Evidence
train.py:133-137 calls builder.collect_and_compile_contracts(..., max_workers=3, max_compiler_configs=...).
src/dataset_pipeline.py:793-797 defines collect_and_compile_contracts(self, contract_addresses, max_compiler_configs=2) with no max_workers parameter.
src/dataset_pipeline.py:816-918 then processes contracts, compiler configs, and compiled contracts in nested serial loops.
Impact
The full training pipeline can fail before training starts with TypeError: collect_and_compile_contracts() got an unexpected keyword argument 'max_workers'. Even after the signature mismatch is fixed, dataset compilation/TAC preprocessing remains serial, increasing wall-clock time and compute cost before every fresh training run.
Recommended fix
Accept and honor a bounded max_workers argument in collect_and_compile_contracts, or remove the caller argument if serial execution is intentional. If parallelizing, use per-thread SQLite connections, rate-limit Etherscan requests, and guard solc installation/cache access.
Acceptance criteria
- A unit/integration test covers the
train.py collection call path and proves max_workers is accepted.
- Dataset collection can be configured with 1 worker and >1 workers without SQLite/solc races.
- Logs include contract/config throughput so preprocessing speedups are measurable.
Evidence
train.py:133-137callsbuilder.collect_and_compile_contracts(..., max_workers=3, max_compiler_configs=...).src/dataset_pipeline.py:793-797definescollect_and_compile_contracts(self, contract_addresses, max_compiler_configs=2)with nomax_workersparameter.src/dataset_pipeline.py:816-918then processes contracts, compiler configs, and compiled contracts in nested serial loops.Impact
The full training pipeline can fail before training starts with
TypeError: collect_and_compile_contracts() got an unexpected keyword argument 'max_workers'. Even after the signature mismatch is fixed, dataset compilation/TAC preprocessing remains serial, increasing wall-clock time and compute cost before every fresh training run.Recommended fix
Accept and honor a bounded
max_workersargument incollect_and_compile_contracts, or remove the caller argument if serial execution is intentional. If parallelizing, use per-thread SQLite connections, rate-limit Etherscan requests, and guard solc installation/cache access.Acceptance criteria
train.pycollection call path and provesmax_workersis accepted.