Skip to content

[Training] Fix collect-and-compile max_workers mismatch #56

Description

@agorevski

Evidence

  • train.py:133-137 calls builder.collect_and_compile_contracts(..., max_workers=3, max_compiler_configs=...).
  • src/dataset_pipeline.py:793-797 defines collect_and_compile_contracts(self, contract_addresses, max_compiler_configs=2) with no max_workers parameter.
  • src/dataset_pipeline.py:816-918 then processes contracts, compiler configs, and compiled contracts in nested serial loops.

Impact

The full training pipeline can fail before training starts with TypeError: collect_and_compile_contracts() got an unexpected keyword argument 'max_workers'. Even after the signature mismatch is fixed, dataset compilation/TAC preprocessing remains serial, increasing wall-clock time and compute cost before every fresh training run.

Recommended fix

Accept and honor a bounded max_workers argument in collect_and_compile_contracts, or remove the caller argument if serial execution is intentional. If parallelizing, use per-thread SQLite connections, rate-limit Etherscan requests, and guard solc installation/cache access.

Acceptance criteria

  • A unit/integration test covers the train.py collection call path and proves max_workers is accepted.
  • Dataset collection can be configured with 1 worker and >1 workers without SQLite/solc races.
  • Logs include contract/config throughput so preprocessing speedups are measurable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions