Finding
The Etherscan DatasetBuilder path used by train.py --collect-data still relies on logs and aggregate counts; it does not persist typed per-stage failures, filter-drop counts, or a dataset-generation manifest.
Evidence
train.py:107-194 uses src.dataset_pipeline.DatasetBuilder, then logs only total pairs, post-filter count, export path, and aggregate dataset statistics.
src/dataset_pipeline.py:793-931 logs skips/compile/TAC-analysis failures (not verified, empty source, no compiler, no functions, compile failure, TAC failure) but returns only total_pairs.
src/dataset_pipeline.py:1606-1651 deletes rows for length, TAC length, duplicates, test names, and simple patterns without recording per-rule row counts.
src/dataset_pipeline.py:1663-1744 exports JSONL/CSV/Parquet but does not write a manifest with inputs, code/env metadata, output hashes, drop counts, or compile diagnostics.
- The Hugging Face generator has stronger observability (
download_hf_contracts.py:760-807 export selection stats and download_hf_contracts.py:810-888 compile diagnostics), so the repository has two dataset paths with different debuggability.
Impact
If a model trained from the Etherscan path performs poorly, maintainers cannot distinguish upstream fetch failures, parser failures, compiler-selection issues, solc errors, TAC-analysis failures, selector-match misses, or filtering/dedup drops without re-running and scraping logs. The exported dataset is also hard to trace back to exact inputs and environment.
Recommended fix
Add a first-class manifest and diagnostics store for DatasetBuilder runs. Persist per-contract/per-compiler status rows with typed failure categories, bounded error samples, per-filter deletion counts, input address file hash/count, output artifact hashes/row counts, git/env metadata, and timings.
Acceptance criteria
train.py --collect-data writes a dataset manifest next to the exported dataset.
- The manifest or SQLite DB reports counts for each skip/failure/filter/drop category.
- Compile/TAC/parser failures include typed categories and bounded sample errors/contracts.
- Documentation explains how to use the manifest during data-quality/model-quality triage.
Finding
The Etherscan
DatasetBuilderpath used bytrain.py --collect-datastill relies on logs and aggregate counts; it does not persist typed per-stage failures, filter-drop counts, or a dataset-generation manifest.Evidence
train.py:107-194usessrc.dataset_pipeline.DatasetBuilder, then logs only total pairs, post-filter count, export path, and aggregate dataset statistics.src/dataset_pipeline.py:793-931logs skips/compile/TAC-analysis failures (not verified, empty source, no compiler, no functions, compile failure, TAC failure) but returns onlytotal_pairs.src/dataset_pipeline.py:1606-1651deletes rows for length, TAC length, duplicates, test names, and simple patterns without recording per-rule row counts.src/dataset_pipeline.py:1663-1744exports JSONL/CSV/Parquet but does not write a manifest with inputs, code/env metadata, output hashes, drop counts, or compile diagnostics.download_hf_contracts.py:760-807export selection stats anddownload_hf_contracts.py:810-888compile diagnostics), so the repository has two dataset paths with different debuggability.Impact
If a model trained from the Etherscan path performs poorly, maintainers cannot distinguish upstream fetch failures, parser failures, compiler-selection issues, solc errors, TAC-analysis failures, selector-match misses, or filtering/dedup drops without re-running and scraping logs. The exported dataset is also hard to trace back to exact inputs and environment.
Recommended fix
Add a first-class manifest and diagnostics store for
DatasetBuilderruns. Persist per-contract/per-compiler status rows with typed failure categories, bounded error samples, per-filter deletion counts, input address file hash/count, output artifact hashes/row counts, git/env metadata, and timings.Acceptance criteria
train.py --collect-datawrites a dataset manifest next to the exported dataset.