Finding
Dataset generation records useful aggregate counts, but lineage and failure diagnostics are too lossy for debugging data quality issues.
Evidence
download_hf_contracts.py:616-628 writes a download manifest with hf_revision or "default" and GITHUB_SHA, but not the resolved Hugging Face commit, actual git SHA/dirty state, command-line args, dependency versions, solc versions, or content hashes of generated artifacts.
download_hf_contracts.py:1492-1506 writes an export manifest with aggregate counts and solc_versions, but not source DB hash, output file hash, split lineage, command, or environment metadata.
- Worker preparation disables logging and returns
None for any exception at download_hf_contracts.py:745-784; the parent records generic no_pairs/prepare produced no compile jobs at download_hf_contracts.py:951-958.
- Compile outcome persistence groups failures and stores only the first failure message per contract in
download_hf_contracts.py:550-582.
Impact
If generated training data produces poor model quality, maintainers cannot reconstruct the exact upstream dataset revision or distinguish parser failures, solc install failures, compile errors, TAC analysis errors, selector matching misses, and filtering/dedup drops at sufficient granularity.
Recommended fix
Promote dataset manifests to first-class artifacts. Capture resolved HF dataset revision, repo git SHA/dirty, CLI args, package versions, solc versions used, input/output SHA-256 hashes, row counts, filter/dedup/drop counts, and per-phase timings. Persist per-contract/per-compile statuses with typed error categories and bounded traceback samples instead of collapsing preparation exceptions to None.
Acceptance criteria
- Each data-generation/export run writes a manifest that can identify the exact source data and code/environment used.
- The SQLite state or companion JSON contains typed counts for download skips, parser failures, solc install failures, compile failures, TAC analysis failures, match misses, filter drops, and dedup drops.
- Preparation exceptions are distinguishable from true
no_pairs outcomes.
- Documentation explains where to find the manifest and how to use it during quality triage.
Finding
Dataset generation records useful aggregate counts, but lineage and failure diagnostics are too lossy for debugging data quality issues.
Evidence
download_hf_contracts.py:616-628writes a download manifest withhf_revision or "default"andGITHUB_SHA, but not the resolved Hugging Face commit, actual git SHA/dirty state, command-line args, dependency versions, solc versions, or content hashes of generated artifacts.download_hf_contracts.py:1492-1506writes an export manifest with aggregate counts andsolc_versions, but not source DB hash, output file hash, split lineage, command, or environment metadata.Nonefor any exception atdownload_hf_contracts.py:745-784; the parent records genericno_pairs/prepare produced no compile jobsatdownload_hf_contracts.py:951-958.download_hf_contracts.py:550-582.Impact
If generated training data produces poor model quality, maintainers cannot reconstruct the exact upstream dataset revision or distinguish parser failures, solc install failures, compile errors, TAC analysis errors, selector matching misses, and filtering/dedup drops at sufficient granularity.
Recommended fix
Promote dataset manifests to first-class artifacts. Capture resolved HF dataset revision, repo git SHA/dirty, CLI args, package versions, solc versions used, input/output SHA-256 hashes, row counts, filter/dedup/drop counts, and per-phase timings. Persist per-contract/per-compile statuses with typed error categories and bounded traceback samples instead of collapsing preparation exceptions to
None.Acceptance criteria
no_pairsoutcomes.