Skip to content

[Observability] Persist typed failure and filter-drop manifests for Etherscan DatasetBuilder runs #110

Description

@agorevski

Finding

The Etherscan DatasetBuilder path used by train.py --collect-data still relies on logs and aggregate counts; it does not persist typed per-stage failures, filter-drop counts, or a dataset-generation manifest.

Evidence

  • train.py:107-194 uses src.dataset_pipeline.DatasetBuilder, then logs only total pairs, post-filter count, export path, and aggregate dataset statistics.
  • src/dataset_pipeline.py:793-931 logs skips/compile/TAC-analysis failures (not verified, empty source, no compiler, no functions, compile failure, TAC failure) but returns only total_pairs.
  • src/dataset_pipeline.py:1606-1651 deletes rows for length, TAC length, duplicates, test names, and simple patterns without recording per-rule row counts.
  • src/dataset_pipeline.py:1663-1744 exports JSONL/CSV/Parquet but does not write a manifest with inputs, code/env metadata, output hashes, drop counts, or compile diagnostics.
  • The Hugging Face generator has stronger observability (download_hf_contracts.py:760-807 export selection stats and download_hf_contracts.py:810-888 compile diagnostics), so the repository has two dataset paths with different debuggability.

Impact

If a model trained from the Etherscan path performs poorly, maintainers cannot distinguish upstream fetch failures, parser failures, compiler-selection issues, solc errors, TAC-analysis failures, selector-match misses, or filtering/dedup drops without re-running and scraping logs. The exported dataset is also hard to trace back to exact inputs and environment.

Recommended fix

Add a first-class manifest and diagnostics store for DatasetBuilder runs. Persist per-contract/per-compiler status rows with typed failure categories, bounded error samples, per-filter deletion counts, input address file hash/count, output artifact hashes/row counts, git/env metadata, and timings.

Acceptance criteria

  • train.py --collect-data writes a dataset manifest next to the exported dataset.
  • The manifest or SQLite DB reports counts for each skip/failure/filter/drop category.
  • Compile/TAC/parser failures include typed categories and bounded sample errors/contracts.
  • Documentation explains how to use the manifest during data-quality/model-quality triage.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions