Skip to content

[Observability] Record dataset lineage and compile failure diagnostics #68

Description

@agorevski

Finding

Dataset generation records useful aggregate counts, but lineage and failure diagnostics are too lossy for debugging data quality issues.

Evidence

  • download_hf_contracts.py:616-628 writes a download manifest with hf_revision or "default" and GITHUB_SHA, but not the resolved Hugging Face commit, actual git SHA/dirty state, command-line args, dependency versions, solc versions, or content hashes of generated artifacts.
  • download_hf_contracts.py:1492-1506 writes an export manifest with aggregate counts and solc_versions, but not source DB hash, output file hash, split lineage, command, or environment metadata.
  • Worker preparation disables logging and returns None for any exception at download_hf_contracts.py:745-784; the parent records generic no_pairs/prepare produced no compile jobs at download_hf_contracts.py:951-958.
  • Compile outcome persistence groups failures and stores only the first failure message per contract in download_hf_contracts.py:550-582.

Impact

If generated training data produces poor model quality, maintainers cannot reconstruct the exact upstream dataset revision or distinguish parser failures, solc install failures, compile errors, TAC analysis errors, selector matching misses, and filtering/dedup drops at sufficient granularity.

Recommended fix

Promote dataset manifests to first-class artifacts. Capture resolved HF dataset revision, repo git SHA/dirty, CLI args, package versions, solc versions used, input/output SHA-256 hashes, row counts, filter/dedup/drop counts, and per-phase timings. Persist per-contract/per-compile statuses with typed error categories and bounded traceback samples instead of collapsing preparation exceptions to None.

Acceptance criteria

  • Each data-generation/export run writes a manifest that can identify the exact source data and code/environment used.
  • The SQLite state or companion JSON contains typed counts for download skips, parser failures, solc install failures, compile failures, TAC analysis failures, match misses, filter drops, and dedup drops.
  • Preparation exceptions are distinguishable from true no_pairs outcomes.
  • Documentation explains where to find the manifest and how to use it during quality triage.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions