Skip to content

[Observability] Persist evaluation failures and traceable prediction samples #66

Description

@agorevski

Finding

The main evaluation path drops failed examples and stores predictions without enough identifiers to trace poor outputs back to the dataset.

Evidence

  • train.py:470-498 evaluates rows in a loop, but on exception only logs Error: ... and does not append a failed result row.
  • Successful details saved at train.py:483-489 include only original, decompiled, and metrics; they omit the TAC input, dataset row index, metadata, input/output hashes, generation parameters, per-example timing, and error fields.
  • The summary at train.py:523-554 reports global means and replication aggregates but no attempted/succeeded/failed counts or worst-example samples.
  • The alternate SmartContractTrainingPipeline.evaluate_model path records sampled_indices in src/training_pipeline.py:927-934, but the CLI train.py --eval-only path does not.

Impact

When quality is bad, maintainers cannot quickly identify which compiler versions, selectors, functions, long TAC inputs, or failure modes caused the drop. Silent omission of failed rows can also make aggregate quality look better than the actual attempted evaluation.

Recommended fix

Persist one result object per attempted evaluation row, successful or failed. Include dataset index, TAC input or stable hash, source metadata, generation config, token lengths/truncation flags, timing, output, metrics, and error/traceback summary. Add summary fields for num_attempted, num_succeeded, num_failed, and top/worst samples by metric.

Acceptance criteria

  • Evaluation JSON contains a details entry for every attempted row, including failures.
  • Each detail can be traced back to the original dataset row without relying on ordering alone.
  • Summary includes failed count/rate and worst-N examples for semantic similarity, edit distance, and replication F1.
  • Existing latest_results.txt links or points to the detailed JSON and reports failure counts.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions