Finding
The main evaluation path drops failed examples and stores predictions without enough identifiers to trace poor outputs back to the dataset.
Evidence
train.py:470-498 evaluates rows in a loop, but on exception only logs Error: ... and does not append a failed result row.
- Successful details saved at
train.py:483-489 include only original, decompiled, and metrics; they omit the TAC input, dataset row index, metadata, input/output hashes, generation parameters, per-example timing, and error fields.
- The summary at
train.py:523-554 reports global means and replication aggregates but no attempted/succeeded/failed counts or worst-example samples.
- The alternate
SmartContractTrainingPipeline.evaluate_model path records sampled_indices in src/training_pipeline.py:927-934, but the CLI train.py --eval-only path does not.
Impact
When quality is bad, maintainers cannot quickly identify which compiler versions, selectors, functions, long TAC inputs, or failure modes caused the drop. Silent omission of failed rows can also make aggregate quality look better than the actual attempted evaluation.
Recommended fix
Persist one result object per attempted evaluation row, successful or failed. Include dataset index, TAC input or stable hash, source metadata, generation config, token lengths/truncation flags, timing, output, metrics, and error/traceback summary. Add summary fields for num_attempted, num_succeeded, num_failed, and top/worst samples by metric.
Acceptance criteria
- Evaluation JSON contains a
details entry for every attempted row, including failures.
- Each detail can be traced back to the original dataset row without relying on ordering alone.
- Summary includes failed count/rate and worst-N examples for semantic similarity, edit distance, and replication F1.
- Existing
latest_results.txt links or points to the detailed JSON and reports failure counts.
Finding
The main evaluation path drops failed examples and stores predictions without enough identifiers to trace poor outputs back to the dataset.
Evidence
train.py:470-498evaluates rows in a loop, but on exception only logsError: ...and does not append a failed result row.train.py:483-489include onlyoriginal,decompiled, andmetrics; they omit the TAC input, dataset row index, metadata, input/output hashes, generation parameters, per-example timing, and error fields.train.py:523-554reports global means and replication aggregates but no attempted/succeeded/failed counts or worst-example samples.SmartContractTrainingPipeline.evaluate_modelpath recordssampled_indicesinsrc/training_pipeline.py:927-934, but the CLItrain.py --eval-onlypath does not.Impact
When quality is bad, maintainers cannot quickly identify which compiler versions, selectors, functions, long TAC inputs, or failure modes caused the drop. Silent omission of failed rows can also make aggregate quality look better than the actual attempted evaluation.
Recommended fix
Persist one result object per attempted evaluation row, successful or failed. Include dataset index, TAC input or stable hash, source metadata, generation config, token lengths/truncation flags, timing, output, metrics, and error/traceback summary. Add summary fields for
num_attempted,num_succeeded,num_failed, and top/worst samples by metric.Acceptance criteria
detailsentry for every attempted row, including failures.latest_results.txtlinks or points to the detailed JSON and reports failure counts.