[Observability] Persist evaluation failures and traceable prediction samples

## Finding
The main evaluation path drops failed examples and stores predictions without enough identifiers to trace poor outputs back to the dataset.

## Evidence
- `train.py:470-498` evaluates rows in a loop, but on exception only logs `Error: ...` and does not append a failed result row.
- Successful details saved at `train.py:483-489` include only `original`, `decompiled`, and `metrics`; they omit the TAC input, dataset row index, metadata, input/output hashes, generation parameters, per-example timing, and error fields.
- The summary at `train.py:523-554` reports global means and replication aggregates but no attempted/succeeded/failed counts or worst-example samples.
- The alternate `SmartContractTrainingPipeline.evaluate_model` path records `sampled_indices` in `src/training_pipeline.py:927-934`, but the CLI `train.py --eval-only` path does not.

## Impact
When quality is bad, maintainers cannot quickly identify which compiler versions, selectors, functions, long TAC inputs, or failure modes caused the drop. Silent omission of failed rows can also make aggregate quality look better than the actual attempted evaluation.

## Recommended fix
Persist one result object per attempted evaluation row, successful or failed. Include dataset index, TAC input or stable hash, source metadata, generation config, token lengths/truncation flags, timing, output, metrics, and error/traceback summary. Add summary fields for `num_attempted`, `num_succeeded`, `num_failed`, and top/worst samples by metric.

## Acceptance criteria
- Evaluation JSON contains a `details` entry for every attempted row, including failures.
- Each detail can be traced back to the original dataset row without relying on ordering alone.
- Summary includes failed count/rate and worst-N examples for semantic similarity, edit distance, and replication F1.
- Existing `latest_results.txt` links or points to the detailed JSON and reports failure counts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Observability] Persist evaluation failures and traceable prediction samples #66

Finding

Evidence

Impact

Recommended fix

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Observability] Persist evaluation failures and traceable prediction samples #66

Description

Finding

Evidence

Impact

Recommended fix

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions