[Observability] Add prompt and truncation diagnostics to offline evaluation results

## Finding
Offline evaluation persists per-example outputs and hashes, but it does not record whether TAC prompts were truncated or how close each request was to the model context window.

## Evidence
- `train.py:1759-1762` fixes the eval generation config to `max_new_tokens` and `eval_batch_size` only.
- `train.py:1815-1829` writes each detail with `generation_config`, hashes, an input preview, output, and metrics, but no token counts, prompt budget, truncation flag, or generated-token count.
- `train.py:1895-1899` calls `SmartContractDecompiler.decompile_tac_to_solidity(...)` and only receives generated text.
- `src/model_setup.py:1693-1742` can silently remove comments/dead-code blocks or hard-truncate TAC before generation.
- The web path has the missing observability surface in `web/app.py:540-581` (`_prompt_diagnostics` records TAC tokens before/after, `tac_truncated`, prompt tokens, and generated tokens), but eval-only does not reuse it.

## Impact
When eval quality is poor on long or complex functions, maintainers cannot tell whether the model actually saw the full TAC, whether compiler metadata consumed too much prompt budget, or whether failures correlate with near-context-window prompts. This makes root-cause analysis look like a model-quality issue when it may be a prompt budgeting/truncation issue.

## Recommended fix
Expose prompt diagnostics from the decompiler or a shared helper and persist them in eval `details`. Aggregate summary counts/rates for truncated examples and token-budget pressure, and surface the fields in worst-sample reports.

## Acceptance criteria
- Each eval detail includes context window, prompt budget, `max_new_tokens`, TAC tokens before/after, prompt tokens, generated tokens when available, `tac_truncated`, and truncation strategy/marker.
- Eval summary includes truncated-example count/rate and token-budget percentiles.
- `latest_results.txt` or worst samples call out truncated low-quality examples.
- Tests cover an over-budget TAC example and verify diagnostics are persisted.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Observability] Add prompt and truncation diagnostics to offline evaluation results #108

Finding

Evidence

Impact

Recommended fix

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Observability] Add prompt and truncation diagnostics to offline evaluation results #108

Description

Finding

Evidence

Impact

Recommended fix

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions