Skip to content

[Observability] Add prompt and truncation diagnostics to offline evaluation results #108

Description

@agorevski

Finding

Offline evaluation persists per-example outputs and hashes, but it does not record whether TAC prompts were truncated or how close each request was to the model context window.

Evidence

  • train.py:1759-1762 fixes the eval generation config to max_new_tokens and eval_batch_size only.
  • train.py:1815-1829 writes each detail with generation_config, hashes, an input preview, output, and metrics, but no token counts, prompt budget, truncation flag, or generated-token count.
  • train.py:1895-1899 calls SmartContractDecompiler.decompile_tac_to_solidity(...) and only receives generated text.
  • src/model_setup.py:1693-1742 can silently remove comments/dead-code blocks or hard-truncate TAC before generation.
  • The web path has the missing observability surface in web/app.py:540-581 (_prompt_diagnostics records TAC tokens before/after, tac_truncated, prompt tokens, and generated tokens), but eval-only does not reuse it.

Impact

When eval quality is poor on long or complex functions, maintainers cannot tell whether the model actually saw the full TAC, whether compiler metadata consumed too much prompt budget, or whether failures correlate with near-context-window prompts. This makes root-cause analysis look like a model-quality issue when it may be a prompt budgeting/truncation issue.

Recommended fix

Expose prompt diagnostics from the decompiler or a shared helper and persist them in eval details. Aggregate summary counts/rates for truncated examples and token-budget pressure, and surface the fields in worst-sample reports.

Acceptance criteria

  • Each eval detail includes context window, prompt budget, max_new_tokens, TAC tokens before/after, prompt tokens, generated tokens when available, tac_truncated, and truncation strategy/marker.
  • Eval summary includes truncated-example count/rate and token-budget percentiles.
  • latest_results.txt or worst samples call out truncated low-quality examples.
  • Tests cover an over-budget TAC example and verify diagnostics are persisted.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions