Skip to content

history: compare should diff score deltas, not just pass/fail verdicts #104

Description

@renaudcepre

Tracked for when the `protest history` browse CLI is reintroduced (code archived on `archive/history-cli`).

Problem

`_classify_changes` (archive/history-cli:`protest/cli/history.py`) only diffs verdicts (`passed`) and the two integrity hashes. It never reads the numeric scores — even though the writer persists them on every run:

  • per case: `scores: {name: float}` (`protest/history/plugin.py` `_serialize_eval_case`)
  • per run: `evals.scores_summary: {score: {mean, median, p5, p95, ...}}`

So a case that stays `passed=True` across two runs while its score mean drops 0.95 → 0.62 is reported as "No changes". For a score-centric eval tool, the continuous drift is the signal we want — the boolean verdict is the coarse one.

Proposal

When a case is comparable (same `case_hash` and same `eval_hash` between the two runs), compute and display the per-score delta, e.g. `recall 0.95 → 0.62 (-0.33)`, and add a `score_regression` category gated on a configurable threshold (e.g. |Δmean| ≥ 0.05). When hashes differ, keep the current "modified" behaviour (do not pretend to compare across a changed thermometer — see the epoch issue).

No storage change: the data is already on disk (schema_version=1).

Priority: this is the #1 gap — it answers the daily question "did my last change help?".

Metadata

Metadata

Assignees

No one assigned

    Labels

    cliCommand line interfaceenhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions