Tracked for when the `protest history` browse CLI is reintroduced (code archived on `archive/history-cli`).
Problem
`_classify_changes` (archive/history-cli:`protest/cli/history.py`) only diffs verdicts (`passed`) and the two integrity hashes. It never reads the numeric scores — even though the writer persists them on every run:
- per case: `scores: {name: float}` (`protest/history/plugin.py` `_serialize_eval_case`)
- per run: `evals.scores_summary: {score: {mean, median, p5, p95, ...}}`
So a case that stays `passed=True` across two runs while its score mean drops 0.95 → 0.62 is reported as "No changes". For a score-centric eval tool, the continuous drift is the signal we want — the boolean verdict is the coarse one.
Proposal
When a case is comparable (same `case_hash` and same `eval_hash` between the two runs), compute and display the per-score delta, e.g. `recall 0.95 → 0.62 (-0.33)`, and add a `score_regression` category gated on a configurable threshold (e.g. |Δmean| ≥ 0.05). When hashes differ, keep the current "modified" behaviour (do not pretend to compare across a changed thermometer — see the epoch issue).
No storage change: the data is already on disk (schema_version=1).
Priority: this is the #1 gap — it answers the daily question "did my last change help?".
Tracked for when the `protest history` browse CLI is reintroduced (code archived on `archive/history-cli`).
Problem
`_classify_changes` (archive/history-cli:`protest/cli/history.py`) only diffs verdicts (`passed`) and the two integrity hashes. It never reads the numeric scores — even though the writer persists them on every run:
So a case that stays `passed=True` across two runs while its score mean drops 0.95 → 0.62 is reported as "No changes". For a score-centric eval tool, the continuous drift is the signal we want — the boolean verdict is the coarse one.
Proposal
When a case is comparable (same `case_hash` and same `eval_hash` between the two runs), compute and display the per-score delta, e.g. `recall 0.95 → 0.62 (-0.33)`, and add a `score_regression` category gated on a configurable threshold (e.g. |Δmean| ≥ 0.05). When hashes differ, keep the current "modified" behaviour (do not pretend to compare across a changed thermometer — see the epoch issue).
No storage change: the data is already on disk (schema_version=1).
Priority: this is the #1 gap — it answers the daily question "did my last change help?".