history: `compare` should diff score deltas, not just pass/fail verdicts

Tracked for when the \`protest history\` browse CLI is reintroduced (code archived on \`archive/history-cli\`).

## Problem
\`_classify_changes\` (archive/history-cli:\`protest/cli/history.py\`) only diffs **verdicts** (\`passed\`) and the two integrity hashes. It never reads the numeric scores — even though the writer persists them on every run:

- per case: \`scores: {name: float}\` (\`protest/history/plugin.py\` \`_serialize_eval_case\`)
- per run: \`evals.scores_summary: {score: {mean, median, p5, p95, ...}}\`

So a case that stays \`passed=True\` across two runs while its score mean drops **0.95 → 0.62** is reported as **"No changes"**. For a score-centric eval tool, the continuous drift is the signal we want — the boolean verdict is the coarse one.

## Proposal
When a case is **comparable** (same \`case_hash\` *and* same \`eval_hash\` between the two runs), compute and display the per-score delta, e.g. \`recall 0.95 → 0.62 (-0.33)\`, and add a \`score_regression\` category gated on a configurable threshold (e.g. |Δmean| ≥ 0.05). When hashes differ, keep the current "modified" behaviour (do not pretend to compare across a changed thermometer — see the epoch issue).

No storage change: the data is already on disk (schema_version=1).

Priority: this is the #1 gap — it answers the daily question "did my last change help?".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

history: `compare` should diff score deltas, not just pass/fail verdicts #104

Problem

Proposal

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

history: compare should diff score deltas, not just pass/fail verdicts #104

Description

Problem

Proposal

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

history: `compare` should diff score deltas, not just pass/fail verdicts #104