[Evaluation] Add baseline/regression comparisons and confidence intervals

## Finding
Evaluation artifacts report single-run means/stddevs and target checks, but do not compare against a stored baseline or report statistical uncertainty. The CLI summary includes mean/std and pass rates only (`train.py:523-539`). The human-readable report prints target checks for absolute thresholds (`src/evaluation_report.py:116-144`) but no confidence intervals, bootstrap intervals, sample-size warnings, or deltas versus a previous/baseline run.

The docs recommend comparing against a Llama 3.2 3B baseline using the same split (`docs/runbook.md:263`), but the evaluator has no first-class baseline input or regression gate.

## Impact
Small eval runs (for example `--eval-limit 3`, shown in `latest_results.txt:55-68`) can produce noisy metrics that look precise. Model developers cannot tell whether a change is a real improvement, a regression, or sampling noise.

## Recommended fix
Add a baseline comparison mode and uncertainty estimates. Support passing a previous `results/eval_*.json` as baseline, compute metric deltas, bootstrap or binomial confidence intervals for key metrics/pass rates, and emit pass/fail regression gates for important metrics.

## Acceptance criteria
- CLI accepts a baseline result path or discovers a configured baseline for comparison.
- JSON/report include deltas and 95% confidence intervals for key aggregate metrics and pass rates.
- Reports flag statistically meaningful regressions and warn on very small eval sizes.
- Tests cover baseline delta calculation and CI formatting on deterministic sample data.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Evaluation] Add baseline/regression comparisons and confidence intervals #69

Finding

Impact

Recommended fix

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Evaluation] Add baseline/regression comparisons and confidence intervals #69

Description

Finding

Impact

Recommended fix

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions