Finding
Evaluation artifacts report single-run means/stddevs and target checks, but do not compare against a stored baseline or report statistical uncertainty. The CLI summary includes mean/std and pass rates only (train.py:523-539). The human-readable report prints target checks for absolute thresholds (src/evaluation_report.py:116-144) but no confidence intervals, bootstrap intervals, sample-size warnings, or deltas versus a previous/baseline run.
The docs recommend comparing against a Llama 3.2 3B baseline using the same split (docs/runbook.md:263), but the evaluator has no first-class baseline input or regression gate.
Impact
Small eval runs (for example --eval-limit 3, shown in latest_results.txt:55-68) can produce noisy metrics that look precise. Model developers cannot tell whether a change is a real improvement, a regression, or sampling noise.
Recommended fix
Add a baseline comparison mode and uncertainty estimates. Support passing a previous results/eval_*.json as baseline, compute metric deltas, bootstrap or binomial confidence intervals for key metrics/pass rates, and emit pass/fail regression gates for important metrics.
Acceptance criteria
- CLI accepts a baseline result path or discovers a configured baseline for comparison.
- JSON/report include deltas and 95% confidence intervals for key aggregate metrics and pass rates.
- Reports flag statistically meaningful regressions and warn on very small eval sizes.
- Tests cover baseline delta calculation and CI formatting on deterministic sample data.
Finding
Evaluation artifacts report single-run means/stddevs and target checks, but do not compare against a stored baseline or report statistical uncertainty. The CLI summary includes mean/std and pass rates only (
train.py:523-539). The human-readable report prints target checks for absolute thresholds (src/evaluation_report.py:116-144) but no confidence intervals, bootstrap intervals, sample-size warnings, or deltas versus a previous/baseline run.The docs recommend comparing against a Llama 3.2 3B baseline using the same split (
docs/runbook.md:263), but the evaluator has no first-class baseline input or regression gate.Impact
Small eval runs (for example
--eval-limit 3, shown inlatest_results.txt:55-68) can produce noisy metrics that look precise. Model developers cannot tell whether a change is a real improvement, a regression, or sampling noise.Recommended fix
Add a baseline comparison mode and uncertainty estimates. Support passing a previous
results/eval_*.jsonas baseline, compute metric deltas, bootstrap or binomial confidence intervals for key metrics/pass rates, and emit pass/fail regression gates for important metrics.Acceptance criteria