Split tests/evals/test_e2e.py (1229 LOC) by concern

## Context

`tests/evals/test_e2e.py` is a 1229-LOC catch-all hosting 13 test classes covering distinct concerns (setup, filtering, output, payload flow, history, hashing, builtin evaluators, scoring v2, short-circuit, results files, multi-dataset history, task fixtures). It has grown organically and now bundles several themes that the rest of `tests/evals/` already extracts into dedicated files (`test_hashing.py`, `test_judge.py`, `test_score_stats.py`, `test_evalcase_tags_wiring.py`, etc.).

This is a follow-up to the eval framework review — non-blocking, deferred until after the eval feature ships.

## Current breakdown

| Lines | Class | Concern |
|---|---|---|
| 106-189 | `TestEvalSetup` | Setup, async, metadata |
| 190-289 | `TestKindFiltering` | `kind=test/eval` filtering |
| 291-383 | `TestEvalOutput` | Report format (stats, pass count) |
| 384-518 | `TestEvalPayloadFlow` | EvalPayload flow through the pipeline |
| 520-609 | `TestHistory` | History file persistence |
| 611-650 | `TestCleanDirty` | `clean_dirty` workflow |
| 652-702 | `TestCaseHashing` | Case/eval hashing |
| 704-844 | `TestBuiltinEvaluators` | 14 builtin evaluators |
| 846-956 | `TestScoringV2` | Scoring (bool / dataclass / float) |
| 958-1027 | `TestShortCircuit` | ShortCircuit semantics |
| 1028-1085 | `TestResultsFiles` | On-disk results files |
| 1087-1141 | `TestMultiDatasetHistory` | Multi-dataset history persistence |
| 1142-1228 | `TestEvalTaskFixtures` | DI fixtures inside the eval task |

## Proposed split (minimal — 3 new files + 1 migration)

- **Migrate `TestCaseHashing` → `tests/evals/test_hashing.py`** (file already exists, 289 LOC). Eliminates a thematic duplicate.
- **New `tests/evals/test_history.py`** ← `TestHistory` + `TestCleanDirty` + `TestMultiDatasetHistory` (~180 LOC). All three are about history-file persistence.
- **New `tests/evals/test_evaluators_runtime.py`** ← `TestBuiltinEvaluators` + `TestScoringV2` + `TestShortCircuit` (~330 LOC). Evaluator/scoring semantics, distinct from execution flow.
- **`test_e2e.py` keeps** Setup, Filtering, EvalOutput, PayloadFlow, ResultsFiles, EvalTaskFixtures (~570 LOC). True end-to-end pipeline coverage.

## Decisions to make at implementation time

1. **Shared helpers** (`FakeAccuracyResult`, `fake_accuracy`, `echo_task` at lines 61-104). Options:
   - Move to a local `tests/evals/conftest.py` (pytest-idiomatic, auto-discovered)
   - Move to a `tests/evals/_helpers.py` module imported explicitly
   - Duplicate in each new file (lowest ceremony, slight drift risk)
2. **Verify there is no overlap** between `TestCaseHashing` and the existing `test_hashing.py` before migrating — some cases may already be covered.
3. **Granularity**: a more aggressive split (file-per-class, ~5 more files) was considered and rejected as fragmentation.

## Acceptance criteria

- All tests still pass after the split (`pytest tests/evals/`)
- No new public surface, no behavior change
- `just lint` clean
- Each file documents what it covers in a module docstring

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split tests/evals/test_e2e.py (1229 LOC) by concern #98

Context

Current breakdown

Proposed split (minimal — 3 new files + 1 migration)

Decisions to make at implementation time

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Lines	Class	Concern
106-189	`TestEvalSetup`	Setup, async, metadata
190-289	`TestKindFiltering`	`kind=test/eval` filtering
291-383	`TestEvalOutput`	Report format (stats, pass count)
384-518	`TestEvalPayloadFlow`	EvalPayload flow through the pipeline
520-609	`TestHistory`	History file persistence
611-650	`TestCleanDirty`	`clean_dirty` workflow
652-702	`TestCaseHashing`	Case/eval hashing
704-844	`TestBuiltinEvaluators`	14 builtin evaluators
846-956	`TestScoringV2`	Scoring (bool / dataclass / float)
958-1027	`TestShortCircuit`	ShortCircuit semantics
1028-1085	`TestResultsFiles`	On-disk results files
1087-1141	`TestMultiDatasetHistory`	Multi-dataset history persistence
1142-1228	`TestEvalTaskFixtures`	DI fixtures inside the eval task

Split tests/evals/test_e2e.py (1229 LOC) by concern #98

Description

Context

Current breakdown

Proposed split (minimal — 3 new files + 1 migration)

Decisions to make at implementation time

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions