Context
tests/evals/test_e2e.py is a 1229-LOC catch-all hosting 13 test classes covering distinct concerns (setup, filtering, output, payload flow, history, hashing, builtin evaluators, scoring v2, short-circuit, results files, multi-dataset history, task fixtures). It has grown organically and now bundles several themes that the rest of tests/evals/ already extracts into dedicated files (test_hashing.py, test_judge.py, test_score_stats.py, test_evalcase_tags_wiring.py, etc.).
This is a follow-up to the eval framework review — non-blocking, deferred until after the eval feature ships.
Current breakdown
| Lines |
Class |
Concern |
| 106-189 |
TestEvalSetup |
Setup, async, metadata |
| 190-289 |
TestKindFiltering |
kind=test/eval filtering |
| 291-383 |
TestEvalOutput |
Report format (stats, pass count) |
| 384-518 |
TestEvalPayloadFlow |
EvalPayload flow through the pipeline |
| 520-609 |
TestHistory |
History file persistence |
| 611-650 |
TestCleanDirty |
clean_dirty workflow |
| 652-702 |
TestCaseHashing |
Case/eval hashing |
| 704-844 |
TestBuiltinEvaluators |
14 builtin evaluators |
| 846-956 |
TestScoringV2 |
Scoring (bool / dataclass / float) |
| 958-1027 |
TestShortCircuit |
ShortCircuit semantics |
| 1028-1085 |
TestResultsFiles |
On-disk results files |
| 1087-1141 |
TestMultiDatasetHistory |
Multi-dataset history persistence |
| 1142-1228 |
TestEvalTaskFixtures |
DI fixtures inside the eval task |
Proposed split (minimal — 3 new files + 1 migration)
- Migrate
TestCaseHashing → tests/evals/test_hashing.py (file already exists, 289 LOC). Eliminates a thematic duplicate.
- New
tests/evals/test_history.py ← TestHistory + TestCleanDirty + TestMultiDatasetHistory (~180 LOC). All three are about history-file persistence.
- New
tests/evals/test_evaluators_runtime.py ← TestBuiltinEvaluators + TestScoringV2 + TestShortCircuit (~330 LOC). Evaluator/scoring semantics, distinct from execution flow.
test_e2e.py keeps Setup, Filtering, EvalOutput, PayloadFlow, ResultsFiles, EvalTaskFixtures (~570 LOC). True end-to-end pipeline coverage.
Decisions to make at implementation time
- Shared helpers (
FakeAccuracyResult, fake_accuracy, echo_task at lines 61-104). Options:
- Move to a local
tests/evals/conftest.py (pytest-idiomatic, auto-discovered)
- Move to a
tests/evals/_helpers.py module imported explicitly
- Duplicate in each new file (lowest ceremony, slight drift risk)
- Verify there is no overlap between
TestCaseHashing and the existing test_hashing.py before migrating — some cases may already be covered.
- Granularity: a more aggressive split (file-per-class, ~5 more files) was considered and rejected as fragmentation.
Acceptance criteria
- All tests still pass after the split (
pytest tests/evals/)
- No new public surface, no behavior change
just lint clean
- Each file documents what it covers in a module docstring
Context
tests/evals/test_e2e.pyis a 1229-LOC catch-all hosting 13 test classes covering distinct concerns (setup, filtering, output, payload flow, history, hashing, builtin evaluators, scoring v2, short-circuit, results files, multi-dataset history, task fixtures). It has grown organically and now bundles several themes that the rest oftests/evals/already extracts into dedicated files (test_hashing.py,test_judge.py,test_score_stats.py,test_evalcase_tags_wiring.py, etc.).This is a follow-up to the eval framework review — non-blocking, deferred until after the eval feature ships.
Current breakdown
TestEvalSetupTestKindFilteringkind=test/evalfilteringTestEvalOutputTestEvalPayloadFlowTestHistoryTestCleanDirtyclean_dirtyworkflowTestCaseHashingTestBuiltinEvaluatorsTestScoringV2TestShortCircuitTestResultsFilesTestMultiDatasetHistoryTestEvalTaskFixturesProposed split (minimal — 3 new files + 1 migration)
TestCaseHashing→tests/evals/test_hashing.py(file already exists, 289 LOC). Eliminates a thematic duplicate.tests/evals/test_history.py←TestHistory+TestCleanDirty+TestMultiDatasetHistory(~180 LOC). All three are about history-file persistence.tests/evals/test_evaluators_runtime.py←TestBuiltinEvaluators+TestScoringV2+TestShortCircuit(~330 LOC). Evaluator/scoring semantics, distinct from execution flow.test_e2e.pykeeps Setup, Filtering, EvalOutput, PayloadFlow, ResultsFiles, EvalTaskFixtures (~570 LOC). True end-to-end pipeline coverage.Decisions to make at implementation time
FakeAccuracyResult,fake_accuracy,echo_taskat lines 61-104). Options:tests/evals/conftest.py(pytest-idiomatic, auto-discovered)tests/evals/_helpers.pymodule imported explicitlyTestCaseHashingand the existingtest_hashing.pybefore migrating — some cases may already be covered.Acceptance criteria
pytest tests/evals/)just lintclean