Skip to content

Split tests/evals/test_e2e.py (1229 LOC) by concern #98

Description

@renaudcepre

Context

tests/evals/test_e2e.py is a 1229-LOC catch-all hosting 13 test classes covering distinct concerns (setup, filtering, output, payload flow, history, hashing, builtin evaluators, scoring v2, short-circuit, results files, multi-dataset history, task fixtures). It has grown organically and now bundles several themes that the rest of tests/evals/ already extracts into dedicated files (test_hashing.py, test_judge.py, test_score_stats.py, test_evalcase_tags_wiring.py, etc.).

This is a follow-up to the eval framework review — non-blocking, deferred until after the eval feature ships.

Current breakdown

Lines Class Concern
106-189 TestEvalSetup Setup, async, metadata
190-289 TestKindFiltering kind=test/eval filtering
291-383 TestEvalOutput Report format (stats, pass count)
384-518 TestEvalPayloadFlow EvalPayload flow through the pipeline
520-609 TestHistory History file persistence
611-650 TestCleanDirty clean_dirty workflow
652-702 TestCaseHashing Case/eval hashing
704-844 TestBuiltinEvaluators 14 builtin evaluators
846-956 TestScoringV2 Scoring (bool / dataclass / float)
958-1027 TestShortCircuit ShortCircuit semantics
1028-1085 TestResultsFiles On-disk results files
1087-1141 TestMultiDatasetHistory Multi-dataset history persistence
1142-1228 TestEvalTaskFixtures DI fixtures inside the eval task

Proposed split (minimal — 3 new files + 1 migration)

  • Migrate TestCaseHashingtests/evals/test_hashing.py (file already exists, 289 LOC). Eliminates a thematic duplicate.
  • New tests/evals/test_history.pyTestHistory + TestCleanDirty + TestMultiDatasetHistory (~180 LOC). All three are about history-file persistence.
  • New tests/evals/test_evaluators_runtime.pyTestBuiltinEvaluators + TestScoringV2 + TestShortCircuit (~330 LOC). Evaluator/scoring semantics, distinct from execution flow.
  • test_e2e.py keeps Setup, Filtering, EvalOutput, PayloadFlow, ResultsFiles, EvalTaskFixtures (~570 LOC). True end-to-end pipeline coverage.

Decisions to make at implementation time

  1. Shared helpers (FakeAccuracyResult, fake_accuracy, echo_task at lines 61-104). Options:
    • Move to a local tests/evals/conftest.py (pytest-idiomatic, auto-discovered)
    • Move to a tests/evals/_helpers.py module imported explicitly
    • Duplicate in each new file (lowest ceremony, slight drift risk)
  2. Verify there is no overlap between TestCaseHashing and the existing test_hashing.py before migrating — some cases may already be covered.
  3. Granularity: a more aggressive split (file-per-class, ~5 more files) was considered and rejected as fragmentation.

Acceptance criteria

  • All tests still pass after the split (pytest tests/evals/)
  • No new public surface, no behavior change
  • just lint clean
  • Each file documents what it covers in a module docstring

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions