Skip to content

Add characterization tests for exam grading helpers #249

@brianlan

Description

@brianlan

Summary

Add direct backend characterization tests for the existing exam grading helper behavior before any future extraction or modification.

Background / Context

The V4 refactoring plan identified exam grading as behavior-sensitive. backend/app/presentation/exam_grading.py contains pure grading helpers and async grading orchestration that are mostly covered indirectly today. Before moving or changing that code, we need direct tests that pin current scoring, result shape, retry, and fallback behavior.

Problem

Exam grading behavior affects user scores and exam results. Without direct tests, later refactors could silently change grading status values, scores, retry counts, or returned dict shapes.

Goal / Expected Behavior

Create passing tests that document the current behavior of exam grading and exam helper functions without changing production code.

Scope

This issue should cover:

  • Create backend/tests/presentation/test_exam_grading.py.
  • Create backend/tests/presentation/test_exam_helpers.py.
  • Cover grade_objective_item, grade_short_answer_item, build_tracking_update, build_grading_result, build_exam_summary, and make_exam_item.
  • Cover all existing ProblemType branches and the PENDING_REVIEW fallback path.

Out of Scope

This issue should not cover:

  • Refactoring production grading code.
  • Changing scoring thresholds or grading algorithms.
  • Changing exam route handlers.
  • Moving grading helpers to another package.

Chosen Implementation Approach

Use characterization tests only. Build minimal representative item dictionaries and light test doubles for VLM/storage where async grading requires them. Assert the exact current output shape and current retry/fallback behavior.

Implementation Plan

The implementor should:

  1. Inspect exam_grading.py and exam_helpers.py to enumerate inputs and return shapes.
  2. Add pure helper tests for objective grading, result construction, and tracking updates.
  3. Add async tests for short-answer grading success, retryable failure, non-retryable failure, and unexpected exception fallback.
  4. Add exam helper tests for summary/item construction where behavior is currently indirect.
  5. Run the targeted tests and then the full backend suite.

Relevant Files / Areas

Likely relevant areas:

  • backend/app/presentation/exam_grading.py
  • backend/app/presentation/exam_helpers.py
  • backend/tests/presentation/test_exam_grading.py
  • backend/tests/presentation/test_exam_helpers.py

Tests Required

The implementor must add or update automated tests covering:

  • Direct unit tests for grading result dict shape.
  • Objective grading tests across supported problem types.
  • Short-answer VLM success and fallback tests.
  • Tracking update tests for correct and incorrect answers.
  • Exam helper tests for existing summary/item behavior.

At minimum, tests should verify:

  • Correct objective answers produce the same status and score as before.
  • Incorrect or missing answers produce the same status and score as before.
  • Retryable VLM errors retry according to the current cap and then fall back to PENDING_REVIEW.
  • build_grading_result includes the exact expected keys.

Manual Verification / Self-Check

Before claiming this issue is done, the implementor must:

  1. Run the relevant automated test suite.
  2. Manually verify the main behavior described in this issue when applicable.
  3. Verify that no related existing behavior regressed.
  4. Record the exact commands run and their results in the PR description.

Suggested verification commands:

cd backend && uv run --frozen pytest tests/presentation/test_exam_grading.py
cd backend && uv run --frozen pytest tests/presentation/test_exam_helpers.py
cd backend && uv run --frozen pytest

Reviewer Acceptance Checklist

The reviewer should verify that:

  • The implementation matches the expected behavior described above.
  • The chosen implementation approach was followed, or any deviation is clearly justified.
  • The change is appropriately scoped and does not include unrelated work.
  • Required automated tests were added or updated.
  • The tests fail or would have failed before the fix where applicable.
  • The implementor included self-verification results in the PR.
  • Edge cases and regression risks were considered.
  • Documentation, comments, or user-facing text were updated if needed.

Dependencies

None.

Follow-Up Work

Future exam-domain extraction may use these tests as a safety net, but that extraction is not part of this issue.

Definition of Done

This issue is done when:

  • The new presentation test files exist and pass.
  • The full backend suite passes.
  • No production code is changed except test-only support if absolutely necessary and justified.
  • The PR description records exact verification commands and results.

Metadata

Metadata

Assignees

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions