Skip to content

Scope 4: Imputation Quality Evaluation — RMSE and MAE via shared holdback #93

@DEVunderdog

Description

@DEVunderdog

Problem Statement

After ImputationOrchestrator.fit() returns, a data scientist has no direct, unit-grounded measure of how accurate the imputation actually was. R² (added in Scope 3) tells you how much variance the model explains, but it does not tell you how far off the imputed values are in the column's own units. A data scientist working with a medical dataset cannot tell from R²=0.6 whether the imputer is off by ±2 mmHg or ±40 mmHg on a blood pressure column. RMSE and MAE express imputation error in the column's own units, making the diagnostic immediately interpretable to the domain expert.

Solution

Extend ImputationFitDiagnostic with two new fields — rmse and mae — computed during fit() from the same 20% held-back complete rows already used for r2_train. All three metrics (R², RMSE, MAE) are produced in a single evaluation pass per model; the final stored model is then re-fit on all complete rows. No new config parameters, no separate masking pass, no change to the fit() / transform() contract.

User Stories

  1. As a data scientist, I want to see RMSE on each imputed column so that I can understand the imputation error in the column's own units rather than as a dimensionless ratio.
  2. As a data scientist, I want to see MAE alongside RMSE so that I can assess whether large errors are concentrated in a few outlier rows (RMSE >> MAE) or distributed evenly.
  3. As a data scientist, I want RMSE and MAE to be computed automatically when I call fit() so that I do not need to run a separate evaluation step.
  4. As a data scientist, I want RMSE and MAE to appear alongside r2_train on ImputationFitDiagnostic so that all quality signals are in one place.
  5. As a data scientist, I want RMSE and MAE to be None for scalar strategies (Mean, Median, Mode) so that the diagnostic is not cluttered with trivially predictable numbers.
  6. As a data scientist, I want RMSE and MAE to be None when there are too few complete rows to produce a reliable estimate, consistent with how r2_train is already guarded.
  7. As a data scientist, I want RMSE and MAE to be available on ColumnImputationRecord.diagnostic for every KNN, Regression, and MICE column so that I can inspect them column by column.
  8. As a data scientist using MICE, I want RMSE and MAE computed per column from the same shared holdback set so that all MICE columns are evaluated under identical conditions.
  9. As a data scientist, I want the ImputationFitDiagnostic to be serializable via FittedImputer.to_dict() / from_dict() so that I can persist a fitted imputer and recover the full diagnostic later.
  10. As a data scientist, I want RMSE and MAE to reflect out-of-sample prediction quality (not in-sample), consistent with the held-back evaluation already used for r2_train.
  11. As a data scientist, I want the diagnostic to be part of the Public API so that I can type-hint against ImputationFitDiagnostic in downstream code.

Implementation Decisions

Scope boundary

Scope 4 adds only rmse and mae to ImputationFitDiagnostic. Distribution comparison (observed_mean, observed_std, variance_ratio) and convergence signals (converged, n_iter) are part of Scope 3 (Issue #92). r2_train is also Scope 3. Scope 4 is strictly additive on top of that design.

New dataclass: ImputationFitDiagnostic

Add to _config.py. Fields: r2_train, rmse, mae, converged, n_iter, imputed_mean, imputed_std, observed_mean, observed_std, variance_ratio. All optional (float | None). Part of the Public API.

New field on ColumnImputationRecord

Add diagnostic: ImputationFitDiagnostic | None = None. Present for KNN, Regression, MICE. None for Mean, Median, Mode, Constant, Dropped, Passthrough.

New internal function: _evaluate_holdback()

Pure computation function: takes a fitted model, the full feature matrix, holdback row indices, and the true target values for those rows. Returns R², RMSE, MAE. No Polars, no I/O. Shared by KNN, Regression, and MICE evaluation paths.

Shared holdback pass

The same 20% held-back complete rows used for r2_train (ADR-0013) also produce rmse and mae. R², RMSE, and MAE are computed together in one evaluation pass. The final stored model is then re-fit on all complete rows. No separate masking pass is introduced (ADR-0014).

Minimum rows guard

When complete rows fall below refit_r2_min_complete_rows, all three of r2_train, rmse, and mae are None. One guard fires, all three fields go None together.

MICE holdback

The holdback for a MICE block is the set of rows complete across all MICE columns. 20% of that intersection is held back; R², RMSE, and MAE are computed per column from that single shared holdback set. If the intersection is smaller than refit_r2_min_complete_rows, all three fields are None for every column in the block.

RMSE and MAE are diagnostic-only

suggest_refit_config does not use rmse or mae for automated rules. RMSE and MAE are in column-specific units with no universal threshold; r2_train already provides the dimensionless automated signal. (ADR-0015)

Serialization

ColumnImputationRecord.to_dict() serializes diagnostic as a nested dict. FittedImputer.from_dict() restores it into an ImputationFitDiagnostic instance. Round-trip fidelity required for all nine fields including None values.

No new config parameters

The holdback fraction (20%) is hardcoded, consistent with ADR-0013. No refit_holdback_fraction or equivalent is added to NumericImputationConfig.

Testing Decisions

What makes a good test here: tests assert on the external shape and semantics of ImputationFitDiagnostic fields — not on the exact float values produced by sklearn models. A test that asserts rmse >= 0 and rmse < observed_std * 2 is correct; a test that asserts rmse == 1.2345 is brittle. Tests use minimal synthetic DataFrames constructed inline, following the pattern established in test_model_strategies.py and test_numeric_imputer.py.

Module 1: _evaluate_holdback() (unit tests)

  • Returns (r2, rmse, mae) as a tuple of floats for a trivially predictable case (constant target)
  • rmse and mae are non-negative for any input
  • rmse >= mae always holds (Cauchy-Schwarz)
  • Returns (None, None, None) when holdback row count is below refit_r2_min_complete_rows
  • rmse and mae are zero when the model predicts perfectly

Module 2: ColumnImputationRecord serialization (unit tests)

  • to_dict() / from_dict() round-trip preserves all ImputationFitDiagnostic fields
  • Round-trip preserves None values correctly (not converted to 0 or missing keys)
  • diagnostic is absent from the dict for scalar strategies (or serialized as null)

Module 3: NumericImputer.fit() integration (integration tests, following test_model_strategies.py)

  • diagnostic is not None for KNN, Regression, and MICE columns
  • diagnostic is None for Mean, Median, Mode, Constant, Dropped, and Passthrough columns
  • diagnostic.rmse and diagnostic.mae are non-negative floats for model-based columns with sufficient complete rows
  • diagnostic.rmse and diagnostic.mae are None when complete rows < refit_r2_min_complete_rows
  • For a MICE block: all MICE columns share the same holdback (same r2_train is None / not None for all columns simultaneously)
  • rmse and mae are not used by suggest_refit_config (the function's output config is unchanged when only rmse/mae are poor)

Out of Scope

Further Notes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions