Scope 4: Imputation Quality Evaluation — RMSE and MAE via shared holdback

## Problem Statement

After `ImputationOrchestrator.fit()` returns, a data scientist has no direct, unit-grounded measure of how accurate the imputation actually was. R² (added in Scope 3) tells you how much variance the model explains, but it does not tell you how far off the imputed values are in the column's own units. A data scientist working with a medical dataset cannot tell from R²=0.6 whether the imputer is off by ±2 mmHg or ±40 mmHg on a blood pressure column. RMSE and MAE express imputation error in the column's own units, making the diagnostic immediately interpretable to the domain expert.

## Solution

Extend `ImputationFitDiagnostic` with two new fields — `rmse` and `mae` — computed during `fit()` from the same 20% held-back complete rows already used for `r2_train`. All three metrics (R², RMSE, MAE) are produced in a single evaluation pass per model; the final stored model is then re-fit on all complete rows. No new config parameters, no separate masking pass, no change to the `fit()` / `transform()` contract.

## User Stories

1. As a data scientist, I want to see RMSE on each imputed column so that I can understand the imputation error in the column's own units rather than as a dimensionless ratio.
2. As a data scientist, I want to see MAE alongside RMSE so that I can assess whether large errors are concentrated in a few outlier rows (RMSE >> MAE) or distributed evenly.
3. As a data scientist, I want RMSE and MAE to be computed automatically when I call `fit()` so that I do not need to run a separate evaluation step.
4. As a data scientist, I want RMSE and MAE to appear alongside `r2_train` on `ImputationFitDiagnostic` so that all quality signals are in one place.
5. As a data scientist, I want RMSE and MAE to be `None` for scalar strategies (Mean, Median, Mode) so that the diagnostic is not cluttered with trivially predictable numbers.
6. As a data scientist, I want RMSE and MAE to be `None` when there are too few complete rows to produce a reliable estimate, consistent with how `r2_train` is already guarded.
7. As a data scientist, I want RMSE and MAE to be available on `ColumnImputationRecord.diagnostic` for every KNN, Regression, and MICE column so that I can inspect them column by column.
8. As a data scientist using MICE, I want RMSE and MAE computed per column from the same shared holdback set so that all MICE columns are evaluated under identical conditions.
9. As a data scientist, I want the `ImputationFitDiagnostic` to be serializable via `FittedImputer.to_dict()` / `from_dict()` so that I can persist a fitted imputer and recover the full diagnostic later.
10. As a data scientist, I want RMSE and MAE to reflect out-of-sample prediction quality (not in-sample), consistent with the held-back evaluation already used for `r2_train`.
11. As a data scientist, I want the diagnostic to be part of the Public API so that I can type-hint against `ImputationFitDiagnostic` in downstream code.

## Implementation Decisions

### Scope boundary
Scope 4 adds only `rmse` and `mae` to `ImputationFitDiagnostic`. Distribution comparison (`observed_mean`, `observed_std`, `variance_ratio`) and convergence signals (`converged`, `n_iter`) are part of Scope 3 (Issue #92). `r2_train` is also Scope 3. Scope 4 is strictly additive on top of that design.

### New dataclass: `ImputationFitDiagnostic`
Add to `_config.py`. Fields: `r2_train`, `rmse`, `mae`, `converged`, `n_iter`, `imputed_mean`, `imputed_std`, `observed_mean`, `observed_std`, `variance_ratio`. All optional (`float | None`). Part of the Public API.

### New field on `ColumnImputationRecord`
Add `diagnostic: ImputationFitDiagnostic | None = None`. Present for KNN, Regression, MICE. `None` for Mean, Median, Mode, Constant, Dropped, Passthrough.

### New internal function: `_evaluate_holdback()`
Pure computation function: takes a fitted model, the full feature matrix, holdback row indices, and the true target values for those rows. Returns R², RMSE, MAE. No Polars, no I/O. Shared by KNN, Regression, and MICE evaluation paths.

### Shared holdback pass
The same 20% held-back complete rows used for `r2_train` (ADR-0013) also produce `rmse` and `mae`. R², RMSE, and MAE are computed together in one evaluation pass. The final stored model is then re-fit on all complete rows. No separate masking pass is introduced (ADR-0014).

### Minimum rows guard
When complete rows fall below `refit_r2_min_complete_rows`, all three of `r2_train`, `rmse`, and `mae` are `None`. One guard fires, all three fields go `None` together.

### MICE holdback
The holdback for a MICE block is the set of rows complete across **all** MICE columns. 20% of that intersection is held back; R², RMSE, and MAE are computed per column from that single shared holdback set. If the intersection is smaller than `refit_r2_min_complete_rows`, all three fields are `None` for every column in the block.

### RMSE and MAE are diagnostic-only
`suggest_refit_config` does not use `rmse` or `mae` for automated rules. RMSE and MAE are in column-specific units with no universal threshold; `r2_train` already provides the dimensionless automated signal. (ADR-0015)

### Serialization
`ColumnImputationRecord.to_dict()` serializes `diagnostic` as a nested dict. `FittedImputer.from_dict()` restores it into an `ImputationFitDiagnostic` instance. Round-trip fidelity required for all nine fields including `None` values.

### No new config parameters
The holdback fraction (20%) is hardcoded, consistent with ADR-0013. No `refit_holdback_fraction` or equivalent is added to `NumericImputationConfig`.

## Testing Decisions

**What makes a good test here:** tests assert on the external shape and semantics of `ImputationFitDiagnostic` fields — not on the exact float values produced by sklearn models. A test that asserts `rmse >= 0` and `rmse < observed_std * 2` is correct; a test that asserts `rmse == 1.2345` is brittle. Tests use minimal synthetic DataFrames constructed inline, following the pattern established in `test_model_strategies.py` and `test_numeric_imputer.py`.

### Module 1: `_evaluate_holdback()` (unit tests)
- Returns `(r2, rmse, mae)` as a tuple of floats for a trivially predictable case (constant target)
- `rmse` and `mae` are non-negative for any input
- `rmse >= mae` always holds (Cauchy-Schwarz)
- Returns `(None, None, None)` when holdback row count is below `refit_r2_min_complete_rows`
- `rmse` and `mae` are zero when the model predicts perfectly

### Module 2: `ColumnImputationRecord` serialization (unit tests)
- `to_dict()` / `from_dict()` round-trip preserves all `ImputationFitDiagnostic` fields
- Round-trip preserves `None` values correctly (not converted to 0 or missing keys)
- `diagnostic` is absent from the dict for scalar strategies (or serialized as `null`)

### Module 3: `NumericImputer.fit()` integration (integration tests, following `test_model_strategies.py`)
- `diagnostic` is not `None` for KNN, Regression, and MICE columns
- `diagnostic` is `None` for Mean, Median, Mode, Constant, Dropped, and Passthrough columns
- `diagnostic.rmse` and `diagnostic.mae` are non-negative floats for model-based columns with sufficient complete rows
- `diagnostic.rmse` and `diagnostic.mae` are `None` when complete rows < `refit_r2_min_complete_rows`
- For a MICE block: all MICE columns share the same holdback (same `r2_train` is `None` / not `None` for all columns simultaneously)
- `rmse` and `mae` are not used by `suggest_refit_config` (the function's output config is unchanged when only `rmse`/`mae` are poor)

## Out of Scope

- **Distribution comparison fields** (`observed_mean`, `observed_std`, `imputed_mean`, `imputed_std`, `variance_ratio`) — Scope 3 / Issue #92
- **Convergence signals** (`converged`, `n_iter`) — Scope 3 / Issue #92
- **`r2_train`** — Scope 3 / Issue #92
- **`suggest_refit_config` rules based on RMSE/MAE** — RMSE/MAE are diagnostic-only; R² handles automated decisions
- **Separate MCAR masking pass** (masking values within rows rather than holding back whole rows) — rejected in ADR-0014
- **Normalised RMSE as an automated rule trigger** — redundant with R²; rejected in ADR-0015
- **Configurable holdback fraction** — hardcoded at 20% to match ADR-0013
- **Per-column holdback for MICE** (Option B: each MICE column uses its own independent holdback) — rejected in favour of the intersection approach for consistency

## Further Notes

- Depends on Scope 3 (Issue #92) being implemented first: `ImputationFitDiagnostic` and `ColumnImputationRecord.diagnostic` must exist before `rmse` and `mae` can be added to them.
- Depends on Scopes 0–2 (Issues #89–91) for the overhaul of Regression and MICE internals that `_evaluate_holdback()` will call into.
- ADR-0014 (shared holdback) and ADR-0015 (RMSE/MAE diagnostic-only) are written and committed.
- `CONTEXT.md` has been updated to reflect `rmse` and `mae` on `ImputationFitDiagnostic`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scope 4: Imputation Quality Evaluation — RMSE and MAE via shared holdback #93

Problem Statement

Solution

User Stories

Implementation Decisions

Scope boundary

New dataclass: `ImputationFitDiagnostic`

New field on `ColumnImputationRecord`

New internal function: `_evaluate_holdback()`

Shared holdback pass

Minimum rows guard

MICE holdback

RMSE and MAE are diagnostic-only

Serialization

No new config parameters

Testing Decisions

Module 1: `_evaluate_holdback()` (unit tests)

Module 2: `ColumnImputationRecord` serialization (unit tests)

Module 3: `NumericImputer.fit()` integration (integration tests, following `test_model_strategies.py`)

Out of Scope

Further Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Scope 4: Imputation Quality Evaluation — RMSE and MAE via shared holdback #93

Description

Problem Statement

Solution

User Stories

Implementation Decisions

Scope boundary

New dataclass: ImputationFitDiagnostic

New field on ColumnImputationRecord

New internal function: _evaluate_holdback()

Shared holdback pass

Minimum rows guard

MICE holdback

RMSE and MAE are diagnostic-only

Serialization

No new config parameters

Testing Decisions

Module 1: _evaluate_holdback() (unit tests)

Module 2: ColumnImputationRecord serialization (unit tests)

Module 3: NumericImputer.fit() integration (integration tests, following test_model_strategies.py)

Out of Scope

Further Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

New dataclass: `ImputationFitDiagnostic`

New field on `ColumnImputationRecord`

New internal function: `_evaluate_holdback()`

Module 1: `_evaluate_holdback()` (unit tests)

Module 2: `ColumnImputationRecord` serialization (unit tests)

Module 3: `NumericImputer.fit()` integration (integration tests, following `test_model_strategies.py`)