Problem Statement
After ImputationOrchestrator.fit() returns, a data scientist has no direct, unit-grounded measure of how accurate the imputation actually was. R² (added in Scope 3) tells you how much variance the model explains, but it does not tell you how far off the imputed values are in the column's own units. A data scientist working with a medical dataset cannot tell from R²=0.6 whether the imputer is off by ±2 mmHg or ±40 mmHg on a blood pressure column. RMSE and MAE express imputation error in the column's own units, making the diagnostic immediately interpretable to the domain expert.
Solution
Extend ImputationFitDiagnostic with two new fields — rmse and mae — computed during fit() from the same 20% held-back complete rows already used for r2_train. All three metrics (R², RMSE, MAE) are produced in a single evaluation pass per model; the final stored model is then re-fit on all complete rows. No new config parameters, no separate masking pass, no change to the fit() / transform() contract.
User Stories
- As a data scientist, I want to see RMSE on each imputed column so that I can understand the imputation error in the column's own units rather than as a dimensionless ratio.
- As a data scientist, I want to see MAE alongside RMSE so that I can assess whether large errors are concentrated in a few outlier rows (RMSE >> MAE) or distributed evenly.
- As a data scientist, I want RMSE and MAE to be computed automatically when I call
fit() so that I do not need to run a separate evaluation step.
- As a data scientist, I want RMSE and MAE to appear alongside
r2_train on ImputationFitDiagnostic so that all quality signals are in one place.
- As a data scientist, I want RMSE and MAE to be
None for scalar strategies (Mean, Median, Mode) so that the diagnostic is not cluttered with trivially predictable numbers.
- As a data scientist, I want RMSE and MAE to be
None when there are too few complete rows to produce a reliable estimate, consistent with how r2_train is already guarded.
- As a data scientist, I want RMSE and MAE to be available on
ColumnImputationRecord.diagnostic for every KNN, Regression, and MICE column so that I can inspect them column by column.
- As a data scientist using MICE, I want RMSE and MAE computed per column from the same shared holdback set so that all MICE columns are evaluated under identical conditions.
- As a data scientist, I want the
ImputationFitDiagnostic to be serializable via FittedImputer.to_dict() / from_dict() so that I can persist a fitted imputer and recover the full diagnostic later.
- As a data scientist, I want RMSE and MAE to reflect out-of-sample prediction quality (not in-sample), consistent with the held-back evaluation already used for
r2_train.
- As a data scientist, I want the diagnostic to be part of the Public API so that I can type-hint against
ImputationFitDiagnostic in downstream code.
Implementation Decisions
Scope boundary
Scope 4 adds only rmse and mae to ImputationFitDiagnostic. Distribution comparison (observed_mean, observed_std, variance_ratio) and convergence signals (converged, n_iter) are part of Scope 3 (Issue #92). r2_train is also Scope 3. Scope 4 is strictly additive on top of that design.
New dataclass: ImputationFitDiagnostic
Add to _config.py. Fields: r2_train, rmse, mae, converged, n_iter, imputed_mean, imputed_std, observed_mean, observed_std, variance_ratio. All optional (float | None). Part of the Public API.
New field on ColumnImputationRecord
Add diagnostic: ImputationFitDiagnostic | None = None. Present for KNN, Regression, MICE. None for Mean, Median, Mode, Constant, Dropped, Passthrough.
New internal function: _evaluate_holdback()
Pure computation function: takes a fitted model, the full feature matrix, holdback row indices, and the true target values for those rows. Returns R², RMSE, MAE. No Polars, no I/O. Shared by KNN, Regression, and MICE evaluation paths.
Shared holdback pass
The same 20% held-back complete rows used for r2_train (ADR-0013) also produce rmse and mae. R², RMSE, and MAE are computed together in one evaluation pass. The final stored model is then re-fit on all complete rows. No separate masking pass is introduced (ADR-0014).
Minimum rows guard
When complete rows fall below refit_r2_min_complete_rows, all three of r2_train, rmse, and mae are None. One guard fires, all three fields go None together.
MICE holdback
The holdback for a MICE block is the set of rows complete across all MICE columns. 20% of that intersection is held back; R², RMSE, and MAE are computed per column from that single shared holdback set. If the intersection is smaller than refit_r2_min_complete_rows, all three fields are None for every column in the block.
RMSE and MAE are diagnostic-only
suggest_refit_config does not use rmse or mae for automated rules. RMSE and MAE are in column-specific units with no universal threshold; r2_train already provides the dimensionless automated signal. (ADR-0015)
Serialization
ColumnImputationRecord.to_dict() serializes diagnostic as a nested dict. FittedImputer.from_dict() restores it into an ImputationFitDiagnostic instance. Round-trip fidelity required for all nine fields including None values.
No new config parameters
The holdback fraction (20%) is hardcoded, consistent with ADR-0013. No refit_holdback_fraction or equivalent is added to NumericImputationConfig.
Testing Decisions
What makes a good test here: tests assert on the external shape and semantics of ImputationFitDiagnostic fields — not on the exact float values produced by sklearn models. A test that asserts rmse >= 0 and rmse < observed_std * 2 is correct; a test that asserts rmse == 1.2345 is brittle. Tests use minimal synthetic DataFrames constructed inline, following the pattern established in test_model_strategies.py and test_numeric_imputer.py.
Module 1: _evaluate_holdback() (unit tests)
- Returns
(r2, rmse, mae) as a tuple of floats for a trivially predictable case (constant target)
rmse and mae are non-negative for any input
rmse >= mae always holds (Cauchy-Schwarz)
- Returns
(None, None, None) when holdback row count is below refit_r2_min_complete_rows
rmse and mae are zero when the model predicts perfectly
Module 2: ColumnImputationRecord serialization (unit tests)
to_dict() / from_dict() round-trip preserves all ImputationFitDiagnostic fields
- Round-trip preserves
None values correctly (not converted to 0 or missing keys)
diagnostic is absent from the dict for scalar strategies (or serialized as null)
Module 3: NumericImputer.fit() integration (integration tests, following test_model_strategies.py)
diagnostic is not None for KNN, Regression, and MICE columns
diagnostic is None for Mean, Median, Mode, Constant, Dropped, and Passthrough columns
diagnostic.rmse and diagnostic.mae are non-negative floats for model-based columns with sufficient complete rows
diagnostic.rmse and diagnostic.mae are None when complete rows < refit_r2_min_complete_rows
- For a MICE block: all MICE columns share the same holdback (same
r2_train is None / not None for all columns simultaneously)
rmse and mae are not used by suggest_refit_config (the function's output config is unchanged when only rmse/mae are poor)
Out of Scope
Further Notes
Problem Statement
After
ImputationOrchestrator.fit()returns, a data scientist has no direct, unit-grounded measure of how accurate the imputation actually was. R² (added in Scope 3) tells you how much variance the model explains, but it does not tell you how far off the imputed values are in the column's own units. A data scientist working with a medical dataset cannot tell from R²=0.6 whether the imputer is off by ±2 mmHg or ±40 mmHg on a blood pressure column. RMSE and MAE express imputation error in the column's own units, making the diagnostic immediately interpretable to the domain expert.Solution
Extend
ImputationFitDiagnosticwith two new fields —rmseandmae— computed duringfit()from the same 20% held-back complete rows already used forr2_train. All three metrics (R², RMSE, MAE) are produced in a single evaluation pass per model; the final stored model is then re-fit on all complete rows. No new config parameters, no separate masking pass, no change to thefit()/transform()contract.User Stories
fit()so that I do not need to run a separate evaluation step.r2_trainonImputationFitDiagnosticso that all quality signals are in one place.Nonefor scalar strategies (Mean, Median, Mode) so that the diagnostic is not cluttered with trivially predictable numbers.Nonewhen there are too few complete rows to produce a reliable estimate, consistent with howr2_trainis already guarded.ColumnImputationRecord.diagnosticfor every KNN, Regression, and MICE column so that I can inspect them column by column.ImputationFitDiagnosticto be serializable viaFittedImputer.to_dict()/from_dict()so that I can persist a fitted imputer and recover the full diagnostic later.r2_train.ImputationFitDiagnosticin downstream code.Implementation Decisions
Scope boundary
Scope 4 adds only
rmseandmaetoImputationFitDiagnostic. Distribution comparison (observed_mean,observed_std,variance_ratio) and convergence signals (converged,n_iter) are part of Scope 3 (Issue #92).r2_trainis also Scope 3. Scope 4 is strictly additive on top of that design.New dataclass:
ImputationFitDiagnosticAdd to
_config.py. Fields:r2_train,rmse,mae,converged,n_iter,imputed_mean,imputed_std,observed_mean,observed_std,variance_ratio. All optional (float | None). Part of the Public API.New field on
ColumnImputationRecordAdd
diagnostic: ImputationFitDiagnostic | None = None. Present for KNN, Regression, MICE.Nonefor Mean, Median, Mode, Constant, Dropped, Passthrough.New internal function:
_evaluate_holdback()Pure computation function: takes a fitted model, the full feature matrix, holdback row indices, and the true target values for those rows. Returns R², RMSE, MAE. No Polars, no I/O. Shared by KNN, Regression, and MICE evaluation paths.
Shared holdback pass
The same 20% held-back complete rows used for
r2_train(ADR-0013) also producermseandmae. R², RMSE, and MAE are computed together in one evaluation pass. The final stored model is then re-fit on all complete rows. No separate masking pass is introduced (ADR-0014).Minimum rows guard
When complete rows fall below
refit_r2_min_complete_rows, all three ofr2_train,rmse, andmaeareNone. One guard fires, all three fields goNonetogether.MICE holdback
The holdback for a MICE block is the set of rows complete across all MICE columns. 20% of that intersection is held back; R², RMSE, and MAE are computed per column from that single shared holdback set. If the intersection is smaller than
refit_r2_min_complete_rows, all three fields areNonefor every column in the block.RMSE and MAE are diagnostic-only
suggest_refit_configdoes not usermseormaefor automated rules. RMSE and MAE are in column-specific units with no universal threshold;r2_trainalready provides the dimensionless automated signal. (ADR-0015)Serialization
ColumnImputationRecord.to_dict()serializesdiagnosticas a nested dict.FittedImputer.from_dict()restores it into anImputationFitDiagnosticinstance. Round-trip fidelity required for all nine fields includingNonevalues.No new config parameters
The holdback fraction (20%) is hardcoded, consistent with ADR-0013. No
refit_holdback_fractionor equivalent is added toNumericImputationConfig.Testing Decisions
What makes a good test here: tests assert on the external shape and semantics of
ImputationFitDiagnosticfields — not on the exact float values produced by sklearn models. A test that assertsrmse >= 0andrmse < observed_std * 2is correct; a test that assertsrmse == 1.2345is brittle. Tests use minimal synthetic DataFrames constructed inline, following the pattern established intest_model_strategies.pyandtest_numeric_imputer.py.Module 1:
_evaluate_holdback()(unit tests)(r2, rmse, mae)as a tuple of floats for a trivially predictable case (constant target)rmseandmaeare non-negative for any inputrmse >= maealways holds (Cauchy-Schwarz)(None, None, None)when holdback row count is belowrefit_r2_min_complete_rowsrmseandmaeare zero when the model predicts perfectlyModule 2:
ColumnImputationRecordserialization (unit tests)to_dict()/from_dict()round-trip preserves allImputationFitDiagnosticfieldsNonevalues correctly (not converted to 0 or missing keys)diagnosticis absent from the dict for scalar strategies (or serialized asnull)Module 3:
NumericImputer.fit()integration (integration tests, followingtest_model_strategies.py)diagnosticis notNonefor KNN, Regression, and MICE columnsdiagnosticisNonefor Mean, Median, Mode, Constant, Dropped, and Passthrough columnsdiagnostic.rmseanddiagnostic.maeare non-negative floats for model-based columns with sufficient complete rowsdiagnostic.rmseanddiagnostic.maeareNonewhen complete rows <refit_r2_min_complete_rowsr2_trainisNone/ notNonefor all columns simultaneously)rmseandmaeare not used bysuggest_refit_config(the function's output config is unchanged when onlyrmse/maeare poor)Out of Scope
observed_mean,observed_std,imputed_mean,imputed_std,variance_ratio) — Scope 3 / Issue Scope 3: Multi-round Iteration and Feedback Loop — ImputationFitDiagnostic and suggest_refit_config #92converged,n_iter) — Scope 3 / Issue Scope 3: Multi-round Iteration and Feedback Loop — ImputationFitDiagnostic and suggest_refit_config #92r2_train— Scope 3 / Issue Scope 3: Multi-round Iteration and Feedback Loop — ImputationFitDiagnostic and suggest_refit_config #92suggest_refit_configrules based on RMSE/MAE — RMSE/MAE are diagnostic-only; R² handles automated decisionsFurther Notes
ImputationFitDiagnosticandColumnImputationRecord.diagnosticmust exist beforermseandmaecan be added to them._evaluate_holdback()will call into.CONTEXT.mdhas been updated to reflectrmseandmaeonImputationFitDiagnostic.