Skip to content

Scope 3: Multi-round Iteration and Feedback Loop — ImputationFitDiagnostic and suggest_refit_config #92

@DEVunderdog

Description

@DEVunderdog

Problem Statement

After ImputationOrchestrator.fit() returns, a data scientist has no way to know whether the imputation was effective. A MICE block that reached max_iter without converging, a regression model that learned nothing above a scalar fill, or a KNN pass that collapsed all imputed values to near-mean — all are indistinguishable from a good fit from the caller's perspective. The ColumnImputationRecord records why a strategy was chosen, but nothing about whether the resulting fit was any good.

When quality is poor, the user has no re-fit path either. Parameters like max_iter, tol, and n_neighbors are computed dynamically inside NumericImputer.fit() after Scopes 0/1/2, but there is no mechanism to override them for a specific column, and no library helper to translate "column X converged poorly" into a concrete config change.

Solution

Extend Phase 2 with a structured fit quality diagnostic and a standalone re-fit suggestion helper:

  1. Compute ImputationFitDiagnostic per column during fit() — eight fields covering predictive quality (R² on held-out complete rows), convergence status, iteration count, and distribution comparison between imputed and observed values. Attached to ColumnImputationRecord.diagnostic; None for scalar strategies and Passthrough/Dropped/Constant.

  2. Add per-column parameter overrides to NumericImputationConfigper_column_max_iter, per_column_n_neighbors, and per_column_strategy let the user override dynamically-computed values for specific columns without touching the rest of the fit.

  3. Provide suggest_refit_config(records, config) — a standalone Public API function that reads the diagnostics, identifies columns with poor fit, and returns a new NumericImputationConfig with per-column overrides pre-populated. The user reviews, adjusts if needed, and calls fit() again unchanged.

User Stories

  1. As a data scientist, I want a structured quality diagnostic attached to each column's imputation record after fit(), so that I can see whether the imputation was effective without inspecting human-readable signal strings.
  2. As a data scientist, I want r2_train computed from held-out complete rows (not the same rows used for training), so that the quality score is an honest estimate of model quality rather than an optimistic in-sample figure.
  3. As a data scientist, I want r2_train to be None when fewer than 25 complete rows are available for a held-out evaluation, so that I am not presented with an unreliable R² computed from too few points.
  4. As a data scientist, I want converged as a boolean field on the diagnostic (not a parsed string), so that I can programmatically identify non-converging columns without scanning signal text.
  5. As a data scientist, I want n_iter — the actual iteration count — alongside the boolean converged, so that I can see how close to convergence a model came (e.g., 9/10 iterations vs 1/10) and make a more informed re-fit decision.
  6. As a data scientist, I want imputed_mean and imputed_std in the diagnostic, so that I can compare the distribution of imputed values against the distribution of observed values.
  7. As a data scientist, I want observed_mean and observed_std in the diagnostic, so that I have the reference distribution from training data alongside the imputed distribution without having to recompute it myself.
  8. As a data scientist, I want variance_ratio (imputed_std / observed_std) in the diagnostic, so that I have a single number that directly flags distribution collapse — when all imputed values are near-constant the ratio is near zero and easy to threshold on.
  9. As a data scientist, I want diagnostic to be None for scalar strategies (Mean, Median, Mode) and for Passthrough, Dropped, and Constant columns, so that I am not presented with metrics that have no meaningful interpretation for non-model-based strategies.
  10. As a data scientist, I want per-MICE-column r2_train scores — one per column in the block, not one score for the whole block — so that I can identify exactly which columns within a MICE block had poor quality without re-running the entire block.
  11. As a data scientist, I want the final fitted model (stored in FittedImputer) to always be trained on all available training data, so that the model used at inference is the best possible fit regardless of the held-out evaluation step.
  12. As a data scientist, I want suggest_refit_config(records, config) importable directly from dataforge_ml, so that I can call it with the output of FittedImputer.records and my current config without digging through submodule paths.
  13. As a data scientist, I want suggest_refit_config to set per_column_max_iter for columns where converged=False, multiplied by refit_max_iter_multiplier, so that the fix for a non-converging column is automatic and I do not have to guess a new iteration count myself.
  14. As a data scientist, I want suggest_refit_config to set per_column_strategy[col] = Median for columns where r2_train < refit_r2_threshold, so that a column whose model learned nothing is routed to a reliable scalar fallback on the next run rather than repeating the same bad fit.
  15. As a data scientist, I want suggest_refit_config to flag columns where variance_ratio < refit_variance_ratio_threshold without automatically changing their strategy, so that I am informed of distribution collapse and can review those columns myself before deciding how to proceed.
  16. As a data scientist, I want suggest_refit_config to skip the R² rule for a column when r2_train is None, so that columns with too few complete rows are not incorrectly routed to Median on the basis of a missing metric.
  17. As a data scientist, I want to set per_column_max_iter manually in NumericImputationConfig for any column, so that I can override the dynamically-computed max_iter even when suggest_refit_config did not flag the column.
  18. As a data scientist, I want to set per_column_n_neighbors manually in NumericImputationConfig for any column, so that I can override the dynamically-computed n_neighbors for KNN columns where I have domain knowledge about the appropriate neighbourhood size.
  19. As a data scientist, I want to set per_column_strategy manually in NumericImputationConfig for any column, so that I can force a specific imputation strategy regardless of what the routing decision tree would select — for example, forcing Median on a column that the library routed to MICE but I know has no predictive structure.
  20. As a data scientist, I want per_column_strategy to bypass the routing decision tree entirely for a named column, so that the override is respected unconditionally regardless of the column's missingness flags or severity.
  21. As a data scientist, I want the re-fit to use the same orchestrator.fit(train_df, profile) entry point unchanged, so that the multi-round workflow requires no new methods to learn — fit, inspect diagnostics, update config, fit again.
  22. As a data scientist, I want ImputationFitDiagnostic importable directly from dataforge_ml, so that I can type-hint against it in my own helper functions without using a fragile internal submodule path.
  23. As a data scientist, I want refit_r2_threshold, refit_variance_ratio_threshold, refit_max_iter_multiplier, and refit_r2_min_complete_rows to be configurable fields in NumericImputationConfig, so that I can tune the sensitivity of the re-fit suggestion to the needs of my specific dataset.
  24. As a data scientist, I want ImputationFitDiagnostic to be preserved through FittedImputer.to_dict() / from_dict() serialisation, so that I can persist a fitted imputer and later inspect its diagnostics without having to re-fit.
  25. As a library contributor, I want the R² held-back evaluation and the final model re-fit to be clearly separated inside NumericImputer.fit(), so that the two steps can be tested and reasoned about independently.

Implementation Decisions

New: ImputationFitDiagnostic dataclass

A new dataclass in the imputation config module. Eight fields:

  • r2_train: Optional[float] — R² on held-out complete rows. None when fewer than refit_r2_min_complete_rows complete rows are available.
  • converged: Optional[bool] — whether IterativeImputer reached max_iter without converging. None for KNN and scalar strategies.
  • n_iter: Optional[int] — actual iteration count of IterativeImputer. None for KNN and scalar strategies.
  • imputed_mean: float — mean of the values imputed for null rows during fit.
  • imputed_std: float — standard deviation of imputed values.
  • observed_mean: float — mean of non-null training values.
  • observed_std: float — standard deviation of non-null training values.
  • variance_ratio: floatimputed_std / observed_std. Near-zero indicates distribution collapse.

Part of the Public API. Added to to_dict() / from_dict() serialisation on ColumnImputationRecord.

Modified: ColumnImputationRecord

Add one new field:

  • diagnostic: Optional[ImputationFitDiagnostic] = None

None for Passthrough, Dropped, Constant, and all scalar strategies. Present for KNN, Regression, and MICE.

Modified: NumericImputationConfig

Seven new fields alongside the existing four:

Per-column overrides (populated by suggest_refit_config or set manually):

  • per_column_max_iter: dict[str, int] — overrides dynamically-computed max_iter for named Regression/MICE columns
  • per_column_n_neighbors: dict[str, int] — overrides dynamically-computed n_neighbors for named KNN columns
  • per_column_strategy: dict[str, ImputationStrategy] — bypasses routing decision tree entirely for named columns

Thresholds for suggest_refit_config:

  • refit_r2_threshold: float = 0.1 — R² below this value triggers a Median routing override
  • refit_variance_ratio_threshold: float = 0.3 — variance ratio below this value triggers a flag (no automatic override)
  • refit_max_iter_multiplier: float = 2.0 — multiplier applied to computed max_iter when converged=False
  • refit_r2_min_complete_rows: int = 25 — minimum complete rows required to attempt R² computation

All seven new fields serialised in to_dict() / from_dict().

R² computation approach inside NumericImputer.fit()

For each model-based strategy, before fitting the final model:

  1. Identify complete rows (no NaN across the relevant columns).
  2. If complete row count ≥ refit_r2_min_complete_rows: hold back 20% as a validation set (fixed random_state=0).
  3. Fit a temporary model on the 80% training portion.
  4. Mask the target column in the held-out rows, run transform(), compute R².
  5. Re-fit the final model on all complete rows — this is the model stored in FittedImputer.

For MICE, step 4 is repeated once per MICE column (masking each column individually in the held-out rows). The final stored MICE model is fit on all data once.

r2_train = None when complete row count < refit_r2_min_complete_rows.

Modified: NumericImputer.fit() — per-column override consumption

Before strategy routing for each column:

  • If config.per_column_strategy.get(col) is set, use that strategy directly — skip the routing decision tree.

Before fitting model-based strategies:

  • If config.per_column_max_iter.get(col) is set, use that value instead of the dynamically-computed max_iter.
  • If config.per_column_n_neighbors.get(col) is set, use that value instead of the dynamically-computed n_neighbors.

New: suggest_refit_config standalone function

Importable from dataforge_ml. Signature:

suggest_refit_config(records: dict[str, ColumnImputationRecord], config: NumericImputationConfig) -> NumericImputationConfig

Three rules applied per column with a non-None diagnostic:

  1. converged=Falseper_column_max_iter[col] = current_computed_max_iter × config.refit_max_iter_multiplier
  2. r2_train < config.refit_r2_threshold (and r2_train is not None) → per_column_strategy[col] = ImputationStrategy.Median
  3. variance_ratio < config.refit_variance_ratio_threshold → append a flag to signals only; no automatic strategy change

Returns a new NumericImputationConfig with these overrides set. Does not mutate the input config.

Public API exports

Both ImputationFitDiagnostic and suggest_refit_config added to the Public API, importable directly from dataforge_ml.

Dependency

Hard dependency on Scopes 0, 1, and 2 (issues #89, #90, #91). NonlinearityTag, RegressionEstimatorFactory, dynamic max_iter/tol/n_neighbors computation, and convergence monitoring must all be in place before this scope is implemented.

Testing Decisions

What makes a good test here: test external behaviour through public outputs — ColumnImputationRecord.diagnostic fields, suggest_refit_config return values, and per-column override effects on strategy routing. Do not test the internal split fractions or which temporary model was constructed during R² evaluation. Mirror the assertion style of test_model_strategies.py (verify .strategy, .signals, and model presence in the bundle) and test_imputation_config.py (verify defaults, to_dict() keys, and from_dict() round-trips).

Modules with tests:

  • NumericImputer (diagnostic computation) — test_model_strategies.py, new cases:

    • A Regression column on a dataset with strong linear signal produces r2_train > 0 in its diagnostic.
    • A Regression column where the model predicts near-mean (no signal) produces r2_train near zero.
    • A MICE block produces a separate r2_train per column in the block (not one shared score).
    • A KNN column produces r2_train in its diagnostic, and converged and n_iter are None.
    • A column with fewer than 25 complete rows produces r2_train = None in its diagnostic.
    • A column routed to Mean/Median/Mode has diagnostic = None.
    • A column routed to Passthrough/Dropped/Constant has diagnostic = None.
    • variance_ratio is present and positive for any model-based column with imputed values.
    • converged = False and n_iter == max_iter are set together when the IterativeImputer hits its limit.
  • NumericImputer (per-column overrides) — test_model_strategies.py, new cases:

    • A column with per_column_strategy = Median is routed to Median regardless of its missingness flags.
    • A column with per_column_strategy = Median that would otherwise be routed to MICE does not appear in the MICE block.
    • per_column_max_iter is consumed: the recorded n_iter cap in signals reflects the override value, not the dynamically-computed value.
  • suggest_refit_config — new test file, cases:

    • A column with converged=False in its diagnostic produces a per_column_max_iter entry in the returned config.
    • The multiplied max_iter value equals computed_max_iter × refit_max_iter_multiplier.
    • A column with r2_train < refit_r2_threshold produces a per_column_strategy[col] = Median entry.
    • A column with r2_train = None does NOT produce a per_column_strategy Median entry.
    • A column with variance_ratio < refit_variance_ratio_threshold does NOT produce a strategy override (flag only).
    • A column with diagnostic = None is silently skipped.
    • The returned config is a new object — the input config is not mutated.
    • Calling suggest_refit_config on a config already containing per-column overrides from a prior call preserves existing overrides not targeted by new rules.
  • NumericImputationConfigtest_imputation_config.py, new cases:

    • All seven new fields have correct defaults.
    • to_dict() key set includes all seven new field names.
    • from_dict() round-trip preserves non-default values for all seven fields.
    • from_dict({}) produces correct defaults for all seven new fields.
  • FittedImputer (diagnostic serialisation) — test_fitted_imputer.py, new cases:

    • to_dict() / from_dict() round-trip preserves ColumnImputationRecord.diagnostic fields including r2_train, variance_ratio, converged, and n_iter.
    • A FittedImputer round-tripped through serialisation produces identical transform() output.
  • Integration test — test_imputation_end_to_end.py, one new case:

    • Dataset with strong linear structure: Regression column diagnostic shows r2_train > 0.5 and converged = True.
    • Call suggest_refit_config, apply returned config, call fit() again — second fit completes without error and produces no nulls.

Out of Scope

Further Notes

  • The held-back R² evaluation runs each model twice: once on 80% of complete rows (diagnostic), once on 100% (final stored model). This is deliberate — see ADR 0013. The stored FittedImputer always learns from all available training data.
  • For MICE, the O(n_mice_cols) transform calls for per-column R² are the accepted cost of column-level diagnostic precision, consistent with the accuracy-over-speed principle.
  • All design decisions from this session are recorded in CONTEXT.md under ImputationFitDiagnostic, suggest_refit_config, and Per-Column Imputation Override, and in ADR 0013 (held-back evaluation rationale).
  • suggest_refit_config is a pure function with no side effects — same inputs always produce the same output. It is straightforward to unit-test exhaustively.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions