Skip to content

Scope 2: MICE Overhaul — estimator selection, dynamic convergence, and data-adaptive hyperparameters #91

@DEVunderdog

Description

@DEVunderdog

Problem Statement

When a data scientist uses DataForgeML to impute columns routed to `ImputationStrategy.MICE`, the library produces imputed values that are systematically less accurate than the data allows, in five distinct ways:

  1. Internal estimator is always BayesianRidge regardless of data linearity — `IterativeImputer` is estimator-agnostic, but the library hardcodes `BayesianRidge` for every MICE block. When the relationships between MICE columns are non-linear, interaction-driven, or mixed-scale, a linear estimator produces biased per-round predictions that compound across iterations. The library has no mechanism to detect this and no signal to the user that a better estimator exists.

  2. `max_iter=10` with no convergence monitoring — ten iterations is not guaranteed to be sufficient. The appropriate number of iterations depends on how strongly the MICE columns are correlated with each other — high inter-column correlation requires more rounds before imputed values stabilise. There is no post-fit check for whether `n_iter_` reached `max_iter` (indicating non-convergence), and `tol` is left at the sklearn default without any data-driven justification.

  3. `n_nearest_features` is not set — `IterativeImputer` defaults to using all columns as predictors for each imputation target. For large MICE blocks this adds irrelevant predictors to every round's regression, introducing noise and unnecessary compute. The profiler already computes value-level Pearson correlations for every numeric column pair — a directly applicable signal for selecting which predictors are informative per target — but it is not used.

  4. `initial_strategy` is always `"mean"` regardless of skew — the first iteration of `IterativeImputer` fills all missing values using `initial_strategy` before any regression begins. For skewed columns, the mean is not representative of central tendency. The profiler computes `SkewSeverity` per column. Columns with moderate or higher skew should receive a median initial fill, but none do.

  5. No visibility into fit quality — the `ColumnImputationRecord` records why MICE was chosen, but nothing about whether the fit converged or which estimator was used. A data scientist has no signal for columns where imputation quality may be poor.

Solution

Overhaul `ImputationStrategy.MICE` to be a principled, data-driven, self-monitoring strategy:

  1. Auto-select the internal estimator using Phase 1's `NonlinearityTag`. Take the most complex tag across all MICE columns and pass it to `RegressionEstimatorFactory` (the same factory introduced in Scope 0 for single-column Regression). This ensures the estimator is appropriate for the hardest structural relationship in the block.

  2. Compute `max_iter` and `tol` dynamically using the same seven-signal framework introduced in Scope 0, with conservative aggregation rules across MICE columns. No fixed defaults.

  3. Set `initial_strategy` based on skew — use `"median"` if any MICE column has `SkewSeverity >= Moderate`; otherwise `"mean"`.

  4. Set `n_nearest_features` for large MICE blocks — derived from value-level Pearson correlations in the `CorrelationProfiler` output, applied only when the block exceeds a configurable column count threshold.

  5. Monitor convergence post-fit and surface it, along with the chosen estimator, in every MICE column's `ColumnImputationRecord.signals`.

User Stories

  1. As a data scientist, I want the MICE internal estimator to be automatically selected based on the nonlinearity structure of the MICE columns, so that I get accurate imputed values without manually specifying a model.
  2. As a data scientist, I want the library to use a tree-based estimator inside MICE when any column in the block has a complex non-linear relationship with its predictors, so that per-round predictions are not biased by a linear model applied to non-linear structure.
  3. As a data scientist, I want the same `RegressionEstimatorFactory` used by the Regression strategy to also drive MICE estimator selection, so that the library's estimator-selection logic is consistent across both iterative imputation strategies.
  4. As a data scientist, I want `max_iter` to be computed from the inter-column correlation structure of the MICE block, so that strongly correlated blocks receive enough rounds to stabilise.
  5. As a data scientist, I want `max_iter` to be informed by the fraction of missing cells in the MICE matrix, so that blocks with high missingness are not under-iterated.
  6. As a data scientist, I want `max_iter` to increase when the complete row fraction is low, so that sparse MICE blocks are given enough rounds to converge.
  7. As a data scientist, I want `tol` to be set relative to the minimum IQR across MICE columns rather than an absolute default, so that convergence detection is consistent across columns with very different numeric ranges.
  8. As a data scientist, I want convergence monitoring to apply to MICE just as it does to the Regression strategy, so that non-convergence is never silently swallowed.
  9. As a data scientist, I want a convergence warning to appear in every MICE column's audit record when `n_iter_` reaches `max_iter`, so that I can identify the full set of columns affected by non-convergence without having to inspect each one individually.
  10. As a data scientist, I want the chosen estimator to be recorded in every MICE column's `ColumnImputationRecord.signals`, so that I can see which model was used and why without reading source code.
  11. As a data scientist, I want `initial_strategy` set to `"median"` automatically when any MICE column is skewed, so that the first iteration is not anchored to a biased initial fill.
  12. As a data scientist, I want `n_nearest_features` to limit which predictors are used per MICE round when the block is large, so that irrelevant features do not add noise to each round's regression.
  13. As a data scientist, I want `n_nearest_features` to be derived from value-level feature correlations rather than missingness correlations, so that the selected predictors are informative about column values, not just about which columns tend to be missing together.
  14. As a data scientist, I want `n_nearest_features` to be left unset (all predictors used) for small MICE blocks, so that small blocks are not artificially constrained when all columns are plausibly informative.
  15. As a data scientist, I want the minimum MICE block size for `n_nearest_features` to activate to be configurable, so that I can tune the threshold for domain-specific datasets.
  16. As a data scientist, I want the Pearson correlation threshold used to count informative predictors to be configurable, so that I can adjust sensitivity to weak correlations.
  17. As a data scientist, I want `GradientBoostingRegressor` selected automatically for MICE blocks dominated by complex non-linear structure on large datasets, so that I get the most accurate imputation without manually tuning estimator selection.
  18. As a data scientist, I want `RandomForestRegressor` selected for MICE blocks dominated by complex non-linear structure on smaller datasets, so that imputation is accurate without the compute cost of gradient boosting on small data.
  19. As a data scientist, I want MICE columns that are all `Unpredictable` to each fall back to Median individually, so that I receive a clear signal that regression was attempted and found unsuitable rather than a silently degraded MICE block.
  20. As a data scientist, I want the `NonlinearityTag` driving the MICE estimator choice to be recorded in `signals` alongside the estimator name, so that I understand which column's structure triggered the selection.
  21. As a library contributor, I want the MICE hyperparameter computation to reuse the same signal framework as the Regression strategy, so that the two iterative strategies are consistent and the framework is tested in one place.

Implementation Decisions

Dependency: Scope 0 must ship first

Scope 2 has a hard dependency on Scope 0: it requires `NonlinearityTag`, `NonlinearityProfiler` (Phase 1), `RegressionEstimatorFactory`, and `NonlinearityTag` stored in `NumericStats`. Scope 2 is not implemented until Scope 0 is complete. No graceful degradation path.

Modified: MICE fitting block in `NumericImputer`

Replace `IterativeImputer(random_state=0, max_iter=10)` with a fully parameterised construction. At fit time:

Estimator selection:

  • Collect `NonlinearityTag` for each MICE column from `NumericStats`.
  • Take the most complex tag across the block (precedence: `ComplexNonlinear` > `MonotonicNonlinear` > `Linear` > `Unpredictable`).
  • Pass the winning tag and `n_rows` to `RegressionEstimatorFactory`.
  • If all MICE columns are `Unpredictable`: skip the MICE block entirely; fall back each column to Median individually via `_fallback_to_median`.

`max_iter` and `tol` (seven-signal framework, MICE aggregation):

  1. `NonlinearityTag` — most complex across MICE columns; `ComplexNonlinear` increases base `max_iter`
  2. Feature matrix missingness fraction — fraction of NaN cells across the full MICE matrix
  3. R² strength — minimum R²_linear across MICE columns (worst-case convergence speed); high minimum R² → lower `max_iter`
  4. Inter-feature correlation — maximum pairwise Pearson `|r|` among MICE columns (from `CorrelationProfiler`); high correlation → higher `max_iter`
  5. Complete row fraction — fraction of rows with no NaN across all MICE columns; low fraction → higher `max_iter`
  6. Scale-relative `tol` — `tol = min_iqr * scaling_factor` where `min_iqr` is the minimum IQR across MICE columns
  7. `ComplexNonlinear` tol tightening — tighter `tol` when most complex tag is `ComplexNonlinear`

`initial_strategy`:

  • If any MICE column has `SkewSeverity >= Moderate`: `"median"`
  • Otherwise: `"mean"`

`imputation_order`: No change — keep sklearn default `"ascending"` (fewest-missing first). This is already the better-grounded order; the gap description had the direction reversed.

`n_nearest_features`:

  • If MICE block has ≤ `mice_n_nearest_features_min_cols` columns: leave `n_nearest_features=None` (use all).
  • Otherwise: for each MICE column, count other MICE columns with `|Pearson r| > mice_correlation_threshold` (from `CorrelationProfiler`). Take the median count across all MICE columns, capped at `mice_max_nearest_features`. Pass this as `n_nearest_features`.

Post-fit convergence monitoring:

  • Check `IterativeImputer.n_iter_ == max_iter`.
  • If true, append a convergence warning to every MICE column's `ColumnImputationRecord.signals`.
  • Regardless of convergence: append the chosen estimator and the driving `NonlinearityTag` to every MICE column's signals.

Modified: `NumericImputationConfig`

Three new fields:

  • `mice_n_nearest_features_min_cols: int = 10` — MICE block size below which `n_nearest_features` is left unset
  • `mice_max_nearest_features: int = 20` — cap on the computed `n_nearest_features` value
  • `mice_correlation_threshold: float = 0.1` — minimum `|Pearson r|` for a column to count as an informative predictor

All three serialised in `to_dict()` / `from_dict()`.

No new storage wrapper

The fitted `IterativeImputer` continues to be stored as a bare object under `"mice"` in `FittedImputer.models`. No `_FittedMICE` wrapper is needed: unlike KNN, `IterativeImputer` handles feature NaNs internally at transform time, so no scaling parameters need to be stored alongside the model.

No `FittedImputer.transform()` changes

`_apply_block_model(df, "mice", ...)` calls `model.transform(arr)` unchanged. The new hyperparameters are baked into the fitted `IterativeImputer` at fit time.

`sample_posterior` stays `False`

Not configurable in Scope 2. See ADR 0012.

Testing Decisions

What makes a good test here: test external behaviour through the public interface, not internal implementation details. Verify that `ColumnImputationRecord.signals` contains the expected entries (estimator choice, convergence status, `n_nearest_features` decision). Do not assert on which sklearn estimator class was constructed internally. Mirror the pattern of `test_model_strategies.py`.

Modules with tests:

  • `NumericImputer` (MICE path) — `test_model_strategies.py` — new cases:

    • A MICE block where at least one column has `ComplexNonlinear` tag produces signals containing a non-linear estimator name for all MICE columns.
    • A MICE block where all columns are `Unpredictable` produces no MICE model and each column's record shows `Median` strategy with a fallback signal.
    • A MICE block with any `SkewSeverity >= Moderate` column produces `initial_strategy=median` visible in signals.
    • A MICE block exceeding `mice_n_nearest_features_min_cols` produces a `n_nearest_features` entry in signals; a small block does not.
    • A block that hits `max_iter` before convergence produces a convergence warning in every MICE column's signals.
    • Convergence warning appears on all MICE columns, not just one.
  • `NumericImputationConfig` — `test_model_strategies.py` or config unit tests — new cases:

    • `to_dict()` / `from_dict()` round-trip preserves the three new fields with correct defaults.
  • `FittedImputer` (MICE path) — `test_fitted_imputer.py` — new cases:

    • `transform()` on a MICE-backed imputer produces no nulls in the output.
    • `to_dict()` / `from_dict()` round-trip preserves the fitted `IterativeImputer` correctly (same behaviour as existing MICE serialisation test, extended for new estimator variants).
  • Integration test — `test_imputation_end_to_end.py` — one new case: dataset with multiple MAR-suspect columns where at least one column has a non-linear relationship with others; final imputed output has no nulls; every MICE column's `ColumnImputationRecord.signals` contains estimator choice and convergence status.

Out of Scope

  • `sample_posterior` — enabling posterior sampling requires multiple-imputation infrastructure (run MICE k times, combine via Rubin's rules) and is incompatible with the non-linear estimators selected in Scope 2. Deferred to a future multiple-imputation scope. See ADR 0012.
  • `imputation_order` change — the gap description had the sklearn parameter direction reversed. `"ascending"` (fewest-missing first) is already the better-grounded default. No change needed.
  • Per-column estimator inside MICE — assigning each MICE target its own estimator is not supported natively by `IterativeImputer` and would require a bespoke wrapper. The most-complex-tag-wins rule produces acceptable outcomes without this complexity. See ADR 0011.
  • KNN estimator inside MICE — `KNeighborsRegressor` as MICE internal estimator is not addressed; deferred.
  • Scope 0 (Regression Imputer Overhaul) — prerequisite, separate issue.
  • Scope 1 (KNN Hyperparameter Selection) — separate issue.
  • Scope A (MICE multiple-imputation / Rubin's rules) — separate session.
  • Gradient boosting variants beyond `GradientBoostingRegressor` (XGBoost, LightGBM) — deferred.

Further Notes

  • Scope 2 has a hard dependency on Scope 0. `NonlinearityProfiler`, `RegressionEstimatorFactory`, and `NonlinearityTag` in `NumericStats` must all be shipped before Scope 2 is implemented. No graceful degradation path is provided.
  • The `CorrelationProfiler` already computes pairwise Pearson correlations in Phase 1. `n_nearest_features` derivation should reuse those values directly rather than recomputing them.
  • The size threshold separating `RandomForestRegressor` from `GradientBoostingRegressor` for `ComplexNonlinear` blocks is inherited from `RegressionEstimatorFactory` (Scope 0), controlled by `gradient_boost_min_rows` in `NumericImputationConfig`.
  • The seven-signal `max_iter` / `tol` formula and its configurable scaling factors should be benchmarked empirically during implementation to validate default values.
  • All design decisions from this PRD are recorded in `CONTEXT.md` under the `MICE` entry in `Imputation Strategy`, and in ADR 0011 (estimator aggregation) and ADR 0012 (`sample_posterior` deferral).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions