Skip to content

Scope 0: Regression Imputer Overhaul — IterativeImputer, NonlinearityTag, dynamic convergence #89

@DEVunderdog

Description

@DEVunderdog

Problem Statement

When a data scientist uses DataForgeML to impute a column routed to ImputationStrategy.Regression, the library silently produces biased imputed values in three interconnected ways:

  1. Complete-row dropping during fit — if any feature column also has missing values, those rows are dropped from the training set entirely. For MAR columns (where missingness is structurally correlated), the training set systematically excludes exactly the rows most relevant to the missing-value population.

  2. Mean-patching at inference — when predicting imputed values for rows where feature columns are also missing, the library substitutes column means for those feature NaNs. The model was never trained on mean-patched inputs, so predictions are out-of-distribution.

  3. Blind linear assumption — BayesianRidge is always used regardless of whether the relationship between the target column and its predictors is linear. For non-linear or interaction-driven relationships, a linear model produces systematically biased imputed values with no warning or signal.

Together these produce imputed values that can silently degrade model quality downstream. The library gives the user no signal that any of this is happening — the ColumnImputationRecord records only why the strategy was chosen, not whether the resulting imputation was any good.

Solution

Overhaul ImputationStrategy.Regression to be a principled, data-driven, self-monitoring strategy:

  1. Replace standalone BayesianRidge with a single-column IterativeImputer that handles missing feature columns iteratively — no rows are dropped, no mean-patching at inference.

  2. Auto-select the estimator inside the IterativeImputer based on a new NonlinearityTag signal computed in Phase 1. The library detects whether the column's relationship with its predictors is linear, monotonically non-linear, or complexly non-linear, and picks the appropriate estimator automatically.

  3. Compute max_iter and tol dynamically from seven data signals rather than using fixed defaults, so the IterativeImputer converges appropriately for the specific column's structure.

  4. Monitor convergence post-fit and surface it in the ColumnImputationRecord so users have visibility into fit quality.

User Stories

  1. As a data scientist, I want the library to correctly handle columns where the feature predictors also have missing values, so that my imputed values are not biased by complete-row dropping.
  2. As a data scientist, I want the library to automatically detect when a column's relationship with its predictors is non-linear, so that I don't have to manually specify a different estimator.
  3. As a data scientist, I want the library to automatically select between BayesianRidge, RandomForestRegressor, and GradientBoostingRegressor based on data signals, so that I get accurate imputed values without trial-and-error.
  4. As a data scientist, I want to see whether the regression imputer converged in the audit log, so that I can identify columns where imputation quality may be poor.
  5. As a data scientist, I want the library to fall back to Median when no model can predict the column better than a scalar fill, so that I don't receive misleadingly precise but inaccurate imputed values.
  6. As a data scientist, I want feature scaling to be applied automatically when BayesianRidge is selected, so that columns with different numeric ranges receive equal regularisation treatment without any manual pre-processing step on my part.
  7. As a data scientist, I want the ColumnImputationRecord to tell me which estimator was chosen and why, so that I understand the imputation decision without reading source code.
  8. As a data scientist, I want convergence warnings surfaced at the column level, not just as a silent degradation, so that I know when to increase max_iter for a specific column.
  9. As a data scientist, I want the NonlinearityTag for each column to be visible in the Phase 1 profile output, so that I can understand why the library made a particular estimator choice before running imputation.
  10. As a data scientist, I want the four non-linearity signals (Spearman/Pearson discrepancy, mutual information, R² gap, heteroscedasticity) to each be accessible in the profile, so that I can inspect the evidence behind the tag.
  11. As a data scientist, I want an Unpredictable tag to route the column to Median with a clear signal, so that I know regression was attempted and found unsuitable rather than silently falling back.
  12. As a data scientist, I want the same convergence monitoring to apply whether the column is MonotonicNonlinear or Linear, so that non-convergence is never silently swallowed regardless of estimator.
  13. As a data scientist, I want tol to be relative to the column's IQR rather than absolute, so that convergence detection is consistent across columns with very different numeric ranges.
  14. As a data scientist, I want the max_iter to be informed by the number of feature columns that are themselves missing, so that columns with complex missingness structure are not under-iterated.
  15. As a data scientist, I want the complete row fraction to influence max_iter, so that columns with very few complete training rows get more rounds to stabilise.
  16. As a data scientist, I want inter-feature correlation among missing feature columns to inform max_iter, so that co-shifting feature updates are given enough rounds to settle.
  17. As a data scientist, I want GradientBoostingRegressor to be selected automatically on large datasets with ComplexNonlinear structure, so that I get the most accurate imputation without tuning estimator selection myself.
  18. As a data scientist, I want RandomForestRegressor to be selected on smaller datasets with ComplexNonlinear structure, so that imputation is accurate without incurring the compute cost of gradient boosting on a small sample.
  19. As a data scientist, I want the NonlinearityTag to be part of NumericStats in Phase 1, so that downstream phases beyond imputation can also consume it if relevant.
  20. As a library contributor, I want NonlinearityProfiler to be a testable, isolated sub-processor, so that I can verify signal computation independently of the imputation path.

Implementation Decisions

New: NonlinearityTag enum

A new enum in the numeric profiling config alongside SkewSeverity and KurtosisTag. Four values: Linear, MonotonicNonlinear, ComplexNonlinear, Unpredictable. Stored in NumericStats.nonlinearity_tag.

New: NonlinearityProfiler sub-processor (Phase 1)

A new Phase 1 sub-processor following the existing ColumnBatchProfiler interface. Computes four signals for each numeric column against its numeric predictors — all four always computed, never staged:

  • Spearman/Pearson discrepancy — large |Spearman - Pearson| across predictors indicates monotonic non-linearity
  • Mutual informationmutual_info_regression per predictor pair; high MI not explained by Pearson/Spearman indicates complex non-linear structure
  • R² gap test — fit LinearRegression and a shallow RandomForestRegressor on a bootstrap sample of complete rows; the gap quantifies the imputation quality improvement from switching estimators. Near-zero R²_RF → Unpredictable
  • Breusch-Pagan heteroscedasticity — linear model residuals tested for non-constant variance; heteroscedastic residuals indicate BayesianRidge's isotropic noise assumption is violated

The profiler assigns a NonlinearityTag based on combined signal thresholds. The tag and the four underlying signal values are all stored in NumericStats.

Modified: NumericStats

Add nonlinearity_tag: Optional[NonlinearityTag] and four raw signal fields (Spearman/Pearson max discrepancy, mean MI, R² gap, heteroscedasticity p-value). All serialised in to_dict().

New: RegressionEstimatorFactory (Phase 2, imputation)

A focused module that takes NonlinearityTag and n_rows and returns a fitted-ready sklearn estimator:

  • LinearPipeline([StandardScaler, BayesianRidge(fit_intercept=True)])
  • MonotonicNonlinearRandomForestRegressor
  • ComplexNonlinear, large dataset → GradientBoostingRegressor
  • ComplexNonlinear, small dataset → RandomForestRegressor
  • Unpredictable → signals caller to route to Median fallback

Modified: _fit_regression / NumericImputer

Replace standalone BayesianRidge with IterativeImputer(estimator=<from factory>). Compute max_iter and tol dynamically from seven signals at fit time:

  1. NonlinearityTagComplexNonlinear increases base max_iter
  2. Count of feature columns with missing values — more missing features → higher max_iter
  3. R² strength from Signal 3 — high R²_linear → lower max_iter (faster convergence)
  4. Inter-feature correlation among missing features — high pairwise correlation → higher max_iter
  5. Complete row fraction — low fraction → higher max_iter
  6. Scale-relative toltol = column_iqr * scaling_factor derived from NumericStats.iqr
  7. Scale-relative tol adjustment for ComplexNonlinear — tighter tol for non-linear paths

Post-fit: check IterativeImputer.n_iter_ == max_iter; if true, append a convergence warning to ColumnImputationRecord.signals.

Modified: FittedImputer

Remove feat_means storage and inference-time mean-patching for the Regression path. The fitted IterativeImputer handles inference-time feature NaNs internally. Update storage format: "regression:{col}" now stores a fitted IterativeImputer directly, not a (BayesianRidge, feat_means) tuple.

Modified: ColumnImputationRecord

No new fields required — convergence status and estimator choice are recorded in the existing signals: list[str] field as structured human-readable entries.

Testing Decisions

What makes a good test here: test external behaviour through the public interface, not internal implementation. For NonlinearityProfiler, test that known linear datasets produce Linear and known non-linear datasets produce MonotonicNonlinear or ComplexNonlinear. For the imputation path, test that missing feature columns are handled correctly, that Unpredictable columns fall back to Median, and that convergence warnings appear in signals when max_iter is hit. Do not test which internal sklearn estimator was constructed.

Modules with tests:

  • NonlinearityProfiler — unit tests covering all four signal paths and all four tag outcomes using synthetic datasets with known linearity properties. Mirror the pattern of test_numeric_profiler.py.
  • RegressionEstimatorFactory — unit tests verifying that each NonlinearityTag × size combination returns a strategy that produces non-null predictions on held-out rows. Test via fit/predict, not by inspecting the estimator type.
  • NumericImputer (Regression path) — unit tests verifying: (a) rows with missing feature columns are imputed, not dropped; (b) Unpredictable tag routes to Median with signal recorded; (c) convergence warning appears in signals when n_iter_ == max_iter. Mirror the pattern of test_model_strategies.py.
  • FittedImputer (Regression path) — unit tests verifying that inference-time feature NaNs are handled without feat_means patching, and that serialisation round-trip works for the new IterativeImputer-based storage format. Mirror the pattern of test_fitted_imputer.py.
  • Integration test — add a case to test_imputation_end_to_end.py with a dataset where feature columns are also partially missing, verifying that the final imputed output has no nulls and the audit log records the estimator choice and convergence status.

Out of Scope

  • Scope B (MICE estimator selection and hyperparameter signals) — separate session
  • Scope A (KNNImputer hyperparameter adaptation) — separate session
  • Scope C (multi-round feedback loop and re-fit path) — separate session
  • Scope D (imputation quality evaluation / holdout error estimation) — separate session
  • Scopes 1–10 (effective nulls consistency, strategy routing, MNAR treatment, etc.) — separate sessions
  • Hyperparameter tuning of BayesianRidge priors — handled by EM self-adaptation and StandardScaler
  • sample_posterior for BayesianRidge — deferred to Scope B

Further Notes

  • NonlinearityProfiler should reuse Pearson correlations already computed by CorrelationProfiler rather than recomputing them.
  • RegressionEstimatorFactory is a strong candidate for a deep module: simple interface (tag, n_rows → estimator), rich internal logic, no side effects, fully testable in isolation.
  • The size threshold separating RandomForestRegressor from GradientBoostingRegressor for ComplexNonlinear columns should be configurable via a new NumericImputationConfig.gradient_boost_min_rows field; the default should be benchmarked empirically during implementation.
  • All design decisions in this PRD are recorded in CONTEXT.md under NonlinearityTag and the Regression entry in Imputation Strategy.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions