Scope 0: Regression Imputer Overhaul — IterativeImputer, NonlinearityTag, dynamic convergence

## Problem Statement

When a data scientist uses DataForgeML to impute a column routed to `ImputationStrategy.Regression`, the library silently produces biased imputed values in three interconnected ways:

1. **Complete-row dropping during fit** — if any feature column also has missing values, those rows are dropped from the training set entirely. For MAR columns (where missingness is structurally correlated), the training set systematically excludes exactly the rows most relevant to the missing-value population.

2. **Mean-patching at inference** — when predicting imputed values for rows where feature columns are also missing, the library substitutes column means for those feature NaNs. The model was never trained on mean-patched inputs, so predictions are out-of-distribution.

3. **Blind linear assumption** — BayesianRidge is always used regardless of whether the relationship between the target column and its predictors is linear. For non-linear or interaction-driven relationships, a linear model produces systematically biased imputed values with no warning or signal.

Together these produce imputed values that can silently degrade model quality downstream. The library gives the user no signal that any of this is happening — the `ColumnImputationRecord` records only why the strategy was chosen, not whether the resulting imputation was any good.

## Solution

Overhaul `ImputationStrategy.Regression` to be a principled, data-driven, self-monitoring strategy:

1. **Replace standalone BayesianRidge with a single-column `IterativeImputer`** that handles missing feature columns iteratively — no rows are dropped, no mean-patching at inference.

2. **Auto-select the estimator inside the IterativeImputer** based on a new `NonlinearityTag` signal computed in Phase 1. The library detects whether the column's relationship with its predictors is linear, monotonically non-linear, or complexly non-linear, and picks the appropriate estimator automatically.

3. **Compute `max_iter` and `tol` dynamically** from seven data signals rather than using fixed defaults, so the IterativeImputer converges appropriately for the specific column's structure.

4. **Monitor convergence post-fit** and surface it in the `ColumnImputationRecord` so users have visibility into fit quality.

## User Stories

1. As a data scientist, I want the library to correctly handle columns where the feature predictors also have missing values, so that my imputed values are not biased by complete-row dropping.
2. As a data scientist, I want the library to automatically detect when a column's relationship with its predictors is non-linear, so that I don't have to manually specify a different estimator.
3. As a data scientist, I want the library to automatically select between BayesianRidge, RandomForestRegressor, and GradientBoostingRegressor based on data signals, so that I get accurate imputed values without trial-and-error.
4. As a data scientist, I want to see whether the regression imputer converged in the audit log, so that I can identify columns where imputation quality may be poor.
5. As a data scientist, I want the library to fall back to Median when no model can predict the column better than a scalar fill, so that I don't receive misleadingly precise but inaccurate imputed values.
6. As a data scientist, I want feature scaling to be applied automatically when BayesianRidge is selected, so that columns with different numeric ranges receive equal regularisation treatment without any manual pre-processing step on my part.
7. As a data scientist, I want the `ColumnImputationRecord` to tell me which estimator was chosen and why, so that I understand the imputation decision without reading source code.
8. As a data scientist, I want convergence warnings surfaced at the column level, not just as a silent degradation, so that I know when to increase `max_iter` for a specific column.
9. As a data scientist, I want the `NonlinearityTag` for each column to be visible in the Phase 1 profile output, so that I can understand why the library made a particular estimator choice before running imputation.
10. As a data scientist, I want the four non-linearity signals (Spearman/Pearson discrepancy, mutual information, R² gap, heteroscedasticity) to each be accessible in the profile, so that I can inspect the evidence behind the tag.
11. As a data scientist, I want an `Unpredictable` tag to route the column to Median with a clear signal, so that I know regression was attempted and found unsuitable rather than silently falling back.
12. As a data scientist, I want the same convergence monitoring to apply whether the column is `MonotonicNonlinear` or `Linear`, so that non-convergence is never silently swallowed regardless of estimator.
13. As a data scientist, I want `tol` to be relative to the column's IQR rather than absolute, so that convergence detection is consistent across columns with very different numeric ranges.
14. As a data scientist, I want the `max_iter` to be informed by the number of feature columns that are themselves missing, so that columns with complex missingness structure are not under-iterated.
15. As a data scientist, I want the complete row fraction to influence `max_iter`, so that columns with very few complete training rows get more rounds to stabilise.
16. As a data scientist, I want inter-feature correlation among missing feature columns to inform `max_iter`, so that co-shifting feature updates are given enough rounds to settle.
17. As a data scientist, I want GradientBoostingRegressor to be selected automatically on large datasets with `ComplexNonlinear` structure, so that I get the most accurate imputation without tuning estimator selection myself.
18. As a data scientist, I want RandomForestRegressor to be selected on smaller datasets with `ComplexNonlinear` structure, so that imputation is accurate without incurring the compute cost of gradient boosting on a small sample.
19. As a data scientist, I want the `NonlinearityTag` to be part of `NumericStats` in Phase 1, so that downstream phases beyond imputation can also consume it if relevant.
20. As a library contributor, I want `NonlinearityProfiler` to be a testable, isolated sub-processor, so that I can verify signal computation independently of the imputation path.

## Implementation Decisions

### New: `NonlinearityTag` enum
A new enum in the numeric profiling config alongside `SkewSeverity` and `KurtosisTag`. Four values: `Linear`, `MonotonicNonlinear`, `ComplexNonlinear`, `Unpredictable`. Stored in `NumericStats.nonlinearity_tag`.

### New: `NonlinearityProfiler` sub-processor (Phase 1)
A new Phase 1 sub-processor following the existing `ColumnBatchProfiler` interface. Computes four signals for each numeric column against its numeric predictors — all four always computed, never staged:

- **Spearman/Pearson discrepancy** — large `|Spearman - Pearson|` across predictors indicates monotonic non-linearity
- **Mutual information** — `mutual_info_regression` per predictor pair; high MI not explained by Pearson/Spearman indicates complex non-linear structure
- **R² gap test** — fit `LinearRegression` and a shallow `RandomForestRegressor` on a bootstrap sample of complete rows; the gap quantifies the imputation quality improvement from switching estimators. Near-zero R²_RF → `Unpredictable`
- **Breusch-Pagan heteroscedasticity** — linear model residuals tested for non-constant variance; heteroscedastic residuals indicate BayesianRidge's isotropic noise assumption is violated

The profiler assigns a `NonlinearityTag` based on combined signal thresholds. The tag and the four underlying signal values are all stored in `NumericStats`.

### Modified: `NumericStats`
Add `nonlinearity_tag: Optional[NonlinearityTag]` and four raw signal fields (Spearman/Pearson max discrepancy, mean MI, R² gap, heteroscedasticity p-value). All serialised in `to_dict()`.

### New: `RegressionEstimatorFactory` (Phase 2, imputation)
A focused module that takes `NonlinearityTag` and `n_rows` and returns a fitted-ready sklearn estimator:
- `Linear` → `Pipeline([StandardScaler, BayesianRidge(fit_intercept=True)])`
- `MonotonicNonlinear` → `RandomForestRegressor`
- `ComplexNonlinear`, large dataset → `GradientBoostingRegressor`
- `ComplexNonlinear`, small dataset → `RandomForestRegressor`
- `Unpredictable` → signals caller to route to Median fallback

### Modified: `_fit_regression` / `NumericImputer`
Replace standalone `BayesianRidge` with `IterativeImputer(estimator=<from factory>)`. Compute `max_iter` and `tol` dynamically from seven signals at fit time:

1. `NonlinearityTag` — `ComplexNonlinear` increases base `max_iter`
2. Count of feature columns with missing values — more missing features → higher `max_iter`
3. R² strength from Signal 3 — high R²_linear → lower `max_iter` (faster convergence)
4. Inter-feature correlation among missing features — high pairwise correlation → higher `max_iter`
5. Complete row fraction — low fraction → higher `max_iter`
6. Scale-relative `tol` — `tol = column_iqr * scaling_factor` derived from `NumericStats.iqr`
7. Scale-relative `tol` adjustment for `ComplexNonlinear` — tighter tol for non-linear paths

Post-fit: check `IterativeImputer.n_iter_ == max_iter`; if true, append a convergence warning to `ColumnImputationRecord.signals`.

### Modified: `FittedImputer`
Remove `feat_means` storage and inference-time mean-patching for the Regression path. The fitted `IterativeImputer` handles inference-time feature NaNs internally. Update storage format: `"regression:{col}"` now stores a fitted `IterativeImputer` directly, not a `(BayesianRidge, feat_means)` tuple.

### Modified: `ColumnImputationRecord`
No new fields required — convergence status and estimator choice are recorded in the existing `signals: list[str]` field as structured human-readable entries.

## Testing Decisions

**What makes a good test here:** test external behaviour through the public interface, not internal implementation. For `NonlinearityProfiler`, test that known linear datasets produce `Linear` and known non-linear datasets produce `MonotonicNonlinear` or `ComplexNonlinear`. For the imputation path, test that missing feature columns are handled correctly, that `Unpredictable` columns fall back to Median, and that convergence warnings appear in `signals` when `max_iter` is hit. Do not test which internal sklearn estimator was constructed.

**Modules with tests:**
- `NonlinearityProfiler` — unit tests covering all four signal paths and all four tag outcomes using synthetic datasets with known linearity properties. Mirror the pattern of `test_numeric_profiler.py`.
- `RegressionEstimatorFactory` — unit tests verifying that each `NonlinearityTag` × size combination returns a strategy that produces non-null predictions on held-out rows. Test via `fit/predict`, not by inspecting the estimator type.
- `NumericImputer` (Regression path) — unit tests verifying: (a) rows with missing feature columns are imputed, not dropped; (b) `Unpredictable` tag routes to Median with signal recorded; (c) convergence warning appears in signals when `n_iter_ == max_iter`. Mirror the pattern of `test_model_strategies.py`.
- `FittedImputer` (Regression path) — unit tests verifying that inference-time feature NaNs are handled without `feat_means` patching, and that serialisation round-trip works for the new `IterativeImputer`-based storage format. Mirror the pattern of `test_fitted_imputer.py`.
- Integration test — add a case to `test_imputation_end_to_end.py` with a dataset where feature columns are also partially missing, verifying that the final imputed output has no nulls and the audit log records the estimator choice and convergence status.

## Out of Scope

- Scope B (MICE estimator selection and hyperparameter signals) — separate session
- Scope A (KNNImputer hyperparameter adaptation) — separate session
- Scope C (multi-round feedback loop and re-fit path) — separate session
- Scope D (imputation quality evaluation / holdout error estimation) — separate session
- Scopes 1–10 (effective nulls consistency, strategy routing, MNAR treatment, etc.) — separate sessions
- Hyperparameter tuning of BayesianRidge priors — handled by EM self-adaptation and StandardScaler
- `sample_posterior` for BayesianRidge — deferred to Scope B

## Further Notes

- `NonlinearityProfiler` should reuse Pearson correlations already computed by `CorrelationProfiler` rather than recomputing them.
- `RegressionEstimatorFactory` is a strong candidate for a deep module: simple interface (`tag`, `n_rows` → estimator), rich internal logic, no side effects, fully testable in isolation.
- The size threshold separating `RandomForestRegressor` from `GradientBoostingRegressor` for `ComplexNonlinear` columns should be configurable via a new `NumericImputationConfig.gradient_boost_min_rows` field; the default should be benchmarked empirically during implementation.
- All design decisions in this PRD are recorded in `CONTEXT.md` under `NonlinearityTag` and the `Regression` entry in `Imputation Strategy`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scope 0: Regression Imputer Overhaul — IterativeImputer, NonlinearityTag, dynamic convergence #89

Problem Statement

Solution

User Stories

Implementation Decisions

New: `NonlinearityTag` enum

New: `NonlinearityProfiler` sub-processor (Phase 1)

Modified: `NumericStats`

New: `RegressionEstimatorFactory` (Phase 2, imputation)

Modified: `_fit_regression` / `NumericImputer`

Modified: `FittedImputer`

Modified: `ColumnImputationRecord`

Testing Decisions

Out of Scope

Further Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Scope 0: Regression Imputer Overhaul — IterativeImputer, NonlinearityTag, dynamic convergence #89

Description

Problem Statement

Solution

User Stories

Implementation Decisions

New: NonlinearityTag enum

New: NonlinearityProfiler sub-processor (Phase 1)

Modified: NumericStats

New: RegressionEstimatorFactory (Phase 2, imputation)

Modified: _fit_regression / NumericImputer

Modified: FittedImputer

Modified: ColumnImputationRecord

Testing Decisions

Out of Scope

Further Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

New: `NonlinearityTag` enum

New: `NonlinearityProfiler` sub-processor (Phase 1)

Modified: `NumericStats`

New: `RegressionEstimatorFactory` (Phase 2, imputation)

Modified: `_fit_regression` / `NumericImputer`

Modified: `FittedImputer`

Modified: `ColumnImputationRecord`