Scope 3: Multi-round Iteration and Feedback Loop — ImputationFitDiagnostic and suggest_refit_config

## Problem Statement

After `ImputationOrchestrator.fit()` returns, a data scientist has no way to know whether the imputation was effective. A MICE block that reached `max_iter` without converging, a regression model that learned nothing above a scalar fill, or a KNN pass that collapsed all imputed values to near-mean — all are indistinguishable from a good fit from the caller's perspective. The `ColumnImputationRecord` records why a strategy was chosen, but nothing about whether the resulting fit was any good.

When quality is poor, the user has no re-fit path either. Parameters like `max_iter`, `tol`, and `n_neighbors` are computed dynamically inside `NumericImputer.fit()` after Scopes 0/1/2, but there is no mechanism to override them for a specific column, and no library helper to translate \"column X converged poorly\" into a concrete config change.

## Solution

Extend Phase 2 with a structured fit quality diagnostic and a standalone re-fit suggestion helper:

1. **Compute `ImputationFitDiagnostic` per column during `fit()`** — eight fields covering predictive quality (R² on held-out complete rows), convergence status, iteration count, and distribution comparison between imputed and observed values. Attached to `ColumnImputationRecord.diagnostic`; `None` for scalar strategies and Passthrough/Dropped/Constant.

2. **Add per-column parameter overrides to `NumericImputationConfig`** — `per_column_max_iter`, `per_column_n_neighbors`, and `per_column_strategy` let the user override dynamically-computed values for specific columns without touching the rest of the fit.

3. **Provide `suggest_refit_config(records, config)`** — a standalone Public API function that reads the diagnostics, identifies columns with poor fit, and returns a new `NumericImputationConfig` with per-column overrides pre-populated. The user reviews, adjusts if needed, and calls `fit()` again unchanged.

## User Stories

1. As a data scientist, I want a structured quality diagnostic attached to each column's imputation record after `fit()`, so that I can see whether the imputation was effective without inspecting human-readable signal strings.
2. As a data scientist, I want `r2_train` computed from held-out complete rows (not the same rows used for training), so that the quality score is an honest estimate of model quality rather than an optimistic in-sample figure.
3. As a data scientist, I want `r2_train` to be `None` when fewer than 25 complete rows are available for a held-out evaluation, so that I am not presented with an unreliable R² computed from too few points.
4. As a data scientist, I want `converged` as a boolean field on the diagnostic (not a parsed string), so that I can programmatically identify non-converging columns without scanning signal text.
5. As a data scientist, I want `n_iter` — the actual iteration count — alongside the boolean `converged`, so that I can see how close to convergence a model came (e.g., 9/10 iterations vs 1/10) and make a more informed re-fit decision.
6. As a data scientist, I want `imputed_mean` and `imputed_std` in the diagnostic, so that I can compare the distribution of imputed values against the distribution of observed values.
7. As a data scientist, I want `observed_mean` and `observed_std` in the diagnostic, so that I have the reference distribution from training data alongside the imputed distribution without having to recompute it myself.
8. As a data scientist, I want `variance_ratio` (`imputed_std / observed_std`) in the diagnostic, so that I have a single number that directly flags distribution collapse — when all imputed values are near-constant the ratio is near zero and easy to threshold on.
9. As a data scientist, I want `diagnostic` to be `None` for scalar strategies (Mean, Median, Mode) and for Passthrough, Dropped, and Constant columns, so that I am not presented with metrics that have no meaningful interpretation for non-model-based strategies.
10. As a data scientist, I want per-MICE-column `r2_train` scores — one per column in the block, not one score for the whole block — so that I can identify exactly which columns within a MICE block had poor quality without re-running the entire block.
11. As a data scientist, I want the final fitted model (stored in `FittedImputer`) to always be trained on all available training data, so that the model used at inference is the best possible fit regardless of the held-out evaluation step.
12. As a data scientist, I want `suggest_refit_config(records, config)` importable directly from `dataforge_ml`, so that I can call it with the output of `FittedImputer.records` and my current config without digging through submodule paths.
13. As a data scientist, I want `suggest_refit_config` to set `per_column_max_iter` for columns where `converged=False`, multiplied by `refit_max_iter_multiplier`, so that the fix for a non-converging column is automatic and I do not have to guess a new iteration count myself.
14. As a data scientist, I want `suggest_refit_config` to set `per_column_strategy[col] = Median` for columns where `r2_train < refit_r2_threshold`, so that a column whose model learned nothing is routed to a reliable scalar fallback on the next run rather than repeating the same bad fit.
15. As a data scientist, I want `suggest_refit_config` to flag columns where `variance_ratio < refit_variance_ratio_threshold` without automatically changing their strategy, so that I am informed of distribution collapse and can review those columns myself before deciding how to proceed.
16. As a data scientist, I want `suggest_refit_config` to skip the R² rule for a column when `r2_train` is `None`, so that columns with too few complete rows are not incorrectly routed to Median on the basis of a missing metric.
17. As a data scientist, I want to set `per_column_max_iter` manually in `NumericImputationConfig` for any column, so that I can override the dynamically-computed `max_iter` even when `suggest_refit_config` did not flag the column.
18. As a data scientist, I want to set `per_column_n_neighbors` manually in `NumericImputationConfig` for any column, so that I can override the dynamically-computed `n_neighbors` for KNN columns where I have domain knowledge about the appropriate neighbourhood size.
19. As a data scientist, I want to set `per_column_strategy` manually in `NumericImputationConfig` for any column, so that I can force a specific imputation strategy regardless of what the routing decision tree would select — for example, forcing Median on a column that the library routed to MICE but I know has no predictive structure.
20. As a data scientist, I want `per_column_strategy` to bypass the routing decision tree entirely for a named column, so that the override is respected unconditionally regardless of the column's missingness flags or severity.
21. As a data scientist, I want the re-fit to use the same `orchestrator.fit(train_df, profile)` entry point unchanged, so that the multi-round workflow requires no new methods to learn — fit, inspect diagnostics, update config, fit again.
22. As a data scientist, I want `ImputationFitDiagnostic` importable directly from `dataforge_ml`, so that I can type-hint against it in my own helper functions without using a fragile internal submodule path.
23. As a data scientist, I want `refit_r2_threshold`, `refit_variance_ratio_threshold`, `refit_max_iter_multiplier`, and `refit_r2_min_complete_rows` to be configurable fields in `NumericImputationConfig`, so that I can tune the sensitivity of the re-fit suggestion to the needs of my specific dataset.
24. As a data scientist, I want `ImputationFitDiagnostic` to be preserved through `FittedImputer.to_dict()` / `from_dict()` serialisation, so that I can persist a fitted imputer and later inspect its diagnostics without having to re-fit.
25. As a library contributor, I want the R² held-back evaluation and the final model re-fit to be clearly separated inside `NumericImputer.fit()`, so that the two steps can be tested and reasoned about independently.

## Implementation Decisions

### New: `ImputationFitDiagnostic` dataclass

A new dataclass in the imputation config module. Eight fields:

- `r2_train: Optional[float]` — R² on held-out complete rows. `None` when fewer than `refit_r2_min_complete_rows` complete rows are available.
- `converged: Optional[bool]` — whether `IterativeImputer` reached `max_iter` without converging. `None` for KNN and scalar strategies.
- `n_iter: Optional[int]` — actual iteration count of `IterativeImputer`. `None` for KNN and scalar strategies.
- `imputed_mean: float` — mean of the values imputed for null rows during fit.
- `imputed_std: float` — standard deviation of imputed values.
- `observed_mean: float` — mean of non-null training values.
- `observed_std: float` — standard deviation of non-null training values.
- `variance_ratio: float` — `imputed_std / observed_std`. Near-zero indicates distribution collapse.

Part of the Public API. Added to `to_dict()` / `from_dict()` serialisation on `ColumnImputationRecord`.

### Modified: `ColumnImputationRecord`

Add one new field:
- `diagnostic: Optional[ImputationFitDiagnostic] = None`

`None` for Passthrough, Dropped, Constant, and all scalar strategies. Present for KNN, Regression, and MICE.

### Modified: `NumericImputationConfig`

Seven new fields alongside the existing four:

**Per-column overrides** (populated by `suggest_refit_config` or set manually):
- `per_column_max_iter: dict[str, int]` — overrides dynamically-computed `max_iter` for named Regression/MICE columns
- `per_column_n_neighbors: dict[str, int]` — overrides dynamically-computed `n_neighbors` for named KNN columns
- `per_column_strategy: dict[str, ImputationStrategy]` — bypasses routing decision tree entirely for named columns

**Thresholds for `suggest_refit_config`:**
- `refit_r2_threshold: float = 0.1` — R² below this value triggers a Median routing override
- `refit_variance_ratio_threshold: float = 0.3` — variance ratio below this value triggers a flag (no automatic override)
- `refit_max_iter_multiplier: float = 2.0` — multiplier applied to computed `max_iter` when `converged=False`
- `refit_r2_min_complete_rows: int = 25` — minimum complete rows required to attempt R² computation

All seven new fields serialised in `to_dict()` / `from_dict()`.

### R² computation approach inside `NumericImputer.fit()`

For each model-based strategy, before fitting the final model:

1. Identify complete rows (no NaN across the relevant columns).
2. If complete row count ≥ `refit_r2_min_complete_rows`: hold back 20% as a validation set (fixed `random_state=0`).
3. Fit a temporary model on the 80% training portion.
4. Mask the target column in the held-out rows, run `transform()`, compute R².
5. Re-fit the final model on **all** complete rows — this is the model stored in `FittedImputer`.

For MICE, step 4 is repeated once per MICE column (masking each column individually in the held-out rows). The final stored MICE model is fit on all data once.

`r2_train = None` when complete row count < `refit_r2_min_complete_rows`.

### Modified: `NumericImputer.fit()` — per-column override consumption

Before strategy routing for each column:
- If `config.per_column_strategy.get(col)` is set, use that strategy directly — skip the routing decision tree.

Before fitting model-based strategies:
- If `config.per_column_max_iter.get(col)` is set, use that value instead of the dynamically-computed `max_iter`.
- If `config.per_column_n_neighbors.get(col)` is set, use that value instead of the dynamically-computed `n_neighbors`.

### New: `suggest_refit_config` standalone function

Importable from `dataforge_ml`. Signature:

```
suggest_refit_config(records: dict[str, ColumnImputationRecord], config: NumericImputationConfig) -> NumericImputationConfig
```

Three rules applied per column with a non-None diagnostic:

1. `converged=False` → `per_column_max_iter[col] = current_computed_max_iter × config.refit_max_iter_multiplier`
2. `r2_train < config.refit_r2_threshold` (and `r2_train is not None`) → `per_column_strategy[col] = ImputationStrategy.Median`
3. `variance_ratio < config.refit_variance_ratio_threshold` → append a flag to `signals` only; no automatic strategy change

Returns a new `NumericImputationConfig` with these overrides set. Does not mutate the input config.

### Public API exports

Both `ImputationFitDiagnostic` and `suggest_refit_config` added to the Public API, importable directly from `dataforge_ml`.

### Dependency

Hard dependency on Scopes 0, 1, and 2 (issues #89, #90, #91). `NonlinearityTag`, `RegressionEstimatorFactory`, dynamic `max_iter`/`tol`/`n_neighbors` computation, and convergence monitoring must all be in place before this scope is implemented.

## Testing Decisions

**What makes a good test here:** test external behaviour through public outputs — `ColumnImputationRecord.diagnostic` fields, `suggest_refit_config` return values, and per-column override effects on strategy routing. Do not test the internal split fractions or which temporary model was constructed during R² evaluation. Mirror the assertion style of `test_model_strategies.py` (verify `.strategy`, `.signals`, and model presence in the bundle) and `test_imputation_config.py` (verify defaults, `to_dict()` keys, and `from_dict()` round-trips).

**Modules with tests:**

- **`NumericImputer` (diagnostic computation) — `test_model_strategies.py`**, new cases:
  - A Regression column on a dataset with strong linear signal produces `r2_train > 0` in its diagnostic.
  - A Regression column where the model predicts near-mean (no signal) produces `r2_train` near zero.
  - A MICE block produces a separate `r2_train` per column in the block (not one shared score).
  - A KNN column produces `r2_train` in its diagnostic, and `converged` and `n_iter` are `None`.
  - A column with fewer than 25 complete rows produces `r2_train = None` in its diagnostic.
  - A column routed to Mean/Median/Mode has `diagnostic = None`.
  - A column routed to Passthrough/Dropped/Constant has `diagnostic = None`.
  - `variance_ratio` is present and positive for any model-based column with imputed values.
  - `converged = False` and `n_iter == max_iter` are set together when the IterativeImputer hits its limit.

- **`NumericImputer` (per-column overrides) — `test_model_strategies.py`**, new cases:
  - A column with `per_column_strategy = Median` is routed to Median regardless of its missingness flags.
  - A column with `per_column_strategy = Median` that would otherwise be routed to MICE does not appear in the MICE block.
  - `per_column_max_iter` is consumed: the recorded `n_iter` cap in signals reflects the override value, not the dynamically-computed value.

- **`suggest_refit_config` — new test file**, cases:
  - A column with `converged=False` in its diagnostic produces a `per_column_max_iter` entry in the returned config.
  - The multiplied `max_iter` value equals `computed_max_iter × refit_max_iter_multiplier`.
  - A column with `r2_train < refit_r2_threshold` produces a `per_column_strategy[col] = Median` entry.
  - A column with `r2_train = None` does NOT produce a `per_column_strategy` Median entry.
  - A column with `variance_ratio < refit_variance_ratio_threshold` does NOT produce a strategy override (flag only).
  - A column with `diagnostic = None` is silently skipped.
  - The returned config is a new object — the input config is not mutated.
  - Calling `suggest_refit_config` on a config already containing per-column overrides from a prior call preserves existing overrides not targeted by new rules.

- **`NumericImputationConfig` — `test_imputation_config.py`**, new cases:
  - All seven new fields have correct defaults.
  - `to_dict()` key set includes all seven new field names.
  - `from_dict()` round-trip preserves non-default values for all seven fields.
  - `from_dict({})` produces correct defaults for all seven new fields.

- **`FittedImputer` (diagnostic serialisation) — `test_fitted_imputer.py`**, new cases:
  - `to_dict()` / `from_dict()` round-trip preserves `ColumnImputationRecord.diagnostic` fields including `r2_train`, `variance_ratio`, `converged`, and `n_iter`.
  - A `FittedImputer` round-tripped through serialisation produces identical `transform()` output.

- **Integration test — `test_imputation_end_to_end.py`**, one new case:
  - Dataset with strong linear structure: Regression column diagnostic shows `r2_train > 0.5` and `converged = True`.
  - Call `suggest_refit_config`, apply returned config, call `fit()` again — second fit completes without error and produces no nulls.

## Out of Scope

- **Scope D (holdout error estimation using external test data or cross-validation)** — `r2_train` here uses only training data internal to `fit()`. Evaluating imputation quality against held-out test-set labels is a separate scope.
- **Automated internal re-fit loop** — `fit()` remains a single-pass operation. The library never re-fits automatically in response to diagnostics; that decision always belongs to the user.
- **`per_column_tol` override** — `tol` is derived from the column's IQR and is hard for users to set meaningfully. Excluded deliberately; `max_iter` is the right lever for convergence problems.
- **Distribution comparison beyond variance_ratio** — KS statistic or Jensen-Shannon divergence between imputed and observed distributions are deferred.
- **Scopes 0, 1, 2 (#89, #90, #91)** — prerequisites, separate issues.

## Further Notes

- The held-back R² evaluation runs each model twice: once on 80% of complete rows (diagnostic), once on 100% (final stored model). This is deliberate — see ADR 0013. The stored `FittedImputer` always learns from all available training data.
- For MICE, the O(n_mice_cols) transform calls for per-column R² are the accepted cost of column-level diagnostic precision, consistent with the accuracy-over-speed principle.
- All design decisions from this session are recorded in `CONTEXT.md` under `ImputationFitDiagnostic`, `suggest_refit_config`, and `Per-Column Imputation Override`, and in ADR 0013 (held-back evaluation rationale).
- `suggest_refit_config` is a pure function with no side effects — same inputs always produce the same output. It is straightforward to unit-test exhaustively.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scope 3: Multi-round Iteration and Feedback Loop — ImputationFitDiagnostic and suggest_refit_config #92

Problem Statement

Solution

User Stories

Implementation Decisions

New: `ImputationFitDiagnostic` dataclass

Modified: `ColumnImputationRecord`

Modified: `NumericImputationConfig`

R² computation approach inside `NumericImputer.fit()`

Modified: `NumericImputer.fit()` — per-column override consumption

New: `suggest_refit_config` standalone function

Public API exports

Dependency

Testing Decisions

Out of Scope

Further Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Scope 3: Multi-round Iteration and Feedback Loop — ImputationFitDiagnostic and suggest_refit_config #92

Description

Problem Statement

Solution

User Stories

Implementation Decisions

New: ImputationFitDiagnostic dataclass

Modified: ColumnImputationRecord

Modified: NumericImputationConfig

R² computation approach inside NumericImputer.fit()

Modified: NumericImputer.fit() — per-column override consumption

New: suggest_refit_config standalone function

Public API exports

Dependency

Testing Decisions

Out of Scope

Further Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

New: `ImputationFitDiagnostic` dataclass

Modified: `ColumnImputationRecord`

Modified: `NumericImputationConfig`

R² computation approach inside `NumericImputer.fit()`

Modified: `NumericImputer.fit()` — per-column override consumption

New: `suggest_refit_config` standalone function