Scope 6: Strategy Decision Engine — Routing Signals and Distribution Shape Escalation

## Problem Statement

When a data scientist uses DataForgeML to impute numeric columns, the strategy routing engine makes systematically suboptimal decisions in several interconnected ways:

1. **MAR High with no detected missingness correlations falls back to Median.** A column flagged as MAR (Missing At Random) with High severity (5–20% missing) but an empty `correlated_with` list routes directly to Median. This discards the model-based imputation that the MAR mechanism specifically warrants — the absence of detected missingness correlations does not mean numeric features are non-predictive.

2. **Distribution shape is not a universal routing signal.** Skewness is consulted in exactly one place: the MCAR Minor split between Mean and Median. Kurtosis — already computed and stored in `NumericStats.kurtosis_tag` — is never consulted anywhere in routing. A Leptokurtic column (frequent extreme values, heavy tails) receives the same Median fill at Moderate severity as a normally distributed column, even though Median systematically underestimates the extreme values that define a fat-tailed distribution.

3. **NonlinearityTag is not a routing guard.** `NonlinearityTag.Unpredictable` indicates that no model family achieves R² > 0.1 against the available numeric predictors. A column with this tag can still be routed to Regression or KNN — producing a model indistinguishable from a scalar fill, at full compute cost, with no signal to the user.

4. **MCAR routing cannot see whether features are value-predictive.** For MCAR columns, the routing uses only severity and dataset size as signals. The `CorrelationProfiler` already computes a value-level Pearson correlation matrix across all numeric columns. A MCAR High column with no numeric predictor above `|r| = 0.2` is routed to KNN anyway, even though the feature set contains no useful predictive signal.

5. **The missingness correlation threshold is hardcoded.** `MARSuspect` detection uses a fixed Pearson threshold of 0.6 inside `MissingnessProfiler`. Datasets with correlated missingness patterns that fall below 0.6 are silently classified as MCAR, leading to systematically weaker imputation strategies for those columns.

## Solution

Extend the strategy decision engine with a complete set of routing signals, a universal distribution shape escalation principle, and a configurable missingness correlation threshold:

1. **Route MAR High with empty correlations to the MCAR High fallback chain** (KNN → Regression → Median under size guards) rather than jumping directly to Median. The absence of detected missingness correlations does not preclude value-level predictability.

2. **Make distribution shape a universal routing signal** applied at every severity level and for both MCAR and MAR paths. Wherever the routing would produce a scalar fill, check `KurtosisTag` and `SkewSeverity` first: `Leptokurtic` or `SkewSeverity.Severe` escalates to KNN (under size guards); `Platykurtic` de-escalates (scalar fills are more representative for thin-tailed distributions). `NumericFlag.NearConstant` caps escalation — model-based is wasteful when 90%+ of values share the mode.

3. **Add `NonlinearityTag.Unpredictable` as an unconditional pre-routing guard.** Before any mechanism-based routing, check the tag. If `Unpredictable`, route to Median and record the reason in `signals`, regardless of whether the column is MAR or MCAR and regardless of severity.

4. **Add a value-level feature-predictability check for MCAR routing.** Consume the `CorrelationProfiler` Pearson matrix already in `StructuralProfileResult`. When max `|r| < 0.2` against all available numeric features, skip model-based strategies and route directly to Median — features contain no useful predictive signal.

5. **Make the MAR correlation threshold configurable.** Move the hardcoded 0.6 threshold from `MissingnessProfiler` into `ProfileConfig` as `mar_correlation_threshold` (default 0.6), so users working with datasets where correlated missingness patterns fall below 0.6 can tune detection sensitivity.

## User Stories

1. As a data scientist, I want a MAR High column with no detected missingness correlations to attempt KNN or Regression before falling back to Median, so that the model-based strategies warranted by the MAR mechanism are always tried when features are available.
2. As a data scientist, I want a Leptokurtic column at MCAR Minor severity to escalate to KNN rather than receiving Mean, so that the heavy-tailed distribution's extreme values are not systematically underestimated by a scalar fill.
3. As a data scientist, I want a Leptokurtic column at MCAR Moderate severity to escalate to KNN rather than defaulting directly to Median, so that the 1–5% of missing values in a fat-tailed column are imputed using row-level feature similarity.
4. As a data scientist, I want a Leptokurtic MAR column at Minor or Moderate severity to escalate to KNN rather than receiving Median, so that distribution shape is treated consistently across MCAR and MAR paths.
5. As a data scientist, I want a column with SkewSeverity.Severe at Minor or Moderate severity to escalate to KNN, so that severely skewed distributions at low missingness receive model-based treatment independently of whether the kurtosis is also extreme.
6. As a data scientist, I want SkewSeverity and KurtosisTag to act as independent escalation signals, so that a column with both severe skew and Leptokurtic tails does not require both conditions to be met before escalation fires.
7. As a data scientist, I want a Platykurtic column to be treated as a de-escalation signal, so that thin-tailed distributions where Mean or Median are highly representative are not unnecessarily routed to model-based strategies.
8. As a data scientist, I want KurtosisTag.Mesokurtic to leave the existing severity-based routing unchanged, so that normally distributed columns are not affected by the new distribution shape logic.
9. As a data scientist, I want a NearConstant column (mode frequency > 90%) to be capped at Median regardless of kurtosis or skewness escalation signals, so that model-based strategies are not applied when the column has essentially no variation to model.
10. As a data scientist, I want NonlinearityTag.Unpredictable to prevent model-based routing unconditionally, so that a column where no model achieves meaningful R² never receives a KNN or Regression fill that performs no better than Median.
11. As a data scientist, I want the Unpredictable guard to apply even when the column is flagged MARSuspect with High or Severe missingness, so that the MAR mechanism does not silently override statistical evidence that modeling is futile.
12. As a data scientist, I want the audit log to record both the MAR flag and the Unpredictable guard when both are present, so that I understand exactly why a model-based strategy was not attempted.
13. As a data scientist, I want MCAR routing to check whether value-level Pearson correlations support model-based imputation before routing to KNN or Regression, so that I do not receive a model-based fill on a column where no numeric predictor carries useful signal.
14. As a data scientist, I want the feature-predictability check to use a configurable maximum correlation threshold (default |r| < 0.2), so that I can tune the sensitivity for datasets where all features are moderately correlated.
15. As a data scientist, I want the feature-predictability check to apply only to MCAR paths and not to MAR paths, so that MAR columns confirmed by missingness correlation evidence still attempt model-based imputation even when value-level correlations are weak.
16. As a data scientist, I want to configure `mar_correlation_threshold` in `ProfileConfig`, so that I can lower it for datasets with diffuse missingness correlation patterns that fall below the default 0.6.
17. As a data scientist, I want `mar_correlation_threshold` to default to 0.6, so that existing pipelines continue to behave identically without any configuration change.
18. As a data scientist, I want every new routing decision — distribution shape escalation, Unpredictable guard, NearConstant cap, feature-predictability skip — recorded in `ColumnImputationRecord.signals`, so that I can understand the exact reason for every strategy assignment without reading source code.
19. As a data scientist, I want a MAR column whose missingness correlates only with categorical columns to still attempt model-based imputation using available numeric features, so that cross-type missingness evidence is not wasted.
20. As a data scientist, I want all routing decisions to remain overridable via `per_column_strategy` in `NumericImputationConfig`, so that I can always force a specific strategy when I have domain knowledge that overrides the decision engine.
21. As a library contributor, I want the routing logic to be extractable into a dedicated, testable module with a simple input/output interface, so that routing decisions can be unit-tested in complete isolation from DataFrame fitting.

## Implementation Decisions

### Modified: `ProfileConfig`
Add one new field:
- `mar_correlation_threshold: float = 0.6` — the Pearson correlation threshold above which a column's missingness indicator is considered correlated with another column's, triggering `MissingnessFlag.MARSuspect`. Serialised in `to_dict()` / `from_dict()`. Default preserves existing behaviour.

### Modified: `MissingnessProfiler`
Remove the hardcoded `_MAR_CORRELATION_THRESHOLD = 0.60` module-level constant. Accept `ProfileConfig` (or the threshold value directly) and use `profile_config.mar_correlation_threshold` in the correlation matrix scan. No change to the correlation computation itself — only the threshold that gates `MARSuspect` assignment.

### Modified: `_fit_one` in `NumericImputer`
Two new checks inserted into the existing priority chain:

**Priority 3 — `NonlinearityTag.Unpredictable` pre-routing guard** (inserted after MNAR, before all mechanism-based routing):
- Read `NonlinearityTag` from `cp.stats.nonlinearity_tag` (requires Scope 0 / Issue #89).
- If `Unpredictable`: route to Median immediately, append both the original missingness flag and the Unpredictable guard to `signals`. Do not proceed to MAR, MCAR, or any model-based path.

**Priority 6 — `NumericFlag.NearConstant` de-escalation cap** (inserted after Discrete / Mode, before MCAR severity routing):
- If `NumericFlag.NearConstant` is in `cp.flags`: route to Median, record in `signals`. Prevents kurtosis/skewness escalation from triggering KNN on a near-constant column.

**Updated MCAR Minor routing** (currently lines 207–214):
- After the existing Normal skew → Mean branch, add a Leptokurtic check: if `kurtosis_tag == Leptokurtic`, escalate to `_mcar_model_strategy` with `severity=Minor` rather than returning Mean.
- The existing "Minor + skewed → Median" branch also becomes an escalation candidate when `kurtosis_tag == Leptokurtic` or `skewness_severity == Severe`.

**Updated MCAR Moderate path** (currently the fallthrough at line 217):
- Before falling to Median, check distribution shape: if `KurtosisTag.Leptokurtic` OR `SkewSeverity.Severe`, delegate to `_mcar_model_strategy`.

### Modified: `_mar_strategy`
Two changes:

**MAR High + empty `corrs`** (currently the final return at line 258):
- When `severity == High` and `corrs` is empty, delegate to `_mcar_model_strategy(severity=High, ...)` instead of returning Median. Records signal: `"knn/regression: MAR high + no missingness correlations detected, applying MCAR High fallback chain"`.

**Distribution shape escalation for MAR Minor/Moderate** (currently falls through to the final Median return):
- Before the final Median return, check distribution shape: if `KurtosisTag.Leptokurtic` OR `SkewSeverity.Severe`, and severity is Minor or Moderate, attempt KNN under size guards before falling to Median.

### Modified: `_mcar_model_strategy`
**Feature-predictability check** (new, MCAR paths only):
- Accept the `pearson_matrix` from `StructuralProfileResult.dataset.feature_correlation.pearson_matrix` as a new parameter.
- Before routing to KNN or Regression: compute `max_abs_r = max(|r| for all other numeric columns)`. If `max_abs_r < config.mcar_feature_predictability_threshold` (new configurable field in `NumericImputationConfig`, default 0.2): return Median with signal `"median: feature-predictability check failed (max |r|={max_abs_r:.2f} < threshold)"`.
- This check applies only inside `_mcar_model_strategy`, not in `_mar_strategy`.

### Modified: `NumericImputationConfig`
One new field:
- `mcar_feature_predictability_threshold: float = 0.2` — maximum absolute Pearson correlation below which MCAR model-based routing is skipped. Serialised in `to_dict()` / `from_dict()`.

### Modified: `NumericImputer.fit()` signature
Pass `StructuralProfileResult` correlation data into the routing functions that need it (`_mcar_model_strategy` for the feature-predictability check, `_fit_one` for `NonlinearityTag`). The `profile` parameter already exists on `fit()`; routing functions need access to the column's `NumericStats` (for `NonlinearityTag` and `KurtosisTag`) and to `dataset.feature_correlation.pearson_matrix` (for the predictability check). Both are reachable from the existing `profile` parameter.

### Candidate deep module: `_StrategyRouter`
Extract `_fit_one`, `_mar_strategy`, and `_mcar_model_strategy` into a `_StrategyRouter` class or module. Interface: takes `(col, ColumnProfile, NumericImputationConfig, n_rows, n_features, multi_mar, pearson_matrix)` and returns `(ImputationStrategy, list[str])`. No DataFrame access — pure signal-in, decision-out. This makes the routing logic independently testable without constructing fitting infrastructure.

### Dependency on Scope 0 (Issue #89)
`NonlinearityTag.Unpredictable` as a pre-routing guard requires `NonlinearityTag` to be computed and stored in `NumericStats` by `NonlinearityProfiler`. Scope 6 has a **hard dependency on Issue #89**. All other signals (`KurtosisTag`, `NearConstant`, `pearson_matrix`) are already available in Phase 1 and can be wired into routing before Issue #89 ships. The `Unpredictable` guard should be implemented as a no-op stub (or guarded by `if nonlinearity_tag is not None`) until Issue #89 is complete.

### Relationship to Scope 2 (Issue #91)
Scope 6 changes which columns are routed to MICE (more columns may qualify at Minor/Moderate severity via distribution shape escalation). Scope 2 changes MICE internals (estimator selection, dynamic `max_iter`). Neither blocks the other, but Scope 2's dynamic `max_iter` computation should be validated against the updated column sets produced by Scope 6 routing.

## Testing Decisions

**What makes a good test:** test routing decisions through the external public interface — `ColumnImputationRecord.strategy` and `ColumnImputationRecord.signals`. Do not assert on internal branching, private method calls, or intermediate routing state. Construct minimal synthetic DataFrames with known distribution properties (controlled skewness/kurtosis via scipy) and mock or inject profile objects with specific flag combinations. Each test should isolate one routing signal at a time, varying only the signal under test while holding others constant.

**Prior art:** `test_numeric_imputer.py` (strategy assertions via `ColumnImputationRecord`), `test_model_strategies.py` (model-based strategy cases with signal inspection).

### Module 1: `_StrategyRouter` (if extracted) or `NumericImputer` routing path
- MAR High + empty `corrs` routes to KNN or Regression (not Median).
- MAR High + empty `corrs` on a small dataset (below KNN/Regression size guards) routes to Median.
- MCAR Moderate + `KurtosisTag.Leptokurtic` routes to KNN.
- MCAR Moderate + `KurtosisTag.Mesokurtic` routes to Median (unchanged).
- MCAR Minor + `KurtosisTag.Leptokurtic` routes to KNN (not Mean).
- MCAR Minor + `SkewSeverity.Normal` + `KurtosisTag.Platykurtic` routes to Mean (Platykurtic does not prevent Mean).
- MAR Minor + `KurtosisTag.Leptokurtic` routes to KNN.
- MAR Moderate + `SkewSeverity.Severe` routes to KNN.
- `NumericFlag.NearConstant` + `KurtosisTag.Leptokurtic` routes to Median (NearConstant cap overrides escalation).
- `NonlinearityTag.Unpredictable` + MAR High routes to Median (Unpredictable overrides MAR).
- `NonlinearityTag.Unpredictable` + MCAR Severe routes to Median (Unpredictable overrides Severe).
- Signals for `Unpredictable` guard include both the original missingness flag and the guard reason.
- MCAR High + `max |r| < 0.2` routes to Median (feature-predictability check).
- MCAR High + `max |r| >= 0.2` routes to KNN or Regression (feature-predictability check passes).
- MAR High + `max |r| < 0.2` still routes to model-based (feature-predictability check does not apply to MAR).

### Module 2: `MissingnessProfiler` — configurable threshold
- Lowering `mar_correlation_threshold` to 0.4 in `ProfileConfig` causes columns with missingness correlation of 0.45 to receive `MARSuspect` where they previously did not.
- Raising the threshold to 0.8 causes columns previously flagged `MARSuspect` at 0.65 to lose the flag.
- Default threshold (0.6) produces identical output to the current hardcoded behaviour.

### Module 3: `ProfileConfig` and `NumericImputationConfig`
- `mar_correlation_threshold` defaults to 0.6; `to_dict()` / `from_dict()` round-trip preserves non-default values.
- `mcar_feature_predictability_threshold` defaults to 0.2; same round-trip requirement.

## Out of Scope

- **Bimodality detection** — a bimodal distribution where Mean and Median fall in the valley between two peaks is a meaningful routing signal, but requires new Phase 1 computation (Hartigan's dip test or GMM). Deferred to a future scope.
- **Tail asymmetry ratio** — `(p99 - p95) / (p5 - p1)` from existing percentile data. More precise than skewness alone, but deferred alongside bimodality to keep this scope focused on wiring existing signals.
- **Outlier density** — fraction of values beyond 3σ. Partially captured by `KurtosisTag.Leptokurtic`; deferred.
- **`max_iter` for MICE** — hardcoded at 10 today. Fully addressed by Scope 2 (Issue #91). Scope 6 does not touch MICE internals.
- **KNN and MICE hyperparameter selection** — dynamic `n_neighbors`, estimator selection, `tol` computation. Scopes 1 and 2 (Issues #90 and #91).
- **Numeric sentinel support** — Scope 5 (Issue #94).

## Further Notes

- All signals consumed by Scope 6 routing — `KurtosisTag`, `NumericFlag.NearConstant`, `CorrelationProfiler.pearson_matrix` — are already computed in Phase 1 and stored in `StructuralProfileResult`. No new Phase 1 computation is required except for `NonlinearityTag` (Scope 0 / Issue #89 dependency).
- The `CorrelationProfiler.pearson_matrix` is already available under `StructuralProfileResult.dataset.feature_correlation.pearson_matrix`. The imputer's `fit()` method already receives `profile: StructuralProfileResult`, so wiring the matrix into routing functions is a parameter-threading exercise, not an architectural change.
- The `mar_correlation_threshold` is a definitional threshold (it determines the `MARSuspect` label) and therefore belongs in `ProfileConfig` (Phase 1) rather than `NumericImputationConfig` (Phase 2). This is consistent with the existing design principle in `CONTEXT.md`: definitional thresholds live in the phase that produces the label.
- ADR 0016 documents the decision that `NonlinearityTag.Unpredictable` unconditionally overrides MAR mechanism routing.
- ADR 0017 documents the decision that distribution shape is a universal routing signal across all paths.
- `CONTEXT.md` has been updated to reflect all routing signal changes, the updated `Numeric Imputation Decision Priority`, and the configurable `mar_correlation_threshold`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scope 6: Strategy Decision Engine — Routing Signals and Distribution Shape Escalation #95

Problem Statement

Solution

User Stories

Implementation Decisions

Modified: `ProfileConfig`

Modified: `MissingnessProfiler`

Modified: `_fit_one` in `NumericImputer`

Modified: `_mar_strategy`

Modified: `_mcar_model_strategy`

Modified: `NumericImputationConfig`

Modified: `NumericImputer.fit()` signature

Candidate deep module: `_StrategyRouter`

Dependency on Scope 0 (Issue #89)

Relationship to Scope 2 (Issue #91)

Testing Decisions

Module 1: `_StrategyRouter` (if extracted) or `NumericImputer` routing path

Module 2: `MissingnessProfiler` — configurable threshold

Module 3: `ProfileConfig` and `NumericImputationConfig`

Out of Scope

Further Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Scope 6: Strategy Decision Engine — Routing Signals and Distribution Shape Escalation #95

Description

Problem Statement

Solution

User Stories

Implementation Decisions

Modified: ProfileConfig

Modified: MissingnessProfiler

Modified: _fit_one in NumericImputer

Modified: _mar_strategy

Modified: _mcar_model_strategy

Modified: NumericImputationConfig

Modified: NumericImputer.fit() signature

Candidate deep module: _StrategyRouter

Dependency on Scope 0 (Issue #89)

Relationship to Scope 2 (Issue #91)

Testing Decisions

Module 1: _StrategyRouter (if extracted) or NumericImputer routing path

Module 2: MissingnessProfiler — configurable threshold

Module 3: ProfileConfig and NumericImputationConfig

Out of Scope

Further Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Modified: `ProfileConfig`

Modified: `MissingnessProfiler`

Modified: `_fit_one` in `NumericImputer`

Modified: `_mar_strategy`

Modified: `_mcar_model_strategy`

Modified: `NumericImputationConfig`

Modified: `NumericImputer.fit()` signature

Candidate deep module: `_StrategyRouter`

Module 1: `_StrategyRouter` (if extracted) or `NumericImputer` routing path

Module 2: `MissingnessProfiler` — configurable threshold

Module 3: `ProfileConfig` and `NumericImputationConfig`