Skip to content

Scope 6: Strategy Decision Engine — Routing Signals and Distribution Shape Escalation #95

@DEVunderdog

Description

@DEVunderdog

Problem Statement

When a data scientist uses DataForgeML to impute numeric columns, the strategy routing engine makes systematically suboptimal decisions in several interconnected ways:

  1. MAR High with no detected missingness correlations falls back to Median. A column flagged as MAR (Missing At Random) with High severity (5–20% missing) but an empty correlated_with list routes directly to Median. This discards the model-based imputation that the MAR mechanism specifically warrants — the absence of detected missingness correlations does not mean numeric features are non-predictive.

  2. Distribution shape is not a universal routing signal. Skewness is consulted in exactly one place: the MCAR Minor split between Mean and Median. Kurtosis — already computed and stored in NumericStats.kurtosis_tag — is never consulted anywhere in routing. A Leptokurtic column (frequent extreme values, heavy tails) receives the same Median fill at Moderate severity as a normally distributed column, even though Median systematically underestimates the extreme values that define a fat-tailed distribution.

  3. NonlinearityTag is not a routing guard. NonlinearityTag.Unpredictable indicates that no model family achieves R² > 0.1 against the available numeric predictors. A column with this tag can still be routed to Regression or KNN — producing a model indistinguishable from a scalar fill, at full compute cost, with no signal to the user.

  4. MCAR routing cannot see whether features are value-predictive. For MCAR columns, the routing uses only severity and dataset size as signals. The CorrelationProfiler already computes a value-level Pearson correlation matrix across all numeric columns. A MCAR High column with no numeric predictor above |r| = 0.2 is routed to KNN anyway, even though the feature set contains no useful predictive signal.

  5. The missingness correlation threshold is hardcoded. MARSuspect detection uses a fixed Pearson threshold of 0.6 inside MissingnessProfiler. Datasets with correlated missingness patterns that fall below 0.6 are silently classified as MCAR, leading to systematically weaker imputation strategies for those columns.

Solution

Extend the strategy decision engine with a complete set of routing signals, a universal distribution shape escalation principle, and a configurable missingness correlation threshold:

  1. Route MAR High with empty correlations to the MCAR High fallback chain (KNN → Regression → Median under size guards) rather than jumping directly to Median. The absence of detected missingness correlations does not preclude value-level predictability.

  2. Make distribution shape a universal routing signal applied at every severity level and for both MCAR and MAR paths. Wherever the routing would produce a scalar fill, check KurtosisTag and SkewSeverity first: Leptokurtic or SkewSeverity.Severe escalates to KNN (under size guards); Platykurtic de-escalates (scalar fills are more representative for thin-tailed distributions). NumericFlag.NearConstant caps escalation — model-based is wasteful when 90%+ of values share the mode.

  3. Add NonlinearityTag.Unpredictable as an unconditional pre-routing guard. Before any mechanism-based routing, check the tag. If Unpredictable, route to Median and record the reason in signals, regardless of whether the column is MAR or MCAR and regardless of severity.

  4. Add a value-level feature-predictability check for MCAR routing. Consume the CorrelationProfiler Pearson matrix already in StructuralProfileResult. When max |r| < 0.2 against all available numeric features, skip model-based strategies and route directly to Median — features contain no useful predictive signal.

  5. Make the MAR correlation threshold configurable. Move the hardcoded 0.6 threshold from MissingnessProfiler into ProfileConfig as mar_correlation_threshold (default 0.6), so users working with datasets where correlated missingness patterns fall below 0.6 can tune detection sensitivity.

User Stories

  1. As a data scientist, I want a MAR High column with no detected missingness correlations to attempt KNN or Regression before falling back to Median, so that the model-based strategies warranted by the MAR mechanism are always tried when features are available.
  2. As a data scientist, I want a Leptokurtic column at MCAR Minor severity to escalate to KNN rather than receiving Mean, so that the heavy-tailed distribution's extreme values are not systematically underestimated by a scalar fill.
  3. As a data scientist, I want a Leptokurtic column at MCAR Moderate severity to escalate to KNN rather than defaulting directly to Median, so that the 1–5% of missing values in a fat-tailed column are imputed using row-level feature similarity.
  4. As a data scientist, I want a Leptokurtic MAR column at Minor or Moderate severity to escalate to KNN rather than receiving Median, so that distribution shape is treated consistently across MCAR and MAR paths.
  5. As a data scientist, I want a column with SkewSeverity.Severe at Minor or Moderate severity to escalate to KNN, so that severely skewed distributions at low missingness receive model-based treatment independently of whether the kurtosis is also extreme.
  6. As a data scientist, I want SkewSeverity and KurtosisTag to act as independent escalation signals, so that a column with both severe skew and Leptokurtic tails does not require both conditions to be met before escalation fires.
  7. As a data scientist, I want a Platykurtic column to be treated as a de-escalation signal, so that thin-tailed distributions where Mean or Median are highly representative are not unnecessarily routed to model-based strategies.
  8. As a data scientist, I want KurtosisTag.Mesokurtic to leave the existing severity-based routing unchanged, so that normally distributed columns are not affected by the new distribution shape logic.
  9. As a data scientist, I want a NearConstant column (mode frequency > 90%) to be capped at Median regardless of kurtosis or skewness escalation signals, so that model-based strategies are not applied when the column has essentially no variation to model.
  10. As a data scientist, I want NonlinearityTag.Unpredictable to prevent model-based routing unconditionally, so that a column where no model achieves meaningful R² never receives a KNN or Regression fill that performs no better than Median.
  11. As a data scientist, I want the Unpredictable guard to apply even when the column is flagged MARSuspect with High or Severe missingness, so that the MAR mechanism does not silently override statistical evidence that modeling is futile.
  12. As a data scientist, I want the audit log to record both the MAR flag and the Unpredictable guard when both are present, so that I understand exactly why a model-based strategy was not attempted.
  13. As a data scientist, I want MCAR routing to check whether value-level Pearson correlations support model-based imputation before routing to KNN or Regression, so that I do not receive a model-based fill on a column where no numeric predictor carries useful signal.
  14. As a data scientist, I want the feature-predictability check to use a configurable maximum correlation threshold (default |r| < 0.2), so that I can tune the sensitivity for datasets where all features are moderately correlated.
  15. As a data scientist, I want the feature-predictability check to apply only to MCAR paths and not to MAR paths, so that MAR columns confirmed by missingness correlation evidence still attempt model-based imputation even when value-level correlations are weak.
  16. As a data scientist, I want to configure mar_correlation_threshold in ProfileConfig, so that I can lower it for datasets with diffuse missingness correlation patterns that fall below the default 0.6.
  17. As a data scientist, I want mar_correlation_threshold to default to 0.6, so that existing pipelines continue to behave identically without any configuration change.
  18. As a data scientist, I want every new routing decision — distribution shape escalation, Unpredictable guard, NearConstant cap, feature-predictability skip — recorded in ColumnImputationRecord.signals, so that I can understand the exact reason for every strategy assignment without reading source code.
  19. As a data scientist, I want a MAR column whose missingness correlates only with categorical columns to still attempt model-based imputation using available numeric features, so that cross-type missingness evidence is not wasted.
  20. As a data scientist, I want all routing decisions to remain overridable via per_column_strategy in NumericImputationConfig, so that I can always force a specific strategy when I have domain knowledge that overrides the decision engine.
  21. As a library contributor, I want the routing logic to be extractable into a dedicated, testable module with a simple input/output interface, so that routing decisions can be unit-tested in complete isolation from DataFrame fitting.

Implementation Decisions

Modified: ProfileConfig

Add one new field:

  • mar_correlation_threshold: float = 0.6 — the Pearson correlation threshold above which a column's missingness indicator is considered correlated with another column's, triggering MissingnessFlag.MARSuspect. Serialised in to_dict() / from_dict(). Default preserves existing behaviour.

Modified: MissingnessProfiler

Remove the hardcoded _MAR_CORRELATION_THRESHOLD = 0.60 module-level constant. Accept ProfileConfig (or the threshold value directly) and use profile_config.mar_correlation_threshold in the correlation matrix scan. No change to the correlation computation itself — only the threshold that gates MARSuspect assignment.

Modified: _fit_one in NumericImputer

Two new checks inserted into the existing priority chain:

Priority 3 — NonlinearityTag.Unpredictable pre-routing guard (inserted after MNAR, before all mechanism-based routing):

Priority 6 — NumericFlag.NearConstant de-escalation cap (inserted after Discrete / Mode, before MCAR severity routing):

  • If NumericFlag.NearConstant is in cp.flags: route to Median, record in signals. Prevents kurtosis/skewness escalation from triggering KNN on a near-constant column.

Updated MCAR Minor routing (currently lines 207–214):

  • After the existing Normal skew → Mean branch, add a Leptokurtic check: if kurtosis_tag == Leptokurtic, escalate to _mcar_model_strategy with severity=Minor rather than returning Mean.
  • The existing "Minor + skewed → Median" branch also becomes an escalation candidate when kurtosis_tag == Leptokurtic or skewness_severity == Severe.

Updated MCAR Moderate path (currently the fallthrough at line 217):

  • Before falling to Median, check distribution shape: if KurtosisTag.Leptokurtic OR SkewSeverity.Severe, delegate to _mcar_model_strategy.

Modified: _mar_strategy

Two changes:

MAR High + empty corrs (currently the final return at line 258):

  • When severity == High and corrs is empty, delegate to _mcar_model_strategy(severity=High, ...) instead of returning Median. Records signal: "knn/regression: MAR high + no missingness correlations detected, applying MCAR High fallback chain".

Distribution shape escalation for MAR Minor/Moderate (currently falls through to the final Median return):

  • Before the final Median return, check distribution shape: if KurtosisTag.Leptokurtic OR SkewSeverity.Severe, and severity is Minor or Moderate, attempt KNN under size guards before falling to Median.

Modified: _mcar_model_strategy

Feature-predictability check (new, MCAR paths only):

  • Accept the pearson_matrix from StructuralProfileResult.dataset.feature_correlation.pearson_matrix as a new parameter.
  • Before routing to KNN or Regression: compute max_abs_r = max(|r| for all other numeric columns). If max_abs_r < config.mcar_feature_predictability_threshold (new configurable field in NumericImputationConfig, default 0.2): return Median with signal "median: feature-predictability check failed (max |r|={max_abs_r:.2f} < threshold)".
  • This check applies only inside _mcar_model_strategy, not in _mar_strategy.

Modified: NumericImputationConfig

One new field:

  • mcar_feature_predictability_threshold: float = 0.2 — maximum absolute Pearson correlation below which MCAR model-based routing is skipped. Serialised in to_dict() / from_dict().

Modified: NumericImputer.fit() signature

Pass StructuralProfileResult correlation data into the routing functions that need it (_mcar_model_strategy for the feature-predictability check, _fit_one for NonlinearityTag). The profile parameter already exists on fit(); routing functions need access to the column's NumericStats (for NonlinearityTag and KurtosisTag) and to dataset.feature_correlation.pearson_matrix (for the predictability check). Both are reachable from the existing profile parameter.

Candidate deep module: _StrategyRouter

Extract _fit_one, _mar_strategy, and _mcar_model_strategy into a _StrategyRouter class or module. Interface: takes (col, ColumnProfile, NumericImputationConfig, n_rows, n_features, multi_mar, pearson_matrix) and returns (ImputationStrategy, list[str]). No DataFrame access — pure signal-in, decision-out. This makes the routing logic independently testable without constructing fitting infrastructure.

Dependency on Scope 0 (Issue #89)

NonlinearityTag.Unpredictable as a pre-routing guard requires NonlinearityTag to be computed and stored in NumericStats by NonlinearityProfiler. Scope 6 has a hard dependency on Issue #89. All other signals (KurtosisTag, NearConstant, pearson_matrix) are already available in Phase 1 and can be wired into routing before Issue #89 ships. The Unpredictable guard should be implemented as a no-op stub (or guarded by if nonlinearity_tag is not None) until Issue #89 is complete.

Relationship to Scope 2 (Issue #91)

Scope 6 changes which columns are routed to MICE (more columns may qualify at Minor/Moderate severity via distribution shape escalation). Scope 2 changes MICE internals (estimator selection, dynamic max_iter). Neither blocks the other, but Scope 2's dynamic max_iter computation should be validated against the updated column sets produced by Scope 6 routing.

Testing Decisions

What makes a good test: test routing decisions through the external public interface — ColumnImputationRecord.strategy and ColumnImputationRecord.signals. Do not assert on internal branching, private method calls, or intermediate routing state. Construct minimal synthetic DataFrames with known distribution properties (controlled skewness/kurtosis via scipy) and mock or inject profile objects with specific flag combinations. Each test should isolate one routing signal at a time, varying only the signal under test while holding others constant.

Prior art: test_numeric_imputer.py (strategy assertions via ColumnImputationRecord), test_model_strategies.py (model-based strategy cases with signal inspection).

Module 1: _StrategyRouter (if extracted) or NumericImputer routing path

  • MAR High + empty corrs routes to KNN or Regression (not Median).
  • MAR High + empty corrs on a small dataset (below KNN/Regression size guards) routes to Median.
  • MCAR Moderate + KurtosisTag.Leptokurtic routes to KNN.
  • MCAR Moderate + KurtosisTag.Mesokurtic routes to Median (unchanged).
  • MCAR Minor + KurtosisTag.Leptokurtic routes to KNN (not Mean).
  • MCAR Minor + SkewSeverity.Normal + KurtosisTag.Platykurtic routes to Mean (Platykurtic does not prevent Mean).
  • MAR Minor + KurtosisTag.Leptokurtic routes to KNN.
  • MAR Moderate + SkewSeverity.Severe routes to KNN.
  • NumericFlag.NearConstant + KurtosisTag.Leptokurtic routes to Median (NearConstant cap overrides escalation).
  • NonlinearityTag.Unpredictable + MAR High routes to Median (Unpredictable overrides MAR).
  • NonlinearityTag.Unpredictable + MCAR Severe routes to Median (Unpredictable overrides Severe).
  • Signals for Unpredictable guard include both the original missingness flag and the guard reason.
  • MCAR High + max |r| < 0.2 routes to Median (feature-predictability check).
  • MCAR High + max |r| >= 0.2 routes to KNN or Regression (feature-predictability check passes).
  • MAR High + max |r| < 0.2 still routes to model-based (feature-predictability check does not apply to MAR).

Module 2: MissingnessProfiler — configurable threshold

  • Lowering mar_correlation_threshold to 0.4 in ProfileConfig causes columns with missingness correlation of 0.45 to receive MARSuspect where they previously did not.
  • Raising the threshold to 0.8 causes columns previously flagged MARSuspect at 0.65 to lose the flag.
  • Default threshold (0.6) produces identical output to the current hardcoded behaviour.

Module 3: ProfileConfig and NumericImputationConfig

  • mar_correlation_threshold defaults to 0.6; to_dict() / from_dict() round-trip preserves non-default values.
  • mcar_feature_predictability_threshold defaults to 0.2; same round-trip requirement.

Out of Scope

Further Notes

  • All signals consumed by Scope 6 routing — KurtosisTag, NumericFlag.NearConstant, CorrelationProfiler.pearson_matrix — are already computed in Phase 1 and stored in StructuralProfileResult. No new Phase 1 computation is required except for NonlinearityTag (Scope 0 / Issue Scope 0: Regression Imputer Overhaul — IterativeImputer, NonlinearityTag, dynamic convergence #89 dependency).
  • The CorrelationProfiler.pearson_matrix is already available under StructuralProfileResult.dataset.feature_correlation.pearson_matrix. The imputer's fit() method already receives profile: StructuralProfileResult, so wiring the matrix into routing functions is a parameter-threading exercise, not an architectural change.
  • The mar_correlation_threshold is a definitional threshold (it determines the MARSuspect label) and therefore belongs in ProfileConfig (Phase 1) rather than NumericImputationConfig (Phase 2). This is consistent with the existing design principle in CONTEXT.md: definitional thresholds live in the phase that produces the label.
  • ADR 0016 documents the decision that NonlinearityTag.Unpredictable unconditionally overrides MAR mechanism routing.
  • ADR 0017 documents the decision that distribution shape is a universal routing signal across all paths.
  • CONTEXT.md has been updated to reflect all routing signal changes, the updated Numeric Imputation Decision Priority, and the configurable mar_correlation_threshold.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions