Skip to content

Scope 8: MNAR Treatment — data-derived fill replaces sentinel constant #97

@DEVunderdog

Description

@DEVunderdog

Problem Statement

When a data scientist uses DataForgeML to impute a column declared as MNAR (Missing Not At Random), the library fills every null with a hardcoded sentinel constant (-1 by default) and appends a binary missingness indicator column. This produces two interconnected problems:

  1. The sentinel is almost always an outlier. If an income column ranges from 20,000 to 200,000, filling MNAR-missing rows with -1 places them far outside the entire distribution. Linear models and distance-based models treat -1 as a real extreme value, creating a systematic directional bias for the missing rows that has nothing to do with the MNAR mechanism.

  2. The sentinel is redundant but misleading for tree-based models. A tree model learns a split at col == -1, which is semantically equivalent to col_missing == 1. The indicator already carries this signal — the sentinel fill adds no information but pollutes the feature space with a phantom mode.

  3. mnar_constant_fill is a global override applied identically to all MNAR columns. A column ranging 0–1 and a column ranging 0–1,000,000 both receive -1, producing inconsistent distances from the column's natural centre regardless of the column's own distribution.

The ColumnImputationRecord records the strategy as Constant, which gives no indication that this is an MNAR-specific treatment rather than a deliberate user-specified fill.

Solution

Replace the sentinel constant fill with a data-derived fill computed from the non-missing rows of each MNAR column individually, and rename the strategy to MNAR to make the audit log self-documenting:

  1. Compute the fill from observed data. For each MNAR column, compute the fill value from the non-missing rows only: observed mean when SkewSeverity.Normal, observed median for any other severity. This reuses the same skew-driven scalar logic already applied to MCAR columns, keeping the behaviour consistent across the library.

  2. Always add the missingness indicator. The binary {col}_missing column remains unchanged — it is the primary MNAR signal. The fill value serves only to keep Phase 2's output null-free so downstream phases receive clean input without null-awareness.

  3. Rename the strategy from Constant to MNAR. The serialised strategy becomes "mnar". Old serialised FittedImputer objects whose strategy field reads "constant" are migrated transparently in from_dict().

  4. Remove mnar_constant_fill from NumericImputationConfig. The fill is always data-derived and is not user-configurable at the strategy level. No safe migration value exists — the default of -1 was itself the problem being fixed.

User Stories

  1. As a data scientist, I want MNAR column fill values to be computed from the column's own observed data rather than a hardcoded sentinel, so that downstream models do not receive phantom outlier values for MNAR-missing rows.
  2. As a data scientist, I want the MNAR fill to use the observed mean when the column's skew is Normal and the observed median otherwise, so that the fill is consistent with how MCAR scalar fills are chosen across the rest of the library.
  3. As a data scientist, I want the fill statistic for each MNAR column to be computed from that column's non-missing rows only, so that sentinel values and effective nulls resolved upstream do not corrupt the fill computation.
  4. As a data scientist, I want a different fill value computed per MNAR column, so that a column ranging 0–1 and a column ranging 0–1,000,000 each receive a fill appropriate to their own distribution rather than a single global value.
  5. As a data scientist, I want the binary missingness indicator ({col}_missing) to still be appended for every MNAR column, so that downstream models can identify which rows had structurally missing values and weight them accordingly.
  6. As a data scientist, I want Phase 2 to still emit a null-free DataFrame after MNAR treatment, so that downstream phases (Outlier Detection, Normalization, Encoding, Scaling) receive clean input without requiring null-awareness of MNAR columns.
  7. As a data scientist, I want the MNAR strategy to appear as "mnar" in the ColumnImputationRecord, so that audit logs unambiguously identify MNAR-treated columns and do not confuse them with arbitrary constant fills.
  8. As a data scientist, I want the computed fill value and the skew severity that drove the mean/median choice to both appear in ColumnImputationRecord.signals, so that I can inspect exactly what fill was applied to each MNAR column without reading source code.
  9. As a data scientist, I want MNAR fill to round correctly to the nearest integer for integer-typed columns, so that the fill is type-consistent with the column dtype and does not introduce fractional values into a discrete column.
  10. As a data scientist, I want saved FittedImputer objects serialised before this change to still load and transform correctly, so that I do not lose previously fitted imputers when upgrading the library.
  11. As a data scientist, I want mnar_constant_fill to no longer exist in NumericImputationConfig, so that the config does not expose a parameter whose only effect was producing biased sentinel fills.
  12. As a data scientist, I want the MNAR treatment to still fire at the highest priority after DropCandidate, so that a user-declared MNAR column is never accidentally routed to MICE, Regression, or KNN regardless of its missingness severity or correlation structure.
  13. As a library contributor, I want the ImputationStrategy enum to have an MNAR member with serialised value "mnar", so that the strategy name is self-documenting in code, logs, and persisted artefacts.
  14. As a library contributor, I want FittedImputer.from_dict() to silently migrate a strategy field of "constant" to "mnar", so that old serialised imputers load correctly without manual intervention.
  15. As a library contributor, I want the MNAR fill computation to live in the same _fit_one function that handles all other numeric strategy routing, so that the skew read and scalar computation are co-located with the rest of Priority 2 logic.

Implementation Decisions

Modified: ImputationStrategy enum

Add MNAR = "mnar". Remove Constant = "constant" or keep it deprecated-only for the migration shim. The serialised form is "mnar" going forward.

Modified: NumericImputationConfig

Remove mnar_constant_fill: float = -1 entirely, including its to_dict() and from_dict() entries. No replacement parameter — the fill is always data-derived.

Modified: _fit_one — Priority 2 block

When col in mnar_columns:

  • Read SkewSeverity from the column's NumericStats (already available at this priority via cp.stats). If stats are absent, fall back to median.
  • Compute fill: _compute_mean(train_df, col) when SkewSeverity.Normal; _compute_median(train_df, col) otherwise.
  • Return strategy=ImputationStrategy.MNAR, fill_value=<computed>, indicator_added=True.
  • Append two signal entries: "declared MNAR by user configuration" and "mnar_fill: mean/median (skew=<severity>)".

No change: FittedImputer.transform() scalar fill path

The existing scalar fill loop already handles any strategy not in its skip list. ImputationStrategy.MNAR must not appear in that skip list — the loop will apply rec.fill_value normally. No logic change required.

Modified: FittedImputer.from_dict()

Before constructing ImputationStrategy(rec_data["strategy"]), remap the string: if the value is "constant", treat it as "mnar". This one-line migration handles all old serialised FittedImputer objects transparently.

Updated: CONTEXT.md

  • MNAR mechanism entry: updated to describe data-derived fill and skew-driven mean/median choice.
  • Imputation Strategy: Constant entry replaced with MNAR entry with the new definition.
  • Numeric Imputation Decision Priority: Priority 2 updated to ImputationStrategy.MNAR (observed mean/median fill, skew-driven) + missingness indicator.
  • ImputationFitDiagnostic: Constant replaced with MNAR in the list of strategies where diagnostic is None.

New: ADR 0019

Documents why the sentinel constant was replaced with a data-derived fill, why mnar_constant_fill was removed with no replacement, and the serialisation migration approach.

Testing Decisions

What makes a good test here: test external behaviour through the public interface. Do not assert on internal variable names or intermediate computations. For _fit_one, test that the returned ColumnImputationRecord has the correct strategy, fill_value, indicator_added, and signals for known inputs. For FittedImputer, test that transform() produces a null-free output and that the indicator column is present. Do not test which internal branch was taken.

Modules with tests:

  • _fit_one (MNAR path) — unit tests covering:

    • MNAR column with SkewSeverity.Normalstrategy == MNAR, fill_value equals the column mean of non-missing rows, indicator_added == True.
    • MNAR column with SkewSeverity.Moderate or Severefill_value equals the column median of non-missing rows.
    • MNAR column where cp.stats is absent → falls back to median without error.
    • signals list contains both the declaration entry and the skew/fill-choice entry.
    • Mirror the pattern of existing test_numeric_imputer.py MNAR test cases.
  • FittedImputer (MNAR path) — unit tests covering:

    • transform() on a DataFrame with MNAR nulls produces no nulls in the output.
    • The {col}_missing indicator column is present and correctly populated.
    • to_dict() / from_dict() round-trip preserves strategy == MNAR and the computed fill_value.
    • from_dict() with a legacy "constant" strategy string deserialises correctly as MNAR.
    • Mirror the pattern of existing test_fitted_imputer.py.
  • NumericImputationConfig — unit test that instantiating with a mnar_constant_fill keyword argument raises a TypeError, confirming the parameter has been removed.

Out of Scope

  • Categorical MNAR treatment — only numeric MNAR columns (handled by NumericImputer) are covered by this scope.
  • Per-column MNAR fill override — if a user needs a column-level fill override, that belongs in a future per_column_mnar_fill dict in NumericImputationConfig, consistent with the per-column override pattern introduced in Scope 3 (Scope 3: Multi-round Iteration and Feedback Loop — ImputationFitDiagnostic and suggest_refit_config #92). Not implemented here.
  • MNAR detection from data — MNAR is always user-declared via ImputationConfig.mnar_columns. Auto-detection is out of scope.
  • Multiple imputation for MNAR (Rubin's rules) — deferred to a future scope.
  • Backfilling or migrating existing serialised FittedImputer objects on disk — the from_dict() migration handles loading; no file-system migration tool is provided.

Dependencies

This scope has no hard dependencies on any open issue. The MNAR priority block (Priority 2 in _fit_one) fires before all other routing logic and is self-contained. Scopes 0–7 (#89#96) can be implemented in any order relative to this scope.

Soft relationship: Scope 3 (#92) introduces per_column_strategy as the per-column override pattern. If a future per_column_mnar_fill override is ever added, it should follow that same pattern.

Further Notes

  • The observed mean/median is computed from non-missing rows only. Effective nulls (sentinels, NaN, Inf, empty strings) are resolved upstream by _resolve_effective_nulls before fit() is called, so the fill computation never sees them as real observations.
  • The fill value is stored in ColumnImputationRecord.fill_value and therefore survives the FittedImputer round-trip. transform() uses the stored value directly — no re-computation at transform time.
  • The {col}_missing indicator name convention is unchanged.
  • All design decisions from this scope are recorded in CONTEXT.md and ADR 0019.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions