You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a data scientist uses DataForgeML to impute a column declared as MNAR (Missing Not At Random), the library fills every null with a hardcoded sentinel constant (-1 by default) and appends a binary missingness indicator column. This produces two interconnected problems:
The sentinel is almost always an outlier. If an income column ranges from 20,000 to 200,000, filling MNAR-missing rows with -1 places them far outside the entire distribution. Linear models and distance-based models treat -1 as a real extreme value, creating a systematic directional bias for the missing rows that has nothing to do with the MNAR mechanism.
The sentinel is redundant but misleading for tree-based models. A tree model learns a split at col == -1, which is semantically equivalent to col_missing == 1. The indicator already carries this signal — the sentinel fill adds no information but pollutes the feature space with a phantom mode.
mnar_constant_fill is a global override applied identically to all MNAR columns. A column ranging 0–1 and a column ranging 0–1,000,000 both receive -1, producing inconsistent distances from the column's natural centre regardless of the column's own distribution.
The ColumnImputationRecord records the strategy as Constant, which gives no indication that this is an MNAR-specific treatment rather than a deliberate user-specified fill.
Solution
Replace the sentinel constant fill with a data-derived fill computed from the non-missing rows of each MNAR column individually, and rename the strategy to MNAR to make the audit log self-documenting:
Compute the fill from observed data. For each MNAR column, compute the fill value from the non-missing rows only: observed mean when SkewSeverity.Normal, observed median for any other severity. This reuses the same skew-driven scalar logic already applied to MCAR columns, keeping the behaviour consistent across the library.
Always add the missingness indicator. The binary {col}_missing column remains unchanged — it is the primary MNAR signal. The fill value serves only to keep Phase 2's output null-free so downstream phases receive clean input without null-awareness.
Rename the strategy from Constant to MNAR. The serialised strategy becomes "mnar". Old serialised FittedImputer objects whose strategy field reads "constant" are migrated transparently in from_dict().
Remove mnar_constant_fill from NumericImputationConfig. The fill is always data-derived and is not user-configurable at the strategy level. No safe migration value exists — the default of -1 was itself the problem being fixed.
User Stories
As a data scientist, I want MNAR column fill values to be computed from the column's own observed data rather than a hardcoded sentinel, so that downstream models do not receive phantom outlier values for MNAR-missing rows.
As a data scientist, I want the MNAR fill to use the observed mean when the column's skew is Normal and the observed median otherwise, so that the fill is consistent with how MCAR scalar fills are chosen across the rest of the library.
As a data scientist, I want the fill statistic for each MNAR column to be computed from that column's non-missing rows only, so that sentinel values and effective nulls resolved upstream do not corrupt the fill computation.
As a data scientist, I want a different fill value computed per MNAR column, so that a column ranging 0–1 and a column ranging 0–1,000,000 each receive a fill appropriate to their own distribution rather than a single global value.
As a data scientist, I want the binary missingness indicator ({col}_missing) to still be appended for every MNAR column, so that downstream models can identify which rows had structurally missing values and weight them accordingly.
As a data scientist, I want Phase 2 to still emit a null-free DataFrame after MNAR treatment, so that downstream phases (Outlier Detection, Normalization, Encoding, Scaling) receive clean input without requiring null-awareness of MNAR columns.
As a data scientist, I want the MNAR strategy to appear as "mnar" in the ColumnImputationRecord, so that audit logs unambiguously identify MNAR-treated columns and do not confuse them with arbitrary constant fills.
As a data scientist, I want the computed fill value and the skew severity that drove the mean/median choice to both appear in ColumnImputationRecord.signals, so that I can inspect exactly what fill was applied to each MNAR column without reading source code.
As a data scientist, I want MNAR fill to round correctly to the nearest integer for integer-typed columns, so that the fill is type-consistent with the column dtype and does not introduce fractional values into a discrete column.
As a data scientist, I want saved FittedImputer objects serialised before this change to still load and transform correctly, so that I do not lose previously fitted imputers when upgrading the library.
As a data scientist, I want mnar_constant_fill to no longer exist in NumericImputationConfig, so that the config does not expose a parameter whose only effect was producing biased sentinel fills.
As a data scientist, I want the MNAR treatment to still fire at the highest priority after DropCandidate, so that a user-declared MNAR column is never accidentally routed to MICE, Regression, or KNN regardless of its missingness severity or correlation structure.
As a library contributor, I want the ImputationStrategy enum to have an MNAR member with serialised value "mnar", so that the strategy name is self-documenting in code, logs, and persisted artefacts.
As a library contributor, I want FittedImputer.from_dict() to silently migrate a strategy field of "constant" to "mnar", so that old serialised imputers load correctly without manual intervention.
As a library contributor, I want the MNAR fill computation to live in the same _fit_one function that handles all other numeric strategy routing, so that the skew read and scalar computation are co-located with the rest of Priority 2 logic.
Implementation Decisions
Modified: ImputationStrategy enum
Add MNAR = "mnar". Remove Constant = "constant" or keep it deprecated-only for the migration shim. The serialised form is "mnar" going forward.
Modified: NumericImputationConfig
Remove mnar_constant_fill: float = -1 entirely, including its to_dict() and from_dict() entries. No replacement parameter — the fill is always data-derived.
Modified: _fit_one — Priority 2 block
When col in mnar_columns:
Read SkewSeverity from the column's NumericStats (already available at this priority via cp.stats). If stats are absent, fall back to median.
Compute fill: _compute_mean(train_df, col) when SkewSeverity.Normal; _compute_median(train_df, col) otherwise.
Append two signal entries: "declared MNAR by user configuration" and "mnar_fill: mean/median (skew=<severity>)".
No change: FittedImputer.transform() scalar fill path
The existing scalar fill loop already handles any strategy not in its skip list. ImputationStrategy.MNAR must not appear in that skip list — the loop will apply rec.fill_value normally. No logic change required.
Modified: FittedImputer.from_dict()
Before constructing ImputationStrategy(rec_data["strategy"]), remap the string: if the value is "constant", treat it as "mnar". This one-line migration handles all old serialised FittedImputer objects transparently.
Updated: CONTEXT.md
MNAR mechanism entry: updated to describe data-derived fill and skew-driven mean/median choice.
Imputation Strategy: Constant entry replaced with MNAR entry with the new definition.
ImputationFitDiagnostic: Constant replaced with MNAR in the list of strategies where diagnostic is None.
New: ADR 0019
Documents why the sentinel constant was replaced with a data-derived fill, why mnar_constant_fill was removed with no replacement, and the serialisation migration approach.
Testing Decisions
What makes a good test here: test external behaviour through the public interface. Do not assert on internal variable names or intermediate computations. For _fit_one, test that the returned ColumnImputationRecord has the correct strategy, fill_value, indicator_added, and signals for known inputs. For FittedImputer, test that transform() produces a null-free output and that the indicator column is present. Do not test which internal branch was taken.
Modules with tests:
_fit_one (MNAR path) — unit tests covering:
MNAR column with SkewSeverity.Normal → strategy == MNAR, fill_value equals the column mean of non-missing rows, indicator_added == True.
MNAR column with SkewSeverity.Moderate or Severe → fill_value equals the column median of non-missing rows.
MNAR column where cp.stats is absent → falls back to median without error.
signals list contains both the declaration entry and the skew/fill-choice entry.
Mirror the pattern of existing test_numeric_imputer.py MNAR test cases.
FittedImputer (MNAR path) — unit tests covering:
transform() on a DataFrame with MNAR nulls produces no nulls in the output.
The {col}_missing indicator column is present and correctly populated.
to_dict() / from_dict() round-trip preserves strategy == MNAR and the computed fill_value.
from_dict() with a legacy "constant" strategy string deserialises correctly as MNAR.
Mirror the pattern of existing test_fitted_imputer.py.
NumericImputationConfig — unit test that instantiating with a mnar_constant_fill keyword argument raises a TypeError, confirming the parameter has been removed.
Out of Scope
Categorical MNAR treatment — only numeric MNAR columns (handled by NumericImputer) are covered by this scope.
MNAR detection from data — MNAR is always user-declared via ImputationConfig.mnar_columns. Auto-detection is out of scope.
Multiple imputation for MNAR (Rubin's rules) — deferred to a future scope.
Backfilling or migrating existing serialised FittedImputer objects on disk — the from_dict() migration handles loading; no file-system migration tool is provided.
Dependencies
This scope has no hard dependencies on any open issue. The MNAR priority block (Priority 2 in _fit_one) fires before all other routing logic and is self-contained. Scopes 0–7 (#89–#96) can be implemented in any order relative to this scope.
Soft relationship: Scope 3 (#92) introduces per_column_strategy as the per-column override pattern. If a future per_column_mnar_fill override is ever added, it should follow that same pattern.
Further Notes
The observed mean/median is computed from non-missing rows only. Effective nulls (sentinels, NaN, Inf, empty strings) are resolved upstream by _resolve_effective_nulls before fit() is called, so the fill computation never sees them as real observations.
The fill value is stored in ColumnImputationRecord.fill_value and therefore survives the FittedImputer round-trip. transform() uses the stored value directly — no re-computation at transform time.
The {col}_missing indicator name convention is unchanged.
All design decisions from this scope are recorded in CONTEXT.md and ADR 0019.
Problem Statement
When a data scientist uses DataForgeML to impute a column declared as MNAR (Missing Not At Random), the library fills every null with a hardcoded sentinel constant (
-1by default) and appends a binary missingness indicator column. This produces two interconnected problems:The sentinel is almost always an outlier. If an
incomecolumn ranges from 20,000 to 200,000, filling MNAR-missing rows with-1places them far outside the entire distribution. Linear models and distance-based models treat-1as a real extreme value, creating a systematic directional bias for the missing rows that has nothing to do with the MNAR mechanism.The sentinel is redundant but misleading for tree-based models. A tree model learns a split at
col == -1, which is semantically equivalent tocol_missing == 1. The indicator already carries this signal — the sentinel fill adds no information but pollutes the feature space with a phantom mode.mnar_constant_fillis a global override applied identically to all MNAR columns. A column ranging 0–1 and a column ranging 0–1,000,000 both receive-1, producing inconsistent distances from the column's natural centre regardless of the column's own distribution.The
ColumnImputationRecordrecords the strategy asConstant, which gives no indication that this is an MNAR-specific treatment rather than a deliberate user-specified fill.Solution
Replace the sentinel constant fill with a data-derived fill computed from the non-missing rows of each MNAR column individually, and rename the strategy to
MNARto make the audit log self-documenting:Compute the fill from observed data. For each MNAR column, compute the fill value from the non-missing rows only: observed mean when
SkewSeverity.Normal, observed median for any other severity. This reuses the same skew-driven scalar logic already applied to MCAR columns, keeping the behaviour consistent across the library.Always add the missingness indicator. The binary
{col}_missingcolumn remains unchanged — it is the primary MNAR signal. The fill value serves only to keep Phase 2's output null-free so downstream phases receive clean input without null-awareness.Rename the strategy from
ConstanttoMNAR. The serialised strategy becomes"mnar". Old serialisedFittedImputerobjects whose strategy field reads"constant"are migrated transparently infrom_dict().Remove
mnar_constant_fillfromNumericImputationConfig. The fill is always data-derived and is not user-configurable at the strategy level. No safe migration value exists — the default of-1was itself the problem being fixed.User Stories
{col}_missing) to still be appended for every MNAR column, so that downstream models can identify which rows had structurally missing values and weight them accordingly."mnar"in theColumnImputationRecord, so that audit logs unambiguously identify MNAR-treated columns and do not confuse them with arbitrary constant fills.ColumnImputationRecord.signals, so that I can inspect exactly what fill was applied to each MNAR column without reading source code.FittedImputerobjects serialised before this change to still load and transform correctly, so that I do not lose previously fitted imputers when upgrading the library.mnar_constant_fillto no longer exist inNumericImputationConfig, so that the config does not expose a parameter whose only effect was producing biased sentinel fills.DropCandidate, so that a user-declared MNAR column is never accidentally routed to MICE, Regression, or KNN regardless of its missingness severity or correlation structure.ImputationStrategyenum to have anMNARmember with serialised value"mnar", so that the strategy name is self-documenting in code, logs, and persisted artefacts.FittedImputer.from_dict()to silently migrate a strategy field of"constant"to"mnar", so that old serialised imputers load correctly without manual intervention._fit_onefunction that handles all other numeric strategy routing, so that the skew read and scalar computation are co-located with the rest of Priority 2 logic.Implementation Decisions
Modified:
ImputationStrategyenumAdd
MNAR = "mnar". RemoveConstant = "constant"or keep it deprecated-only for the migration shim. The serialised form is"mnar"going forward.Modified:
NumericImputationConfigRemove
mnar_constant_fill: float = -1entirely, including itsto_dict()andfrom_dict()entries. No replacement parameter — the fill is always data-derived.Modified:
_fit_one— Priority 2 blockWhen
col in mnar_columns:SkewSeverityfrom the column'sNumericStats(already available at this priority viacp.stats). If stats are absent, fall back to median._compute_mean(train_df, col)whenSkewSeverity.Normal;_compute_median(train_df, col)otherwise.strategy=ImputationStrategy.MNAR,fill_value=<computed>,indicator_added=True."declared MNAR by user configuration"and"mnar_fill: mean/median (skew=<severity>)".No change:
FittedImputer.transform()scalar fill pathThe existing scalar fill loop already handles any strategy not in its skip list.
ImputationStrategy.MNARmust not appear in that skip list — the loop will applyrec.fill_valuenormally. No logic change required.Modified:
FittedImputer.from_dict()Before constructing
ImputationStrategy(rec_data["strategy"]), remap the string: if the value is"constant", treat it as"mnar". This one-line migration handles all old serialisedFittedImputerobjects transparently.Updated:
CONTEXT.mdConstantentry replaced withMNARentry with the new definition.ImputationStrategy.MNAR (observed mean/median fill, skew-driven) + missingness indicator.Constantreplaced withMNARin the list of strategies where diagnostic isNone.New: ADR 0019
Documents why the sentinel constant was replaced with a data-derived fill, why
mnar_constant_fillwas removed with no replacement, and the serialisation migration approach.Testing Decisions
What makes a good test here: test external behaviour through the public interface. Do not assert on internal variable names or intermediate computations. For
_fit_one, test that the returnedColumnImputationRecordhas the correctstrategy,fill_value,indicator_added, andsignalsfor known inputs. ForFittedImputer, test thattransform()produces a null-free output and that the indicator column is present. Do not test which internal branch was taken.Modules with tests:
_fit_one(MNAR path) — unit tests covering:SkewSeverity.Normal→strategy == MNAR,fill_valueequals the column mean of non-missing rows,indicator_added == True.SkewSeverity.ModerateorSevere→fill_valueequals the column median of non-missing rows.cp.statsis absent → falls back to median without error.signalslist contains both the declaration entry and the skew/fill-choice entry.test_numeric_imputer.pyMNAR test cases.FittedImputer(MNAR path) — unit tests covering:transform()on a DataFrame with MNAR nulls produces no nulls in the output.{col}_missingindicator column is present and correctly populated.to_dict()/from_dict()round-trip preservesstrategy == MNARand the computedfill_value.from_dict()with a legacy"constant"strategy string deserialises correctly asMNAR.test_fitted_imputer.py.NumericImputationConfig— unit test that instantiating with amnar_constant_fillkeyword argument raises aTypeError, confirming the parameter has been removed.Out of Scope
NumericImputer) are covered by this scope.per_column_mnar_filldict inNumericImputationConfig, consistent with the per-column override pattern introduced in Scope 3 (Scope 3: Multi-round Iteration and Feedback Loop — ImputationFitDiagnostic and suggest_refit_config #92). Not implemented here.ImputationConfig.mnar_columns. Auto-detection is out of scope.FittedImputerobjects on disk — thefrom_dict()migration handles loading; no file-system migration tool is provided.Dependencies
This scope has no hard dependencies on any open issue. The MNAR priority block (
Priority 2in_fit_one) fires before all other routing logic and is self-contained. Scopes 0–7 (#89–#96) can be implemented in any order relative to this scope.Soft relationship: Scope 3 (#92) introduces
per_column_strategyas the per-column override pattern. If a futureper_column_mnar_filloverride is ever added, it should follow that same pattern.Further Notes
_resolve_effective_nullsbeforefit()is called, so the fill computation never sees them as real observations.ColumnImputationRecord.fill_valueand therefore survives theFittedImputerround-trip.transform()uses the stored value directly — no re-computation at transform time.{col}_missingindicator name convention is unchanged.CONTEXT.mdand ADR 0019.