Scope 8: MNAR Treatment — data-derived fill replaces sentinel constant

## Problem Statement

When a data scientist uses DataForgeML to impute a column declared as MNAR (Missing Not At Random), the library fills every null with a hardcoded sentinel constant (`-1` by default) and appends a binary missingness indicator column. This produces two interconnected problems:

1. **The sentinel is almost always an outlier.** If an `income` column ranges from 20,000 to 200,000, filling MNAR-missing rows with `-1` places them far outside the entire distribution. Linear models and distance-based models treat `-1` as a real extreme value, creating a systematic directional bias for the missing rows that has nothing to do with the MNAR mechanism.

2. **The sentinel is redundant but misleading for tree-based models.** A tree model learns a split at `col == -1`, which is semantically equivalent to `col_missing == 1`. The indicator already carries this signal — the sentinel fill adds no information but pollutes the feature space with a phantom mode.

3. **`mnar_constant_fill` is a global override applied identically to all MNAR columns.** A column ranging 0–1 and a column ranging 0–1,000,000 both receive `-1`, producing inconsistent distances from the column's natural centre regardless of the column's own distribution.

The `ColumnImputationRecord` records the strategy as `Constant`, which gives no indication that this is an MNAR-specific treatment rather than a deliberate user-specified fill.

## Solution

Replace the sentinel constant fill with a **data-derived fill** computed from the non-missing rows of each MNAR column individually, and rename the strategy to `MNAR` to make the audit log self-documenting:

1. **Compute the fill from observed data.** For each MNAR column, compute the fill value from the non-missing rows only: observed mean when `SkewSeverity.Normal`, observed median for any other severity. This reuses the same skew-driven scalar logic already applied to MCAR columns, keeping the behaviour consistent across the library.

2. **Always add the missingness indicator.** The binary `{col}_missing` column remains unchanged — it is the primary MNAR signal. The fill value serves only to keep Phase 2's output null-free so downstream phases receive clean input without null-awareness.

3. **Rename the strategy from `Constant` to `MNAR`.** The serialised strategy becomes `"mnar"`. Old serialised `FittedImputer` objects whose strategy field reads `"constant"` are migrated transparently in `from_dict()`.

4. **Remove `mnar_constant_fill` from `NumericImputationConfig`.** The fill is always data-derived and is not user-configurable at the strategy level. No safe migration value exists — the default of `-1` was itself the problem being fixed.

## User Stories

1. As a data scientist, I want MNAR column fill values to be computed from the column's own observed data rather than a hardcoded sentinel, so that downstream models do not receive phantom outlier values for MNAR-missing rows.
2. As a data scientist, I want the MNAR fill to use the observed mean when the column's skew is Normal and the observed median otherwise, so that the fill is consistent with how MCAR scalar fills are chosen across the rest of the library.
3. As a data scientist, I want the fill statistic for each MNAR column to be computed from that column's non-missing rows only, so that sentinel values and effective nulls resolved upstream do not corrupt the fill computation.
4. As a data scientist, I want a different fill value computed per MNAR column, so that a column ranging 0–1 and a column ranging 0–1,000,000 each receive a fill appropriate to their own distribution rather than a single global value.
5. As a data scientist, I want the binary missingness indicator (`{col}_missing`) to still be appended for every MNAR column, so that downstream models can identify which rows had structurally missing values and weight them accordingly.
6. As a data scientist, I want Phase 2 to still emit a null-free DataFrame after MNAR treatment, so that downstream phases (Outlier Detection, Normalization, Encoding, Scaling) receive clean input without requiring null-awareness of MNAR columns.
7. As a data scientist, I want the MNAR strategy to appear as `"mnar"` in the `ColumnImputationRecord`, so that audit logs unambiguously identify MNAR-treated columns and do not confuse them with arbitrary constant fills.
8. As a data scientist, I want the computed fill value and the skew severity that drove the mean/median choice to both appear in `ColumnImputationRecord.signals`, so that I can inspect exactly what fill was applied to each MNAR column without reading source code.
9. As a data scientist, I want MNAR fill to round correctly to the nearest integer for integer-typed columns, so that the fill is type-consistent with the column dtype and does not introduce fractional values into a discrete column.
10. As a data scientist, I want saved `FittedImputer` objects serialised before this change to still load and transform correctly, so that I do not lose previously fitted imputers when upgrading the library.
11. As a data scientist, I want `mnar_constant_fill` to no longer exist in `NumericImputationConfig`, so that the config does not expose a parameter whose only effect was producing biased sentinel fills.
12. As a data scientist, I want the MNAR treatment to still fire at the highest priority after `DropCandidate`, so that a user-declared MNAR column is never accidentally routed to MICE, Regression, or KNN regardless of its missingness severity or correlation structure.
13. As a library contributor, I want the `ImputationStrategy` enum to have an `MNAR` member with serialised value `"mnar"`, so that the strategy name is self-documenting in code, logs, and persisted artefacts.
14. As a library contributor, I want `FittedImputer.from_dict()` to silently migrate a strategy field of `"constant"` to `"mnar"`, so that old serialised imputers load correctly without manual intervention.
15. As a library contributor, I want the MNAR fill computation to live in the same `_fit_one` function that handles all other numeric strategy routing, so that the skew read and scalar computation are co-located with the rest of Priority 2 logic.

## Implementation Decisions

### Modified: `ImputationStrategy` enum
Add `MNAR = "mnar"`. Remove `Constant = "constant"` or keep it deprecated-only for the migration shim. The serialised form is `"mnar"` going forward.

### Modified: `NumericImputationConfig`
Remove `mnar_constant_fill: float = -1` entirely, including its `to_dict()` and `from_dict()` entries. No replacement parameter — the fill is always data-derived.

### Modified: `_fit_one` — Priority 2 block
When `col in mnar_columns`:
- Read `SkewSeverity` from the column's `NumericStats` (already available at this priority via `cp.stats`). If stats are absent, fall back to median.
- Compute fill: `_compute_mean(train_df, col)` when `SkewSeverity.Normal`; `_compute_median(train_df, col)` otherwise.
- Return `strategy=ImputationStrategy.MNAR`, `fill_value=<computed>`, `indicator_added=True`.
- Append two signal entries: `"declared MNAR by user configuration"` and `"mnar_fill: mean/median (skew=<severity>)"`.

### No change: `FittedImputer.transform()` scalar fill path
The existing scalar fill loop already handles any strategy not in its skip list. `ImputationStrategy.MNAR` must not appear in that skip list — the loop will apply `rec.fill_value` normally. No logic change required.

### Modified: `FittedImputer.from_dict()`
Before constructing `ImputationStrategy(rec_data["strategy"])`, remap the string: if the value is `"constant"`, treat it as `"mnar"`. This one-line migration handles all old serialised `FittedImputer` objects transparently.

### Updated: `CONTEXT.md`
- **MNAR** mechanism entry: updated to describe data-derived fill and skew-driven mean/median choice.
- **Imputation Strategy**: `Constant` entry replaced with `MNAR` entry with the new definition.
- **Numeric Imputation Decision Priority**: Priority 2 updated to `ImputationStrategy.MNAR (observed mean/median fill, skew-driven) + missingness indicator`.
- **ImputationFitDiagnostic**: `Constant` replaced with `MNAR` in the list of strategies where diagnostic is `None`.

### New: ADR 0019
Documents why the sentinel constant was replaced with a data-derived fill, why `mnar_constant_fill` was removed with no replacement, and the serialisation migration approach.

## Testing Decisions

**What makes a good test here:** test external behaviour through the public interface. Do not assert on internal variable names or intermediate computations. For `_fit_one`, test that the returned `ColumnImputationRecord` has the correct `strategy`, `fill_value`, `indicator_added`, and `signals` for known inputs. For `FittedImputer`, test that `transform()` produces a null-free output and that the indicator column is present. Do not test which internal branch was taken.

**Modules with tests:**

- **`_fit_one` (MNAR path)** — unit tests covering:
  - MNAR column with `SkewSeverity.Normal` → `strategy == MNAR`, `fill_value` equals the column mean of non-missing rows, `indicator_added == True`.
  - MNAR column with `SkewSeverity.Moderate` or `Severe` → `fill_value` equals the column median of non-missing rows.
  - MNAR column where `cp.stats` is absent → falls back to median without error.
  - `signals` list contains both the declaration entry and the skew/fill-choice entry.
  - Mirror the pattern of existing `test_numeric_imputer.py` MNAR test cases.

- **`FittedImputer` (MNAR path)** — unit tests covering:
  - `transform()` on a DataFrame with MNAR nulls produces no nulls in the output.
  - The `{col}_missing` indicator column is present and correctly populated.
  - `to_dict()` / `from_dict()` round-trip preserves `strategy == MNAR` and the computed `fill_value`.
  - `from_dict()` with a legacy `"constant"` strategy string deserialises correctly as `MNAR`.
  - Mirror the pattern of existing `test_fitted_imputer.py`.

- **`NumericImputationConfig`** — unit test that instantiating with a `mnar_constant_fill` keyword argument raises a `TypeError`, confirming the parameter has been removed.

## Out of Scope

- Categorical MNAR treatment — only numeric MNAR columns (handled by `NumericImputer`) are covered by this scope.
- Per-column MNAR fill override — if a user needs a column-level fill override, that belongs in a future `per_column_mnar_fill` dict in `NumericImputationConfig`, consistent with the per-column override pattern introduced in Scope 3 (#92). Not implemented here.
- MNAR detection from data — MNAR is always user-declared via `ImputationConfig.mnar_columns`. Auto-detection is out of scope.
- Multiple imputation for MNAR (Rubin's rules) — deferred to a future scope.
- Backfilling or migrating existing serialised `FittedImputer` objects on disk — the `from_dict()` migration handles loading; no file-system migration tool is provided.

## Dependencies

This scope has **no hard dependencies** on any open issue. The MNAR priority block (`Priority 2` in `_fit_one`) fires before all other routing logic and is self-contained. Scopes 0–7 (#89–#96) can be implemented in any order relative to this scope.

Soft relationship: Scope 3 (#92) introduces `per_column_strategy` as the per-column override pattern. If a future `per_column_mnar_fill` override is ever added, it should follow that same pattern.

## Further Notes

- The observed mean/median is computed from non-missing rows only. Effective nulls (sentinels, NaN, Inf, empty strings) are resolved upstream by `_resolve_effective_nulls` before `fit()` is called, so the fill computation never sees them as real observations.
- The fill value is stored in `ColumnImputationRecord.fill_value` and therefore survives the `FittedImputer` round-trip. `transform()` uses the stored value directly — no re-computation at transform time.
- The `{col}_missing` indicator name convention is unchanged.
- All design decisions from this scope are recorded in `CONTEXT.md` and ADR 0019.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scope 8: MNAR Treatment — data-derived fill replaces sentinel constant #97

Problem Statement

Solution

User Stories

Implementation Decisions

Modified: `ImputationStrategy` enum

Modified: `NumericImputationConfig`

Modified: `_fit_one` — Priority 2 block

No change: `FittedImputer.transform()` scalar fill path

Modified: `FittedImputer.from_dict()`

Updated: `CONTEXT.md`

New: ADR 0019

Testing Decisions

Out of Scope

Dependencies

Further Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Scope 8: MNAR Treatment — data-derived fill replaces sentinel constant #97

Description

Problem Statement

Solution

User Stories

Implementation Decisions

Modified: ImputationStrategy enum

Modified: NumericImputationConfig

Modified: _fit_one — Priority 2 block

No change: FittedImputer.transform() scalar fill path

Modified: FittedImputer.from_dict()

Updated: CONTEXT.md

New: ADR 0019

Testing Decisions

Out of Scope

Dependencies

Further Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Modified: `ImputationStrategy` enum

Modified: `NumericImputationConfig`

Modified: `_fit_one` — Priority 2 block

No change: `FittedImputer.transform()` scalar fill path

Modified: `FittedImputer.from_dict()`

Updated: `CONTEXT.md`