Scope 5: Numeric Sentinel Support — end-to-end effective null coverage for user-declared sentinel values

## Problem Statement

Users working with real-world datasets — particularly in healthcare, finance, and survey data — frequently encode missing values as domain-specific numeric sentinels such as `-999`, `9999`, or `0`. These values are semantically null but are not Polars-native nulls, so DataForgeML's profiler does not count them as missing and the imputer does not treat them as missing.

The consequence is silent data corruption: a column declared by the user to have sentinel-based missingness will have those sentinel values treated as real observations during mean, median, mode, and model-based imputation. No error is raised and no warning is emitted — the imputed output is subtly wrong with no indication to the user.

The CONTEXT.md glossary already defines **Numeric sentinel** as a first-class effective null type, but no implementation exists in any layer of the codebase.

---

## Solution

Allow users to declare per-column numeric sentinel values in `ProfileConfig`. Once declared, the sentinel values are treated identically to Polars-native nulls throughout the entire pipeline:

- **Phase 1 (Profiling):** sentinel rows are counted in `effective_null_count` and `effective_null_ratio`, contributing to severity classification and MAR correlation detection.
- **Phase 2 (Imputation):** sentinel values are normalized to Polars-native nulls before any fit or transform operation, so mean/median/model computations never see them as real observations.
- **Serialization:** sentinel configuration survives `FittedImputer.to_dict()` / `from_dict()` round-trips so a saved imputer can correctly transform new data.

---

## User Stories

1. As a data scientist, I want to declare that `-999` means missing in my `age` column, so that the profiler counts those rows in the missingness rate rather than treating `-999` as a valid age.
2. As a data scientist, I want to declare multiple sentinel values per column (e.g. `-999` and `9999`), so that I can cover datasets where multiple codes were used for different missing-value reasons.
3. As a data scientist, I want sentinel declarations to apply to integer-typed columns as well as float-typed columns, so that I am not forced to cast my data before profiling.
4. As a data scientist, I want sentinel values to be counted in `effective_null_count` and `effective_null_ratio` in the missingness profile, so that severity classification (Minor / Moderate / High / Severe) reflects true missingness.
5. As a data scientist, I want sentinel-heavy columns (>50% sentinel rows) to receive a `DropCandidate` flag, so that columns that are mostly sentinel-null are handled the same as columns that are mostly Polars-null.
6. As a data scientist, I want sentinel rows to participate in MAR correlation detection, so that columns whose sentinel missingness correlates with other columns' missingness are correctly flagged as `MARSuspect`.
7. As a data scientist, I want sentinel values to be normalized to Polars-native nulls before imputation fits, so that mean, median, and mode computations are never contaminated by sentinel values.
8. As a data scientist, I want sentinel values in model-based imputation columns (KNN, MICE, Regression) to be normalized to NaN before the model sees the data, so that the model trains on real observations only.
9. As a data scientist, I want sentinel values in Passthrough columns on the test set to raise `UnfittedColumnError`, so that I am notified when the test set has sentinel-based missingness that the training split did not see.
10. As a data scientist, I want `SplitImbalanceWarning` to fire when all sentinel rows land in the test split and the training split has no missingness, so that I am warned about unsafe splits even when the missingness is sentinel-based.
11. As a data scientist, I want missingness indicators for MNAR columns to be set to `1` for sentinel rows, so that the indicator correctly marks which rows were originally missing regardless of whether the missingness was sentinel-based or native-null-based.
12. As a data scientist, I want to declare sentinels at the `PipelineConfig` level so that I have a single place to configure the entire pipeline without touching profiler or imputer internals.
13. As a data scientist, I want sentinel configuration to survive `FittedImputer.to_dict()` / `from_dict()` serialization, so that a saved imputer correctly normalizes sentinel values when transforming new batches of data in production.
14. As a data scientist, I want columns with no declared sentinels to be completely unaffected by this feature, so that existing pipelines continue to behave identically.
15. As a data scientist, I want sentinel declarations to work alongside the existing string sentinel and float NaN/Inf detection, so that a column can be covered by multiple effective-null rules simultaneously without conflict.

---

## Implementation Decisions

### Module changes

**`ProfileConfig`**
- Add a `numeric_sentinels` field: a mapping from column name to a list of sentinel values (floats, which are type-compatible with any numeric dtype).
- Update `to_dict()` and `from_dict()` to serialize and restore this field.
- Default: empty mapping (no sentinels declared — zero change to existing behaviour).

**`StructuralProfileResult`**
- Mirror `numeric_sentinels` as a field on the result object.
- `StructuralProfiler` copies the value from `ProfileConfig` when building the result, so Phase 2 can read sentinels from the profile without holding a config reference.
- This preserves the existing contract: sub-processors receive `(DataFrame, list[str], StructuralProfileResult)` and need nothing else.

**`_null_detection.py`**
- Add `_numeric_sentinel_eligible(dtype)` — returns `True` for all integer and float numeric dtypes.
- This is the single authority for which dtypes may receive sentinel normalization, consistent with the existing `_sentinel_eligible` and `_inf_eligible` functions.

**`_null_normalization.py` / `_resolve_effective_nulls`**
- Add an optional `numeric_sentinels: dict[str, list[float]] | None` parameter (default `None`).
- For each column present in both the DataFrame and the sentinels dict, if the column's dtype passes `_numeric_sentinel_eligible`, add a Polars expression that replaces each matching sentinel value with null.
- Sentinel expressions are appended to the existing expression list; the function remains a pure, side-effect-free utility.
- When `numeric_sentinels` is `None` or empty, the function behaves identically to its current form.

**`MissingnessProfiler`**
- Update `_profile_column` to accept an optional `sentinels: list[float] | None` parameter.
- When sentinels are provided and the column dtype passes `_numeric_sentinel_eligible`, extend the `eff_null` boolean mask to also flag rows whose value is in the sentinel list.
- `_run()` passes `profile_config.numeric_sentinels.get(col_name)` for each column.

**`ImputationOrchestrator.fit()`**
- After calling `_resolve_effective_nulls(train_df)`, pass `numeric_sentinels=profile.numeric_sentinels` as the second argument.
- Pass the sentinel dict to the `FittedImputer` constructor so it is stored for use in `transform()`.

**`FittedImputer`**
- Add `numeric_sentinels: dict[str, list[float]]` as a dataclass field (default: empty dict).
- In `transform()`, pass `numeric_sentinels=self.numeric_sentinels` to `_resolve_effective_nulls`.
- In `to_dict()`, serialize the field as a plain JSON-compatible dict.
- In `from_dict()`, restore the field; missing key defaults to empty dict for backwards compatibility with imputers saved before this feature.

### Key architectural decisions

- **Sentinels are declared once in `ProfileConfig` and flow through the system via `StructuralProfileResult`.** Phase 2 reads them from the profile, not from a separate config object. This is consistent with how Phase 2 already consumes all other Phase 1 signals.
- **`_resolve_effective_nulls` remains a pure function.** It accepts a plain `dict[str, list[float]]` rather than a domain type, keeping it decoupled from `StructuralProfileResult` and easy to test in isolation.
- **Sentinels apply to any numeric dtype**, including integers. A value of `-999` on an `Int64` column is a common real-world pattern; requiring Float casts before profiling would be a breaking ergonomic gap.
- **No auto-detection.** Sentinel values are never inferred from data distributions. This keeps Phase 1 deterministic and avoids false positives on legitimate extreme values.
- **Backwards compatibility.** `FittedImputer.from_dict()` treats a missing `numeric_sentinels` key as an empty dict, so imputers serialized before this feature load and transform correctly.

---

## Testing Decisions

### What makes a good test

Tests verify observable output behaviour given specific inputs — they do not assert on internal state, private method calls, or implementation structure. Each test constructs a minimal DataFrame or config, runs the public interface, and asserts on the returned value or raised exception.

Prior art: `test_null_normalization.py` (pure-function DataFrame-in / DataFrame-out tests), `test_fitted_imputer.py` (FittedImputer constructed directly with known records, transform output asserted), `test_missingness_profiler.py` (profiler called with a DataFrame, result counts asserted).

### Modules with full unit test coverage

**`_resolve_effective_nulls` (extended)**
- Declared sentinel on an `Int64` column is converted to null; non-sentinel values are unchanged.
- Declared sentinel on a `Float64` column is converted to null; existing NaN/Inf rules still apply independently.
- Multiple sentinels declared for one column — all are converted; any other value is unchanged.
- Sentinel declared for a column not present in the DataFrame — no error, rest of DataFrame unchanged.
- Sentinel value that matches no rows — no rows changed (no-op, no error).
- Column with no declared sentinel — completely unchanged even if it contains values that happen to equal a sentinel declared for a different column.
- `numeric_sentinels=None` — function behaves identically to current behaviour.

**`MissingnessProfiler._profile_column` (extended)**
- Sentinel values are counted in `effective_null_count` alongside standard nulls.
- `effective_null_ratio` reflects sentinel rows correctly.
- Severity classification uses the sentinel-inclusive ratio.
- When both standard nulls and sentinel values are present, counts are additive.
- When `sentinels=None`, output is identical to current behaviour.

**`FittedImputer` round-trip serialization**
- `to_dict()` includes the `numeric_sentinels` field.
- `from_dict()` restores `numeric_sentinels` correctly.
- A `FittedImputer` deserialized from a dict without `numeric_sentinels` (old format) defaults to empty dict.
- Transform output of original and deserialized imputer is identical when sentinel columns are present.

### Integration coverage

One integration test covering the full fit-transform cycle: a DataFrame with an integer column containing `-999` sentinel values is profiled (sentinels declared in `ProfileConfig`), the orchestrator fits on the training split, and `FittedImputer.transform()` on the test split produces a result where no `-999` values remain and the imputed fill value is derived from non-sentinel observations only.

---

## Out of Scope

- **Auto-detection of numeric sentinels from data.** Sentinel values must always be user-declared. Inferring sentinels from distribution gaps or extreme-value patterns is a separate, higher-risk feature.
- **Sentinel support for non-numeric dtypes.** String sentinels (`"NA"`, `""`, etc.) are already handled by the existing string branch in `_resolve_effective_nulls`. This scope adds only numeric sentinel support.
- **Per-sentinel-value metadata.** Each sentinel value is treated identically — there is no distinction between "missing because unknown" vs "missing because not applicable." That level of MNAR annotation is a future concern.
- **Sentinel values in the Public API result types.** `ColumnImputationRecord` and `ImputationResult` do not expose which rows were sentinel-null vs native-null — that distinction is erased at normalization time and is not recoverable downstream.

---

## Further Notes

- The CONTEXT.md **Effective Null** glossary entry already documents the Numeric sentinel type. Once this scope ships, the code will match the documented design contract.
- The existing three gaps described in **Scope 5** (un-normalized sentinels in `_df_to_numpy`, Passthrough check, split imbalance check) were closed by earlier commits that added `_resolve_effective_nulls` to `ImputationOrchestrator.fit()` and `FittedImputer.transform()`. Those fixes covered string and float sentinels. This scope extends coverage to user-declared numeric sentinels, which those fixes could not address because the declaration mechanism did not yet exist.
- This scope should be implemented **before Scopes 0–4** (issues #89–#93). Those scopes add new code paths to the fit and transform logic. Implementing sentinel support first ensures all new paths inherit correct normalization rather than requiring a second pass.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scope 5: Numeric Sentinel Support — end-to-end effective null coverage for user-declared sentinel values #94

Problem Statement

Solution

User Stories

Implementation Decisions

Module changes

Key architectural decisions

Testing Decisions

What makes a good test

Modules with full unit test coverage

Integration coverage

Out of Scope

Further Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Scope 5: Numeric Sentinel Support — end-to-end effective null coverage for user-declared sentinel values #94

Description

Problem Statement

Solution

User Stories

Implementation Decisions

Module changes

Key architectural decisions

Testing Decisions

What makes a good test

Modules with full unit test coverage

Integration coverage

Out of Scope

Further Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions