Skip to content

Scope 5: Numeric Sentinel Support — end-to-end effective null coverage for user-declared sentinel values #94

@DEVunderdog

Description

@DEVunderdog

Problem Statement

Users working with real-world datasets — particularly in healthcare, finance, and survey data — frequently encode missing values as domain-specific numeric sentinels such as -999, 9999, or 0. These values are semantically null but are not Polars-native nulls, so DataForgeML's profiler does not count them as missing and the imputer does not treat them as missing.

The consequence is silent data corruption: a column declared by the user to have sentinel-based missingness will have those sentinel values treated as real observations during mean, median, mode, and model-based imputation. No error is raised and no warning is emitted — the imputed output is subtly wrong with no indication to the user.

The CONTEXT.md glossary already defines Numeric sentinel as a first-class effective null type, but no implementation exists in any layer of the codebase.


Solution

Allow users to declare per-column numeric sentinel values in ProfileConfig. Once declared, the sentinel values are treated identically to Polars-native nulls throughout the entire pipeline:

  • Phase 1 (Profiling): sentinel rows are counted in effective_null_count and effective_null_ratio, contributing to severity classification and MAR correlation detection.
  • Phase 2 (Imputation): sentinel values are normalized to Polars-native nulls before any fit or transform operation, so mean/median/model computations never see them as real observations.
  • Serialization: sentinel configuration survives FittedImputer.to_dict() / from_dict() round-trips so a saved imputer can correctly transform new data.

User Stories

  1. As a data scientist, I want to declare that -999 means missing in my age column, so that the profiler counts those rows in the missingness rate rather than treating -999 as a valid age.
  2. As a data scientist, I want to declare multiple sentinel values per column (e.g. -999 and 9999), so that I can cover datasets where multiple codes were used for different missing-value reasons.
  3. As a data scientist, I want sentinel declarations to apply to integer-typed columns as well as float-typed columns, so that I am not forced to cast my data before profiling.
  4. As a data scientist, I want sentinel values to be counted in effective_null_count and effective_null_ratio in the missingness profile, so that severity classification (Minor / Moderate / High / Severe) reflects true missingness.
  5. As a data scientist, I want sentinel-heavy columns (>50% sentinel rows) to receive a DropCandidate flag, so that columns that are mostly sentinel-null are handled the same as columns that are mostly Polars-null.
  6. As a data scientist, I want sentinel rows to participate in MAR correlation detection, so that columns whose sentinel missingness correlates with other columns' missingness are correctly flagged as MARSuspect.
  7. As a data scientist, I want sentinel values to be normalized to Polars-native nulls before imputation fits, so that mean, median, and mode computations are never contaminated by sentinel values.
  8. As a data scientist, I want sentinel values in model-based imputation columns (KNN, MICE, Regression) to be normalized to NaN before the model sees the data, so that the model trains on real observations only.
  9. As a data scientist, I want sentinel values in Passthrough columns on the test set to raise UnfittedColumnError, so that I am notified when the test set has sentinel-based missingness that the training split did not see.
  10. As a data scientist, I want SplitImbalanceWarning to fire when all sentinel rows land in the test split and the training split has no missingness, so that I am warned about unsafe splits even when the missingness is sentinel-based.
  11. As a data scientist, I want missingness indicators for MNAR columns to be set to 1 for sentinel rows, so that the indicator correctly marks which rows were originally missing regardless of whether the missingness was sentinel-based or native-null-based.
  12. As a data scientist, I want to declare sentinels at the PipelineConfig level so that I have a single place to configure the entire pipeline without touching profiler or imputer internals.
  13. As a data scientist, I want sentinel configuration to survive FittedImputer.to_dict() / from_dict() serialization, so that a saved imputer correctly normalizes sentinel values when transforming new batches of data in production.
  14. As a data scientist, I want columns with no declared sentinels to be completely unaffected by this feature, so that existing pipelines continue to behave identically.
  15. As a data scientist, I want sentinel declarations to work alongside the existing string sentinel and float NaN/Inf detection, so that a column can be covered by multiple effective-null rules simultaneously without conflict.

Implementation Decisions

Module changes

ProfileConfig

  • Add a numeric_sentinels field: a mapping from column name to a list of sentinel values (floats, which are type-compatible with any numeric dtype).
  • Update to_dict() and from_dict() to serialize and restore this field.
  • Default: empty mapping (no sentinels declared — zero change to existing behaviour).

StructuralProfileResult

  • Mirror numeric_sentinels as a field on the result object.
  • StructuralProfiler copies the value from ProfileConfig when building the result, so Phase 2 can read sentinels from the profile without holding a config reference.
  • This preserves the existing contract: sub-processors receive (DataFrame, list[str], StructuralProfileResult) and need nothing else.

_null_detection.py

  • Add _numeric_sentinel_eligible(dtype) — returns True for all integer and float numeric dtypes.
  • This is the single authority for which dtypes may receive sentinel normalization, consistent with the existing _sentinel_eligible and _inf_eligible functions.

_null_normalization.py / _resolve_effective_nulls

  • Add an optional numeric_sentinels: dict[str, list[float]] | None parameter (default None).
  • For each column present in both the DataFrame and the sentinels dict, if the column's dtype passes _numeric_sentinel_eligible, add a Polars expression that replaces each matching sentinel value with null.
  • Sentinel expressions are appended to the existing expression list; the function remains a pure, side-effect-free utility.
  • When numeric_sentinels is None or empty, the function behaves identically to its current form.

MissingnessProfiler

  • Update _profile_column to accept an optional sentinels: list[float] | None parameter.
  • When sentinels are provided and the column dtype passes _numeric_sentinel_eligible, extend the eff_null boolean mask to also flag rows whose value is in the sentinel list.
  • _run() passes profile_config.numeric_sentinels.get(col_name) for each column.

ImputationOrchestrator.fit()

  • After calling _resolve_effective_nulls(train_df), pass numeric_sentinels=profile.numeric_sentinels as the second argument.
  • Pass the sentinel dict to the FittedImputer constructor so it is stored for use in transform().

FittedImputer

  • Add numeric_sentinels: dict[str, list[float]] as a dataclass field (default: empty dict).
  • In transform(), pass numeric_sentinels=self.numeric_sentinels to _resolve_effective_nulls.
  • In to_dict(), serialize the field as a plain JSON-compatible dict.
  • In from_dict(), restore the field; missing key defaults to empty dict for backwards compatibility with imputers saved before this feature.

Key architectural decisions

  • Sentinels are declared once in ProfileConfig and flow through the system via StructuralProfileResult. Phase 2 reads them from the profile, not from a separate config object. This is consistent with how Phase 2 already consumes all other Phase 1 signals.
  • _resolve_effective_nulls remains a pure function. It accepts a plain dict[str, list[float]] rather than a domain type, keeping it decoupled from StructuralProfileResult and easy to test in isolation.
  • Sentinels apply to any numeric dtype, including integers. A value of -999 on an Int64 column is a common real-world pattern; requiring Float casts before profiling would be a breaking ergonomic gap.
  • No auto-detection. Sentinel values are never inferred from data distributions. This keeps Phase 1 deterministic and avoids false positives on legitimate extreme values.
  • Backwards compatibility. FittedImputer.from_dict() treats a missing numeric_sentinels key as an empty dict, so imputers serialized before this feature load and transform correctly.

Testing Decisions

What makes a good test

Tests verify observable output behaviour given specific inputs — they do not assert on internal state, private method calls, or implementation structure. Each test constructs a minimal DataFrame or config, runs the public interface, and asserts on the returned value or raised exception.

Prior art: test_null_normalization.py (pure-function DataFrame-in / DataFrame-out tests), test_fitted_imputer.py (FittedImputer constructed directly with known records, transform output asserted), test_missingness_profiler.py (profiler called with a DataFrame, result counts asserted).

Modules with full unit test coverage

_resolve_effective_nulls (extended)

  • Declared sentinel on an Int64 column is converted to null; non-sentinel values are unchanged.
  • Declared sentinel on a Float64 column is converted to null; existing NaN/Inf rules still apply independently.
  • Multiple sentinels declared for one column — all are converted; any other value is unchanged.
  • Sentinel declared for a column not present in the DataFrame — no error, rest of DataFrame unchanged.
  • Sentinel value that matches no rows — no rows changed (no-op, no error).
  • Column with no declared sentinel — completely unchanged even if it contains values that happen to equal a sentinel declared for a different column.
  • numeric_sentinels=None — function behaves identically to current behaviour.

MissingnessProfiler._profile_column (extended)

  • Sentinel values are counted in effective_null_count alongside standard nulls.
  • effective_null_ratio reflects sentinel rows correctly.
  • Severity classification uses the sentinel-inclusive ratio.
  • When both standard nulls and sentinel values are present, counts are additive.
  • When sentinels=None, output is identical to current behaviour.

FittedImputer round-trip serialization

  • to_dict() includes the numeric_sentinels field.
  • from_dict() restores numeric_sentinels correctly.
  • A FittedImputer deserialized from a dict without numeric_sentinels (old format) defaults to empty dict.
  • Transform output of original and deserialized imputer is identical when sentinel columns are present.

Integration coverage

One integration test covering the full fit-transform cycle: a DataFrame with an integer column containing -999 sentinel values is profiled (sentinels declared in ProfileConfig), the orchestrator fits on the training split, and FittedImputer.transform() on the test split produces a result where no -999 values remain and the imputed fill value is derived from non-sentinel observations only.


Out of Scope

  • Auto-detection of numeric sentinels from data. Sentinel values must always be user-declared. Inferring sentinels from distribution gaps or extreme-value patterns is a separate, higher-risk feature.
  • Sentinel support for non-numeric dtypes. String sentinels ("NA", "", etc.) are already handled by the existing string branch in _resolve_effective_nulls. This scope adds only numeric sentinel support.
  • Per-sentinel-value metadata. Each sentinel value is treated identically — there is no distinction between "missing because unknown" vs "missing because not applicable." That level of MNAR annotation is a future concern.
  • Sentinel values in the Public API result types. ColumnImputationRecord and ImputationResult do not expose which rows were sentinel-null vs native-null — that distinction is erased at normalization time and is not recoverable downstream.

Further Notes

  • The CONTEXT.md Effective Null glossary entry already documents the Numeric sentinel type. Once this scope ships, the code will match the documented design contract.
  • The existing three gaps described in Scope 5 (un-normalized sentinels in _df_to_numpy, Passthrough check, split imbalance check) were closed by earlier commits that added _resolve_effective_nulls to ImputationOrchestrator.fit() and FittedImputer.transform(). Those fixes covered string and float sentinels. This scope extends coverage to user-declared numeric sentinels, which those fixes could not address because the declaration mechanism did not yet exist.
  • This scope should be implemented before Scopes 0–4 (issues Scope 0: Regression Imputer Overhaul — IterativeImputer, NonlinearityTag, dynamic convergence #89Scope 4: Imputation Quality Evaluation — RMSE and MAE via shared holdback #93). Those scopes add new code paths to the fit and transform logic. Implementing sentinel support first ensures all new paths inherit correct normalization rather than requiring a second pass.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions