You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Users working with real-world datasets — particularly in healthcare, finance, and survey data — frequently encode missing values as domain-specific numeric sentinels such as -999, 9999, or 0. These values are semantically null but are not Polars-native nulls, so DataForgeML's profiler does not count them as missing and the imputer does not treat them as missing.
The consequence is silent data corruption: a column declared by the user to have sentinel-based missingness will have those sentinel values treated as real observations during mean, median, mode, and model-based imputation. No error is raised and no warning is emitted — the imputed output is subtly wrong with no indication to the user.
The CONTEXT.md glossary already defines Numeric sentinel as a first-class effective null type, but no implementation exists in any layer of the codebase.
Solution
Allow users to declare per-column numeric sentinel values in ProfileConfig. Once declared, the sentinel values are treated identically to Polars-native nulls throughout the entire pipeline:
Phase 1 (Profiling): sentinel rows are counted in effective_null_count and effective_null_ratio, contributing to severity classification and MAR correlation detection.
Phase 2 (Imputation): sentinel values are normalized to Polars-native nulls before any fit or transform operation, so mean/median/model computations never see them as real observations.
Serialization: sentinel configuration survives FittedImputer.to_dict() / from_dict() round-trips so a saved imputer can correctly transform new data.
User Stories
As a data scientist, I want to declare that -999 means missing in my age column, so that the profiler counts those rows in the missingness rate rather than treating -999 as a valid age.
As a data scientist, I want to declare multiple sentinel values per column (e.g. -999 and 9999), so that I can cover datasets where multiple codes were used for different missing-value reasons.
As a data scientist, I want sentinel declarations to apply to integer-typed columns as well as float-typed columns, so that I am not forced to cast my data before profiling.
As a data scientist, I want sentinel values to be counted in effective_null_count and effective_null_ratio in the missingness profile, so that severity classification (Minor / Moderate / High / Severe) reflects true missingness.
As a data scientist, I want sentinel-heavy columns (>50% sentinel rows) to receive a DropCandidate flag, so that columns that are mostly sentinel-null are handled the same as columns that are mostly Polars-null.
As a data scientist, I want sentinel rows to participate in MAR correlation detection, so that columns whose sentinel missingness correlates with other columns' missingness are correctly flagged as MARSuspect.
As a data scientist, I want sentinel values to be normalized to Polars-native nulls before imputation fits, so that mean, median, and mode computations are never contaminated by sentinel values.
As a data scientist, I want sentinel values in model-based imputation columns (KNN, MICE, Regression) to be normalized to NaN before the model sees the data, so that the model trains on real observations only.
As a data scientist, I want sentinel values in Passthrough columns on the test set to raise UnfittedColumnError, so that I am notified when the test set has sentinel-based missingness that the training split did not see.
As a data scientist, I want SplitImbalanceWarning to fire when all sentinel rows land in the test split and the training split has no missingness, so that I am warned about unsafe splits even when the missingness is sentinel-based.
As a data scientist, I want missingness indicators for MNAR columns to be set to 1 for sentinel rows, so that the indicator correctly marks which rows were originally missing regardless of whether the missingness was sentinel-based or native-null-based.
As a data scientist, I want to declare sentinels at the PipelineConfig level so that I have a single place to configure the entire pipeline without touching profiler or imputer internals.
As a data scientist, I want sentinel configuration to survive FittedImputer.to_dict() / from_dict() serialization, so that a saved imputer correctly normalizes sentinel values when transforming new batches of data in production.
As a data scientist, I want columns with no declared sentinels to be completely unaffected by this feature, so that existing pipelines continue to behave identically.
As a data scientist, I want sentinel declarations to work alongside the existing string sentinel and float NaN/Inf detection, so that a column can be covered by multiple effective-null rules simultaneously without conflict.
Implementation Decisions
Module changes
ProfileConfig
Add a numeric_sentinels field: a mapping from column name to a list of sentinel values (floats, which are type-compatible with any numeric dtype).
Update to_dict() and from_dict() to serialize and restore this field.
Default: empty mapping (no sentinels declared — zero change to existing behaviour).
StructuralProfileResult
Mirror numeric_sentinels as a field on the result object.
StructuralProfiler copies the value from ProfileConfig when building the result, so Phase 2 can read sentinels from the profile without holding a config reference.
This preserves the existing contract: sub-processors receive (DataFrame, list[str], StructuralProfileResult) and need nothing else.
_null_detection.py
Add _numeric_sentinel_eligible(dtype) — returns True for all integer and float numeric dtypes.
This is the single authority for which dtypes may receive sentinel normalization, consistent with the existing _sentinel_eligible and _inf_eligible functions.
For each column present in both the DataFrame and the sentinels dict, if the column's dtype passes _numeric_sentinel_eligible, add a Polars expression that replaces each matching sentinel value with null.
Sentinel expressions are appended to the existing expression list; the function remains a pure, side-effect-free utility.
When numeric_sentinels is None or empty, the function behaves identically to its current form.
MissingnessProfiler
Update _profile_column to accept an optional sentinels: list[float] | None parameter.
When sentinels are provided and the column dtype passes _numeric_sentinel_eligible, extend the eff_null boolean mask to also flag rows whose value is in the sentinel list.
_run() passes profile_config.numeric_sentinels.get(col_name) for each column.
ImputationOrchestrator.fit()
After calling _resolve_effective_nulls(train_df), pass numeric_sentinels=profile.numeric_sentinels as the second argument.
Pass the sentinel dict to the FittedImputer constructor so it is stored for use in transform().
FittedImputer
Add numeric_sentinels: dict[str, list[float]] as a dataclass field (default: empty dict).
In transform(), pass numeric_sentinels=self.numeric_sentinels to _resolve_effective_nulls.
In to_dict(), serialize the field as a plain JSON-compatible dict.
In from_dict(), restore the field; missing key defaults to empty dict for backwards compatibility with imputers saved before this feature.
Key architectural decisions
Sentinels are declared once in ProfileConfig and flow through the system via StructuralProfileResult. Phase 2 reads them from the profile, not from a separate config object. This is consistent with how Phase 2 already consumes all other Phase 1 signals.
_resolve_effective_nulls remains a pure function. It accepts a plain dict[str, list[float]] rather than a domain type, keeping it decoupled from StructuralProfileResult and easy to test in isolation.
Sentinels apply to any numeric dtype, including integers. A value of -999 on an Int64 column is a common real-world pattern; requiring Float casts before profiling would be a breaking ergonomic gap.
No auto-detection. Sentinel values are never inferred from data distributions. This keeps Phase 1 deterministic and avoids false positives on legitimate extreme values.
Backwards compatibility.FittedImputer.from_dict() treats a missing numeric_sentinels key as an empty dict, so imputers serialized before this feature load and transform correctly.
Testing Decisions
What makes a good test
Tests verify observable output behaviour given specific inputs — they do not assert on internal state, private method calls, or implementation structure. Each test constructs a minimal DataFrame or config, runs the public interface, and asserts on the returned value or raised exception.
Prior art: test_null_normalization.py (pure-function DataFrame-in / DataFrame-out tests), test_fitted_imputer.py (FittedImputer constructed directly with known records, transform output asserted), test_missingness_profiler.py (profiler called with a DataFrame, result counts asserted).
Modules with full unit test coverage
_resolve_effective_nulls (extended)
Declared sentinel on an Int64 column is converted to null; non-sentinel values are unchanged.
Declared sentinel on a Float64 column is converted to null; existing NaN/Inf rules still apply independently.
Multiple sentinels declared for one column — all are converted; any other value is unchanged.
Sentinel declared for a column not present in the DataFrame — no error, rest of DataFrame unchanged.
Sentinel value that matches no rows — no rows changed (no-op, no error).
Column with no declared sentinel — completely unchanged even if it contains values that happen to equal a sentinel declared for a different column.
numeric_sentinels=None — function behaves identically to current behaviour.
MissingnessProfiler._profile_column (extended)
Sentinel values are counted in effective_null_count alongside standard nulls.
Severity classification uses the sentinel-inclusive ratio.
When both standard nulls and sentinel values are present, counts are additive.
When sentinels=None, output is identical to current behaviour.
FittedImputer round-trip serialization
to_dict() includes the numeric_sentinels field.
from_dict() restores numeric_sentinels correctly.
A FittedImputer deserialized from a dict without numeric_sentinels (old format) defaults to empty dict.
Transform output of original and deserialized imputer is identical when sentinel columns are present.
Integration coverage
One integration test covering the full fit-transform cycle: a DataFrame with an integer column containing -999 sentinel values is profiled (sentinels declared in ProfileConfig), the orchestrator fits on the training split, and FittedImputer.transform() on the test split produces a result where no -999 values remain and the imputed fill value is derived from non-sentinel observations only.
Out of Scope
Auto-detection of numeric sentinels from data. Sentinel values must always be user-declared. Inferring sentinels from distribution gaps or extreme-value patterns is a separate, higher-risk feature.
Sentinel support for non-numeric dtypes. String sentinels ("NA", "", etc.) are already handled by the existing string branch in _resolve_effective_nulls. This scope adds only numeric sentinel support.
Per-sentinel-value metadata. Each sentinel value is treated identically — there is no distinction between "missing because unknown" vs "missing because not applicable." That level of MNAR annotation is a future concern.
Sentinel values in the Public API result types.ColumnImputationRecord and ImputationResult do not expose which rows were sentinel-null vs native-null — that distinction is erased at normalization time and is not recoverable downstream.
Further Notes
The CONTEXT.md Effective Null glossary entry already documents the Numeric sentinel type. Once this scope ships, the code will match the documented design contract.
The existing three gaps described in Scope 5 (un-normalized sentinels in _df_to_numpy, Passthrough check, split imbalance check) were closed by earlier commits that added _resolve_effective_nulls to ImputationOrchestrator.fit() and FittedImputer.transform(). Those fixes covered string and float sentinels. This scope extends coverage to user-declared numeric sentinels, which those fixes could not address because the declaration mechanism did not yet exist.
Problem Statement
Users working with real-world datasets — particularly in healthcare, finance, and survey data — frequently encode missing values as domain-specific numeric sentinels such as
-999,9999, or0. These values are semantically null but are not Polars-native nulls, so DataForgeML's profiler does not count them as missing and the imputer does not treat them as missing.The consequence is silent data corruption: a column declared by the user to have sentinel-based missingness will have those sentinel values treated as real observations during mean, median, mode, and model-based imputation. No error is raised and no warning is emitted — the imputed output is subtly wrong with no indication to the user.
The CONTEXT.md glossary already defines Numeric sentinel as a first-class effective null type, but no implementation exists in any layer of the codebase.
Solution
Allow users to declare per-column numeric sentinel values in
ProfileConfig. Once declared, the sentinel values are treated identically to Polars-native nulls throughout the entire pipeline:effective_null_countandeffective_null_ratio, contributing to severity classification and MAR correlation detection.FittedImputer.to_dict()/from_dict()round-trips so a saved imputer can correctly transform new data.User Stories
-999means missing in myagecolumn, so that the profiler counts those rows in the missingness rate rather than treating-999as a valid age.-999and9999), so that I can cover datasets where multiple codes were used for different missing-value reasons.effective_null_countandeffective_null_ratioin the missingness profile, so that severity classification (Minor / Moderate / High / Severe) reflects true missingness.DropCandidateflag, so that columns that are mostly sentinel-null are handled the same as columns that are mostly Polars-null.MARSuspect.UnfittedColumnError, so that I am notified when the test set has sentinel-based missingness that the training split did not see.SplitImbalanceWarningto fire when all sentinel rows land in the test split and the training split has no missingness, so that I am warned about unsafe splits even when the missingness is sentinel-based.1for sentinel rows, so that the indicator correctly marks which rows were originally missing regardless of whether the missingness was sentinel-based or native-null-based.PipelineConfiglevel so that I have a single place to configure the entire pipeline without touching profiler or imputer internals.FittedImputer.to_dict()/from_dict()serialization, so that a saved imputer correctly normalizes sentinel values when transforming new batches of data in production.Implementation Decisions
Module changes
ProfileConfignumeric_sentinelsfield: a mapping from column name to a list of sentinel values (floats, which are type-compatible with any numeric dtype).to_dict()andfrom_dict()to serialize and restore this field.StructuralProfileResultnumeric_sentinelsas a field on the result object.StructuralProfilercopies the value fromProfileConfigwhen building the result, so Phase 2 can read sentinels from the profile without holding a config reference.(DataFrame, list[str], StructuralProfileResult)and need nothing else._null_detection.py_numeric_sentinel_eligible(dtype)— returnsTruefor all integer and float numeric dtypes._sentinel_eligibleand_inf_eligiblefunctions._null_normalization.py/_resolve_effective_nullsnumeric_sentinels: dict[str, list[float]] | Noneparameter (defaultNone)._numeric_sentinel_eligible, add a Polars expression that replaces each matching sentinel value with null.numeric_sentinelsisNoneor empty, the function behaves identically to its current form.MissingnessProfiler_profile_columnto accept an optionalsentinels: list[float] | Noneparameter._numeric_sentinel_eligible, extend theeff_nullboolean mask to also flag rows whose value is in the sentinel list._run()passesprofile_config.numeric_sentinels.get(col_name)for each column.ImputationOrchestrator.fit()_resolve_effective_nulls(train_df), passnumeric_sentinels=profile.numeric_sentinelsas the second argument.FittedImputerconstructor so it is stored for use intransform().FittedImputernumeric_sentinels: dict[str, list[float]]as a dataclass field (default: empty dict).transform(), passnumeric_sentinels=self.numeric_sentinelsto_resolve_effective_nulls.to_dict(), serialize the field as a plain JSON-compatible dict.from_dict(), restore the field; missing key defaults to empty dict for backwards compatibility with imputers saved before this feature.Key architectural decisions
ProfileConfigand flow through the system viaStructuralProfileResult. Phase 2 reads them from the profile, not from a separate config object. This is consistent with how Phase 2 already consumes all other Phase 1 signals._resolve_effective_nullsremains a pure function. It accepts a plaindict[str, list[float]]rather than a domain type, keeping it decoupled fromStructuralProfileResultand easy to test in isolation.-999on anInt64column is a common real-world pattern; requiring Float casts before profiling would be a breaking ergonomic gap.FittedImputer.from_dict()treats a missingnumeric_sentinelskey as an empty dict, so imputers serialized before this feature load and transform correctly.Testing Decisions
What makes a good test
Tests verify observable output behaviour given specific inputs — they do not assert on internal state, private method calls, or implementation structure. Each test constructs a minimal DataFrame or config, runs the public interface, and asserts on the returned value or raised exception.
Prior art:
test_null_normalization.py(pure-function DataFrame-in / DataFrame-out tests),test_fitted_imputer.py(FittedImputer constructed directly with known records, transform output asserted),test_missingness_profiler.py(profiler called with a DataFrame, result counts asserted).Modules with full unit test coverage
_resolve_effective_nulls(extended)Int64column is converted to null; non-sentinel values are unchanged.Float64column is converted to null; existing NaN/Inf rules still apply independently.numeric_sentinels=None— function behaves identically to current behaviour.MissingnessProfiler._profile_column(extended)effective_null_countalongside standard nulls.effective_null_ratioreflects sentinel rows correctly.sentinels=None, output is identical to current behaviour.FittedImputerround-trip serializationto_dict()includes thenumeric_sentinelsfield.from_dict()restoresnumeric_sentinelscorrectly.FittedImputerdeserialized from a dict withoutnumeric_sentinels(old format) defaults to empty dict.Integration coverage
One integration test covering the full fit-transform cycle: a DataFrame with an integer column containing
-999sentinel values is profiled (sentinels declared inProfileConfig), the orchestrator fits on the training split, andFittedImputer.transform()on the test split produces a result where no-999values remain and the imputed fill value is derived from non-sentinel observations only.Out of Scope
"NA","", etc.) are already handled by the existing string branch in_resolve_effective_nulls. This scope adds only numeric sentinel support.ColumnImputationRecordandImputationResultdo not expose which rows were sentinel-null vs native-null — that distinction is erased at normalization time and is not recoverable downstream.Further Notes
_df_to_numpy, Passthrough check, split imbalance check) were closed by earlier commits that added_resolve_effective_nullstoImputationOrchestrator.fit()andFittedImputer.transform(). Those fixes covered string and float sentinels. This scope extends coverage to user-declared numeric sentinels, which those fixes could not address because the declaration mechanism did not yet exist.