You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a data scientist uses DataForgeML to split a dataset and then run imputation, the library fails to detect four categories of unsafe split conditions that silently degrade imputation quality:
Proportional missingness underrepresentation in train goes undetected. The current imbalance check is binary: it warns only when train has zero missing values for a column the full-dataset profile says should have some. A column that is 20% missing in the full dataset but only 2% missing in train passes silently — the imputation model is trained on a nearly-clean slice and learns fill statistics that do not reflect the population.
Test-side missingness imbalance is never checked. There is no mechanism to warn when the test split has proportionally far less missingness than expected. When this happens, imputation quality cannot be evaluated on test — the fitted model runs but the test set does not exercise the missingness distribution it was trained on.
Globally sparse rows are not protected by the stratified split. Rows that are missing in many columns simultaneously — the hardest inputs for MICE — are not individually protected by the per-column missingness stratification signals. A standard multilabel stratified split can still concentrate these rows disproportionately in one partition.
fit_transform discards the FittedImputer, enabling accidental re-fit on test. The current return type is ImputationResult only. A caller who wants to impute the test set has no path other than calling fit() a second time — or, worse, calling fit_transform(test_df), which re-fits on test data. Neither path is safe, and the API gives no signal that either is wrong.
Additionally, the Joint MAR missingness signal in build_label_matrix uses raw Polars is_null() for both correlated columns instead of the effective null mask — meaning string sentinel rows are missed in the joint signal even though Phase 1 detected the MAR correlation using effective nulls.
Dependencies
Scope 5 (issue Scope 5: Numeric Sentinel Support — end-to-end effective null coverage for user-declared sentinel values #94) — Numeric Sentinel Support is a partial dependency. User-declared numeric sentinels (e.g. -999) are not yet normalised by _resolve_effective_nulls, so the proportional imbalance check will undercount effective nulls for sentinel-heavy columns until Scope 5 ships. Scope 9 documents this explicitly: numeric sentinel columns are exempt from the proportional check until Scope 5 is implemented. String sentinels and float sentinels (NaN, Inf) are already normalised by the existing _resolve_effective_nulls call in fit() and are not affected by this dependency.
Solution
Extend Phase 2 imbalance checking and Phase 1 splitting to detect and surface all four classes of unsafe split conditions:
Upgrade the train-side imbalance check from binary to proportional. Replace the == 0 check with: warn when train_missing_ratio < split_imbalance_ratio_threshold × profile_effective_null_ratio. Default threshold: 0.5. Configurable in NumericImputationConfig as split_imbalance_ratio_threshold: float = 0.5. Rename SplitImbalanceWarning to TrainSplitImbalanceWarning.
Add a test-side proportional imbalance check inside FittedImputer.transform(). Apply the same proportional rule against each column's stored profile_missing_ratio (persisted in ColumnImputationRecord at fit time). Emit a new TestSplitImbalanceWarning. This fires universally regardless of which split method was used.
Implement the Compound missingness row signal in build_label_matrix. Add a RowMissingnessDistribution dataclass to MissingnessProfileResult (computed by MissingnessProfiler). The row_missingness_p90 field holds the 90th-percentile count of missing columns per row. Use it in build_label_matrix to emit a single binary label: 1 for rows missing in more columns than row_missingness_p90.
Fix the Joint MAR pair signal to use the effective null mask. Replace raw is_null() with the same dtype-driven effective null check used in the per-column missingness signal for both columns in each correlated pair.
Change fit_transform to return tuple[FittedImputer, ImputationResult]. Forces callers to receive the FittedImputer. The natural next step — fitted_imputer.transform(test_df) — is then self-evident from the unpacked tuple.
User Stories
As a data scientist, I want to receive a warning when my training split has proportionally far less missingness than the full dataset for any column, so that I know my imputation model was trained on an unrepresentative slice before I draw conclusions from its output.
As a data scientist, I want the missingness imbalance warning to tell me the actual train missing ratio and the profile missing ratio for each affected column, so that I can judge how severe the imbalance is without inspecting the data manually.
As a data scientist, I want the train-side imbalance threshold to be configurable, so that I can tighten it (e.g. 0.8) on high-stakes datasets or relax it (e.g. 0.3) for known-skewed splits like time-based partitions.
As a data scientist, I want to receive a warning when my test split has proportionally far less missingness than expected for any imputed column, so that I know before I evaluate imputation quality that the test set may not exercise the imputer's real-world behaviour.
As a data scientist, I want the test-side warning to fire regardless of how I split my data — whether I used profile_stratified_split, random_split, or my own custom split — so that the check is not opt-in.
As a data scientist, I want TrainSplitImbalanceWarning and TestSplitImbalanceWarning as distinct warning types, so that I can silence one without the other using warnings.filterwarnings.
As a data scientist, I want rows that are missing in many columns simultaneously to be proportionally represented in both train and test when I call profile_stratified_split, so that MICE has representative globally-sparse rows available during training.
As a data scientist, I want profile_stratified_split to include a compound missingness row signal automatically — no extra configuration required — so that globally-sparse row protection is always on.
As a data scientist, I want fit_transform to return both the FittedImputer and the imputed train ImputationResult as a tuple, so that I cannot accidentally discard the fitted imputer and am naturally guided to use it for test-set imputation.
As a data scientist, I want the Joint MAR missingness signal in the stratified split to correctly identify correlated missing pairs when missingness is expressed as string sentinels, so that MAR structure is preserved in both partitions even for string-sentinel columns.
As a data scientist, I want the train-side imbalance check to correctly count string sentinel rows (e.g. "NA", "") and float sentinel rows (NaN, Inf) as missing, so that I do not receive false warnings for columns whose missingness is expressed through sentinels rather than Polars nulls.
As a data scientist, I want the imbalance check to document explicitly that user-declared numeric sentinels (e.g. -999) are not yet covered by the proportional check, so that I understand the limitation without having to dig into the source.
As a data scientist, I want RowMissingnessDistribution to be accessible on MissingnessProfileResult after Phase 1 profiling, so that I can inspect the row-wise missingness distribution for my dataset independently of the split.
As a data scientist, I want the compound missingness row signal to be omitted from the label matrix when row_missingness_p90 == 0 (no row has any missing values), so that the signal slot is not wasted on a dataset with no missingness.
As a data scientist, I want TrainSplitImbalanceWarning and TestSplitImbalanceWarning to be importable directly from dataforge_ml, so that I can catch them programmatically without importing from internal submodules.
As a data scientist, I want profile_missing_ratio to appear on ColumnImputationRecord for every column with a fitted strategy, so that I can compare train missingness against profile missingness in my own post-fit analysis.
Implementation Decisions
Modules modified
MissingnessProfiler and MissingnessProfileResult
Add a new RowMissingnessDistribution dataclass with at minimum row_missingness_p90: int — the 90th-percentile count of per-row missing columns across the dataset.
Add row_distribution: RowMissingnessDistribution as a field on MissingnessProfileResult.
MissingnessProfiler computes this by calculating, for each row, the count of columns with effective nulls, then taking the 90th percentile of that row-wise distribution.
_profile_signals.build_label_matrix
Add the Compound missingness row signal using row_distribution.row_missingness_p90 from the profile. One binary label for the whole dataset: 1 if a row's effective null count exceeds p90. Signal is omitted when p90 == 0.
Fix the Joint MAR pair signal: replace raw is_null() with the same dtype-driven effective null check (string sentinels + float sentinels) used in the per-column missingness signal, applied to both columns in each correlated pair.
NumericImputationConfig
Add split_imbalance_ratio_threshold: float = 0.5. Applies to both train-side and test-side checks.
ColumnImputationRecord
Add profile_missing_ratio: float. Populated during fit() from cp.missingness.effective_null_ratio for every column with a fitted strategy. Used by FittedImputer.transform() for the test-side check.
ImputationOrchestrator
Rename SplitImbalanceWarning to TrainSplitImbalanceWarning.
Upgrade _check_split_imbalance from binary to proportional: warn when train_ratio < split_imbalance_ratio_threshold × profile_ratio. The function reads the threshold from the config.
Change fit_transform return type from ImputationResult to tuple[FittedImputer, ImputationResult].
FittedImputer.transform()
After applying imputations, iterate over columns with a fitted strategy. For each column, compare test_df[col].null_count() / len(test_df) against record.profile_missing_ratio. Emit TestSplitImbalanceWarning when the ratio falls below split_imbalance_ratio_threshold × profile_missing_ratio. The threshold is stored alongside the records (passed in from NumericImputationConfig at fit time).
dataforge_ml/__init__.py
Remove SplitImbalanceWarning from exports.
Add TrainSplitImbalanceWarning and TestSplitImbalanceWarning.
Warning content
Both warnings include: column name(s), their profile missing ratio, their observed split missing ratio, and the threshold that was violated.
Numeric sentinel exemption
The proportional check compares null_count() (post-_resolve_effective_nulls) against effective_null_ratio from the profile. _resolve_effective_nulls normalises string and float sentinels but not user-declared numeric sentinels. For columns with numeric sentinels, null_count() will undercount relative to effective_null_ratio. This is documented in the warning message and in ColumnImputationRecord. Full numeric sentinel support is deferred to Scope 5 (issue #94).
ADR references
ADR 0020 — TestSplitImbalanceWarning lives in FittedImputer.transform() not DataSplitter
ADR 0022 — Split imbalance check is proportional, not binary
Testing Decisions
A good test for this scope checks observable behaviour through public or package-internal interfaces — not internal implementation state. Specifically: does the right warning fire at the right threshold? Does build_label_matrix include the compound signal when it should and exclude it when it should not? Does fit_transform return both objects?
NearConstant numeric minority signal and Datetime future-date signal — both are described in CONTEXT.md as stratification signals but are not implemented. These are not part of Scope 9.
Automatic split correction — Scope 9 warns about bad splits but does not attempt to re-split or re-weight the data. The correction path remains DataSplitter.profile_stratified_split().
K-fold imbalance checking — TestSplitImbalanceWarning fires per transform() call, which covers single-split and k-fold use cases naturally. No fold-level aggregation is added.
Further Notes
SplitImbalanceWarning is currently exported from dataforge_ml/__init__.py and referenced in existing unit tests. Renaming it to TrainSplitImbalanceWarning is a breaking public API change. All existing test references to SplitImbalanceWarning must be updated as part of this scope.
The split_imbalance_ratio_threshold lives in NumericImputationConfig because the imbalance check is Phase 2's responsibility — it is consumed by the imputer, not the profiler. The profiler produces the profile ratio; the imputer evaluates the split against it.
RowMissingnessDistribution is a Phase 1 (profiling) output but its primary consumer is the splitting module. This follows the existing pattern where Phase 1 outputs (the full StructuralProfileResult) are passed into DataSplitter and ImputationOrchestrator without either phase owning the profile.
Problem Statement
When a data scientist uses DataForgeML to split a dataset and then run imputation, the library fails to detect four categories of unsafe split conditions that silently degrade imputation quality:
Proportional missingness underrepresentation in train goes undetected. The current imbalance check is binary: it warns only when train has zero missing values for a column the full-dataset profile says should have some. A column that is 20% missing in the full dataset but only 2% missing in train passes silently — the imputation model is trained on a nearly-clean slice and learns fill statistics that do not reflect the population.
Test-side missingness imbalance is never checked. There is no mechanism to warn when the test split has proportionally far less missingness than expected. When this happens, imputation quality cannot be evaluated on test — the fitted model runs but the test set does not exercise the missingness distribution it was trained on.
Globally sparse rows are not protected by the stratified split. Rows that are missing in many columns simultaneously — the hardest inputs for MICE — are not individually protected by the per-column missingness stratification signals. A standard multilabel stratified split can still concentrate these rows disproportionately in one partition.
fit_transformdiscards theFittedImputer, enabling accidental re-fit on test. The current return type isImputationResultonly. A caller who wants to impute the test set has no path other than callingfit()a second time — or, worse, callingfit_transform(test_df), which re-fits on test data. Neither path is safe, and the API gives no signal that either is wrong.Additionally, the Joint MAR missingness signal in
build_label_matrixuses raw Polarsis_null()for both correlated columns instead of the effective null mask — meaning string sentinel rows are missed in the joint signal even though Phase 1 detected the MAR correlation using effective nulls.Dependencies
-999) are not yet normalised by_resolve_effective_nulls, so the proportional imbalance check will undercount effective nulls for sentinel-heavy columns until Scope 5 ships. Scope 9 documents this explicitly: numeric sentinel columns are exempt from the proportional check until Scope 5 is implemented. String sentinels and float sentinels (NaN,Inf) are already normalised by the existing_resolve_effective_nullscall infit()and are not affected by this dependency.Solution
Extend Phase 2 imbalance checking and Phase 1 splitting to detect and surface all four classes of unsafe split conditions:
Upgrade the train-side imbalance check from binary to proportional. Replace the
== 0check with: warn whentrain_missing_ratio < split_imbalance_ratio_threshold × profile_effective_null_ratio. Default threshold:0.5. Configurable inNumericImputationConfigassplit_imbalance_ratio_threshold: float = 0.5. RenameSplitImbalanceWarningtoTrainSplitImbalanceWarning.Add a test-side proportional imbalance check inside
FittedImputer.transform(). Apply the same proportional rule against each column's storedprofile_missing_ratio(persisted inColumnImputationRecordat fit time). Emit a newTestSplitImbalanceWarning. This fires universally regardless of which split method was used.Implement the Compound missingness row signal in
build_label_matrix. Add aRowMissingnessDistributiondataclass toMissingnessProfileResult(computed byMissingnessProfiler). Therow_missingness_p90field holds the 90th-percentile count of missing columns per row. Use it inbuild_label_matrixto emit a single binary label:1for rows missing in more columns thanrow_missingness_p90.Fix the Joint MAR pair signal to use the effective null mask. Replace raw
is_null()with the same dtype-driven effective null check used in the per-column missingness signal for both columns in each correlated pair.Change
fit_transformto returntuple[FittedImputer, ImputationResult]. Forces callers to receive theFittedImputer. The natural next step —fitted_imputer.transform(test_df)— is then self-evident from the unpacked tuple.User Stories
profile_stratified_split,random_split, or my own custom split — so that the check is not opt-in.TrainSplitImbalanceWarningandTestSplitImbalanceWarningas distinct warning types, so that I can silence one without the other usingwarnings.filterwarnings.profile_stratified_split, so that MICE has representative globally-sparse rows available during training.profile_stratified_splitto include a compound missingness row signal automatically — no extra configuration required — so that globally-sparse row protection is always on.fit_transformto return both theFittedImputerand the imputed trainImputationResultas a tuple, so that I cannot accidentally discard the fitted imputer and am naturally guided to use it for test-set imputation."NA","") and float sentinel rows (NaN,Inf) as missing, so that I do not receive false warnings for columns whose missingness is expressed through sentinels rather than Polars nulls.-999) are not yet covered by the proportional check, so that I understand the limitation without having to dig into the source.RowMissingnessDistributionto be accessible onMissingnessProfileResultafter Phase 1 profiling, so that I can inspect the row-wise missingness distribution for my dataset independently of the split.row_missingness_p90 == 0(no row has any missing values), so that the signal slot is not wasted on a dataset with no missingness.TrainSplitImbalanceWarningandTestSplitImbalanceWarningto be importable directly fromdataforge_ml, so that I can catch them programmatically without importing from internal submodules.profile_missing_ratioto appear onColumnImputationRecordfor every column with a fitted strategy, so that I can compare train missingness against profile missingness in my own post-fit analysis.Implementation Decisions
Modules modified
MissingnessProfilerandMissingnessProfileResultRowMissingnessDistributiondataclass with at minimumrow_missingness_p90: int— the 90th-percentile count of per-row missing columns across the dataset.row_distribution: RowMissingnessDistributionas a field onMissingnessProfileResult.MissingnessProfilercomputes this by calculating, for each row, the count of columns with effective nulls, then taking the 90th percentile of that row-wise distribution._profile_signals.build_label_matrixrow_distribution.row_missingness_p90from the profile. One binary label for the whole dataset:1if a row's effective null count exceedsp90. Signal is omitted whenp90 == 0.is_null()with the same dtype-driven effective null check (string sentinels + float sentinels) used in the per-column missingness signal, applied to both columns in each correlated pair.NumericImputationConfigsplit_imbalance_ratio_threshold: float = 0.5. Applies to both train-side and test-side checks.ColumnImputationRecordprofile_missing_ratio: float. Populated duringfit()fromcp.missingness.effective_null_ratiofor every column with a fitted strategy. Used byFittedImputer.transform()for the test-side check.ImputationOrchestratorSplitImbalanceWarningtoTrainSplitImbalanceWarning._check_split_imbalancefrom binary to proportional: warn whentrain_ratio < split_imbalance_ratio_threshold × profile_ratio. The function reads the threshold from the config.fit_transformreturn type fromImputationResulttotuple[FittedImputer, ImputationResult].FittedImputer.transform()test_df[col].null_count() / len(test_df)againstrecord.profile_missing_ratio. EmitTestSplitImbalanceWarningwhen the ratio falls belowsplit_imbalance_ratio_threshold × profile_missing_ratio. The threshold is stored alongside the records (passed in fromNumericImputationConfigat fit time).dataforge_ml/__init__.pySplitImbalanceWarningfrom exports.TrainSplitImbalanceWarningandTestSplitImbalanceWarning.Warning content
Both warnings include: column name(s), their profile missing ratio, their observed split missing ratio, and the threshold that was violated.
Numeric sentinel exemption
The proportional check compares
null_count()(post-_resolve_effective_nulls) againsteffective_null_ratiofrom the profile._resolve_effective_nullsnormalises string and float sentinels but not user-declared numeric sentinels. For columns with numeric sentinels,null_count()will undercount relative toeffective_null_ratio. This is documented in the warning message and inColumnImputationRecord. Full numeric sentinel support is deferred to Scope 5 (issue #94).ADR references
FittedImputer.transform()notDataSplitterfit_transformreturnstuple[FittedImputer, ImputationResult]Testing Decisions
A good test for this scope checks observable behaviour through public or package-internal interfaces — not internal implementation state. Specifically: does the right warning fire at the right threshold? Does
build_label_matrixinclude the compound signal when it should and exclude it when it should not? Doesfit_transformreturn both objects?Modules to test:
_check_split_imbalance(train-side proportional check)train_ratio < 0.5 × profile_ratio.train_ratio >= 0.5 × profile_ratio.0.8).tests/unit/imputation/test_orchestrator.py— existingSplitImbalanceWarningtests.FittedImputer.transform()(test-side proportional check)TestSplitImbalanceWarningfires whentest_ratio < 0.5 × profile_ratio.TestSplitImbalanceWarningdoes not fire when test ratio is adequate.TrainSplitImbalanceWarning(fromfit()) andTestSplitImbalanceWarning(fromtransform()) can fire independently in the same workflow.tests/unit/imputation/test_orchestrator.py.RowMissingnessDistributioncomputationrow_missingness_p90is zero for a fully populated dataset.row_missingness_p90reflects the actual 90th percentile of per-row missing-column counts.p90 == 1.tests/unit/profiling/test_missingness_profiler.py.build_label_matrix— compound missingness row signalrow_missingness_p90 > 0.row_missingness_p90 == 0.1; others receive0.tests/unit/splitting/test_data_splitter.py.build_label_matrix— Joint MAR pair effective null fix1in the joint signal.fit_transformtuple returnFittedImputer.ImputationResult.FittedImputerreturned is identical to what a standalonefit()call would return.tests/unit/imputation/test_orchestrator.py— existingfit_transformtests.Out of Scope
-999) are not covered by the proportional check. This requires_resolve_effective_nullsto normalise numeric sentinels, which is a Scope 5 (issue Scope 5: Numeric Sentinel Support — end-to-end effective null coverage for user-declared sentinel values #94) deliverable.CONTEXT.mdas stratification signals but are not implemented. These are not part of Scope 9.DataSplitter.profile_stratified_split().TestSplitImbalanceWarningfires pertransform()call, which covers single-split and k-fold use cases naturally. No fold-level aggregation is added.Further Notes
SplitImbalanceWarningis currently exported fromdataforge_ml/__init__.pyand referenced in existing unit tests. Renaming it toTrainSplitImbalanceWarningis a breaking public API change. All existing test references toSplitImbalanceWarningmust be updated as part of this scope.split_imbalance_ratio_thresholdlives inNumericImputationConfigbecause the imbalance check is Phase 2's responsibility — it is consumed by the imputer, not the profiler. The profiler produces the profile ratio; the imputer evaluates the split against it.RowMissingnessDistributionis a Phase 1 (profiling) output but its primary consumer is the splitting module. This follows the existing pattern where Phase 1 outputs (the fullStructuralProfileResult) are passed intoDataSplitterandImputationOrchestratorwithout either phase owning the profile.