Skip to content

Scope 9: Split Design and Imbalance Checking — proportional checks, test-side warning, compound missingness signal, fit_transform tuple return #98

@DEVunderdog

Description

@DEVunderdog

Problem Statement

When a data scientist uses DataForgeML to split a dataset and then run imputation, the library fails to detect four categories of unsafe split conditions that silently degrade imputation quality:

  1. Proportional missingness underrepresentation in train goes undetected. The current imbalance check is binary: it warns only when train has zero missing values for a column the full-dataset profile says should have some. A column that is 20% missing in the full dataset but only 2% missing in train passes silently — the imputation model is trained on a nearly-clean slice and learns fill statistics that do not reflect the population.

  2. Test-side missingness imbalance is never checked. There is no mechanism to warn when the test split has proportionally far less missingness than expected. When this happens, imputation quality cannot be evaluated on test — the fitted model runs but the test set does not exercise the missingness distribution it was trained on.

  3. Globally sparse rows are not protected by the stratified split. Rows that are missing in many columns simultaneously — the hardest inputs for MICE — are not individually protected by the per-column missingness stratification signals. A standard multilabel stratified split can still concentrate these rows disproportionately in one partition.

  4. fit_transform discards the FittedImputer, enabling accidental re-fit on test. The current return type is ImputationResult only. A caller who wants to impute the test set has no path other than calling fit() a second time — or, worse, calling fit_transform(test_df), which re-fits on test data. Neither path is safe, and the API gives no signal that either is wrong.

Additionally, the Joint MAR missingness signal in build_label_matrix uses raw Polars is_null() for both correlated columns instead of the effective null mask — meaning string sentinel rows are missed in the joint signal even though Phase 1 detected the MAR correlation using effective nulls.

Dependencies

  • Scope 5 (issue Scope 5: Numeric Sentinel Support — end-to-end effective null coverage for user-declared sentinel values #94) — Numeric Sentinel Support is a partial dependency. User-declared numeric sentinels (e.g. -999) are not yet normalised by _resolve_effective_nulls, so the proportional imbalance check will undercount effective nulls for sentinel-heavy columns until Scope 5 ships. Scope 9 documents this explicitly: numeric sentinel columns are exempt from the proportional check until Scope 5 is implemented. String sentinels and float sentinels (NaN, Inf) are already normalised by the existing _resolve_effective_nulls call in fit() and are not affected by this dependency.

Solution

Extend Phase 2 imbalance checking and Phase 1 splitting to detect and surface all four classes of unsafe split conditions:

  1. Upgrade the train-side imbalance check from binary to proportional. Replace the == 0 check with: warn when train_missing_ratio < split_imbalance_ratio_threshold × profile_effective_null_ratio. Default threshold: 0.5. Configurable in NumericImputationConfig as split_imbalance_ratio_threshold: float = 0.5. Rename SplitImbalanceWarning to TrainSplitImbalanceWarning.

  2. Add a test-side proportional imbalance check inside FittedImputer.transform(). Apply the same proportional rule against each column's stored profile_missing_ratio (persisted in ColumnImputationRecord at fit time). Emit a new TestSplitImbalanceWarning. This fires universally regardless of which split method was used.

  3. Implement the Compound missingness row signal in build_label_matrix. Add a RowMissingnessDistribution dataclass to MissingnessProfileResult (computed by MissingnessProfiler). The row_missingness_p90 field holds the 90th-percentile count of missing columns per row. Use it in build_label_matrix to emit a single binary label: 1 for rows missing in more columns than row_missingness_p90.

  4. Fix the Joint MAR pair signal to use the effective null mask. Replace raw is_null() with the same dtype-driven effective null check used in the per-column missingness signal for both columns in each correlated pair.

  5. Change fit_transform to return tuple[FittedImputer, ImputationResult]. Forces callers to receive the FittedImputer. The natural next step — fitted_imputer.transform(test_df) — is then self-evident from the unpacked tuple.

User Stories

  1. As a data scientist, I want to receive a warning when my training split has proportionally far less missingness than the full dataset for any column, so that I know my imputation model was trained on an unrepresentative slice before I draw conclusions from its output.
  2. As a data scientist, I want the missingness imbalance warning to tell me the actual train missing ratio and the profile missing ratio for each affected column, so that I can judge how severe the imbalance is without inspecting the data manually.
  3. As a data scientist, I want the train-side imbalance threshold to be configurable, so that I can tighten it (e.g. 0.8) on high-stakes datasets or relax it (e.g. 0.3) for known-skewed splits like time-based partitions.
  4. As a data scientist, I want to receive a warning when my test split has proportionally far less missingness than expected for any imputed column, so that I know before I evaluate imputation quality that the test set may not exercise the imputer's real-world behaviour.
  5. As a data scientist, I want the test-side warning to fire regardless of how I split my data — whether I used profile_stratified_split, random_split, or my own custom split — so that the check is not opt-in.
  6. As a data scientist, I want TrainSplitImbalanceWarning and TestSplitImbalanceWarning as distinct warning types, so that I can silence one without the other using warnings.filterwarnings.
  7. As a data scientist, I want rows that are missing in many columns simultaneously to be proportionally represented in both train and test when I call profile_stratified_split, so that MICE has representative globally-sparse rows available during training.
  8. As a data scientist, I want profile_stratified_split to include a compound missingness row signal automatically — no extra configuration required — so that globally-sparse row protection is always on.
  9. As a data scientist, I want fit_transform to return both the FittedImputer and the imputed train ImputationResult as a tuple, so that I cannot accidentally discard the fitted imputer and am naturally guided to use it for test-set imputation.
  10. As a data scientist, I want the Joint MAR missingness signal in the stratified split to correctly identify correlated missing pairs when missingness is expressed as string sentinels, so that MAR structure is preserved in both partitions even for string-sentinel columns.
  11. As a data scientist, I want the train-side imbalance check to correctly count string sentinel rows (e.g. "NA", "") and float sentinel rows (NaN, Inf) as missing, so that I do not receive false warnings for columns whose missingness is expressed through sentinels rather than Polars nulls.
  12. As a data scientist, I want the imbalance check to document explicitly that user-declared numeric sentinels (e.g. -999) are not yet covered by the proportional check, so that I understand the limitation without having to dig into the source.
  13. As a data scientist, I want RowMissingnessDistribution to be accessible on MissingnessProfileResult after Phase 1 profiling, so that I can inspect the row-wise missingness distribution for my dataset independently of the split.
  14. As a data scientist, I want the compound missingness row signal to be omitted from the label matrix when row_missingness_p90 == 0 (no row has any missing values), so that the signal slot is not wasted on a dataset with no missingness.
  15. As a data scientist, I want TrainSplitImbalanceWarning and TestSplitImbalanceWarning to be importable directly from dataforge_ml, so that I can catch them programmatically without importing from internal submodules.
  16. As a data scientist, I want profile_missing_ratio to appear on ColumnImputationRecord for every column with a fitted strategy, so that I can compare train missingness against profile missingness in my own post-fit analysis.

Implementation Decisions

Modules modified

MissingnessProfiler and MissingnessProfileResult

  • Add a new RowMissingnessDistribution dataclass with at minimum row_missingness_p90: int — the 90th-percentile count of per-row missing columns across the dataset.
  • Add row_distribution: RowMissingnessDistribution as a field on MissingnessProfileResult.
  • MissingnessProfiler computes this by calculating, for each row, the count of columns with effective nulls, then taking the 90th percentile of that row-wise distribution.

_profile_signals.build_label_matrix

  • Add the Compound missingness row signal using row_distribution.row_missingness_p90 from the profile. One binary label for the whole dataset: 1 if a row's effective null count exceeds p90. Signal is omitted when p90 == 0.
  • Fix the Joint MAR pair signal: replace raw is_null() with the same dtype-driven effective null check (string sentinels + float sentinels) used in the per-column missingness signal, applied to both columns in each correlated pair.

NumericImputationConfig

  • Add split_imbalance_ratio_threshold: float = 0.5. Applies to both train-side and test-side checks.

ColumnImputationRecord

  • Add profile_missing_ratio: float. Populated during fit() from cp.missingness.effective_null_ratio for every column with a fitted strategy. Used by FittedImputer.transform() for the test-side check.

ImputationOrchestrator

  • Rename SplitImbalanceWarning to TrainSplitImbalanceWarning.
  • Upgrade _check_split_imbalance from binary to proportional: warn when train_ratio < split_imbalance_ratio_threshold × profile_ratio. The function reads the threshold from the config.
  • Change fit_transform return type from ImputationResult to tuple[FittedImputer, ImputationResult].

FittedImputer.transform()

  • After applying imputations, iterate over columns with a fitted strategy. For each column, compare test_df[col].null_count() / len(test_df) against record.profile_missing_ratio. Emit TestSplitImbalanceWarning when the ratio falls below split_imbalance_ratio_threshold × profile_missing_ratio. The threshold is stored alongside the records (passed in from NumericImputationConfig at fit time).

dataforge_ml/__init__.py

  • Remove SplitImbalanceWarning from exports.
  • Add TrainSplitImbalanceWarning and TestSplitImbalanceWarning.

Warning content

Both warnings include: column name(s), their profile missing ratio, their observed split missing ratio, and the threshold that was violated.

Numeric sentinel exemption

The proportional check compares null_count() (post-_resolve_effective_nulls) against effective_null_ratio from the profile. _resolve_effective_nulls normalises string and float sentinels but not user-declared numeric sentinels. For columns with numeric sentinels, null_count() will undercount relative to effective_null_ratio. This is documented in the warning message and in ColumnImputationRecord. Full numeric sentinel support is deferred to Scope 5 (issue #94).

ADR references

  • ADR 0020 — TestSplitImbalanceWarning lives in FittedImputer.transform() not DataSplitter
  • ADR 0021 — fit_transform returns tuple[FittedImputer, ImputationResult]
  • ADR 0022 — Split imbalance check is proportional, not binary

Testing Decisions

A good test for this scope checks observable behaviour through public or package-internal interfaces — not internal implementation state. Specifically: does the right warning fire at the right threshold? Does build_label_matrix include the compound signal when it should and exclude it when it should not? Does fit_transform return both objects?

Modules to test:

  • _check_split_imbalance (train-side proportional check)

    • Warning fires when train_ratio < 0.5 × profile_ratio.
    • Warning does not fire when train_ratio >= 0.5 × profile_ratio.
    • Warning fires with a custom threshold (e.g. 0.8).
    • Warning names all affected columns in a single emission.
    • No warning when profile reports zero missingness for a column.
    • String sentinel rows in train are counted correctly (no false warning).
    • Float sentinel rows in train are counted correctly (no false warning).
    • Prior art: tests/unit/imputation/test_orchestrator.py — existing SplitImbalanceWarning tests.
  • FittedImputer.transform() (test-side proportional check)

    • TestSplitImbalanceWarning fires when test_ratio < 0.5 × profile_ratio.
    • TestSplitImbalanceWarning does not fire when test ratio is adequate.
    • Both TrainSplitImbalanceWarning (from fit()) and TestSplitImbalanceWarning (from transform()) can fire independently in the same workflow.
    • Prior art: tests/unit/imputation/test_orchestrator.py.
  • RowMissingnessDistribution computation

    • row_missingness_p90 is zero for a fully populated dataset.
    • row_missingness_p90 reflects the actual 90th percentile of per-row missing-column counts.
    • Single-column dataset with all values missing produces p90 == 1.
    • Prior art: tests/unit/profiling/test_missingness_profiler.py.
  • build_label_matrix — compound missingness row signal

    • Signal column is present in the label matrix when row_missingness_p90 > 0.
    • Signal column is absent when row_missingness_p90 == 0.
    • Rows above the p90 threshold receive label 1; others receive 0.
    • Prior art: tests/unit/splitting/test_data_splitter.py.
  • build_label_matrix — Joint MAR pair effective null fix

    • A row with a string sentinel in one column of a MAR pair receives 1 in the joint signal.
    • Prior art: same as above.
  • fit_transform tuple return

    • Return value is a tuple of length 2.
    • First element is a FittedImputer.
    • Second element is an ImputationResult.
    • The FittedImputer returned is identical to what a standalone fit() call would return.
    • Prior art: tests/unit/imputation/test_orchestrator.py — existing fit_transform tests.

Out of Scope

  • Numeric sentinel imbalance detection — columns where missingness is expressed through user-declared numeric sentinels (e.g. -999) are not covered by the proportional check. This requires _resolve_effective_nulls to normalise numeric sentinels, which is a Scope 5 (issue Scope 5: Numeric Sentinel Support — end-to-end effective null coverage for user-declared sentinel values #94) deliverable.
  • NearConstant numeric minority signal and Datetime future-date signal — both are described in CONTEXT.md as stratification signals but are not implemented. These are not part of Scope 9.
  • Automatic split correction — Scope 9 warns about bad splits but does not attempt to re-split or re-weight the data. The correction path remains DataSplitter.profile_stratified_split().
  • K-fold imbalance checkingTestSplitImbalanceWarning fires per transform() call, which covers single-split and k-fold use cases naturally. No fold-level aggregation is added.

Further Notes

  • SplitImbalanceWarning is currently exported from dataforge_ml/__init__.py and referenced in existing unit tests. Renaming it to TrainSplitImbalanceWarning is a breaking public API change. All existing test references to SplitImbalanceWarning must be updated as part of this scope.
  • The split_imbalance_ratio_threshold lives in NumericImputationConfig because the imbalance check is Phase 2's responsibility — it is consumed by the imputer, not the profiler. The profiler produces the profile ratio; the imputer evaluates the split against it.
  • RowMissingnessDistribution is a Phase 1 (profiling) output but its primary consumer is the splitting module. This follows the existing pattern where Phase 1 outputs (the full StructuralProfileResult) are passed into DataSplitter and ImputationOrchestrator without either phase owning the profile.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions