Scope 9: Split Design and Imbalance Checking — proportional checks, test-side warning, compound missingness signal, fit_transform tuple return

## Problem Statement

When a data scientist uses DataForgeML to split a dataset and then run imputation, the library fails to detect four categories of unsafe split conditions that silently degrade imputation quality:

1. **Proportional missingness underrepresentation in train goes undetected.** The current imbalance check is binary: it warns only when train has *zero* missing values for a column the full-dataset profile says should have some. A column that is 20% missing in the full dataset but only 2% missing in train passes silently — the imputation model is trained on a nearly-clean slice and learns fill statistics that do not reflect the population.

2. **Test-side missingness imbalance is never checked.** There is no mechanism to warn when the test split has proportionally far less missingness than expected. When this happens, imputation quality cannot be evaluated on test — the fitted model runs but the test set does not exercise the missingness distribution it was trained on.

3. **Globally sparse rows are not protected by the stratified split.** Rows that are missing in many columns simultaneously — the hardest inputs for MICE — are not individually protected by the per-column missingness stratification signals. A standard multilabel stratified split can still concentrate these rows disproportionately in one partition.

4. **`fit_transform` discards the `FittedImputer`, enabling accidental re-fit on test.** The current return type is `ImputationResult` only. A caller who wants to impute the test set has no path other than calling `fit()` a second time — or, worse, calling `fit_transform(test_df)`, which re-fits on test data. Neither path is safe, and the API gives no signal that either is wrong.

Additionally, the Joint MAR missingness signal in `build_label_matrix` uses raw Polars `is_null()` for both correlated columns instead of the effective null mask — meaning string sentinel rows are missed in the joint signal even though Phase 1 detected the MAR correlation using effective nulls.

## Dependencies

- **Scope 5 (issue #94) — Numeric Sentinel Support** is a partial dependency. User-declared numeric sentinels (e.g. `-999`) are not yet normalised by `_resolve_effective_nulls`, so the proportional imbalance check will undercount effective nulls for sentinel-heavy columns until Scope 5 ships. Scope 9 documents this explicitly: numeric sentinel columns are exempt from the proportional check until Scope 5 is implemented. String sentinels and float sentinels (`NaN`, `Inf`) are already normalised by the existing `_resolve_effective_nulls` call in `fit()` and are not affected by this dependency.

## Solution

Extend Phase 2 imbalance checking and Phase 1 splitting to detect and surface all four classes of unsafe split conditions:

1. **Upgrade the train-side imbalance check from binary to proportional.** Replace the `== 0` check with: warn when `train_missing_ratio < split_imbalance_ratio_threshold × profile_effective_null_ratio`. Default threshold: `0.5`. Configurable in `NumericImputationConfig` as `split_imbalance_ratio_threshold: float = 0.5`. Rename `SplitImbalanceWarning` to `TrainSplitImbalanceWarning`.

2. **Add a test-side proportional imbalance check inside `FittedImputer.transform()`.** Apply the same proportional rule against each column's stored `profile_missing_ratio` (persisted in `ColumnImputationRecord` at fit time). Emit a new `TestSplitImbalanceWarning`. This fires universally regardless of which split method was used.

3. **Implement the Compound missingness row signal in `build_label_matrix`.** Add a `RowMissingnessDistribution` dataclass to `MissingnessProfileResult` (computed by `MissingnessProfiler`). The `row_missingness_p90` field holds the 90th-percentile count of missing columns per row. Use it in `build_label_matrix` to emit a single binary label: `1` for rows missing in more columns than `row_missingness_p90`.

4. **Fix the Joint MAR pair signal to use the effective null mask.** Replace raw `is_null()` with the same dtype-driven effective null check used in the per-column missingness signal for both columns in each correlated pair.

5. **Change `fit_transform` to return `tuple[FittedImputer, ImputationResult]`.** Forces callers to receive the `FittedImputer`. The natural next step — `fitted_imputer.transform(test_df)` — is then self-evident from the unpacked tuple.

## User Stories

1. As a data scientist, I want to receive a warning when my training split has proportionally far less missingness than the full dataset for any column, so that I know my imputation model was trained on an unrepresentative slice before I draw conclusions from its output.
2. As a data scientist, I want the missingness imbalance warning to tell me the actual train missing ratio and the profile missing ratio for each affected column, so that I can judge how severe the imbalance is without inspecting the data manually.
3. As a data scientist, I want the train-side imbalance threshold to be configurable, so that I can tighten it (e.g. 0.8) on high-stakes datasets or relax it (e.g. 0.3) for known-skewed splits like time-based partitions.
4. As a data scientist, I want to receive a warning when my test split has proportionally far less missingness than expected for any imputed column, so that I know before I evaluate imputation quality that the test set may not exercise the imputer's real-world behaviour.
5. As a data scientist, I want the test-side warning to fire regardless of how I split my data — whether I used `profile_stratified_split`, `random_split`, or my own custom split — so that the check is not opt-in.
6. As a data scientist, I want `TrainSplitImbalanceWarning` and `TestSplitImbalanceWarning` as distinct warning types, so that I can silence one without the other using `warnings.filterwarnings`.
7. As a data scientist, I want rows that are missing in many columns simultaneously to be proportionally represented in both train and test when I call `profile_stratified_split`, so that MICE has representative globally-sparse rows available during training.
8. As a data scientist, I want `profile_stratified_split` to include a compound missingness row signal automatically — no extra configuration required — so that globally-sparse row protection is always on.
9. As a data scientist, I want `fit_transform` to return both the `FittedImputer` and the imputed train `ImputationResult` as a tuple, so that I cannot accidentally discard the fitted imputer and am naturally guided to use it for test-set imputation.
10. As a data scientist, I want the Joint MAR missingness signal in the stratified split to correctly identify correlated missing pairs when missingness is expressed as string sentinels, so that MAR structure is preserved in both partitions even for string-sentinel columns.
11. As a data scientist, I want the train-side imbalance check to correctly count string sentinel rows (e.g. `"NA"`, `""`) and float sentinel rows (`NaN`, `Inf`) as missing, so that I do not receive false warnings for columns whose missingness is expressed through sentinels rather than Polars nulls.
12. As a data scientist, I want the imbalance check to document explicitly that user-declared numeric sentinels (e.g. `-999`) are not yet covered by the proportional check, so that I understand the limitation without having to dig into the source.
13. As a data scientist, I want `RowMissingnessDistribution` to be accessible on `MissingnessProfileResult` after Phase 1 profiling, so that I can inspect the row-wise missingness distribution for my dataset independently of the split.
14. As a data scientist, I want the compound missingness row signal to be omitted from the label matrix when `row_missingness_p90 == 0` (no row has any missing values), so that the signal slot is not wasted on a dataset with no missingness.
15. As a data scientist, I want `TrainSplitImbalanceWarning` and `TestSplitImbalanceWarning` to be importable directly from `dataforge_ml`, so that I can catch them programmatically without importing from internal submodules.
16. As a data scientist, I want `profile_missing_ratio` to appear on `ColumnImputationRecord` for every column with a fitted strategy, so that I can compare train missingness against profile missingness in my own post-fit analysis.

## Implementation Decisions

### Modules modified

**`MissingnessProfiler` and `MissingnessProfileResult`**
- Add a new `RowMissingnessDistribution` dataclass with at minimum `row_missingness_p90: int` — the 90th-percentile count of per-row missing columns across the dataset.
- Add `row_distribution: RowMissingnessDistribution` as a field on `MissingnessProfileResult`.
- `MissingnessProfiler` computes this by calculating, for each row, the count of columns with effective nulls, then taking the 90th percentile of that row-wise distribution.

**`_profile_signals.build_label_matrix`**
- Add the Compound missingness row signal using `row_distribution.row_missingness_p90` from the profile. One binary label for the whole dataset: `1` if a row's effective null count exceeds `p90`. Signal is omitted when `p90 == 0`.
- Fix the Joint MAR pair signal: replace raw `is_null()` with the same dtype-driven effective null check (string sentinels + float sentinels) used in the per-column missingness signal, applied to both columns in each correlated pair.

**`NumericImputationConfig`**
- Add `split_imbalance_ratio_threshold: float = 0.5`. Applies to both train-side and test-side checks.

**`ColumnImputationRecord`**
- Add `profile_missing_ratio: float`. Populated during `fit()` from `cp.missingness.effective_null_ratio` for every column with a fitted strategy. Used by `FittedImputer.transform()` for the test-side check.

**`ImputationOrchestrator`**
- Rename `SplitImbalanceWarning` to `TrainSplitImbalanceWarning`.
- Upgrade `_check_split_imbalance` from binary to proportional: warn when `train_ratio < split_imbalance_ratio_threshold × profile_ratio`. The function reads the threshold from the config.
- Change `fit_transform` return type from `ImputationResult` to `tuple[FittedImputer, ImputationResult]`.

**`FittedImputer.transform()`**
- After applying imputations, iterate over columns with a fitted strategy. For each column, compare `test_df[col].null_count() / len(test_df)` against `record.profile_missing_ratio`. Emit `TestSplitImbalanceWarning` when the ratio falls below `split_imbalance_ratio_threshold × profile_missing_ratio`. The threshold is stored alongside the records (passed in from `NumericImputationConfig` at fit time).

**`dataforge_ml/__init__.py`**
- Remove `SplitImbalanceWarning` from exports.
- Add `TrainSplitImbalanceWarning` and `TestSplitImbalanceWarning`.

### Warning content
Both warnings include: column name(s), their profile missing ratio, their observed split missing ratio, and the threshold that was violated.

### Numeric sentinel exemption
The proportional check compares `null_count()` (post-`_resolve_effective_nulls`) against `effective_null_ratio` from the profile. `_resolve_effective_nulls` normalises string and float sentinels but not user-declared numeric sentinels. For columns with numeric sentinels, `null_count()` will undercount relative to `effective_null_ratio`. This is documented in the warning message and in `ColumnImputationRecord`. Full numeric sentinel support is deferred to Scope 5 (issue #94).

### ADR references
- ADR 0020 — TestSplitImbalanceWarning lives in `FittedImputer.transform()` not `DataSplitter`
- ADR 0021 — `fit_transform` returns `tuple[FittedImputer, ImputationResult]`
- ADR 0022 — Split imbalance check is proportional, not binary

## Testing Decisions

A good test for this scope checks observable behaviour through public or package-internal interfaces — not internal implementation state. Specifically: does the right warning fire at the right threshold? Does `build_label_matrix` include the compound signal when it should and exclude it when it should not? Does `fit_transform` return both objects?

**Modules to test:**

- **`_check_split_imbalance` (train-side proportional check)**
  - Warning fires when `train_ratio < 0.5 × profile_ratio`.
  - Warning does not fire when `train_ratio >= 0.5 × profile_ratio`.
  - Warning fires with a custom threshold (e.g. `0.8`).
  - Warning names all affected columns in a single emission.
  - No warning when profile reports zero missingness for a column.
  - String sentinel rows in train are counted correctly (no false warning).
  - Float sentinel rows in train are counted correctly (no false warning).
  - Prior art: `tests/unit/imputation/test_orchestrator.py` — existing `SplitImbalanceWarning` tests.

- **`FittedImputer.transform()` (test-side proportional check)**
  - `TestSplitImbalanceWarning` fires when `test_ratio < 0.5 × profile_ratio`.
  - `TestSplitImbalanceWarning` does not fire when test ratio is adequate.
  - Both `TrainSplitImbalanceWarning` (from `fit()`) and `TestSplitImbalanceWarning` (from `transform()`) can fire independently in the same workflow.
  - Prior art: `tests/unit/imputation/test_orchestrator.py`.

- **`RowMissingnessDistribution` computation**
  - `row_missingness_p90` is zero for a fully populated dataset.
  - `row_missingness_p90` reflects the actual 90th percentile of per-row missing-column counts.
  - Single-column dataset with all values missing produces `p90 == 1`.
  - Prior art: `tests/unit/profiling/test_missingness_profiler.py`.

- **`build_label_matrix` — compound missingness row signal**
  - Signal column is present in the label matrix when `row_missingness_p90 > 0`.
  - Signal column is absent when `row_missingness_p90 == 0`.
  - Rows above the p90 threshold receive label `1`; others receive `0`.
  - Prior art: `tests/unit/splitting/test_data_splitter.py`.

- **`build_label_matrix` — Joint MAR pair effective null fix**
  - A row with a string sentinel in one column of a MAR pair receives `1` in the joint signal.
  - Prior art: same as above.

- **`fit_transform` tuple return**
  - Return value is a tuple of length 2.
  - First element is a `FittedImputer`.
  - Second element is an `ImputationResult`.
  - The `FittedImputer` returned is identical to what a standalone `fit()` call would return.
  - Prior art: `tests/unit/imputation/test_orchestrator.py` — existing `fit_transform` tests.

## Out of Scope

- **Numeric sentinel imbalance detection** — columns where missingness is expressed through user-declared numeric sentinels (e.g. `-999`) are not covered by the proportional check. This requires `_resolve_effective_nulls` to normalise numeric sentinels, which is a Scope 5 (issue #94) deliverable.
- **NearConstant numeric minority signal** and **Datetime future-date signal** — both are described in `CONTEXT.md` as stratification signals but are not implemented. These are not part of Scope 9.
- **Automatic split correction** — Scope 9 warns about bad splits but does not attempt to re-split or re-weight the data. The correction path remains `DataSplitter.profile_stratified_split()`.
- **K-fold imbalance checking** — `TestSplitImbalanceWarning` fires per `transform()` call, which covers single-split and k-fold use cases naturally. No fold-level aggregation is added.

## Further Notes

- `SplitImbalanceWarning` is currently exported from `dataforge_ml/__init__.py` and referenced in existing unit tests. Renaming it to `TrainSplitImbalanceWarning` is a **breaking public API change**. All existing test references to `SplitImbalanceWarning` must be updated as part of this scope.
- The `split_imbalance_ratio_threshold` lives in `NumericImputationConfig` because the imbalance check is Phase 2's responsibility — it is consumed by the imputer, not the profiler. The profiler produces the profile ratio; the imputer evaluates the split against it.
- `RowMissingnessDistribution` is a Phase 1 (profiling) output but its primary consumer is the splitting module. This follows the existing pattern where Phase 1 outputs (the full `StructuralProfileResult`) are passed into `DataSplitter` and `ImputationOrchestrator` without either phase owning the profile.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scope 9: Split Design and Imbalance Checking — proportional checks, test-side warning, compound missingness signal, fit_transform tuple return #98

Problem Statement

Dependencies

Solution

User Stories

Implementation Decisions

Modules modified

Warning content

Numeric sentinel exemption

ADR references

Testing Decisions

Out of Scope

Further Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Scope 9: Split Design and Imbalance Checking — proportional checks, test-side warning, compound missingness signal, fit_transform tuple return #98

Description

Problem Statement

Dependencies

Solution

User Stories

Implementation Decisions

Modules modified

Warning content

Numeric sentinel exemption

ADR references

Testing Decisions

Out of Scope

Further Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions