Scope 7: NumericKind.BoundedDiscrete — Compound Classification and Mode Imputation Fix

## Problem Statement

When a data scientist uses DataForgeML to impute numeric columns, the library misclassifies two structurally different column types under the same label — `NumericKind.Discrete` — and applies Mode imputation to both. This produces silently wrong results for one of them.

The current rule is: any column that is an integer dtype OR has fewer than 20 unique values is classified `Discrete` and receives Mode imputation. This catches two very different column types:

1. **Truly bounded discrete columns** — columns whose values form a closed, finite set. A 5-level Likert scale `{1,2,3,4,5}`, a binary flag `{0,1}`, or a day-of-week encoding `{0,...,6}`. Mode imputation is semantically correct here.

2. **Low-cardinality integers by accident** — integer columns whose observed range happens to be narrow in this sample, but whose underlying domain is unbounded. An age column in a 50-row dataset, a count of rare events, or a year column. Mode imputation is wrong here — filling missing ages with the most common age is no better than filling with the mean or median.

Both column types pass the same test. But for the second type, Mode produces actively wrong imputed values with no warning to the user.

There is a second, related gap: the `Discrete → Mode` routing fires at Priority 4, before the MCAR severity chain. This means a low-cardinality integer column with high missingness or a MAR pattern gets Mode instead of model-based imputation — an outcome that was never intended.

A third gap: unlike `SemanticType`, which users can override per-column via `PipelineConfig.column_overrides`, there is currently no way to override `NumericKind`. When auto-detection misclassifies a column — whether to `BoundedDiscrete` or `Continuous` — the user has no recourse without hacking internal state.

## Solution

Replace the single-condition `NumericKind.Discrete` classification with a compound four-signal test that distinguishes truly bounded discrete columns from low-cardinality integers. Rename `NumericKind.Discrete` to `NumericKind.BoundedDiscrete` to reflect the stricter, more precise definition.

A column is `BoundedDiscrete` only when **all four** of the following signals pass:

1. **Tight sequence** — observed values fill every integer slot between min and max (`range_span == n_unique`). Eliminates columns with gaps (sparse sampling of a continuous domain).
2. **Small range** — `max - min ≤ 20`. Eliminates wide consecutive ranges like years `{2000,...,2024}` or large ordinal encodings.
3. **Low cardinality** — `n_unique / n_rows < 0.05` OR `n_unique ≤ 10`. Eliminates small-dataset accidents where a continuous column has few observed values. The absolute floor `≤ 10` protects small datasets where the ratio inflates.
4. **Standard origin** — `min == 0 or min == 1`. Eliminates continuous variables like age or year whose minimum is non-standard.

Float columns require an additional pre-check: all non-null values must be whole numbers (`value % 1 == 0`). Float columns with any fractional values are always `Continuous` — the tight-sequence check assumes integer steps and is undefined for decimal spacing.

The classification is **conservative by design**: a column must pass all four signals. Failing any one → `Continuous`. A false `Continuous` classification (bounded discrete column gets Mean/Median) is suboptimal. A false `BoundedDiscrete` classification (continuous column gets Mode) is actively wrong.

Alongside the improved classifier, a **NumericKind override mechanism** is added to `PipelineConfig`, parallel to the existing `column_overrides` for `SemanticType`. Users can explicitly declare a column's `NumericKind` when auto-detection is insufficient.

## User Stories

1. As a data scientist, I want a column of 5-level Likert ratings `{1,2,3,4,5}` to be classified as bounded discrete so that missing ratings are filled with the most common rating, not a mean like `3.2`.
2. As a data scientist, I want a binary flag column `{0,1}` to be classified as bounded discrete so that missing flags are filled with the mode (the dominant class), not the mean (a fractional value).
3. As a data scientist, I want an age column in a small dataset — which happens to have only 6 unique values — to be classified as continuous so that missing ages are filled with the median or a model-based estimate, not the modal age.
4. As a data scientist, I want a count column `{0,1,2,3}` for rare events in a 2,000-row dataset to be classified as continuous if it fails cardinality checks, so that its imputation uses the full MCAR severity chain.
5. As a data scientist, I want a float column `{1.0, 2.0, 3.0, 4.0, 5.0}` — a Likert scale stored as float because of null values — to be classified as bounded discrete, so that it receives the same Mode imputation as its integer equivalent.
6. As a data scientist, I want a float column `{0.1, 0.5, 1.0, 1.5, 2.0}` with decimal values to be classified as continuous regardless of cardinality, so that it is never mis-treated as a fixed-vocabulary column.
7. As a data scientist, I want a day-of-week column `{0,...,6}` to be classified as bounded discrete so that missing days are filled with the most common day.
8. As a data scientist, I want a year column `{2000,...,2024}` to be classified as continuous despite being an integer, so that missing years receive median or model-based imputation rather than the modal year.
9. As a data scientist, I want a 5-level rating column in a 20-row dataset to still be classified as bounded discrete, so that small datasets are not penalised by the cardinality ratio check when the column is genuinely fixed-vocabulary.
10. As a data scientist, I want a MAR-suspect continuous column that previously had few unique values to now receive model-based imputation instead of Mode, so that its missingness mechanism is actually respected.
11. As a data scientist, I want the `ColumnImputationRecord.signals` field to record the classification outcome so that I can see why a column was treated as bounded discrete or continuous.
12. As a data scientist, I want the `NumericKind` enum value for bounded discrete columns to be named `BoundedDiscrete` — not `Discrete` — so that the label is self-documenting and unambiguous in audit logs.
13. As a data scientist, I want to override the auto-detected `NumericKind` for a specific column so that I can correct misclassifications without changing the column's `SemanticType`.
14. As a data scientist, I want to force a numeric column to `NumericKind.BoundedDiscrete` so that it receives Mode imputation even when the four-signal test did not classify it as bounded discrete.
15. As a data scientist, I want to force a numeric column to `NumericKind.Continuous` so that it falls into the MCAR severity chain instead of receiving Mode imputation when I know the column's domain is not finite.
16. As a data scientist, I want an explicit, descriptive error when I declare a `NumericKind` override for a non-Numeric column so that I immediately know the configuration is invalid and why.
17. As a data scientist, I want a `TypeFlag.NumericKindOverride` annotation in `ColumnProfile.type_flags` so that I can see in the audit JSON that a `NumericKind` was set manually rather than auto-detected.
18. As a data scientist, I want convenience methods `set_numeric_kind` and `set_columns_numeric_kind` on `PipelineConfig` so that I can declare overrides without manipulating the dict directly.

## Implementation Decisions

### Modules modified

**`NumericKind` enum**
Rename the `Discrete` value to `BoundedDiscrete`. No third value is added — columns that previously were `Discrete` but fail the new test become `Continuous`.

**`_classify_numeric_kind` (Phase 1 — type detector)**
This is the single location where the compound test lives. Replace the existing single-condition rule (`int dtype OR < 20 unique`) with the four-signal compound test. Accept `n_rows` as an additional parameter (needed for signal 3, the cardinality ratio check). This function is a deep module: pure signal-in → `NumericKind` decision-out, no side effects, independently testable. The `EncodedCategory` early-exit guard remains unchanged — those columns are `SemanticType.Categorical` and out of scope.

**`_fit_one` (Phase 2 — numeric imputer)**
Update the Priority 5 routing check from `NumericKind.Discrete` to `NumericKind.BoundedDiscrete`. No other routing logic changes — `BoundedDiscrete → Mode` remains unconditional.

**`TypeFlag` enum**
Add `NumericKindOverride` as a distinct value. This flag is set when a `NumericKind` override is applied in the orchestrator. `TypeFlag.UserOverride` remains unchanged — it means only that a `SemanticType` was explicitly set by the caller. Both flags can coexist on the same column when the user overrides both.

**`PipelineConfig`**
Add `numeric_kind_overrides: dict[str, NumericKind]` (default empty dict), parallel to `column_overrides`. Add convenience methods:
- `set_numeric_kind(column, kind)` — single column, accepts `NumericKind` enum or raw string (`"continuous"`, `"bounded_discrete"`), with validation matching the pattern of `set_column_type`.
- `set_columns_numeric_kind(columns, kind)` — batch variant.

Update `to_dict` / `from_dict` / `to_json` / `from_json` to round-trip `numeric_kind_overrides`.

**`StructuralProfiler.profile` (orchestrator — Step 5)**
Apply `numeric_kind_overrides` in Step 5, after `column_overrides` (SemanticType overrides are applied first). For each column in `numeric_kind_overrides`:
- If the column is absent from `result.columns` (excluded or non-existent), silently ignore.
- If `cp.semantic_type != SemanticType.Numeric`, raise `ValueError` with an explicit message naming the column and its actual `SemanticType`. Example: _"NumericKind override for column 'price' is invalid — column has SemanticType.Categorical. NumericKind only applies to SemanticType.Numeric columns."_
- Otherwise set `cp.numeric_kind` and append `TypeFlag.NumericKindOverride` if not already present.

SemanticType overrides are applied first within Step 5 so the guard checks the user's declared type — not the detector's type. This matters when the same column has both a SemanticType override (e.g. to `Categorical`) and a NumericKind override: the error fires based on the final SemanticType.

### Architectural decisions

- The compound test lives in Phase 1 (`_classify_numeric_kind`), not in Phase 2. Type classification belongs in Phase 1; Phase 2 consumes decisions, it does not make type judgements. See ADR 0018.
- Mode fires unconditionally for `BoundedDiscrete` at Priority 5 regardless of severity or mechanism. A finite-domain column must receive a finite-domain fill value. Model-based strategies are semantically invalid for this column type. See ADR 0018.
- The conservative rule (all four signals required) is intentional. A false `Continuous` classification is recoverable (suboptimal scalar fill). A false `BoundedDiscrete` classification is actively wrong (mode of a continuous variable). When ambiguous, default to `Continuous`.
- `numeric_kind_overrides` lives on `PipelineConfig` (not `ProfileConfig`) because `NumericKind` is consumed cross-phase: Phase 1 writes it, Phase 2 reads it for imputation routing. Placing it on `ProfileConfig` would require Phase 2 to reach into the profiling config, violating the phase boundary.
- `TypeFlag.NumericKindOverride` is distinct from `TypeFlag.UserOverride` so the audit log distinguishes "user changed SemanticType" from "user changed NumericKind."
- `EncodedCategory` columns (`SemanticType.Categorical`) are out of scope. They do not reach `NumericImputer`. Their missing-value handling belongs to a future `CategoricalImputer` scope.

### Dependencies

**Internal (within this scope)**
Signal 3 (cardinality ratio) requires `n_rows` to be threaded into `_classify_numeric_kind`. This is a small change to the call site in the type detector but is a prerequisite for the full compound test to be correct.

**On other scopes**
- This scope has **no hard dependency** on any other scope. All four classification signals are already computable from the raw `Series` and `n_rows` — no new Phase 1 profilers are required.
- **Scope 6 (Issue #95)** changes which columns reach MCAR routing (adding distribution shape escalation and feature-predictability checks). When Scope 6 and Scope 7 are both shipped, columns that were previously `Discrete → Mode` but now become `Continuous` under Scope 7 will be subject to the updated MCAR routing introduced by Scope 6. The combined behaviour should be validated in integration tests.
- **Scopes 0, 1, 2 (Issues #89, #90, #91)** — Regression, KNN, and MICE improvements. Columns that migrate from `BoundedDiscrete` to `Continuous` under this scope will naturally benefit from those improvements once they ship. No coordination required at implementation time.

## Testing Decisions

**What makes a good test:** test classification and routing decisions through the external observable interface — `NumericKind` on `ColumnProfile` for Phase 1 tests, and `ColumnImputationRecord.strategy` plus `ColumnImputationRecord.signals` for Phase 2 tests. Do not assert on internal branching, private method state, or intermediate variables. Construct minimal synthetic `pl.Series` and `pl.DataFrame` objects with known properties. Each test should vary exactly one signal at a time, holding all others constant.

### Module 1: `_classify_numeric_kind` (deep module — test in isolation)

This is the most important module to test because it is a pure function with a well-defined contract: given a `Series` and `n_rows`, return a `NumericKind`. Every signal combination is independently testable without constructing fitting infrastructure.

- Integer series `{1,2,3,4,5}` with large `n_rows` → `BoundedDiscrete`
- Integer series `{0,1}` → `BoundedDiscrete`
- Integer series `{0,...,6}` (day of week) → `BoundedDiscrete`
- Integer series `{18,22,35,42,55}` (gaps) → `Continuous` (fails signal 1)
- Integer series `{18,19,20,21,22}` (tight but non-zero origin) → `Continuous` (fails signal 4)
- Integer series `{2000,...,2010}` (tight but non-zero origin) → `Continuous` (fails signal 4)
- Integer series `{1,...,25}` (tight, origin=1, but range > 20) → `Continuous` (fails signal 2)
- Integer series `{1,2,3,4,5}` in a 20-row dataset → `BoundedDiscrete` (floor `≤ 10` saves it despite failing ratio)
- Integer series `{1,2,3,4,5,6,7,8,9,10,11}` (11 unique values, fails floor) in a 20-row dataset → `Continuous` (fails signal 3 on both ratio and floor)
- Float series `{1.0,2.0,3.0,4.0,5.0}` (whole-number floats) → `BoundedDiscrete`
- Float series `{0.1,0.5,1.0,1.5,2.0}` (decimal values) → `Continuous` (fails whole-number pre-check)
- Float series `{0.0,1.0}` → `BoundedDiscrete`

**Prior art:** `test_type_detector.py` for classification tests using synthetic `pl.Series`.

### Module 2: `NumericImputer` routing (Priority 5)

Test that the imputer's routing decision respects the new `BoundedDiscrete` classification. Test through `ColumnImputationRecord.strategy` — do not assert on internal branching.

- A column classified `BoundedDiscrete` → strategy is `Mode`
- A column classified `BoundedDiscrete` with `MissingSeverity.High` → still `Mode` (unconditional)
- A column classified `BoundedDiscrete` with `MARSuspect` flag → still `Mode` (unconditional, Priority 5 fires before MAR routing would)
- A column with integer dtype that fails signal 1 (gaps) → strategy is NOT `Mode`; falls to MCAR chain
- A column with integer dtype that fails signal 4 (non-zero origin) → strategy is NOT `Mode`; falls to MCAR chain

**Prior art:** `test_numeric_imputer.py` for strategy assertions via `ColumnImputationRecord`.

### Module 3: NumericKind override mechanism

Test through `ColumnProfile.numeric_kind` and `ColumnProfile.type_flags` — the external observable state produced by the orchestrator after Step 5.

- `set_numeric_kind` with a valid string (`"bounded_discrete"`) sets the correct enum value
- `set_numeric_kind` with an invalid string raises `ValueError`
- A Numeric column with a `numeric_kind_overrides` entry → `cp.numeric_kind` equals the declared kind, `TypeFlag.NumericKindOverride` is present, `TypeFlag.UserOverride` is absent
- A column with both a SemanticType override (to `Categorical`) and a NumericKind override → `ValueError` with message naming the column and actual `SemanticType`
- A column with both a SemanticType override (to `Numeric`) and a NumericKind override → override is applied; both `TypeFlag.UserOverride` and `TypeFlag.NumericKindOverride` are present
- A NumericKind override for a column absent from the DataFrame → silently ignored, no error
- `PipelineConfig.to_dict` / `from_dict` round-trips with `numeric_kind_overrides` populated
- Override flows through to Phase 2 routing: a column forced to `BoundedDiscrete` via override → `ColumnImputationRecord.strategy == Mode`; a column forced to `Continuous` → falls to MCAR chain

**Prior art:** `test_type_detector.py` for orchestrator-level type flag assertions; `test_pipeline_config.py` for `PipelineConfig` round-trip tests.

## Out of Scope

- **Bimodality detection** — a `BoundedDiscrete` column with a bimodal distribution (e.g. 40% rating 1, 40% rating 5) receives Mode imputation even though Mode is not representative of the full distribution. Fixing this requires new Phase 1 computation (Hartigan's Dip Test or GMM) and a new `Stochastic` imputation strategy (random sampling from the observed distribution). Deferred to a future scope.
- **`EncodedCategory` / `CategoricalImputer`** — columns carrying `TypeFlag.EncodedCategory` are `SemanticType.Categorical` and do not reach `NumericImputer`. Their missing values are currently silently ignored. Fixing this requires a new `CategoricalImputer` registered under `SemanticType.Categorical` in `_IMPUTATION_REGISTRY`. Separate future scope.
- **Configurable signal thresholds** — the thresholds used in the compound test (`max - min ≤ 20`, `n_unique / n_rows < 0.05`, `n_unique ≤ 10`) are not exposed in `ProfileConfig`. They are definitional thresholds for what constitutes a bounded discrete column, not operational parameters. If evidence emerges that a specific threshold causes systematic misclassification, they can be made configurable in a later scope.
- **Stochastic imputation for discrete columns** — replacing Mode with distribution-proportional random sampling for any discrete column. Deferred alongside bimodality detection.

## Further Notes

- The rename from `Discrete` to `BoundedDiscrete` is a breaking change for any user code that checks `numeric_kind == "discrete"` directly. `NumericKind` is documented in CONTEXT.md as an internal type accessible via submodule import but not part of the Public API, so exposure is minimal. The rename is preferable to keeping `Discrete` as an imprecise label.
- CONTEXT.md has been updated: the `NumericKind` section, the `Mode` strategy definition, and Priority 5 in the Numeric Imputation Decision Priority chain all reflect `BoundedDiscrete`.
- ADR 0018 documents the compound classification rule, the conservative design choice, and the Phase 1 placement decision.
- The `_DISCRETE_NUNIQUE_THRESHOLD = 20` constant in `_type_detector.py` and the `_DISCRETE_MAX_UNIQUE = 20` constant in `_numeric_profiler.py` should be removed or renamed when this scope ships — they represent the old single-condition rule and are no longer accurate.
- The `Column Override` term in CONTEXT.md refers specifically to `SemanticType` overrides via `PipelineConfig.column_overrides`. `NumericKind` overrides are a distinct mechanism stored in `PipelineConfig.numeric_kind_overrides` and should not be conflated with Column Overrides in documentation or conversation.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scope 7: NumericKind.BoundedDiscrete — Compound Classification and Mode Imputation Fix #96

Problem Statement

Solution

User Stories

Implementation Decisions

Modules modified

Architectural decisions

Dependencies

Testing Decisions

Module 1: `_classify_numeric_kind` (deep module — test in isolation)

Module 2: `NumericImputer` routing (Priority 5)

Module 3: NumericKind override mechanism

Out of Scope

Further Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Scope 7: NumericKind.BoundedDiscrete — Compound Classification and Mode Imputation Fix #96

Description

Problem Statement

Solution

User Stories

Implementation Decisions

Modules modified

Architectural decisions

Dependencies

Testing Decisions

Module 1: _classify_numeric_kind (deep module — test in isolation)

Module 2: NumericImputer routing (Priority 5)

Module 3: NumericKind override mechanism

Out of Scope

Further Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Module 1: `_classify_numeric_kind` (deep module — test in isolation)

Module 2: `NumericImputer` routing (Priority 5)