Skip to content

Scope 7: NumericKind.BoundedDiscrete — Compound Classification and Mode Imputation Fix #96

@DEVunderdog

Description

@DEVunderdog

Problem Statement

When a data scientist uses DataForgeML to impute numeric columns, the library misclassifies two structurally different column types under the same label — NumericKind.Discrete — and applies Mode imputation to both. This produces silently wrong results for one of them.

The current rule is: any column that is an integer dtype OR has fewer than 20 unique values is classified Discrete and receives Mode imputation. This catches two very different column types:

  1. Truly bounded discrete columns — columns whose values form a closed, finite set. A 5-level Likert scale {1,2,3,4,5}, a binary flag {0,1}, or a day-of-week encoding {0,...,6}. Mode imputation is semantically correct here.

  2. Low-cardinality integers by accident — integer columns whose observed range happens to be narrow in this sample, but whose underlying domain is unbounded. An age column in a 50-row dataset, a count of rare events, or a year column. Mode imputation is wrong here — filling missing ages with the most common age is no better than filling with the mean or median.

Both column types pass the same test. But for the second type, Mode produces actively wrong imputed values with no warning to the user.

There is a second, related gap: the Discrete → Mode routing fires at Priority 4, before the MCAR severity chain. This means a low-cardinality integer column with high missingness or a MAR pattern gets Mode instead of model-based imputation — an outcome that was never intended.

A third gap: unlike SemanticType, which users can override per-column via PipelineConfig.column_overrides, there is currently no way to override NumericKind. When auto-detection misclassifies a column — whether to BoundedDiscrete or Continuous — the user has no recourse without hacking internal state.

Solution

Replace the single-condition NumericKind.Discrete classification with a compound four-signal test that distinguishes truly bounded discrete columns from low-cardinality integers. Rename NumericKind.Discrete to NumericKind.BoundedDiscrete to reflect the stricter, more precise definition.

A column is BoundedDiscrete only when all four of the following signals pass:

  1. Tight sequence — observed values fill every integer slot between min and max (range_span == n_unique). Eliminates columns with gaps (sparse sampling of a continuous domain).
  2. Small rangemax - min ≤ 20. Eliminates wide consecutive ranges like years {2000,...,2024} or large ordinal encodings.
  3. Low cardinalityn_unique / n_rows < 0.05 OR n_unique ≤ 10. Eliminates small-dataset accidents where a continuous column has few observed values. The absolute floor ≤ 10 protects small datasets where the ratio inflates.
  4. Standard originmin == 0 or min == 1. Eliminates continuous variables like age or year whose minimum is non-standard.

Float columns require an additional pre-check: all non-null values must be whole numbers (value % 1 == 0). Float columns with any fractional values are always Continuous — the tight-sequence check assumes integer steps and is undefined for decimal spacing.

The classification is conservative by design: a column must pass all four signals. Failing any one → Continuous. A false Continuous classification (bounded discrete column gets Mean/Median) is suboptimal. A false BoundedDiscrete classification (continuous column gets Mode) is actively wrong.

Alongside the improved classifier, a NumericKind override mechanism is added to PipelineConfig, parallel to the existing column_overrides for SemanticType. Users can explicitly declare a column's NumericKind when auto-detection is insufficient.

User Stories

  1. As a data scientist, I want a column of 5-level Likert ratings {1,2,3,4,5} to be classified as bounded discrete so that missing ratings are filled with the most common rating, not a mean like 3.2.
  2. As a data scientist, I want a binary flag column {0,1} to be classified as bounded discrete so that missing flags are filled with the mode (the dominant class), not the mean (a fractional value).
  3. As a data scientist, I want an age column in a small dataset — which happens to have only 6 unique values — to be classified as continuous so that missing ages are filled with the median or a model-based estimate, not the modal age.
  4. As a data scientist, I want a count column {0,1,2,3} for rare events in a 2,000-row dataset to be classified as continuous if it fails cardinality checks, so that its imputation uses the full MCAR severity chain.
  5. As a data scientist, I want a float column {1.0, 2.0, 3.0, 4.0, 5.0} — a Likert scale stored as float because of null values — to be classified as bounded discrete, so that it receives the same Mode imputation as its integer equivalent.
  6. As a data scientist, I want a float column {0.1, 0.5, 1.0, 1.5, 2.0} with decimal values to be classified as continuous regardless of cardinality, so that it is never mis-treated as a fixed-vocabulary column.
  7. As a data scientist, I want a day-of-week column {0,...,6} to be classified as bounded discrete so that missing days are filled with the most common day.
  8. As a data scientist, I want a year column {2000,...,2024} to be classified as continuous despite being an integer, so that missing years receive median or model-based imputation rather than the modal year.
  9. As a data scientist, I want a 5-level rating column in a 20-row dataset to still be classified as bounded discrete, so that small datasets are not penalised by the cardinality ratio check when the column is genuinely fixed-vocabulary.
  10. As a data scientist, I want a MAR-suspect continuous column that previously had few unique values to now receive model-based imputation instead of Mode, so that its missingness mechanism is actually respected.
  11. As a data scientist, I want the ColumnImputationRecord.signals field to record the classification outcome so that I can see why a column was treated as bounded discrete or continuous.
  12. As a data scientist, I want the NumericKind enum value for bounded discrete columns to be named BoundedDiscrete — not Discrete — so that the label is self-documenting and unambiguous in audit logs.
  13. As a data scientist, I want to override the auto-detected NumericKind for a specific column so that I can correct misclassifications without changing the column's SemanticType.
  14. As a data scientist, I want to force a numeric column to NumericKind.BoundedDiscrete so that it receives Mode imputation even when the four-signal test did not classify it as bounded discrete.
  15. As a data scientist, I want to force a numeric column to NumericKind.Continuous so that it falls into the MCAR severity chain instead of receiving Mode imputation when I know the column's domain is not finite.
  16. As a data scientist, I want an explicit, descriptive error when I declare a NumericKind override for a non-Numeric column so that I immediately know the configuration is invalid and why.
  17. As a data scientist, I want a TypeFlag.NumericKindOverride annotation in ColumnProfile.type_flags so that I can see in the audit JSON that a NumericKind was set manually rather than auto-detected.
  18. As a data scientist, I want convenience methods set_numeric_kind and set_columns_numeric_kind on PipelineConfig so that I can declare overrides without manipulating the dict directly.

Implementation Decisions

Modules modified

NumericKind enum
Rename the Discrete value to BoundedDiscrete. No third value is added — columns that previously were Discrete but fail the new test become Continuous.

_classify_numeric_kind (Phase 1 — type detector)
This is the single location where the compound test lives. Replace the existing single-condition rule (int dtype OR < 20 unique) with the four-signal compound test. Accept n_rows as an additional parameter (needed for signal 3, the cardinality ratio check). This function is a deep module: pure signal-in → NumericKind decision-out, no side effects, independently testable. The EncodedCategory early-exit guard remains unchanged — those columns are SemanticType.Categorical and out of scope.

_fit_one (Phase 2 — numeric imputer)
Update the Priority 5 routing check from NumericKind.Discrete to NumericKind.BoundedDiscrete. No other routing logic changes — BoundedDiscrete → Mode remains unconditional.

TypeFlag enum
Add NumericKindOverride as a distinct value. This flag is set when a NumericKind override is applied in the orchestrator. TypeFlag.UserOverride remains unchanged — it means only that a SemanticType was explicitly set by the caller. Both flags can coexist on the same column when the user overrides both.

PipelineConfig
Add numeric_kind_overrides: dict[str, NumericKind] (default empty dict), parallel to column_overrides. Add convenience methods:

  • set_numeric_kind(column, kind) — single column, accepts NumericKind enum or raw string ("continuous", "bounded_discrete"), with validation matching the pattern of set_column_type.
  • set_columns_numeric_kind(columns, kind) — batch variant.

Update to_dict / from_dict / to_json / from_json to round-trip numeric_kind_overrides.

StructuralProfiler.profile (orchestrator — Step 5)
Apply numeric_kind_overrides in Step 5, after column_overrides (SemanticType overrides are applied first). For each column in numeric_kind_overrides:

  • If the column is absent from result.columns (excluded or non-existent), silently ignore.
  • If cp.semantic_type != SemanticType.Numeric, raise ValueError with an explicit message naming the column and its actual SemanticType. Example: "NumericKind override for column 'price' is invalid — column has SemanticType.Categorical. NumericKind only applies to SemanticType.Numeric columns."
  • Otherwise set cp.numeric_kind and append TypeFlag.NumericKindOverride if not already present.

SemanticType overrides are applied first within Step 5 so the guard checks the user's declared type — not the detector's type. This matters when the same column has both a SemanticType override (e.g. to Categorical) and a NumericKind override: the error fires based on the final SemanticType.

Architectural decisions

  • The compound test lives in Phase 1 (_classify_numeric_kind), not in Phase 2. Type classification belongs in Phase 1; Phase 2 consumes decisions, it does not make type judgements. See ADR 0018.
  • Mode fires unconditionally for BoundedDiscrete at Priority 5 regardless of severity or mechanism. A finite-domain column must receive a finite-domain fill value. Model-based strategies are semantically invalid for this column type. See ADR 0018.
  • The conservative rule (all four signals required) is intentional. A false Continuous classification is recoverable (suboptimal scalar fill). A false BoundedDiscrete classification is actively wrong (mode of a continuous variable). When ambiguous, default to Continuous.
  • numeric_kind_overrides lives on PipelineConfig (not ProfileConfig) because NumericKind is consumed cross-phase: Phase 1 writes it, Phase 2 reads it for imputation routing. Placing it on ProfileConfig would require Phase 2 to reach into the profiling config, violating the phase boundary.
  • TypeFlag.NumericKindOverride is distinct from TypeFlag.UserOverride so the audit log distinguishes "user changed SemanticType" from "user changed NumericKind."
  • EncodedCategory columns (SemanticType.Categorical) are out of scope. They do not reach NumericImputer. Their missing-value handling belongs to a future CategoricalImputer scope.

Dependencies

Internal (within this scope)
Signal 3 (cardinality ratio) requires n_rows to be threaded into _classify_numeric_kind. This is a small change to the call site in the type detector but is a prerequisite for the full compound test to be correct.

On other scopes

Testing Decisions

What makes a good test: test classification and routing decisions through the external observable interface — NumericKind on ColumnProfile for Phase 1 tests, and ColumnImputationRecord.strategy plus ColumnImputationRecord.signals for Phase 2 tests. Do not assert on internal branching, private method state, or intermediate variables. Construct minimal synthetic pl.Series and pl.DataFrame objects with known properties. Each test should vary exactly one signal at a time, holding all others constant.

Module 1: _classify_numeric_kind (deep module — test in isolation)

This is the most important module to test because it is a pure function with a well-defined contract: given a Series and n_rows, return a NumericKind. Every signal combination is independently testable without constructing fitting infrastructure.

  • Integer series {1,2,3,4,5} with large n_rowsBoundedDiscrete
  • Integer series {0,1}BoundedDiscrete
  • Integer series {0,...,6} (day of week) → BoundedDiscrete
  • Integer series {18,22,35,42,55} (gaps) → Continuous (fails signal 1)
  • Integer series {18,19,20,21,22} (tight but non-zero origin) → Continuous (fails signal 4)
  • Integer series {2000,...,2010} (tight but non-zero origin) → Continuous (fails signal 4)
  • Integer series {1,...,25} (tight, origin=1, but range > 20) → Continuous (fails signal 2)
  • Integer series {1,2,3,4,5} in a 20-row dataset → BoundedDiscrete (floor ≤ 10 saves it despite failing ratio)
  • Integer series {1,2,3,4,5,6,7,8,9,10,11} (11 unique values, fails floor) in a 20-row dataset → Continuous (fails signal 3 on both ratio and floor)
  • Float series {1.0,2.0,3.0,4.0,5.0} (whole-number floats) → BoundedDiscrete
  • Float series {0.1,0.5,1.0,1.5,2.0} (decimal values) → Continuous (fails whole-number pre-check)
  • Float series {0.0,1.0}BoundedDiscrete

Prior art: test_type_detector.py for classification tests using synthetic pl.Series.

Module 2: NumericImputer routing (Priority 5)

Test that the imputer's routing decision respects the new BoundedDiscrete classification. Test through ColumnImputationRecord.strategy — do not assert on internal branching.

  • A column classified BoundedDiscrete → strategy is Mode
  • A column classified BoundedDiscrete with MissingSeverity.High → still Mode (unconditional)
  • A column classified BoundedDiscrete with MARSuspect flag → still Mode (unconditional, Priority 5 fires before MAR routing would)
  • A column with integer dtype that fails signal 1 (gaps) → strategy is NOT Mode; falls to MCAR chain
  • A column with integer dtype that fails signal 4 (non-zero origin) → strategy is NOT Mode; falls to MCAR chain

Prior art: test_numeric_imputer.py for strategy assertions via ColumnImputationRecord.

Module 3: NumericKind override mechanism

Test through ColumnProfile.numeric_kind and ColumnProfile.type_flags — the external observable state produced by the orchestrator after Step 5.

  • set_numeric_kind with a valid string ("bounded_discrete") sets the correct enum value
  • set_numeric_kind with an invalid string raises ValueError
  • A Numeric column with a numeric_kind_overrides entry → cp.numeric_kind equals the declared kind, TypeFlag.NumericKindOverride is present, TypeFlag.UserOverride is absent
  • A column with both a SemanticType override (to Categorical) and a NumericKind override → ValueError with message naming the column and actual SemanticType
  • A column with both a SemanticType override (to Numeric) and a NumericKind override → override is applied; both TypeFlag.UserOverride and TypeFlag.NumericKindOverride are present
  • A NumericKind override for a column absent from the DataFrame → silently ignored, no error
  • PipelineConfig.to_dict / from_dict round-trips with numeric_kind_overrides populated
  • Override flows through to Phase 2 routing: a column forced to BoundedDiscrete via override → ColumnImputationRecord.strategy == Mode; a column forced to Continuous → falls to MCAR chain

Prior art: test_type_detector.py for orchestrator-level type flag assertions; test_pipeline_config.py for PipelineConfig round-trip tests.

Out of Scope

  • Bimodality detection — a BoundedDiscrete column with a bimodal distribution (e.g. 40% rating 1, 40% rating 5) receives Mode imputation even though Mode is not representative of the full distribution. Fixing this requires new Phase 1 computation (Hartigan's Dip Test or GMM) and a new Stochastic imputation strategy (random sampling from the observed distribution). Deferred to a future scope.
  • EncodedCategory / CategoricalImputer — columns carrying TypeFlag.EncodedCategory are SemanticType.Categorical and do not reach NumericImputer. Their missing values are currently silently ignored. Fixing this requires a new CategoricalImputer registered under SemanticType.Categorical in _IMPUTATION_REGISTRY. Separate future scope.
  • Configurable signal thresholds — the thresholds used in the compound test (max - min ≤ 20, n_unique / n_rows < 0.05, n_unique ≤ 10) are not exposed in ProfileConfig. They are definitional thresholds for what constitutes a bounded discrete column, not operational parameters. If evidence emerges that a specific threshold causes systematic misclassification, they can be made configurable in a later scope.
  • Stochastic imputation for discrete columns — replacing Mode with distribution-proportional random sampling for any discrete column. Deferred alongside bimodality detection.

Further Notes

  • The rename from Discrete to BoundedDiscrete is a breaking change for any user code that checks numeric_kind == "discrete" directly. NumericKind is documented in CONTEXT.md as an internal type accessible via submodule import but not part of the Public API, so exposure is minimal. The rename is preferable to keeping Discrete as an imprecise label.
  • CONTEXT.md has been updated: the NumericKind section, the Mode strategy definition, and Priority 5 in the Numeric Imputation Decision Priority chain all reflect BoundedDiscrete.
  • ADR 0018 documents the compound classification rule, the conservative design choice, and the Phase 1 placement decision.
  • The _DISCRETE_NUNIQUE_THRESHOLD = 20 constant in _type_detector.py and the _DISCRETE_MAX_UNIQUE = 20 constant in _numeric_profiler.py should be removed or renamed when this scope ships — they represent the old single-condition rule and are no longer accurate.
  • The Column Override term in CONTEXT.md refers specifically to SemanticType overrides via PipelineConfig.column_overrides. NumericKind overrides are a distinct mechanism stored in PipelineConfig.numeric_kind_overrides and should not be conflated with Column Overrides in documentation or conversation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions