You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a data scientist uses DataForgeML to impute numeric columns, the library misclassifies two structurally different column types under the same label — NumericKind.Discrete — and applies Mode imputation to both. This produces silently wrong results for one of them.
The current rule is: any column that is an integer dtype OR has fewer than 20 unique values is classified Discrete and receives Mode imputation. This catches two very different column types:
Truly bounded discrete columns — columns whose values form a closed, finite set. A 5-level Likert scale {1,2,3,4,5}, a binary flag {0,1}, or a day-of-week encoding {0,...,6}. Mode imputation is semantically correct here.
Low-cardinality integers by accident — integer columns whose observed range happens to be narrow in this sample, but whose underlying domain is unbounded. An age column in a 50-row dataset, a count of rare events, or a year column. Mode imputation is wrong here — filling missing ages with the most common age is no better than filling with the mean or median.
Both column types pass the same test. But for the second type, Mode produces actively wrong imputed values with no warning to the user.
There is a second, related gap: the Discrete → Mode routing fires at Priority 4, before the MCAR severity chain. This means a low-cardinality integer column with high missingness or a MAR pattern gets Mode instead of model-based imputation — an outcome that was never intended.
A third gap: unlike SemanticType, which users can override per-column via PipelineConfig.column_overrides, there is currently no way to override NumericKind. When auto-detection misclassifies a column — whether to BoundedDiscrete or Continuous — the user has no recourse without hacking internal state.
Solution
Replace the single-condition NumericKind.Discrete classification with a compound four-signal test that distinguishes truly bounded discrete columns from low-cardinality integers. Rename NumericKind.Discrete to NumericKind.BoundedDiscrete to reflect the stricter, more precise definition.
A column is BoundedDiscrete only when all four of the following signals pass:
Tight sequence — observed values fill every integer slot between min and max (range_span == n_unique). Eliminates columns with gaps (sparse sampling of a continuous domain).
Small range — max - min ≤ 20. Eliminates wide consecutive ranges like years {2000,...,2024} or large ordinal encodings.
Low cardinality — n_unique / n_rows < 0.05 OR n_unique ≤ 10. Eliminates small-dataset accidents where a continuous column has few observed values. The absolute floor ≤ 10 protects small datasets where the ratio inflates.
Standard origin — min == 0 or min == 1. Eliminates continuous variables like age or year whose minimum is non-standard.
Float columns require an additional pre-check: all non-null values must be whole numbers (value % 1 == 0). Float columns with any fractional values are always Continuous — the tight-sequence check assumes integer steps and is undefined for decimal spacing.
The classification is conservative by design: a column must pass all four signals. Failing any one → Continuous. A false Continuous classification (bounded discrete column gets Mean/Median) is suboptimal. A false BoundedDiscrete classification (continuous column gets Mode) is actively wrong.
Alongside the improved classifier, a NumericKind override mechanism is added to PipelineConfig, parallel to the existing column_overrides for SemanticType. Users can explicitly declare a column's NumericKind when auto-detection is insufficient.
User Stories
As a data scientist, I want a column of 5-level Likert ratings {1,2,3,4,5} to be classified as bounded discrete so that missing ratings are filled with the most common rating, not a mean like 3.2.
As a data scientist, I want a binary flag column {0,1} to be classified as bounded discrete so that missing flags are filled with the mode (the dominant class), not the mean (a fractional value).
As a data scientist, I want an age column in a small dataset — which happens to have only 6 unique values — to be classified as continuous so that missing ages are filled with the median or a model-based estimate, not the modal age.
As a data scientist, I want a count column {0,1,2,3} for rare events in a 2,000-row dataset to be classified as continuous if it fails cardinality checks, so that its imputation uses the full MCAR severity chain.
As a data scientist, I want a float column {1.0, 2.0, 3.0, 4.0, 5.0} — a Likert scale stored as float because of null values — to be classified as bounded discrete, so that it receives the same Mode imputation as its integer equivalent.
As a data scientist, I want a float column {0.1, 0.5, 1.0, 1.5, 2.0} with decimal values to be classified as continuous regardless of cardinality, so that it is never mis-treated as a fixed-vocabulary column.
As a data scientist, I want a day-of-week column {0,...,6} to be classified as bounded discrete so that missing days are filled with the most common day.
As a data scientist, I want a year column {2000,...,2024} to be classified as continuous despite being an integer, so that missing years receive median or model-based imputation rather than the modal year.
As a data scientist, I want a 5-level rating column in a 20-row dataset to still be classified as bounded discrete, so that small datasets are not penalised by the cardinality ratio check when the column is genuinely fixed-vocabulary.
As a data scientist, I want a MAR-suspect continuous column that previously had few unique values to now receive model-based imputation instead of Mode, so that its missingness mechanism is actually respected.
As a data scientist, I want the ColumnImputationRecord.signals field to record the classification outcome so that I can see why a column was treated as bounded discrete or continuous.
As a data scientist, I want the NumericKind enum value for bounded discrete columns to be named BoundedDiscrete — not Discrete — so that the label is self-documenting and unambiguous in audit logs.
As a data scientist, I want to override the auto-detected NumericKind for a specific column so that I can correct misclassifications without changing the column's SemanticType.
As a data scientist, I want to force a numeric column to NumericKind.BoundedDiscrete so that it receives Mode imputation even when the four-signal test did not classify it as bounded discrete.
As a data scientist, I want to force a numeric column to NumericKind.Continuous so that it falls into the MCAR severity chain instead of receiving Mode imputation when I know the column's domain is not finite.
As a data scientist, I want an explicit, descriptive error when I declare a NumericKind override for a non-Numeric column so that I immediately know the configuration is invalid and why.
As a data scientist, I want a TypeFlag.NumericKindOverride annotation in ColumnProfile.type_flags so that I can see in the audit JSON that a NumericKind was set manually rather than auto-detected.
As a data scientist, I want convenience methods set_numeric_kind and set_columns_numeric_kind on PipelineConfig so that I can declare overrides without manipulating the dict directly.
Implementation Decisions
Modules modified
NumericKind enum
Rename the Discrete value to BoundedDiscrete. No third value is added — columns that previously were Discrete but fail the new test become Continuous.
_classify_numeric_kind (Phase 1 — type detector)
This is the single location where the compound test lives. Replace the existing single-condition rule (int dtype OR < 20 unique) with the four-signal compound test. Accept n_rows as an additional parameter (needed for signal 3, the cardinality ratio check). This function is a deep module: pure signal-in → NumericKind decision-out, no side effects, independently testable. The EncodedCategory early-exit guard remains unchanged — those columns are SemanticType.Categorical and out of scope.
_fit_one (Phase 2 — numeric imputer)
Update the Priority 5 routing check from NumericKind.Discrete to NumericKind.BoundedDiscrete. No other routing logic changes — BoundedDiscrete → Mode remains unconditional.
TypeFlag enum
Add NumericKindOverride as a distinct value. This flag is set when a NumericKind override is applied in the orchestrator. TypeFlag.UserOverride remains unchanged — it means only that a SemanticType was explicitly set by the caller. Both flags can coexist on the same column when the user overrides both.
set_numeric_kind(column, kind) — single column, accepts NumericKind enum or raw string ("continuous", "bounded_discrete"), with validation matching the pattern of set_column_type.
StructuralProfiler.profile (orchestrator — Step 5)
Apply numeric_kind_overrides in Step 5, after column_overrides (SemanticType overrides are applied first). For each column in numeric_kind_overrides:
If the column is absent from result.columns (excluded or non-existent), silently ignore.
If cp.semantic_type != SemanticType.Numeric, raise ValueError with an explicit message naming the column and its actual SemanticType. Example: "NumericKind override for column 'price' is invalid — column has SemanticType.Categorical. NumericKind only applies to SemanticType.Numeric columns."
Otherwise set cp.numeric_kind and append TypeFlag.NumericKindOverride if not already present.
SemanticType overrides are applied first within Step 5 so the guard checks the user's declared type — not the detector's type. This matters when the same column has both a SemanticType override (e.g. to Categorical) and a NumericKind override: the error fires based on the final SemanticType.
Architectural decisions
The compound test lives in Phase 1 (_classify_numeric_kind), not in Phase 2. Type classification belongs in Phase 1; Phase 2 consumes decisions, it does not make type judgements. See ADR 0018.
Mode fires unconditionally for BoundedDiscrete at Priority 5 regardless of severity or mechanism. A finite-domain column must receive a finite-domain fill value. Model-based strategies are semantically invalid for this column type. See ADR 0018.
The conservative rule (all four signals required) is intentional. A false Continuous classification is recoverable (suboptimal scalar fill). A false BoundedDiscrete classification is actively wrong (mode of a continuous variable). When ambiguous, default to Continuous.
numeric_kind_overrides lives on PipelineConfig (not ProfileConfig) because NumericKind is consumed cross-phase: Phase 1 writes it, Phase 2 reads it for imputation routing. Placing it on ProfileConfig would require Phase 2 to reach into the profiling config, violating the phase boundary.
TypeFlag.NumericKindOverride is distinct from TypeFlag.UserOverride so the audit log distinguishes "user changed SemanticType" from "user changed NumericKind."
EncodedCategory columns (SemanticType.Categorical) are out of scope. They do not reach NumericImputer. Their missing-value handling belongs to a future CategoricalImputer scope.
Dependencies
Internal (within this scope)
Signal 3 (cardinality ratio) requires n_rows to be threaded into _classify_numeric_kind. This is a small change to the call site in the type detector but is a prerequisite for the full compound test to be correct.
On other scopes
This scope has no hard dependency on any other scope. All four classification signals are already computable from the raw Series and n_rows — no new Phase 1 profilers are required.
Scope 6 (Issue Scope 6: Strategy Decision Engine — Routing Signals and Distribution Shape Escalation #95) changes which columns reach MCAR routing (adding distribution shape escalation and feature-predictability checks). When Scope 6 and Scope 7 are both shipped, columns that were previously Discrete → Mode but now become Continuous under Scope 7 will be subject to the updated MCAR routing introduced by Scope 6. The combined behaviour should be validated in integration tests.
What makes a good test: test classification and routing decisions through the external observable interface — NumericKind on ColumnProfile for Phase 1 tests, and ColumnImputationRecord.strategy plus ColumnImputationRecord.signals for Phase 2 tests. Do not assert on internal branching, private method state, or intermediate variables. Construct minimal synthetic pl.Series and pl.DataFrame objects with known properties. Each test should vary exactly one signal at a time, holding all others constant.
Module 1: _classify_numeric_kind (deep module — test in isolation)
This is the most important module to test because it is a pure function with a well-defined contract: given a Series and n_rows, return a NumericKind. Every signal combination is independently testable without constructing fitting infrastructure.
Integer series {1,2,3,4,5} with large n_rows → BoundedDiscrete
Integer series {0,1} → BoundedDiscrete
Integer series {0,...,6} (day of week) → BoundedDiscrete
Integer series {18,22,35,42,55} (gaps) → Continuous (fails signal 1)
Integer series {18,19,20,21,22} (tight but non-zero origin) → Continuous (fails signal 4)
Integer series {2000,...,2010} (tight but non-zero origin) → Continuous (fails signal 4)
Integer series {1,...,25} (tight, origin=1, but range > 20) → Continuous (fails signal 2)
Integer series {1,2,3,4,5} in a 20-row dataset → BoundedDiscrete (floor ≤ 10 saves it despite failing ratio)
Integer series {1,2,3,4,5,6,7,8,9,10,11} (11 unique values, fails floor) in a 20-row dataset → Continuous (fails signal 3 on both ratio and floor)
Float series {1.0,2.0,3.0,4.0,5.0} (whole-number floats) → BoundedDiscrete
Float series {0.1,0.5,1.0,1.5,2.0} (decimal values) → Continuous (fails whole-number pre-check)
Float series {0.0,1.0} → BoundedDiscrete
Prior art:test_type_detector.py for classification tests using synthetic pl.Series.
Module 2: NumericImputer routing (Priority 5)
Test that the imputer's routing decision respects the new BoundedDiscrete classification. Test through ColumnImputationRecord.strategy — do not assert on internal branching.
A column classified BoundedDiscrete → strategy is Mode
A column classified BoundedDiscrete with MissingSeverity.High → still Mode (unconditional)
A column classified BoundedDiscrete with MARSuspect flag → still Mode (unconditional, Priority 5 fires before MAR routing would)
A column with integer dtype that fails signal 1 (gaps) → strategy is NOT Mode; falls to MCAR chain
A column with integer dtype that fails signal 4 (non-zero origin) → strategy is NOT Mode; falls to MCAR chain
Prior art:test_numeric_imputer.py for strategy assertions via ColumnImputationRecord.
Module 3: NumericKind override mechanism
Test through ColumnProfile.numeric_kind and ColumnProfile.type_flags — the external observable state produced by the orchestrator after Step 5.
set_numeric_kind with a valid string ("bounded_discrete") sets the correct enum value
set_numeric_kind with an invalid string raises ValueError
A Numeric column with a numeric_kind_overrides entry → cp.numeric_kind equals the declared kind, TypeFlag.NumericKindOverride is present, TypeFlag.UserOverride is absent
A column with both a SemanticType override (to Categorical) and a NumericKind override → ValueError with message naming the column and actual SemanticType
A column with both a SemanticType override (to Numeric) and a NumericKind override → override is applied; both TypeFlag.UserOverride and TypeFlag.NumericKindOverride are present
A NumericKind override for a column absent from the DataFrame → silently ignored, no error
PipelineConfig.to_dict / from_dict round-trips with numeric_kind_overrides populated
Override flows through to Phase 2 routing: a column forced to BoundedDiscrete via override → ColumnImputationRecord.strategy == Mode; a column forced to Continuous → falls to MCAR chain
Prior art:test_type_detector.py for orchestrator-level type flag assertions; test_pipeline_config.py for PipelineConfig round-trip tests.
Out of Scope
Bimodality detection — a BoundedDiscrete column with a bimodal distribution (e.g. 40% rating 1, 40% rating 5) receives Mode imputation even though Mode is not representative of the full distribution. Fixing this requires new Phase 1 computation (Hartigan's Dip Test or GMM) and a new Stochastic imputation strategy (random sampling from the observed distribution). Deferred to a future scope.
EncodedCategory / CategoricalImputer — columns carrying TypeFlag.EncodedCategory are SemanticType.Categorical and do not reach NumericImputer. Their missing values are currently silently ignored. Fixing this requires a new CategoricalImputer registered under SemanticType.Categorical in _IMPUTATION_REGISTRY. Separate future scope.
Configurable signal thresholds — the thresholds used in the compound test (max - min ≤ 20, n_unique / n_rows < 0.05, n_unique ≤ 10) are not exposed in ProfileConfig. They are definitional thresholds for what constitutes a bounded discrete column, not operational parameters. If evidence emerges that a specific threshold causes systematic misclassification, they can be made configurable in a later scope.
Stochastic imputation for discrete columns — replacing Mode with distribution-proportional random sampling for any discrete column. Deferred alongside bimodality detection.
Further Notes
The rename from Discrete to BoundedDiscrete is a breaking change for any user code that checks numeric_kind == "discrete" directly. NumericKind is documented in CONTEXT.md as an internal type accessible via submodule import but not part of the Public API, so exposure is minimal. The rename is preferable to keeping Discrete as an imprecise label.
CONTEXT.md has been updated: the NumericKind section, the Mode strategy definition, and Priority 5 in the Numeric Imputation Decision Priority chain all reflect BoundedDiscrete.
ADR 0018 documents the compound classification rule, the conservative design choice, and the Phase 1 placement decision.
The _DISCRETE_NUNIQUE_THRESHOLD = 20 constant in _type_detector.py and the _DISCRETE_MAX_UNIQUE = 20 constant in _numeric_profiler.py should be removed or renamed when this scope ships — they represent the old single-condition rule and are no longer accurate.
The Column Override term in CONTEXT.md refers specifically to SemanticType overrides via PipelineConfig.column_overrides. NumericKind overrides are a distinct mechanism stored in PipelineConfig.numeric_kind_overrides and should not be conflated with Column Overrides in documentation or conversation.
Problem Statement
When a data scientist uses DataForgeML to impute numeric columns, the library misclassifies two structurally different column types under the same label —
NumericKind.Discrete— and applies Mode imputation to both. This produces silently wrong results for one of them.The current rule is: any column that is an integer dtype OR has fewer than 20 unique values is classified
Discreteand receives Mode imputation. This catches two very different column types:Truly bounded discrete columns — columns whose values form a closed, finite set. A 5-level Likert scale
{1,2,3,4,5}, a binary flag{0,1}, or a day-of-week encoding{0,...,6}. Mode imputation is semantically correct here.Low-cardinality integers by accident — integer columns whose observed range happens to be narrow in this sample, but whose underlying domain is unbounded. An age column in a 50-row dataset, a count of rare events, or a year column. Mode imputation is wrong here — filling missing ages with the most common age is no better than filling with the mean or median.
Both column types pass the same test. But for the second type, Mode produces actively wrong imputed values with no warning to the user.
There is a second, related gap: the
Discrete → Moderouting fires at Priority 4, before the MCAR severity chain. This means a low-cardinality integer column with high missingness or a MAR pattern gets Mode instead of model-based imputation — an outcome that was never intended.A third gap: unlike
SemanticType, which users can override per-column viaPipelineConfig.column_overrides, there is currently no way to overrideNumericKind. When auto-detection misclassifies a column — whether toBoundedDiscreteorContinuous— the user has no recourse without hacking internal state.Solution
Replace the single-condition
NumericKind.Discreteclassification with a compound four-signal test that distinguishes truly bounded discrete columns from low-cardinality integers. RenameNumericKind.DiscretetoNumericKind.BoundedDiscreteto reflect the stricter, more precise definition.A column is
BoundedDiscreteonly when all four of the following signals pass:range_span == n_unique). Eliminates columns with gaps (sparse sampling of a continuous domain).max - min ≤ 20. Eliminates wide consecutive ranges like years{2000,...,2024}or large ordinal encodings.n_unique / n_rows < 0.05ORn_unique ≤ 10. Eliminates small-dataset accidents where a continuous column has few observed values. The absolute floor≤ 10protects small datasets where the ratio inflates.min == 0 or min == 1. Eliminates continuous variables like age or year whose minimum is non-standard.Float columns require an additional pre-check: all non-null values must be whole numbers (
value % 1 == 0). Float columns with any fractional values are alwaysContinuous— the tight-sequence check assumes integer steps and is undefined for decimal spacing.The classification is conservative by design: a column must pass all four signals. Failing any one →
Continuous. A falseContinuousclassification (bounded discrete column gets Mean/Median) is suboptimal. A falseBoundedDiscreteclassification (continuous column gets Mode) is actively wrong.Alongside the improved classifier, a NumericKind override mechanism is added to
PipelineConfig, parallel to the existingcolumn_overridesforSemanticType. Users can explicitly declare a column'sNumericKindwhen auto-detection is insufficient.User Stories
{1,2,3,4,5}to be classified as bounded discrete so that missing ratings are filled with the most common rating, not a mean like3.2.{0,1}to be classified as bounded discrete so that missing flags are filled with the mode (the dominant class), not the mean (a fractional value).{0,1,2,3}for rare events in a 2,000-row dataset to be classified as continuous if it fails cardinality checks, so that its imputation uses the full MCAR severity chain.{1.0, 2.0, 3.0, 4.0, 5.0}— a Likert scale stored as float because of null values — to be classified as bounded discrete, so that it receives the same Mode imputation as its integer equivalent.{0.1, 0.5, 1.0, 1.5, 2.0}with decimal values to be classified as continuous regardless of cardinality, so that it is never mis-treated as a fixed-vocabulary column.{0,...,6}to be classified as bounded discrete so that missing days are filled with the most common day.{2000,...,2024}to be classified as continuous despite being an integer, so that missing years receive median or model-based imputation rather than the modal year.ColumnImputationRecord.signalsfield to record the classification outcome so that I can see why a column was treated as bounded discrete or continuous.NumericKindenum value for bounded discrete columns to be namedBoundedDiscrete— notDiscrete— so that the label is self-documenting and unambiguous in audit logs.NumericKindfor a specific column so that I can correct misclassifications without changing the column'sSemanticType.NumericKind.BoundedDiscreteso that it receives Mode imputation even when the four-signal test did not classify it as bounded discrete.NumericKind.Continuousso that it falls into the MCAR severity chain instead of receiving Mode imputation when I know the column's domain is not finite.NumericKindoverride for a non-Numeric column so that I immediately know the configuration is invalid and why.TypeFlag.NumericKindOverrideannotation inColumnProfile.type_flagsso that I can see in the audit JSON that aNumericKindwas set manually rather than auto-detected.set_numeric_kindandset_columns_numeric_kindonPipelineConfigso that I can declare overrides without manipulating the dict directly.Implementation Decisions
Modules modified
NumericKindenumRename the
Discretevalue toBoundedDiscrete. No third value is added — columns that previously wereDiscretebut fail the new test becomeContinuous._classify_numeric_kind(Phase 1 — type detector)This is the single location where the compound test lives. Replace the existing single-condition rule (
int dtype OR < 20 unique) with the four-signal compound test. Acceptn_rowsas an additional parameter (needed for signal 3, the cardinality ratio check). This function is a deep module: pure signal-in →NumericKinddecision-out, no side effects, independently testable. TheEncodedCategoryearly-exit guard remains unchanged — those columns areSemanticType.Categoricaland out of scope._fit_one(Phase 2 — numeric imputer)Update the Priority 5 routing check from
NumericKind.DiscretetoNumericKind.BoundedDiscrete. No other routing logic changes —BoundedDiscrete → Moderemains unconditional.TypeFlagenumAdd
NumericKindOverrideas a distinct value. This flag is set when aNumericKindoverride is applied in the orchestrator.TypeFlag.UserOverrideremains unchanged — it means only that aSemanticTypewas explicitly set by the caller. Both flags can coexist on the same column when the user overrides both.PipelineConfigAdd
numeric_kind_overrides: dict[str, NumericKind](default empty dict), parallel tocolumn_overrides. Add convenience methods:set_numeric_kind(column, kind)— single column, acceptsNumericKindenum or raw string ("continuous","bounded_discrete"), with validation matching the pattern ofset_column_type.set_columns_numeric_kind(columns, kind)— batch variant.Update
to_dict/from_dict/to_json/from_jsonto round-tripnumeric_kind_overrides.StructuralProfiler.profile(orchestrator — Step 5)Apply
numeric_kind_overridesin Step 5, aftercolumn_overrides(SemanticType overrides are applied first). For each column innumeric_kind_overrides:result.columns(excluded or non-existent), silently ignore.cp.semantic_type != SemanticType.Numeric, raiseValueErrorwith an explicit message naming the column and its actualSemanticType. Example: "NumericKind override for column 'price' is invalid — column has SemanticType.Categorical. NumericKind only applies to SemanticType.Numeric columns."cp.numeric_kindand appendTypeFlag.NumericKindOverrideif not already present.SemanticType overrides are applied first within Step 5 so the guard checks the user's declared type — not the detector's type. This matters when the same column has both a SemanticType override (e.g. to
Categorical) and a NumericKind override: the error fires based on the final SemanticType.Architectural decisions
_classify_numeric_kind), not in Phase 2. Type classification belongs in Phase 1; Phase 2 consumes decisions, it does not make type judgements. See ADR 0018.BoundedDiscreteat Priority 5 regardless of severity or mechanism. A finite-domain column must receive a finite-domain fill value. Model-based strategies are semantically invalid for this column type. See ADR 0018.Continuousclassification is recoverable (suboptimal scalar fill). A falseBoundedDiscreteclassification is actively wrong (mode of a continuous variable). When ambiguous, default toContinuous.numeric_kind_overrideslives onPipelineConfig(notProfileConfig) becauseNumericKindis consumed cross-phase: Phase 1 writes it, Phase 2 reads it for imputation routing. Placing it onProfileConfigwould require Phase 2 to reach into the profiling config, violating the phase boundary.TypeFlag.NumericKindOverrideis distinct fromTypeFlag.UserOverrideso the audit log distinguishes "user changed SemanticType" from "user changed NumericKind."EncodedCategorycolumns (SemanticType.Categorical) are out of scope. They do not reachNumericImputer. Their missing-value handling belongs to a futureCategoricalImputerscope.Dependencies
Internal (within this scope)
Signal 3 (cardinality ratio) requires
n_rowsto be threaded into_classify_numeric_kind. This is a small change to the call site in the type detector but is a prerequisite for the full compound test to be correct.On other scopes
Seriesandn_rows— no new Phase 1 profilers are required.Discrete → Modebut now becomeContinuousunder Scope 7 will be subject to the updated MCAR routing introduced by Scope 6. The combined behaviour should be validated in integration tests.BoundedDiscretetoContinuousunder this scope will naturally benefit from those improvements once they ship. No coordination required at implementation time.Testing Decisions
What makes a good test: test classification and routing decisions through the external observable interface —
NumericKindonColumnProfilefor Phase 1 tests, andColumnImputationRecord.strategyplusColumnImputationRecord.signalsfor Phase 2 tests. Do not assert on internal branching, private method state, or intermediate variables. Construct minimal syntheticpl.Seriesandpl.DataFrameobjects with known properties. Each test should vary exactly one signal at a time, holding all others constant.Module 1:
_classify_numeric_kind(deep module — test in isolation)This is the most important module to test because it is a pure function with a well-defined contract: given a
Seriesandn_rows, return aNumericKind. Every signal combination is independently testable without constructing fitting infrastructure.{1,2,3,4,5}with largen_rows→BoundedDiscrete{0,1}→BoundedDiscrete{0,...,6}(day of week) →BoundedDiscrete{18,22,35,42,55}(gaps) →Continuous(fails signal 1){18,19,20,21,22}(tight but non-zero origin) →Continuous(fails signal 4){2000,...,2010}(tight but non-zero origin) →Continuous(fails signal 4){1,...,25}(tight, origin=1, but range > 20) →Continuous(fails signal 2){1,2,3,4,5}in a 20-row dataset →BoundedDiscrete(floor≤ 10saves it despite failing ratio){1,2,3,4,5,6,7,8,9,10,11}(11 unique values, fails floor) in a 20-row dataset →Continuous(fails signal 3 on both ratio and floor){1.0,2.0,3.0,4.0,5.0}(whole-number floats) →BoundedDiscrete{0.1,0.5,1.0,1.5,2.0}(decimal values) →Continuous(fails whole-number pre-check){0.0,1.0}→BoundedDiscretePrior art:
test_type_detector.pyfor classification tests using syntheticpl.Series.Module 2:
NumericImputerrouting (Priority 5)Test that the imputer's routing decision respects the new
BoundedDiscreteclassification. Test throughColumnImputationRecord.strategy— do not assert on internal branching.BoundedDiscrete→ strategy isModeBoundedDiscretewithMissingSeverity.High→ stillMode(unconditional)BoundedDiscretewithMARSuspectflag → stillMode(unconditional, Priority 5 fires before MAR routing would)Mode; falls to MCAR chainMode; falls to MCAR chainPrior art:
test_numeric_imputer.pyfor strategy assertions viaColumnImputationRecord.Module 3: NumericKind override mechanism
Test through
ColumnProfile.numeric_kindandColumnProfile.type_flags— the external observable state produced by the orchestrator after Step 5.set_numeric_kindwith a valid string ("bounded_discrete") sets the correct enum valueset_numeric_kindwith an invalid string raisesValueErrornumeric_kind_overridesentry →cp.numeric_kindequals the declared kind,TypeFlag.NumericKindOverrideis present,TypeFlag.UserOverrideis absentCategorical) and a NumericKind override →ValueErrorwith message naming the column and actualSemanticTypeNumeric) and a NumericKind override → override is applied; bothTypeFlag.UserOverrideandTypeFlag.NumericKindOverrideare presentPipelineConfig.to_dict/from_dictround-trips withnumeric_kind_overridespopulatedBoundedDiscretevia override →ColumnImputationRecord.strategy == Mode; a column forced toContinuous→ falls to MCAR chainPrior art:
test_type_detector.pyfor orchestrator-level type flag assertions;test_pipeline_config.pyforPipelineConfiground-trip tests.Out of Scope
BoundedDiscretecolumn with a bimodal distribution (e.g. 40% rating 1, 40% rating 5) receives Mode imputation even though Mode is not representative of the full distribution. Fixing this requires new Phase 1 computation (Hartigan's Dip Test or GMM) and a newStochasticimputation strategy (random sampling from the observed distribution). Deferred to a future scope.EncodedCategory/CategoricalImputer— columns carryingTypeFlag.EncodedCategoryareSemanticType.Categoricaland do not reachNumericImputer. Their missing values are currently silently ignored. Fixing this requires a newCategoricalImputerregistered underSemanticType.Categoricalin_IMPUTATION_REGISTRY. Separate future scope.max - min ≤ 20,n_unique / n_rows < 0.05,n_unique ≤ 10) are not exposed inProfileConfig. They are definitional thresholds for what constitutes a bounded discrete column, not operational parameters. If evidence emerges that a specific threshold causes systematic misclassification, they can be made configurable in a later scope.Further Notes
DiscretetoBoundedDiscreteis a breaking change for any user code that checksnumeric_kind == "discrete"directly.NumericKindis documented in CONTEXT.md as an internal type accessible via submodule import but not part of the Public API, so exposure is minimal. The rename is preferable to keepingDiscreteas an imprecise label.NumericKindsection, theModestrategy definition, and Priority 5 in the Numeric Imputation Decision Priority chain all reflectBoundedDiscrete._DISCRETE_NUNIQUE_THRESHOLD = 20constant in_type_detector.pyand the_DISCRETE_MAX_UNIQUE = 20constant in_numeric_profiler.pyshould be removed or renamed when this scope ships — they represent the old single-condition rule and are no longer accurate.Column Overrideterm in CONTEXT.md refers specifically toSemanticTypeoverrides viaPipelineConfig.column_overrides.NumericKindoverrides are a distinct mechanism stored inPipelineConfig.numeric_kind_overridesand should not be conflated with Column Overrides in documentation or conversation.