Problem Statement
When a data scientist uses DataForgeML to impute a column routed to ImputationStrategy.KNN, the library hard-codes KNNImputer(n_neighbors=5) with uniform weighting and no feature scaling. This silently degrades imputation quality in three interconnected ways:
-
Fixed k regardless of dataset structure — k=5 is appropriate for neither high-dimensional spaces (where five neighbors may all be far away, making their average no better than the column mean) nor dense low-dimensional spaces (where k=5 makes imputed values sensitive to one or two outlier rows). The signals needed to choose k adaptively are available at fit time but unused.
-
Uniform weighting regardless of distance reliability — weights="uniform" averages all k neighbors equally. When distances are reliable (low missingness, low dimensionality), weights="distance" is strictly more informative. When distances are unreliable (high missingness, high dimensionality), distance-weighting amplifies noise. The library uses uniform unconditionally.
-
No feature scaling before KNN — nan_euclidean distances are scale-sensitive. A column ranging 0–1 and a column ranging 0–1,000,000 produce distances dominated entirely by the large-magnitude column. The imputed value is driven by whichever column has the largest numeric range, not by genuine row similarity.
Solution
Replace the hard-coded KNNImputer(n_neighbors=5) with data-adaptive parameter selection and NaN-safe feature scaling:
-
Compute n_neighbors dynamically from three signals at Phase 2 fit time — dimensionality, feature matrix missingness fraction, and complete row fraction — via a multiplicative formula bounded by configurable min/max values.
-
Derive weights from the same reliability assessment used to compute k. When distances are reliable (low missingness and low dimensionality), use "distance"; otherwise "uniform". No separate diagnostic required — the weights decision is the inverse of the k decision.
-
Scale feature columns before KNN fit using NaN-safe StandardScaler (nanmean/nanstd computed from the KNN training matrix). Inverse-scale imputed outputs. Store scaling parameters alongside the fitted model in a new _FittedKNN structure.
User Stories
- As a data scientist, I want
n_neighbors to increase automatically for high-dimensional feature spaces, so that the imputed value is not driven by a handful of accidentally nearby rows.
- As a data scientist, I want
n_neighbors to increase when the feature matrix has high missingness, so that distances computed over varying feature subsets are averaged over enough neighbors to be stable.
- As a data scientist, I want
n_neighbors to increase when few complete rows exist in the KNN matrix, so that imputation is stable when the training set has sparse structure.
- As a data scientist, I want
weights="distance" applied automatically when distances are reliable, so that genuinely closer neighbors contribute more without any manual configuration.
- As a data scientist, I want
weights="uniform" applied automatically when distances are unreliable, so that noisy distances are not amplified by down-weighting farther neighbors.
- As a data scientist, I want feature columns scaled to comparable ranges before KNN, so that imputed values reflect genuine row similarity rather than whichever column has the largest numeric magnitude.
- As a data scientist, I want to see the chosen
n_neighbors, weights, and scaling decision in the audit log, so that I understand what the library chose and why.
- As a data scientist, I want the minimum and maximum allowed
n_neighbors and the weights reliability thresholds to be configurable, so that I can tune KNN behaviour for domain-specific datasets.
Implementation Decisions
Modified: KNN fitting in NumericImputer (Phase 2)
Replace the hard-coded KNNImputer(n_neighbors=5) with the following at fit time:
Signal computation (from train_df at fit time, no Phase 1 changes required):
n_features = count of KNN columns
miss_frac = fraction of NaN cells in the KNN training matrix
complete_frac = fraction of rows with no NaN across all KNN columns
n_neighbors formula — multiplicative, so signals compound each other's effect (both high missingness and low completeness arise from the same underlying data sparsity and their combined effect on distance reliability is multiplicative):
base_k = max(knn_min_neighbors, int(sqrt(n_features)))
k = base_k × (1 + miss_frac) × (1 / max(complete_frac, 0.1)) ^ 0.5
n_neighbors = min(max(knn_min_neighbors, int(k)), n_rows − 1, knn_max_neighbors)
weights formula — binary threshold, both conditions must pass for distance weighting:
reliability_high = (miss_frac < knn_distance_weight_max_null_ratio) AND (n_features ≤ knn_distance_weight_max_features)
weights = "distance" if reliability_high else "uniform"
The weights decision is the inverse of the k decision: when you increase k because distances are unreliable, uniform weighting is safer. When k stays small because space is dense, distance weighting adds value.
NaN-safe feature scaling:
Compute col_means and col_stds using nanmean/nanstd on the KNN training matrix (set zero stds to 1.0). Scale the matrix — NaN cells remain NaN after scaling. Fit KNNImputer on the scaled matrix.
Signals recorded — two entries appended to each KNN column's ColumnImputationRecord.signals:
"knn_params: n_neighbors={k}, weights={w} | n_features={f}, miss_frac={m:.2f}, complete_frac={c:.2f}"
"knn_scaling: applied StandardScaler (nanmean/nanstd) across {n_features} feature columns"
New: _FittedKNN dataclass (in FittedImputer module)
Stores model (the fitted KNNImputer), col_means, and col_stds. Replaces the bare KNNImputer previously stored under "knn" in FittedImputer.models. The existing joblib serialisation path handles the dataclass without modification.
New: _apply_knn transform helper (in FittedImputer module)
Dedicated function replacing the generic block-model application for KNN. Applies: scale input → KNNImputer.transform() → inverse-scale output. The generic _apply_block_model is retained for MICE only.
Modified: NumericImputationConfig
Four new fields alongside existing KNN routing guards (knn_max_rows, knn_max_features):
knn_min_neighbors (default 5) — floor on computed k
knn_max_neighbors (default 25) — cap on computed k
knn_distance_weight_max_null_ratio (default 0.15) — miss_frac threshold for distance weighting
knn_distance_weight_max_features (default 30) — dimensionality threshold for distance weighting
No Phase 1 changes
All signals are computed from train_df at Phase 2 fit time. No new sub-processor, no new NumericStats fields.
Why manual NaN-safe scaling rather than a sklearn Pipeline
sklearn.preprocessing.StandardScaler does not accept NaN inputs. A Pipeline([StandardScaler(), KNNImputer()]) fails at fit time because KNNImputer requires NaN in its input by design. The workaround of fitting StandardScaler on complete rows only and calling transform() on the full matrix also fails — transform() applies check_array with force_all_finite=True by default. Manual nanmean/nanstd scaling is the only clean, version-stable approach. See ADR 0010.
Testing Decisions
What makes a good test here: test observable behaviour through public outputs, not the exact computed k value. Verify that k responds in the correct direction to each signal (relative comparisons), and verify weights and scaling decisions via ColumnImputationRecord.signals. Do not assert specific numeric k values — the formula should be free to tune empirically without breaking tests.
Modules with tests:
-
NumericImputer (KNN path) — new cases mirroring the pattern in test_model_strategies.py:
- High-dimensional KNN column set produces a larger reported
n_neighbors in signals than a low-dimensional set, all else equal.
- High feature-matrix missingness produces a larger reported
n_neighbors than low missingness.
- Low
miss_frac + few features → signals contain weights=distance.
- High
miss_frac OR many features → signals contain weights=uniform.
- Signals contain a
knn_scaling: applied entry for any KNN-routed column.
-
FittedImputer (KNN path) — new cases mirroring the pattern in test_fitted_imputer.py:
transform() on a _FittedKNN-backed imputer produces no nulls in the output.
- Scale-sensitive case: two KNN columns with magnitude ratio ~1000:1 — imputed values are not dominated by the large-scale column.
to_dict() / from_dict() round-trip preserves model, col_means, and col_stds correctly.
-
Integration test (test_imputation_end_to_end.py) — one new case: dataset with KNN columns at mixed scales and partial missingness; final imputed output has no nulls; ColumnImputationRecord.signals for each KNN column contains both knn_params and knn_scaling entries.
Out of Scope
- KNN metric selection (alternatives to
nan_euclidean) — deferred
- Weighted KNN beyond sklearn's
"uniform" / "distance" built-in options — deferred
- KNN for categorical columns — separate phase
- Scope 0 (Regression Imputer Overhaul) — separate session
- Scope A (MICE estimator selection) — separate session
Further Notes
- The weights decision is intentionally derived from the same reliability signals as k, not from a separate local-variance diagnostic. This avoids adding a new Phase 1 computation while still capturing the key insight: when distances are unreliable enough to require a larger k, they are also unreliable enough that distance-weighting amplifies rather than reduces noise.
- The
knn_distance_weight_max_features threshold (default 30) is more generous than a naive curse-of-dimensionality argument would suggest, because NaN-safe scaling removes the scale-dominance problem that would otherwise worsen with dimensionality. The remaining concern is pure geometric density.
- A full design walkthrough and all decisions are recorded in
docs/prd-scope-1-knn-hyperparameter-selection.md and ADR 0010.
Problem Statement
When a data scientist uses DataForgeML to impute a column routed to
ImputationStrategy.KNN, the library hard-codesKNNImputer(n_neighbors=5)with uniform weighting and no feature scaling. This silently degrades imputation quality in three interconnected ways:Fixed k regardless of dataset structure — k=5 is appropriate for neither high-dimensional spaces (where five neighbors may all be far away, making their average no better than the column mean) nor dense low-dimensional spaces (where k=5 makes imputed values sensitive to one or two outlier rows). The signals needed to choose k adaptively are available at fit time but unused.
Uniform weighting regardless of distance reliability —
weights="uniform"averages all k neighbors equally. When distances are reliable (low missingness, low dimensionality),weights="distance"is strictly more informative. When distances are unreliable (high missingness, high dimensionality), distance-weighting amplifies noise. The library uses uniform unconditionally.No feature scaling before KNN —
nan_euclideandistances are scale-sensitive. A column ranging 0–1 and a column ranging 0–1,000,000 produce distances dominated entirely by the large-magnitude column. The imputed value is driven by whichever column has the largest numeric range, not by genuine row similarity.Solution
Replace the hard-coded
KNNImputer(n_neighbors=5)with data-adaptive parameter selection and NaN-safe feature scaling:Compute
n_neighborsdynamically from three signals at Phase 2 fit time — dimensionality, feature matrix missingness fraction, and complete row fraction — via a multiplicative formula bounded by configurable min/max values.Derive
weightsfrom the same reliability assessment used to compute k. When distances are reliable (low missingness and low dimensionality), use"distance"; otherwise"uniform". No separate diagnostic required — the weights decision is the inverse of the k decision.Scale feature columns before KNN fit using NaN-safe StandardScaler (nanmean/nanstd computed from the KNN training matrix). Inverse-scale imputed outputs. Store scaling parameters alongside the fitted model in a new
_FittedKNNstructure.User Stories
n_neighborsto increase automatically for high-dimensional feature spaces, so that the imputed value is not driven by a handful of accidentally nearby rows.n_neighborsto increase when the feature matrix has high missingness, so that distances computed over varying feature subsets are averaged over enough neighbors to be stable.n_neighborsto increase when few complete rows exist in the KNN matrix, so that imputation is stable when the training set has sparse structure.weights="distance"applied automatically when distances are reliable, so that genuinely closer neighbors contribute more without any manual configuration.weights="uniform"applied automatically when distances are unreliable, so that noisy distances are not amplified by down-weighting farther neighbors.n_neighbors,weights, and scaling decision in the audit log, so that I understand what the library chose and why.n_neighborsand the weights reliability thresholds to be configurable, so that I can tune KNN behaviour for domain-specific datasets.Implementation Decisions
Modified: KNN fitting in
NumericImputer(Phase 2)Replace the hard-coded
KNNImputer(n_neighbors=5)with the following at fit time:Signal computation (from
train_dfat fit time, no Phase 1 changes required):n_features= count of KNN columnsmiss_frac= fraction of NaN cells in the KNN training matrixcomplete_frac= fraction of rows with no NaN across all KNN columnsn_neighborsformula — multiplicative, so signals compound each other's effect (both high missingness and low completeness arise from the same underlying data sparsity and their combined effect on distance reliability is multiplicative):weightsformula — binary threshold, both conditions must pass for distance weighting:The weights decision is the inverse of the k decision: when you increase k because distances are unreliable, uniform weighting is safer. When k stays small because space is dense, distance weighting adds value.
NaN-safe feature scaling:
Compute
col_meansandcol_stdsusing nanmean/nanstd on the KNN training matrix (set zero stds to 1.0). Scale the matrix — NaN cells remain NaN after scaling. FitKNNImputeron the scaled matrix.Signals recorded — two entries appended to each KNN column's
ColumnImputationRecord.signals:"knn_params: n_neighbors={k}, weights={w} | n_features={f}, miss_frac={m:.2f}, complete_frac={c:.2f}""knn_scaling: applied StandardScaler (nanmean/nanstd) across {n_features} feature columns"New:
_FittedKNNdataclass (inFittedImputermodule)Stores
model(the fittedKNNImputer),col_means, andcol_stds. Replaces the bareKNNImputerpreviously stored under"knn"inFittedImputer.models. The existing joblib serialisation path handles the dataclass without modification.New:
_apply_knntransform helper (inFittedImputermodule)Dedicated function replacing the generic block-model application for KNN. Applies: scale input →
KNNImputer.transform()→ inverse-scale output. The generic_apply_block_modelis retained for MICE only.Modified:
NumericImputationConfigFour new fields alongside existing KNN routing guards (
knn_max_rows,knn_max_features):knn_min_neighbors(default 5) — floor on computed kknn_max_neighbors(default 25) — cap on computed kknn_distance_weight_max_null_ratio(default 0.15) — miss_frac threshold for distance weightingknn_distance_weight_max_features(default 30) — dimensionality threshold for distance weightingNo Phase 1 changes
All signals are computed from
train_dfat Phase 2 fit time. No new sub-processor, no newNumericStatsfields.Why manual NaN-safe scaling rather than a sklearn Pipeline
sklearn.preprocessing.StandardScalerdoes not accept NaN inputs. APipeline([StandardScaler(), KNNImputer()])fails at fit time becauseKNNImputerrequires NaN in its input by design. The workaround of fittingStandardScaleron complete rows only and callingtransform()on the full matrix also fails —transform()appliescheck_arraywithforce_all_finite=Trueby default. Manual nanmean/nanstd scaling is the only clean, version-stable approach. See ADR 0010.Testing Decisions
What makes a good test here: test observable behaviour through public outputs, not the exact computed k value. Verify that k responds in the correct direction to each signal (relative comparisons), and verify weights and scaling decisions via
ColumnImputationRecord.signals. Do not assert specific numeric k values — the formula should be free to tune empirically without breaking tests.Modules with tests:
NumericImputer(KNN path) — new cases mirroring the pattern intest_model_strategies.py:n_neighborsin signals than a low-dimensional set, all else equal.n_neighborsthan low missingness.miss_frac+ few features → signals containweights=distance.miss_fracOR many features → signals containweights=uniform.knn_scaling: appliedentry for any KNN-routed column.FittedImputer(KNN path) — new cases mirroring the pattern intest_fitted_imputer.py:transform()on a_FittedKNN-backed imputer produces no nulls in the output.to_dict()/from_dict()round-trip preserves model,col_means, andcol_stdscorrectly.Integration test (
test_imputation_end_to_end.py) — one new case: dataset with KNN columns at mixed scales and partial missingness; final imputed output has no nulls;ColumnImputationRecord.signalsfor each KNN column contains bothknn_paramsandknn_scalingentries.Out of Scope
nan_euclidean) — deferred"uniform"/"distance"built-in options — deferredFurther Notes
knn_distance_weight_max_featuresthreshold (default 30) is more generous than a naive curse-of-dimensionality argument would suggest, because NaN-safe scaling removes the scale-dominance problem that would otherwise worsen with dimensionality. The remaining concern is pure geometric density.docs/prd-scope-1-knn-hyperparameter-selection.mdand ADR 0010.