Skip to content

Scope 1: KNNImputer Hyperparameter Selection — adaptive k, reliability-based weights, NaN-safe scaling #90

@DEVunderdog

Description

@DEVunderdog

Problem Statement

When a data scientist uses DataForgeML to impute a column routed to ImputationStrategy.KNN, the library hard-codes KNNImputer(n_neighbors=5) with uniform weighting and no feature scaling. This silently degrades imputation quality in three interconnected ways:

  1. Fixed k regardless of dataset structure — k=5 is appropriate for neither high-dimensional spaces (where five neighbors may all be far away, making their average no better than the column mean) nor dense low-dimensional spaces (where k=5 makes imputed values sensitive to one or two outlier rows). The signals needed to choose k adaptively are available at fit time but unused.

  2. Uniform weighting regardless of distance reliabilityweights="uniform" averages all k neighbors equally. When distances are reliable (low missingness, low dimensionality), weights="distance" is strictly more informative. When distances are unreliable (high missingness, high dimensionality), distance-weighting amplifies noise. The library uses uniform unconditionally.

  3. No feature scaling before KNNnan_euclidean distances are scale-sensitive. A column ranging 0–1 and a column ranging 0–1,000,000 produce distances dominated entirely by the large-magnitude column. The imputed value is driven by whichever column has the largest numeric range, not by genuine row similarity.

Solution

Replace the hard-coded KNNImputer(n_neighbors=5) with data-adaptive parameter selection and NaN-safe feature scaling:

  1. Compute n_neighbors dynamically from three signals at Phase 2 fit time — dimensionality, feature matrix missingness fraction, and complete row fraction — via a multiplicative formula bounded by configurable min/max values.

  2. Derive weights from the same reliability assessment used to compute k. When distances are reliable (low missingness and low dimensionality), use "distance"; otherwise "uniform". No separate diagnostic required — the weights decision is the inverse of the k decision.

  3. Scale feature columns before KNN fit using NaN-safe StandardScaler (nanmean/nanstd computed from the KNN training matrix). Inverse-scale imputed outputs. Store scaling parameters alongside the fitted model in a new _FittedKNN structure.

User Stories

  1. As a data scientist, I want n_neighbors to increase automatically for high-dimensional feature spaces, so that the imputed value is not driven by a handful of accidentally nearby rows.
  2. As a data scientist, I want n_neighbors to increase when the feature matrix has high missingness, so that distances computed over varying feature subsets are averaged over enough neighbors to be stable.
  3. As a data scientist, I want n_neighbors to increase when few complete rows exist in the KNN matrix, so that imputation is stable when the training set has sparse structure.
  4. As a data scientist, I want weights="distance" applied automatically when distances are reliable, so that genuinely closer neighbors contribute more without any manual configuration.
  5. As a data scientist, I want weights="uniform" applied automatically when distances are unreliable, so that noisy distances are not amplified by down-weighting farther neighbors.
  6. As a data scientist, I want feature columns scaled to comparable ranges before KNN, so that imputed values reflect genuine row similarity rather than whichever column has the largest numeric magnitude.
  7. As a data scientist, I want to see the chosen n_neighbors, weights, and scaling decision in the audit log, so that I understand what the library chose and why.
  8. As a data scientist, I want the minimum and maximum allowed n_neighbors and the weights reliability thresholds to be configurable, so that I can tune KNN behaviour for domain-specific datasets.

Implementation Decisions

Modified: KNN fitting in NumericImputer (Phase 2)

Replace the hard-coded KNNImputer(n_neighbors=5) with the following at fit time:

Signal computation (from train_df at fit time, no Phase 1 changes required):

  • n_features = count of KNN columns
  • miss_frac = fraction of NaN cells in the KNN training matrix
  • complete_frac = fraction of rows with no NaN across all KNN columns

n_neighbors formula — multiplicative, so signals compound each other's effect (both high missingness and low completeness arise from the same underlying data sparsity and their combined effect on distance reliability is multiplicative):

base_k  = max(knn_min_neighbors, int(sqrt(n_features)))
k = base_k × (1 + miss_frac) × (1 / max(complete_frac, 0.1)) ^ 0.5
n_neighbors = min(max(knn_min_neighbors, int(k)), n_rows − 1, knn_max_neighbors)

weights formula — binary threshold, both conditions must pass for distance weighting:

reliability_high = (miss_frac < knn_distance_weight_max_null_ratio) AND (n_features ≤ knn_distance_weight_max_features)
weights = "distance" if reliability_high else "uniform"

The weights decision is the inverse of the k decision: when you increase k because distances are unreliable, uniform weighting is safer. When k stays small because space is dense, distance weighting adds value.

NaN-safe feature scaling:
Compute col_means and col_stds using nanmean/nanstd on the KNN training matrix (set zero stds to 1.0). Scale the matrix — NaN cells remain NaN after scaling. Fit KNNImputer on the scaled matrix.

Signals recorded — two entries appended to each KNN column's ColumnImputationRecord.signals:

  • "knn_params: n_neighbors={k}, weights={w} | n_features={f}, miss_frac={m:.2f}, complete_frac={c:.2f}"
  • "knn_scaling: applied StandardScaler (nanmean/nanstd) across {n_features} feature columns"

New: _FittedKNN dataclass (in FittedImputer module)

Stores model (the fitted KNNImputer), col_means, and col_stds. Replaces the bare KNNImputer previously stored under "knn" in FittedImputer.models. The existing joblib serialisation path handles the dataclass without modification.

New: _apply_knn transform helper (in FittedImputer module)

Dedicated function replacing the generic block-model application for KNN. Applies: scale input → KNNImputer.transform() → inverse-scale output. The generic _apply_block_model is retained for MICE only.

Modified: NumericImputationConfig

Four new fields alongside existing KNN routing guards (knn_max_rows, knn_max_features):

  • knn_min_neighbors (default 5) — floor on computed k
  • knn_max_neighbors (default 25) — cap on computed k
  • knn_distance_weight_max_null_ratio (default 0.15) — miss_frac threshold for distance weighting
  • knn_distance_weight_max_features (default 30) — dimensionality threshold for distance weighting

No Phase 1 changes

All signals are computed from train_df at Phase 2 fit time. No new sub-processor, no new NumericStats fields.

Why manual NaN-safe scaling rather than a sklearn Pipeline

sklearn.preprocessing.StandardScaler does not accept NaN inputs. A Pipeline([StandardScaler(), KNNImputer()]) fails at fit time because KNNImputer requires NaN in its input by design. The workaround of fitting StandardScaler on complete rows only and calling transform() on the full matrix also fails — transform() applies check_array with force_all_finite=True by default. Manual nanmean/nanstd scaling is the only clean, version-stable approach. See ADR 0010.

Testing Decisions

What makes a good test here: test observable behaviour through public outputs, not the exact computed k value. Verify that k responds in the correct direction to each signal (relative comparisons), and verify weights and scaling decisions via ColumnImputationRecord.signals. Do not assert specific numeric k values — the formula should be free to tune empirically without breaking tests.

Modules with tests:

  • NumericImputer (KNN path) — new cases mirroring the pattern in test_model_strategies.py:

    • High-dimensional KNN column set produces a larger reported n_neighbors in signals than a low-dimensional set, all else equal.
    • High feature-matrix missingness produces a larger reported n_neighbors than low missingness.
    • Low miss_frac + few features → signals contain weights=distance.
    • High miss_frac OR many features → signals contain weights=uniform.
    • Signals contain a knn_scaling: applied entry for any KNN-routed column.
  • FittedImputer (KNN path) — new cases mirroring the pattern in test_fitted_imputer.py:

    • transform() on a _FittedKNN-backed imputer produces no nulls in the output.
    • Scale-sensitive case: two KNN columns with magnitude ratio ~1000:1 — imputed values are not dominated by the large-scale column.
    • to_dict() / from_dict() round-trip preserves model, col_means, and col_stds correctly.
  • Integration test (test_imputation_end_to_end.py) — one new case: dataset with KNN columns at mixed scales and partial missingness; final imputed output has no nulls; ColumnImputationRecord.signals for each KNN column contains both knn_params and knn_scaling entries.

Out of Scope

  • KNN metric selection (alternatives to nan_euclidean) — deferred
  • Weighted KNN beyond sklearn's "uniform" / "distance" built-in options — deferred
  • KNN for categorical columns — separate phase
  • Scope 0 (Regression Imputer Overhaul) — separate session
  • Scope A (MICE estimator selection) — separate session

Further Notes

  • The weights decision is intentionally derived from the same reliability signals as k, not from a separate local-variance diagnostic. This avoids adding a new Phase 1 computation while still capturing the key insight: when distances are unreliable enough to require a larger k, they are also unreliable enough that distance-weighting amplifies rather than reduces noise.
  • The knn_distance_weight_max_features threshold (default 30) is more generous than a naive curse-of-dimensionality argument would suggest, because NaN-safe scaling removes the scale-dominance problem that would otherwise worsen with dimensionality. The remaining concern is pure geometric density.
  • A full design walkthrough and all decisions are recorded in docs/prd-scope-1-knn-hyperparameter-selection.md and ADR 0010.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions