Scope 1: KNNImputer Hyperparameter Selection — adaptive k, reliability-based weights, NaN-safe scaling

## Problem Statement

When a data scientist uses DataForgeML to impute a column routed to `ImputationStrategy.KNN`, the library hard-codes `KNNImputer(n_neighbors=5)` with uniform weighting and no feature scaling. This silently degrades imputation quality in three interconnected ways:

1. **Fixed k regardless of dataset structure** — k=5 is appropriate for neither high-dimensional spaces (where five neighbors may all be far away, making their average no better than the column mean) nor dense low-dimensional spaces (where k=5 makes imputed values sensitive to one or two outlier rows). The signals needed to choose k adaptively are available at fit time but unused.

2. **Uniform weighting regardless of distance reliability** — `weights="uniform"` averages all k neighbors equally. When distances are reliable (low missingness, low dimensionality), `weights="distance"` is strictly more informative. When distances are unreliable (high missingness, high dimensionality), distance-weighting amplifies noise. The library uses uniform unconditionally.

3. **No feature scaling before KNN** — `nan_euclidean` distances are scale-sensitive. A column ranging 0–1 and a column ranging 0–1,000,000 produce distances dominated entirely by the large-magnitude column. The imputed value is driven by whichever column has the largest numeric range, not by genuine row similarity.

## Solution

Replace the hard-coded `KNNImputer(n_neighbors=5)` with data-adaptive parameter selection and NaN-safe feature scaling:

1. **Compute `n_neighbors` dynamically** from three signals at Phase 2 fit time — dimensionality, feature matrix missingness fraction, and complete row fraction — via a multiplicative formula bounded by configurable min/max values.

2. **Derive `weights` from the same reliability assessment** used to compute k. When distances are reliable (low missingness and low dimensionality), use `"distance"`; otherwise `"uniform"`. No separate diagnostic required — the weights decision is the inverse of the k decision.

3. **Scale feature columns before KNN fit** using NaN-safe StandardScaler (nanmean/nanstd computed from the KNN training matrix). Inverse-scale imputed outputs. Store scaling parameters alongside the fitted model in a new `_FittedKNN` structure.

## User Stories

1. As a data scientist, I want `n_neighbors` to increase automatically for high-dimensional feature spaces, so that the imputed value is not driven by a handful of accidentally nearby rows.
2. As a data scientist, I want `n_neighbors` to increase when the feature matrix has high missingness, so that distances computed over varying feature subsets are averaged over enough neighbors to be stable.
3. As a data scientist, I want `n_neighbors` to increase when few complete rows exist in the KNN matrix, so that imputation is stable when the training set has sparse structure.
4. As a data scientist, I want `weights="distance"` applied automatically when distances are reliable, so that genuinely closer neighbors contribute more without any manual configuration.
5. As a data scientist, I want `weights="uniform"` applied automatically when distances are unreliable, so that noisy distances are not amplified by down-weighting farther neighbors.
6. As a data scientist, I want feature columns scaled to comparable ranges before KNN, so that imputed values reflect genuine row similarity rather than whichever column has the largest numeric magnitude.
7. As a data scientist, I want to see the chosen `n_neighbors`, `weights`, and scaling decision in the audit log, so that I understand what the library chose and why.
8. As a data scientist, I want the minimum and maximum allowed `n_neighbors` and the weights reliability thresholds to be configurable, so that I can tune KNN behaviour for domain-specific datasets.

## Implementation Decisions

### Modified: KNN fitting in `NumericImputer` (Phase 2)

Replace the hard-coded `KNNImputer(n_neighbors=5)` with the following at fit time:

**Signal computation** (from `train_df` at fit time, no Phase 1 changes required):
- `n_features` = count of KNN columns
- `miss_frac` = fraction of NaN cells in the KNN training matrix
- `complete_frac` = fraction of rows with no NaN across all KNN columns

**`n_neighbors` formula** — multiplicative, so signals compound each other's effect (both high missingness and low completeness arise from the same underlying data sparsity and their combined effect on distance reliability is multiplicative):
```
base_k  = max(knn_min_neighbors, int(sqrt(n_features)))
k = base_k × (1 + miss_frac) × (1 / max(complete_frac, 0.1)) ^ 0.5
n_neighbors = min(max(knn_min_neighbors, int(k)), n_rows − 1, knn_max_neighbors)
```

**`weights` formula** — binary threshold, both conditions must pass for distance weighting:
```
reliability_high = (miss_frac < knn_distance_weight_max_null_ratio) AND (n_features ≤ knn_distance_weight_max_features)
weights = "distance" if reliability_high else "uniform"
```

The weights decision is the inverse of the k decision: when you increase k because distances are unreliable, uniform weighting is safer. When k stays small because space is dense, distance weighting adds value.

**NaN-safe feature scaling:**
Compute `col_means` and `col_stds` using nanmean/nanstd on the KNN training matrix (set zero stds to 1.0). Scale the matrix — NaN cells remain NaN after scaling. Fit `KNNImputer` on the scaled matrix.

**Signals recorded** — two entries appended to each KNN column's `ColumnImputationRecord.signals`:
- `"knn_params: n_neighbors={k}, weights={w} | n_features={f}, miss_frac={m:.2f}, complete_frac={c:.2f}"`
- `"knn_scaling: applied StandardScaler (nanmean/nanstd) across {n_features} feature columns"`

### New: `_FittedKNN` dataclass (in `FittedImputer` module)

Stores `model` (the fitted `KNNImputer`), `col_means`, and `col_stds`. Replaces the bare `KNNImputer` previously stored under `"knn"` in `FittedImputer.models`. The existing joblib serialisation path handles the dataclass without modification.

### New: `_apply_knn` transform helper (in `FittedImputer` module)

Dedicated function replacing the generic block-model application for KNN. Applies: scale input → `KNNImputer.transform()` → inverse-scale output. The generic `_apply_block_model` is retained for MICE only.

### Modified: `NumericImputationConfig`

Four new fields alongside existing KNN routing guards (`knn_max_rows`, `knn_max_features`):
- `knn_min_neighbors` (default 5) — floor on computed k
- `knn_max_neighbors` (default 25) — cap on computed k
- `knn_distance_weight_max_null_ratio` (default 0.15) — miss_frac threshold for distance weighting
- `knn_distance_weight_max_features` (default 30) — dimensionality threshold for distance weighting

### No Phase 1 changes

All signals are computed from `train_df` at Phase 2 fit time. No new sub-processor, no new `NumericStats` fields.

### Why manual NaN-safe scaling rather than a sklearn Pipeline

`sklearn.preprocessing.StandardScaler` does not accept NaN inputs. A `Pipeline([StandardScaler(), KNNImputer()])` fails at fit time because `KNNImputer` requires NaN in its input by design. The workaround of fitting `StandardScaler` on complete rows only and calling `transform()` on the full matrix also fails — `transform()` applies `check_array` with `force_all_finite=True` by default. Manual nanmean/nanstd scaling is the only clean, version-stable approach. See ADR 0010.

## Testing Decisions

**What makes a good test here:** test observable behaviour through public outputs, not the exact computed k value. Verify that k responds in the correct *direction* to each signal (relative comparisons), and verify weights and scaling decisions via `ColumnImputationRecord.signals`. Do not assert specific numeric k values — the formula should be free to tune empirically without breaking tests.

**Modules with tests:**

- **`NumericImputer` (KNN path)** — new cases mirroring the pattern in `test_model_strategies.py`:
  - High-dimensional KNN column set produces a larger reported `n_neighbors` in signals than a low-dimensional set, all else equal.
  - High feature-matrix missingness produces a larger reported `n_neighbors` than low missingness.
  - Low `miss_frac` + few features → signals contain `weights=distance`.
  - High `miss_frac` OR many features → signals contain `weights=uniform`.
  - Signals contain a `knn_scaling: applied` entry for any KNN-routed column.

- **`FittedImputer` (KNN path)** — new cases mirroring the pattern in `test_fitted_imputer.py`:
  - `transform()` on a `_FittedKNN`-backed imputer produces no nulls in the output.
  - Scale-sensitive case: two KNN columns with magnitude ratio ~1000:1 — imputed values are not dominated by the large-scale column.
  - `to_dict()` / `from_dict()` round-trip preserves model, `col_means`, and `col_stds` correctly.

- **Integration test** (`test_imputation_end_to_end.py`) — one new case: dataset with KNN columns at mixed scales and partial missingness; final imputed output has no nulls; `ColumnImputationRecord.signals` for each KNN column contains both `knn_params` and `knn_scaling` entries.

## Out of Scope

- KNN metric selection (alternatives to `nan_euclidean`) — deferred
- Weighted KNN beyond sklearn's `"uniform"` / `"distance"` built-in options — deferred
- KNN for categorical columns — separate phase
- Scope 0 (Regression Imputer Overhaul) — separate session
- Scope A (MICE estimator selection) — separate session

## Further Notes

- The weights decision is intentionally derived from the same reliability signals as k, not from a separate local-variance diagnostic. This avoids adding a new Phase 1 computation while still capturing the key insight: when distances are unreliable enough to require a larger k, they are also unreliable enough that distance-weighting amplifies rather than reduces noise.
- The `knn_distance_weight_max_features` threshold (default 30) is more generous than a naive curse-of-dimensionality argument would suggest, because NaN-safe scaling removes the scale-dominance problem that would otherwise worsen with dimensionality. The remaining concern is pure geometric density.
- A full design walkthrough and all decisions are recorded in `docs/prd-scope-1-knn-hyperparameter-selection.md` and ADR 0010.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scope 1: KNNImputer Hyperparameter Selection — adaptive k, reliability-based weights, NaN-safe scaling #90

Problem Statement

Solution

User Stories

Implementation Decisions

Modified: KNN fitting in `NumericImputer` (Phase 2)

New: `_FittedKNN` dataclass (in `FittedImputer` module)

New: `_apply_knn` transform helper (in `FittedImputer` module)

Modified: `NumericImputationConfig`

No Phase 1 changes

Why manual NaN-safe scaling rather than a sklearn Pipeline

Testing Decisions

Out of Scope

Further Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Scope 1: KNNImputer Hyperparameter Selection — adaptive k, reliability-based weights, NaN-safe scaling #90

Description

Problem Statement

Solution

User Stories

Implementation Decisions

Modified: KNN fitting in NumericImputer (Phase 2)

New: _FittedKNN dataclass (in FittedImputer module)

New: _apply_knn transform helper (in FittedImputer module)

Modified: NumericImputationConfig

No Phase 1 changes

Why manual NaN-safe scaling rather than a sklearn Pipeline

Testing Decisions

Out of Scope

Further Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Modified: KNN fitting in `NumericImputer` (Phase 2)

New: `_FittedKNN` dataclass (in `FittedImputer` module)

New: `_apply_knn` transform helper (in `FittedImputer` module)

Modified: `NumericImputationConfig`