Skip to content

feat: n_nearest_features selection for large MICE blocks via Pearson correlation #161

@DEVunderdog

Description

@DEVunderdog

Parent

#91

What to build

For large MICE blocks, compute n_nearest_features from value-level Pearson correlations already available in the CorrelationProfiler output, rather than leaving it unset (which forces IterativeImputer to use all predictors). For small blocks, leave n_nearest_features=None.

Algorithm:

  • If the MICE block has ≤ mice_n_nearest_features_min_cols columns: set n_nearest_features=None (use all predictors)
  • Otherwise: for each MICE column, count how many other MICE columns have |Pearson r| > mice_correlation_threshold (from CorrelationProfiler). Take the median count across all MICE columns, cap at mice_max_nearest_features, and pass the result as n_nearest_features
  • Record the decision (and the computed value, or "all predictors") in every MICE column's signals

All three threshold fields (mice_n_nearest_features_min_cols, mice_max_nearest_features, mice_correlation_threshold) come from NumericImputationConfig (added in #157).

Acceptance criteria

  • MICE blocks larger than mice_n_nearest_features_min_cols columns have n_nearest_features computed from CorrelationProfiler value-level Pearson correlations
  • MICE blocks at or below mice_n_nearest_features_min_cols have n_nearest_features=None
  • Computed n_nearest_features is capped at mice_max_nearest_features
  • Only columns with |Pearson r| > mice_correlation_threshold count as informative predictors
  • Every MICE column's signals records the n_nearest_features decision (value used, or "all predictors used — block below min_cols threshold")
  • Unit test: MICE block exceeding mice_n_nearest_features_min_cols produces a n_nearest_features signal entry on all MICE column records
  • Unit test: MICE block at or below the threshold produces an "all predictors" signal and n_nearest_features=None is passed to IterativeImputer
  • Numpy-style docstrings on any new helper functions

Blocked by

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions