Skip to content

Scope 16 / Phase 1: Compute outlier density signal in NumericProfiler #232

Description

@DEVunderdog

Parent

#105

What to build

Compute outlier_density and set NumericFlag.HighOutlierDensity in NumericProfiler. Outlier density is the fraction of non-null values that lie beyond outlier_sigma_threshold standard deviations from the column mean.

Computation (_numeric_profiler.py)

After mean and std are computed for a column:

outlier_density = count(|value − mean| > outlier_sigma_threshold × std) / n_non_null
  • Store as NumericStats.outlier_density
  • When outlier_density > high_outlier_density_threshold, append NumericFlag.HighOutlierDensity to NumericStats.flags
  • Both thresholds are read from NumericProfileConfig (outlier_sigma_threshold, high_outlier_density_threshold)
  • Skip (leave outlier_density = None, no flag) when std is None or zero (constant column)

Acceptance criteria

  • A column where more than 5% of values lie beyond 3σ from the mean sets NumericFlag.HighOutlierDensity and stores the correct fraction in NumericStats.outlier_density
  • A column at or below the threshold does not set the flag; outlier_density is still populated
  • outlier_sigma_threshold and high_outlier_density_threshold are read from NumericProfileConfig, not hard-coded
  • A constant column (zero std) skips the computation without raising — outlier_density = None, no flag
  • HighOutlierDensity fires independently of KurtosisTag — a Mesokurtic column with high outlier fraction correctly receives the flag
  • All existing NumericProfiler tests pass (no regressions)

Blocked by

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions