Skip to content

Scope 16 / Phase 1: Bimodality detection — Hartigan's Dip Test + 2-component GMM in NumericProfiler #233

Description

@DEVunderdog

Parent

#105

What to build

Implement bimodality detection in NumericProfiler using Hartigan's Dip Test (non-parametric, no shape assumption) combined with a 2-component Gaussian Mixture Model for peak location. Add diptest as a new package dependency.

Detection logic (_numeric_profiler.py)

For each numeric column, after existing distribution statistics are computed:

  1. Run Hartigan's Dip Test (diptest.diptest(values)) on non-null values
  2. When dip_p_value < bimodal_dip_p_value_threshold (from NumericProfileConfig):
    • Fit a 2-component GMM (sklearn.mixture.GaussianMixture(n_components=2)) on the non-null values
    • Extract center1, center2 as the two component means (ordered ascending)
    • Construct BimodalStats(dip_statistic, dip_p_value, center1, center2)
    • Set NumericStats.bimodal_stats = BimodalStats(...)
    • Append NumericFlag.Bimodal to NumericStats.flags
  3. Mutual exclusion: when NumericFlag.NearConstant is already set for a column, skip the dip test entirely (a 90%-mode column cannot be bimodal)
  4. When the dip test does not fire, leave bimodal_stats = None and do not append the flag

The bidirectional invariant must hold after profiling: NumericFlag.Bimodal present ↔ bimodal_stats is not None.

Dependency

Add diptest to project dependencies (e.g. pyproject.toml).

Acceptance criteria

  • A column with two clearly separated peaks sets NumericFlag.Bimodal and populates NumericStats.bimodal_stats with non-None center1 and center2
  • A unimodal column (including a skewed one) does not set NumericFlag.Bimodal and has bimodal_stats = None
  • A column with NumericFlag.NearConstant already set is skipped — no dip test runs, no Bimodal flag
  • bimodal_dip_p_value_threshold is read from NumericProfileConfig, not hard-coded
  • The bidirectional invariant holds: every profiled column either has both flag and stats, or neither
  • NumericStats.to_dict() / from_dict() correctly serialises and deserialises BimodalStats (nested dict when present, None otherwise)
  • diptest package is declared as a dependency
  • All existing NumericProfiler tests pass (no regressions)

Blocked by

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions