Parent
#105
What to build
Implement bimodality detection in NumericProfiler using Hartigan's Dip Test (non-parametric, no shape assumption) combined with a 2-component Gaussian Mixture Model for peak location. Add diptest as a new package dependency.
Detection logic (_numeric_profiler.py)
For each numeric column, after existing distribution statistics are computed:
- Run Hartigan's Dip Test (
diptest.diptest(values)) on non-null values
- When
dip_p_value < bimodal_dip_p_value_threshold (from NumericProfileConfig):
- Fit a 2-component GMM (
sklearn.mixture.GaussianMixture(n_components=2)) on the non-null values
- Extract
center1, center2 as the two component means (ordered ascending)
- Construct
BimodalStats(dip_statistic, dip_p_value, center1, center2)
- Set
NumericStats.bimodal_stats = BimodalStats(...)
- Append
NumericFlag.Bimodal to NumericStats.flags
- Mutual exclusion: when
NumericFlag.NearConstant is already set for a column, skip the dip test entirely (a 90%-mode column cannot be bimodal)
- When the dip test does not fire, leave
bimodal_stats = None and do not append the flag
The bidirectional invariant must hold after profiling: NumericFlag.Bimodal present ↔ bimodal_stats is not None.
Dependency
Add diptest to project dependencies (e.g. pyproject.toml).
Acceptance criteria
Blocked by
Parent
#105
What to build
Implement bimodality detection in
NumericProfilerusing Hartigan's Dip Test (non-parametric, no shape assumption) combined with a 2-component Gaussian Mixture Model for peak location. Adddiptestas a new package dependency.Detection logic (
_numeric_profiler.py)For each numeric column, after existing distribution statistics are computed:
diptest.diptest(values)) on non-null valuesdip_p_value < bimodal_dip_p_value_threshold(fromNumericProfileConfig):sklearn.mixture.GaussianMixture(n_components=2)) on the non-null valuescenter1,center2as the two component means (ordered ascending)BimodalStats(dip_statistic, dip_p_value, center1, center2)NumericStats.bimodal_stats = BimodalStats(...)NumericFlag.BimodaltoNumericStats.flagsNumericFlag.NearConstantis already set for a column, skip the dip test entirely (a 90%-mode column cannot be bimodal)bimodal_stats = Noneand do not append the flagThe bidirectional invariant must hold after profiling:
NumericFlag.Bimodalpresent ↔bimodal_stats is not None.Dependency
Add
diptestto project dependencies (e.g.pyproject.toml).Acceptance criteria
NumericFlag.Bimodaland populatesNumericStats.bimodal_statswith non-Nonecenter1andcenter2NumericFlag.Bimodaland hasbimodal_stats = NoneNumericFlag.NearConstantalready set is skipped — no dip test runs, noBimodalflagbimodal_dip_p_value_thresholdis read fromNumericProfileConfig, not hard-codedNumericStats.to_dict()/from_dict()correctly serialises and deserialisesBimodalStats(nested dict when present,Noneotherwise)diptestpackage is declared as a dependencyNumericProfilertests pass (no regressions)Blocked by
BimodalStats,NumericFlag.Bimodal,NumericStats.bimodal_statsmust exist)