Skip to content

Compact Profile Report: human-readable to_markdown() with two-tier rendering #268

Description

@DEVunderdog

Problem Statement

A developer profiling a dataset with StructuralProfiler and calling result.to_markdown() receives a Markdown document that is ~1 MB and ~13 000 lines for an 82-column dataset. The method is documented as "lossless" — it dumps every field of every dataclass including histogram bins, full N×N correlation matrices, per-column memory breakdowns, and top-10 value frequency lists. No human can read this. The developer either ignores the method entirely or is forced to parse to_json() manually to find the signals they care about.

Solution

Replace to_markdown() with a compact, human-oriented view that covers every information category but depth-limits list-heavy fields and drops fields with no human interpretive value. Rename the current lossless implementation to to_full_markdown() so machine consumers that want the full dump still have it. The complete machine-readable output remains in to_dict() / to_json(); downstream phases always use those, never Markdown.

User Stories

  1. As a developer, I want result.to_markdown() to produce a document I can read in under five minutes, so that I can understand the dataset's structure and quality at a glance without parsing JSON.
  2. As a developer, I want every column to appear in a summary table at the top, so that I can scan all 80+ columns for type, missingness, and flags in one pass.
  3. As a developer, I want flagged columns (those with anomalies, meaningful missingness, or nonlinearity signals) to receive a full detail section, so that I can investigate problems without wading through clean columns.
  4. As a developer, I want clean columns to appear only in the summary table with no body section, so that the report is not dominated by columns that need no attention.
  5. As a developer, I want flagged columns ordered by severity (most urgent first), so that the most actionable columns appear at the top of the detail section.
  6. As a developer, I want correlation information shown as top-5 highest absolute Pearson and top-5 highest absolute Spearman per column, so that I can see a column's strongest relationships without reading an N×N matrix.
  7. As a developer, I want target correlations shown as top-5 Pearson and top-5 Spearman per feature per target, so that I can identify the most predictive features quickly.
  8. As a developer, I want all scalar stats (mean, median, std, variance, skewness, kurtosis, min, max, percentiles, null ratios, etc.) preserved in the compact view, so that I can read individual numbers without switching to JSON.
  9. As a developer, I want top_values capped at 3 entries per column, so that frequent-value lists do not dominate the per-column section.
  10. As a developer, I want histogram bins excluded from the compact view, so that the distribution shape is communicated through tags (SkewSeverity, KurtosisTag, TailAsymmetryTag) rather than unreadable bin tables.
  11. As a developer, I want the missingness correlation matrix excluded from the compact view, so that the MAR signal is communicated through MARSuspect flags and correlated_with column names rather than an N×N matrix.
  12. As a developer, I want per-column memory breakdown excluded from the compact view, so that the dataset memory section shows only the total memory_bytes scalar.
  13. As a developer, I want total_rows excluded from per-column missingness sections, so that the redundant dataset-level value is not repeated for every column.
  14. As a developer, I want correlated_with preserved in full in the compact view, so that I can see every column whose missingness correlates with this one (these are just column names — no bulk).
  15. As a developer, I want to_full_markdown() available for the lossless dump, so that I can still access every field when debugging or archiving a profile.
  16. As a developer, I want the compact view to use the same domain vocabulary as the rest of the library (severity labels, flag names, tag names), so that reading the report feels consistent with the docstrings and CONTEXT.md.
  17. As a developer, I want the Dataset Overview section to show only scalar dataset stats, so that dataset-level bulk data (memory breakdown, missingness matrix) does not inflate the header section.
  18. As a developer, I want sentinel declarations (numeric and string) included in the compact view, so that I can verify my configured sentinels are being applied without switching to to_json().
  19. As a developer, I want the Target Analysis section to show top-5 correlations per feature column per target, so that I can identify the most predictive features for each target in a readable table.
  20. As a developer calling to_full_markdown(), I want the lossless output to be identical in content to the previous to_markdown(), so that nothing is lost for debugging use cases.

Implementation Decisions

  • to_markdown() becomes compact; to_full_markdown() becomes lossless. The existing implementation of to_markdown() is renamed to_full_markdown(). A new to_markdown() is written from scratch implementing the compact view. This is a breaking change on a public API method; no compatibility shim is added (the lossless consumers should use to_json()).

  • Field inclusion rules:

    • Dropped from compact view: histogram (bins), missingness_matrix (N×N), memory_breakdown (per-column bytes), total_rows on ColumnMissingnessProfile.
    • Capped: top_values → 3 entries; feature/target correlation matrices → top-5 absolute Pearson + top-5 absolute Spearman per column.
    • Kept in full: all scalar fields including redundant pairs (std+variance, both null ratios, mean_median_ratio); correlated_with (column names only); all enum tags and flags; all PercentileSnapshot values; BimodalStats fields; RowMissingnessDistribution fields; sentinel dicts.
  • Clean threshold for two-tier rendering: A column is clean (summary table only) when ALL of: no MissingnessFlag, no NumericFlag, MissingSeverity is None or Minor, NonlinearityTag is None or Linear. Any other column is flagged and receives a full detail section.

  • Flagged column ordering: DropCandidate/FullyNull first, then MissingSeverity.SevereHighModerateMinor-but-flagged → columns flagged only for NumericFlag/NonlinearityTag. Alphabetical within each tier.

  • Document structure:

    # Structural Profile Report (Compact)
    ## Dataset Overview        ← scalars only; no memory_breakdown
    ## Column Summary          ← one table row per column, all columns
    ## Flagged Columns         ← full detail sections, severity-first
    ## Target Analysis         ← top-5 Pearson + Spearman per feature, per target
    ## Sentinels               ← numeric_sentinels + string_sentinels unchanged
    
  • to_full_markdown() is lossless. Its output must be content-equivalent to the previous to_markdown() — same fields, same structure, no omissions. It is not part of the human-facing compact API; it exists for debugging and archival.

  • CONTEXT.md gains three new terms under a Profile Serialization heading: Compact Profile Report, Full Profile Report, Two-Tier Column Rendering. ADR-0040 records the decision and the rejected alternatives (add to_compact_markdown() alongside; to_markdown(compact=False) parameter).

Testing Decisions

  • What makes a good test: Test the rendered Markdown output for structural properties and presence/absence of fields — not the internal logic of which columns are flagged. Drive tests from StructuralProfileResult instances constructed directly with known data, not from running StructuralProfiler end-to-end.

  • Modules tested:

    • StructuralProfileResult.to_markdown() — compact view contract
    • StructuralProfileResult.to_full_markdown() — lossless contract (assert no fields are dropped relative to to_dict())
  • Test cases for to_markdown():

    • A result with all-clean columns: assert no per-column detail sections appear; assert Column Summary table contains all columns.
    • A result with one flagged column: assert that column has a detail section; assert clean columns do not.
    • A result with NumericStats: assert histogram is absent; assert top_values section has at most 3 entries; assert scalar fields (mean, std, skewness, etc.) are present.
    • A result with feature_correlation: assert at most 5 Pearson and 5 Spearman entries appear per column; assert no full matrix dump.
    • A result with memory_breakdown set: assert per-column bytes do not appear; assert memory_bytes total does.
    • A result with missingness_matrix set: assert matrix does not appear in compact output.
    • Flagged column ordering: assert DropCandidate column appears before Severe column which appears before High column.
    • A result with target correlations: assert top-5 Pearson and top-5 Spearman appear per feature column per target.
  • Prior art: tests/integration/test_structural_end_to_end.py uses StructuralProfileResult instances directly (via StructuralProfiler.profile()). New tests for serialization belong in tests/unit/profiling/ as a new file, constructing StructuralProfileResult with fixture data, consistent with the unit test pattern in that directory.

Out of Scope

  • LLM agent integration or token-budget-aware serialization — the compact view is designed for human readers only; agent context is a future concern.
  • Any changes to to_dict() or to_json() — these remain lossless and are the canonical machine-readable formats.
  • ASCII sparklines or visual histogram rendering in the compact view.
  • Per-column configurable field inclusion (e.g. to_markdown(include_histogram=True)) — one compact contract, no toggles.
  • Changes to any other serialization method (ColumnProfile.to_dict(), NumericStats.to_dict(), etc.) — those remain untouched; only the top-level Markdown rendering changes.

Further Notes

ADR-0040 (docs/adr/0040-compact-profile-markdown-human-view.md) records the full decision including field inclusion rules, the two-tier rendering threshold, document structure, and the two rejected alternatives. CONTEXT.md has been updated with Compact Profile Report, Full Profile Report, and Two-Tier Column Rendering terms.

The example output at dataforgeml-examples/result.md (the previous lossless markdown, ~1 MB) serves as a reference for what to_full_markdown() must preserve.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions