Compact Profile Report: human-readable to_markdown() with two-tier rendering

## Problem Statement

A developer profiling a dataset with `StructuralProfiler` and calling `result.to_markdown()` receives a Markdown document that is ~1 MB and ~13 000 lines for an 82-column dataset. The method is documented as \"lossless\" — it dumps every field of every dataclass including histogram bins, full N×N correlation matrices, per-column memory breakdowns, and top-10 value frequency lists. No human can read this. The developer either ignores the method entirely or is forced to parse `to_json()` manually to find the signals they care about.

## Solution

Replace `to_markdown()` with a compact, human-oriented view that covers every information category but depth-limits list-heavy fields and drops fields with no human interpretive value. Rename the current lossless implementation to `to_full_markdown()` so machine consumers that want the full dump still have it. The complete machine-readable output remains in `to_dict()` / `to_json()`; downstream phases always use those, never Markdown.

## User Stories

1. As a developer, I want `result.to_markdown()` to produce a document I can read in under five minutes, so that I can understand the dataset's structure and quality at a glance without parsing JSON.
2. As a developer, I want every column to appear in a summary table at the top, so that I can scan all 80+ columns for type, missingness, and flags in one pass.
3. As a developer, I want flagged columns (those with anomalies, meaningful missingness, or nonlinearity signals) to receive a full detail section, so that I can investigate problems without wading through clean columns.
4. As a developer, I want clean columns to appear only in the summary table with no body section, so that the report is not dominated by columns that need no attention.
5. As a developer, I want flagged columns ordered by severity (most urgent first), so that the most actionable columns appear at the top of the detail section.
6. As a developer, I want correlation information shown as top-5 highest absolute Pearson and top-5 highest absolute Spearman per column, so that I can see a column's strongest relationships without reading an N×N matrix.
7. As a developer, I want target correlations shown as top-5 Pearson and top-5 Spearman per feature per target, so that I can identify the most predictive features quickly.
8. As a developer, I want all scalar stats (mean, median, std, variance, skewness, kurtosis, min, max, percentiles, null ratios, etc.) preserved in the compact view, so that I can read individual numbers without switching to JSON.
9. As a developer, I want `top_values` capped at 3 entries per column, so that frequent-value lists do not dominate the per-column section.
10. As a developer, I want histogram bins excluded from the compact view, so that the distribution shape is communicated through tags (`SkewSeverity`, `KurtosisTag`, `TailAsymmetryTag`) rather than unreadable bin tables.
11. As a developer, I want the missingness correlation matrix excluded from the compact view, so that the MAR signal is communicated through `MARSuspect` flags and `correlated_with` column names rather than an N×N matrix.
12. As a developer, I want per-column memory breakdown excluded from the compact view, so that the dataset memory section shows only the total `memory_bytes` scalar.
13. As a developer, I want `total_rows` excluded from per-column missingness sections, so that the redundant dataset-level value is not repeated for every column.
14. As a developer, I want `correlated_with` preserved in full in the compact view, so that I can see every column whose missingness correlates with this one (these are just column names — no bulk).
15. As a developer, I want `to_full_markdown()` available for the lossless dump, so that I can still access every field when debugging or archiving a profile.
16. As a developer, I want the compact view to use the same domain vocabulary as the rest of the library (severity labels, flag names, tag names), so that reading the report feels consistent with the docstrings and CONTEXT.md.
17. As a developer, I want the Dataset Overview section to show only scalar dataset stats, so that dataset-level bulk data (memory breakdown, missingness matrix) does not inflate the header section.
18. As a developer, I want sentinel declarations (numeric and string) included in the compact view, so that I can verify my configured sentinels are being applied without switching to `to_json()`.
19. As a developer, I want the Target Analysis section to show top-5 correlations per feature column per target, so that I can identify the most predictive features for each target in a readable table.
20. As a developer calling `to_full_markdown()`, I want the lossless output to be identical in content to the previous `to_markdown()`, so that nothing is lost for debugging use cases.

## Implementation Decisions

- **`to_markdown()` becomes compact; `to_full_markdown()` becomes lossless.** The existing implementation of `to_markdown()` is renamed `to_full_markdown()`. A new `to_markdown()` is written from scratch implementing the compact view. This is a breaking change on a public API method; no compatibility shim is added (the lossless consumers should use `to_json()`).

- **Field inclusion rules:**
  - *Dropped from compact view:* `histogram` (bins), `missingness_matrix` (N×N), `memory_breakdown` (per-column bytes), `total_rows` on `ColumnMissingnessProfile`.
  - *Capped:* `top_values` → 3 entries; feature/target correlation matrices → top-5 absolute Pearson + top-5 absolute Spearman per column.
  - *Kept in full:* all scalar fields including redundant pairs (`std`+`variance`, both null ratios, `mean_median_ratio`); `correlated_with` (column names only); all enum tags and flags; all `PercentileSnapshot` values; `BimodalStats` fields; `RowMissingnessDistribution` fields; sentinel dicts.

- **Clean threshold for two-tier rendering:** A column is *clean* (summary table only) when ALL of: no `MissingnessFlag`, no `NumericFlag`, `MissingSeverity` is `None` or `Minor`, `NonlinearityTag` is `None` or `Linear`. Any other column is *flagged* and receives a full detail section.

- **Flagged column ordering:** `DropCandidate`/`FullyNull` first, then `MissingSeverity.Severe` → `High` → `Moderate` → `Minor`-but-flagged → columns flagged only for `NumericFlag`/`NonlinearityTag`. Alphabetical within each tier.

- **Document structure:**
  ```
  # Structural Profile Report (Compact)
  ## Dataset Overview        ← scalars only; no memory_breakdown
  ## Column Summary          ← one table row per column, all columns
  ## Flagged Columns         ← full detail sections, severity-first
  ## Target Analysis         ← top-5 Pearson + Spearman per feature, per target
  ## Sentinels               ← numeric_sentinels + string_sentinels unchanged
  ```

- **`to_full_markdown()` is lossless.** Its output must be content-equivalent to the previous `to_markdown()` — same fields, same structure, no omissions. It is not part of the human-facing compact API; it exists for debugging and archival.

- **CONTEXT.md** gains three new terms under a *Profile Serialization* heading: *Compact Profile Report*, *Full Profile Report*, *Two-Tier Column Rendering*. ADR-0040 records the decision and the rejected alternatives (add `to_compact_markdown()` alongside; `to_markdown(compact=False)` parameter).

## Testing Decisions

- **What makes a good test:** Test the rendered Markdown output for structural properties and presence/absence of fields — not the internal logic of which columns are flagged. Drive tests from `StructuralProfileResult` instances constructed directly with known data, not from running `StructuralProfiler` end-to-end.

- **Modules tested:**
  - `StructuralProfileResult.to_markdown()` — compact view contract
  - `StructuralProfileResult.to_full_markdown()` — lossless contract (assert no fields are dropped relative to `to_dict()`)

- **Test cases for `to_markdown()`:**
  - A result with all-clean columns: assert no per-column detail sections appear; assert Column Summary table contains all columns.
  - A result with one flagged column: assert that column has a detail section; assert clean columns do not.
  - A result with `NumericStats`: assert histogram is absent; assert `top_values` section has at most 3 entries; assert scalar fields (mean, std, skewness, etc.) are present.
  - A result with `feature_correlation`: assert at most 5 Pearson and 5 Spearman entries appear per column; assert no full matrix dump.
  - A result with `memory_breakdown` set: assert per-column bytes do not appear; assert `memory_bytes` total does.
  - A result with `missingness_matrix` set: assert matrix does not appear in compact output.
  - Flagged column ordering: assert `DropCandidate` column appears before `Severe` column which appears before `High` column.
  - A result with target correlations: assert top-5 Pearson and top-5 Spearman appear per feature column per target.

- **Prior art:** `tests/integration/test_structural_end_to_end.py` uses `StructuralProfileResult` instances directly (via `StructuralProfiler.profile()`). New tests for serialization belong in `tests/unit/profiling/` as a new file, constructing `StructuralProfileResult` with fixture data, consistent with the unit test pattern in that directory.

## Out of Scope

- LLM agent integration or token-budget-aware serialization — the compact view is designed for human readers only; agent context is a future concern.
- Any changes to `to_dict()` or `to_json()` — these remain lossless and are the canonical machine-readable formats.
- ASCII sparklines or visual histogram rendering in the compact view.
- Per-column configurable field inclusion (e.g. `to_markdown(include_histogram=True)`) — one compact contract, no toggles.
- Changes to any other serialization method (`ColumnProfile.to_dict()`, `NumericStats.to_dict()`, etc.) — those remain untouched; only the top-level Markdown rendering changes.

## Further Notes

ADR-0040 (`docs/adr/0040-compact-profile-markdown-human-view.md`) records the full decision including field inclusion rules, the two-tier rendering threshold, document structure, and the two rejected alternatives. CONTEXT.md has been updated with *Compact Profile Report*, *Full Profile Report*, and *Two-Tier Column Rendering* terms.

The example output at `dataforgeml-examples/result.md` (the previous lossless markdown, ~1 MB) serves as a reference for what `to_full_markdown()` must preserve.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compact Profile Report: human-readable to_markdown() with two-tier rendering #268

Problem Statement

Solution

User Stories

Implementation Decisions

Testing Decisions

Out of Scope

Further Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Compact Profile Report: human-readable to_markdown() with two-tier rendering #268

Description

Problem Statement

Solution

User Stories

Implementation Decisions

Testing Decisions

Out of Scope

Further Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions