[Data Quality] Gate splits on holdout row ratios and oversized leakage components

## Finding
`split_dataset` can produce leakage-free but highly degenerate train/validation/test splits without failing. On the current generated source dataset, exact-output grouping creates very large connected components, so the default split ratios are not honored and validation becomes too small to be representative.

## Evidence
- `train.py:602-670` assigns leakage-connected components to splits, but only has a best-effort non-empty holdout repair.
- `train.py:918-927` fails only on leakage validation or optional coverage (`--min-holdout-stratum-count`); it does not enforce minimum train/val/test row ratios, maximum component size, or minimum holdout rows by default.
- Read-only audit command:
  ```bash
  python - <<'PY'
  import json, train
  rows=[json.loads(l) for l in open('data/hf_training_dataset.jsonl') if l.strip()]
  tr, va, te = train._grouped_split(rows, 0.85, 0.10, seed=42)
  print(len(tr), len(va), len(te))
  print(train.validate_split_leakage({'train': tr, 'val': va, 'test': te}, sample_limit=3)['status'])
  PY
  ```
  Result: `82184 76 820` with leakage status `passed` (source has 83,080 rows; intended counts are approximately 70,618 / 8,308 / 4,154).

## Impact
A run can silently train with almost no validation coverage and a much smaller test set than intended. Model selection, early stopping, and reported metrics become unstable even though the leakage validator reports success.

## Recommended fix
Add split-quality gates after component assignment. Fail (or require an explicit override) when a split misses configurable minimum row counts/ratios, when a leakage component is too large to satisfy the requested ratios, or when duplicate output bodies must be capped before splitting.

## Acceptance criteria
- Default split generation fails on the current degenerate 82,184 / 76 / 820 outcome unless an explicit override is supplied.
- The split manifest records largest component sizes and target-vs-actual row deltas.
- Tests cover a dataset where exact-output duplicates form a giant component and assert the splitter fails with an actionable message.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data Quality] Gate splits on holdout row ratios and oversized leakage components #112

Finding

Evidence

Impact

Recommended fix

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Data Quality] Gate splits on holdout row ratios and oversized leakage components #112

Description

Finding

Evidence

Impact

Recommended fix

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions