Finding
split_dataset can produce leakage-free but highly degenerate train/validation/test splits without failing. On the current generated source dataset, exact-output grouping creates very large connected components, so the default split ratios are not honored and validation becomes too small to be representative.
Evidence
train.py:602-670 assigns leakage-connected components to splits, but only has a best-effort non-empty holdout repair.
train.py:918-927 fails only on leakage validation or optional coverage (--min-holdout-stratum-count); it does not enforce minimum train/val/test row ratios, maximum component size, or minimum holdout rows by default.
- Read-only audit command:
python - <<'PY'
import json, train
rows=[json.loads(l) for l in open('data/hf_training_dataset.jsonl') if l.strip()]
tr, va, te = train._grouped_split(rows, 0.85, 0.10, seed=42)
print(len(tr), len(va), len(te))
print(train.validate_split_leakage({'train': tr, 'val': va, 'test': te}, sample_limit=3)['status'])
PY
Result: 82184 76 820 with leakage status passed (source has 83,080 rows; intended counts are approximately 70,618 / 8,308 / 4,154).
Impact
A run can silently train with almost no validation coverage and a much smaller test set than intended. Model selection, early stopping, and reported metrics become unstable even though the leakage validator reports success.
Recommended fix
Add split-quality gates after component assignment. Fail (or require an explicit override) when a split misses configurable minimum row counts/ratios, when a leakage component is too large to satisfy the requested ratios, or when duplicate output bodies must be capped before splitting.
Acceptance criteria
- Default split generation fails on the current degenerate 82,184 / 76 / 820 outcome unless an explicit override is supplied.
- The split manifest records largest component sizes and target-vs-actual row deltas.
- Tests cover a dataset where exact-output duplicates form a giant component and assert the splitter fails with an actionable message.
Finding
split_datasetcan produce leakage-free but highly degenerate train/validation/test splits without failing. On the current generated source dataset, exact-output grouping creates very large connected components, so the default split ratios are not honored and validation becomes too small to be representative.Evidence
train.py:602-670assigns leakage-connected components to splits, but only has a best-effort non-empty holdout repair.train.py:918-927fails only on leakage validation or optional coverage (--min-holdout-stratum-count); it does not enforce minimum train/val/test row ratios, maximum component size, or minimum holdout rows by default.82184 76 820with leakage statuspassed(source has 83,080 rows; intended counts are approximately 70,618 / 8,308 / 4,154).Impact
A run can silently train with almost no validation coverage and a much smaller test set than intended. Model selection, early stopping, and reported metrics become unstable even though the leakage validator reports success.
Recommended fix
Add split-quality gates after component assignment. Fail (or require an explicit override) when a split misses configurable minimum row counts/ratios, when a leakage component is too large to satisfy the requested ratios, or when duplicate output bodies must be capped before splitting.
Acceptance criteria