Skip to content

[Data Quality] Gate splits on holdout row ratios and oversized leakage components #112

Description

@agorevski

Finding

split_dataset can produce leakage-free but highly degenerate train/validation/test splits without failing. On the current generated source dataset, exact-output grouping creates very large connected components, so the default split ratios are not honored and validation becomes too small to be representative.

Evidence

  • train.py:602-670 assigns leakage-connected components to splits, but only has a best-effort non-empty holdout repair.
  • train.py:918-927 fails only on leakage validation or optional coverage (--min-holdout-stratum-count); it does not enforce minimum train/val/test row ratios, maximum component size, or minimum holdout rows by default.
  • Read-only audit command:
    python - <<'PY'
    import json, train
    rows=[json.loads(l) for l in open('data/hf_training_dataset.jsonl') if l.strip()]
    tr, va, te = train._grouped_split(rows, 0.85, 0.10, seed=42)
    print(len(tr), len(va), len(te))
    print(train.validate_split_leakage({'train': tr, 'val': va, 'test': te}, sample_limit=3)['status'])
    PY
    Result: 82184 76 820 with leakage status passed (source has 83,080 rows; intended counts are approximately 70,618 / 8,308 / 4,154).

Impact

A run can silently train with almost no validation coverage and a much smaller test set than intended. Model selection, early stopping, and reported metrics become unstable even though the leakage validator reports success.

Recommended fix

Add split-quality gates after component assignment. Fail (or require an explicit override) when a split misses configurable minimum row counts/ratios, when a leakage component is too large to satisfy the requested ratios, or when duplicate output bodies must be capped before splitting.

Acceptance criteria

  • Default split generation fails on the current degenerate 82,184 / 76 / 820 outcome unless an explicit override is supplied.
  • The split manifest records largest component sizes and target-vs-actual row deltas.
  • Tests cover a dataset where exact-output duplicates form a giant component and assert the splitter fails with an actionable message.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions