Evidence
train.py:660 exposes only an explicit --resume checkpoint path.
src/model_setup.py:974-977 resumes only when that explicit path is provided; otherwise it calls trainer.train() from scratch.
- Checkpoints are configured under
src/model_setup.py:805, src/model_setup.py:817, src/model_setup.py:849, and limited by src/model_setup.py:830, but there is no latest-checkpoint discovery.
src/model_setup.py:983-990 saves final model_config.json and training_metrics.json, but not the full CLI args, dataset hashes/counts, split seed, git SHA/dirty state, dependency versions, hardware, or DeepSpeed config snapshot.
Impact
Interrupted runs can accidentally restart from scratch, wasting GPU time and money. Final artifacts are also hard to reproduce or audit because the saved config omits important runtime and data provenance.
Recommended fix
Support --resume auto (or safe default auto-discovery) using the newest valid checkpoint in the output directory. Write a run manifest at startup and update it at completion with CLI args, effective training arguments, seeds, dataset paths/counts/hashes, model revision, git SHA/dirty flag, package versions, hardware, precision mode, and DeepSpeed config content.
Acceptance criteria
- Restarting a run with an existing checkpoint resumes from the latest checkpoint when requested.
- If no checkpoint exists, the behavior is explicit and logged.
- Every run directory contains a machine-readable manifest sufficient to reproduce the run and audit the exact data/config used.
Evidence
train.py:660exposes only an explicit--resumecheckpoint path.src/model_setup.py:974-977resumes only when that explicit path is provided; otherwise it callstrainer.train()from scratch.src/model_setup.py:805,src/model_setup.py:817,src/model_setup.py:849, and limited bysrc/model_setup.py:830, but there is no latest-checkpoint discovery.src/model_setup.py:983-990saves finalmodel_config.jsonandtraining_metrics.json, but not the full CLI args, dataset hashes/counts, split seed, git SHA/dirty state, dependency versions, hardware, or DeepSpeed config snapshot.Impact
Interrupted runs can accidentally restart from scratch, wasting GPU time and money. Final artifacts are also hard to reproduce or audit because the saved config omits important runtime and data provenance.
Recommended fix
Support
--resume auto(or safe default auto-discovery) using the newest valid checkpoint in the output directory. Write a run manifest at startup and update it at completion with CLI args, effective training arguments, seeds, dataset paths/counts/hashes, model revision, git SHA/dirty flag, package versions, hardware, precision mode, and DeepSpeed config content.Acceptance criteria