[Training] Auto-resume checkpoints and persist run manifests

## Evidence
- `train.py:660` exposes only an explicit `--resume` checkpoint path.
- `src/model_setup.py:974-977` resumes only when that explicit path is provided; otherwise it calls `trainer.train()` from scratch.
- Checkpoints are configured under `src/model_setup.py:805`, `src/model_setup.py:817`, `src/model_setup.py:849`, and limited by `src/model_setup.py:830`, but there is no latest-checkpoint discovery.
- `src/model_setup.py:983-990` saves final `model_config.json` and `training_metrics.json`, but not the full CLI args, dataset hashes/counts, split seed, git SHA/dirty state, dependency versions, hardware, or DeepSpeed config snapshot.

## Impact
Interrupted runs can accidentally restart from scratch, wasting GPU time and money. Final artifacts are also hard to reproduce or audit because the saved config omits important runtime and data provenance.

## Recommended fix
Support `--resume auto` (or safe default auto-discovery) using the newest valid checkpoint in the output directory. Write a run manifest at startup and update it at completion with CLI args, effective training arguments, seeds, dataset paths/counts/hashes, model revision, git SHA/dirty flag, package versions, hardware, precision mode, and DeepSpeed config content.

## Acceptance criteria
- Restarting a run with an existing checkpoint resumes from the latest checkpoint when requested.
- If no checkpoint exists, the behavior is explicit and logged.
- Every run directory contains a machine-readable manifest sufficient to reproduce the run and audit the exact data/config used.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Training] Auto-resume checkpoints and persist run manifests #59

Evidence

Impact

Recommended fix

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Training] Auto-resume checkpoints and persist run manifests #59

Description

Evidence

Impact

Recommended fix

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions