Evidence
train.py:703-706 exposes --enable-memory-monitoring only for memory logging.
src/model_setup.py:564-590 logs CPU/GPU memory at Trainer log events.
src/model_setup.py:823 sets report_to="none", and there is no torch.profiler or callback that records step time, dataloader wait, tokens/sec, examples/sec, or optimizer/backward timing.
Impact
When training is slow or expensive, developers cannot tell whether the bottleneck is tokenization, DataLoader workers, host-to-device transfer, attention kernels, optimizer steps, or evaluation. This makes performance regressions hard to catch and prolongs expensive GPU experiments.
Recommended fix
Add opt-in profiling flags such as --profile-steps / --profile-output and a lightweight timing callback. Log per-step wall time, examples/sec, approximate tokens/sec, learning-rate/loss, GPU memory, and optional torch.profiler traces for a bounded number of steps.
Acceptance criteria
- Profiling is off by default and has low overhead when disabled.
- Enabling profiling writes a trace or JSON/CSV summary under the run output directory.
- Logs include enough timing breakdown to distinguish DataLoader/tokenization stalls from GPU compute bottlenecks.
Evidence
train.py:703-706exposes--enable-memory-monitoringonly for memory logging.src/model_setup.py:564-590logs CPU/GPU memory at Trainer log events.src/model_setup.py:823setsreport_to="none", and there is notorch.profileror callback that records step time, dataloader wait, tokens/sec, examples/sec, or optimizer/backward timing.Impact
When training is slow or expensive, developers cannot tell whether the bottleneck is tokenization, DataLoader workers, host-to-device transfer, attention kernels, optimizer steps, or evaluation. This makes performance regressions hard to catch and prolongs expensive GPU experiments.
Recommended fix
Add opt-in profiling flags such as
--profile-steps/--profile-outputand a lightweight timing callback. Log per-step wall time, examples/sec, approximate tokens/sec, learning-rate/loss, GPU memory, and optionaltorch.profilertraces for a bounded number of steps.Acceptance criteria