Skip to content

[Training] Add optional throughput and profiler hooks #61

Description

@agorevski

Evidence

  • train.py:703-706 exposes --enable-memory-monitoring only for memory logging.
  • src/model_setup.py:564-590 logs CPU/GPU memory at Trainer log events.
  • src/model_setup.py:823 sets report_to="none", and there is no torch.profiler or callback that records step time, dataloader wait, tokens/sec, examples/sec, or optimizer/backward timing.

Impact

When training is slow or expensive, developers cannot tell whether the bottleneck is tokenization, DataLoader workers, host-to-device transfer, attention kernels, optimizer steps, or evaluation. This makes performance regressions hard to catch and prolongs expensive GPU experiments.

Recommended fix

Add opt-in profiling flags such as --profile-steps / --profile-output and a lightweight timing callback. Log per-step wall time, examples/sec, approximate tokens/sec, learning-rate/loss, GPU memory, and optional torch.profiler traces for a bounded number of steps.

Acceptance criteria

  • Profiling is off by default and has low overhead when disabled.
  • Enabling profiling writes a trace or JSON/CSV summary under the run output directory.
  • Logs include enough timing breakdown to distinguish DataLoader/tokenization stalls from GPU compute bottlenecks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions