[Training] Add optional throughput and profiler hooks

## Evidence
- `train.py:703-706` exposes `--enable-memory-monitoring` only for memory logging.
- `src/model_setup.py:564-590` logs CPU/GPU memory at Trainer log events.
- `src/model_setup.py:823` sets `report_to="none"`, and there is no `torch.profiler` or callback that records step time, dataloader wait, tokens/sec, examples/sec, or optimizer/backward timing.

## Impact
When training is slow or expensive, developers cannot tell whether the bottleneck is tokenization, DataLoader workers, host-to-device transfer, attention kernels, optimizer steps, or evaluation. This makes performance regressions hard to catch and prolongs expensive GPU experiments.

## Recommended fix
Add opt-in profiling flags such as `--profile-steps` / `--profile-output` and a lightweight timing callback. Log per-step wall time, examples/sec, approximate tokens/sec, learning-rate/loss, GPU memory, and optional `torch.profiler` traces for a bounded number of steps.

## Acceptance criteria
- Profiling is off by default and has low overhead when disabled.
- Enabling profiling writes a trace or JSON/CSV summary under the run output directory.
- Logs include enough timing breakdown to distinguish DataLoader/tokenization stalls from GPU compute bottlenecks.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Training] Add optional throughput and profiler hooks #61

Evidence

Impact

Recommended fix

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Training] Add optional throughput and profiler hooks #61

Description

Evidence

Impact

Recommended fix

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions