Evidence
src/model_setup.py:673-700 reads LOCAL_RANK, pins quantized models to that rank with device_map={"": local_rank}, but uses device_map="auto" for non-quantized CUDA models.
train.py:836-838 disables quantization for --tiny, and future full-precision/DeepSpeed experiments may also run non-quantized models.
src/model_setup.py:625-627 accepts use_deepspeed, but the current loader does not use it to adjust device placement.
Impact
Under torchrun/DDP, each non-quantized process can try to spread the model across all visible GPUs instead of staying on its local rank. That can cause duplicated allocations, cross-rank contention, hangs, or out-of-memory failures, making smoke tests and future full-precision runs unreliable.
Recommended fix
When LOCAL_RANK is set, place non-quantized training models on the local rank instead of device_map="auto". For DeepSpeed, avoid incompatible device_map settings and let DeepSpeed/Accelerate own placement. Add a small placement test that monkeypatches CUDA/rank state.
Acceptance criteria
torchrun with --tiny or another non-quantized model assigns one rank to one local device.
- DeepSpeed launches do not pass conflicting
device_map values.
- Tests cover quantized and non-quantized placement decisions.
Evidence
src/model_setup.py:673-700readsLOCAL_RANK, pins quantized models to that rank withdevice_map={"": local_rank}, but usesdevice_map="auto"for non-quantized CUDA models.train.py:836-838disables quantization for--tiny, and future full-precision/DeepSpeed experiments may also run non-quantized models.src/model_setup.py:625-627acceptsuse_deepspeed, but the current loader does not use it to adjust device placement.Impact
Under
torchrun/DDP, each non-quantized process can try to spread the model across all visible GPUs instead of staying on its local rank. That can cause duplicated allocations, cross-rank contention, hangs, or out-of-memory failures, making smoke tests and future full-precision runs unreliable.Recommended fix
When
LOCAL_RANKis set, place non-quantized training models on the local rank instead ofdevice_map="auto". For DeepSpeed, avoid incompatibledevice_mapsettings and let DeepSpeed/Accelerate own placement. Add a small placement test that monkeypatches CUDA/rank state.Acceptance criteria
torchrunwith--tinyor another non-quantized model assigns one rank to one local device.device_mapvalues.