Skip to content

[Training] Fix DDP device placement for non-quantized models #60

Description

@agorevski

Evidence

  • src/model_setup.py:673-700 reads LOCAL_RANK, pins quantized models to that rank with device_map={"": local_rank}, but uses device_map="auto" for non-quantized CUDA models.
  • train.py:836-838 disables quantization for --tiny, and future full-precision/DeepSpeed experiments may also run non-quantized models.
  • src/model_setup.py:625-627 accepts use_deepspeed, but the current loader does not use it to adjust device placement.

Impact

Under torchrun/DDP, each non-quantized process can try to spread the model across all visible GPUs instead of staying on its local rank. That can cause duplicated allocations, cross-rank contention, hangs, or out-of-memory failures, making smoke tests and future full-precision runs unreliable.

Recommended fix

When LOCAL_RANK is set, place non-quantized training models on the local rank instead of device_map="auto". For DeepSpeed, avoid incompatible device_map settings and let DeepSpeed/Accelerate own placement. Add a small placement test that monkeypatches CUDA/rank state.

Acceptance criteria

  • torchrun with --tiny or another non-quantized model assigns one rank to one local device.
  • DeepSpeed launches do not pass conflicting device_map values.
  • Tests cover quantized and non-quantized placement decisions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions