[Training] Fix DDP device placement for non-quantized models

## Evidence
- `src/model_setup.py:673-700` reads `LOCAL_RANK`, pins quantized models to that rank with `device_map={"": local_rank}`, but uses `device_map="auto"` for non-quantized CUDA models.
- `train.py:836-838` disables quantization for `--tiny`, and future full-precision/DeepSpeed experiments may also run non-quantized models.
- `src/model_setup.py:625-627` accepts `use_deepspeed`, but the current loader does not use it to adjust device placement.

## Impact
Under `torchrun`/DDP, each non-quantized process can try to spread the model across all visible GPUs instead of staying on its local rank. That can cause duplicated allocations, cross-rank contention, hangs, or out-of-memory failures, making smoke tests and future full-precision runs unreliable.

## Recommended fix
When `LOCAL_RANK` is set, place non-quantized training models on the local rank instead of `device_map="auto"`. For DeepSpeed, avoid incompatible `device_map` settings and let DeepSpeed/Accelerate own placement. Add a small placement test that monkeypatches CUDA/rank state.

## Acceptance criteria
- `torchrun` with `--tiny` or another non-quantized model assigns one rank to one local device.
- DeepSpeed launches do not pass conflicting `device_map` values.
- Tests cover quantized and non-quantized placement decisions.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Training] Fix DDP device placement for non-quantized models #60

Evidence

Impact

Recommended fix

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Training] Fix DDP device placement for non-quantized models #60

Description

Evidence

Impact

Recommended fix

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions