System Info
In-training simulation eval (eval_freq > 0 with a sim env) is not supported under parameter sharding (DeepSpeed ZeRO-3 or FSDP): each rank rolls out independently, so the sharded-param all-gathers in the eval forward desync across ranks and hang at NCCL. Use ZeRO-1/2 (params replicated), or run eval out-of-process on saved checkpoints.
File "/fss/bot/OpenTau_main/src/opentau/scripts/train.py", line 1494, in <module>
Information
Reproduction
Simpley running cosmos3 on multi node with zero3
Expected behavior
Should be able to run evals with zero3 sharding while training
System Info
Information
Reproduction
Simpley running cosmos3 on multi node with zero3
Expected behavior
Should be able to run evals with zero3 sharding while training