You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
10 minutes nccl timeout tells us that there are some connectivity issues between nodes. I would try to check if nodes can see each other IPs by ping. If it`s ok, then try to run a small script like this pytorch/pytorch#14536 (comment) to check if the nccl communication working at all.
10 minutes nccl timeout tells us that there are some connectivity issues between nodes. I would try to check if nodes can see each other IPs by ping. If it`s ok, then try to run a small script like this pytorch/pytorch#14536 (comment) to check if the nccl communication working at all.
Thanks Oleh, This is an ongoing issue on the superpod. We are seeing it even on some jobs, like the Llama-3.1-405B endpoint, where connectivity times out even when a keep-alive message is regularly sent. This was very uncommon a few weeks ago.
Ah, I see. Thanks! I've mitigated similar issues in the past by setting the nccl timeout to 90 minutes instead of 10. It could help if the connectivity issues are not so long, but the training time could be longer.
Accelerate fails when launched on multi-gpu due to NCCL timeout.
accelerate launch --multi_gpu --num_processes 2 --mixed_precision=bf16 --config_file conf/accelerate/accelerate_base.yaml examples/rl_gsm8k/run_finetune.py --config-dir /home/toolkit/TapeAgents/outputs/simple_rl_reinforce_fork_1024_attempts_16_algo_reinforce_checkpoint_1/conf --config-name 0 finetune.train_batch_size=4 finetune.gradient_accumulation_passes=256
The text was updated successfully, but these errors were encountered: