Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL timeout on multi-gpu #104

Open
AlexPiche opened this issue Nov 19, 2024 · 3 comments
Open

NCCL timeout on multi-gpu #104

AlexPiche opened this issue Nov 19, 2024 · 3 comments
Assignees

Comments

@AlexPiche
Copy link
Collaborator

Accelerate fails when launched on multi-gpu due to NCCL timeout.

accelerate launch --multi_gpu --num_processes 2 --mixed_precision=bf16 --config_file conf/accelerate/accelerate_base.yaml examples/rl_gsm8k/run_finetune.py --config-dir /home/toolkit/TapeAgents/outputs/simple_rl_reinforce_fork_1024_attempts_16_algo_reinforce_checkpoint_1/conf --config-name 0 finetune.train_batch_size=4 finetune.gradient_accumulation_passes=256

11/19/2024 15:33:12 - INFO - tapeagents.finetune.context - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 2
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: bf16

11/19/2024 15:33:12 - INFO - tapeagents.finetune.context - Saving experiment to outputs/rl_gsm8k/finetune
11/19/2024 15:33:16 - INFO - tapeagents.finetune.context - Initializing model <class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'> from meta-llama/Meta-Llama-3.1-8B-Instruct
11/19/2024 15:33:16 - INFO - tapeagents.finetune.context - Loading args: {'use_safetensors': True, 'trust_remote_code': False, 'low_cpu_mem_usage': True, 'torch_dtype': torch.bfloat16}
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 10.30it/s]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  8.16it/s]
11/19/2024 15:33:17 - INFO - tapeagents.finetune.context - Instantiated preprocess function hash b6590c28080ef54a
11/19/2024 15:33:17 - INFO - tapeagents.finetune.context - Instantiated collate_fn hash c30614bd72522b74
Resolving data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 64/64 [00:00<00:00, 49535.98it/s]
Resolving data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 64/64 [00:00<00:00, 61837.24it/s]
Resolving data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 64/64 [00:00<00:00, 102300.10it/s]
Resolving data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 64/64 [00:00<00:00, 73163.11it/s]
11/19/2024 15:33:18 - INFO - tapeagents.finetune.context - Raw data part size: 1023
11/19/2024 15:33:18 - INFO - tapeagents.finetune.context - Raw data part fingerprint: 818145dad50bfb24
Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1023/1023 [00:01<00:00, 1009.62 examples/s]
11/19/2024 15:33:19 - INFO - tapeagents.finetune.context - Preprocessed data part fingerprint: 39a839e90eb91a59
11/19/2024 15:33:19 - INFO - tapeagents.finetune.context - Merged data size: 1023
11/19/2024 15:33:19 - INFO - tapeagents.finetune.context - Merged data fingerprint: 693d489ba7139c2e
Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1023/1023 [00:01<00:00, 1010.37 examples/s]
11/19/2024 15:33:19 - INFO - tapeagents.finetune.rl - Populate RL Data
11/19/2024 15:33:19 - INFO - tapeagents.finetune.rl - Populate RL Data
Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1023/1023 [00:00<00:00, 9716.67 examples/s]
Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1023/1023 [00:00<00:00, 9954.33 examples/s]
11/19/2024 15:33:30 - INFO - tapeagents.finetune.rl - Finish Populate RL Data
11/19/2024 15:33:30 - INFO - tapeagents.finetune.rl - Finish Populate RL Data
11/19/2024 15:33:36 - INFO - tapeagents.finetune.context - Completed steps 0: {'dataset_stats/num_sequences': '512.000', 'dataset_stats/max_seq_length': '922.000', 'dataset_stats/min_seq_length': '220.000', 'dataset_stats/avg_seq_length': '362.266'}
11/19/2024 15:33:36 - ERROR - tapeagents.finetune.context - Failed to log metrics to wandb with error: You must call wandb.init() before wandb.log()
11/19/2024 15:33:36 - INFO - tapeagents.finetune.context - Start training
/home/toolkit/.conda/envs/tapeagents/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
[rank1]:[E1119 15:43:35.814661330 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=58, OpType=BROADCAST, NumelIn=2112, NumelOut=2112, Timeout(ms)=600000) ran for 600076 milliseconds before timing out.
[rank1]:[E1119 15:43:35.854969250 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 58, last enqueued NCCL work: 58, last completed NCCL work: 57.
[rank1]:[E1119 15:43:36.155998294 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 58, last enqueued NCCL work: 58, last completed NCCL work: 57.
[rank1]:[E1119 15:43:36.156021224 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E1119 15:43:36.156025666 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E1119 15:43:36.162898839 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=58, OpType=BROADCAST, NumelIn=2112, NumelOut=2112, Timeout(ms)=600000) ran for 600076 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fffd7377f86 in /home/toolkit/.conda/envs/tapeagents/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fff893708d2 in /home/toolkit/.conda/envs/tapeagents/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fff89377313 in /home/toolkit/.conda/envs/tapeagents/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fff893796fc in /home/toolkit/.conda/envs/tapeagents/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7fffd6ac7bf4 in /home/toolkit/.conda/envs/tapeagents/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x7ffff7a6cac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126a40 (0x7ffff7afea40 in /usr/lib/x86_64-linux-gnu/libc.so.6)

W1119 15:43:46.444000 140737352795968 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1145 closing signal SIGTERM
E1119 15:43:47.110000 140737352795968 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 1 (pid: 1146) of binary: /home/toolkit/.conda/envs/tapeagents/bin/python
Traceback (most recent call last):
  File "/home/toolkit/.conda/envs/tapeagents/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/toolkit/.conda/envs/tapeagents/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/toolkit/.conda/envs/tapeagents/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
    multi_gpu_launcher(args)
  File "/home/toolkit/.conda/envs/tapeagents/lib/python3.10/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/toolkit/.conda/envs/tapeagents/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/toolkit/.conda/envs/tapeagents/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/toolkit/.conda/envs/tapeagents/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=====================================================
examples/rl_gsm8k/run_finetune.py FAILED
-----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-11-19_15:43:46
  host      : 51fed3f4-972b-4869-9b45-fee2e6922786
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 1146)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 1146
=====================================================

@AlexPiche AlexPiche self-assigned this Nov 19, 2024
@rafapi rafapi self-assigned this Nov 20, 2024
@ollmer
Copy link
Collaborator

ollmer commented Nov 25, 2024

10 minutes nccl timeout tells us that there are some connectivity issues between nodes. I would try to check if nodes can see each other IPs by ping. If it`s ok, then try to run a small script like this pytorch/pytorch#14536 (comment) to check if the nccl communication working at all.

@rafapi
Copy link
Collaborator

rafapi commented Nov 25, 2024

10 minutes nccl timeout tells us that there are some connectivity issues between nodes. I would try to check if nodes can see each other IPs by ping. If it`s ok, then try to run a small script like this pytorch/pytorch#14536 (comment) to check if the nccl communication working at all.

Thanks Oleh, This is an ongoing issue on the superpod. We are seeing it even on some jobs, like the Llama-3.1-405B endpoint, where connectivity times out even when a keep-alive message is regularly sent. This was very uncommon a few weeks ago.

@ollmer
Copy link
Collaborator

ollmer commented Nov 25, 2024

Ah, I see. Thanks! I've mitigated similar issues in the past by setting the nccl timeout to 90 minutes instead of 10. It could help if the connectivity issues are not so long, but the training time could be longer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants