Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: call for stack trace for "Watchdog caught collective operation timeout" #6042

Open
youkaichao opened this issue Jul 1, 2024 · 10 comments
Labels
bug Something isn't working stale

Comments

@youkaichao
Copy link
Member

Your current environment

The output of `python collect_env.py`

🐛 Describe the bug

We received quite a lot report about "Watchdog caught collective operation timeout", which is flaky and difficult to reproduce. It happens after running for some time.

To analyze the error, we need to collect enough stack trace. If you encounter a similar problem, please paste enough stack trace for us to debug.

Example: https://buildkite.com/vllm/ci-aws/builds/3548#01906e81-54c6-4713-beb7-d08f3c873200 caught one such error.

Please include the first line of error, together with the Python stack trace.

In the following example, it seems one process has illegal memory access. It dies, but the rest process is still in allreduce, and is waiting for it, causing the timeout problem. From the python level stack trace, it happens when we profile the run, and it seems to be related with moe layer.

[rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
--
  | CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
  | For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
  | Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
  |  
  | Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
  | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd5c7e92897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fd5c7e42b25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fd5c7f6a718 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
  | frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fd57bc4ae36 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7fd57bc4ef38 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7fd57bc545ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fd57bc5531c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #7: <unknown function> + 0xdc253 (0x7fd5c76b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
  | frame #8: <unknown function> + 0x94ac3 (0x7fd5c90d9ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | frame #9: <unknown function> + 0x126850 (0x7fd5c916b850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  |  
  | terminate called after throwing an instance of 'c10::DistBackendError'
  | what():  [PG 2 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
  | CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
  | For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
  | Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
  |  
  | Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
  | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd5c7e92897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fd5c7e42b25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fd5c7f6a718 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
  | frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fd57bc4ae36 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7fd57bc4ef38 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7fd57bc545ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fd57bc5531c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #7: <unknown function> + 0xdc253 (0x7fd5c76b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
  | frame #8: <unknown function> + 0x94ac3 (0x7fd5c90d9ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | frame #9: <unknown function> + 0x126850 (0x7fd5c916b850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  |  
  | Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
  | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd5c7e92897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #1: <unknown function> + 0xe32e33 (0x7fd57b8d7e33 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #2: <unknown function> + 0xdc253 (0x7fd5c76b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
  | frame #3: <unknown function> + 0x94ac3 (0x7fd5c90d9ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | frame #4: <unknown function> + 0x126850 (0x7fd5c916b850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  |  
  | ERROR 07-01 13:54:43 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 1159 died, exit code: -6
  | INFO 07-01 13:54:43 multiproc_worker_utils.py:123] Killing local vLLM worker processes
  | [rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=134217728, NumelOut=134217728, Timeout(ms)=600000) ran for 600059 milliseconds before timing out.
  | [rank0]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 0] Timeout at NCCL work: 2, last enqueued NCCL work: 2, last completed NCCL work: 1.
  | [rank0]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
  | [rank0]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
  | [rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=134217728, NumelOut=134217728, Timeout(ms)=600000) ran for 600059 milliseconds before timing out.
  | Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9ed797a897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f9e8b64f1b2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f9e8b653fd0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9e8b65531c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #4: <unknown function> + 0xdc253 (0x7f9ed70b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
  | frame #5: <unknown function> + 0x94ac3 (0x7f9f821e3ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | frame #6: <unknown function> + 0x126850 (0x7f9f82275850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  |  
  | terminate called after throwing an instance of 'c10::DistBackendError'
  | what():  [PG 2 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=134217728, NumelOut=134217728, Timeout(ms)=600000) ran for 600059 milliseconds before timing out.
  | Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9ed797a897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f9e8b64f1b2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f9e8b653fd0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9e8b65531c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #4: <unknown function> + 0xdc253 (0x7f9ed70b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
  | frame #5: <unknown function> + 0x94ac3 (0x7f9f821e3ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | frame #6: <unknown function> + 0x126850 (0x7f9f82275850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  |  
  | Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
  | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9ed797a897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #1: <unknown function> + 0xe32e33 (0x7f9e8b2d7e33 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #2: <unknown function> + 0xdc253 (0x7f9ed70b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
  | frame #3: <unknown function> + 0x94ac3 (0x7f9f821e3ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | frame #4: <unknown function> + 0x126850 (0x7f9f82275850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  |  
  | Fatal Python error: Aborted
  |  
  | Thread 0x00007f96975fe640 (most recent call first):
  | File "/usr/lib/python3.10/threading.py", line 320 in wait
  | File "/usr/lib/python3.10/multiprocessing/queues.py", line 231 in _feed
  | File "/usr/lib/python3.10/threading.py", line 953 in run
  | File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  | File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
  |  
  | Thread 0x00007f96acbfd640 (most recent call first):
  | File "/usr/lib/python3.10/threading.py", line 324 in wait
  | File "/usr/lib/python3.10/threading.py", line 607 in wait
  | File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  | File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  | File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
  |  
  | Thread 0x00007f9f8214e480 (most recent call first):
  | File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 854 in __call__
  | File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 357 in topk_softmax
  | File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 34 in wrapper
  | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 357 in fused_topk
  | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 549 in fused_moe
  | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 273 in forward
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
  | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 426 in forward
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
  | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 470 in forward
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
  | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 540 in forward
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
  | File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1135 in execute_model
  | File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
  | File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 814 in profile_run
  | File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
  | File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 175 in determine_num_available_blocks
  | File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
  | File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 122 in _run_workers
  | File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 38 in determine_num_available_blocks
  | File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 344 in _initialize_kv_caches
  | File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 251 in __init__
  | File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 405 in from_engine_args
  | File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 144 in __init__
  | File "/usr/local/lib/python3.10/dist-packages/lm_eval/models/vllm_causallms.py", line 97 in __init__
  | File "/usr/local/lib/python3.10/dist-packages/lm_eval/api/model.py", line 133 in create_from_arg_string
  | File "/usr/local/lib/python3.10/dist-packages/lm_eval/evaluator.py", line 164 in simple_evaluate
  | File "/usr/local/lib/python3.10/dist-packages/lm_eval/utils.py", line 288 in _wrapper
  | File "/vllm-workspace/.buildkite/lm-eval-harness/test_lm_eval_correctness.py", line 29 in launch_lm_eval
  | File "/vllm-workspace/.buildkite/lm-eval-harness/test_lm_eval_correctness.py", line 45 in test_lm_eval_correctness
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/python.py", line 162 in pytest_pyfunc_call
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 103 in _multicall
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 120 in _hookexec
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 513 in __call__
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/python.py", line 1632 in runtest
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 173 in pytest_runtest_call
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 103 in _multicall
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 120 in _hookexec
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 513 in __call__
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 241 in <lambda>
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 341 in from_call
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 240 in call_and_report
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 135 in runtestprotocol
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 116 in pytest_runtest_protocol
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 103 in _multicall
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 120 in _hookexec
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 513 in __call__
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 364 in pytest_runtestloop
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 103 in _multicall
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 120 in _hookexec
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 513 in __call__
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 339 in _main
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 285 in wrap_session
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 332 in pytest_cmdline_main
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 103 in _multicall
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 120 in _hookexec
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 513 in __call__
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/config/__init__.py", line 178 in main
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/config/__init__.py", line 206 in console_main
  | File "/usr/local/bin/pytest", line 8 in <module>
@youkaichao youkaichao added the bug Something isn't working label Jul 1, 2024
@oliver-li
Copy link

[36m(RayWorkerWrapper pid=4009)�[0m [rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=158046, OpType=GATHER, NumelIn=2867200, NumelOut=0, Timeout(ms)=600000) ran for 600027 milliseconds before timing out.
�[36m(RayWorkerWrapper pid=4009)�[0m [rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 1] Timeout at NCCL work: 158046, last enqueued NCCL work: 158046, last completed NCCL work: 158045.
�[36m(RayWorkerWrapper pid=4009)�[0m [rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
�[36m(RayWorkerWrapper pid=4009)�[0m [rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
�[36m(RayWorkerWrapper pid=4009)�[0m [rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=158046, OpType=GATHER, NumelIn=2867200, NumelOut=0, Timeout(ms)=600000) ran for 600027 milliseconds before timing out.
�[36m(RayWorkerWrapper pid=4009)�[0m Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
�[36m(RayWorkerWrapper pid=4009)�[0m frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa2ea16b897 in /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so)
�[36m(RayWorkerWrapper pid=4009)�[0m frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f9a81b4c5a2 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
�[36m(RayWorkerWrapper pid=4009)�[0m frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f9a81b513c0 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
�[36m(RayWorkerWrapper pid=4009)�[0m frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9a81b5270c in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
�[36m(RayWorkerWrapper pid=4009)�[0m frame #4: <unknown function> + 0x1c220 (0x7fa2ea2b7220 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch.so)
�[36m(RayWorkerWrapper pid=4009)�[0m frame #5: <unknown function> + 0x7ea5 (0x7fa2fe195ea5 in /usr/lib64/libpthread.so.0)
�[36m(RayWorkerWrapper pid=4009)�[0m frame #6: clone + 0x6d (0x7fa2fd7b5b0d in /usr/lib64/libc.so.6)
�[36m(RayWorkerWrapper pid=4009)�[0m 
�[36m(RayWorkerWrapper pid=4009)�[0m [2024-07-05 14:00:48,171 E 4009 4134] logging.cc:101: Unhandled exception: N3c1016DistBackendErrorE. what(): [PG 1 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=158046, OpType=GATHER, NumelIn=2867200, NumelOut=0, Timeout(ms)=600000) ran for 600027 milliseconds before timing out.
�[36m(RayWorkerWrapper pid=4009)�[0m Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
�[36m(RayWorkerWrapper pid=4009)�[0m frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa2ea16b897 in /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so)
�[36m(RayWorkerWrapper pid=4009)�[0m frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f9a81b4c5a2 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
�[36m(RayWorkerWrapper pid=4009)�[0m frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f9a81b513c0 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
�[36m(RayWorkerWrapper pid=4009)�[0m frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9a81b5270c in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
�[36m(RayWorkerWrapper pid=4009)�[0m frame #4: <unknown function> + 0x1c220 (0x7fa2ea2b7220 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch.so)
�[36m(RayWorkerWrapper pid=4009)�[0m frame #5: <unknown function> + 0x7ea5 (0x7fa2fe195ea5 in /usr/lib64/libpthread.so.0)
�[36m(RayWorkerWrapper pid=4009)�[0m frame #6: clone + 0x6d (0x7fa2fd7b5b0d in /usr/lib64/libc.so.6)
�[36m(RayWorkerWrapper pid=4009)�[0m 
�[36m(RayWorkerWrapper pid=4009)�[0m Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
�[36m(RayWorkerWrapper pid=4009)�[0m frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa2ea16b897 in /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so)
�[36m(RayWorkerWrapper pid=4009)�[0m frame #1: <unknown function> + 0xe083a9 (0x7f9a817d73a9 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
�[36m(RayWorkerWrapper pid=4009)�[0m frame #2: <unknown function> + 0x1c220 (0x7fa2ea2b7220 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch.so)
�[36m(RayWorkerWrapper pid=4009)�[0m frame #3: <unknown function> + 0x7ea5 (0x7fa2fe195ea5 in /usr/lib64/libpthread.so.0)
�[36m(RayWorkerWrapper pid=4009)�[0m frame #4: clone + 0x6d (0x7fa2fd7b5b0d in /usr/lib64/libc.so.6)
�[36m(RayWorkerWrapper pid=4009)�[0m 
�[36m(RayWorkerWrapper pid=4009)�[0m [2024-07-05 14:00:48,182 E 4009 4134] logging.cc:108: Stack trace: 
�[36m(RayWorkerWrapper pid=4009)�[0m  /usr/local/lib/python3.9/site-packages/ray/_raylet.so(+0x1021b3a) [0x7fa2fa068b3a] ray::operator<<()
�[36m(RayWorkerWrapper pid=4009)�[0m /usr/local/lib/python3.9/site-packages/ray/_raylet.so(+0x10245f8) [0x7fa2fa06b5f8] ray::TerminateHandler()
�[36m(RayWorkerWrapper pid=4009)�[0m /usr/lib64/libstdc++.so.6(+0x5ea06) [0x7fa2fd000a06]
�[36m(RayWorkerWrapper pid=4009)�[0m /usr/lib64/libstdc++.so.6(+0x5ea33) [0x7fa2fd000a33]
�[36m(RayWorkerWrapper pid=4009)�[0m /usr/lib64/libstdc++.so.6(+0x5e9f4) [0x7fa2fd0009f4]
�[36m(RayWorkerWrapper pid=4009)�[0m /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so(+0xe0845a) [0x7f9a817d745a] c10d::ProcessGroupNCCL::ncclCommWatchdog()
�[36m(RayWorkerWrapper pid=4009)�[0m /usr/local/lib/python3.9/site-packages/torch/lib/libtorch.so(+0x1c220) [0x7fa2ea2b7220] execute_native_thread_routine
�[36m(RayWorkerWrapper pid=4009)�[0m /usr/lib64/libpthread.so.0(+0x7ea5) [0x7fa2fe195ea5] start_thread
�[36m(RayWorkerWrapper pid=4009)�[0m /usr/lib64/libc.so.6(clone+0x6d) [0x7fa2fd7b5b0d] clone
�[36m(RayWorkerWrapper pid=4009)�[0m 
�[36m(RayWorkerWrapper pid=4009)�[0m *** SIGABRT received at time=1720159248 on cpu 5 ***
�[36m(RayWorkerWrapper pid=4009)�[0m PC: @     0x7fa2fd6ed387  (unknown)  raise
�[36m(RayWorkerWrapper pid=4009)�[0m     @     0x7fa2fe19d630       3504  (unknown)
�[36m(RayWorkerWrapper pid=4009)�[0m     @     0x7fa2fd000a06  (unknown)  (unknown)
�[36m(RayWorkerWrapper pid=4009)�[0m     @     0x7f9968118048  1728156448  (unknown)
�[36m(RayWorkerWrapper pid=4009)�[0m     @     0x7f9abe1a6580  (unknown)  (unknown)
�[36m(RayWorkerWrapper pid=4009)�[0m     @ 0xcde907894810c083  (unknown)  (unknown)
�[36m(RayWorkerWrapper pid=4009)�[0m [2024-07-05 14:00:48,200 E 4009 4134] logging.cc:365: *** SIGABRT received at time=1720159248 on cpu 5 ***
�[36m(RayWorkerWrapper pid=4009)�[0m [2024-07-05 14:00:48,200 E 4009 4134] logging.cc:365: PC: @     0x7fa2fd6ed387  (unknown)  raise
�[36m(RayWorkerWrapper pid=4009)�[0m [2024-07-05 14:00:48,201 E 4009 4134] logging.cc:365:     @     0x7fa2fe19d630       3504  (unknown)
�[36m(RayWorkerWrapper pid=4009)�[0m [2024-07-05 14:00:48,201 E 4009 4134] logging.cc:365:     @     0x7fa2fd000a06  (unknown)  (unknown)
�[36m(RayWorkerWrapper pid=4009)�[0m [2024-07-05 14:00:48,202 E 4009 4134] logging.cc:365:     @     0x7f9968118048  1728156448  (unknown)
�[36m(RayWorkerWrapper pid=4009)�[0m [2024-07-05 14:00:48,214 E 4009 4134] logging.cc:365:     @     0x7f9abe1a6580  (unknown)  (unknown)
�[36m(RayWorkerWrapper pid=4009)�[0m [2024-07-05 14:00:48,215 E 4009 4134] logging.cc:365:     @ 0xcde907894810c083  (unknown)  (unknown)
�[36m(RayWorkerWrapper pid=4009)�[0m Fatal Python error: Aborted
�[36m(RayWorkerWrapper pid=4009)�[0m 
[rank0]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 0] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=5
[rank0]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 0] ProcessGroupNCCL preparing to dump debug info.
[rank0]:[F ProcessGroupNCCL.cpp:1169] [PG 0 Rank 0] [PG 0 Rank 0] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 5
*** SIGABRT received at time=1720159428 on cpu 134 ***
PC: @     0x7fe6c1e3c387  (unknown)  raise
    @     0x7fe6c28ec630  1656596208  (unknown)
    @ ... and at least 1 more frames
[2024-07-05 14:03:48,574 E 2878 4115] logging.cc:365: *** SIGABRT received at time=1720159428 on cpu 134 ***
[2024-07-05 14:03:48,574 E 2878 4115] logging.cc:365: PC: @     0x7fe6c1e3c387  (unknown)  raise
[2024-07-05 14:03:48,574 E 2878 4115] logging.cc:365:     @     0x7fe6c28ec630  1656596208  (unknown)
[2024-07-05 14:03:48,574 E 2878 4115] logging.cc:365:     @ ... and at least 1 more frames
Fatal Python error: Aborted

@haichuan1221
Copy link
Contributor

I have the same issue, do you manage to solve it?

Your current environment

The output of `python collect_env.py`

🐛 Describe the bug

We received quite a lot report about "Watchdog caught collective operation timeout", which is flaky and difficult to reproduce. It happens after running for some time.

To analyze the error, we need to collect enough stack trace. If you encounter a similar problem, please paste enough stack trace for us to debug.

Example: https://buildkite.com/vllm/ci-aws/builds/3548#01906e81-54c6-4713-beb7-d08f3c873200 caught one such error.

Please include the first line of error, together with the Python stack trace.

In the following example, it seems one process has illegal memory access. It dies, but the rest process is still in allreduce, and is waiting for it, causing the timeout problem. From the python level stack trace, it happens when we profile the run, and it seems to be related with moe layer.

[rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
--
  | CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
  | For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
  | Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
  |  
  | Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
  | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd5c7e92897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fd5c7e42b25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fd5c7f6a718 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
  | frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fd57bc4ae36 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7fd57bc4ef38 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7fd57bc545ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fd57bc5531c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #7: <unknown function> + 0xdc253 (0x7fd5c76b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
  | frame #8: <unknown function> + 0x94ac3 (0x7fd5c90d9ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | frame #9: <unknown function> + 0x126850 (0x7fd5c916b850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  |  
  | terminate called after throwing an instance of 'c10::DistBackendError'
  | what():  [PG 2 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
  | CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
  | For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
  | Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
  |  
  | Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
  | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd5c7e92897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fd5c7e42b25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fd5c7f6a718 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
  | frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fd57bc4ae36 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7fd57bc4ef38 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7fd57bc545ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fd57bc5531c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #7: <unknown function> + 0xdc253 (0x7fd5c76b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
  | frame #8: <unknown function> + 0x94ac3 (0x7fd5c90d9ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | frame #9: <unknown function> + 0x126850 (0x7fd5c916b850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  |  
  | Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
  | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd5c7e92897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #1: <unknown function> + 0xe32e33 (0x7fd57b8d7e33 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #2: <unknown function> + 0xdc253 (0x7fd5c76b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
  | frame #3: <unknown function> + 0x94ac3 (0x7fd5c90d9ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | frame #4: <unknown function> + 0x126850 (0x7fd5c916b850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  |  
  | ERROR 07-01 13:54:43 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 1159 died, exit code: -6
  | INFO 07-01 13:54:43 multiproc_worker_utils.py:123] Killing local vLLM worker processes
  | [rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=134217728, NumelOut=134217728, Timeout(ms)=600000) ran for 600059 milliseconds before timing out.
  | [rank0]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 0] Timeout at NCCL work: 2, last enqueued NCCL work: 2, last completed NCCL work: 1.
  | [rank0]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
  | [rank0]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
  | [rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=134217728, NumelOut=134217728, Timeout(ms)=600000) ran for 600059 milliseconds before timing out.
  | Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9ed797a897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f9e8b64f1b2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f9e8b653fd0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9e8b65531c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #4: <unknown function> + 0xdc253 (0x7f9ed70b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
  | frame #5: <unknown function> + 0x94ac3 (0x7f9f821e3ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | frame #6: <unknown function> + 0x126850 (0x7f9f82275850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  |  
  | terminate called after throwing an instance of 'c10::DistBackendError'
  | what():  [PG 2 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=134217728, NumelOut=134217728, Timeout(ms)=600000) ran for 600059 milliseconds before timing out.
  | Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9ed797a897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f9e8b64f1b2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f9e8b653fd0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9e8b65531c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #4: <unknown function> + 0xdc253 (0x7f9ed70b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
  | frame #5: <unknown function> + 0x94ac3 (0x7f9f821e3ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | frame #6: <unknown function> + 0x126850 (0x7f9f82275850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  |  
  | Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
  | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9ed797a897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #1: <unknown function> + 0xe32e33 (0x7f9e8b2d7e33 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #2: <unknown function> + 0xdc253 (0x7f9ed70b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
  | frame #3: <unknown function> + 0x94ac3 (0x7f9f821e3ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | frame #4: <unknown function> + 0x126850 (0x7f9f82275850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  |  
  | Fatal Python error: Aborted
  |  
  | Thread 0x00007f96975fe640 (most recent call first):
  | File "/usr/lib/python3.10/threading.py", line 320 in wait
  | File "/usr/lib/python3.10/multiprocessing/queues.py", line 231 in _feed
  | File "/usr/lib/python3.10/threading.py", line 953 in run
  | File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  | File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
  |  
  | Thread 0x00007f96acbfd640 (most recent call first):
  | File "/usr/lib/python3.10/threading.py", line 324 in wait
  | File "/usr/lib/python3.10/threading.py", line 607 in wait
  | File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  | File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  | File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
  |  
  | Thread 0x00007f9f8214e480 (most recent call first):
  | File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 854 in __call__
  | File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 357 in topk_softmax
  | File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 34 in wrapper
  | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 357 in fused_topk
  | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 549 in fused_moe
  | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 273 in forward
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
  | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 426 in forward
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
  | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 470 in forward
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
  | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 540 in forward
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
  | File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1135 in execute_model
  | File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
  | File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 814 in profile_run
  | File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
  | File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 175 in determine_num_available_blocks
  | File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
  | File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 122 in _run_workers
  | File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 38 in determine_num_available_blocks
  | File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 344 in _initialize_kv_caches
  | File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 251 in __init__
  | File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 405 in from_engine_args
  | File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 144 in __init__
  | File "/usr/local/lib/python3.10/dist-packages/lm_eval/models/vllm_causallms.py", line 97 in __init__
  | File "/usr/local/lib/python3.10/dist-packages/lm_eval/api/model.py", line 133 in create_from_arg_string
  | File "/usr/local/lib/python3.10/dist-packages/lm_eval/evaluator.py", line 164 in simple_evaluate
  | File "/usr/local/lib/python3.10/dist-packages/lm_eval/utils.py", line 288 in _wrapper
  | File "/vllm-workspace/.buildkite/lm-eval-harness/test_lm_eval_correctness.py", line 29 in launch_lm_eval
  | File "/vllm-workspace/.buildkite/lm-eval-harness/test_lm_eval_correctness.py", line 45 in test_lm_eval_correctness
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/python.py", line 162 in pytest_pyfunc_call
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 103 in _multicall
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 120 in _hookexec
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 513 in __call__
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/python.py", line 1632 in runtest
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 173 in pytest_runtest_call
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 103 in _multicall
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 120 in _hookexec
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 513 in __call__
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 241 in <lambda>
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 341 in from_call
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 240 in call_and_report
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 135 in runtestprotocol
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 116 in pytest_runtest_protocol
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 103 in _multicall
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 120 in _hookexec
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 513 in __call__
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 364 in pytest_runtestloop
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 103 in _multicall
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 120 in _hookexec
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 513 in __call__
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 339 in _main
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 285 in wrap_session
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 332 in pytest_cmdline_main
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 103 in _multicall
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 120 in _hookexec
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 513 in __call__
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/config/__init__.py", line 178 in main
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/config/__init__.py", line 206 in console_main
  | File "/usr/local/bin/pytest", line 8 in <module>

@haichuan1221
Copy link
Contributor

for it, causing the timeout problem. From the python level stack trace, it happens when we profile the run, and it seems to be related with moe layer.

Here is the log i get when i run llama model

[rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f76bfecb897 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f76bfe7bb25 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f76bffa3718 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f76c11a1e36 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f76c11a5f38 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f76c11ab5ac in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f76c11ac31c in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdbbf4 (0x7f770cc5ebf4 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/bin/../lib/libstdc++.so.6)
frame #8: + 0x8609 (0x7f770df21609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f770dcec353 in /lib/x86_64-linux-gnu/libc.so.6)

[rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f76bfecb897 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f76bfe7bb25 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f76bffa3718 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f76c11a1e36 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f76c11a5f38 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f76c11ab5ac in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f76c11ac31c in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdbbf4 (0x7f770cc5ebf4 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/bin/../lib/libstdc++.so.6)
frame #8: + 0x8609 (0x7f770df21609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f770dcec353 in /lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f76bfecb897 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f76bfe7bb25 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f76bffa3718 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f76c11a1e36 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f76c11a5f38 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f76c11ab5ac in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f76c11ac31c in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdbbf4 (0x7f770cc5ebf4 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/bin/../lib/libstdc++.so.6)
frame #8: + 0x8609 (0x7f770df21609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f770dcec353 in /lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f76bfecb897 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f76bfe7bb25 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f76bffa3718 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f76c11a1e36 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f76c11a5f38 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f76c11ab5ac in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f76c11ac31c in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdbbf4 (0x7f770cc5ebf4 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/bin/../lib/libstdc++.so.6)
frame #8: + 0x8609 (0x7f770df21609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f770dcec353 in /lib/x86_64-linux-gnu/libc.so.6)

/mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 2 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
^C^C^C^C^CAborted (core dumped)
`

@ZackWan
Copy link

ZackWan commented Jul 25, 2024

How to reproduce

Just execute code below.

from time import time

from vllm import LLM, SamplingParams

import contextlib
import gc
import ray
import torch
from vllm.distributed import destroy_model_parallel


def cleanup():
    destroy_model_parallel()
    with contextlib.suppress(AssertionError):
        torch.distributed.destroy_process_group()
    gc.collect()
    torch.cuda.empty_cache()
    ray.shutdown()


# Common prefix.
prefix = (
    "You are an expert school principal, skilled in effectively managing "
    "faculty and staff. Draft 10-15 questions for a potential first grade "
    "Head Teacher for my K-12, all-girls', independent school that emphasizes "
    "community, joyful discovery, and life-long learning. The candidate is "
    "coming in for a first-round panel interview for a 8th grade Math "
    "teaching role. They have 5 years of previous teaching experience "
    "as an assistant teacher at a co-ed, public school with experience "
    "in middle school math teaching. Based on these information, fulfill "
    "the following paragraph: ")

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

generating_prompts = [prefix + prompt for prompt in prompts]

# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.0)

# Create an LLM.
regular_llm = LLM(model="facebook/opt-125m", 
                    tensor_parallel_size=2,
                    gpu_memory_utilization=0.4)

print("Results without `enable_prefix_caching`")

# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
start_time_regular = time()
outputs = regular_llm.generate(generating_prompts, sampling_params)
duration_regular = time() - start_time_regular

regular_generated_texts = []
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    regular_generated_texts.append(generated_text)
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

print("-" * 80)

del regular_llm
cleanup()

prefix_cached_llm = LLM(model="facebook/opt-125m",
                        tensor_parallel_size=2,
                        enable_prefix_caching=True,
                        gpu_memory_utilization=0.4,
                        disable_custom_all_reduce=True)

# Warmup so that the shared prompt's KV cache is computed.
prefix_cached_llm.generate(generating_prompts[0], sampling_params)

# Generate with prefix caching.
start_time_cached = time()
outputs = prefix_cached_llm.generate(generating_prompts, sampling_params)
duration_cached = time() - start_time_cached

print("Results with `enable_prefix_caching`")

cached_generated_texts = []
# Print the outputs. You should see the same outputs as before.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    cached_generated_texts.append(generated_text)
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

print("-" * 80)

# Compare the results and display the speedup
generated_same = all([
    regular_generated_texts[i] == cached_generated_texts[i]
    for i in range(len(prompts))
])
print(f"Generated answers are the same: {generated_same}")

speedup = round(duration_regular / duration_cached, 2)
print(f"Speed up of cached generation compared to the regular is: {speedup}")

Error log

(RayWorkerWrapper pid=489968) [rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1572864, NumelOut=1572864, Timeout(ms)=600000) ran for 600021 milliseconds before timing out.
(RayWorkerWrapper pid=489968) [rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 1] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: 93459313471793.
(RayWorkerWrapper pid=489968) [rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
(RayWorkerWrapper pid=489968) [rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
(RayWorkerWrapper pid=489968) [rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1572864, NumelOut=1572864, Timeout(ms)=600000) ran for 600021 milliseconds before timing out.
(RayWorkerWrapper pid=489968) Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
(RayWorkerWrapper pid=489968) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f33e0193897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
(RayWorkerWrapper pid=489968) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f04299801b2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
(RayWorkerWrapper pid=489968) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f0429984fd0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
(RayWorkerWrapper pid=489968) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f042998631c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
(RayWorkerWrapper pid=489968) frame #4: + 0xdc253 (0x7f33e5a72253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
(RayWorkerWrapper pid=489968) frame #5: + 0x94ac3 (0x7f33e79f2ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(RayWorkerWrapper pid=489968) frame #6: + 0x126850 (0x7f33e7a84850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(RayWorkerWrapper pid=489968)
(RayWorkerWrapper pid=489968) [2024-07-25 09:26:39,687 E 489968 490249] logging.cc:101: Unhandled exception: N3c1016DistBackendErrorE. what(): [PG 1 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1572864, NumelOut=1572864, Timeout(ms)=600000) ran for 600021 milliseconds before timing out.
(RayWorkerWrapper pid=489968) Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
(RayWorkerWrapper pid=489968) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f33e0193897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
(RayWorkerWrapper pid=489968) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f04299801b2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
(RayWorkerWrapper pid=489968) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f0429984fd0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
(RayWorkerWrapper pid=489968) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f042998631c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
(RayWorkerWrapper pid=489968) frame #4: + 0xdc253 (0x7f33e5a72253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
(RayWorkerWrapper pid=489968) frame #5: + 0x94ac3 (0x7f33e79f2ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(RayWorkerWrapper pid=489968) frame #6: + 0x126850 (0x7f33e7a84850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(RayWorkerWrapper pid=489968)
(RayWorkerWrapper pid=489968) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
(RayWorkerWrapper pid=489968) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f33e0193897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
(RayWorkerWrapper pid=489968) frame #1: + 0xe32e33 (0x7f0429608e33 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
(RayWorkerWrapper pid=489968) frame #2: + 0xdc253 (0x7f33e5a72253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
(RayWorkerWrapper pid=489968) frame #3: + 0x94ac3 (0x7f33e79f2ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(RayWorkerWrapper pid=489968) frame #4: + 0x126850 (0x7f33e7a84850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(RayWorkerWrapper pid=489968)
(RayWorkerWrapper pid=489968) [2024-07-25 09:26:39,696 E 489968 490249] logging.cc:108: Stack trace:
(RayWorkerWrapper pid=489968) /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0x101a47a) [0x7f33e6be147a] ray::operator<<()
(RayWorkerWrapper pid=489968) /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0x101cf38) [0x7f33e6be3f38] ray::TerminateHandler()
(RayWorkerWrapper pid=489968) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c) [0x7f33e5a4420c]
(RayWorkerWrapper pid=489968) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277) [0x7f33e5a44277]
(RayWorkerWrapper pid=489968) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae1fe) [0x7f33e5a441fe]
(RayWorkerWrapper pid=489968) /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0xe32ee4) [0x7f0429608ee4] c10d::ProcessGroupNCCL::ncclCommWatchdog()
(RayWorkerWrapper pid=489968) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f33e5a72253]
(RayWorkerWrapper pid=489968) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f33e79f2ac3]
(RayWorkerWrapper pid=489968) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f33e7a84850]
(RayWorkerWrapper pid=489968)
(RayWorkerWrapper pid=489968) *** SIGABRT received at time=1721899599 on cpu 84 ***
(RayWorkerWrapper pid=489968) PC: @ 0x7f33e79f49fc (unknown) pthread_kill
(RayWorkerWrapper pid=489968) @ 0x7f33e79a0520 (unknown) (unknown)
(RayWorkerWrapper pid=489968) [2024-07-25 09:26:39,697 E 489968 490249] logging.cc:365: *** SIGABRT received at time=1721899599 on cpu 84 ***
(RayWorkerWrapper pid=489968) [2024-07-25 09:26:39,697 E 489968 490249] logging.cc:365: PC: @ 0x7f33e79f49fc (unknown) pthread_kill
(RayWorkerWrapper pid=489968) [2024-07-25 09:26:39,697 E 489968 490249] logging.cc:365: @ 0x7f33e79a0520 (unknown) (unknown)
(RayWorkerWrapper pid=489968) Fatal Python error: Aborted
(RayWorkerWrapper pid=489968)

@flexorRegev
Copy link

+1 This happens to me also some times when running Llama3.170b on tp=2

Is there a good way to recover from this?

@flexorRegev
Copy link

@youkaichao did you find the source / a way to work around this?

@youkaichao
Copy link
Member Author

not yet :(

@Clint-chan

This comment was marked as abuse.

@Sandwiches97
Copy link

+1 This happens to me also many times when running miniCPM 2.6 on vllm=0.5.5

Copy link

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

@github-actions github-actions bot added the stale label Dec 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

7 participants