Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Engine iteration timed out. This should never happen! #9839

Open
1 task done
xxzhang0927 opened this issue Oct 30, 2024 · 0 comments
Open
1 task done

[Bug]: Engine iteration timed out. This should never happen! #9839

xxzhang0927 opened this issue Oct 30, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@xxzhang0927
Copy link

Your current environment

hardwark: A800
Driver Version: 535.54.03 CUDA Version: 12.2
vllm commit d3a2451
model qwen72B

Model Input Dumps

No response

🐛 Describe the bug

INFO 10-30 11:46:47 async_llm_engine.py:173] Added request 541ca4832eb9436180e721ef069baedb.
ERROR 10-30 11:47:32 async_llm_engine.py:656] Engine iteration timed out. This should never happen!
ERROR 10-30 11:47:32 async_llm_engine.py:56] Engine background task failed
ERROR 10-30 11:47:32 async_llm_engine.py:56] Traceback (most recent call last):
ERROR 10-30 11:47:32 async_llm_engine.py:56] File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 629, in run_engine_loop
ERROR 10-30 11:47:32 async_llm_engine.py:56] done, _ = await asyncio.wait(
ERROR 10-30 11:47:32 async_llm_engine.py:56] File "/usr/local/lib/python3.9/asyncio/tasks.py", line 413, in wait
ERROR 10-30 11:47:32 async_llm_engine.py:56] return await _wait(fs, timeout, return_when, loop)
ERROR 10-30 11:47:32 async_llm_engine.py:56] File "/usr/local/lib/python3.9/asyncio/tasks.py", line 525, in _wait
ERROR 10-30 11:47:32 async_llm_engine.py:56] await waiter
ERROR 10-30 11:47:32 async_llm_engine.py:56] asyncio.exceptions.CancelledError
ERROR 10-30 11:47:32 async_llm_engine.py:56]
ERROR 10-30 11:47:32 async_llm_engine.py:56] During handling of the above exception, another exception occurred:
ERROR 10-30 11:47:32 async_llm_engine.py:56]
ERROR 10-30 11:47:32 async_llm_engine.py:56] Traceback (most recent call last):
ERROR 10-30 11:47:32 async_llm_engine.py:56] File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
ERROR 10-30 11:47:32 async_llm_engine.py:56] return_value = task.result()
ERROR 10-30 11:47:32 async_llm_engine.py:56] File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 633, in run_engine_loop
ERROR 10-30 11:47:32 async_llm_engine.py:56] await asyncio.sleep(0)
ERROR 10-30 11:47:32 async_llm_engine.py:56] File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_timeout.py", line 95, in aexit
ERROR 10-30 11:47:32 async_llm_engine.py:56] self._do_exit(exc_type)
ERROR 10-30 11:47:32 async_llm_engine.py:56] File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
ERROR 10-30 11:47:32 async_llm_engine.py:56] raise asyncio.TimeoutError
ERROR 10-30 11:47:32 async_llm_engine.py:56] asyncio.exceptions.TimeoutError
2024-10-30 11:47:32,282 - asyncio:default_exception_handler:1753 - ERROR: Exception in callback _log_task_completion(error_callback=>)(<Task finishe...imeoutError()>) at /usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py:36
handle: <Handle _log_task_completion(error_callback=>)(<Task finishe...imeoutError()>) at /usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py:36>
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 629, in run_engine_loop
done, _ = await asyncio.wait(
File "/usr/local/lib/python3.9/asyncio/tasks.py", line 413, in wait
return await _wait(fs, timeout, return_when, loop)
File "/usr/local/lib/python3.9/asyncio/tasks.py", line 525, in _wait
await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
return_value = task.result()
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 633, in run_engine_loop
await asyncio.sleep(0)
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_timeout.py", line 95, in aexit
self._do_exit(exc_type)
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/lib/python3.9/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 58, in _log_task_completion
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for theactual cause.
INFO 10-30 11:47:32 async_llm_engine.py:180] Aborted request 6838fbb7076948a7a1f8071d4095c740.
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 629, in run_engine_loop
done, _ = await asyncio.wait(
File "/usr/local/lib/python3.9/asyncio/tasks.py", line 413, in wait
return await _wait(fs, timeout, return_when, loop)
File "/usr/local/lib/python3.9/asyncio/tasks.py", line 525, in _wait
await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/ailab/inference_wrapper/huggingface/lora/nlp/wrapper_vllm.py", line 621, in _process_stream_infence
async for request_output in results_generator:
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 770, in generate
async for output in self._process_request(
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 886, in _process_request
raise e
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 882, in _process_request
async for request_output in stream:
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 93, in anext
raise result
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
return_value = task.result()
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 633, in run_engine_loop
await asyncio.sleep(0)
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_timeout.py", line 95, in aexit
self._do_exit(exc_type)
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
2024-10-30 11:47:32,282 - wrapper:_process_stream_infence:645 - ERROR: streaming inference exception, 6838fbb7076948a7a1f8071d4095c740
(VllmWorkerProcess pid=198) WARNING 10-30 11:47:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=199) WARNING 10-30 11:47:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=200) WARNING 10-30 11:47:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=198) WARNING 10-30 11:48:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=199) WARNING 10-30 11:48:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=200) WARNING 10-30 11:48:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=199) WARNING 10-30 11:49:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=198) WARNING 10-30 11:49:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=200) WARNING 10-30 11:49:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=198) WARNING 10-30 11:50:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=200) WARNING 10-30 11:50:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=199) WARNING 10-30 11:50:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=198) WARNING 10-30 11:51:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=199) WARNING 10-30 11:51:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=200) WARNING 10-30 11:51:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=198) WARNING 10-30 11:52:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=199) WARNING 10-30 11:52:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=200) WARNING 10-30 11:52:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=199) WARNING 10-30 11:53:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=198) WARNING 10-30 11:53:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=200) WARNING 10-30 11:53:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=198) WARNING 10-30 11:54:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=199) WARNING 10-30 11:54:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=200) WARNING 10-30 11:54:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=198) WARNING 10-30 11:55:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=199) WARNING 10-30 11:55:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=200) WARNING 10-30 11:55:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=199) WARNING 10-30 11:56:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=198) WARNING 10-30 11:56:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=200) WARNING 10-30 11:56:32 shm_broadcast.py:386] No available block found in 60 second.
[rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12471541, OpType=GATHER, NumelIn=1520640, NumelOut=0, Timeout(ms)=600000) ran for 600053 milliseconds before timing out.
[rank3]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12471541, OpType=GATHER, NumelIn=1520640, NumelOut=0, Timeout(ms)=600000) ran for 600054 milliseconds before timing out.
[rank2]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 2] Timeout at NCCL work: 12471541, last enqueued NCCL work: 12471541, last completed NCCL work: 12471540.
[rank2]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12471541, OpType=GATHER, NumelIn=1520640, NumelOut=0, Timeout(ms)=600000) ran for 600053 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f318fecf897 in /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f3095e6dc62 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3095e72a80 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3095e73dcc in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x245c0 (0x7f3226ad05c0 in /home/aiges/library/libuds.so)
frame #5: + 0x94ac3 (0x7f322406bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f32240fca04 in /lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12471541, OpType=GATHER, NumelIn=1520640, NumelOut=0, Timeout(ms)=600000) ran for 600054 milliseconds before timing out.
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 2 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12471541, OpType=GATHER, NumelIn=1520640, NumelOut=0, Timeout(ms)=600000) ran for 600053 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f318fecf897 in /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f3095e6dc62 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3095e72a80 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3095e73dcc in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x245c0 (0x7f3226ad05c0 in /home/aiges/library/libuds.so)
frame #5: + 0x94ac3 (0x7f322406bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f32240fca04 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f318fecf897 in /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: + 0xe32119 (0x7f3095af7119 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x245c0 (0x7f3226ad05c0 in /home/aiges/library/libuds.so)
frame #3: + 0x94ac3 (0x7f322406bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x7f32240fca04 in /lib/x86_64-linux-gnu/libc.so.6)

[rank3]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 3] Timeout at NCCL work: 12471541, last enqueued NCCL work: 12471541, last completed NCCL work: 12471540.
[rank3]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12471541, OpType=GATHER, NumelIn=1520640, NumelOut=0, Timeout(ms)=600000) ran for 600054 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f318fecf897 in /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f3095e6dc62 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3095e72a80 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3095e73dcc in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x245c0 (0x7f3226ad05c0 in /home/aiges/library/libuds.so)
frame #5: + 0x94ac3 (0x7f322406bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f32240fca04 in /lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 1] Timeout at NCCL work: 12471541, last enqueued NCCL work: 12471541, last completed NCCL work: 12471540.
[rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12471541, OpType=GATHER, NumelIn=1520640, NumelOut=0, Timeout(ms)=600000) ran for 600054 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f318fecf897 in /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f3095e6dc62 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3095e72a80 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3095e73dcc in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x245c0 (0x7f3226ad05c0 in /home/aiges/library/libuds.so)
frame #5: + 0x94ac3 (0x7f322406bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f32240fca04 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 2 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12471541, OpType=GATHER, NumelIn=1520640, NumelOut=0, Timeout(ms)=600000) ran for 600054 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f318fecf897 in /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f3095e6dc62 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3095e72a80 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3095e73dcc in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x245c0 (0x7f3226ad05c0 in /home/aiges/library/libuds.so)
frame #5: + 0x94ac3 (0x7f322406bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f32240fca04 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f318fecf897 in /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: + 0xe32119 (0x7f3095af7119 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x245c0 (0x7f3226ad05c0 in /home/aiges/library/libuds.so)
frame #3: + 0x94ac3 (0x7f322406bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x7f32240fca04 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 2 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12471541, OpType=GATHER, NumelIn=1520640, NumelOut=0, Timeout(ms)=600000) ran for 600054 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f318fecf897 in /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f3095e6dc62 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3095e72a80 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3095e73dcc in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x245c0 (0x7f3226ad05c0 in /home/aiges/library/libuds.so)
frame #5: + 0x94ac3 (0x7f322406bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f32240fca04 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f318fecf897 in /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: + 0xe32119 (0x7f3095af7119 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x245c0 (0x7f3226ad05c0 in /home/aiges/library/libuds.so)
frame #3: + 0x94ac3 (0x7f322406bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x7f32240fca04 in /lib/x86_64-linux-gnu/libc.so.6)

ERROR 10-30 11:56:48 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 198 died, exit code: -6
INFO 10-30 11:56:48 multiproc_worker_utils.py:123] Killing local vLLM worker processes
[rank0]:[E ProcessGroupNCCL.cpp:1316] [PG 2 Rank 0] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=9
[rank0]:[E ProcessGroupNCCL.cpp:1153] [PG 2 Rank 0] ProcessGroupNCCL preparing to dump debug info.
[rank0]:[F ProcessGroupNCCL.cpp:1169] [PG 2 Rank 0] [PG 2 Rank 0] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 9

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@xxzhang0927 xxzhang0927 added the bug Something isn't working label Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant