Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: llama3-405b-fp8 NCCL communication #7775

Open
wangwensuo opened this issue Aug 22, 2024 · 8 comments
Open

[Bug]: llama3-405b-fp8 NCCL communication #7775

wangwensuo opened this issue Aug 22, 2024 · 8 comments
Labels
bug Something isn't working

Comments

@wangwensuo
Copy link

Your current environment

The output of `python collect_env.py`
Your output of `python collect_env.py` here

vllm 0.5.4

🐛 Describe the bug

目前在8 * A800上进行推理,vllm推理的40b fp8版本,然后tp=8,一开始能推理,试了几个case以后就通信异常了
INFO 08-22 06:23:22 metrics.py:406] Avg prompt throughput: 36.2 tokens/s, Avg generation throughput: 4.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 08-22 06:23:27 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
INFO 08-22 06:23:32 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
INFO 08-22 06:23:37 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:23:42 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:23:57 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:24:07 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:24:17 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:24:27 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:24:37 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
(VllmWorkerProcess pid=161) WARNING 08-22 06:24:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=158) WARNING 08-22 06:24:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=160) WARNING 08-22 06:24:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=164) WARNING 08-22 06:24:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=163) WARNING 08-22 06:24:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=162) WARNING 08-22 06:24:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=159) WARNING 08-22 06:24:42 shm_broadcast.py:386] No available block found in 60 second.
ERROR 08-22 06:24:42 async_llm_engine.py:663] Engine iteration timed out. This should never happen!
ERROR 08-22 06:24:42 async_llm_engine.py:57] Engine background task failed
ERROR 08-22 06:24:42 async_llm_engine.py:57] Traceback (most recent call last):
ERROR 08-22 06:24:42 async_llm_engine.py:57] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 636, in run_engine_loop
ERROR 08-22 06:24:42 async_llm_engine.py:57] done, _ = await asyncio.wait(
ERROR 08-22 06:24:42 async_llm_engine.py:57] File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
ERROR 08-22 06:24:42 async_llm_engine.py:57] return await _wait(fs, timeout, return_when, loop)
ERROR 08-22 06:24:42 async_llm_engine.py:57] File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
ERROR 08-22 06:24:42 async_llm_engine.py:57] await waiter
ERROR 08-22 06:24:42 async_llm_engine.py:57] asyncio.exceptions.CancelledError
ERROR 08-22 06:24:42 async_llm_engine.py:57]
ERROR 08-22 06:24:42 async_llm_engine.py:57] During handling of the above exception, another exception occurred:
ERROR 08-22 06:24:42 async_llm_engine.py:57]
ERROR 08-22 06:24:42 async_llm_engine.py:57] Traceback (most recent call last):
ERROR 08-22 06:24:42 async_llm_engine.py:57] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 47, in _log_task_completion
ERROR 08-22 06:24:42 async_llm_engine.py:57] return_value = task.result()
ERROR 08-22 06:24:42 async_llm_engine.py:57] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 635, in run_engine_loop
ERROR 08-22 06:24:42 async_llm_engine.py:57] async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
ERROR 08-22 06:24:42 async_llm_engine.py:57] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_timeout.py", line 95, in aexit
ERROR 08-22 06:24:42 async_llm_engine.py:57] self._do_exit(exc_type)
ERROR 08-22 06:24:42 async_llm_engine.py:57] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
ERROR 08-22 06:24:42 async_llm_engine.py:57] raise asyncio.TimeoutError
ERROR 08-22 06:24:42 async_llm_engine.py:57] asyncio.exceptions.TimeoutError
ERROR:asyncio:Exception in callback _log_task_completion(error_callback=>)(<Task finishe...imeoutError()>) at /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py:37
handle: <Handle _log_task_completion(error_callback=>)(<Task finishe...imeoutError()>) at /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py:37>
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 636, in run_engine_loop
done, _ = await asyncio.wait(
File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
return await _wait(fs, timeout, return_when, loop)
File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 47, in _log_task_completion
return_value = task.result()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 635, in run_engine_loop
async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_timeout.py", line 95, in aexit
self._do_exit(exc_type)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 59, in _log_task_completion
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for theactual cause.
INFO 08-22 06:24:42 async_llm_engine.py:181] Aborted request chat-0d19f933a5304a3890a8ffa2c2bc69e9.
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 265, in call
await wrap(partial(self.listen_for_disconnect, receive))
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in wrap
await func()
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 238, in listen_for_disconnect
message = await receive()
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 553, in receive
await self.message_event.wait()
File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait
await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f45dbd9bf10

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in call
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in call
await super().call(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in call
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in call
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in call
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in call
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in call
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in call
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 75, in app
await response(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 258, in call
async with anyio.create_task_group() as task_group:
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 680, in aexit
raise BaseExceptionGroup(
exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
INFO 08-22 06:24:47 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:24:57 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:25:07 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:25:17 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:25:27 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:25:37 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
(VllmWorkerProcess pid=162) WARNING 08-22 06:25:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=163) WARNING 08-22 06:25:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=164) WARNING 08-22 06:25:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=160) WARNING 08-22 06:25:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=161) WARNING 08-22 06:25:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=158) WARNING 08-22 06:25:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=159) WARNING 08-22 06:25:42 shm_broadcast.py:386] No available block found in 60 second.
INFO 08-22 06:25:47 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:25:57 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:26:07 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:26:17 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:26:27 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:26:37 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
(VllmWorkerProcess pid=158) WARNING 08-22 06:26:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=163) WARNING 08-22 06:26:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=162) WARNING 08-22 06:26:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=161) WARNING 08-22 06:26:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=160) WARNING 08-22 06:26:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=164) WARNING 08-22 06:26:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=159) WARNING 08-22 06:26:42 shm_broadcast.py:386] No available block found in 60 second.
INFO 08-22 06:26:47 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:26:57 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:27:07 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:27:17 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:27:27 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:27:37 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
(VllmWorkerProcess pid=163) WARNING 08-22 06:27:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=162) WARNING 08-22 06:27:42 shm_broadcast.py:386] No available block found in 60 second.
[rank1]:[E822 06:33:42.457707445 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4554, OpType=GATHER, NumelIn=16032, NumelOut=0, Timeout(ms)=600000) ran for 600073 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f468ed77f86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f4640d588d2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f4640d5f313 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4640d616fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x7f468e4b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7f468fbdcac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f468fc6dbf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 3 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4554, OpType=GATHER, NumelIn=16032, NumelOut=0, Timeout(ms)=600000) ran for 600073 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f468ed77f86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f4640d588d2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f4640d5f313 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4640d616fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x7f468e4b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7f468fbdcac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f468fc6dbf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f468ed77f86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: + 0xe5aa84 (0x7f46409eaa84 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xdc253 (0x7f468e4b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: + 0x94ac3 (0x7f468fbdcac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x7f468fc6dbf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[E822 06:33:42.472676771 ProcessGroupNCCL.cpp:607] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4554, OpType=GATHER, NumelIn=16032, NumelOut=0, Timeout(ms)=600000) ran for 600089 milliseconds before timing out.
[rank7]:[E822 06:33:42.472710001 ProcessGroupNCCL.cpp:607] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4554, OpType=GATHER, NumelIn=16032, NumelOut=0, Timeout(ms)=600000) ran for 600089 milliseconds before timing out.
[rank2]:[E822 06:33:42.472732829 ProcessGroupNCCL.cpp:1664] [PG 3 Rank 2] Exception (either an error or timeout) detected by watchdog at work: 4554, last enqueued NCCL work: 4554, last completed NCCL work: 4553.
[rank2]:[E822 06:33:42.472743993 ProcessGroupNCCL.cpp:1709] [PG 3 Rank 2] Timeout at NCCL work: 4554, last enqueued NCCL work: 4554, last completed NCCL work: 4553.
[rank2]:[E822 06:33:42.472748107 ProcessGroupNCCL.cpp:621] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E822 06:33:42.472750861 ProcessGroupNCCL.cpp:627] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank7]:[E822 06:33:42.472758671 ProcessGroupNCCL.cpp:1664] [PG 3 Rank 7] Exception (either an error or timeout) detected by watchdog at work: 4554, last enqueued NCCL work: 4554, last completed NCCL work: 4553.
[rank7]:[E822 06:33:42.472765960 ProcessGroupNCCL.cpp:1709] [PG 3 Rank 7] Timeout at NCCL work: 4554, last enqueued NCCL work: 4554, last completed NCCL work: 4553.
[rank7]:[E822 06:33:42.472770740 ProcessGroupNCCL.cpp:621] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank7]:[E822 06:33:42.472775007 ProcessGroupNCCL.cpp:627] [Rank 7] To avoid data inconsistency, we are taking the entire process down.
[rank6]:[E822 06:33:42.473328595 ProcessGroupNCCL.cpp:607] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4554, OpType=GATHER, NumelIn=16032, NumelOut=0, Timeout(ms)=600000) ran for 600090 milliseconds before timing out.
[rank6]:[E822 06:33:42.473378363 ProcessGroupNCCL.cpp:1664] [PG 3 Rank 6] Exception (either an error or timeout) detected by watchdog at work: 4554, last enqueued NCCL work: 4554, last completed NCCL work: 4553.
[rank6]:[E822 06:33:42.473386694 ProcessGroupNCCL.cpp:1709] [PG 3 Rank 6] Timeout at NCCL work: 4554, last enqueued NCCL work: 4554, last completed NCCL work: 4553.
[rank6]:[E822 06:33:42.473392166 ProcessGroupNCCL.cpp:621] [Rank 6] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank6]:[E822 06:33:42.473396865 ProcessGroupNCCL.cpp:627] [Rank 6] To avoid data inconsistency, we are taking the entire process down.
[rank7]:[E822 06:33:42.473871273 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4554, OpType=GATHER, NumelIn=16032, NumelOut=0, Timeout(ms)=600000) ran for 600089 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f468ed77f86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f4640d588d2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f4640d5f313 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4640d616fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x7f468e4b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7f468fbdcac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f468fc6dbf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

@wangwensuo wangwensuo added the bug Something isn't working label Aug 22, 2024
@youkaichao
Copy link
Member

can you stably reproduce it?

@wangwensuo
Copy link
Author

It's like this every time. After a few inferences, an error is reported, and the GPU usage rate is 100%. Currently, there is no problem with node inference 8b

@youkaichao
Copy link
Member

try --disable-custom-all-reduce ?

@TangJiakai
Copy link

try --disable-custom-all-reduce ?

Still Not work

@vlasenkoalexey
Copy link

Have same issue, happens after model is running for a while.

Startup arguments:
- --host=0.0.0.0
- --port=7080
- --model=gs://...
- --tensor-parallel-size=8
- --swap-space=16
- --gpu-memory-utilization=0.95
- --disable-log-stats
- --enable-chunked-prefill
- --max-num-seqs=12
- --max-model-len=131072
- --model-type=llama3.1

@TTTdas
Copy link

TTTdas commented Oct 9, 2024

same issue and solved by --disable-custom-all-reduce True

@youkaichao
Copy link
Member

@TTTdas I think #8558 should fix it. can you have a try with the latest wheel? see https://docs.vllm.ai/en/latest/getting_started/installation.html#install-the-latest-code for installing the latest wheel.

@TTTdas
Copy link

TTTdas commented Oct 14, 2024

@TTTdas I think #8558 should fix it. can you have a try with the latest wheel? see https://docs.vllm.ai/en/latest/getting_started/installation.html#install-the-latest-code for installing the latest wheel.

yes, it works! thx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants