[Bug]: llama3-405b-fp8 NCCL communication #7775

wangwensuo · 2024-08-22T07:49:08Z

Your current environment

The output of `python collect_env.py`

Your output of `python collect_env.py` here

vllm 0.5.4

🐛 Describe the bug

目前在8 * A800上进行推理，vllm推理的40b fp8版本，然后tp=8，一开始能推理，试了几个case以后就通信异常了
INFO 08-22 06:23:22 metrics.py:406] Avg prompt throughput: 36.2 tokens/s, Avg generation throughput: 4.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 08-22 06:23:27 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
INFO 08-22 06:23:32 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
INFO 08-22 06:23:37 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:23:42 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:23:57 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:24:07 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:24:17 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:24:27 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:24:37 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
(VllmWorkerProcess pid=161) WARNING 08-22 06:24:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=158) WARNING 08-22 06:24:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=160) WARNING 08-22 06:24:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=164) WARNING 08-22 06:24:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=163) WARNING 08-22 06:24:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=162) WARNING 08-22 06:24:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=159) WARNING 08-22 06:24:42 shm_broadcast.py:386] No available block found in 60 second.
ERROR 08-22 06:24:42 async_llm_engine.py:663] Engine iteration timed out. This should never happen!
ERROR 08-22 06:24:42 async_llm_engine.py:57] Engine background task failed
ERROR 08-22 06:24:42 async_llm_engine.py:57] Traceback (most recent call last):
ERROR 08-22 06:24:42 async_llm_engine.py:57] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 636, in run_engine_loop
ERROR 08-22 06:24:42 async_llm_engine.py:57] done, _ = await asyncio.wait(
ERROR 08-22 06:24:42 async_llm_engine.py:57] File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
ERROR 08-22 06:24:42 async_llm_engine.py:57] return await _wait(fs, timeout, return_when, loop)
ERROR 08-22 06:24:42 async_llm_engine.py:57] File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
ERROR 08-22 06:24:42 async_llm_engine.py:57] await waiter
ERROR 08-22 06:24:42 async_llm_engine.py:57] asyncio.exceptions.CancelledError
ERROR 08-22 06:24:42 async_llm_engine.py:57]
ERROR 08-22 06:24:42 async_llm_engine.py:57] During handling of the above exception, another exception occurred:
ERROR 08-22 06:24:42 async_llm_engine.py:57]
ERROR 08-22 06:24:42 async_llm_engine.py:57] Traceback (most recent call last):
ERROR 08-22 06:24:42 async_llm_engine.py:57] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 47, in _log_task_completion
ERROR 08-22 06:24:42 async_llm_engine.py:57] return_value = task.result()
ERROR 08-22 06:24:42 async_llm_engine.py:57] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 635, in run_engine_loop
ERROR 08-22 06:24:42 async_llm_engine.py:57] async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
ERROR 08-22 06:24:42 async_llm_engine.py:57] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_timeout.py", line 95, in aexit
ERROR 08-22 06:24:42 async_llm_engine.py:57] self._do_exit(exc_type)
ERROR 08-22 06:24:42 async_llm_engine.py:57] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
ERROR 08-22 06:24:42 async_llm_engine.py:57] raise asyncio.TimeoutError
ERROR 08-22 06:24:42 async_llm_engine.py:57] asyncio.exceptions.TimeoutError
ERROR:asyncio:Exception in callback _log_task_completion(error_callback=>)(<Task finishe...imeoutError()>) at /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py:37
handle: <Handle _log_task_completion(error_callback=>)(<Task finishe...imeoutError()>) at /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py:37>
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 636, in run_engine_loop
done, _ = await asyncio.wait(
File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
return await _wait(fs, timeout, return_when, loop)
File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 47, in _log_task_completion
return_value = task.result()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 635, in run_engine_loop
async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_timeout.py", line 95, in aexit
self._do_exit(exc_type)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 59, in _log_task_completion
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for theactual cause.
INFO 08-22 06:24:42 async_llm_engine.py:181] Aborted request chat-0d19f933a5304a3890a8ffa2c2bc69e9.
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 265, in call
await wrap(partial(self.listen_for_disconnect, receive))
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in wrap
await func()
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 238, in listen_for_disconnect
message = await receive()
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 553, in receive
await self.message_event.wait()
File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait
await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f45dbd9bf10

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in call
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in call
await super().call(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in call
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in call
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in call
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in call
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in call
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in call
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 75, in app
await response(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 258, in call
async with anyio.create_task_group() as task_group:
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 680, in aexit
raise BaseExceptionGroup(
exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
INFO 08-22 06:24:47 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:24:57 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:25:07 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:25:17 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:25:27 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:25:37 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
(VllmWorkerProcess pid=162) WARNING 08-22 06:25:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=163) WARNING 08-22 06:25:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=164) WARNING 08-22 06:25:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=160) WARNING 08-22 06:25:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=161) WARNING 08-22 06:25:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=158) WARNING 08-22 06:25:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=159) WARNING 08-22 06:25:42 shm_broadcast.py:386] No available block found in 60 second.
INFO 08-22 06:25:47 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:25:57 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:26:07 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:26:17 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:26:27 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:26:37 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
(VllmWorkerProcess pid=158) WARNING 08-22 06:26:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=163) WARNING 08-22 06:26:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=162) WARNING 08-22 06:26:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=161) WARNING 08-22 06:26:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=160) WARNING 08-22 06:26:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=164) WARNING 08-22 06:26:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=159) WARNING 08-22 06:26:42 shm_broadcast.py:386] No available block found in 60 second.
INFO 08-22 06:26:47 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:26:57 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:27:07 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:27:17 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:27:27 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:27:37 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
(VllmWorkerProcess pid=163) WARNING 08-22 06:27:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=162) WARNING 08-22 06:27:42 shm_broadcast.py:386] No available block found in 60 second.
[rank1]:[E822 06:33:42.457707445 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4554, OpType=GATHER, NumelIn=16032, NumelOut=0, Timeout(ms)=600000) ran for 600073 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f468ed77f86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f4640d588d2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f4640d5f313 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4640d616fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x7f468e4b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7f468fbdcac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f468fc6dbf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 3 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4554, OpType=GATHER, NumelIn=16032, NumelOut=0, Timeout(ms)=600000) ran for 600073 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f468ed77f86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f4640d588d2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f4640d5f313 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4640d616fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x7f468e4b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7f468fbdcac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f468fc6dbf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f468ed77f86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: + 0xe5aa84 (0x7f46409eaa84 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xdc253 (0x7f468e4b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: + 0x94ac3 (0x7f468fbdcac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x7f468fc6dbf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[E822 06:33:42.472676771 ProcessGroupNCCL.cpp:607] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4554, OpType=GATHER, NumelIn=16032, NumelOut=0, Timeout(ms)=600000) ran for 600089 milliseconds before timing out.
[rank7]:[E822 06:33:42.472710001 ProcessGroupNCCL.cpp:607] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4554, OpType=GATHER, NumelIn=16032, NumelOut=0, Timeout(ms)=600000) ran for 600089 milliseconds before timing out.
[rank2]:[E822 06:33:42.472732829 ProcessGroupNCCL.cpp:1664] [PG 3 Rank 2] Exception (either an error or timeout) detected by watchdog at work: 4554, last enqueued NCCL work: 4554, last completed NCCL work: 4553.
[rank2]:[E822 06:33:42.472743993 ProcessGroupNCCL.cpp:1709] [PG 3 Rank 2] Timeout at NCCL work: 4554, last enqueued NCCL work: 4554, last completed NCCL work: 4553.
[rank2]:[E822 06:33:42.472748107 ProcessGroupNCCL.cpp:621] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E822 06:33:42.472750861 ProcessGroupNCCL.cpp:627] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank7]:[E822 06:33:42.472758671 ProcessGroupNCCL.cpp:1664] [PG 3 Rank 7] Exception (either an error or timeout) detected by watchdog at work: 4554, last enqueued NCCL work: 4554, last completed NCCL work: 4553.
[rank7]:[E822 06:33:42.472765960 ProcessGroupNCCL.cpp:1709] [PG 3 Rank 7] Timeout at NCCL work: 4554, last enqueued NCCL work: 4554, last completed NCCL work: 4553.
[rank7]:[E822 06:33:42.472770740 ProcessGroupNCCL.cpp:621] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank7]:[E822 06:33:42.472775007 ProcessGroupNCCL.cpp:627] [Rank 7] To avoid data inconsistency, we are taking the entire process down.
[rank6]:[E822 06:33:42.473328595 ProcessGroupNCCL.cpp:607] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4554, OpType=GATHER, NumelIn=16032, NumelOut=0, Timeout(ms)=600000) ran for 600090 milliseconds before timing out.
[rank6]:[E822 06:33:42.473378363 ProcessGroupNCCL.cpp:1664] [PG 3 Rank 6] Exception (either an error or timeout) detected by watchdog at work: 4554, last enqueued NCCL work: 4554, last completed NCCL work: 4553.
[rank6]:[E822 06:33:42.473386694 ProcessGroupNCCL.cpp:1709] [PG 3 Rank 6] Timeout at NCCL work: 4554, last enqueued NCCL work: 4554, last completed NCCL work: 4553.
[rank6]:[E822 06:33:42.473392166 ProcessGroupNCCL.cpp:621] [Rank 6] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank6]:[E822 06:33:42.473396865 ProcessGroupNCCL.cpp:627] [Rank 6] To avoid data inconsistency, we are taking the entire process down.
[rank7]:[E822 06:33:42.473871273 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4554, OpType=GATHER, NumelIn=16032, NumelOut=0, Timeout(ms)=600000) ran for 600089 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f468ed77f86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f4640d588d2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f4640d5f313 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4640d616fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x7f468e4b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7f468fbdcac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f468fc6dbf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

The text was updated successfully, but these errors were encountered:

youkaichao · 2024-08-22T07:54:00Z

can you stably reproduce it?

wangwensuo · 2024-08-22T07:57:21Z

It's like this every time. After a few inferences, an error is reported, and the GPU usage rate is 100%. Currently, there is no problem with node inference 8b

youkaichao · 2024-08-22T07:59:46Z

try --disable-custom-all-reduce ?

TangJiakai · 2024-08-29T09:02:27Z

try --disable-custom-all-reduce ?

Still Not work

vlasenkoalexey · 2024-09-17T19:23:18Z

Have same issue, happens after model is running for a while.

Startup arguments:
- --host=0.0.0.0
- --port=7080
- --model=gs://...
- --tensor-parallel-size=8
- --swap-space=16
- --gpu-memory-utilization=0.95
- --disable-log-stats
- --enable-chunked-prefill
- --max-num-seqs=12
- --max-model-len=131072
- --model-type=llama3.1

TTTdas · 2024-10-09T04:35:02Z

same issue and solved by --disable-custom-all-reduce True

youkaichao · 2024-10-10T03:34:58Z

@TTTdas I think #8558 should fix it. can you have a try with the latest wheel? see https://docs.vllm.ai/en/latest/getting_started/installation.html#install-the-latest-code for installing the latest wheel.

TTTdas · 2024-10-14T03:30:13Z

@TTTdas I think #8558 should fix it. can you have a try with the latest wheel? see https://docs.vllm.ai/en/latest/getting_started/installation.html#install-the-latest-code for installing the latest wheel.

yes, it works! thx

wangwensuo added the bug Something isn't working label Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: llama3-405b-fp8 NCCL communication #7775

[Bug]: llama3-405b-fp8 NCCL communication #7775

wangwensuo commented Aug 22, 2024

youkaichao commented Aug 22, 2024

wangwensuo commented Aug 22, 2024

youkaichao commented Aug 22, 2024

TangJiakai commented Aug 29, 2024

vlasenkoalexey commented Sep 17, 2024

TTTdas commented Oct 9, 2024

youkaichao commented Oct 10, 2024

TTTdas commented Oct 14, 2024

[Bug]: llama3-405b-fp8 NCCL communication #7775

[Bug]: llama3-405b-fp8 NCCL communication #7775

Comments

wangwensuo commented Aug 22, 2024

Your current environment

🐛 Describe the bug

youkaichao commented Aug 22, 2024

wangwensuo commented Aug 22, 2024

youkaichao commented Aug 22, 2024

TangJiakai commented Aug 29, 2024

vlasenkoalexey commented Sep 17, 2024

TTTdas commented Oct 9, 2024

youkaichao commented Oct 10, 2024

TTTdas commented Oct 14, 2024