[Bug]: vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already. #5060

heungson · 2024-05-26T22:44:41Z

Your current environment

docker image: vllm/vllm-openai:0.4.2
Model: https://huggingface.co/alpindale/c4ai-command-r-plus-GPTQ
GPUs: RTX8000 * 2

🐛 Describe the bug

The model works fine until the following error is raised.

INFO 05-26 22:28:18 async_llm_engine.py:529] Received request cmpl-10dff83cb4b6422ba8c64213942a7e46: prompt: '<BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>"Question: Is Korea the name of a Nation?\nGuideline: No explanation.\nFormat: {"Answer": "<your yes/no answer>"}<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>', sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['---'], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4096, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [5, 5, 255000, 255006, 9, 60478, 33, 3294, 13489, 1690, 2773, 1719, 1671, 20611, 38, 206, 46622, 7609, 33, 3679, 33940, 21, 206, 8961, 33, 19586, 61664, 2209, 31614, 28131, 20721, 22, 3598, 11205, 37, 22631, 255001, 255000, 255007], lora_request: None.
INFO 05-26 22:28:18 async_llm_engine.py:154] Aborted request cmpl-10dff83cb4b6422ba8c64213942a7e46.
INFO: 10.11.3.150:6231 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 475, in engine_step
request_outputs = await self.engine.step_async()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 221, in step_async
output = await self.model_executor.execute_model_async(
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 110, in execute_model_async
all_outputs = await self._run_workers_async("execute_model",
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 326, in _run_workers_async
all_outputs = await asyncio.gather(*coros)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 69, in call
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in call
await super().call(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in call
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in call
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in call
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in call
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in call
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in call
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
response = await func(request)
File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
raw_response = await run_endpoint_function(
File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
return await dependant.call(**values)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 99, in create_chat_completion
generator = await openai_serving_chat.create_chat_completion(
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 138, in create_chat_completion
return await self.chat_completion_full_generator(
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 301, in chat_completion_full_generator
async for res in result_generator:
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 666, in generate
raise e
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 660, in generate
async for request_output in stream:
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 77, in anext
raise result
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
task.result()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 501, in run_engine_loop
has_requests_in_progress = await asyncio.wait_for(
File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 69, in call
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in call
await super().call(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in call
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in call
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in call
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in call
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in call
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in call
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
response = await func(request)
File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
raw_response = await run_endpoint_function(
File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
return await dependant.call(**values)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 99, in create_chat_completion
generator = await openai_serving_chat.create_chat_completion(
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 138, in create_chat_completion
return await self.chat_completion_full_generator(
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 301, in chat_completion_full_generator
async for res in result_generator:
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 666, in generate
raise e
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 650, in generate
stream = await self.add_request(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 537, in add_request
self.start_background_loop()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 411, in start_background_loop
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already

Warrior0x1 · 2024-05-27T00:38:37Z

I encountered a similar error which was a serious bug in production

BKsirius · 2024-05-27T03:08:27Z

Hi! I got the same error.!

agahEbrahimi · 2024-05-28T05:44:37Z

I'm facing the same error in production.

DarkLight1337 · 2024-05-28T09:02:26Z

~~IIRC, this has been fixed by #4363, which should be in the next release. I haven't rigorously tested whether it specifically fixes this problem though.~~ Doesn't look like it, based on the newly opened issues.

Related issues:

mavericb · 2024-05-31T20:55:33Z

I also encountered this serious bug. It's impossible to deploy in prod since it fails unexpectedly and doesn't even restart the system. I tried the pre-release in 0.4.3, but the bug still persists 😭

Just adding some more info. I can call the endpoint from three terminals at the same time and it seems to survive, but the bug comes again when calling the endpoint with 4 terminals. So, it's problematic to deploy something like this in production where multiple simultaneous calls can happen.

Edit: additional info

Penglikai · 2024-06-12T08:14:24Z

I am facing the same error, wonder if it's solved in v0.5.0, anyone tested on it?

tommyil · 2024-06-12T17:35:17Z

I am experiencing a similar issue, with Llama3

albertsokol · 2024-06-13T02:32:06Z

I am facing the same error, wonder if it's solved in v0.5.0, anyone tested on it?

I've also been experiencing the same issue, using Llama 3 70b, in v0.5.0.

tmostak · 2024-06-13T19:31:23Z

Also hitting this with Llama 3 70b in v0.5.0. All the times it has been triggered have been when using guided_regex (via the OpenAI API), fwiw, where it happens very frequently.

EDIT: Actually just hit it without the guided_regex argument.

valeriylo · 2024-06-15T11:14:48Z

Facing the same error, but it seems it is related with long context length. When setting at around 98k and over, the Avg generation throughput stays at 0.0 tokens/s and after 6 messages there's loop error from engine.
Looks like it needs more time to process long request and api itself cuts off

tommyil · 2024-06-22T06:02:28Z

Update: Turns out the image I used had an outdated version installed.
I upgraded vLLM to version 0.5.0.post1, and the error hasn't recurred.

More background/log info on this:
This is from Nvidia A10, with a Llama3 8B foundation, and a fine-tuned Qlora adapter (trained with unsloth).

The symptoms are a stopped background loop, without the ability to recover.

Log:

- 2024-06-22T05:51:06.490+00:00 INFO:     10.42.20.211:38860 - "GET /health HTTP/1.1" 500 Internal Server Error
- 2024-06-22T05:51:06.490+00:00 ERROR:    Exception in ASGI application
- 2024-06-22T05:51:06.490+00:00 Traceback (most recent call last):
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
- 2024-06-22T05:51:06.490+00:00     result = await app(  # type: ignore[func-returns-value]
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
- 2024-06-22T05:51:06.490+00:00     return await self.app(scope, receive, send)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
- 2024-06-22T05:51:06.490+00:00     await super().__call__(scope, receive, send)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
- 2024-06-22T05:51:06.490+00:00     await self.middleware_stack(scope, receive, send)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
- 2024-06-22T05:51:06.490+00:00     raise exc
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
- 2024-06-22T05:51:06.490+00:00     await self.app(scope, receive, _send)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
- 2024-06-22T05:51:06.490+00:00     await self.app(scope, receive, send)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
- 2024-06-22T05:51:06.490+00:00     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
- 2024-06-22T05:51:06.490+00:00     raise exc
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
- 2024-06-22T05:51:06.490+00:00     await app(scope, receive, sender)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
- 2024-06-22T05:51:06.490+00:00     await self.middleware_stack(scope, receive, send)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
- 2024-06-22T05:51:06.490+00:00     await route.handle(scope, receive, send)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
- 2024-06-22T05:51:06.490+00:00     await self.app(scope, receive, send)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
- 2024-06-22T05:51:06.490+00:00     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
- 2024-06-22T05:51:06.490+00:00     raise exc
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
- 2024-06-22T05:51:06.490+00:00     await app(scope, receive, sender)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
- 2024-06-22T05:51:06.490+00:00     response = await func(request)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
- 2024-06-22T05:51:06.491+00:00     raw_response = await run_endpoint_function(
- 2024-06-22T05:51:06.491+00:00   File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
- 2024-06-22T05:51:06.491+00:00     return await dependant.call(**values)
- 2024-06-22T05:51:06.491+00:00   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 71, in health
- 2024-06-22T05:51:06.491+00:00     await openai_serving_chat.engine.check_health()
- 2024-06-22T05:51:06.491+00:00   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 711, in check_health
- 2024-06-22T05:51:06.491+00:00     raise AsyncEngineDeadError("Background loop is stopped.")
- 2024-06-22T05:51:06.491+00:00 vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped.
- 2024-06-22T05:51:08.799+00:00 INFO 06-22 05:51:08 metrics.py:229] Avg prompt throughput: 149.9 tokens/s, Avg generation throughput: 0.2 tokens/s, Running: 6 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 43.4%, CPU KV cache usage: 0.0%

tommyil · 2024-06-22T20:20:35Z

Here's the up-to-date error, that occurs sporadically, also on version 0.5.0post1.

This is from Nvidia A10, with a Llama3 8B foundation, and a fine-tuned Qlora adapter (trained with unsloth).
The symptoms are a stopped background loop, without the ability to recover.

- 2024-06-22T20:17:30.186+00:00 INFO:     10.42.15.50:60988 - "GET /health HTTP/1.1" 500 Internal Server Error
- 2024-06-22T20:17:30.187+00:00 ERROR:    Exception in ASGI application
- 2024-06-22T20:17:30.187+00:00 Traceback (most recent call last):
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
- 2024-06-22T20:17:30.187+00:00     result = await app(  # type: ignore[func-returns-value]
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
- 2024-06-22T20:17:30.187+00:00     return await self.app(scope, receive, send)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
- 2024-06-22T20:17:30.187+00:00     await super().__call__(scope, receive, send)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
- 2024-06-22T20:17:30.187+00:00     await self.middleware_stack(scope, receive, send)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
- 2024-06-22T20:17:30.187+00:00     raise exc
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
- 2024-06-22T20:17:30.187+00:00     await self.app(scope, receive, _send)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
- 2024-06-22T20:17:30.187+00:00     await self.app(scope, receive, send)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
- 2024-06-22T20:17:30.187+00:00     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
- 2024-06-22T20:17:30.187+00:00     raise exc
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
- 2024-06-22T20:17:30.187+00:00     await app(scope, receive, sender)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
- 2024-06-22T20:17:30.187+00:00     await self.middleware_stack(scope, receive, send)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
- 2024-06-22T20:17:30.187+00:00     await route.handle(scope, receive, send)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
- 2024-06-22T20:17:30.187+00:00     await self.app(scope, receive, send)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
- 2024-06-22T20:17:30.187+00:00     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
- 2024-06-22T20:17:30.187+00:00     raise exc
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
- 2024-06-22T20:17:30.187+00:00     await app(scope, receive, sender)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
- 2024-06-22T20:17:30.187+00:00     response = await func(request)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
- 2024-06-22T20:17:30.187+00:00     raw_response = await run_endpoint_function(
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
- 2024-06-22T20:17:30.187+00:00     return await dependant.call(**values)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 84, in health
- 2024-06-22T20:17:30.187+00:00     await openai_serving_chat.engine.check_health()
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 842, in check_health
- 2024-06-22T20:17:30.187+00:00     raise AsyncEngineDeadError("Background loop is stopped.")
- 2024-06-22T20:17:30.187+00:00 vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped.

And here is the cause for that - Cuda out of memory:

- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] Engine background task failed
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] Traceback (most recent call last):
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 42, in _log_task_completion
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return_value = task.result()
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 532, in run_engine_loop
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     has_requests_in_progress = await asyncio.wait_for(
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return fut.result()
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 506, in engine_step
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     request_outputs = await self.engine.step_async()
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 235, in step_async
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     output = await self.model_executor.execute_model_async(
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 117, in execute_model_async
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     output = await make_async(self.driver_worker.execute_model
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     result = self.fn(*self.args, **self.kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return func(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 280, in execute_model
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     output = self.model_runner.execute_model(seq_group_metadata_list,
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return func(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 749, in execute_model
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     hidden_states = model_executable(
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return self._call_impl(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return forward_call(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 371, in forward
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     hidden_states = self.model(input_ids, positions, kv_caches,
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return self._call_impl(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return forward_call(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 288, in forward
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     hidden_states, residual = layer(
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return self._call_impl(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return forward_call(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 237, in forward
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     hidden_states = self.mlp(hidden_states)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return self._call_impl(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return forward_call(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 80, in forward
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     x = self.act_fn(gate_up)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return self._call_impl(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return forward_call(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/custom_op.py", line 13, in forward
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return self._forward_method(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/activation.py", line 36, in forward_cuda
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     out = torch.empty(output_shape, dtype=x.dtype, device=x.device)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 216.00 MiB. GPU 
- 2024-06-22T20:16:34.700+00:00 Exception in callback functools.partial(<function _log_task_completion at 0x7f10e052ecb0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f10c8a0e530>>)

Could there be a memory leak?

My card uses 21GB out of 24GB.

DarkLight1337 · 2024-06-23T01:28:59Z

Here's the up-to-date error, that occurs sporadically, also on version 0.5.0post1.

This is from Nvidia A10, with a Llama3 8B foundation, and a fine-tuned Qlora adapter (trained with unsloth). The symptoms are a stopped background loop, without the ability to recover.

And here is the cause for that - Cuda out of memory:

Could there be a memory leak?

My card uses 21GB out of 24GB.

#5355 may fix your particular problem. To temporarily circumvent this issue, you can set gpu_memory_utilization to a lower value (the default is 0.9).

trislee02 · 2024-06-27T03:41:08Z

Is there any update on this?

robertgshaw2-neuralmagic · 2024-06-27T12:56:48Z

Hello! I am systematically tracking AsyncEngineDeadError in #5901

To help us understand what is going on, I need to reproduce the errors on my side.

If you can share:

Server launch command
Example requests that cause the server to crash

It is much easier for me to look into what is going on.

robertgshaw2-neuralmagic · 2024-06-27T13:02:07Z

cc @Warrior0x1, @BKsirius, @agahEbrahimi, @mavericb, @Penglikai, @tommyil, @albertsokol, @tmostak, @valeriylo, @trislee02

trislee02 · 2024-06-28T15:55:36Z

Thank you for your prompt reply. Here's what I ran:
1. Server launch command

python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2-72B-Instruct-AWQ --tensor-parallel-size 2 --enforce-eager --quantization awq --gpu-memory-utilization 0.98 --max-model-len 77500

In particular, I enabled YARN as instructed here to process long context (over the original 32K).

2. Example requests that cause the server to crash
A long request of approximately 76,188 tokens (using OpenAI tokenizer) is attached below.
long_request.txt

The error I got:

INFO:     129.126.125.252:12223 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 506, in engine_step
    request_outputs = await self.engine.step_async()
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 235, in step_async
    output = await self.model_executor.execute_model_async(
  File "/opt/conda/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 166, in execute_model_async
    return await self._driver_execute_model_async(execute_model_req)
  File "/opt/conda/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 149, in _driver_execute_model_async
    return await self.driver_exec_model(execute_model_req)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/opt/conda/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/opt/conda/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/opt/conda/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 103, in create_chat_completion
    generator = await openai_serving_chat.create_chat_completion(
  File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 282, in create_chat_completion
    return await self.chat_completion_full_generator(
  File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 482, in chat_completion_full_generator
    async for res in result_generator:
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 673, in generate
    async for output in self._process_request(
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 780, in _process_request
    raise e
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 776, in _process_request
    async for request_output in stream:
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 89, in __anext__
    raise result
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 42, in _log_task_completion
    return_value = task.result()
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 532, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
  File "/opt/conda/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/opt/conda/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/opt/conda/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/opt/conda/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 103, in create_chat_completion
    generator = await openai_serving_chat.create_chat_completion(
  File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 282, in create_chat_completion
    return await self.chat_completion_full_generator(
  File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 482, in chat_completion_full_generator
    async for res in result_generator:
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 673, in generate
    async for output in self._process_request(
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 767, in _process_request
    stream = await self.add_request(
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 572, in add_request
    self.start_background_loop()
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 443, in start_background_loop
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.

Before this error, it showed the input token as this exerpt:

1030, 894, 1290, 311, 6286, 70164, 594, 829, 382, 61428, 49056, 0, 70164, 0, 70164, 8958, 43443, 17618, 13, 17426, 13, 330, 40, 3278, 1977, 432, 15356, 358, 1366, 311, 0, 70164, 0, 79123, 2293, 1837, 42246, 264, 2805, 707, 83, 7203, 8364, 84190, 14422, 1059, 19142, 448, 806, 1787, 1424, 382, 12209, 1052, 1033, 35177, 52884, 5193, 279, 14852, 6422, 11, 323, 3198, 594, 23314, 1136, 14995, 11, 323, 1550, 916, 279, 21340, 264, 1293, 10865, 289, 604, 315, 6646, 13, 4392, 13, 25636, 2127, 264, 95328, 504, 806, 653, 2986, 323, 3855, 304, 264, 294, 9832, 8841, 279, 6006, 13, 3197, 566, 1030, 8048, 4279, 1616, 566, 6519, 2163, 323, 44035, 518, 279, 6109, 2293, 25235, 7403, 323, 41563, 1136, 14995, 323, 1585, 84569, 438, 807, 49057, 1588, 323, 1052, 4221, 279, 38213, 14549, 448, 9709, 315, 12296, 11, 323, 279, 45896, 287, 7071, 389, 279, 26148, 34663, 19660, 4402, 323, 4460, 311, 8865, 264, 2975, 315, 330, 68494, 350, 4626, 1, 916, 279, 16971, 4617, 16065, 315, 24209, 86355, 13, 5005, 4392, 13, 25636, 2127, 6519, 323, 8570, 389, 700, 279, 6006, 13, 35825, 847, 8896, 504, 279, 521, 78121, 358, 8110, 382, 1, 28851, 311, 15786, 1045, 1899, 1335, 566, 11827, 11, 438, 582, 10487, 51430, 1495, 304, 279, 38636, 382, 1, 9064, 11974, 1, 73325, 2217, 1, 19434, 697, 6078, 1007, 279, 27505, 1335, 47010, 279, 38636, 8171, 382, 7044, 2148, 697, 64168, 1335, 1053, 4392, 13, 25636, 2127, 448, 37829, 11, 330, 40, 3207, 944, 1414, 358, 572, 30587, 432, 2217, 67049, 1290, 1335, 358, 7230, 11, 330, 40, 3278, 387, 15713, 311, 2217, 13, 659, 659, 358, 572, 11259, 29388, 806, 4845, 323, 566, 572, 11699, 705, 1948, 279, 24140, 11, 82758, 304, 806, 54144, 11, 448, 264, 2244, 19565, 304, 806, 6078, 382, 1, 93809, 323, 279, 33182, 659, 659, 659, 48181, 54698, 659, 659, 659, 10621, 95581, 33292, 659, 659, 659, 15605, 43786, 19874, 659, 659, 659, 659, 1837, 12209, 358, 572, 20446, 4279, 32073, 304, 279, 9255, 4722, 2188, 315, 279, 19771, 16629, 11, 36774, 518, 279, 6556, 330, 51, 1897, 2886, 1, 323, 8580, 369, 279, 3040, 297, 62410, 5426, 382, 13874, 19324, 4340, 1657, 31365, 4278, 525, 1052, 304, 419, 2197, 30, 10479, 6139, 1105, 30, 151645, 198, 151644, 77091, 198], lora_request: None.

After that, it kept showing Running: 1 reqs, but no generation.

INFO 06-28 15:41:57 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 86.2%, CPU KV cache usage: 0.0%.
INFO 06-28 15:42:07 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 86.2%, CPU KV cache usage: 0.0%.
INFO 06-28 15:42:17 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 86.2%, CPU KV cache usage: 0.0%.
INFO 06-28 15:42:27 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 86.2%, CPU KV cache usage: 0.0%.

Thank you in advance!

richarddli · 2024-07-10T20:35:14Z

I can trigger this error reliably when sending requests with larger amounts of tokens. I've reproduced this on both meta-llama/Meta-Llama-3-8B-Instruct and mistralai/Mistral-7B-Instruct-v0.1.

In my situation, I'm deploying vLLM on a CPU-only 32GB Intel system, and then running inference through the OpenAI endpoint.

Aillian · 2024-07-12T06:02:38Z

getting the same error, here is my launch command:
python -m vllm.entrypoints.openai.api_server --model Model_Files --dtype bfloat16 --chat-template chat_template.jinja --device cuda --enable-prefix-caching

richarddli · 2024-07-17T18:45:59Z

I did some additional experimentation:

On a 64GB VM, CPU only, I was able to successfully trigger the error with a 351 token prompt.
On a 128GB VM, CPU only, the 351 token prompt did not trigger an error.
On the 128GB VM, CPU only, a 604 token prompt does trigger the error.

I'm using the same Docker image of vLLM, reasonably close to tip, and built using the Dockerfile.cpu, with meta-llama/Meta-Llama-3-8B-Instruct.

trislee02 · 2024-07-19T21:50:48Z

Is there any update on this? Thanks.

shangzyu · 2024-07-25T02:26:03Z

Same problem. But when I reduce the max-num-seqs from 256 to 16. This error disappeared.

richarddli · 2024-07-26T20:09:04Z

I've had some success by increasing ENGINE_ITERATION_TIMEOUT_S.

It appears the offending code is here: (see https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py#L630). When the engine takes too long, it times out, but then leaves the engine in a dead state. I'm not familiar enough with the internals of vLLM to suggest a fix.

That said, if someone does have an idea of how to fix it, I'm happy to try to implement a fix.

TangJiakai · 2024-08-04T09:46:34Z

Same Error!!!

mangomatrix · 2024-08-08T01:24:31Z

Same error: #6689 (comment)
Only find on Lllma 3.1, 70B 405B-fp8, llama3 70B is right!
Useing vllm package has no problem:

from vllm import LLM

llm = LLM("/mnt/models/Meta-Llama-3.1-405B-Instruct-FP8", tensor_parallel_size=8, max_model_len=10600)

server wrong:

vllm serve /mnt/models/Meta-Llama-3-70B-Instruct --tensor-parallel-size 8 --api-key token-abc123 --host 0.0.0.0 --port 8899 --max-model-len 81920

yitianlian · 2024-08-09T13:15:56Z

same error!!

caoxu915683474 · 2024-08-12T01:25:22Z

same error!!

yckbilly1929 · 2024-08-14T13:32:19Z

I've had some success by increasing ENGINE_ITERATION_TIMEOUT_S.

It appears the offending code is here: (see https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py#L630). When the engine takes too long, it times out, but then leaves the engine in a dead state. I'm not familiar enough with the internals of vLLM to suggest a fix.

That said, if someone does have an idea of how to fix it, I'm happy to try to implement a fix.

This works. I set ENGINE_ITERATION_TIMEOUT_S to 180 to align with GraphRAG default value timeout=configuration.request_timeout or 180.0,. The default value of 60 is not enough sometimes.

endNone · 2024-09-07T10:13:36Z

I have meet the same issue. From my point of view, it happens in two scenes: One situation is under heavy request pressure (like a graphrag), the other is situation is uncertain. I deployed the service to the production environment, and this kind of error appears after about 20 days, even though there was no such pressure at that time. Such errors are hard to reproduce. Initially, I suspected it was due to unstable network connections, but I quickly ruled out that possibility.I believe that setting ENGINE_ITERATION_TIMEOUT_S is effective in the first situation, but it may not necessarily work in the second one.

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 506, in engine_step
    request_outputs = await self.engine.step_async()
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 235, in step_async
    output = await self.model_executor.execute_model_async(
  File "/usr/local/lib/python3.9/site-packages/vllm/executor/distributed_gpu_executor.py", line 166, in execute_model_async
    return await self._driver_execute_model_async(execute_model_req)
  File "/usr/local/lib/python3.9/site-packages/vllm/executor/multiproc_gpu_executor.py", line 149, in _driver_execute_model_async
    return await self.driver_exec_model(execute_model_req)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 492, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 42, in _log_task_completion
    return_value = task.result()
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 532, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 494, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/starlette/responses.py", line 261, in wrap
    await func()
  File "/usr/local/lib/python3.9/site-packages/starlette/responses.py", line 250, in stream_response
    async for chunk in self.body_iterator:
  File "/root/Futuregene/FastChat/fastchat/serve/vllm_worker.py", line 196, in generate_stream
    async for request_output in results_generator:
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 673, in generate
    async for output in self._process_request(
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 767, in _process_request
    stream = await self.add_request(
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 572, in add_request
    self.start_background_loop()
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 443, in start_background_loop
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.

mangomatrix · 2024-09-12T11:04:23Z

It seems to be a VRAM leak issue. When your VRAM is insufficient to run the model inference, the server will stop functioning, but the client won’t detect it.

richarddli · 2024-09-12T12:48:13Z

It seems to be a VRAM leak issue. When your VRAM is insufficient to run the model inference, the server will stop functioning, but the client won’t detect it.

I read a bunch of the code a couple months ago. When you send a request to vLLM, it gets queued for processing. There is a timeout associated with this request, that is governed by ENGINE_ITERATION_TIMEOUT_S. When a request exceeds the timeout, an AsyncEngineDeadError is thrown. I put together a hacky patch that simply removed the request from queue, returning an error to the caller. This way the caller can then choose how it wants to handle a 500 response (retry, ignore, etc.). I did ping a few vLLM folks to review my patch, but never heard back from them.

So hopefully someone who is more familiar with vLLM internals than me can investigate. I'm not sure if there is a VRAM leak issue or not (I certainly got the error frequently enough on new instances, which suggests it's not a leak), but I do think the semantics of the queue are incorrect.

ashwin-js · 2024-09-16T05:09:39Z

Hi I am using Triton server to host my engine. I am getting the same issue. Can some explain how to set ENGINE_ITERATION_TIMEOUT_S triton server ?

khayamgondal · 2024-09-16T22:09:13Z

@ashwin-js were you able to figure it out?

Silas-Xu · 2024-09-18T09:51:03Z

same error👀

ashwin-js · 2024-09-18T09:56:43Z

@khayamgondal No but I got a workaround, I exported the trtion metrics and Keeping the GPU utils below 85%. I am not seeing any error.
and in model.json I kept the gpu_memory_utilization to 0.90

Silas-Xu · 2024-09-18T10:50:08Z

I have stopped the service requests, but there is still a ghost request that continues to run, and the GPU KV cache usage keeps increasing until it reaches 100%.

My startup command is:

vllm serve /root/Llama3.1-8b-Instruct --dtype auto --api-key token-abc123 --max_model_len 39680 --served-model-name "llama3.1-8b"

The vllm version is 0.6.1.post2, and the requests are made using concurrent batch requests (using from openai import AsyncOpenAI), GPU is 4090.

ericalduo · 2024-09-21T08:15:10Z

I have stopped the service requests, but there is still a ghost request that continues to run, and the GPU KV cache usage keeps increasing until it reaches 100%.

My startup command is:
vllm serve /root/Llama3.1-8b-Instruct --dtype auto --api-key token-abc123 --max_model_len 39680 --served-model-name "llama3.1-8b"
The vllm version is 0.6.1.post2, and the requests are made using concurrent batch requests (using from openai import AsyncOpenAI), GPU is 4090.

@Silas-Xu Same as you. Have you resolved it?

Silas-Xu · 2024-09-21T11:19:16Z

I have stopped the service requests, but there is still a ghost request that continues to run, and the GPU KV cache usage keeps increasing until it reaches 100%.
My startup command is:
vllm serve /root/Llama3.1-8b-Instruct --dtype auto --api-key token-abc123 --max_model_len 39680 --served-model-name "llama3.1-8b"
The vllm version is 0.6.1.post2, and the requests are made using concurrent batch requests (using from openai import AsyncOpenAI), GPU is 4090.
@Silas-Xu Same as you. Have you resolved it?

I have tried various methods, including upgrading to the latest version and using different parameters, but none have been successful. It is said that using a smaller gpu_memory_utilization parameter can solve this problem, but it may cause an inability to start, with a message indicating insufficient GPU memory.

ericalduo · 2024-09-22T08:12:50Z

I have stopped the service requests, but there is still a ghost request that continues to run, and the GPU KV cache usage keeps increasing until it reaches 100%.
My startup command is:
vllm serve /root/Llama3.1-8b-Instruct --dtype auto --api-key token-abc123 --max_model_len 39680 --served-model-name "llama3.1-8b"
The vllm version is 0.6.1.post2, and the requests are made using concurrent batch requests (using from openai import AsyncOpenAI), GPU is 4090.
@Silas-Xu Same as you. Have you resolved it?
I have tried various methods, including upgrading to the latest version and using different parameters, but none have been successful. It is said that using a smaller gpu_memory_utilization parameter can solve this problem, but it may cause an inability to start, with a message indicating insufficient GPU memory.

i running ‘glm-4-9b-gptq-int4’ model on RTX 4090 with gpu_memory_utilization=0.5, But the model still reports this error.

TweedBeetle · 2024-10-25T08:22:07Z

I am encounterin this issue running an openai server with the following engine args:

        gpu_memory_utilization=0.7,
        enforce_eager=False,  # capture the graph for faster inference, but slower cold starts (30s > 20s)
        num_scheduler_steps=4,
        max_num_seqs=64,
        block_size=32,
        dtype="bfloat16",
        enable_chunked_prefill=True,
        trust_remote_code=True,

I am running llama 3.1 8b on 0.6.3.post1 on an H100

heungson added the bug Something isn't working label May 26, 2024

DarkLight1337 mentioned this issue Jun 1, 2024

bug fixed: cuda out of memory lead to 'AsyncEngineDeadError: Background loop has errored already. #5173

Open

LIKEGAKKI mentioned this issue Jun 11, 2024

大批量请求后服务直接卡死 xorbitsai/inference#1291

Closed

zRzRzRzRzRzRzR mentioned this issue Jun 11, 2024

VLLM 部署 glm-4-9b-chat-1m模型，推理服务崩溃 THUDM/GLM-4#136

Closed

robertgshaw2-neuralmagic mentioned this issue Jun 27, 2024

[Bug]: TRACKING ISSUE: AsyncEngineDeadError #5901

Open

18 tasks

robertgshaw2-neuralmagic mentioned this issue Jun 27, 2024

[Bug]: TRACKING ISSUE: CUDA OOM with Logprobs #5907

Open

[Bug]: vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already. #5060

[Bug]: vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already. #5060

Comments

heungson commented May 26, 2024 • edited Loading

Your current environment

🐛 Describe the bug

The model works fine until the following error is raised.

Warrior0x1 commented May 27, 2024

BKsirius commented May 27, 2024

agahEbrahimi commented May 28, 2024

DarkLight1337 commented May 28, 2024 • edited Loading

mavericb commented May 31, 2024 • edited Loading

Penglikai commented Jun 12, 2024

tommyil commented Jun 12, 2024 • edited Loading

albertsokol commented Jun 13, 2024

tmostak commented Jun 13, 2024 • edited Loading

valeriylo commented Jun 15, 2024 • edited Loading

tommyil commented Jun 22, 2024 • edited Loading

tommyil commented Jun 22, 2024 • edited Loading

DarkLight1337 commented Jun 23, 2024 • edited Loading

trislee02 commented Jun 27, 2024

robertgshaw2-neuralmagic commented Jun 27, 2024

robertgshaw2-neuralmagic commented Jun 27, 2024

trislee02 commented Jun 28, 2024

richarddli commented Jul 10, 2024

Aillian commented Jul 12, 2024

richarddli commented Jul 17, 2024

trislee02 commented Jul 19, 2024

shangzyu commented Jul 25, 2024

richarddli commented Jul 26, 2024

TangJiakai commented Aug 4, 2024

mangomatrix commented Aug 8, 2024 • edited Loading

yitianlian commented Aug 9, 2024

caoxu915683474 commented Aug 12, 2024

yckbilly1929 commented Aug 14, 2024

endNone commented Sep 7, 2024

mangomatrix commented Sep 12, 2024

richarddli commented Sep 12, 2024

ashwin-js commented Sep 16, 2024

khayamgondal commented Sep 16, 2024

Silas-Xu commented Sep 18, 2024

ashwin-js commented Sep 18, 2024

Silas-Xu commented Sep 18, 2024

ericalduo commented Sep 21, 2024 • edited Loading

Silas-Xu commented Sep 21, 2024

ericalduo commented Sep 22, 2024

TweedBeetle commented Oct 25, 2024 • edited Loading

heungson commented May 26, 2024 •

edited

Loading

DarkLight1337 commented May 28, 2024 •

edited

Loading

mavericb commented May 31, 2024 •

edited

Loading

tommyil commented Jun 12, 2024 •

edited

Loading

tmostak commented Jun 13, 2024 •

edited

Loading

valeriylo commented Jun 15, 2024 •

edited

Loading

tommyil commented Jun 22, 2024 •

edited

Loading

tommyil commented Jun 22, 2024 •

edited

Loading

DarkLight1337 commented Jun 23, 2024 •

edited

Loading

mangomatrix commented Aug 8, 2024 •

edited

Loading

ericalduo commented Sep 21, 2024 •

edited

Loading

TweedBeetle commented Oct 25, 2024 •

edited

Loading