Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already. #5060

Open
Tracked by #5901
heungson opened this issue May 26, 2024 · 40 comments
Labels
bug Something isn't working

Comments

@heungson
Copy link

heungson commented May 26, 2024

Your current environment

docker image: vllm/vllm-openai:0.4.2
Model: https://huggingface.co/alpindale/c4ai-command-r-plus-GPTQ
GPUs: RTX8000 * 2

🐛 Describe the bug

The model works fine until the following error is raised.

INFO 05-26 22:28:18 async_llm_engine.py:529] Received request cmpl-10dff83cb4b6422ba8c64213942a7e46: prompt: '<BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>"Question: Is Korea the name of a Nation?\nGuideline: No explanation.\nFormat: {"Answer": "<your yes/no answer>"}<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>', sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['---'], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4096, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [5, 5, 255000, 255006, 9, 60478, 33, 3294, 13489, 1690, 2773, 1719, 1671, 20611, 38, 206, 46622, 7609, 33, 3679, 33940, 21, 206, 8961, 33, 19586, 61664, 2209, 31614, 28131, 20721, 22, 3598, 11205, 37, 22631, 255001, 255000, 255007], lora_request: None.
INFO 05-26 22:28:18 async_llm_engine.py:154] Aborted request cmpl-10dff83cb4b6422ba8c64213942a7e46.
INFO: 10.11.3.150:6231 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 475, in engine_step
request_outputs = await self.engine.step_async()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 221, in step_async
output = await self.model_executor.execute_model_async(
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 110, in execute_model_async
all_outputs = await self._run_workers_async("execute_model",
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 326, in _run_workers_async
all_outputs = await asyncio.gather(*coros)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 69, in call
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in call
await super().call(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in call
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in call
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in call
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in call
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in call
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in call
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
response = await func(request)
File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
raw_response = await run_endpoint_function(
File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
return await dependant.call(**values)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 99, in create_chat_completion
generator = await openai_serving_chat.create_chat_completion(
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 138, in create_chat_completion
return await self.chat_completion_full_generator(
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 301, in chat_completion_full_generator
async for res in result_generator:
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 666, in generate
raise e
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 660, in generate
async for request_output in stream:
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 77, in anext
raise result
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
task.result()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 501, in run_engine_loop
has_requests_in_progress = await asyncio.wait_for(
File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 69, in call
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in call
await super().call(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in call
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in call
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in call
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in call
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in call
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in call
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
response = await func(request)
File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
raw_response = await run_endpoint_function(
File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
return await dependant.call(**values)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 99, in create_chat_completion
generator = await openai_serving_chat.create_chat_completion(
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 138, in create_chat_completion
return await self.chat_completion_full_generator(
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 301, in chat_completion_full_generator
async for res in result_generator:
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 666, in generate
raise e
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 650, in generate
stream = await self.add_request(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 537, in add_request
self.start_background_loop()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 411, in start_background_loop
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already

@heungson heungson added the bug Something isn't working label May 26, 2024
@Warrior0x1
Copy link

I encountered a similar error which was a serious bug in production

@BKsirius
Copy link

Hi! I got the same error.!
img_v3_02b9_3f9c6c3b-342c-4e01-aa94-fa6b1139e80g

@agahEbrahimi
Copy link

I'm facing the same error in production.
Screenshot 2024-05-27 at 10 44 23 PM

@DarkLight1337
Copy link
Member

DarkLight1337 commented May 28, 2024

@mavericb
Copy link

mavericb commented May 31, 2024

I also encountered this serious bug. It's impossible to deploy in prod since it fails unexpectedly and doesn't even restart the system. I tried the pre-release in 0.4.3, but the bug still persists 😭

Just adding some more info. I can call the endpoint from three terminals at the same time and it seems to survive, but the bug comes again when calling the endpoint with 4 terminals. So, it's problematic to deploy something like this in production where multiple simultaneous calls can happen.

Edit: additional info

@Penglikai
Copy link

I am facing the same error, wonder if it's solved in v0.5.0, anyone tested on it?
vllmerr

@tommyil
Copy link

tommyil commented Jun 12, 2024

I am experiencing a similar issue, with Llama3

@albertsokol
Copy link

I am facing the same error, wonder if it's solved in v0.5.0, anyone tested on it?

I've also been experiencing the same issue, using Llama 3 70b, in v0.5.0.

@tmostak
Copy link

tmostak commented Jun 13, 2024

Also hitting this with Llama 3 70b in v0.5.0. All the times it has been triggered have been when using guided_regex (via the OpenAI API), fwiw, where it happens very frequently.

EDIT: Actually just hit it without the guided_regex argument.

@valeriylo
Copy link

valeriylo commented Jun 15, 2024

Facing the same error, but it seems it is related with long context length. When setting at around 98k and over, the Avg generation throughput stays at 0.0 tokens/s and after 6 messages there's loop error from engine.
Looks like it needs more time to process long request and api itself cuts off

@tommyil
Copy link

tommyil commented Jun 22, 2024

Update: Turns out the image I used had an outdated version installed.
I upgraded vLLM to version 0.5.0.post1, and the error hasn't recurred.

More background/log info on this:
This is from Nvidia A10, with a Llama3 8B foundation, and a fine-tuned Qlora adapter (trained with unsloth).

The symptoms are a stopped background loop, without the ability to recover.

Log:

- 2024-06-22T05:51:06.490+00:00 INFO:     10.42.20.211:38860 - "GET /health HTTP/1.1" 500 Internal Server Error
- 2024-06-22T05:51:06.490+00:00 ERROR:    Exception in ASGI application
- 2024-06-22T05:51:06.490+00:00 Traceback (most recent call last):
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
- 2024-06-22T05:51:06.490+00:00     result = await app(  # type: ignore[func-returns-value]
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
- 2024-06-22T05:51:06.490+00:00     return await self.app(scope, receive, send)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
- 2024-06-22T05:51:06.490+00:00     await super().__call__(scope, receive, send)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
- 2024-06-22T05:51:06.490+00:00     await self.middleware_stack(scope, receive, send)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
- 2024-06-22T05:51:06.490+00:00     raise exc
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
- 2024-06-22T05:51:06.490+00:00     await self.app(scope, receive, _send)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
- 2024-06-22T05:51:06.490+00:00     await self.app(scope, receive, send)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
- 2024-06-22T05:51:06.490+00:00     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
- 2024-06-22T05:51:06.490+00:00     raise exc
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
- 2024-06-22T05:51:06.490+00:00     await app(scope, receive, sender)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
- 2024-06-22T05:51:06.490+00:00     await self.middleware_stack(scope, receive, send)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
- 2024-06-22T05:51:06.490+00:00     await route.handle(scope, receive, send)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
- 2024-06-22T05:51:06.490+00:00     await self.app(scope, receive, send)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
- 2024-06-22T05:51:06.490+00:00     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
- 2024-06-22T05:51:06.490+00:00     raise exc
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
- 2024-06-22T05:51:06.490+00:00     await app(scope, receive, sender)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
- 2024-06-22T05:51:06.490+00:00     response = await func(request)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
- 2024-06-22T05:51:06.491+00:00     raw_response = await run_endpoint_function(
- 2024-06-22T05:51:06.491+00:00   File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
- 2024-06-22T05:51:06.491+00:00     return await dependant.call(**values)
- 2024-06-22T05:51:06.491+00:00   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 71, in health
- 2024-06-22T05:51:06.491+00:00     await openai_serving_chat.engine.check_health()
- 2024-06-22T05:51:06.491+00:00   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 711, in check_health
- 2024-06-22T05:51:06.491+00:00     raise AsyncEngineDeadError("Background loop is stopped.")
- 2024-06-22T05:51:06.491+00:00 vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped.
- 2024-06-22T05:51:08.799+00:00 INFO 06-22 05:51:08 metrics.py:229] Avg prompt throughput: 149.9 tokens/s, Avg generation throughput: 0.2 tokens/s, Running: 6 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 43.4%, CPU KV cache usage: 0.0%

@tommyil
Copy link

tommyil commented Jun 22, 2024

Here's the up-to-date error, that occurs sporadically, also on version 0.5.0post1.

This is from Nvidia A10, with a Llama3 8B foundation, and a fine-tuned Qlora adapter (trained with unsloth).
The symptoms are a stopped background loop, without the ability to recover.

- 2024-06-22T20:17:30.186+00:00 INFO:     10.42.15.50:60988 - "GET /health HTTP/1.1" 500 Internal Server Error
- 2024-06-22T20:17:30.187+00:00 ERROR:    Exception in ASGI application
- 2024-06-22T20:17:30.187+00:00 Traceback (most recent call last):
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
- 2024-06-22T20:17:30.187+00:00     result = await app(  # type: ignore[func-returns-value]
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
- 2024-06-22T20:17:30.187+00:00     return await self.app(scope, receive, send)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
- 2024-06-22T20:17:30.187+00:00     await super().__call__(scope, receive, send)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
- 2024-06-22T20:17:30.187+00:00     await self.middleware_stack(scope, receive, send)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
- 2024-06-22T20:17:30.187+00:00     raise exc
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
- 2024-06-22T20:17:30.187+00:00     await self.app(scope, receive, _send)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
- 2024-06-22T20:17:30.187+00:00     await self.app(scope, receive, send)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
- 2024-06-22T20:17:30.187+00:00     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
- 2024-06-22T20:17:30.187+00:00     raise exc
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
- 2024-06-22T20:17:30.187+00:00     await app(scope, receive, sender)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
- 2024-06-22T20:17:30.187+00:00     await self.middleware_stack(scope, receive, send)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
- 2024-06-22T20:17:30.187+00:00     await route.handle(scope, receive, send)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
- 2024-06-22T20:17:30.187+00:00     await self.app(scope, receive, send)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
- 2024-06-22T20:17:30.187+00:00     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
- 2024-06-22T20:17:30.187+00:00     raise exc
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
- 2024-06-22T20:17:30.187+00:00     await app(scope, receive, sender)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
- 2024-06-22T20:17:30.187+00:00     response = await func(request)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
- 2024-06-22T20:17:30.187+00:00     raw_response = await run_endpoint_function(
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
- 2024-06-22T20:17:30.187+00:00     return await dependant.call(**values)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 84, in health
- 2024-06-22T20:17:30.187+00:00     await openai_serving_chat.engine.check_health()
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 842, in check_health
- 2024-06-22T20:17:30.187+00:00     raise AsyncEngineDeadError("Background loop is stopped.")
- 2024-06-22T20:17:30.187+00:00 vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped.

And here is the cause for that - Cuda out of memory:

- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] Engine background task failed
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] Traceback (most recent call last):
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 42, in _log_task_completion
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return_value = task.result()
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 532, in run_engine_loop
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     has_requests_in_progress = await asyncio.wait_for(
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return fut.result()
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 506, in engine_step
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     request_outputs = await self.engine.step_async()
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 235, in step_async
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     output = await self.model_executor.execute_model_async(
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 117, in execute_model_async
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     output = await make_async(self.driver_worker.execute_model
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     result = self.fn(*self.args, **self.kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return func(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 280, in execute_model
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     output = self.model_runner.execute_model(seq_group_metadata_list,
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return func(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 749, in execute_model
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     hidden_states = model_executable(
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return self._call_impl(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return forward_call(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 371, in forward
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     hidden_states = self.model(input_ids, positions, kv_caches,
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return self._call_impl(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return forward_call(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 288, in forward
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     hidden_states, residual = layer(
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return self._call_impl(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return forward_call(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 237, in forward
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     hidden_states = self.mlp(hidden_states)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return self._call_impl(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return forward_call(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 80, in forward
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     x = self.act_fn(gate_up)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return self._call_impl(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return forward_call(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/custom_op.py", line 13, in forward
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return self._forward_method(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/activation.py", line 36, in forward_cuda
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     out = torch.empty(output_shape, dtype=x.dtype, device=x.device)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 216.00 MiB. GPU 
- 2024-06-22T20:16:34.700+00:00 Exception in callback functools.partial(<function _log_task_completion at 0x7f10e052ecb0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f10c8a0e530>>)

Could there be a memory leak?

My card uses 21GB out of 24GB.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jun 23, 2024

Here's the up-to-date error, that occurs sporadically, also on version 0.5.0post1.

This is from Nvidia A10, with a Llama3 8B foundation, and a fine-tuned Qlora adapter (trained with unsloth). The symptoms are a stopped background loop, without the ability to recover.

And here is the cause for that - Cuda out of memory:

Could there be a memory leak?

My card uses 21GB out of 24GB.

#5355 may fix your particular problem. To temporarily circumvent this issue, you can set gpu_memory_utilization to a lower value (the default is 0.9).

@trislee02
Copy link

Is there any update on this?

@robertgshaw2-neuralmagic
Copy link
Collaborator

Hello! I am systematically tracking AsyncEngineDeadError in #5901

To help us understand what is going on, I need to reproduce the errors on my side.

If you can share:

  • Server launch command
  • Example requests that cause the server to crash

It is much easier for me to look into what is going on.

@robertgshaw2-neuralmagic
Copy link
Collaborator

@trislee02
Copy link

Thank you for your prompt reply. Here's what I ran:
1. Server launch command

python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2-72B-Instruct-AWQ --tensor-parallel-size 2 --enforce-eager --quantization awq --gpu-memory-utilization 0.98 --max-model-len 77500

In particular, I enabled YARN as instructed here to process long context (over the original 32K).

2. Example requests that cause the server to crash
A long request of approximately 76,188 tokens (using OpenAI tokenizer) is attached below.
long_request.txt

The error I got:

INFO:     129.126.125.252:12223 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 506, in engine_step
    request_outputs = await self.engine.step_async()
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 235, in step_async
    output = await self.model_executor.execute_model_async(
  File "/opt/conda/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 166, in execute_model_async
    return await self._driver_execute_model_async(execute_model_req)
  File "/opt/conda/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 149, in _driver_execute_model_async
    return await self.driver_exec_model(execute_model_req)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/opt/conda/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/opt/conda/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/opt/conda/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 103, in create_chat_completion
    generator = await openai_serving_chat.create_chat_completion(
  File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 282, in create_chat_completion
    return await self.chat_completion_full_generator(
  File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 482, in chat_completion_full_generator
    async for res in result_generator:
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 673, in generate
    async for output in self._process_request(
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 780, in _process_request
    raise e
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 776, in _process_request
    async for request_output in stream:
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 89, in __anext__
    raise result
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 42, in _log_task_completion
    return_value = task.result()
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 532, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
  File "/opt/conda/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/opt/conda/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/opt/conda/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/opt/conda/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 103, in create_chat_completion
    generator = await openai_serving_chat.create_chat_completion(
  File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 282, in create_chat_completion
    return await self.chat_completion_full_generator(
  File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 482, in chat_completion_full_generator
    async for res in result_generator:
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 673, in generate
    async for output in self._process_request(
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 767, in _process_request
    stream = await self.add_request(
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 572, in add_request
    self.start_background_loop()
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 443, in start_background_loop
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.

Before this error, it showed the input token as this exerpt:

1030, 894, 1290, 311, 6286, 70164, 594, 829, 382, 61428, 49056, 0, 70164, 0, 70164, 8958, 43443, 17618, 13, 17426, 13, 330, 40, 3278, 1977, 432, 15356, 358, 1366, 311, 0, 70164, 0, 79123, 2293, 1837, 42246, 264, 2805, 707, 83, 7203, 8364, 84190, 14422, 1059, 19142, 448, 806, 1787, 1424, 382, 12209, 1052, 1033, 35177, 52884, 5193, 279, 14852, 6422, 11, 323, 3198, 594, 23314, 1136, 14995, 11, 323, 1550, 916, 279, 21340, 264, 1293, 10865, 289, 604, 315, 6646, 13, 4392, 13, 25636, 2127, 264, 95328, 504, 806, 653, 2986, 323, 3855, 304, 264, 294, 9832, 8841, 279, 6006, 13, 3197, 566, 1030, 8048, 4279, 1616, 566, 6519, 2163, 323, 44035, 518, 279, 6109, 2293, 25235, 7403, 323, 41563, 1136, 14995, 323, 1585, 84569, 438, 807, 49057, 1588, 323, 1052, 4221, 279, 38213, 14549, 448, 9709, 315, 12296, 11, 323, 279, 45896, 287, 7071, 389, 279, 26148, 34663, 19660, 4402, 323, 4460, 311, 8865, 264, 2975, 315, 330, 68494, 350, 4626, 1, 916, 279, 16971, 4617, 16065, 315, 24209, 86355, 13, 5005, 4392, 13, 25636, 2127, 6519, 323, 8570, 389, 700, 279, 6006, 13, 35825, 847, 8896, 504, 279, 521, 78121, 358, 8110, 382, 1, 28851, 311, 15786, 1045, 1899, 1335, 566, 11827, 11, 438, 582, 10487, 51430, 1495, 304, 279, 38636, 382, 1, 9064, 11974, 1, 73325, 2217, 1, 19434, 697, 6078, 1007, 279, 27505, 1335, 47010, 279, 38636, 8171, 382, 7044, 2148, 697, 64168, 1335, 1053, 4392, 13, 25636, 2127, 448, 37829, 11, 330, 40, 3207, 944, 1414, 358, 572, 30587, 432, 2217, 67049, 1290, 1335, 358, 7230, 11, 330, 40, 3278, 387, 15713, 311, 2217, 13, 659, 659, 358, 572, 11259, 29388, 806, 4845, 323, 566, 572, 11699, 705, 1948, 279, 24140, 11, 82758, 304, 806, 54144, 11, 448, 264, 2244, 19565, 304, 806, 6078, 382, 1, 93809, 323, 279, 33182, 659, 659, 659, 48181, 54698, 659, 659, 659, 10621, 95581, 33292, 659, 659, 659, 15605, 43786, 19874, 659, 659, 659, 659, 1837, 12209, 358, 572, 20446, 4279, 32073, 304, 279, 9255, 4722, 2188, 315, 279, 19771, 16629, 11, 36774, 518, 279, 6556, 330, 51, 1897, 2886, 1, 323, 8580, 369, 279, 3040, 297, 62410, 5426, 382, 13874, 19324, 4340, 1657, 31365, 4278, 525, 1052, 304, 419, 2197, 30, 10479, 6139, 1105, 30, 151645, 198, 151644, 77091, 198], lora_request: None.

After that, it kept showing Running: 1 reqs, but no generation.

INFO 06-28 15:41:57 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 86.2%, CPU KV cache usage: 0.0%.
INFO 06-28 15:42:07 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 86.2%, CPU KV cache usage: 0.0%.
INFO 06-28 15:42:17 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 86.2%, CPU KV cache usage: 0.0%.
INFO 06-28 15:42:27 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 86.2%, CPU KV cache usage: 0.0%.

Thank you in advance!

@richarddli
Copy link

I can trigger this error reliably when sending requests with larger amounts of tokens. I've reproduced this on both meta-llama/Meta-Llama-3-8B-Instruct and mistralai/Mistral-7B-Instruct-v0.1.

In my situation, I'm deploying vLLM on a CPU-only 32GB Intel system, and then running inference through the OpenAI endpoint.

@Aillian
Copy link

Aillian commented Jul 12, 2024

getting the same error, here is my launch command:
python -m vllm.entrypoints.openai.api_server --model Model_Files --dtype bfloat16 --chat-template chat_template.jinja --device cuda --enable-prefix-caching

@richarddli
Copy link

I did some additional experimentation:

  • On a 64GB VM, CPU only, I was able to successfully trigger the error with a 351 token prompt.
  • On a 128GB VM, CPU only, the 351 token prompt did not trigger an error.
  • On the 128GB VM, CPU only, a 604 token prompt does trigger the error.

I'm using the same Docker image of vLLM, reasonably close to tip, and built using the Dockerfile.cpu, with meta-llama/Meta-Llama-3-8B-Instruct.

@trislee02
Copy link

Is there any update on this? Thanks.

@shangzyu
Copy link

Same problem. But when I reduce the max-num-seqs from 256 to 16. This error disappeared.

@richarddli
Copy link

I've had some success by increasing ENGINE_ITERATION_TIMEOUT_S.

It appears the offending code is here: (see https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py#L630). When the engine takes too long, it times out, but then leaves the engine in a dead state. I'm not familiar enough with the internals of vLLM to suggest a fix.

That said, if someone does have an idea of how to fix it, I'm happy to try to implement a fix.

@TangJiakai
Copy link

Same Error!!!

@mangomatrix
Copy link

mangomatrix commented Aug 8, 2024

Same error: #6689 (comment)
Only find on Lllma 3.1, 70B 405B-fp8, llama3 70B is right!
Useing vllm package has no problem:

from vllm import LLM

llm = LLM("/mnt/models/Meta-Llama-3.1-405B-Instruct-FP8", tensor_parallel_size=8, max_model_len=10600)

server wrong:

vllm serve /mnt/models/Meta-Llama-3-70B-Instruct --tensor-parallel-size 8 --api-key token-abc123 --host 0.0.0.0 --port 8899 --max-model-len 81920

@yitianlian
Copy link

same error!!

1 similar comment
@caoxu915683474
Copy link

same error!!

@yckbilly1929
Copy link

I've had some success by increasing ENGINE_ITERATION_TIMEOUT_S.

It appears the offending code is here: (see https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py#L630). When the engine takes too long, it times out, but then leaves the engine in a dead state. I'm not familiar enough with the internals of vLLM to suggest a fix.

That said, if someone does have an idea of how to fix it, I'm happy to try to implement a fix.

This works. I set ENGINE_ITERATION_TIMEOUT_S to 180 to align with GraphRAG default value timeout=configuration.request_timeout or 180.0,. The default value of 60 is not enough sometimes.

@endNone
Copy link

endNone commented Sep 7, 2024

I have meet the same issue. From my point of view, it happens in two scenes: One situation is under heavy request pressure (like a graphrag), the other is situation is uncertain. I deployed the service to the production environment, and this kind of error appears after about 20 days, even though there was no such pressure at that time. Such errors are hard to reproduce. Initially, I suspected it was due to unstable network connections, but I quickly ruled out that possibility.I believe that setting ENGINE_ITERATION_TIMEOUT_S is effective in the first situation, but it may not necessarily work in the second one.

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 506, in engine_step
    request_outputs = await self.engine.step_async()
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 235, in step_async
    output = await self.model_executor.execute_model_async(
  File "/usr/local/lib/python3.9/site-packages/vllm/executor/distributed_gpu_executor.py", line 166, in execute_model_async
    return await self._driver_execute_model_async(execute_model_req)
  File "/usr/local/lib/python3.9/site-packages/vllm/executor/multiproc_gpu_executor.py", line 149, in _driver_execute_model_async
    return await self.driver_exec_model(execute_model_req)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 492, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 42, in _log_task_completion
    return_value = task.result()
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 532, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 494, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/starlette/responses.py", line 261, in wrap
    await func()
  File "/usr/local/lib/python3.9/site-packages/starlette/responses.py", line 250, in stream_response
    async for chunk in self.body_iterator:
  File "/root/Futuregene/FastChat/fastchat/serve/vllm_worker.py", line 196, in generate_stream
    async for request_output in results_generator:
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 673, in generate
    async for output in self._process_request(
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 767, in _process_request
    stream = await self.add_request(
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 572, in add_request
    self.start_background_loop()
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 443, in start_background_loop
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.

@mangomatrix
Copy link

It seems to be a VRAM leak issue. When your VRAM is insufficient to run the model inference, the server will stop functioning, but the client won’t detect it.

@richarddli
Copy link

It seems to be a VRAM leak issue. When your VRAM is insufficient to run the model inference, the server will stop functioning, but the client won’t detect it.

I read a bunch of the code a couple months ago. When you send a request to vLLM, it gets queued for processing. There is a timeout associated with this request, that is governed by ENGINE_ITERATION_TIMEOUT_S. When a request exceeds the timeout, an AsyncEngineDeadError is thrown. I put together a hacky patch that simply removed the request from queue, returning an error to the caller. This way the caller can then choose how it wants to handle a 500 response (retry, ignore, etc.). I did ping a few vLLM folks to review my patch, but never heard back from them.

So hopefully someone who is more familiar with vLLM internals than me can investigate. I'm not sure if there is a VRAM leak issue or not (I certainly got the error frequently enough on new instances, which suggests it's not a leak), but I do think the semantics of the queue are incorrect.

@ashwin-js
Copy link

Hi I am using Triton server to host my engine. I am getting the same issue. Can some explain how to set ENGINE_ITERATION_TIMEOUT_S triton server ?

@khayamgondal
Copy link

@ashwin-js were you able to figure it out?

@Silas-Xu
Copy link

same error👀

@ashwin-js
Copy link

@khayamgondal No but I got a workaround, I exported the trtion metrics and Keeping the GPU utils below 85%. I am not seeing any error.
and in model.json I kept the gpu_memory_utilization to 0.90

@Silas-Xu
Copy link

I have stopped the service requests, but there is still a ghost request that continues to run, and the GPU KV cache usage keeps increasing until it reaches 100%.

My startup command is:

vllm serve /root/Llama3.1-8b-Instruct --dtype auto --api-key token-abc123 --max_model_len 39680 --served-model-name "llama3.1-8b"

The vllm version is 0.6.1.post2, and the requests are made using concurrent batch requests (using from openai import AsyncOpenAI), GPU is 4090.
image

@ericalduo
Copy link

ericalduo commented Sep 21, 2024

I have stopped the service requests, but there is still a ghost request that continues to run, and the GPU KV cache usage keeps increasing until it reaches 100%.

My startup command is:

vllm serve /root/Llama3.1-8b-Instruct --dtype auto --api-key token-abc123 --max_model_len 39680 --served-model-name "llama3.1-8b"

The vllm version is 0.6.1.post2, and the requests are made using concurrent batch requests (using from openai import AsyncOpenAI), GPU is 4090. image

@Silas-Xu Same as you. Have you resolved it?

@Silas-Xu
Copy link

I have stopped the service requests, but there is still a ghost request that continues to run, and the GPU KV cache usage keeps increasing until it reaches 100%.
My startup command is:

vllm serve /root/Llama3.1-8b-Instruct --dtype auto --api-key token-abc123 --max_model_len 39680 --served-model-name "llama3.1-8b"

The vllm version is 0.6.1.post2, and the requests are made using concurrent batch requests (using from openai import AsyncOpenAI), GPU is 4090. image

@Silas-Xu Same as you. Have you resolved it?

I have tried various methods, including upgrading to the latest version and using different parameters, but none have been successful. It is said that using a smaller gpu_memory_utilization parameter can solve this problem, but it may cause an inability to start, with a message indicating insufficient GPU memory.

@ericalduo
Copy link

I have stopped the service requests, but there is still a ghost request that continues to run, and the GPU KV cache usage keeps increasing until it reaches 100%.
My startup command is:

vllm serve /root/Llama3.1-8b-Instruct --dtype auto --api-key token-abc123 --max_model_len 39680 --served-model-name "llama3.1-8b"

The vllm version is 0.6.1.post2, and the requests are made using concurrent batch requests (using from openai import AsyncOpenAI), GPU is 4090. image

@Silas-Xu Same as you. Have you resolved it?

I have tried various methods, including upgrading to the latest version and using different parameters, but none have been successful. It is said that using a smaller gpu_memory_utilization parameter can solve this problem, but it may cause an inability to start, with a message indicating insufficient GPU memory.

i running ‘glm-4-9b-gptq-int4’ model on RTX 4090 with gpu_memory_utilization=0.5, But the model still reports this error.

@TweedBeetle
Copy link

TweedBeetle commented Oct 25, 2024

I am encounterin this issue running an openai server with the following engine args:

        gpu_memory_utilization=0.7,
        enforce_eager=False,  # capture the graph for faster inference, but slower cold starts (30s > 20s)
        num_scheduler_steps=4,
        max_num_seqs=64,
        block_size=32,
        dtype="bfloat16",
        enable_chunked_prefill=True,
        trust_remote_code=True,

I am running llama 3.1 8b on 0.6.3.post1 on an H100

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests