-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already. #5060
Comments
I encountered a similar error which was a serious bug in production |
Related issues:
|
I also encountered this serious bug. It's impossible to deploy in prod since it fails unexpectedly and doesn't even restart the system. I tried the pre-release in 0.4.3, but the bug still persists 😭 Just adding some more info. I can call the endpoint from three terminals at the same time and it seems to survive, but the bug comes again when calling the endpoint with 4 terminals. So, it's problematic to deploy something like this in production where multiple simultaneous calls can happen. Edit: additional info |
I am experiencing a similar issue, with Llama3 |
I've also been experiencing the same issue, using Llama 3 70b, in v0.5.0. |
Also hitting this with Llama 3 70b in v0.5.0. All the times it has been triggered have been when using EDIT: Actually just hit it without the |
Facing the same error, but it seems it is related with long context length. When setting at around 98k and over, the Avg generation throughput stays at 0.0 tokens/s and after 6 messages there's loop error from engine. |
Update: Turns out the image I used had an outdated version installed.
|
Here's the up-to-date error, that occurs sporadically, also on version 0.5.0post1. This is from Nvidia A10, with a Llama3 8B foundation, and a fine-tuned Qlora adapter (trained with unsloth).
And here is the cause for that - Cuda out of memory:
Could there be a memory leak? My card uses 21GB out of 24GB. |
#5355 may fix your particular problem. To temporarily circumvent this issue, you can set |
Is there any update on this? |
Hello! I am systematically tracking To help us understand what is going on, I need to reproduce the errors on my side. If you can share:
It is much easier for me to look into what is going on. |
Thank you for your prompt reply. Here's what I ran:
In particular, I enabled YARN as instructed here to process long context (over the original 32K). 2. Example requests that cause the server to crash The error I got:
Before this error, it showed the input token as this exerpt:
After that, it kept showing
Thank you in advance! |
I can trigger this error reliably when sending requests with larger amounts of tokens. I've reproduced this on both In my situation, I'm deploying vLLM on a CPU-only 32GB Intel system, and then running inference through the OpenAI endpoint. |
getting the same error, here is my launch command: |
I did some additional experimentation:
I'm using the same Docker image of vLLM, reasonably close to tip, and built using the |
Is there any update on this? Thanks. |
Same problem. But when I reduce the max-num-seqs from 256 to 16. This error disappeared. |
I've had some success by increasing It appears the offending code is here: (see https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py#L630). When the engine takes too long, it times out, but then leaves the engine in a dead state. I'm not familiar enough with the internals of vLLM to suggest a fix. That said, if someone does have an idea of how to fix it, I'm happy to try to implement a fix. |
Same Error!!! |
Same error: #6689 (comment)
server wrong: vllm serve /mnt/models/Meta-Llama-3-70B-Instruct --tensor-parallel-size 8 --api-key token-abc123 --host 0.0.0.0 --port 8899 --max-model-len 81920 |
same error!! |
1 similar comment
same error!! |
This works. I set ENGINE_ITERATION_TIMEOUT_S to 180 to align with GraphRAG default value |
I have meet the same issue. From my point of view, it happens in two scenes: One situation is under heavy request pressure (like a graphrag), the other is situation is uncertain. I deployed the service to the production environment, and this kind of error appears after about 20 days, even though there was no such pressure at that time. Such errors are hard to reproduce. Initially, I suspected it was due to unstable network connections, but I quickly ruled out that possibility.I believe that setting ENGINE_ITERATION_TIMEOUT_S is effective in the first situation, but it may not necessarily work in the second one.
|
It seems to be a VRAM leak issue. When your VRAM is insufficient to run the model inference, the server will stop functioning, but the client won’t detect it. |
I read a bunch of the code a couple months ago. When you send a request to vLLM, it gets queued for processing. There is a timeout associated with this request, that is governed by So hopefully someone who is more familiar with vLLM internals than me can investigate. I'm not sure if there is a VRAM leak issue or not (I certainly got the error frequently enough on new instances, which suggests it's not a leak), but I do think the semantics of the queue are incorrect. |
Hi I am using Triton server to host my engine. I am getting the same issue. Can some explain how to set |
@ashwin-js were you able to figure it out? |
same error👀 |
@khayamgondal No but I got a workaround, I exported the trtion metrics and Keeping the GPU utils below 85%. I am not seeing any error. |
@Silas-Xu Same as you. Have you resolved it? |
I have tried various methods, including upgrading to the latest version and using different parameters, but none have been successful. It is said that using a smaller |
i running ‘glm-4-9b-gptq-int4’ model on RTX 4090 with gpu_memory_utilization=0.5, But the model still reports this error. |
I am encounterin this issue running an openai server with the following engine args:
I am running llama 3.1 8b on 0.6.3.post1 on an H100 |
Your current environment
docker image: vllm/vllm-openai:0.4.2
Model: https://huggingface.co/alpindale/c4ai-command-r-plus-GPTQ
GPUs: RTX8000 * 2
🐛 Describe the bug
The model works fine until the following error is raised.
INFO 05-26 22:28:18 async_llm_engine.py:529] Received request cmpl-10dff83cb4b6422ba8c64213942a7e46: prompt: '<BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>"Question: Is Korea the name of a Nation?\nGuideline: No explanation.\nFormat: {"Answer": "<your yes/no answer>"}<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>', sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['---'], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4096, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [5, 5, 255000, 255006, 9, 60478, 33, 3294, 13489, 1690, 2773, 1719, 1671, 20611, 38, 206, 46622, 7609, 33, 3679, 33940, 21, 206, 8961, 33, 19586, 61664, 2209, 31614, 28131, 20721, 22, 3598, 11205, 37, 22631, 255001, 255000, 255007], lora_request: None.
INFO 05-26 22:28:18 async_llm_engine.py:154] Aborted request cmpl-10dff83cb4b6422ba8c64213942a7e46.
INFO: 10.11.3.150:6231 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 475, in engine_step
request_outputs = await self.engine.step_async()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 221, in step_async
output = await self.model_executor.execute_model_async(
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 110, in execute_model_async
all_outputs = await self._run_workers_async("execute_model",
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 326, in _run_workers_async
all_outputs = await asyncio.gather(*coros)
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
return fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 69, in call
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in call
await super().call(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in call
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in call
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in call
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in call
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in call
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in call
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
response = await func(request)
File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
raw_response = await run_endpoint_function(
File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
return await dependant.call(**values)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 99, in create_chat_completion
generator = await openai_serving_chat.create_chat_completion(
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 138, in create_chat_completion
return await self.chat_completion_full_generator(
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 301, in chat_completion_full_generator
async for res in result_generator:
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 666, in generate
raise e
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 660, in generate
async for request_output in stream:
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 77, in anext
raise result
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
task.result()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 501, in run_engine_loop
has_requests_in_progress = await asyncio.wait_for(
File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 69, in call
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in call
await super().call(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in call
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in call
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in call
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in call
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in call
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in call
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
response = await func(request)
File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
raw_response = await run_endpoint_function(
File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
return await dependant.call(**values)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 99, in create_chat_completion
generator = await openai_serving_chat.create_chat_completion(
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 138, in create_chat_completion
return await self.chat_completion_full_generator(
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 301, in chat_completion_full_generator
async for res in result_generator:
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 666, in generate
raise e
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 650, in generate
stream = await self.add_request(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 537, in add_request
self.start_background_loop()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 411, in start_background_loop
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already
The text was updated successfully, but these errors were encountered: