-
-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: with --enable-prefix-caching
, /completions
crashes server with echo=True
above certain prompt length
#5344
Comments
Saw this one also on the LLM entrypoint with large batches |
@KuntaiDu would you have bandwidth to take a look at this? |
Same. Could be a problem caused by 40-series graphics cards and flash-attn. |
updating that this issue (resulting in similar stacktrace) still exists in v0.6.0 also when using the chat endpoint with tried also to enable |
I posted a simpler example to reproduce the error in #8268 . It seems to be that when you have a prefix > the length of the block size, this assertion is hit. I reproduced this error on an A100. |
I wonder if the issue has been fixed in 0.6.3.post1 |
Hi @drubinstein thanks for the suggestion. I tested now with 0.6.3.post1, the exact same snippet fromthe first message here, and it responds ok on first request but running the snippet again it crashes vllm with: INFO: 127.0.0.1:43588 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR 10-25 14:56:14 engine.py:158] AssertionError('Error in model execution: ')
ERROR 10-25 14:56:14 engine.py:158] Traceback (most recent call last):
ERROR 10-25 14:56:14 engine.py:158] File "/home/user/code/debug/.venv/lib/python3.10/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
ERROR 10-25 14:56:14 engine.py:158] return func(*args, **kwargs)
ERROR 10-25 14:56:14 engine.py:158] File "/home/user/code/debug/.venv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1701, in execute_model
ERROR 10-25 14:56:14 engine.py:158] output: SamplerOutput = self.model.sample(
ERROR 10-25 14:56:14 engine.py:158] File "/home/user/code/debug/.venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 573, in sample
ERROR 10-25 14:56:14 engine.py:158] next_tokens = self.sampler(logits, sampling_metadata)
ERROR 10-25 14:56:14 engine.py:158] File "/home/user/code/debug/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-25 14:56:14 engine.py:158] return self._call_impl(*args, **kwargs)
ERROR 10-25 14:56:14 engine.py:158] File "/home/user/code/debug/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-25 14:56:14 engine.py:158] return forward_call(*args, **kwargs)
ERROR 10-25 14:56:14 engine.py:158] File "/home/user/code/debug/.venv/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 304, in forward
ERROR 10-25 14:56:14 engine.py:158] prompt_logprobs, sample_logprobs = get_logprobs(
ERROR 10-25 14:56:14 engine.py:158] File "/home/user/code/debug/.venv/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 984, in get_logprobs
ERROR 10-25 14:56:14 engine.py:158] assert len(next_token_ids) == len(query_indices)
ERROR 10-25 14:56:14 engine.py:158] AssertionError and this stacktrace very similar to original issue reported. |
Rats, the issue wasn't fixed yet I guess. |
Your current environment
🐛 Describe the bug
Hi,
When server started with:
running this client code:
Triggers this assert:
and then the server enters a dead state:
vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.
Given the error triggers above a certain prompt length threshold I suspect it's an OOM shadowed by this assert.
If I give the server more memory headroom by adding
--gpu-memory-utilization 0.5
which leaves 12GB out my RTX 4090's 24GB memory free, the error happens when increasing the prompt size to 512 tokens.This doesn't happen without
echo=True
.Without
--enable-prefix-caching
in the above example it can handle the max prompt size of 2047.Thanks!
The text was updated successfully, but these errors were encountered: