[Bug]: with `--enable-prefix-caching` , `/completions` crashes server with `echo=True` above certain prompt length #5344

hibukipanim · 2024-06-07T13:13:22Z

Your current environment

vLLM 0.4.3

RTX 4090 24GB (reproduces also on A100)

🐛 Describe the bug

Hi,

When server started with:

python -m vllm.entrypoints.openai.api_server --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --enable-prefix-caching

running this client code:

import openai

client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="foo"
)

prompt = [1] * 256
out = client.completions.create(
    model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    prompt=prompt,
    max_tokens=1,
    logprobs=5,
    echo=True
)
print(out)

Triggers this assert:

INFO:     127.0.0.1:39724 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
    return await self.app(scope, receive, send)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 118, in create_completion
    generator = await openai_serving_completion.create_completion(
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_completion.py", line 166, in create_completion
    async for i, res in result_generator:
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/utils.py", line 244, in consumer
    raise e
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/utils.py", line 235, in consumer
    raise item
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/utils.py", line 219, in producer
    async for item in iterator:
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 662, in generate
    async for output in self._process_request(
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 769, in _process_request
    raise e
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 765, in _process_request
    async for request_output in stream:
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 80, in __anext__
    raise result
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 40, in _raise_exception_on_finish
    task.result()
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 521, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 495, in engine_step
    request_outputs = await self.engine.step_async()
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 226, in step_async
    output = await self.model_executor.execute_model_async(
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 117, in execute_model_async
    output = await make_async(self.driver_worker.execute_model
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/worker/worker.py", line 272, in execute_model
    output = self.model_runner.execute_model(seq_group_metadata_list,
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 738, in execute_model
    output = self.model.sample(
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 378, in sample
    next_tokens = self.sampler(logits, sampling_metadata)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 112, in forward
    prompt_logprobs, sample_logprobs = _get_logprobs(
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 760, in _get_logprobs
    assert len(next_token_ids) == len(query_indices)
AssertionError

and then the server enters a dead state: vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.

Given the error triggers above a certain prompt length threshold I suspect it's an OOM shadowed by this assert.
If I give the server more memory headroom by adding --gpu-memory-utilization 0.5 which leaves 12GB out my RTX 4090's 24GB memory free, the error happens when increasing the prompt size to 512 tokens.

This doesn't happen without echo=True.

Without --enable-prefix-caching in the above example it can handle the max prompt size of 2047.

Thanks!

The text was updated successfully, but these errors were encountered:

flexorRegev · 2024-06-09T03:43:51Z

Saw this one also on the LLM entrypoint with large batches

simon-mo · 2024-06-10T16:35:17Z

@KuntaiDu would you have bandwidth to take a look at this?

dongxiaolong · 2024-06-26T01:47:27Z

Same. Could be a problem caused by 40-series graphics cards and flash-attn.
#5678
#5537
#5376
#5376 (comment)

hibukipanim · 2024-09-11T12:45:14Z

updating that this issue (resulting in similar stacktrace) still exists in v0.6.0 also when using the chat endpoint with prompt_logprobs (if server started with --enable-prefix-caching=True)

tried also to enable VLLM_USE_FLASHINFER_SAMPLER=1 and it still happens

drubinstein · 2024-10-18T14:47:50Z

I posted a simpler example to reproduce the error in #8268 . It seems to be that when you have a prefix > the length of the block size, this assertion is hit.

I reproduced this error on an A100.

drubinstein · 2024-10-22T19:44:17Z

#8268 (comment)

I wonder if the issue has been fixed in 0.6.3.post1

hibukipanim · 2024-10-25T12:01:04Z

Hi @drubinstein thanks for the suggestion.

I tested now with 0.6.3.post1, the exact same snippet fromthe first message here, and it responds ok on first request but running the snippet again it crashes vllm with:

INFO:     127.0.0.1:43588 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR 10-25 14:56:14 engine.py:158] AssertionError('Error in model execution: ')
ERROR 10-25 14:56:14 engine.py:158] Traceback (most recent call last):
ERROR 10-25 14:56:14 engine.py:158]   File "/home/user/code/debug/.venv/lib/python3.10/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
ERROR 10-25 14:56:14 engine.py:158]     return func(*args, **kwargs)
ERROR 10-25 14:56:14 engine.py:158]   File "/home/user/code/debug/.venv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1701, in execute_model
ERROR 10-25 14:56:14 engine.py:158]     output: SamplerOutput = self.model.sample(
ERROR 10-25 14:56:14 engine.py:158]   File "/home/user/code/debug/.venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 573, in sample
ERROR 10-25 14:56:14 engine.py:158]     next_tokens = self.sampler(logits, sampling_metadata)
ERROR 10-25 14:56:14 engine.py:158]   File "/home/user/code/debug/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-25 14:56:14 engine.py:158]     return self._call_impl(*args, **kwargs)
ERROR 10-25 14:56:14 engine.py:158]   File "/home/user/code/debug/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-25 14:56:14 engine.py:158]     return forward_call(*args, **kwargs)
ERROR 10-25 14:56:14 engine.py:158]   File "/home/user/code/debug/.venv/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 304, in forward
ERROR 10-25 14:56:14 engine.py:158]     prompt_logprobs, sample_logprobs = get_logprobs(
ERROR 10-25 14:56:14 engine.py:158]   File "/home/user/code/debug/.venv/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 984, in get_logprobs
ERROR 10-25 14:56:14 engine.py:158]     assert len(next_token_ids) == len(query_indices)
ERROR 10-25 14:56:14 engine.py:158] AssertionError

and this stacktrace very similar to original issue reported.

drubinstein · 2024-10-25T13:45:37Z

Rats, the issue wasn't fixed yet I guess.

hibukipanim added the bug Something isn't working label Jun 7, 2024

hibukipanim mentioned this issue Jun 27, 2024

[Bug]: Query with logprobs and echo crashes vllm (llama-3-8b-instruct) #5890

Closed

hibukipanim mentioned this issue Sep 11, 2024

[Bug]: AssertionError when using automatic prefix caching and prompt_logprobs #8268

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: with `--enable-prefix-caching` , `/completions` crashes server with `echo=True` above certain prompt length #5344

[Bug]: with `--enable-prefix-caching` , `/completions` crashes server with `echo=True` above certain prompt length #5344

hibukipanim commented Jun 7, 2024 •

edited

Loading

flexorRegev commented Jun 9, 2024

simon-mo commented Jun 10, 2024

dongxiaolong commented Jun 26, 2024

hibukipanim commented Sep 11, 2024 •

edited

Loading

drubinstein commented Oct 18, 2024 •

edited

Loading

drubinstein commented Oct 22, 2024 •

edited

Loading

hibukipanim commented Oct 25, 2024 •

edited

Loading

drubinstein commented Oct 25, 2024

[Bug]: with --enable-prefix-caching , /completions crashes server with echo=True above certain prompt length #5344

[Bug]: with --enable-prefix-caching , /completions crashes server with echo=True above certain prompt length #5344

Comments

hibukipanim commented Jun 7, 2024 • edited Loading

Your current environment

🐛 Describe the bug

flexorRegev commented Jun 9, 2024

simon-mo commented Jun 10, 2024

dongxiaolong commented Jun 26, 2024

hibukipanim commented Sep 11, 2024 • edited Loading

drubinstein commented Oct 18, 2024 • edited Loading

drubinstein commented Oct 22, 2024 • edited Loading

hibukipanim commented Oct 25, 2024 • edited Loading

drubinstein commented Oct 25, 2024

[Bug]: with `--enable-prefix-caching` , `/completions` crashes server with `echo=True` above certain prompt length #5344

[Bug]: with `--enable-prefix-caching` , `/completions` crashes server with `echo=True` above certain prompt length #5344

hibukipanim commented Jun 7, 2024 •

edited

Loading

hibukipanim commented Sep 11, 2024 •

edited

Loading

drubinstein commented Oct 18, 2024 •

edited

Loading

drubinstein commented Oct 22, 2024 •

edited

Loading

hibukipanim commented Oct 25, 2024 •

edited

Loading