Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: with --enable-prefix-caching , /completions crashes server with echo=True above certain prompt length #5344

Open
hibukipanim opened this issue Jun 7, 2024 · 8 comments
Labels
bug Something isn't working

Comments

@hibukipanim
Copy link

hibukipanim commented Jun 7, 2024

Your current environment

vLLM 0.4.3

RTX 4090 24GB (reproduces also on A100)

🐛 Describe the bug

Hi,

When server started with:

python -m vllm.entrypoints.openai.api_server --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --enable-prefix-caching

running this client code:

import openai

client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="foo"
)

prompt = [1] * 256
out = client.completions.create(
    model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    prompt=prompt,
    max_tokens=1,
    logprobs=5,
    echo=True
)
print(out)

Triggers this assert:

INFO:     127.0.0.1:39724 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
    return await self.app(scope, receive, send)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 118, in create_completion
    generator = await openai_serving_completion.create_completion(
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_completion.py", line 166, in create_completion
    async for i, res in result_generator:
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/utils.py", line 244, in consumer
    raise e
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/utils.py", line 235, in consumer
    raise item
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/utils.py", line 219, in producer
    async for item in iterator:
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 662, in generate
    async for output in self._process_request(
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 769, in _process_request
    raise e
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 765, in _process_request
    async for request_output in stream:
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 80, in __anext__
    raise result
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 40, in _raise_exception_on_finish
    task.result()
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 521, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 495, in engine_step
    request_outputs = await self.engine.step_async()
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 226, in step_async
    output = await self.model_executor.execute_model_async(
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 117, in execute_model_async
    output = await make_async(self.driver_worker.execute_model
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/worker/worker.py", line 272, in execute_model
    output = self.model_runner.execute_model(seq_group_metadata_list,
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 738, in execute_model
    output = self.model.sample(
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 378, in sample
    next_tokens = self.sampler(logits, sampling_metadata)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 112, in forward
    prompt_logprobs, sample_logprobs = _get_logprobs(
  File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 760, in _get_logprobs
    assert len(next_token_ids) == len(query_indices)
AssertionError

and then the server enters a dead state: vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.

Given the error triggers above a certain prompt length threshold I suspect it's an OOM shadowed by this assert.
If I give the server more memory headroom by adding --gpu-memory-utilization 0.5 which leaves 12GB out my RTX 4090's 24GB memory free, the error happens when increasing the prompt size to 512 tokens.

This doesn't happen without echo=True.

Without --enable-prefix-caching in the above example it can handle the max prompt size of 2047.

Thanks!

@hibukipanim hibukipanim added the bug Something isn't working label Jun 7, 2024
@flexorRegev
Copy link

Saw this one also on the LLM entrypoint with large batches

@simon-mo
Copy link
Collaborator

@KuntaiDu would you have bandwidth to take a look at this?

@dongxiaolong
Copy link

Same. Could be a problem caused by 40-series graphics cards and flash-attn.
#5678
#5537
#5376
#5376 (comment)

@hibukipanim
Copy link
Author

hibukipanim commented Sep 11, 2024

updating that this issue (resulting in similar stacktrace) still exists in v0.6.0 also when using the chat endpoint with prompt_logprobs (if server started with --enable-prefix-caching=True)

tried also to enable VLLM_USE_FLASHINFER_SAMPLER=1 and it still happens

@drubinstein
Copy link

drubinstein commented Oct 18, 2024

I posted a simpler example to reproduce the error in #8268 . It seems to be that when you have a prefix > the length of the block size, this assertion is hit.

I reproduced this error on an A100.

@drubinstein
Copy link

drubinstein commented Oct 22, 2024

#8268 (comment)

I wonder if the issue has been fixed in 0.6.3.post1

@hibukipanim
Copy link
Author

hibukipanim commented Oct 25, 2024

Hi @drubinstein thanks for the suggestion.

I tested now with 0.6.3.post1, the exact same snippet fromthe first message here, and it responds ok on first request but running the snippet again it crashes vllm with:

INFO:     127.0.0.1:43588 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR 10-25 14:56:14 engine.py:158] AssertionError('Error in model execution: ')
ERROR 10-25 14:56:14 engine.py:158] Traceback (most recent call last):
ERROR 10-25 14:56:14 engine.py:158]   File "/home/user/code/debug/.venv/lib/python3.10/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
ERROR 10-25 14:56:14 engine.py:158]     return func(*args, **kwargs)
ERROR 10-25 14:56:14 engine.py:158]   File "/home/user/code/debug/.venv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1701, in execute_model
ERROR 10-25 14:56:14 engine.py:158]     output: SamplerOutput = self.model.sample(
ERROR 10-25 14:56:14 engine.py:158]   File "/home/user/code/debug/.venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 573, in sample
ERROR 10-25 14:56:14 engine.py:158]     next_tokens = self.sampler(logits, sampling_metadata)
ERROR 10-25 14:56:14 engine.py:158]   File "/home/user/code/debug/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-25 14:56:14 engine.py:158]     return self._call_impl(*args, **kwargs)
ERROR 10-25 14:56:14 engine.py:158]   File "/home/user/code/debug/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-25 14:56:14 engine.py:158]     return forward_call(*args, **kwargs)
ERROR 10-25 14:56:14 engine.py:158]   File "/home/user/code/debug/.venv/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 304, in forward
ERROR 10-25 14:56:14 engine.py:158]     prompt_logprobs, sample_logprobs = get_logprobs(
ERROR 10-25 14:56:14 engine.py:158]   File "/home/user/code/debug/.venv/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 984, in get_logprobs
ERROR 10-25 14:56:14 engine.py:158]     assert len(next_token_ids) == len(query_indices)
ERROR 10-25 14:56:14 engine.py:158] AssertionError

and this stacktrace very similar to original issue reported.

@drubinstein
Copy link

Rats, the issue wasn't fixed yet I guess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants