-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: TRACKING ISSUE: CUDA OOM with Logprobs #5907
Comments
Hi @robertgshaw2-neuralmagic , I was wondering if there were any ongoing attempts to resolve this issue, or if #5355 seems like an acceptable fix? Having this fixed would be very helpful to us! |
In my case with large vocab size (200k+) and long seq length (8k+), the logits sort is the most memory consuming part and can easily trigger OOM on one single GPU. Is it possible to do some kind of sequence parallel to distribute it to all TP workers? |
@robertgshaw2-neuralmagic Any update? Thanks. |
For now running with Apologies for the delay. I agree this is a significant blemish in vllm right now. |
@tjohnson31415 is looking into this. |
I measured the peak GPU memory usage during the processing in The main challenge is if any Limiting the number of tokens processed at a time with chunked prefill seems like the right solution to me. However, for cases where chunked prefill is not supported, we may still need "chunked logits processing" to limit the number of tokens in the processing from hidden states -> logprobs output. This may be difficult to implement though. @robertgshaw2-neuralmagic: What do you think about this approach of "chunked logits processing"? |
Is there any fundamental reason we need to make all the copies? Otherwise, it would make sense to me that chunking could work |
Also running into this issue here with small models (2B) and when returning logprobs |
Your current environment
🐛 Describe the bug
vLLM has an issue where we can go OOM if too many
logprobs
are requested.The reason that this happens is that there are three sources of memory usage:
When determining the KV cache size, we calculate peak memory running a long prefill * without logprobs *
If a prompt requests many logprobs, however, this is an additional source of memory usage which is not considered during warmup and can cause OOM because we have nothing in scheduler to prevent this
We have received several examples of this:
AsyncEngineDeadError
Attempt to fix this:
I am working on a design to address this issue
The text was updated successfully, but these errors were encountered: