Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: TRACKING ISSUE: CUDA OOM with Logprobs #5907

Open
robertgshaw2-neuralmagic opened this issue Jun 27, 2024 · 8 comments
Open

[Bug]: TRACKING ISSUE: CUDA OOM with Logprobs #5907

robertgshaw2-neuralmagic opened this issue Jun 27, 2024 · 8 comments
Labels
bug Something isn't working

Comments

@robertgshaw2-neuralmagic
Copy link
Collaborator

Your current environment

The output of `python collect_env.py`

🐛 Describe the bug

vLLM has an issue where we can go OOM if too many logprobs are requested.

The reason that this happens is that there are three sources of memory usage:

  • Model weights
  • KV caches
  • Activations

When determining the KV cache size, we calculate peak memory running a long prefill * without logprobs *

If a prompt requests many logprobs, however, this is an additional source of memory usage which is not considered during warmup and can cause OOM because we have nothing in scheduler to prevent this

We have received several examples of this:

Attempt to fix this:

I am working on a design to address this issue

@robertgshaw2-neuralmagic robertgshaw2-neuralmagic added the bug Something isn't working label Jun 27, 2024
@robertgshaw2-neuralmagic robertgshaw2-neuralmagic changed the title [Bug]: TRACKING ISSUE CUDA OOM with Logprobs [Bug]: TRACKING ISSUE: CUDA OOM with Logprobs Jun 27, 2024
@neubig
Copy link
Contributor

neubig commented Jul 18, 2024

Hi @robertgshaw2-neuralmagic , I was wondering if there were any ongoing attempts to resolve this issue, or if #5355 seems like an acceptable fix? Having this fixed would be very helpful to us!

@binxuan
Copy link

binxuan commented Jul 20, 2024

In my case with large vocab size (200k+) and long seq length (8k+), the logits sort is the most memory consuming part and can easily trigger OOM on one single GPU. Is it possible to do some kind of sequence parallel to distribute it to all TP workers?

@cjfcsjt
Copy link

cjfcsjt commented Sep 11, 2024

@robertgshaw2-neuralmagic Any update? Thanks.

@robertgshaw2-neuralmagic
Copy link
Collaborator Author

For now running with --enable-chunked-prefill should avoid the issue. I have been focused on the performance side of vllm for now given this is the biggest priority in vllm. I will return to this once we finalize the wave of optimizations.

Apologies for the delay. I agree this is a significant blemish in vllm right now.

@njhill
Copy link
Member

njhill commented Sep 12, 2024

@tjohnson31415 is looking into this.

@tjohnson31415
Copy link
Contributor

I measured the peak GPU memory usage during the processing in Sampler. I found that the memory usage is balooned to 9x the size of the input logits tensor. The increase comes from upscaling logits to float32 (+2x), copying the tensor for probs and logprobs (+4x), and processing in _sample creating another temporary copy (+2x). I'm sure there are ways to reduce this spike in memory, but we'd still need the input tesnor limited.

The main challenge is if any prompt_logprobs are requested since the logits tensor will have logits for every token in the prompt. We will hit limits for a model with a large context and vocab size in the size of the logits tensor even before processing in Sampler. With Llama 3.1 models with a vocab_size of 128256 and a max sequence length of 131072, a single request with prompt_logprobs requested could produce a logits tensor of 128256 * 131072 * 2B =~ 31 GiB.

Limiting the number of tokens processed at a time with chunked prefill seems like the right solution to me. However, for cases where chunked prefill is not supported, we may still need "chunked logits processing" to limit the number of tokens in the processing from hidden states -> logprobs output. This may be difficult to implement though.

@robertgshaw2-neuralmagic: What do you think about this approach of "chunked logits processing"?

@robertgshaw2-neuralmagic
Copy link
Collaborator Author

Is there any fundamental reason we need to make all the copies? Otherwise, it would make sense to me that chunking could work

@patrickvonplaten
Copy link
Contributor

Also running into this issue here with small models (2B) and when returning logprobs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants