[Bug]: TRACKING ISSUE: CUDA OOM with Logprobs #5907

robertgshaw2-neuralmagic · 2024-06-27T14:18:52Z

Your current environment

The output of `python collect_env.py`

🐛 Describe the bug

vLLM has an issue where we can go OOM if too many logprobs are requested.

The reason that this happens is that there are three sources of memory usage:

Model weights
KV caches
Activations

When determining the KV cache size, we calculate peak memory running a long prefill * without logprobs *

If a prompt requests many logprobs, however, this is an additional source of memory usage which is not considered during warmup and can cause OOM because we have nothing in scheduler to prevent this

We have received several examples of this:

[Bug]: Query with logprobs and echo crashes vllm (llama-3-8b-instruct) #5890
[Bug]: vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already. #5060 << some of the OOM issues in AsyncEngineDeadError
Anyone running MMLU or ARC in lm-eval-harness

Attempt to fix this:

[Bugfix] Take the VRAM usage of prompt_logprobs into account #5355

I am working on a design to address this issue

The text was updated successfully, but these errors were encountered:

neubig · 2024-07-18T13:50:29Z

Hi @robertgshaw2-neuralmagic , I was wondering if there were any ongoing attempts to resolve this issue, or if #5355 seems like an acceptable fix? Having this fixed would be very helpful to us!

binxuan · 2024-07-20T02:31:57Z

In my case with large vocab size (200k+) and long seq length (8k+), the logits sort is the most memory consuming part and can easily trigger OOM on one single GPU. Is it possible to do some kind of sequence parallel to distribute it to all TP workers?

cjfcsjt · 2024-09-11T08:54:52Z

@robertgshaw2-neuralmagic Any update? Thanks.

robertgshaw2-neuralmagic · 2024-09-11T14:46:10Z

For now running with --enable-chunked-prefill should avoid the issue. I have been focused on the performance side of vllm for now given this is the biggest priority in vllm. I will return to this once we finalize the wave of optimizations.

Apologies for the delay. I agree this is a significant blemish in vllm right now.

njhill · 2024-09-12T00:58:21Z

@tjohnson31415 is looking into this.

tjohnson31415 · 2024-09-16T18:05:40Z

I measured the peak GPU memory usage during the processing in Sampler. I found that the memory usage is balooned to 9x the size of the input logits tensor. The increase comes from upscaling logits to float32 (+2x), copying the tensor for probs and logprobs (+4x), and processing in _sample creating another temporary copy (+2x). I'm sure there are ways to reduce this spike in memory, but we'd still need the input tesnor limited.

The main challenge is if any prompt_logprobs are requested since the logits tensor will have logits for every token in the prompt. We will hit limits for a model with a large context and vocab size in the size of the logits tensor even before processing in Sampler. With Llama 3.1 models with a vocab_size of 128256 and a max sequence length of 131072, a single request with prompt_logprobs requested could produce a logits tensor of 128256 * 131072 * 2B =~ 31 GiB.

Limiting the number of tokens processed at a time with chunked prefill seems like the right solution to me. However, for cases where chunked prefill is not supported, we may still need "chunked logits processing" to limit the number of tokens in the processing from hidden states -> logprobs output. This may be difficult to implement though.

@robertgshaw2-neuralmagic: What do you think about this approach of "chunked logits processing"?

robertgshaw2-neuralmagic · 2024-09-16T23:19:33Z

Is there any fundamental reason we need to make all the copies? Otherwise, it would make sense to me that chunking could work

patrickvonplaten · 2024-10-11T14:22:31Z

Also running into this issue here with small models (2B) and when returning logprobs

robertgshaw2-neuralmagic added the bug Something isn't working label Jun 27, 2024

robertgshaw2-neuralmagic changed the title ~~[Bug]: TRACKING ISSUE CUDA OOM with Logprobs~~ [Bug]: TRACKING ISSUE: CUDA OOM with Logprobs Jun 27, 2024

hibukipanim mentioned this issue Jun 27, 2024

[Bug]: Query with logprobs and echo crashes vllm (llama-3-8b-instruct) #5890

Closed

AinzLimuru mentioned this issue Sep 23, 2024

[Bug]: (OOM) Find two places that cause a significant increase in GPU memory usage (probably lead to memory leak) #8184

Open

1 task

drubinstein mentioned this issue Oct 16, 2024

[Feature]: Add ability to sample a specific prompt log probability #9404

Open

1 task

tjohnson31415 mentioned this issue Oct 21, 2024

[Bug]: CUDA out of memory when setting prompt_logprobs with larger batch_size #5424

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: TRACKING ISSUE: CUDA OOM with Logprobs #5907

[Bug]: TRACKING ISSUE: CUDA OOM with Logprobs #5907

robertgshaw2-neuralmagic commented Jun 27, 2024

neubig commented Jul 18, 2024

binxuan commented Jul 20, 2024 •

edited

Loading

cjfcsjt commented Sep 11, 2024

robertgshaw2-neuralmagic commented Sep 11, 2024

njhill commented Sep 12, 2024

tjohnson31415 commented Sep 16, 2024

robertgshaw2-neuralmagic commented Sep 16, 2024

patrickvonplaten commented Oct 11, 2024

[Bug]: TRACKING ISSUE: CUDA OOM with Logprobs #5907

[Bug]: TRACKING ISSUE: CUDA OOM with Logprobs #5907

Comments

robertgshaw2-neuralmagic commented Jun 27, 2024

Your current environment

🐛 Describe the bug

neubig commented Jul 18, 2024

binxuan commented Jul 20, 2024 • edited Loading

cjfcsjt commented Sep 11, 2024

robertgshaw2-neuralmagic commented Sep 11, 2024

njhill commented Sep 12, 2024

tjohnson31415 commented Sep 16, 2024

robertgshaw2-neuralmagic commented Sep 16, 2024

patrickvonplaten commented Oct 11, 2024

binxuan commented Jul 20, 2024 •

edited

Loading