[V1] Remove pre-allocation for KV cache #16941

WoosukKwon · 2025-04-21T21:03:40Z

This PR removes the pre-allocation logic in the KV cache manager, which seems to be unnecessary in current vLLM.

The pre-allocation logic was introduced among the first few PRs for V1 re-architecture.
Originally, it had two purposes:

As written in the comment, we wanted to reduce the overheads of memory allocation (more exactly, allocating new blocks). With pre-allocation, we were able to amortize the overheads by allocating new blocks for a request every 64 steps instead of every single step.
While not written as a comment, its another purpose was to avoid frequent preemptions, where new requests are scheduled but preempted in the next few steps. We thought pre-allocation could reduce the chance of this happening.

However, it seems these two ideas are no longer valid:

Since @comaniac introduced doubly linked list for free blocks, the overhead of block allocation has become marginal. Prior to the PR, we used inefficient list operations to manage the free blocks, leading to high overheads whenever block allocation happened. With DLL, the overhead is small, so is the saving from pre-allocation. *Still, we have substantial overheads from allocate_slots as a whole. However, pre-allocation does not save much of them.
It turns out that the pre-allocation does not meaningfully mitigate the chance of preemption. We need to 1) investigate how much preemption affects the overall performance, and 2) find a better way to mitigate it.

Moreover, pre-allocation does not go very well with the new hybrid memory allocator and KV cache connector. As it only complicates the logic without performance advantages, I delete it in this PR.

Performance

python benchmarks/benchmark_throughput.py --model meta-llama/Llama-3.1-8B --dataset <sharegpt> --num-prompts 5000
main: 53.42 reqs/s (0 preemption), this PR: 53.41 reqs/s (0 preemption)
python benchmarks/benchmark_throughput.py --model meta-llama/Llama-3.1-8B --input-len 1000 --output-len 10000 --num-prompts 100
main: 0.22 reqs/s (96 preemptions), this PR: 0.22 reqs/s (95 preemptions)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

github-actions · 2025-04-21T21:03:50Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

comaniac · 2025-04-22T05:47:59Z

What's the reason of this change? I thought preallocation was a feature that you heavily promoted when introducing v1?

yinghai · 2025-04-22T05:59:44Z

tests/v1/core/test_kv_cache_utils.py

    # Test case 1: Requires additional lookahead tokens
    kv_cache_manager = KVCacheManager(kv_cache_config=config,
-                                      max_model_len=100,
-                                      num_preallocate_tokens=0)


what is num_preallocate_tokens ?

@yinghai I've updated the PR description.

The idea is, when allocating KV cache blocks for a request, rather than allocating just the right amount of blocks, we allocated a few more extra blocks ahead. Originally, this amortized the overheads of block allocation.
However, I recently found that this is no longer the case (for the reason I described in the PR), and pre-allocation only makes the logic complicated without performance advantages. So I'm deleting it in this PR.

WoosukKwon · 2025-04-22T06:04:00Z

What's the reason of this change? I thought preallocation was a feature that you heavily promoted when introducing v1?

@comaniac Good question. I was writing the PR description actually 😂 PTAL.
I will also add benchmark results.

comaniac

LGTM

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Frieda (Jingying) Huang <jingyingfhuang@gmail.com>

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

[V1] Remove pre-allocation for KV cache

319749d

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

mergify bot added the v1 label Apr 21, 2025

WoosukKwon added 5 commits April 21, 2025 16:59

Consider lookahead

3399787

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Fix test

93e3e6f

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

minor

142ccd3

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Merge branch 'main' into no-preallocate

fb3610a

Fix tests

e58ab8b

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

WoosukKwon marked this pull request as ready for review April 22, 2025 05:28

WoosukKwon requested review from alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners April 22, 2025 05:28

WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 22, 2025

yinghai reviewed Apr 22, 2025

View reviewed changes

comaniac approved these changes Apr 22, 2025

View reviewed changes

WoosukKwon merged commit c4ab9f3 into main Apr 22, 2025
65 checks passed

WoosukKwon deleted the no-preallocate branch April 22, 2025 07:52

frieda-huang pushed a commit to frieda-huang/vllm that referenced this pull request Apr 23, 2025

[V1] Remove pre-allocation for KV cache (vllm-project#16941)

403f591

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Frieda (Jingying) Huang <jingyingfhuang@gmail.com>

This was referenced Apr 28, 2025

[V1][Spec Decode][Bugfix] Allocate lookahead token kvc in WAITING queue #16613

Closed

[Bug]: Spec decode not allocating lookahead req for req in WAITING queue #16612

Closed

jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Apr 29, 2025

[V1] Remove pre-allocation for KV cache (vllm-project#16941)

82cb379

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025

[V1] Remove pre-allocation for KV cache (vllm-project#16941)

073f569

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

adobrzyn pushed a commit to HabanaAI/vllm-fork that referenced this pull request Apr 30, 2025

[V1] Remove pre-allocation for KV cache (vllm-project#16941)

395b4cc

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>

RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025

[V1] Remove pre-allocation for KV cache (vllm-project#16941)

d205b48

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

ckhordiasma mentioned this pull request May 14, 2025

nm vllm ent 0.8.5 sync red-hat-data-services/vllm#139

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

[V1] Remove pre-allocation for KV cache #16941

[V1] Remove pre-allocation for KV cache #16941

Uh oh!

WoosukKwon commented Apr 21, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Apr 21, 2025

Uh oh!

comaniac commented Apr 22, 2025

Uh oh!

yinghai Apr 22, 2025

Uh oh!

WoosukKwon Apr 22, 2025

Uh oh!

WoosukKwon commented Apr 22, 2025

Uh oh!

comaniac left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Uh oh!

[V1] Remove pre-allocation for KV cache #16941

[V1] Remove pre-allocation for KV cache #16941

Uh oh!

Conversation

WoosukKwon commented Apr 21, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance

Uh oh!

github-actions bot commented Apr 21, 2025

Uh oh!

comaniac commented Apr 22, 2025

Uh oh!

yinghai Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

WoosukKwon Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

WoosukKwon commented Apr 22, 2025

Uh oh!

comaniac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

WoosukKwon commented Apr 21, 2025 •

edited by github-actions bot

Loading