Skip to content

Conversation

@WoosukKwon
Copy link
Collaborator

@WoosukKwon WoosukKwon commented Apr 21, 2025

This PR removes the pre-allocation logic in the KV cache manager, which seems to be unnecessary in current vLLM.

The pre-allocation logic was introduced among the first few PRs for V1 re-architecture.
Originally, it had two purposes:

  1. As written in the comment, we wanted to reduce the overheads of memory allocation (more exactly, allocating new blocks). With pre-allocation, we were able to amortize the overheads by allocating new blocks for a request every 64 steps instead of every single step.
  2. While not written as a comment, its another purpose was to avoid frequent preemptions, where new requests are scheduled but preempted in the next few steps. We thought pre-allocation could reduce the chance of this happening.

However, it seems these two ideas are no longer valid:

  1. Since @comaniac introduced doubly linked list for free blocks, the overhead of block allocation has become marginal. Prior to the PR, we used inefficient list operations to manage the free blocks, leading to high overheads whenever block allocation happened. With DLL, the overhead is small, so is the saving from pre-allocation. *Still, we have substantial overheads from allocate_slots as a whole. However, pre-allocation does not save much of them.
  2. It turns out that the pre-allocation does not meaningfully mitigate the chance of preemption. We need to 1) investigate how much preemption affects the overall performance, and 2) find a better way to mitigate it.

Moreover, pre-allocation does not go very well with the new hybrid memory allocator and KV cache connector. As it only complicates the logic without performance advantages, I delete it in this PR.

Performance

  • python benchmarks/benchmark_throughput.py --model meta-llama/Llama-3.1-8B --dataset <sharegpt> --num-prompts 5000
    main: 53.42 reqs/s (0 preemption), this PR: 53.41 reqs/s (0 preemption)
  • python benchmarks/benchmark_throughput.py --model meta-llama/Llama-3.1-8B --input-len 1000 --output-len 10000 --num-prompts 100
    main: 0.22 reqs/s (96 preemptions), this PR: 0.22 reqs/s (95 preemptions)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the v1 label Apr 21, 2025
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
@WoosukKwon WoosukKwon marked this pull request as ready for review April 22, 2025 05:28
@WoosukKwon WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 22, 2025
@comaniac
Copy link
Collaborator

What's the reason of this change? I thought preallocation was a feature that you heavily promoted when introducing v1?

# Test case 1: Requires additional lookahead tokens
kv_cache_manager = KVCacheManager(kv_cache_config=config,
max_model_len=100,
num_preallocate_tokens=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is num_preallocate_tokens ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yinghai I've updated the PR description.

The idea is, when allocating KV cache blocks for a request, rather than allocating just the right amount of blocks, we allocated a few more extra blocks ahead. Originally, this amortized the overheads of block allocation.
However, I recently found that this is no longer the case (for the reason I described in the PR), and pre-allocation only makes the logic complicated without performance advantages. So I'm deleting it in this PR.

@WoosukKwon
Copy link
Collaborator Author

What's the reason of this change? I thought preallocation was a feature that you heavily promoted when introducing v1?

@comaniac Good question. I was writing the PR description actually 😂 PTAL.
I will also add benchmark results.

Copy link
Collaborator

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@WoosukKwon WoosukKwon merged commit c4ab9f3 into main Apr 22, 2025
65 checks passed
@WoosukKwon WoosukKwon deleted the no-preallocate branch April 22, 2025 07:52
frieda-huang pushed a commit to frieda-huang/vllm that referenced this pull request Apr 23, 2025
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Frieda (Jingying) Huang <jingyingfhuang@gmail.com>
jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Apr 29, 2025
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
adobrzyn pushed a commit to HabanaAI/vllm-fork that referenced this pull request Apr 30, 2025
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>
RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants