-
-
Couldn't load subscription status.
- Fork 10.9k
[V1] Remove pre-allocation for KV cache #16941
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
What's the reason of this change? I thought preallocation was a feature that you heavily promoted when introducing v1? |
| # Test case 1: Requires additional lookahead tokens | ||
| kv_cache_manager = KVCacheManager(kv_cache_config=config, | ||
| max_model_len=100, | ||
| num_preallocate_tokens=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is num_preallocate_tokens ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yinghai I've updated the PR description.
The idea is, when allocating KV cache blocks for a request, rather than allocating just the right amount of blocks, we allocated a few more extra blocks ahead. Originally, this amortized the overheads of block allocation.
However, I recently found that this is no longer the case (for the reason I described in the PR), and pre-allocation only makes the logic complicated without performance advantages. So I'm deleting it in this PR.
@comaniac Good question. I was writing the PR description actually 😂 PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Frieda (Jingying) Huang <jingyingfhuang@gmail.com>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
This PR removes the pre-allocation logic in the KV cache manager, which seems to be unnecessary in current vLLM.
The pre-allocation logic was introduced among the first few PRs for V1 re-architecture.
Originally, it had two purposes:
However, it seems these two ideas are no longer valid:
allocate_slotsas a whole. However, pre-allocation does not save much of them.Moreover, pre-allocation does not go very well with the new hybrid memory allocator and KV cache connector. As it only complicates the logic without performance advantages, I delete it in this PR.
Performance
python benchmarks/benchmark_throughput.py --model meta-llama/Llama-3.1-8B --dataset <sharegpt> --num-prompts 5000main: 53.42 reqs/s (0 preemption), this PR: 53.41 reqs/s (0 preemption)
python benchmarks/benchmark_throughput.py --model meta-llama/Llama-3.1-8B --input-len 1000 --output-len 10000 --num-prompts 100main: 0.22 reqs/s (96 preemptions), this PR: 0.22 reqs/s (95 preemptions)