-
-
Couldn't load subscription status.
- Fork 10.8k
[V1][Core] Add async kv cache offload #16159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Changqi Lu <luchangqi.123@bytedance.com>
d838f72 to
9515240
Compare
| for i in range(len(self.kv_caches)): | ||
| layer_cache = self.kv_caches[i] | ||
| key_cache = layer_cache[0] | ||
| val_cache = layer_cache[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please support MLA as DeepSeek is powerful and popular.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait for the community's feedback, and it will be implemented later.
|
This pull request has merge conflicts that must be resolved before it can be |
| key_cache_bytes = self.swapper.get(key_cache_key) | ||
| val_cache_bytes = self.swapper.get(val_cache_key) | ||
|
|
||
| gpu_key_cache = tensor_from_bytes(key_cache_bytes).to( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi there, it seems you start 2 python threads for swapping in and out respectively. I think there is a chance that the perfermance will be greatly affected by GIL. Perhaps considering start two native threads for triggering the tensor loading/offloading to get rid of GIL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. It can be in c or rust threads.
|
@zeroorhero Thanks for introduce this feature! Offloading is useful for reduce TTFT in the Multi-Round QA scenario. And, I have some high-level question.
|
Integrating the swapper into the connector is entirely feasible, but the community currently doesn’t show much support for KV cache offloading. I believe the most effective approach in the PD separation scenario is to directly transfer the KV cache between PD nodes via the RDMA network, as implemented in Dynamo or SGLang. Meanwhile, the KV store can be integrated into the decoder or prefill nodes to expand the available VRAM for use. |
@zeroorhero Yeah, i agree! The best way is reuse the existing kvconnector for swapper purpose to avoid similar codes
After this two problem are solved, the changed code would be reduce a lot. Any thoughts? |
Alright, I'll take a look at how to solve the above problem. |
|
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
|
This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you! |
TL, DR
In V1, we've already implemented storing the KV cache in CPU memory(#13377). However, when multiple vLLM nodes are involved, the router needs to forward requests based on cache awareness. If one node has too many active requests, requests that should have hit the cache might get forwarded to other nodes for recomputation. To avoid this, we can pool the CPU memory across nodes. But performing cross-node KV cache swap-in and swap-out operations can hurt inference performance. Inspired by NVIDIA's open-source project Dynamo (https://github.com/ai-dynamo/dynamo), an asynchronous KV cache transmission scheme has been implemented. In our tests, the performance of asynchronous KV cache transfer is on par with or even better than storing the cache locally in CPU(#13377).
Swap Strategy
Memory Pool -> GPU: happens when a block in a certain request hits the cache.
GPU -> Memory Pool: happens when a new block that does not exist in the memory pool is generated.
During swap-in, operations are performed at the granularity of a single request. Once all cache-hit blocks in a request have been swapped in, the request can be scheduled normally for inference. Meanwhile, each newly generated block is immediately swapped out.
Implementation
Benchmark
This PR currently only implements Redis and Valkey support, but the observed performance was suboptimal. However, our internal benchmarking shows it achieves comparable (or even superior) performance to the local CPU implementation in vLLM PR #13377. Future iterations could extend support to other open-source distributed storage systems for further optimization.