[V1][Core] Add async kv cache offload #16159

zeroorhero · 2025-04-07T06:33:32Z

TL, DR

In V1, we've already implemented storing the KV cache in CPU memory(#13377). However, when multiple vLLM nodes are involved, the router needs to forward requests based on cache awareness. If one node has too many active requests, requests that should have hit the cache might get forwarded to other nodes for recomputation. To avoid this, we can pool the CPU memory across nodes. But performing cross-node KV cache swap-in and swap-out operations can hurt inference performance. Inspired by NVIDIA's open-source project Dynamo (https://github.com/ai-dynamo/dynamo), an asynchronous KV cache transmission scheme has been implemented. In our tests, the performance of asynchronous KV cache transfer is on par with or even better than storing the cache locally in CPU(#13377).

Swap Strategy

Memory Pool -> GPU: happens when a block in a certain request hits the cache.
GPU -> Memory Pool: happens when a new block that does not exist in the memory pool is generated.

During swap-in, operations are performed at the granularity of a single request. Once all cache-hit blocks in a request have been swapped in, the request can be scheduled normally for inference. Meanwhile, each newly generated block is immediately swapped out.

Implementation

Replace the previous step method with async_step. In this method, first retrieve the swap-in and swap-out requests and blocks, check whether they are completed, and record the completed requests and blocks.
Pass the requests that have completed swap-in and the blocks that have been swapped out together to the schedule method. During scheduling, prioritize requests that have finished swap-in, then proceed to schedule requests from the waiting queue. Additionally, collect the requests requiring swap-in and the blocks to be swapped out, and return them via schedule_out.
For the requests in schedule_out that require swap-in, send an asynchronous swap-in request to the model runner. Since Python's multithreading cannot fully utilize multi-core advantages, a better approach is to spawn a new thread within each swapper implementation to handle the sending. Here, we simply put the request into a queue and return immediately.
Perform model inference operations.
Swap out the newly generated blocks. This stage also runs asynchronously, handled by the swapper's independent thread.

Benchmark

This PR currently only implements Redis and Valkey support, but the observed performance was suboptimal. However, our internal benchmarking shows it achieves comparable (or even superior) performance to the local CPU implementation in vLLM PR #13377. Future iterations could extend support to other open-source distributed storage systems for further optimization.

github-actions · 2025-04-07T06:33:43Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mergify · 2025-04-07T06:34:22Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zeroorhero.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Changqi Lu <luchangqi.123@bytedance.com>

maobaolong · 2025-04-07T09:32:41Z

vllm/v1/core/swapper/redis.py

+                for i in range(len(self.kv_caches)):
+                    layer_cache = self.kv_caches[i]
+                    key_cache = layer_cache[0]
+                    val_cache = layer_cache[1]


Could you please support MLA as DeepSeek is powerful and popular.

Wait for the community's feedback, and it will be implemented later.

mergify · 2025-04-09T08:08:12Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zeroorhero.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

singzhou · 2025-04-10T16:55:50Z

vllm/v1/core/swapper/redis.py

+                    key_cache_bytes = self.swapper.get(key_cache_key)
+                    val_cache_bytes = self.swapper.get(val_cache_key)
+
+                    gpu_key_cache = tensor_from_bytes(key_cache_bytes).to(


Hi there, it seems you start 2 python threads for swapping in and out respectively. I think there is a chance that the perfermance will be greatly affected by GIL. Perhaps considering start two native threads for triggering the tensor loading/offloading to get rid of GIL?

yes. It can be in c or rust threads.

maobaolong · 2025-05-14T03:17:50Z

@zeroorhero Thanks for introduce this feature! Offloading is useful for reduce TTFT in the Multi-Round QA scenario.

And, I have some high-level question.

Could you reuse the existing kv transfer connectors? Then mapping the call swapin-> recv, swapout-> store
Would you like to show the whole picture if enable the P/D disaggregate and kvcache offloading.

zeroorhero · 2025-05-14T03:33:28Z

@zeroorhero Thanks for introduce this feature! Offloading is useful for reduce TTFT in the Multi-Round QA scenario.

And, I have some high-level question.

Could you reuse the existing kv transfer connectors? Then mapping the call swapin-> recv, swapout-> store

Would you like to show the whole picture if enable the P/D disaggregate and kvcache offloading.

Integrating the swapper into the connector is entirely feasible, but the community currently doesn’t show much support for KV cache offloading. I believe the most effective approach in the PD separation scenario is to directly transfer the KV cache between PD nodes via the RDMA network, as implemented in Dynamo or SGLang. Meanwhile, the KV store can be integrated into the decoder or prefill nodes to expand the available VRAM for use.

maobaolong · 2025-05-14T04:06:18Z

@zeroorhero Thanks for introduce this feature! Offloading is useful for reduce TTFT in the Multi-Round QA scenario.
And, I have some high-level question.

Could you reuse the existing kv transfer connectors? Then mapping the call swapin-> recv, swapout-> store

Would you like to show the whole picture if enable the P/D disaggregate and kvcache offloading.

Integrating the swapper into the connector is entirely feasible, but the community currently doesn’t show much support for KV cache offloading. I believe the most effective approach in the PD separation scenario is to directly transfer the KV cache between PD nodes via the RDMA network, as implemented in Dynamo or SGLang. Meanwhile, the KV store can be integrated into the decoder or prefill nodes to expand the available VRAM for use.

@zeroorhero Yeah, i agree! The best way is reuse the existing kvconnector for swapper purpose to avoid similar codes swapin-> recv, swapout-> store. So there are two things need to think

How to integrate swapper into connectors
How to co-exist between swapper purpose connector and PD transfer purpose connector.

After this two problem are solved, the changed code would be reduce a lot.
After that, I believe community can support this feature, since it introduce few changes other than 1000+ lines changes.

Any thoughts?

zeroorhero · 2025-05-14T08:53:25Z

@zeroorhero Thanks for introduce this feature! Offloading is useful for reduce TTFT in the Multi-Round QA scenario.
And, I have some high-level question.

Could you reuse the existing kv transfer connectors? Then mapping the call swapin-> recv, swapout-> store

Would you like to show the whole picture if enable the P/D disaggregate and kvcache offloading.

Integrating the swapper into the connector is entirely feasible, but the community currently doesn’t show much support for KV cache offloading. I believe the most effective approach in the PD separation scenario is to directly transfer the KV cache between PD nodes via the RDMA network, as implemented in Dynamo or SGLang. Meanwhile, the KV store can be integrated into the decoder or prefill nodes to expand the available VRAM for use.

@zeroorhero Yeah, i agree! The best way is reuse the existing kvconnector for swapper purpose to avoid similar codes swapin-> recv, swapout-> store. So there are two things need to think

How to integrate swapper into connectors

How to co-exist between swapper purpose connector and PD transfer purpose connector.

After this two problem are solved, the changed code would be reduce a lot. After that, I believe community can support this feature, since it introduce few changes other than 1000+ lines changes.

Any thoughts?

Alright, I'll take a look at how to solve the above problem.

github-actions · 2025-08-13T02:14:23Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

github-actions · 2025-09-13T01:59:26Z

This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you!

zeroorhero requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat, tlrmchlsmth, youkaichao, ywang96 and zhuohan123 as code owners April 7, 2025 06:33

mergify bot added ci/build v1 labels Apr 7, 2025

mergify bot added the needs-rebase label Apr 7, 2025

v1 add async kv cache offload

9515240

Signed-off-by: Changqi Lu <luchangqi.123@bytedance.com>

zeroorhero force-pushed the lcq/async-cache branch from d838f72 to 9515240 Compare April 7, 2025 07:48

mergify bot removed the needs-rebase label Apr 7, 2025

maobaolong reviewed Apr 7, 2025

View reviewed changes

mergify bot added the tpu Related to Google TPUs label Apr 9, 2025

mergify bot added the needs-rebase label Apr 9, 2025

singzhou reviewed Apr 10, 2025

View reviewed changes

mergify bot removed the tpu Related to Google TPUs label Apr 10, 2025

pramodk mentioned this pull request Jun 3, 2025

Understanding data transfer with cpu kv-cache offload : no cpu to gpu transfer seen in nsys profile? LMCache/LMCache#766

Closed

github-actions bot added the stale Over 90 days of inactivity label Aug 13, 2025

github-actions bot closed this Sep 13, 2025

Uh oh!

Uh oh!

[V1][Core] Add async kv cache offload #16159

[V1][Core] Add async kv cache offload #16159

Uh oh!

Conversation

zeroorhero commented Apr 7, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL, DR

Swap Strategy

Implementation

Benchmark

Uh oh!

github-actions bot commented Apr 7, 2025

Uh oh!

mergify bot commented Apr 7, 2025

Uh oh!

maobaolong Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

zeroorhero Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Apr 9, 2025

Uh oh!

singzhou Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zeroorhero Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

maobaolong commented May 14, 2025

Uh oh!

zeroorhero commented May 14, 2025

Uh oh!

maobaolong commented May 14, 2025

Uh oh!

zeroorhero commented May 14, 2025

Uh oh!

github-actions bot commented Aug 13, 2025

Uh oh!

github-actions bot commented Sep 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zeroorhero commented Apr 7, 2025 •

edited by github-actions bot

Loading

singzhou Apr 10, 2025 •

edited

Loading