[KV offload][5/N] Add `CPUOffloadingSpec` #24251

orozery · 2025-09-04T14:06:17Z

This is the final PR enabling CPU offloading in v1.

Concludes RFC #19854.
Depends on #20075, #21448, #22595.

gemini-code-assist

Code Review

This pull request introduces a significant new feature: CPU offloading for v1. The implementation is extensive, adding a new offloading framework with managers, specs, handlers, and a dedicated KV connector. The code is well-structured, with a clear separation of concerns between scheduler and worker logic. My review focuses on critical aspects of reliability and resource management, and I've identified a few high-impact issues that should be addressed to ensure the robustness of this new feature.

gemini-code-assist · 2025-09-04T14:09:18Z

vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py

+        finished_recving = set()
+        for job_id, success in self.worker.get_finished():
+            # we currently do not support job failures
+            assert success


Using assert success is risky as it will crash the worker process if an offloading transfer fails for any reason (e.g., I/O error, out of space). This can bring down the entire system. Failures should be handled more gracefully, for instance by logging a critical error and cleaning up the state for the failed job, without crashing the worker. While the comment indicates failures are not supported, using an assert is not a robust way to enforce this in production code.

vllm/v1/offloading/lru_manager.py

gemini-code-assist · 2025-09-04T14:09:19Z

vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py

+            if store_output is None:
+                logger.warning("Cannot store %s blocks", num_new_blocks)
+                break


Using break here will prematurely exit the loop that iterates over scheduled requests. If prepare_store fails for one request (by returning None), subsequent requests in the same scheduling step will not be considered for offloading. This could lead to offloading starvation for other requests. You should use continue to proceed to the next request in the loop.

Suggested change

if store_output is None:

logger.warning("Cannot store %s blocks", num_new_blocks)

break

if store_output is None:

logger.warning("Cannot store %s blocks", num_new_blocks)

continue

vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py

ApostaC · 2025-09-12T00:25:56Z

Is this PR the combination of #19848, #20075, #21448, and #22595?
I've reviewed the above 4 PRs, and just wondering are there any new things in this PR?

orozery · 2025-09-12T05:52:21Z

Is this PR the combination of #19848, #20075, #21448, and #22595? I've reviewed the above 4 PRs, and just wondering are there any new things in this PR?

So each PR actually introduces a single new commit.
For this PR, this is just the registration of the CPU implementation for the offloading connector.

njhill

LGTM .. needs the block hash type change of course (and I assume that affects the other PRs too...)

njhill · 2025-09-13T00:06:37Z

vllm/v1/offloading/cpu.py

+            attn_backend = get_attn_backend(
+                self.vllm_config.model_config.get_head_size(),
+                self.vllm_config.model_config.dtype,
+                self.vllm_config.cache_config.cache_dtype,
+                self.gpu_block_size,
+                self.vllm_config.model_config.is_attention_free,
+                use_mla=self.vllm_config.model_config.use_mla)


I feel like we should add a get_attn_backend_from_config(VllmConfig) and use that both here and in the NixlConnector.

njhill

@orozery sorry I think some of the comments here actually apply to earlier commit

njhill · 2025-09-13T01:34:32Z

vllm/v1/offloading/cpu.py

+        # allocate fresh blocks
+        blocks: list[BlockStatus] = []
+        for _ in range(num_fresh_blocks):
+            blocks.append(CPUBlockStatus(self.num_allocated_blocks))
+            self.num_allocated_blocks += 1


So there will only be "fresh" blocks temporarily until the cache is full?

njhill · 2025-09-13T01:45:22Z

vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py

+            for blk_hash in request.block_hashes[self.block_size_factor -
+                                                 1::self.block_size_factor]


use islice?

njhill · 2025-09-13T01:46:08Z

vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py

+            return 0, False
+
+        start_block_idx = num_computed_tokens // self.offloaded_block_size
+        hits = self.manager.lookup(block_hashes[start_block_idx:])


use islice?

vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py

njhill · 2025-09-13T02:19:48Z

vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py

+        block_hashes = [
+            blk_hash.hash_value
+            for blk_hash in request.block_hashes[self.block_size_factor -
+                                                 1::self.block_size_factor]
+        ]
+        assert len(block_hashes) >= num_blocks
+
+        block_hashes = block_hashes[start_block_idx:num_blocks]


Suggested change

block_hashes = [

blk_hash.hash_value

for blk_hash in request.block_hashes[self.block_size_factor -

1::self.block_size_factor]

]

assert len(block_hashes) >= num_blocks

block_hashes = block_hashes[start_block_idx:num_blocks]

step = self.block_size_factor

block_hashes = [

blk_hash.hash_value for blk_hash in itertools.islice(

request.block_hashes,

(start_block_idx + 1) * step - 1,

(num_blocks + 1) * step - 1,

step)

]

njhill · 2025-09-13T02:20:23Z

vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py

+        src_specs = self.manager.prepare_load(block_hashes)
+        dst_specs = [
+            GPULoadStoreSpec(gpu_block_id)
+            for gpu_block_id in block_ids[num_computed_gpu_blocks:]


use islice?

I'm changing GPULoadStoreSpec to construct a tensor of block IDs. It needs a list as input, so cannot use islice.

I may be misunderstanding but this comprehension is creating a list of GPULoadStoreSpec objects, which can't be used to create a tensor directly?

oh sorry I think I understand you are saying this is n/a after latest refactor

I changed it to create a single GPULoadStoreSpec object, wrapping a tensor.

vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py

pytorch-bot · 2025-09-15T15:36:13Z

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

orozery · 2025-09-15T15:44:36Z

@njhill I switched to using islice.
The downside is you get a one-time iterator.
Because I use it more than once, I need to create several instances.
So this makes the code a bit more complex.

njhill · 2025-09-15T23:19:24Z

@njhill I switched to using islice. The downside is you get a one-time iterator. Because I use it more than once, I need to create several instances. So this makes the code a bit more complex.

Thanks @orozery I guess I don't follow the downside / extra complexity. E.g.

hits = self.manager.lookup(block_hashes[start_block_idx:])

just becomes

hits = self.manager.lookup(islice(block_hashes, None, start_block_idx))

If/where the sliced list is iterated over multiple times then I agree there may be more to consider.

orozery · 2025-09-16T05:18:51Z

@njhill I switched to using islice. The downside is you get a one-time iterator. Because I use it more than once, I need to create several instances. So this makes the code a bit more complex.

Thanks @orozery I guess I don't follow the downside / extra complexity. E.g.
hits = self.manager.lookup(block_hashes[start_block_idx:])
just becomes
hits = self.manager.lookup(islice(block_hashes, None, start_block_idx))
If/where the sliced list is iterated over multiple times then I agree there may be more to consider.

For example, this:

https://github.com/vllm-project/vllm/blob/94a0405bfaa32932cdb0e8362250f68587ebfa95/vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py#L237-L251

And:

https://github.com/vllm-project/vllm/blob/94a0405bfaa32932cdb0e8362250f68587ebfa95/vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py#L283-L303

njhill

Argh sorry these comments were sitting in pending, just realized I didn't submit it yesterday

vllm/v1/offloading/cpu.py

njhill · 2025-09-16T00:15:07Z

vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py

+        src_specs = self.manager.prepare_load(block_hashes)
+        dst_specs = [
+            GPULoadStoreSpec(gpu_block_id)
+            for gpu_block_id in block_ids[num_computed_gpu_blocks:]


oh sorry I think I understand you are saying this is n/a after latest refactor

vllm/v1/offloading/lru_manager.py

vllm/v1/offloading/mediums.py

mergify · 2025-09-19T19:23:41Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @orozery.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

This commit registers a new OffloadingSpec to add CPU offloading support to the OffloadingConnector. Signed-off-by: Or Ozeri <oro@il.ibm.com>

orozery · 2025-09-21T10:50:03Z

@njhill I've added a small e2e test + small example in the docs

njhill · 2025-09-21T15:24:30Z

tests/v1/kv_offload/test_cpu_offloading.py

+
+
+@pytest.mark.parametrize("cpu_block_size", CPU_BLOCK_SIZES)
+def test_cpu_offloading(cpu_block_size: int) -> None:


Is there a way to add some assertions to the test such that it will fail if the offload is not working? Should probably also verify correctness in conjunction with this.

There is a correctness unit test for the transfer function (test_cpu_gpu.py).
Also there is a correctness unit test that the offloading connector generates the correct transfer addresses of the GPU and the offloaded medium.
I don't know how to we can test correctness e2e.

Currently the test here just checks that prompt generation does not crash when using cpu offloading.
It does not verify that any offloading actually occurs.

One way we can verify this is by adding a kv_events_config (like in test_kv_cache_events) that will check for KVEvents with the CPU medium.
I actually started coding that but saw that it is a bit cumbersome, so I decided to defer this to see if others think it's worthwhile.

Another option is to verify latency decreases when we're supposed to hit the cpu cache (after resetting the GPU prefix cache).
We can decrease the variance by, say, repeat this 100 times and verify that at least 70 times the latency decreased.
This will actually be easy to implement (comparing to the KVEvents test).
My concern is that even when repeating the test multiple (e.g. 100) times, it can still be flakey.
Your thoughts?

Thanks @orozery yes I was thinking perhaps at least some kind of latency comparison but I agree timing tests are fragile / generally not a good idea. If the magnitude of the difference is large enough perhaps it wouldn't need so many attempts, maybe just a handful?

Merging this to ensure that it makes the release but we might want to think a bit more about the e2e CI tests.

Thanks again for all of your hard work @orozery!

Signed-off-by: Or Ozeri <oro@il.ibm.com>

Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: charlifu <charlifu@amd.com>

Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

alew3 · 2025-10-07T14:25:42Z

@orozery how can I debug if the offload connector is being used?

--kv-transfer-config \ '{"kv_connector":"OffloadingConnector", "kv_role":"kv_both", "kv_connector_extra_config":{ "num_cpu_blocks": 5000 } }'

orozery · 2025-10-08T04:06:51Z

@orozery how can I debug if the offload connector is being used?

--kv-transfer-config \ '{"kv_connector":"OffloadingConnector", "kv_role":"kv_both", "kv_connector_extra_config":{ "num_cpu_blocks": 5000 } }'

You can check the vllm logs which will include logs from the connector.
Control level by e.g. VLLM_LOGGING_LEVEL=DEBUG

Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: gaojc <1055866782@qq.com>

Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Signed-off-by: Or Ozeri <oro@il.ibm.com>

orozery requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners September 4, 2025 14:06

mergify bot added ci/build v1 labels Sep 4, 2025

gemini-code-assist bot reviewed Sep 4, 2025

View reviewed changes

hasanar1f reviewed Sep 9, 2025

View reviewed changes

vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py Outdated Show resolved Hide resolved

orozery mentioned this pull request Sep 12, 2025

[KV offload][4/N] Offloading KV connector #22595

Merged

njhill reviewed Sep 13, 2025

View reviewed changes

orozery force-pushed the cpu-offloading branch from 0734b51 to 54f0400 Compare September 15, 2025 15:35

orozery requested a review from NickLucche as a code owner September 15, 2025 15:35

pytorch-bot bot removed the ci/build label Sep 15, 2025

mergify bot added the ci/build label Sep 15, 2025

orozery force-pushed the cpu-offloading branch from 54f0400 to 3c24ea9 Compare September 16, 2025 09:40

njhill reviewed Sep 16, 2025

View reviewed changes

orozery force-pushed the cpu-offloading branch from 3c24ea9 to 111fae9 Compare September 17, 2025 09:34

njhill approved these changes Sep 17, 2025

View reviewed changes

mergify bot added the kv-connector label Sep 18, 2025

njhill changed the title ~~v1: CPU offloading~~ [KV offload][5/N] Add CPUOffloadingSpec Sep 18, 2025

mergify bot added the needs-rebase label Sep 19, 2025

orozery force-pushed the cpu-offloading branch from 111fae9 to 7b178bb Compare September 21, 2025 08:28

mergify bot added documentation Improvements or additions to documentation and removed needs-rebase labels Sep 21, 2025

v1: Add CPU offloading support

6519656

This commit registers a new OffloadingSpec to add CPU offloading support to the OffloadingConnector. Signed-off-by: Or Ozeri <oro@il.ibm.com>

orozery force-pushed the cpu-offloading branch from 7b178bb to 6519656 Compare September 21, 2025 08:54

njhill reviewed Sep 21, 2025

View reviewed changes

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 22, 2025

njhill merged commit 8db2939 into vllm-project:main Sep 22, 2025
50 checks passed

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[KV offload][5/N] Add CPUOffloadingSpec (vllm-project#24251)

fddef2f

Signed-off-by: Or Ozeri <oro@il.ibm.com>

charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025

[KV offload][5/N] Add CPUOffloadingSpec (vllm-project#24251)

83626f1

Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: charlifu <charlifu@amd.com>

HF-001 mentioned this pull request Sep 28, 2025

[RFC]: native kvcache offloading vllm-project/vllm-ascend#3241

Open

yewentao256 pushed a commit that referenced this pull request Oct 3, 2025

[KV offload][5/N] Add CPUOffloadingSpec (#24251)

ff54b6b

Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

gjc0824 pushed a commit to gjc0824/vllm that referenced this pull request Oct 10, 2025

[KV offload][5/N] Add CPUOffloadingSpec (vllm-project#24251)

21f2e62

Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: gaojc <1055866782@qq.com>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025

[KV offload][5/N] Add CPUOffloadingSpec (vllm-project#24251)

891b38d

Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025

[KV offload][5/N] Add CPUOffloadingSpec (vllm-project#24251)

559fc91

Signed-off-by: Or Ozeri <oro@il.ibm.com>

		for blk_hash in request.block_hashes[self.block_size_factor -
		1::self.block_size_factor]



		@pytest.mark.parametrize("cpu_block_size", CPU_BLOCK_SIZES)
		def test_cpu_offloading(cpu_block_size: int) -> None:

Uh oh!

[KV offload][5/N] Add CPUOffloadingSpec #24251

[KV offload][5/N] Add CPUOffloadingSpec #24251

Conversation

orozery commented Sep 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ApostaC commented Sep 12, 2025

Uh oh!

orozery commented Sep 12, 2025

Uh oh!

njhill left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 15, 2025

Uh oh!

orozery commented Sep 15, 2025

Uh oh!

njhill commented Sep 15, 2025

Uh oh!

orozery commented Sep 16, 2025

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Sep 19, 2025

Uh oh!

orozery commented Sep 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

[KV offload][5/N] Add `CPUOffloadingSpec` #24251

[KV offload][5/N] Add `CPUOffloadingSpec` #24251

orozery commented Sep 4, 2025 •

edited by github-actions bot

Loading

njhill left a comment •

edited

Loading

orozery Sep 21, 2025 •

edited

Loading