[Cuda2CPU][P/D] Add cuda2cpu support in NixlConnector #24690

chenxi-yang · 2025-09-11T19:42:07Z

Accidentally closed #24619

Support PD disaggregation for CUDA:CPU based on #22436
Utilizing CPU memory as a buffer and performing point-to-point transmission via NIXL.

Use case: We want to support multi-turn for multi-hosts. When the decode' kv cache is larger than GPU's memory or we need to pass other prefill's kv to the current prefill through decode, we may want to use CPU as cache buffer.

Purpose

Currently, cuda:cpu is not supported yet (#18293 (comment)). This PR adds the support of host buffer on cuda devices.

Test Plan

bash tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh --kv_buffer_device cpu for accuracy test.
bash tests/v1/kv_connector/nixl_integration/run_edge_case_test.sh --kv_buffer_device cpu for edge case test.

Test Result

All tests completed!

mergify · 2025-09-11T19:42:44Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @chenxi-yang.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request adds support for CUDA to CPU KV cache offloading via the NixlConnector, which is a valuable feature for handling large models. The changes include updates to the NixlConnector to recognize 'cpu' as a valid buffer device for CUDA, and new integration tests to validate this functionality. My review identified a critical bug in the block copying logic within the CUDA platform code, which would cause incorrect behavior or errors. Additionally, I've pointed out some high-severity issues in the new test scripts, such as the use of eval which poses a security risk, and incorrect variable usage in log messages.

vllm/platforms/cuda.py

gemini-code-assist · 2025-09-11T19:44:21Z

tests/v1/kv_connector/nixl_integration/run_cuda2cpu_accuracy_test.sh

+    if [ -n "$model_args" ]; then
+    FULL_CMD="$BASE_CMD $model_args"
+    else
+    FULL_CMD="$BASE_CMD"
+    fi
+
+    eval "$FULL_CMD &"


The use of eval to execute the command is dangerous as it can lead to arbitrary code execution if variables in the command string are not properly sanitized. This occurs for both prefill (line 101) and decode (line 133) instance startup. While the risk is lower in a test script with controlled inputs, it's a bad practice. Please consider refactoring the command execution to use arrays, which is safer. This would involve changing get_model_args to output an array of arguments and then executing the command array directly.

gemini-code-assist · 2025-09-11T19:44:21Z

tests/v1/kv_connector/nixl_integration/run_cuda2cpu_edge_case_test.sh

+  if [ -n "$model_args" ]; then
+  FULL_CMD="$BASE_CMD $model_args"
+  else
+  FULL_CMD="$BASE_CMD"
+  fi
+
+  eval "$FULL_CMD &"


The use of eval to execute the command is dangerous as it can lead to arbitrary code execution if variables in the command string are not properly sanitized. This occurs for both prefill (line 69) and decode (line 87) instance startup. It's a best practice to avoid eval. Please consider refactoring the command execution to use arrays, which is safer. This would involve changing get_model_args to output an array of arguments and then executing the command array directly.

tests/v1/kv_connector/nixl_integration/run_cuda2cpu_edge_case_test.sh

mergify · 2025-09-11T23:27:54Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @chenxi-yang.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

NickLucche

Will review asap, sorry!

tests/v1/kv_connector/nixl_integration/run_cuda2cpu_accuracy_test.sh

njhill · 2025-09-12T17:43:07Z

Thanks @chenxi-yang! It looks good at first glance but would you mind reverting all of the formatting changes? It makes it difficult to see what the real changes are are is unrelated to the purpose of the PR.

Feel free to open a separate PR with formatting improvements!

chenxi-yang · 2025-09-12T18:31:11Z

Thanks @chenxi-yang! It looks good at first glance but would you mind reverting all of the formatting changes? It makes it difficult to see what the real changes are are is unrelated to the purpose of the PR.

Feel free to open a separate PR with formatting improvements!

Cleaned up the formatting edits from auto-save!

vllm/v1/worker/gpu_model_runner.py

NickLucche

Looks clean, thanks @chenxi-yang !
I just left a few minor comments.
Also, would you mind updating the docstring in KVTransferConfig.kv_buffer_device, we indeed support cpu now.

tests/v1/kv_connector/nixl_integration/run_cuda2cpu_accuracy_test.sh

vllm/v1/worker/gpu_model_runner.py

NickLucche · 2025-09-16T14:44:08Z

are you opening a new one @chenxi-yang ?

chenxi-yang · 2025-09-20T05:36:42Z

are you opening a new one @chenxi-yang ?

Sorry! I was swamped in other tasks. I am not opening a new one and fixed the requested changes. Let me know how you think. Thank you!

xuechendi · 2025-09-24T14:24:43Z

vllm/v1/worker/gpu_model_runner.py

-                    copy_kv_blocks)
+            kv_transfer_group = get_kv_transfer_group()
+            kv_transfer_group.register_kv_caches(kv_caches)
+            kv_transfer_group.set_host_xfer_buffer_ops(copy_kv_blocks)


How about moving "if self.kv_buffer_device != "cpu"" condition here ? I think it will be straight forward to read the codes.
Suggestion:

if self.vllm_config.kv_transfer_config.kv_buffer_device == "cpu": kv_transfer_group.set_host_xfer_buffer_ops(copy_kv_blocks)

resolved in @NickLucche comments below, please ignore

xuechendi · 2025-09-24T14:31:07Z

tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh

+if [[ "$KV_BUFFER_DEVICE" == "cuda" ]]; then
+  KV_CONFIG='{"kv_connector":"NixlConnector","kv_role":"kv_both"}'
+else
+  KV_CONFIG="{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\",\"kv_buffer_device\":\"$KV_BUFFER_DEVICE\"}"


I added backends support in this PR: #25121
May you also provide options in the run_accuracy_test?

Suggested codes:

VLLM_NIXL_BACKEND=${VLLM_NIXL_BACKEND:-"[\"UCX\"]"} --kv-transfer-config '{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\",\"kv_buffer_device\":\"${KV_BUFFER_DEVICE}\", \"kv_connector_extra_config\":{\"backends\":${VLLM_NIXL_BACKEND}}}'"

I think we should keep the scope of the PR focused here, we can do that in a separate PR

Ok, will take the task

xuechendi · 2025-09-24T14:32:31Z

tests/v1/kv_connector/nixl_integration/run_edge_case_test.sh

+if [[ "$KV_BUFFER_DEVICE" == "cuda" ]]; then
+  KV_CONFIG='{"kv_connector":"NixlConnector","kv_role":"kv_both"}'
+else
+  KV_CONFIG="{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\",\"kv_buffer_device\":\"$KV_BUFFER_DEVICE\"}"


same ask for adding

"kv_connector_extra_config\":{\"backends\":${VLLM_NIXL_BACKEND}}

will add through a different PR, please ignore my comments above

xuechendi · 2025-09-24T14:35:35Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

@@ -672,6 +675,9 @@ def initialize_host_xfer_buffer(

    def set_host_xfer_buffer_ops(self, copy_operation: CopyBlocksOp):
        """Assign copy (d2h, h2d) operations when host buffer is used."""
+        # Set a no-op if the host buffer is not cpu.
+        if self.kv_buffer_device != "cpu":
+            return


Ok to put guard here, however, it is not very straight-forward to me for any non-cpu buffer, it will always call set_host_xfer_buffer_ops in GPU_MODEL_RUNNER, is that OK to move the condition-check to gpu_model_runner?

You can refer to @njhill earlier comment, but more broadly this pair of functions make sense when the selected buffer device is cpu, not when we're running on particular platform.
And the gpu model runner doesn't need to be aware of the selected buffer device, possibly, since it's a kv connector spec.

resolved in @NickLucche comments below, please ignore

NickLucche

LGTM, thanks for contributing @chenxi-yang , let's get the tests fixed (sync to main) and fix the DCO check.

NickLucche · 2025-09-24T15:34:50Z

tests/v1/kv_connector/nixl_integration/run_edge_case_test.sh

+if [[ "$KV_BUFFER_DEVICE" == "cuda" ]]; then
+  KV_CONFIG='{"kv_connector":"NixlConnector","kv_role":"kv_both"}'
+else
+  KV_CONFIG="{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\",\"kv_buffer_device\":\"$KV_BUFFER_DEVICE\"}"


NickLucche · 2025-09-24T15:44:41Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

@@ -672,6 +675,9 @@ def initialize_host_xfer_buffer(

    def set_host_xfer_buffer_ops(self, copy_operation: CopyBlocksOp):
        """Assign copy (d2h, h2d) operations when host buffer is used."""
+        # Set a no-op if the host buffer is not cpu.
+        if self.kv_buffer_device != "cpu":
+            return


You can refer to @njhill earlier comment, but more broadly this pair of functions make sense when the selected buffer device is cpu, not when we're running on particular platform.
And the gpu model runner doesn't need to be aware of the selected buffer device, possibly, since it's a kv connector spec.

Signed-off-by: Chenxi Yang <cxyang@fb.com>

Signed-off-by: Chenxi Yang <cxyang@fb.com> Co-authored-by: Chenxi Yang <cxyang@fb.com>

Signed-off-by: Chenxi Yang <cxyang@fb.com> Co-authored-by: Chenxi Yang <cxyang@fb.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

Signed-off-by: Chenxi Yang <cxyang@fb.com> Co-authored-by: Chenxi Yang <cxyang@fb.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

Signed-off-by: Chenxi Yang <cxyang@fb.com> Co-authored-by: Chenxi Yang <cxyang@fb.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Signed-off-by: Chenxi Yang <cxyang@fb.com> Co-authored-by: Chenxi Yang <cxyang@fb.com>

Signed-off-by: Chenxi Yang <cxyang@fb.com> Co-authored-by: Chenxi Yang <cxyang@fb.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

chenxi-yang requested review from NickLucche, WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners September 11, 2025 19:42

mergify bot added v1 needs-rebase labels Sep 11, 2025

gemini-code-assist bot reviewed Sep 11, 2025

View reviewed changes

chenxi-yang force-pushed the multi_turn_prototype branch from f88c963 to ad91641 Compare September 11, 2025 21:12

mergify bot removed the needs-rebase label Sep 11, 2025

mergify bot added the needs-rebase label Sep 11, 2025

NickLucche reviewed Sep 12, 2025

View reviewed changes

tests/v1/kv_connector/nixl_integration/run_cuda2cpu_accuracy_test.sh Outdated Show resolved Hide resolved

chenxi-yang force-pushed the multi_turn_prototype branch from 90e2f0e to bfa16a4 Compare September 12, 2025 18:29

mergify bot removed the needs-rebase label Sep 12, 2025

njhill reviewed Sep 12, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

chenxi-yang force-pushed the multi_turn_prototype branch from e2645ae to 01c955f Compare September 12, 2025 23:44

NickLucche requested changes Sep 15, 2025

View reviewed changes

tests/v1/kv_connector/nixl_integration/run_cuda2cpu_accuracy_test.sh Outdated Show resolved Hide resolved

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

chenxi-yang requested review from hmellor, houseroad, mgoin, simon-mo, tlrmchlsmth, yewentao256 and youkaichao as code owners September 15, 2025 17:38

chenxi-yang force-pushed the multi_turn_prototype branch from 5044560 to d441e0c Compare September 15, 2025 17:48

mergify bot removed the needs-rebase label Sep 15, 2025

njhill mentioned this pull request Sep 17, 2025

[KV offload][3/N] Add worker-side CPU support #21448

Merged

mergify bot added the kv-connector label Sep 18, 2025

chenxi-yang requested a review from NickLucche September 20, 2025 05:35

chenxi-yang requested a review from ApostaC as a code owner September 20, 2025 05:35

xuechendi reviewed Sep 24, 2025

View reviewed changes

NickLucche approved these changes Sep 24, 2025

View reviewed changes

chenxi-yang force-pushed the multi_turn_prototype branch from 3daceb3 to 732d7a9 Compare September 25, 2025 04:20

Chenxi Yang added 3 commits September 24, 2025 21:37

[Cuda2CPU][P/D] Add cuda2cpu support in NixlConnector

710964b

Signed-off-by: Chenxi Yang <cxyang@fb.com>

Address reviewer feedback: implement no-op in set_host_xfer_buffer_ops

b1c265f

Signed-off-by: Chenxi Yang <cxyang@fb.com>

Merge scripts of cuda and cpu buffer, clean up the runner and docstring.

fbba1a6

Signed-off-by: Chenxi Yang <cxyang@fb.com>

chenxi-yang force-pushed the multi_turn_prototype branch from 732d7a9 to fbba1a6 Compare September 25, 2025 04:38

chenxi-yang added 2 commits September 25, 2025 10:06

Merge branch 'main' into multi_turn_prototype

a9a84a6

Merge branch 'main' into multi_turn_prototype

3c7e53d

NickLucche enabled auto-merge (squash) September 29, 2025 12:48

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 29, 2025

NickLucche merged commit d0d138b into vllm-project:main Sep 29, 2025
47 checks passed

pdasigi pushed a commit to pdasigi/vllm that referenced this pull request Oct 2, 2025

[Nixl][P/D] Add cuda2cpu support (HD->DH transfer) (vllm-project#24690)

23ea746

Signed-off-by: Chenxi Yang <cxyang@fb.com> Co-authored-by: Chenxi Yang <cxyang@fb.com>

yewentao256 pushed a commit that referenced this pull request Oct 3, 2025

[Nixl][P/D] Add cuda2cpu support (HD->DH transfer) (#24690)

f84b2a0

Signed-off-by: Chenxi Yang <cxyang@fb.com> Co-authored-by: Chenxi Yang <cxyang@fb.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[Nixl][P/D] Add cuda2cpu support (HD->DH transfer) (vllm-project#24690)

32bf450

Signed-off-by: Chenxi Yang <cxyang@fb.com> Co-authored-by: Chenxi Yang <cxyang@fb.com>

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

[Nixl][P/D] Add cuda2cpu support (HD->DH transfer) (vllm-project#24690)

d2cacb5

Signed-off-by: Chenxi Yang <cxyang@fb.com> Co-authored-by: Chenxi Yang <cxyang@fb.com>

Uh oh!

[Cuda2CPU][P/D] Add cuda2cpu support in NixlConnector #24690

[Cuda2CPU][P/D] Add cuda2cpu support in NixlConnector #24690

Uh oh!

Conversation

chenxi-yang commented Sep 11, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Sep 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergify bot commented Sep 11, 2025

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

njhill commented Sep 12, 2025

Uh oh!

chenxi-yang commented Sep 12, 2025

Uh oh!

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

NickLucche commented Sep 16, 2025

Uh oh!

chenxi-yang commented Sep 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xuechendi Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NickLucche left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

chenxi-yang commented Sep 11, 2025 •

edited by github-actions bot

Loading

xuechendi Sep 24, 2025 •

edited

Loading

NickLucche left a comment •

edited

Loading