-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[Cuda2CPU][P/D] Add cuda2cpu support in NixlConnector #24690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Cuda2CPU][P/D] Add cuda2cpu support in NixlConnector #24690
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for CUDA to CPU KV cache offloading via the NixlConnector, which is a valuable feature for handling large models. The changes include updates to the NixlConnector to recognize 'cpu' as a valid buffer device for CUDA, and new integration tests to validate this functionality. My review identified a critical bug in the block copying logic within the CUDA platform code, which would cause incorrect behavior or errors. Additionally, I've pointed out some high-severity issues in the new test scripts, such as the use of eval which poses a security risk, and incorrect variable usage in log messages.
| if [ -n "$model_args" ]; then | ||
| FULL_CMD="$BASE_CMD $model_args" | ||
| else | ||
| FULL_CMD="$BASE_CMD" | ||
| fi | ||
|
|
||
| eval "$FULL_CMD &" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of eval to execute the command is dangerous as it can lead to arbitrary code execution if variables in the command string are not properly sanitized. This occurs for both prefill (line 101) and decode (line 133) instance startup. While the risk is lower in a test script with controlled inputs, it's a bad practice. Please consider refactoring the command execution to use arrays, which is safer. This would involve changing get_model_args to output an array of arguments and then executing the command array directly.
| if [ -n "$model_args" ]; then | ||
| FULL_CMD="$BASE_CMD $model_args" | ||
| else | ||
| FULL_CMD="$BASE_CMD" | ||
| fi | ||
|
|
||
| eval "$FULL_CMD &" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of eval to execute the command is dangerous as it can lead to arbitrary code execution if variables in the command string are not properly sanitized. This occurs for both prefill (line 69) and decode (line 87) instance startup. It's a best practice to avoid eval. Please consider refactoring the command execution to use arrays, which is safer. This would involve changing get_model_args to output an array of arguments and then executing the command array directly.
tests/v1/kv_connector/nixl_integration/run_cuda2cpu_edge_case_test.sh
Outdated
Show resolved
Hide resolved
f88c963 to
ad91641
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will review asap, sorry!
tests/v1/kv_connector/nixl_integration/run_cuda2cpu_accuracy_test.sh
Outdated
Show resolved
Hide resolved
|
Thanks @chenxi-yang! It looks good at first glance but would you mind reverting all of the formatting changes? It makes it difficult to see what the real changes are are is unrelated to the purpose of the PR. Feel free to open a separate PR with formatting improvements! |
90e2f0e to
bfa16a4
Compare
Cleaned up the formatting edits from auto-save! |
e2645ae to
01c955f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks clean, thanks @chenxi-yang !
I just left a few minor comments.
Also, would you mind updating the docstring in KVTransferConfig.kv_buffer_device, we indeed support cpu now.
tests/v1/kv_connector/nixl_integration/run_cuda2cpu_accuracy_test.sh
Outdated
Show resolved
Hide resolved
5044560 to
d441e0c
Compare
|
are you opening a new one @chenxi-yang ? |
Sorry! I was swamped in other tasks. I am not opening a new one and fixed the requested changes. Let me know how you think. Thank you! |
| copy_kv_blocks) | ||
| kv_transfer_group = get_kv_transfer_group() | ||
| kv_transfer_group.register_kv_caches(kv_caches) | ||
| kv_transfer_group.set_host_xfer_buffer_ops(copy_kv_blocks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about moving "if self.kv_buffer_device != "cpu"" condition here ? I think it will be straight forward to read the codes.
Suggestion:
if self.vllm_config.kv_transfer_config.kv_buffer_device == "cpu":
kv_transfer_group.set_host_xfer_buffer_ops(copy_kv_blocks)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved in @NickLucche comments below, please ignore
| if [[ "$KV_BUFFER_DEVICE" == "cuda" ]]; then | ||
| KV_CONFIG='{"kv_connector":"NixlConnector","kv_role":"kv_both"}' | ||
| else | ||
| KV_CONFIG="{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\",\"kv_buffer_device\":\"$KV_BUFFER_DEVICE\"}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added backends support in this PR: #25121
May you also provide options in the run_accuracy_test?
Suggested codes:
VLLM_NIXL_BACKEND=${VLLM_NIXL_BACKEND:-"[\"UCX\"]"}
--kv-transfer-config '{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\",\"kv_buffer_device\":\"${KV_BUFFER_DEVICE}\", \"kv_connector_extra_config\":{\"backends\":${VLLM_NIXL_BACKEND}}}'"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep the scope of the PR focused here, we can do that in a separate PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, will take the task
| if [[ "$KV_BUFFER_DEVICE" == "cuda" ]]; then | ||
| KV_CONFIG='{"kv_connector":"NixlConnector","kv_role":"kv_both"}' | ||
| else | ||
| KV_CONFIG="{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\",\"kv_buffer_device\":\"$KV_BUFFER_DEVICE\"}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same ask for adding
"kv_connector_extra_config\":{\"backends\":${VLLM_NIXL_BACKEND}}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will add through a different PR, please ignore my comments above
| @@ -672,6 +675,9 @@ def initialize_host_xfer_buffer( | |||
|
|
|||
| def set_host_xfer_buffer_ops(self, copy_operation: CopyBlocksOp): | |||
| """Assign copy (d2h, h2d) operations when host buffer is used.""" | |||
| # Set a no-op if the host buffer is not cpu. | |||
| if self.kv_buffer_device != "cpu": | |||
| return | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok to put guard here, however, it is not very straight-forward to me for any non-cpu buffer, it will always call set_host_xfer_buffer_ops in GPU_MODEL_RUNNER, is that OK to move the condition-check to gpu_model_runner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can refer to @njhill earlier comment, but more broadly this pair of functions make sense when the selected buffer device is cpu, not when we're running on particular platform.
And the gpu model runner doesn't need to be aware of the selected buffer device, possibly, since it's a kv connector spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved in @NickLucche comments below, please ignore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for contributing @chenxi-yang , let's get the tests fixed (sync to main) and fix the DCO check.
| if [[ "$KV_BUFFER_DEVICE" == "cuda" ]]; then | ||
| KV_CONFIG='{"kv_connector":"NixlConnector","kv_role":"kv_both"}' | ||
| else | ||
| KV_CONFIG="{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\",\"kv_buffer_device\":\"$KV_BUFFER_DEVICE\"}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| @@ -672,6 +675,9 @@ def initialize_host_xfer_buffer( | |||
|
|
|||
| def set_host_xfer_buffer_ops(self, copy_operation: CopyBlocksOp): | |||
| """Assign copy (d2h, h2d) operations when host buffer is used.""" | |||
| # Set a no-op if the host buffer is not cpu. | |||
| if self.kv_buffer_device != "cpu": | |||
| return | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can refer to @njhill earlier comment, but more broadly this pair of functions make sense when the selected buffer device is cpu, not when we're running on particular platform.
And the gpu model runner doesn't need to be aware of the selected buffer device, possibly, since it's a kv connector spec.
3daceb3 to
732d7a9
Compare
Signed-off-by: Chenxi Yang <cxyang@fb.com>
Signed-off-by: Chenxi Yang <cxyang@fb.com>
Signed-off-by: Chenxi Yang <cxyang@fb.com>
732d7a9 to
fbba1a6
Compare
Signed-off-by: Chenxi Yang <cxyang@fb.com> Co-authored-by: Chenxi Yang <cxyang@fb.com>
Signed-off-by: Chenxi Yang <cxyang@fb.com> Co-authored-by: Chenxi Yang <cxyang@fb.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Chenxi Yang <cxyang@fb.com> Co-authored-by: Chenxi Yang <cxyang@fb.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Chenxi Yang <cxyang@fb.com> Co-authored-by: Chenxi Yang <cxyang@fb.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: Chenxi Yang <cxyang@fb.com> Co-authored-by: Chenxi Yang <cxyang@fb.com>
Signed-off-by: Chenxi Yang <cxyang@fb.com> Co-authored-by: Chenxi Yang <cxyang@fb.com>
Signed-off-by: Chenxi Yang <cxyang@fb.com> Co-authored-by: Chenxi Yang <cxyang@fb.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Accidentally closed #24619
Support PD disaggregation for CUDA:CPU based on #22436
Utilizing CPU memory as a buffer and performing point-to-point transmission via NIXL.
Use case: We want to support multi-turn for multi-hosts. When the decode' kv cache is larger than GPU's memory or we need to pass other prefill's kv to the current prefill through decode, we may want to use CPU as cache buffer.
Purpose
Currently, cuda:cpu is not supported yet (#18293 (comment)). This PR adds the support of host buffer on cuda devices.
Test Plan
bash tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh --kv_buffer_device cpufor accuracy test.bash tests/v1/kv_connector/nixl_integration/run_edge_case_test.sh --kv_buffer_device cpufor edge case test.Test Result
All tests completed!