-
-
Notifications
You must be signed in to change notification settings - Fork 11.2k
[Cuda2CPU][P/D] Add cuda2cpu support in NixlConnector #24690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
710964b
b1c265f
fbba1a6
a9a84a6
3c7e53d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,6 +1,33 @@ | ||
| #!/bin/bash | ||
| set -xe | ||
|
|
||
| # Parse command line arguments | ||
| KV_BUFFER_DEVICE="cuda" # Default to cuda | ||
| PREFILL_GPU_ID=4 # Default GPU IDs | ||
| DECODE_GPU_ID=5 | ||
| while [[ $# -gt 0 ]]; do | ||
| case $1 in | ||
| --kv_buffer_device) | ||
| KV_BUFFER_DEVICE="$2" | ||
| shift 2 | ||
| ;; | ||
| *) | ||
| echo "Unknown option $1" | ||
| echo "Usage: $0 [--kv_buffer_device <cuda|cpu>]" | ||
| exit 1 | ||
| ;; | ||
| esac | ||
| done | ||
|
|
||
| echo "Running edge case tests with kv_buffer_device=$KV_BUFFER_DEVICE (GPUs: $PREFILL_GPU_ID, $DECODE_GPU_ID)" | ||
|
|
||
| # Build the kv-transfer-config once | ||
| if [[ "$KV_BUFFER_DEVICE" == "cuda" ]]; then | ||
| KV_CONFIG='{"kv_connector":"NixlConnector","kv_role":"kv_both"}' | ||
| else | ||
| KV_CONFIG="{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\",\"kv_buffer_device\":\"$KV_BUFFER_DEVICE\"}" | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same ask for adding
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ditto
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. will add through a different PR, please ignore my comments above |
||
| fi | ||
|
|
||
| # Models to run | ||
| MODELS=( | ||
| "Qwen/Qwen3-0.6B" | ||
|
|
@@ -50,15 +77,15 @@ run_tests_for_model() { | |
|
|
||
| # Get model-specific arguments | ||
| local model_args=$(get_model_args "$model_name") | ||
|
|
||
| # Start prefill instance | ||
| PREFILL_PORT=8001 | ||
|
|
||
| BASE_CMD="CUDA_VISIBLE_DEVICES=0 VLLM_NIXL_SIDE_CHANNEL_PORT=5559 vllm serve $model_name \ | ||
| BASE_CMD="CUDA_VISIBLE_DEVICES=$PREFILL_GPU_ID VLLM_NIXL_SIDE_CHANNEL_PORT=5559 vllm serve $model_name \ | ||
| --port $PREFILL_PORT \ | ||
| --enforce-eager \ | ||
| --gpu-memory-utilization 0.2 \ | ||
| --kv-transfer-config '{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\"}'" | ||
| --kv-transfer-config '$KV_CONFIG'" | ||
|
|
||
| if [ -n "$model_args" ]; then | ||
| FULL_CMD="$BASE_CMD $model_args" | ||
|
|
@@ -72,11 +99,11 @@ run_tests_for_model() { | |
| DECODE_PORT=8002 | ||
|
|
||
| # Build the command with or without model-specific args | ||
| BASE_CMD="CUDA_VISIBLE_DEVICES=1 VLLM_NIXL_SIDE_CHANNEL_PORT=6000 vllm serve $model_name \ | ||
| BASE_CMD="CUDA_VISIBLE_DEVICES=$DECODE_GPU_ID VLLM_NIXL_SIDE_CHANNEL_PORT=6000 vllm serve $model_name \ | ||
| --port $DECODE_PORT \ | ||
| --enforce-eager \ | ||
| --gpu-memory-utilization 0.2 \ | ||
| --kv-transfer-config '{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\"}'" | ||
| --kv-transfer-config '$KV_CONFIG'" | ||
|
|
||
| if [ -n "$model_args" ]; then | ||
| FULL_CMD="$BASE_CMD $model_args" | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -67,7 +67,10 @@ | |
| # Supported platforms and types of kv transfer buffer. | ||
| # {device: tuple of supported kv buffer types} | ||
| _NIXL_SUPPORTED_DEVICE = { | ||
| "cuda": ("cuda", ), | ||
| "cuda": ( | ||
| "cuda", | ||
| "cpu", | ||
| ), | ||
| "tpu": ("cpu", ), | ||
| "xpu": ("cpu", ), | ||
| } | ||
|
|
@@ -687,6 +690,9 @@ def initialize_host_xfer_buffer( | |
|
|
||
| def set_host_xfer_buffer_ops(self, copy_operation: CopyBlocksOp): | ||
| """Assign copy (d2h, h2d) operations when host buffer is used.""" | ||
| # Set a no-op if the host buffer is not cpu. | ||
| if self.kv_buffer_device != "cpu": | ||
| return | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok to put guard here, however, it is not very straight-forward to me for any non-cpu buffer, it will always call set_host_xfer_buffer_ops in GPU_MODEL_RUNNER, is that OK to move the condition-check to gpu_model_runner?
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can refer to @njhill earlier comment, but more broadly this pair of functions make sense when the selected buffer device is cpu, not when we're running on particular platform.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. resolved in @NickLucche comments below, please ignore |
||
| assert self.use_host_buffer | ||
| self.copy_blocks = copy_operation | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -3973,10 +3973,9 @@ def initialize_kv_cache(self, kv_cache_config: KVCacheConfig) -> None: | |
| self.drafter.validate_same_kv_cache_group(kv_cache_config) | ||
|
|
||
| if has_kv_transfer_group(): | ||
| get_kv_transfer_group().register_kv_caches(kv_caches) | ||
| if self.device.type == 'xpu': | ||
| get_kv_transfer_group().set_host_xfer_buffer_ops( | ||
| copy_kv_blocks) | ||
| kv_transfer_group = get_kv_transfer_group() | ||
| kv_transfer_group.register_kv_caches(kv_caches) | ||
| kv_transfer_group.set_host_xfer_buffer_ops(copy_kv_blocks) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about moving "if self.kv_buffer_device != "cpu"" condition here ? I think it will be straight forward to read the codes.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. resolved in @NickLucche comments below, please ignore |
||
|
|
||
| if self.dcp_world_size > 1: | ||
| layer_names = self.attn_groups[0][0].layer_names | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added backends support in this PR: #25121
May you also provide options in the run_accuracy_test?
Suggested codes:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep the scope of the PR focused here, we can do that in a separate PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, will take the task