Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 28 additions & 3 deletions tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh
Original file line number Diff line number Diff line change
@@ -1,6 +1,31 @@
#!/bin/bash
set -xe

# Parse command line arguments
KV_BUFFER_DEVICE="cuda" # Default to cuda
while [[ $# -gt 0 ]]; do
case $1 in
--kv_buffer_device)
KV_BUFFER_DEVICE="$2"
shift 2
;;
*)
echo "Unknown option $1"
echo "Usage: $0 [--kv_buffer_device <cuda|cpu>]"
exit 1
;;
esac
done

echo "Running accuracy tests with kv_buffer_device=$KV_BUFFER_DEVICE"

# Build the kv-transfer-config once
if [[ "$KV_BUFFER_DEVICE" == "cuda" ]]; then
KV_CONFIG='{"kv_connector":"NixlConnector","kv_role":"kv_both"}'
else
KV_CONFIG="{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\",\"kv_buffer_device\":\"$KV_BUFFER_DEVICE\"}"
Copy link
Contributor

@xuechendi xuechendi Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added backends support in this PR: #25121
May you also provide options in the run_accuracy_test?

Suggested codes:

VLLM_NIXL_BACKEND=${VLLM_NIXL_BACKEND:-"[\"UCX\"]"}

--kv-transfer-config '{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\",\"kv_buffer_device\":\"${KV_BUFFER_DEVICE}\", \"kv_connector_extra_config\":{\"backends\":${VLLM_NIXL_BACKEND}}}'"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep the scope of the PR focused here, we can do that in a separate PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will take the task

fi

# Models to run
MODELS=(
"Qwen/Qwen3-0.6B"
Expand Down Expand Up @@ -79,7 +104,7 @@ run_tests_for_model() {

# Calculate port number (base port + instance number)
PORT=$((8100 + i))
# Calculate side channel port. Avoid clash with with TP workers.
# Calculate side channel port. Avoid clash with with TP workers.
SIDE_CHANNEL_PORT=$((5559 + i))

echo "Starting prefill instance $i on GPU $GPU_ID, port $PORT"
Expand All @@ -93,7 +118,7 @@ run_tests_for_model() {
--enforce-eager \
--gpu-memory-utilization 0.2 \
--tensor-parallel-size $PREFILLER_TP_SIZE \
--kv-transfer-config '{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\"}'"
--kv-transfer-config '$KV_CONFIG'"

if [ -n "$model_args" ]; then
FULL_CMD="$BASE_CMD $model_args"
Expand Down Expand Up @@ -128,7 +153,7 @@ run_tests_for_model() {
--enforce-eager \
--gpu-memory-utilization 0.2 \
--tensor-parallel-size $DECODER_TP_SIZE \
--kv-transfer-config '{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\"}'"
--kv-transfer-config '$KV_CONFIG'"

if [ -n "$model_args" ]; then
FULL_CMD="$BASE_CMD $model_args"
Expand Down
37 changes: 32 additions & 5 deletions tests/v1/kv_connector/nixl_integration/run_edge_case_test.sh
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,6 +1,33 @@
#!/bin/bash
set -xe

# Parse command line arguments
KV_BUFFER_DEVICE="cuda" # Default to cuda
PREFILL_GPU_ID=4 # Default GPU IDs
DECODE_GPU_ID=5
while [[ $# -gt 0 ]]; do
case $1 in
--kv_buffer_device)
KV_BUFFER_DEVICE="$2"
shift 2
;;
*)
echo "Unknown option $1"
echo "Usage: $0 [--kv_buffer_device <cuda|cpu>]"
exit 1
;;
esac
done

echo "Running edge case tests with kv_buffer_device=$KV_BUFFER_DEVICE (GPUs: $PREFILL_GPU_ID, $DECODE_GPU_ID)"

# Build the kv-transfer-config once
if [[ "$KV_BUFFER_DEVICE" == "cuda" ]]; then
KV_CONFIG='{"kv_connector":"NixlConnector","kv_role":"kv_both"}'
else
KV_CONFIG="{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\",\"kv_buffer_device\":\"$KV_BUFFER_DEVICE\"}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same ask for adding

"kv_connector_extra_config\":{\"backends\":${VLLM_NIXL_BACKEND}}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add through a different PR, please ignore my comments above

fi

# Models to run
MODELS=(
"Qwen/Qwen3-0.6B"
Expand Down Expand Up @@ -50,15 +77,15 @@ run_tests_for_model() {

# Get model-specific arguments
local model_args=$(get_model_args "$model_name")

# Start prefill instance
PREFILL_PORT=8001

BASE_CMD="CUDA_VISIBLE_DEVICES=0 VLLM_NIXL_SIDE_CHANNEL_PORT=5559 vllm serve $model_name \
BASE_CMD="CUDA_VISIBLE_DEVICES=$PREFILL_GPU_ID VLLM_NIXL_SIDE_CHANNEL_PORT=5559 vllm serve $model_name \
--port $PREFILL_PORT \
--enforce-eager \
--gpu-memory-utilization 0.2 \
--kv-transfer-config '{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\"}'"
--kv-transfer-config '$KV_CONFIG'"

if [ -n "$model_args" ]; then
FULL_CMD="$BASE_CMD $model_args"
Expand All @@ -72,11 +99,11 @@ run_tests_for_model() {
DECODE_PORT=8002

# Build the command with or without model-specific args
BASE_CMD="CUDA_VISIBLE_DEVICES=1 VLLM_NIXL_SIDE_CHANNEL_PORT=6000 vllm serve $model_name \
BASE_CMD="CUDA_VISIBLE_DEVICES=$DECODE_GPU_ID VLLM_NIXL_SIDE_CHANNEL_PORT=6000 vllm serve $model_name \
--port $DECODE_PORT \
--enforce-eager \
--gpu-memory-utilization 0.2 \
--kv-transfer-config '{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\"}'"
--kv-transfer-config '$KV_CONFIG'"

if [ -n "$model_args" ]; then
FULL_CMD="$BASE_CMD $model_args"
Expand Down
4 changes: 2 additions & 2 deletions vllm/config/kv_transfer.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,8 @@ class KVTransferConfig:
"""The engine id for KV transfers."""

kv_buffer_device: Optional[str] = "cuda"
"""The device used by kv connector to buffer the KV cache.
Currently only support 'cuda'."""
"""The device used by kv connector to buffer the KV cache. Choices are
'cuda' and 'cpu'."""

kv_buffer_size: float = 1e9
"""The buffer size for TorchDistributedConnector. Measured in number of
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,10 @@
# Supported platforms and types of kv transfer buffer.
# {device: tuple of supported kv buffer types}
_NIXL_SUPPORTED_DEVICE = {
"cuda": ("cuda", ),
"cuda": (
"cuda",
"cpu",
),
"tpu": ("cpu", ),
"xpu": ("cpu", ),
}
Expand Down Expand Up @@ -687,6 +690,9 @@ def initialize_host_xfer_buffer(

def set_host_xfer_buffer_ops(self, copy_operation: CopyBlocksOp):
"""Assign copy (d2h, h2d) operations when host buffer is used."""
# Set a no-op if the host buffer is not cpu.
if self.kv_buffer_device != "cpu":
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok to put guard here, however, it is not very straight-forward to me for any non-cpu buffer, it will always call set_host_xfer_buffer_ops in GPU_MODEL_RUNNER, is that OK to move the condition-check to gpu_model_runner?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can refer to @njhill earlier comment, but more broadly this pair of functions make sense when the selected buffer device is cpu, not when we're running on particular platform.
And the gpu model runner doesn't need to be aware of the selected buffer device, possibly, since it's a kv connector spec.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved in @NickLucche comments below, please ignore

assert self.use_host_buffer
self.copy_blocks = copy_operation

Expand Down
24 changes: 24 additions & 0 deletions vllm/platforms/cuda.py
Original file line number Diff line number Diff line change
Expand Up @@ -494,6 +494,30 @@ def check_if_supports_dtype(cls, torch_dtype: torch.dtype):
"You can use float16 instead by explicitly setting the "
"`dtype` flag in CLI, for example: --dtype=half.")

@classmethod
def insert_blocks_to_device(
cls,
src_cache: torch.Tensor,
dst_cache: torch.Tensor,
src_block_indices: torch.Tensor,
dst_block_indices: torch.Tensor,
) -> None:
"""Copy blocks from src_cache to dst_cache on GPU."""
_src_cache = src_cache[:, src_block_indices]
dst_cache[:, dst_block_indices] = _src_cache.to(dst_cache.device)

@classmethod
def swap_out_blocks_to_host(
cls,
src_cache: torch.Tensor,
dst_cache: torch.Tensor,
src_block_indices: torch.Tensor,
dst_block_indices: torch.Tensor,
) -> None:
"""Copy blocks from GPU to host (CPU)."""
_src_cache = src_cache[:, src_block_indices]
dst_cache[:, dst_block_indices] = _src_cache.cpu()

@classmethod
def support_hybrid_kv_cache(cls) -> bool:
return True
Expand Down
7 changes: 3 additions & 4 deletions vllm/v1/worker/gpu_model_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -3973,10 +3973,9 @@ def initialize_kv_cache(self, kv_cache_config: KVCacheConfig) -> None:
self.drafter.validate_same_kv_cache_group(kv_cache_config)

if has_kv_transfer_group():
get_kv_transfer_group().register_kv_caches(kv_caches)
if self.device.type == 'xpu':
get_kv_transfer_group().set_host_xfer_buffer_ops(
copy_kv_blocks)
kv_transfer_group = get_kv_transfer_group()
kv_transfer_group.register_kv_caches(kv_caches)
kv_transfer_group.set_host_xfer_buffer_ops(copy_kv_blocks)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about moving "if self.kv_buffer_device != "cpu"" condition here ? I think it will be straight forward to read the codes.
Suggestion:

if self.vllm_config.kv_transfer_config.kv_buffer_device == "cpu":
    kv_transfer_group.set_host_xfer_buffer_ops(copy_kv_blocks)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved in @NickLucche comments below, please ignore


if self.dcp_world_size > 1:
layer_names = self.attn_groups[0][0].layer_names
Expand Down