Skip to content

Commit 89aa540

Browse files
panpan0000NickLucche
authored andcommitted
[Docs] NixlConnector quickstart guide (vllm-project#24249)
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io> Signed-off-by: Peter Pan <peter.pan@daocloud.io> Signed-off-by: Nicolò Lucchesi<nicolo.lucchesi@gmail.com> Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
1 parent 6272917 commit 89aa540

File tree

3 files changed

+168
-3
lines changed

3 files changed

+168
-3
lines changed

docs/features/disagg_prefill.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ Now supports 5 types of connectors:
2323

2424
- **SharedStorageConnector**: refer to <gh-file:examples/offline_inference/disaggregated-prefill-v1/run.sh> for the example usage of SharedStorageConnector disaggregated prefilling.
2525
- **LMCacheConnectorV1**: refer to <gh-file:examples/others/lmcache/disagg_prefill_lmcache_v1/disagg_example_nixl.sh> for the example usage of LMCacheConnectorV1 disaggregated prefilling which uses NIXL as the underlying KV transmission.
26-
- **NixlConnector**: refer to <gh-file:tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh> for the example usage of NixlConnector disaggregated prefilling which support fully async send/recv.
26+
- **NixlConnector**: refer to <gh-file:tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh> for the example usage of NixlConnector disaggregated prefilling which support fully async send/recv. For detailed usage guide, see [NixlConnector Usage Guide](nixl_connector_usage.md).
2727
- **P2pNcclConnector**: refer to <gh-file:examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/disagg_example_p2p_nccl_xpyd.sh> for the example usage of P2pNcclConnector disaggregated prefilling.
2828
- **MultiConnector**: take advantage of the kv_connector_extra_config: dict[str, Any] already present in KVTransferConfig to stash all the connectors we want in an ordered list of kwargs.such as:
2929

Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
# NixlConnector Usage Guide
2+
3+
NixlConnector is a high-performance KV cache transfer connector for vLLM's disaggregated prefilling feature. It provides fully asynchronous send/receive operations using the NIXL library for efficient cross-process KV cache transfer.
4+
5+
## Prerequisites
6+
7+
### Installation
8+
9+
Install the NIXL library: `uv pip install nixl`, as a quick start.
10+
11+
- Refer to [NIXL official repository](https://github.com/ai-dynamo/nixl) for more installation instructions
12+
- The specified required NIXL version can be found in [requirements/kv_connectors.txt](../../requirements/kv_connectors.txt) and other relevant config files
13+
14+
### Transport Configuration
15+
16+
NixlConnector uses NIXL library for underlying communication, which supports multiple transport backends. UCX (Unified Communication X) is the primary default transport library used by NIXL. Configure transport environment variables:
17+
18+
```bash
19+
# Example UCX configuration, adjust according to your enviroment
20+
export UCX_TLS=all # or specify specific transports like "rc,ud,sm,^cuda_ipc" ..etc
21+
export UCX_NET_DEVICES=all # or specify network devices like "mlx5_0:1,mlx5_1:1"
22+
```
23+
24+
!!! tip
25+
When using UCX as the transport backend, NCCL environment variables (like `NCCL_IB_HCA`, `NCCL_SOCKET_IFNAME`) are not applicable to NixlConnector, so configure UCX-specific environment variables instead of NCCL variables.
26+
27+
## Basic Usage (on the same host)
28+
29+
### Producer (Prefiller) Configuration
30+
31+
Start a prefiller instance that produces KV caches
32+
33+
```bash
34+
# 1st GPU as prefiller
35+
CUDA_VISIBLE_DEVICES=0 \
36+
UCX_NET_DEVICES=all \
37+
VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
38+
vllm serve Qwen/Qwen3-0.6B \
39+
--port 8100 \
40+
--enforce-eager \
41+
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'
42+
```
43+
44+
### Consumer (Decoder) Configuration
45+
46+
Start a decoder instance that consumes KV caches:
47+
48+
```bash
49+
# 2nd GPU as decoder
50+
CUDA_VISIBLE_DEVICES=1 \
51+
UCX_NET_DEVICES=all \
52+
VLLM_NIXL_SIDE_CHANNEL_PORT=5601 \
53+
vllm serve Qwen/Qwen3-0.6B \
54+
--port 8200 \
55+
--enforce-eager \
56+
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'
57+
```
58+
59+
### Proxy Server
60+
61+
Use a proxy server to route requests between prefiller and decoder:
62+
63+
```bash
64+
python tests/v1/kv_connector/nixl_integration/toy_proxy_server.py \
65+
--port 8192 \
66+
--prefiller-hosts localhost \
67+
--prefiller-ports 8100 \
68+
--decoder-hosts localhost \
69+
--decoder-ports 8200
70+
```
71+
72+
## Environment Variables
73+
74+
- `VLLM_NIXL_SIDE_CHANNEL_PORT`: Port for NIXL handshake communication
75+
- Default: 5600
76+
- **Required for both prefiller and decoder instances**
77+
- Each vLLM worker needs a unique port on its host; using the same port number across different hosts is fine
78+
- For TP/DP deployments, each worker's port on a node is computed as: base_port + dp_rank * tp_size + tp_rank (e.g., with `--tensor-parallel-size=4` and base_port=5600, tp_rank 0..3 use ports 5600, 5601, 5602, 5603 on that node).
79+
- Used for the initial NIXL handshake between the prefiller and the decoder
80+
81+
- `VLLM_NIXL_SIDE_CHANNEL_HOST`: Host for side channel communication
82+
- Default: "localhost"
83+
- Set when prefiller and decoder are on different machines
84+
- Connection info is passed via KVTransferParams from prefiller to decoder for handshake
85+
86+
- `VLLM_NIXL_ABORT_REQUEST_TIMEOUT`: Timeout (in seconds) for automatically releasing the prefiller’s KV cache for a particular request. (Optional)
87+
- Default: 120
88+
- If a request is aborted and the decoder has not yet read the KV-cache blocks through the nixl channel, the prefill instance will release its KV-cache blocks after this timeout to avoid holding them indefinitely.
89+
90+
## Multi-Instance Setup
91+
92+
### Multiple Prefiller Instances on Different Machines
93+
94+
```bash
95+
# Prefiller 1 on Machine A (example IP: ${IP1})
96+
VLLM_NIXL_SIDE_CHANNEL_HOST=${IP1} \
97+
VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
98+
UCX_NET_DEVICES=all \
99+
vllm serve Qwen/Qwen3-0.6B --port 8000 \
100+
--tensor-parallel-size 8 \
101+
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}'
102+
103+
# Prefiller 2 on Machine B (example IP: ${IP2})
104+
VLLM_NIXL_SIDE_CHANNEL_HOST=${IP2} \
105+
VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
106+
UCX_NET_DEVICES=all \
107+
vllm serve Qwen/Qwen3-0.6B --port 8000 \
108+
--tensor-parallel-size 8 \
109+
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}'
110+
```
111+
112+
### Multiple Decoder Instances on Different Machines
113+
114+
```bash
115+
# Decoder 1 on Machine C (example IP: ${IP3})
116+
VLLM_NIXL_SIDE_CHANNEL_HOST=${IP3} \
117+
VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
118+
UCX_NET_DEVICES=all \
119+
vllm serve Qwen/Qwen3-0.6B --port 8000 \
120+
--tensor-parallel-size 8 \
121+
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}'
122+
123+
# Decoder 2 on Machine D (example IP: ${IP4})
124+
VLLM_NIXL_SIDE_CHANNEL_HOST=${IP4} \
125+
VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
126+
UCX_NET_DEVICES=all \
127+
vllm serve Qwen/Qwen3-0.6B --port 8000 \
128+
--tensor-parallel-size 8 \
129+
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}'
130+
```
131+
132+
### Proxy for Multiple Instances
133+
134+
```bash
135+
python tests/v1/kv_connector/nixl_integration/toy_proxy_server.py \
136+
--port 8192 \
137+
--prefiller-hosts ${IP1} ${IP2} \
138+
--prefiller-ports 8000 8000 \
139+
--decoder-hosts ${IP3} ${IP4} \
140+
--decoder-ports 8000 8000
141+
```
142+
143+
### KV Role Options
144+
145+
- **kv_producer**: For prefiller instances that generate KV caches
146+
- **kv_consumer**: For decoder instances that consume KV caches from prefiller
147+
- **kv_both**: Enables symmetric functionality where the connector can act as both producer and consumer. This provides flexibility for experimental setups and scenarios where the role distinction is not predetermined.
148+
149+
!!! tip
150+
NixlConnector currently does not distinguish `kv_role`; the actual prefiller/decoder roles are determined by the upper-level proxy (e.g., `toy_proxy_server.py` using `--prefiller-hosts` and `--decoder-hosts`).
151+
Therefore, `kv_role` in `--kv-transfer-config` is effectively a placeholder and does not affect NixlConnector's behavior.
152+
153+
## Example Scripts/Code
154+
155+
Refer to these example scripts in the vLLM repository:
156+
157+
- [run_accuracy_test.sh](../../tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh)
158+
- [toy_proxy_server.py](../../tests/v1/kv_connector/nixl_integration/toy_proxy_server.py)
159+
- [test_accuracy.py](../../tests/v1/kv_connector/nixl_integration/test_accuracy.py)

tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,10 @@ run_tests_for_model() {
8585
echo "Starting prefill instance $i on GPU $GPU_ID, port $PORT"
8686

8787
# Build the command with or without model-specific args
88-
BASE_CMD="CUDA_VISIBLE_DEVICES=$GPU_ID VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \
88+
BASE_CMD="CUDA_VISIBLE_DEVICES=$GPU_ID \
89+
UCX_NET_DEVICES=all \
90+
VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT \
91+
vllm serve $model_name \
8992
--port $PORT \
9093
--enforce-eager \
9194
--gpu-memory-utilization 0.2 \
@@ -117,7 +120,10 @@ run_tests_for_model() {
117120
echo "Starting decode instance $i on GPU $GPU_ID, port $PORT"
118121

119122
# Build the command with or without model-specific args
120-
BASE_CMD="CUDA_VISIBLE_DEVICES=$GPU_ID VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \
123+
BASE_CMD="CUDA_VISIBLE_DEVICES=$GPU_ID \
124+
UCX_NET_DEVICES=all \
125+
VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT \
126+
vllm serve $model_name \
121127
--port $PORT \
122128
--enforce-eager \
123129
--gpu-memory-utilization 0.2 \

0 commit comments

Comments
 (0)