[Docs] NixlConnector quickstart guide (vllm-project#24249)

panpan0000 · NickLucche · FeiDaLI · commit 89aa540fbe87 · 2025-09-25T18:54:22.000+08:00
Signed-off-by: Peter Pan &lt;Peter.Pan@daocloud.io&gt;
Signed-off-by: Peter Pan &lt;peter.pan@daocloud.io&gt;
Signed-off-by: Nicolò Lucchesi&lt;nicolo.lucchesi@gmail.com&gt;
Co-authored-by: Nicolò Lucchesi &lt;nicolo.lucchesi@gmail.com&gt;
diff --git a/docs/features/disagg_prefill.md b/docs/features/disagg_prefill.md
@@ -23,7 +23,7 @@ Now supports 5 types of connectors:
 
 - **SharedStorageConnector**: refer to <gh-file:examples/offline_inference/disaggregated-prefill-v1/run.sh> for the example usage of SharedStorageConnector disaggregated prefilling.
 - **LMCacheConnectorV1**: refer to <gh-file:examples/others/lmcache/disagg_prefill_lmcache_v1/disagg_example_nixl.sh> for the example usage of LMCacheConnectorV1 disaggregated prefilling which uses NIXL as the underlying KV transmission.
-- **NixlConnector**: refer to <gh-file:tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh> for the example usage of NixlConnector disaggregated prefilling which support fully async send/recv.
+- **NixlConnector**: refer to <gh-file:tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh> for the example usage of NixlConnector disaggregated prefilling which support fully async send/recv. For detailed usage guide, see [NixlConnector Usage Guide](nixl_connector_usage.md).
 - **P2pNcclConnector**: refer to <gh-file:examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/disagg_example_p2p_nccl_xpyd.sh> for the example usage of P2pNcclConnector disaggregated prefilling.
 - **MultiConnector**: take advantage of the kv_connector_extra_config: dict[str, Any] already present in KVTransferConfig to stash all the connectors we want in an ordered list of kwargs.such as:
 
diff --git a/docs/features/nixl_connector_usage.md b/docs/features/nixl_connector_usage.md
@@ -0,0 +1,159 @@
+# NixlConnector Usage Guide
+
+NixlConnector is a high-performance KV cache transfer connector for vLLM's disaggregated prefilling feature. It provides fully asynchronous send/receive operations using the NIXL library for efficient cross-process KV cache transfer.
+
+## Prerequisites
+
+### Installation
+
+Install the NIXL library: `uv pip install nixl`, as a quick start.
+
+- Refer to [NIXL official repository](https://github.com/ai-dynamo/nixl) for more installation instructions
+- The specified required NIXL version can be found in [requirements/kv_connectors.txt](../../requirements/kv_connectors.txt) and other relevant config files
+
+### Transport Configuration
+
+NixlConnector uses NIXL library for underlying communication, which supports multiple transport backends. UCX (Unified Communication X) is the primary default transport library used by NIXL. Configure transport environment variables:
+
+```bash
+# Example UCX configuration, adjust according to your enviroment
+export UCX_TLS=all  # or specify specific transports like "rc,ud,sm,^cuda_ipc" ..etc
+export UCX_NET_DEVICES=all  # or specify network devices like "mlx5_0:1,mlx5_1:1"
+```
+
+!!! tip
+    When using UCX as the transport backend, NCCL environment variables (like `NCCL_IB_HCA`, `NCCL_SOCKET_IFNAME`) are not applicable to NixlConnector, so configure UCX-specific environment variables instead of NCCL variables.
+
+## Basic Usage (on the same host)
+
+### Producer (Prefiller) Configuration
+
+Start a prefiller instance that produces KV caches
+
+```bash
+# 1st GPU as prefiller
+CUDA_VISIBLE_DEVICES=0 \
+UCX_NET_DEVICES=all \
+VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
+vllm serve Qwen/Qwen3-0.6B \
+  --port 8100 \
+  --enforce-eager \
+  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'
+```
+
+### Consumer (Decoder) Configuration
+
+Start a decoder instance that consumes KV caches:
+
+```bash
+# 2nd GPU as decoder
+CUDA_VISIBLE_DEVICES=1 \
+UCX_NET_DEVICES=all \
+VLLM_NIXL_SIDE_CHANNEL_PORT=5601 \
+vllm serve Qwen/Qwen3-0.6B \
+  --port 8200 \
+  --enforce-eager \
+  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'
+```
+
+### Proxy Server
+
+Use a proxy server to route requests between prefiller and decoder:
+
+```bash
+python tests/v1/kv_connector/nixl_integration/toy_proxy_server.py \
+  --port 8192 \
+  --prefiller-hosts localhost \
+  --prefiller-ports 8100 \
+  --decoder-hosts localhost \
+  --decoder-ports 8200
+```
+
+## Environment Variables
+
+- `VLLM_NIXL_SIDE_CHANNEL_PORT`: Port for NIXL handshake communication
+    - Default: 5600
+    - **Required for both prefiller and decoder instances**
+    - Each vLLM worker needs a unique port on its host; using the same port number across different hosts is fine
+    - For TP/DP deployments, each worker's port on a node is computed as: base_port + dp_rank * tp_size + tp_rank (e.g., with `--tensor-parallel-size=4` and base_port=5600, tp_rank 0..3 use ports 5600, 5601, 5602, 5603 on that node).
+    - Used for the initial NIXL handshake between the prefiller and the decoder
+
+- `VLLM_NIXL_SIDE_CHANNEL_HOST`: Host for side channel communication
+    - Default: "localhost"
+    - Set when prefiller and decoder are on different machines
+    - Connection info is passed via KVTransferParams from prefiller to decoder for handshake
+
+- `VLLM_NIXL_ABORT_REQUEST_TIMEOUT`: Timeout (in seconds) for automatically releasing the prefiller’s KV cache for a particular request. (Optional)
+    - Default: 120
+    - If a request is aborted and the decoder has not yet read the KV-cache blocks through the nixl channel, the prefill instance will release its KV-cache blocks after this timeout to avoid holding them indefinitely.
+
+## Multi-Instance Setup
+
+### Multiple Prefiller Instances on Different Machines
+
+```bash
+# Prefiller 1 on Machine A (example IP: ${IP1})
+VLLM_NIXL_SIDE_CHANNEL_HOST=${IP1} \
+VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
+UCX_NET_DEVICES=all \
+vllm serve Qwen/Qwen3-0.6B --port 8000 \
+  --tensor-parallel-size 8 \
+  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}'
+
+# Prefiller 2 on Machine B (example IP: ${IP2})
+VLLM_NIXL_SIDE_CHANNEL_HOST=${IP2} \
+VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
+UCX_NET_DEVICES=all \
+vllm serve Qwen/Qwen3-0.6B --port 8000 \
+  --tensor-parallel-size 8 \
+  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}'
+```
+
+### Multiple Decoder Instances on Different Machines
+
+```bash
+# Decoder 1 on Machine C (example IP: ${IP3})
+VLLM_NIXL_SIDE_CHANNEL_HOST=${IP3} \
+VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
+UCX_NET_DEVICES=all \
+vllm serve Qwen/Qwen3-0.6B --port 8000 \
+  --tensor-parallel-size 8 \
+  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}'
+
+# Decoder 2 on Machine D (example IP: ${IP4})
+VLLM_NIXL_SIDE_CHANNEL_HOST=${IP4} \
+VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
+UCX_NET_DEVICES=all \
+vllm serve Qwen/Qwen3-0.6B --port 8000 \
+  --tensor-parallel-size 8 \
+  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}'
+```
+
+### Proxy for Multiple Instances
+
+```bash
+python tests/v1/kv_connector/nixl_integration/toy_proxy_server.py \
+  --port 8192 \
+  --prefiller-hosts ${IP1} ${IP2} \
+  --prefiller-ports 8000 8000 \
+  --decoder-hosts ${IP3} ${IP4} \
+  --decoder-ports 8000 8000
+```
+
+### KV Role Options
+
+- **kv_producer**: For prefiller instances that generate KV caches
+- **kv_consumer**: For decoder instances that consume KV caches from prefiller
+- **kv_both**: Enables symmetric functionality where the connector can act as both producer and consumer. This provides flexibility for experimental setups and scenarios where the role distinction is not predetermined.
+
+!!! tip
+    NixlConnector currently does not distinguish `kv_role`; the actual prefiller/decoder roles are determined by the upper-level proxy (e.g., `toy_proxy_server.py` using `--prefiller-hosts` and `--decoder-hosts`).
+    Therefore, `kv_role` in `--kv-transfer-config` is effectively a placeholder and does not affect NixlConnector's behavior.
+
+## Example Scripts/Code
+
+Refer to these example scripts in the vLLM repository:
+
+- [run_accuracy_test.sh](../../tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh)
+- [toy_proxy_server.py](../../tests/v1/kv_connector/nixl_integration/toy_proxy_server.py)
+- [test_accuracy.py](../../tests/v1/kv_connector/nixl_integration/test_accuracy.py)
diff --git a/tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh b/tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh
@@ -85,7 +85,10 @@ run_tests_for_model() {
     echo "Starting prefill instance $i on GPU $GPU_ID, port $PORT"
 
     # Build the command with or without model-specific args
-    BASE_CMD="CUDA_VISIBLE_DEVICES=$GPU_ID VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \
+    BASE_CMD="CUDA_VISIBLE_DEVICES=$GPU_ID \
+    UCX_NET_DEVICES=all \
+    VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT \
+    vllm serve $model_name \
     --port $PORT \
     --enforce-eager \
     --gpu-memory-utilization 0.2 \
@@ -117,7 +120,10 @@ run_tests_for_model() {
     echo "Starting decode instance $i on GPU $GPU_ID, port $PORT"
 
     # Build the command with or without model-specific args
-    BASE_CMD="CUDA_VISIBLE_DEVICES=$GPU_ID VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \
+    BASE_CMD="CUDA_VISIBLE_DEVICES=$GPU_ID \
+    UCX_NET_DEVICES=all \
+    VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT \
+    vllm serve $model_name \
     --port $PORT \
     --enforce-eager \
     --gpu-memory-utilization 0.2 \