|
| 1 | +# NixlConnector Usage Guide |
| 2 | + |
| 3 | +NixlConnector is a high-performance KV cache transfer connector for vLLM's disaggregated prefilling feature. It provides fully asynchronous send/receive operations using the NIXL library for efficient cross-process KV cache transfer. |
| 4 | + |
| 5 | +## Prerequisites |
| 6 | + |
| 7 | +### Installation |
| 8 | + |
| 9 | +Install the NIXL library: `uv pip install nixl`, as a quick start. |
| 10 | + |
| 11 | +- Refer to [NIXL official repository](https://github.com/ai-dynamo/nixl) for more installation instructions |
| 12 | +- The specified required NIXL version can be found in [requirements/kv_connectors.txt](../../requirements/kv_connectors.txt) and other relevant config files |
| 13 | + |
| 14 | +### Transport Configuration |
| 15 | + |
| 16 | +NixlConnector uses NIXL library for underlying communication, which supports multiple transport backends. UCX (Unified Communication X) is the primary default transport library used by NIXL. Configure transport environment variables: |
| 17 | + |
| 18 | +```bash |
| 19 | +# Example UCX configuration, adjust according to your enviroment |
| 20 | +export UCX_TLS=all # or specify specific transports like "rc,ud,sm,^cuda_ipc" ..etc |
| 21 | +export UCX_NET_DEVICES=all # or specify network devices like "mlx5_0:1,mlx5_1:1" |
| 22 | +``` |
| 23 | + |
| 24 | +!!! tip |
| 25 | + When using UCX as the transport backend, NCCL environment variables (like `NCCL_IB_HCA`, `NCCL_SOCKET_IFNAME`) are not applicable to NixlConnector, so configure UCX-specific environment variables instead of NCCL variables. |
| 26 | + |
| 27 | +## Basic Usage (on the same host) |
| 28 | + |
| 29 | +### Producer (Prefiller) Configuration |
| 30 | + |
| 31 | +Start a prefiller instance that produces KV caches |
| 32 | + |
| 33 | +```bash |
| 34 | +# 1st GPU as prefiller |
| 35 | +CUDA_VISIBLE_DEVICES=0 \ |
| 36 | +UCX_NET_DEVICES=all \ |
| 37 | +VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \ |
| 38 | +vllm serve Qwen/Qwen3-0.6B \ |
| 39 | + --port 8100 \ |
| 40 | + --enforce-eager \ |
| 41 | + --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' |
| 42 | +``` |
| 43 | + |
| 44 | +### Consumer (Decoder) Configuration |
| 45 | + |
| 46 | +Start a decoder instance that consumes KV caches: |
| 47 | + |
| 48 | +```bash |
| 49 | +# 2nd GPU as decoder |
| 50 | +CUDA_VISIBLE_DEVICES=1 \ |
| 51 | +UCX_NET_DEVICES=all \ |
| 52 | +VLLM_NIXL_SIDE_CHANNEL_PORT=5601 \ |
| 53 | +vllm serve Qwen/Qwen3-0.6B \ |
| 54 | + --port 8200 \ |
| 55 | + --enforce-eager \ |
| 56 | + --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' |
| 57 | +``` |
| 58 | + |
| 59 | +### Proxy Server |
| 60 | + |
| 61 | +Use a proxy server to route requests between prefiller and decoder: |
| 62 | + |
| 63 | +```bash |
| 64 | +python tests/v1/kv_connector/nixl_integration/toy_proxy_server.py \ |
| 65 | + --port 8192 \ |
| 66 | + --prefiller-hosts localhost \ |
| 67 | + --prefiller-ports 8100 \ |
| 68 | + --decoder-hosts localhost \ |
| 69 | + --decoder-ports 8200 |
| 70 | +``` |
| 71 | + |
| 72 | +## Environment Variables |
| 73 | + |
| 74 | +- `VLLM_NIXL_SIDE_CHANNEL_PORT`: Port for NIXL handshake communication |
| 75 | + - Default: 5600 |
| 76 | + - **Required for both prefiller and decoder instances** |
| 77 | + - Each vLLM worker needs a unique port on its host; using the same port number across different hosts is fine |
| 78 | + - For TP/DP deployments, each worker's port on a node is computed as: base_port + dp_rank * tp_size + tp_rank (e.g., with `--tensor-parallel-size=4` and base_port=5600, tp_rank 0..3 use ports 5600, 5601, 5602, 5603 on that node). |
| 79 | + - Used for the initial NIXL handshake between the prefiller and the decoder |
| 80 | + |
| 81 | +- `VLLM_NIXL_SIDE_CHANNEL_HOST`: Host for side channel communication |
| 82 | + - Default: "localhost" |
| 83 | + - Set when prefiller and decoder are on different machines |
| 84 | + - Connection info is passed via KVTransferParams from prefiller to decoder for handshake |
| 85 | + |
| 86 | +- `VLLM_NIXL_ABORT_REQUEST_TIMEOUT`: Timeout (in seconds) for automatically releasing the prefiller’s KV cache for a particular request. (Optional) |
| 87 | + - Default: 120 |
| 88 | + - If a request is aborted and the decoder has not yet read the KV-cache blocks through the nixl channel, the prefill instance will release its KV-cache blocks after this timeout to avoid holding them indefinitely. |
| 89 | + |
| 90 | +## Multi-Instance Setup |
| 91 | + |
| 92 | +### Multiple Prefiller Instances on Different Machines |
| 93 | + |
| 94 | +```bash |
| 95 | +# Prefiller 1 on Machine A (example IP: ${IP1}) |
| 96 | +VLLM_NIXL_SIDE_CHANNEL_HOST=${IP1} \ |
| 97 | +VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \ |
| 98 | +UCX_NET_DEVICES=all \ |
| 99 | +vllm serve Qwen/Qwen3-0.6B --port 8000 \ |
| 100 | + --tensor-parallel-size 8 \ |
| 101 | + --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}' |
| 102 | + |
| 103 | +# Prefiller 2 on Machine B (example IP: ${IP2}) |
| 104 | +VLLM_NIXL_SIDE_CHANNEL_HOST=${IP2} \ |
| 105 | +VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \ |
| 106 | +UCX_NET_DEVICES=all \ |
| 107 | +vllm serve Qwen/Qwen3-0.6B --port 8000 \ |
| 108 | + --tensor-parallel-size 8 \ |
| 109 | + --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}' |
| 110 | +``` |
| 111 | + |
| 112 | +### Multiple Decoder Instances on Different Machines |
| 113 | + |
| 114 | +```bash |
| 115 | +# Decoder 1 on Machine C (example IP: ${IP3}) |
| 116 | +VLLM_NIXL_SIDE_CHANNEL_HOST=${IP3} \ |
| 117 | +VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \ |
| 118 | +UCX_NET_DEVICES=all \ |
| 119 | +vllm serve Qwen/Qwen3-0.6B --port 8000 \ |
| 120 | + --tensor-parallel-size 8 \ |
| 121 | + --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}' |
| 122 | + |
| 123 | +# Decoder 2 on Machine D (example IP: ${IP4}) |
| 124 | +VLLM_NIXL_SIDE_CHANNEL_HOST=${IP4} \ |
| 125 | +VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \ |
| 126 | +UCX_NET_DEVICES=all \ |
| 127 | +vllm serve Qwen/Qwen3-0.6B --port 8000 \ |
| 128 | + --tensor-parallel-size 8 \ |
| 129 | + --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}' |
| 130 | +``` |
| 131 | + |
| 132 | +### Proxy for Multiple Instances |
| 133 | + |
| 134 | +```bash |
| 135 | +python tests/v1/kv_connector/nixl_integration/toy_proxy_server.py \ |
| 136 | + --port 8192 \ |
| 137 | + --prefiller-hosts ${IP1} ${IP2} \ |
| 138 | + --prefiller-ports 8000 8000 \ |
| 139 | + --decoder-hosts ${IP3} ${IP4} \ |
| 140 | + --decoder-ports 8000 8000 |
| 141 | +``` |
| 142 | + |
| 143 | +### KV Role Options |
| 144 | + |
| 145 | +- **kv_producer**: For prefiller instances that generate KV caches |
| 146 | +- **kv_consumer**: For decoder instances that consume KV caches from prefiller |
| 147 | +- **kv_both**: Enables symmetric functionality where the connector can act as both producer and consumer. This provides flexibility for experimental setups and scenarios where the role distinction is not predetermined. |
| 148 | + |
| 149 | +!!! tip |
| 150 | + NixlConnector currently does not distinguish `kv_role`; the actual prefiller/decoder roles are determined by the upper-level proxy (e.g., `toy_proxy_server.py` using `--prefiller-hosts` and `--decoder-hosts`). |
| 151 | + Therefore, `kv_role` in `--kv-transfer-config` is effectively a placeholder and does not affect NixlConnector's behavior. |
| 152 | + |
| 153 | +## Example Scripts/Code |
| 154 | + |
| 155 | +Refer to these example scripts in the vLLM repository: |
| 156 | + |
| 157 | +- [run_accuracy_test.sh](../../tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh) |
| 158 | +- [toy_proxy_server.py](../../tests/v1/kv_connector/nixl_integration/toy_proxy_server.py) |
| 159 | +- [test_accuracy.py](../../tests/v1/kv_connector/nixl_integration/test_accuracy.py) |
0 commit comments