You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: benchmarks/router/README.md
+39Lines changed: 39 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -66,6 +66,24 @@ First, start the vLLM worker engines in a terminal.
66
66
--tensor-parallel-size 2
67
67
```
68
68
69
+
#### Prefill Workers
70
+
71
+
You can also launch separate decode and prefill workers for disaggregated serving. This allows you to dedicate specific GPUs to prefill (prompt processing) and decode (token generation) tasks:
We also supports running lightweight mock engines that simulate vLLM behavior without performing actual model inference. Mocker engines are useful for testing router logic and performance without GPU requirements. Use the `--mockers` flag to run mocker engines instead of real vLLM workers.
For detailed explanations of router arguments (especially KV cache routing parameters), see the [KV Cache Routing documentation](../../docs/architecture/kv_cache_routing.md).
108
126
127
+
#### Launching a Prefill Router (Optional)
128
+
129
+
If you're using disaggregated serving with separate prefill and decode workers, you should also launch a prefill router. The prefill router handles routing prefill requests to dedicated prefill workers. When using a prefill router, it's recommended to start the frontend (decode router) with `--kv-overlap-score-weight 0` for pure load balancing (as prefix-aware routing is now handled by the prefill router):
130
+
131
+
```bash
132
+
# Start the decode router with pure load balancing
133
+
python -m dynamo.frontend \
134
+
--router-mode kv \
135
+
--kv-cache-block-size 64 \
136
+
--router-reset-states \
137
+
--http-port 8000 \
138
+
--kv-overlap-score-weight 0
139
+
140
+
# In another terminal, start the prefill router (currently only supports vLLM)
141
+
python -m dynamo.vllm_prefill_router \
142
+
--namespace dynamo \
143
+
--block-size 64
144
+
```
145
+
146
+
The prefill router will automatically coordinate with the decode router to handle request routing between prefill and decode workers.
147
+
109
148
**Note**: If you're unsure whether your backend engines correctly emit KV events for certain models (e.g., hybrid models like gpt-oss or nemotron nano 2), use the `--no-kv-events` flag to disable KV event tracking and use approximate KV indexing instead:
0 commit comments