Skip to content

Commit f9dbe80

Browse files
PeaBranekylehh
authored andcommitted
feat: vllm prefill router (#3155)
Signed-off-by: PeaBrane <yanrpei@gmail.com> Signed-off-by: Kyle H <kylhuang@nvidia.com>
1 parent f202213 commit f9dbe80

File tree

19 files changed

+843
-141
lines changed

19 files changed

+843
-141
lines changed

benchmarks/router/README.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,24 @@ First, start the vLLM worker engines in a terminal.
6666
--tensor-parallel-size 2
6767
```
6868

69+
#### Prefill Workers
70+
71+
You can also launch separate decode and prefill workers for disaggregated serving. This allows you to dedicate specific GPUs to prefill (prompt processing) and decode (token generation) tasks:
72+
73+
```bash
74+
# Launch 4 decode workers (GPUs 0-3)
75+
./run_engines.sh \
76+
--num-workers 4 \
77+
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
78+
79+
# Launch 4 prefill workers (GPUs 4-7)
80+
./run_engines.sh \
81+
--prefills \
82+
--num-workers 4 \
83+
--base-gpu-offset 4 \
84+
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
85+
```
86+
6987
#### Alternative: Launch vLLM Mock Workers
7088

7189
We also supports running lightweight mock engines that simulate vLLM behavior without performing actual model inference. Mocker engines are useful for testing router logic and performance without GPU requirements. Use the `--mockers` flag to run mocker engines instead of real vLLM workers.
@@ -106,6 +124,27 @@ python -m dynamo.frontend --help
106124

107125
For detailed explanations of router arguments (especially KV cache routing parameters), see the [KV Cache Routing documentation](../../docs/architecture/kv_cache_routing.md).
108126

127+
#### Launching a Prefill Router (Optional)
128+
129+
If you're using disaggregated serving with separate prefill and decode workers, you should also launch a prefill router. The prefill router handles routing prefill requests to dedicated prefill workers. When using a prefill router, it's recommended to start the frontend (decode router) with `--kv-overlap-score-weight 0` for pure load balancing (as prefix-aware routing is now handled by the prefill router):
130+
131+
```bash
132+
# Start the decode router with pure load balancing
133+
python -m dynamo.frontend \
134+
--router-mode kv \
135+
--kv-cache-block-size 64 \
136+
--router-reset-states \
137+
--http-port 8000 \
138+
--kv-overlap-score-weight 0
139+
140+
# In another terminal, start the prefill router (currently only supports vLLM)
141+
python -m dynamo.vllm_prefill_router \
142+
--namespace dynamo \
143+
--block-size 64
144+
```
145+
146+
The prefill router will automatically coordinate with the decode router to handle request routing between prefill and decode workers.
147+
109148
**Note**: If you're unsure whether your backend engines correctly emit KV events for certain models (e.g., hybrid models like gpt-oss or nemotron nano 2), use the `--no-kv-events` flag to disable KV event tracking and use approximate KV indexing instead:
110149

111150
```bash

benchmarks/router/ping.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,8 @@
33
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
44
# SPDX-License-Identifier: Apache-2.0
55

6-
# Get port from first argument, default to 8080 if not provided
7-
PORT=${1:-8080}
6+
# Get port from first argument, default to 8000 if not provided
7+
PORT=${1:-8000}
88

99
curl -X POST http://localhost:${PORT}/v1/chat/completions \
1010
-H "Content-Type: application/json" \

benchmarks/router/prefix_ratio_benchmark.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -309,7 +309,7 @@ def main():
309309
"--url",
310310
type=str,
311311
nargs="+", # Accept multiple URLs
312-
default=["http://localhost:8080"],
312+
default=["http://localhost:8000"],
313313
# default=["http://localhost:8090", "http://localhost:8090"],
314314
help="Server URL(s). Can specify multiple URLs for parallel benchmarking",
315315
)

benchmarks/router/real_data_benchmark.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -118,7 +118,7 @@ def main():
118118
parser.add_argument(
119119
"--url",
120120
type=str,
121-
default="http://localhost:8080",
121+
default="http://localhost:8000",
122122
help="Server URL",
123123
)
124124
parser.add_argument(

benchmarks/router/run_engines.sh

Lines changed: 34 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@ NUM_WORKERS=8
88
MODEL_PATH="deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
99
TENSOR_PARALLEL_SIZE=1
1010
USE_MOCKERS=false
11+
USE_PREFILLS=false
12+
BASE_GPU_OFFSET=0
1113
EXTRA_ARGS=()
1214

1315
# Parse arguments
@@ -29,6 +31,14 @@ while [[ $# -gt 0 ]]; do
2931
USE_MOCKERS=true
3032
shift
3133
;;
34+
--prefills)
35+
USE_PREFILLS=true
36+
shift
37+
;;
38+
--base-gpu-offset)
39+
BASE_GPU_OFFSET="$2"
40+
shift 2
41+
;;
3242
--)
3343
shift
3444
EXTRA_ARGS+=("$@")
@@ -71,14 +81,22 @@ if ! [[ "$TENSOR_PARALLEL_SIZE" =~ ^[0-9]+$ ]] || [ "$TENSOR_PARALLEL_SIZE" -lt
7181
exit 1
7282
fi
7383

84+
if ! [[ "$BASE_GPU_OFFSET" =~ ^[0-9]+$ ]]; then
85+
echo "Error: BASE_GPU_OFFSET must be a non-negative integer"
86+
exit 1
87+
fi
88+
7489
# Calculate total GPUs needed
7590
TOTAL_GPUS_NEEDED=$((NUM_WORKERS * TENSOR_PARALLEL_SIZE))
91+
LAST_GPU=$((BASE_GPU_OFFSET + TOTAL_GPUS_NEEDED - 1))
7692
echo "Configuration:"
7793
echo " Engine Type: $([ "$USE_MOCKERS" = true ] && echo "Mocker" || echo "vLLM")"
94+
echo " Worker Type: $([ "$USE_PREFILLS" = true ] && echo "Prefill" || echo "Decode")"
7895
echo " Workers: $NUM_WORKERS"
7996
echo " Model: $MODEL_PATH"
8097
echo " Tensor Parallel Size: $TENSOR_PARALLEL_SIZE"
8198
echo " Total GPUs needed: $TOTAL_GPUS_NEEDED"
99+
echo " GPU Range: $BASE_GPU_OFFSET-$LAST_GPU"
82100
echo " Engine args: ${EXTRA_ARGS[*]}"
83101
echo ""
84102

@@ -93,14 +111,15 @@ cleanup() {
93111

94112
trap cleanup SIGINT SIGTERM
95113

96-
echo "Starting $NUM_WORKERS workers..."
114+
WORKER_TYPE=$([ "$USE_PREFILLS" = true ] && echo "prefill" || echo "decode")
115+
echo "Starting $NUM_WORKERS $WORKER_TYPE workers..."
97116

98117
for i in $(seq 1 $NUM_WORKERS); do
99118
{
100-
echo "[Worker-$i] Starting..."
119+
echo "[${WORKER_TYPE^} Worker-$i] Starting..."
101120

102-
# Calculate GPU indices for this worker
103-
START_GPU=$(( (i - 1) * TENSOR_PARALLEL_SIZE ))
121+
# Calculate GPU indices for this worker (with base offset)
122+
START_GPU=$(( BASE_GPU_OFFSET + (i - 1) * TENSOR_PARALLEL_SIZE ))
104123
END_GPU=$(( START_GPU + TENSOR_PARALLEL_SIZE - 1 ))
105124

106125
# Build CUDA_VISIBLE_DEVICES string
@@ -124,17 +143,22 @@ for i in $(seq 1 $NUM_WORKERS); do
124143
--endpoint dyn://test.mocker.generate \
125144
"${EXTRA_ARGS[@]}"
126145
else
127-
echo "[Worker-$i] Using GPUs: $GPU_DEVICES"
146+
echo "[${WORKER_TYPE^} Worker-$i] Using GPUs: $GPU_DEVICES"
128147
# Run vLLM engine with PYTHONHASHSEED=0 for deterministic event IDs in KV-aware routing
148+
VLLM_ARGS=()
149+
VLLM_ARGS+=("--model" "$MODEL_PATH")
150+
VLLM_ARGS+=("--tensor-parallel-size" "$TENSOR_PARALLEL_SIZE")
151+
if [ "$USE_PREFILLS" = true ]; then
152+
VLLM_ARGS+=("--is-prefill-worker")
153+
fi
154+
VLLM_ARGS+=("${EXTRA_ARGS[@]}")
155+
129156
exec env PYTHONHASHSEED=0 CUDA_VISIBLE_DEVICES=$GPU_DEVICES python -m dynamo.vllm \
130-
--model "$MODEL_PATH" \
131-
--endpoint dyn://test.vllm.generate \
132-
--tensor-parallel-size $TENSOR_PARALLEL_SIZE \
133-
"${EXTRA_ARGS[@]}"
157+
"${VLLM_ARGS[@]}"
134158
fi
135159
} &
136160
PIDS+=($!)
137-
echo "Started worker $i (PID: $!)"
161+
echo "Started $WORKER_TYPE worker $i (PID: $!)"
138162
done
139163

140164
echo "All workers started. Press Ctrl+C to stop."

components/backends/vllm/launch/agg_router.sh

Lines changed: 22 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,29 @@
44
set -e
55
trap 'echo Cleaning up...; kill 0' EXIT
66

7-
# run ingress
8-
python -m dynamo.frontend --router-mode kv --http-port=8000 &
7+
# Set deterministic hash for KV event IDs
8+
export PYTHONHASHSEED=0
9+
10+
# Common configuration
11+
MODEL="Qwen/Qwen3-0.6B"
12+
BLOCK_SIZE=64
13+
14+
# run frontend + KV router
15+
python -m dynamo.frontend \
16+
--router-mode kv \
17+
--http-port 8000 \
18+
--router-reset-states &
919

1020
# run workers
1121
# --enforce-eager is added for quick deployment. for production use, need to remove this flag
12-
CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager --connector none &
22+
CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm \
23+
--model $MODEL \
24+
--block-size $BLOCK_SIZE \
25+
--enforce-eager \
26+
--connector none &
1327

14-
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager --connector none
28+
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \
29+
--model $MODEL \
30+
--block-size $BLOCK_SIZE \
31+
--enforce-eager \
32+
--connector none

components/backends/vllm/launch/disagg_router.sh

Lines changed: 36 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,19 +2,48 @@
22
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
33
# SPDX-License-Identifier: Apache-2.0
44
set -e
5-
65
trap 'echo Cleaning up...; kill 0' EXIT
76

8-
# run ingress
9-
python -m dynamo.frontend --router-mode kv --http-port=8000 &
7+
# Set deterministic hash for KV event IDs
8+
export PYTHONHASHSEED=0
9+
10+
# Common configuration
11+
MODEL="Qwen/Qwen3-0.6B"
12+
BLOCK_SIZE=64
13+
14+
# run decode router with kv-overlap-score-weight 0 for pure load balancing
15+
python -m dynamo.frontend \
16+
--router-mode kv \
17+
--http-port 8000 \
18+
--kv-overlap-score-weight 0 \
19+
--router-reset-states &
1020

11-
# routing will happen between the two decode workers
21+
# run prefill router service
22+
python -m dynamo.vllm_prefill_router \
23+
--namespace dynamo \
24+
--block-size $BLOCK_SIZE &
25+
26+
# two decode workers
1227
# --enforce-eager is added for quick deployment. for production use, need to remove this flag
13-
CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager &
28+
CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm \
29+
--model $MODEL \
30+
--block-size $BLOCK_SIZE \
31+
--enforce-eager &
1432

15-
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager &
33+
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \
34+
--model $MODEL \
35+
--block-size $BLOCK_SIZE \
36+
--enforce-eager &
1637

38+
# two prefill workers
1739
CUDA_VISIBLE_DEVICES=2 python3 -m dynamo.vllm \
18-
--model Qwen/Qwen3-0.6B \
40+
--model $MODEL \
41+
--block-size $BLOCK_SIZE \
42+
--enforce-eager \
43+
--is-prefill-worker &
44+
45+
CUDA_VISIBLE_DEVICES=3 python3 -m dynamo.vllm \
46+
--model $MODEL \
47+
--block-size $BLOCK_SIZE \
1948
--enforce-eager \
2049
--is-prefill-worker

components/backends/vllm/src/dynamo/vllm/handlers.py

Lines changed: 47 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -94,9 +94,13 @@ def __init__(
9494
engine,
9595
default_sampling_params,
9696
prefill_worker_client=None,
97+
prefill_router_client=None,
98+
prefill_router_free_client=None,
9799
):
98100
super().__init__(runtime, component, engine, default_sampling_params)
99101
self.prefill_worker_client = prefill_worker_client
102+
self.prefill_router_client = prefill_router_client
103+
self.prefill_router_free_client = prefill_router_free_client
100104
self.can_prefill = 0
101105
self._prefill_check_task = None
102106

@@ -143,7 +147,11 @@ async def generate(self, request, context):
143147
if value is not None and hasattr(sampling_params, key):
144148
setattr(sampling_params, key, value)
145149

146-
# TODO Change to prefill queue
150+
# TODO: Change to prefill queue
151+
# TODO: (PeaBrane) eventually, do not use a router_client and a free_client directly.
152+
# This is least intrusive for now, but quite error prone. Should consider (major) refactoring
153+
# TODO: (PeaBrane) longer term, decode workers should not handle prefill routing at all.
154+
# Prefill routing logic should be integrated directly into the frontend service potentially.
147155
if self.can_prefill:
148156
# Create a copy for prefill with specific modifications
149157
prefill_sampling_params = deepcopy(sampling_params)
@@ -162,12 +170,37 @@ async def generate(self, request, context):
162170
"request_id": request_id,
163171
}
164172

173+
used_prefill_router = False
165174
try:
166-
prefill_response = await anext(
167-
await self.prefill_worker_client.round_robin(
168-
prefill_request, context=context
175+
prefill_worker_id = None
176+
if (
177+
self.prefill_router_client is not None
178+
and self.prefill_router_client.instance_ids()
179+
):
180+
used_prefill_router = True
181+
best_worker_response = await anext(
182+
await self.prefill_router_client.generate(
183+
{
184+
"token_ids": request["token_ids"],
185+
"request_id": request_id,
186+
}
187+
)
169188
)
170-
)
189+
prefill_worker_id = best_worker_response.data().get("worker_id")
190+
191+
if prefill_worker_id is not None:
192+
prefill_response = await anext(
193+
await self.prefill_worker_client.direct(
194+
prefill_request, prefill_worker_id, context=context
195+
)
196+
)
197+
else:
198+
prefill_response = await anext(
199+
await self.prefill_worker_client.round_robin(
200+
prefill_request, context=context
201+
)
202+
)
203+
171204
except Exception as e:
172205
# TODO: Cancellation does not propagate until the first token is received
173206
if context.is_stopped() or context.is_killed():
@@ -176,6 +209,15 @@ async def generate(self, request, context):
176209
return
177210
raise e
178211

212+
finally:
213+
if used_prefill_router:
214+
await anext(
215+
await self.prefill_router_free_client.generate(
216+
{"request_id": request_id}
217+
)
218+
)
219+
logger.debug(f"Freed router state for request {request_id}")
220+
179221
prefill_response = MyRequestOutput.model_validate_json(
180222
prefill_response.data()
181223
)

0 commit comments

Comments
 (0)