Skip to content

Commit f3e3d94

Browse files
refactor: vLLM to new Python UX (#1983)
Co-authored-by: Graham King <grahamk@nvidia.com>
1 parent 9f2356c commit f3e3d94

File tree

25 files changed

+93
-233
lines changed

25 files changed

+93
-233
lines changed

examples/vllm/README.md renamed to components/backends/vllm/README.md

Lines changed: 10 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,11 @@
11
<!--
22
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
33
SPDX-License-Identifier: Apache-2.0
4-
5-
Licensed under the Apache License, Version 2.0 (the "License");
6-
you may not use this file except in compliance with the License.
7-
You may obtain a copy of the License at
8-
9-
http://www.apache.org/licenses/LICENSE-2.0
10-
11-
Unless required by applicable law or agreed to in writing, software
12-
distributed under the License is distributed on an "AS IS" BASIS,
13-
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14-
See the License for the specific language governing permissions and
15-
limitations under the License.
164
-->
175

18-
# LLM Deployment Examples using vLLM
6+
# LLM Deployment using vLLM
197

20-
This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using vLLM. For Dynamo integration, we leverage vLLM's native KV cache events, NIXL based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation.
8+
This directory contains a Dynamo vllm engine and reference implementations for deploying Large Language Models (LLMs) in various configurations using vLLM. For Dynamo integration, we leverage vLLM's native KV cache events, NIXL based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation.
219

2210
## Deployment Architectures
2311

@@ -36,11 +24,11 @@ docker compose -f deploy/metrics/docker-compose.yml up -d
3624
### Build and Run docker
3725

3826
```bash
39-
./container/build.sh
27+
./container/build.sh --framework VLLM
4028
```
4129

4230
```bash
43-
./container/run.sh -it [--mount-workspace]
31+
./container/run.sh -it --framework VLLM [--mount-workspace]
4432
```
4533

4634
This includes the specific commit [vllm-project/vllm#19790](https://github.com/vllm-project/vllm/pull/19790) which enables support for external control of the DP ranks.
@@ -74,31 +62,31 @@ Note: The above architecture illustrates all the components. The final component
7462

7563
```bash
7664
# requires one gpu
77-
cd examples/vllm
65+
cd components/backends/vllm
7866
bash launch/agg.sh
7967
```
8068

8169
#### Aggregated Serving with KV Routing
8270

8371
```bash
8472
# requires two gpus
85-
cd examples/vllm
73+
cd components/backends/vllm
8674
bash launch/agg_router.sh
8775
```
8876

8977
#### Disaggregated Serving
9078

9179
```bash
9280
# requires two gpus
93-
cd examples/vllm
81+
cd components/backends/vllm
9482
bash launch/disagg.sh
9583
```
9684

9785
#### Disaggregated Serving with KV Routing
9886

9987
```bash
10088
# requires three gpus
101-
cd examples/vllm
89+
cd components/backends/vllm
10290
bash launch/disagg_router.sh
10391
```
10492

@@ -108,7 +96,7 @@ This example is not meant to be performant but showcases dynamo routing to data
10896

10997
```bash
11098
# requires four gpus
111-
cd examples/vllm
99+
cd components/backends/vllm
112100
bash launch/dep.sh
113101
```
114102

@@ -146,7 +134,7 @@ For Kubernetes deployment, YAML manifests are provided in the `deploy/` director
146134
Example with disagg:
147135

148136
```bash
149-
cd ~/dynamo/examples/vllm/deploy
137+
cd ~/dynamo/components/backends/vllm/deploy
150138
kubectl apply -f disagg.yaml
151139
```
152140

examples/vllm/deepseek-r1.md renamed to components/backends/vllm/deepseek-r1.md

Lines changed: 1 addition & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,6 @@
11
<!--
22
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
33
SPDX-License-Identifier: Apache-2.0
4-
5-
Licensed under the Apache License, Version 2.0 (the "License");
6-
you may not use this file except in compliance with the License.
7-
You may obtain a copy of the License at
8-
9-
http://www.apache.org/licenses/LICENSE-2.0
10-
11-
Unless required by applicable law or agreed to in writing, software
12-
distributed under the License is distributed on an "AS IS" BASIS,
13-
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14-
See the License for the specific language governing permissions and
15-
limitations under the License.
164
-->
175

186
# Running Deepseek R1 with Wide EP
@@ -51,4 +39,4 @@ curl localhost:8080/v1/chat/completions \
5139
"stream": false,
5240
"max_tokens": 30
5341
}'
54-
```
42+
```

examples/vllm/deploy/agg.yaml renamed to components/backends/vllm/deploy/agg.yaml

Lines changed: 4 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,6 @@
11
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
22
# SPDX-License-Identifier: Apache-2.0
3-
#
4-
# Licensed under the Apache License, Version 2.0 (the "License");
5-
# you may not use this file except in compliance with the License.
6-
# You may obtain a copy of the License at
7-
#
8-
# http://www.apache.org/licenses/LICENSE-2.0
9-
#
10-
# Unless required by applicable law or agreed to in writing, software
11-
# distributed under the License is distributed on an "AS IS" BASIS,
12-
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13-
# See the License for the specific language governing permissions and
14-
# limitations under the License.
3+
154
apiVersion: nvidia.com/v1alpha1
165
kind: DynamoGraphDeployment
176
metadata:
@@ -50,7 +39,7 @@ spec:
5039
extraPodSpec:
5140
mainContainer:
5241
image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4
53-
workingDir: /workspace/examples/vllm
42+
workingDir: /workspace/components/backends/vllm
5443
args:
5544
- dynamo
5645
- run
@@ -94,6 +83,6 @@ spec:
9483
extraPodSpec:
9584
mainContainer:
9685
image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4
97-
workingDir: /workspace/examples/vllm
86+
workingDir: /workspace/components/backends/vllm
9887
args:
99-
- "python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager 2>&1 | tee /tmp/vllm.log"
88+
- "python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager 2>&1 | tee /tmp/vllm.log"

examples/vllm/deploy/agg_router.yaml renamed to components/backends/vllm/deploy/agg_router.yaml

Lines changed: 3 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,6 @@
11
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
22
# SPDX-License-Identifier: Apache-2.0
3-
#
4-
# Licensed under the Apache License, Version 2.0 (the "License");
5-
# you may not use this file except in compliance with the License.
6-
# You may obtain a copy of the License at
7-
#
8-
# http://www.apache.org/licenses/LICENSE-2.0
9-
#
10-
# Unless required by applicable law or agreed to in writing, software
11-
# distributed under the License is distributed on an "AS IS" BASIS,
12-
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13-
# See the License for the specific language governing permissions and
14-
# limitations under the License.
3+
154
apiVersion: nvidia.com/v1alpha1
165
kind: DynamoGraphDeployment
176
metadata:
@@ -50,7 +39,7 @@ spec:
5039
extraPodSpec:
5140
mainContainer:
5241
image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4
53-
workingDir: /workspace/examples/vllm
42+
workingDir: /workspace/components/backends/vllm
5443
args:
5544
- dynamo
5645
- run
@@ -96,6 +85,6 @@ spec:
9685
extraPodSpec:
9786
mainContainer:
9887
image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4
99-
workingDir: /workspace/examples/vllm
88+
workingDir: /workspace/components/backends/vllm
10089
args:
10190
- "python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager 2>&1 | tee /tmp/vllm.log"

examples/vllm/deploy/disagg.yaml renamed to components/backends/vllm/deploy/disagg.yaml

Lines changed: 4 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,6 @@
11
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
22
# SPDX-License-Identifier: Apache-2.0
3-
#
4-
# Licensed under the Apache License, Version 2.0 (the "License");
5-
# you may not use this file except in compliance with the License.
6-
# You may obtain a copy of the License at
7-
#
8-
# http://www.apache.org/licenses/LICENSE-2.0
9-
#
10-
# Unless required by applicable law or agreed to in writing, software
11-
# distributed under the License is distributed on an "AS IS" BASIS,
12-
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13-
# See the License for the specific language governing permissions and
14-
# limitations under the License.
3+
154
apiVersion: nvidia.com/v1alpha1
165
kind: DynamoGraphDeployment
176
metadata:
@@ -50,7 +39,7 @@ spec:
5039
extraPodSpec:
5140
mainContainer:
5241
image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4
53-
workingDir: /workspace/examples/vllm
42+
workingDir: /workspace/components/backends/vllm
5443
args:
5544
- dynamo
5645
- run
@@ -94,7 +83,7 @@ spec:
9483
extraPodSpec:
9584
mainContainer:
9685
image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4
97-
workingDir: /workspace/examples/vllm
86+
workingDir: /workspace/components/backends/vllm
9887
args:
9988
- "python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager 2>&1 | tee /tmp/vllm.log"
10089
VllmPrefillWorker:
@@ -133,6 +122,6 @@ spec:
133122
extraPodSpec:
134123
mainContainer:
135124
image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4
136-
workingDir: /workspace/examples/vllm
125+
workingDir: /workspace/components/backends/vllm
137126
args:
138127
- "python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager --is-prefill-worker 2>&1 | tee /tmp/vllm.log"

examples/vllm/deploy/disagg_planner.yaml renamed to components/backends/vllm/deploy/disagg_planner.yaml

Lines changed: 4 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,6 @@
11
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
22
# SPDX-License-Identifier: Apache-2.0
3-
#
4-
# Licensed under the Apache License, Version 2.0 (the "License");
5-
# you may not use this file except in compliance with the License.
6-
# You may obtain a copy of the License at
7-
#
8-
# http://www.apache.org/licenses/LICENSE-2.0
9-
#
10-
# Unless required by applicable law or agreed to in writing, software
11-
# distributed under the License is distributed on an "AS IS" BASIS,
12-
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13-
# See the License for the specific language governing permissions and
14-
# limitations under the License.
3+
154
apiVersion: nvidia.com/v1alpha1
165
kind: DynamoGraphDeployment
176
metadata:
@@ -50,7 +39,7 @@ spec:
5039
extraPodSpec:
5140
mainContainer:
5241
image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4
53-
workingDir: /workspace/examples/vllm
42+
workingDir: /workspace/components/backends/vllm
5443
args:
5544
- dynamo
5645
- run
@@ -94,7 +83,7 @@ spec:
9483
extraPodSpec:
9584
mainContainer:
9685
image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4
97-
workingDir: /workspace/examples/vllm
86+
workingDir: /workspace/components/backends/vllm
9887
args:
9988
- "python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager 2>&1 | tee /tmp/vllm.log"
10089
VllmPrefillWorker:
@@ -133,6 +122,6 @@ spec:
133122
extraPodSpec:
134123
mainContainer:
135124
image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4
136-
workingDir: /workspace/examples/vllm
125+
workingDir: /workspace/components/backends/vllm
137126
args:
138127
- "python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager --is-prefill-worker 2>&1 | tee /tmp/vllm.log"

examples/vllm/deploy/disagg_router.yaml renamed to components/backends/vllm/deploy/disagg_router.yaml

Lines changed: 7 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,6 @@
11
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
22
# SPDX-License-Identifier: Apache-2.0
3-
#
4-
# Licensed under the Apache License, Version 2.0 (the "License");
5-
# you may not use this file except in compliance with the License.
6-
# You may obtain a copy of the License at
7-
#
8-
# http://www.apache.org/licenses/LICENSE-2.0
9-
#
10-
# Unless required by applicable law or agreed to in writing, software
11-
# distributed under the License is distributed on an "AS IS" BASIS,
12-
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13-
# See the License for the specific language governing permissions and
14-
# limitations under the License.
3+
154
apiVersion: nvidia.com/v1alpha1
165
kind: DynamoGraphDeployment
176
metadata:
@@ -50,16 +39,9 @@ spec:
5039
extraPodSpec:
5140
mainContainer:
5241
image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4
53-
workingDir: /workspace/examples/vllm
42+
workingDir: /workspace/components/backends/vllm
5443
args:
55-
- dynamo
56-
- run
57-
- in=http
58-
- out=dyn
59-
- --http-port
60-
- "8000"
61-
- --router-mode
62-
- kv
44+
- "python3 -m dynamo.frontend --http-port 8080 --router-mode kv"
6345
VllmDecodeWorker:
6446
dynamoNamespace: vllm-v1-disagg-router
6547
envFromSecret: hf-token-secret
@@ -96,9 +78,9 @@ spec:
9678
extraPodSpec:
9779
mainContainer:
9880
image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4
99-
workingDir: /workspace/examples/vllm
81+
workingDir: /workspace/components/backends/vllm
10082
args:
101-
- "python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager 2>&1 | tee /tmp/vllm.log"
83+
- "python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager 2>&1 | tee /tmp/vllm.log"
10284
VllmPrefillWorker:
10385
dynamoNamespace: vllm-v1-disagg-router
10486
envFromSecret: hf-token-secret
@@ -135,6 +117,6 @@ spec:
135117
extraPodSpec:
136118
mainContainer:
137119
image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4
138-
workingDir: /workspace/examples/vllm
120+
workingDir: /workspace/components/backends/vllm
139121
args:
140-
- "python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager --is-prefill-worker 2>&1 | tee /tmp/vllm.log"
122+
- "python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager --is-prefill-worker 2>&1 | tee /tmp/vllm.log"

examples/vllm/launch/agg.sh renamed to components/backends/vllm/launch/agg.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ set -e
55
trap 'echo Cleaning up...; kill 0' EXIT
66

77
# run ingress
8-
dynamo run in=http out=dyn &
8+
python -m dynamo.frontend &
99

1010
# run worker
11-
python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager --no-enable-prefix-caching
11+
python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager --no-enable-prefix-caching
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
#!/bin/bash
2+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
# SPDX-License-Identifier: Apache-2.0
4+
set -e
5+
trap 'echo Cleaning up...; kill 0' EXIT
6+
7+
# run ingress
8+
python -m dynamo.frontend --router-mode kv &
9+
10+
# run workers
11+
CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager &
12+
13+
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager

examples/vllm/launch/dep.sh renamed to components/backends/vllm/launch/dep.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,13 @@ set -e
55
trap 'echo Cleaning up...; kill 0' EXIT
66

77
# run ingress
8-
dynamo run in=http out=dyn --router-mode kv &
8+
python -m dynamo.frontend --router-mode kv &
99

1010
# Data Parallel Attention / Expert Parallelism
1111
# Routing to DP workers managed by Dynamo
1212
# Chose Qwen3-30B because its a small MOE that can fit on smaller GPUs (L40S for example)
1313
for i in {0..3}; do
14-
CUDA_VISIBLE_DEVICES=$i python3 components/main.py \
14+
CUDA_VISIBLE_DEVICES=$i python3 -m dynamo.vllm \
1515
--model Qwen/Qwen3-30B-A3B \
1616
--data-parallel-rank $i \
1717
--data-parallel-size 4 \

0 commit comments

Comments
 (0)