Skip to content

Commit 5bf23d5

Browse files
feat: update DynamoGraphDeployments for vllm_v1 (#1890)
Co-authored-by: mohammedabdulwahhab <furkhan324@berkeley.edu>
1 parent 9e76590 commit 5bf23d5

File tree

6 files changed

+478
-253
lines changed

6 files changed

+478
-253
lines changed

examples/vllm/README.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,40 @@ bash launch/dep.sh
116116
> [!TIP]
117117
> Run a disaggregated example and try adding another prefill worker once the setup is running! The system will automatically discover and utilize the new worker.
118118
119+
### Kubernetes Deployment
120+
121+
For Kubernetes deployment, YAML manifests are provided in the `deploy/` directory. These define DynamoGraphDeployment resources for various configurations:
122+
123+
- `agg.yaml` - Aggregated serving
124+
- `agg_router.yaml` - Aggregated serving with KV routing
125+
- `disagg.yaml` - Disaggregated serving
126+
- `disagg_router.yaml` - Disaggregated serving with KV routing
127+
128+
#### Prerequisites
129+
130+
- **Dynamo Cloud**: Follow the [Quickstart Guide](../../docs/guides/dynamo_deploy/quickstart.md) to deploy Dynamo Cloud first.
131+
132+
- **Container Images**: The deployment files currently require access to `nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime`. If you don't have access, build and push your own image:
133+
```bash
134+
./container/build.sh --framework VLLM_V1
135+
# Tag and push to your container registry
136+
# Update the image references in the YAML files
137+
```
138+
139+
- **Port Forwarding**: After deployment, forward the frontend service to access the API:
140+
```bash
141+
kubectl port-forward deployment/vllm-v1-disagg-frontend-<pod-uuid-info> 8080:8000
142+
```
143+
144+
#### Deploy to Kubernetes
145+
146+
Example with disagg:
147+
148+
```bash
149+
cd ~/dynamo/examples/vllm/deploy
150+
kubectl apply -f disagg.yaml
151+
```
152+
119153
### Testing the Deployment
120154

121155
Send a test request to verify your deployment:

examples/vllm/deploy/agg.yaml

Lines changed: 49 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -15,10 +15,28 @@
1515
apiVersion: nvidia.com/v1alpha1
1616
kind: DynamoGraphDeployment
1717
metadata:
18-
name: agg
18+
name: vllm-v1-agg
1919
spec:
2020
services:
2121
Frontend:
22+
livenessProbe:
23+
httpGet:
24+
path: /health
25+
port: 8000
26+
initialDelaySeconds: 60
27+
periodSeconds: 60
28+
timeoutSeconds: 30
29+
failureThreshold: 10
30+
readinessProbe:
31+
exec:
32+
command:
33+
- /bin/sh
34+
- -c
35+
- 'curl -s http://localhost:8000/health | jq -e ".status == \"healthy\""'
36+
initialDelaySeconds: 60
37+
periodSeconds: 60
38+
timeoutSeconds: 30
39+
failureThreshold: 10
2240
dynamoNamespace: vllm-v1-agg
2341
componentType: main
2442
replicas: 1
@@ -31,50 +49,38 @@ spec:
3149
memory: "2Gi"
3250
extraPodSpec:
3351
mainContainer:
34-
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.3.1
35-
workingDir: /workspace/examples/vllm_v1
52+
image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4
53+
workingDir: /workspace/examples/vllm
3654
args:
3755
- dynamo
38-
- serve
39-
- graphs.agg:Frontend
40-
- --system-app-port
41-
- "5000"
42-
- --enable-system-app
43-
- --use-default-health-checks
44-
- --service-name
45-
- Frontend
46-
- -f
47-
- ./configs/agg.yaml
48-
SimpleLoadBalancer:
49-
envFromSecret: hf-token-secret
50-
dynamoNamespace: vllm-v1-agg
51-
replicas: 1
52-
resources:
53-
requests:
54-
cpu: "1"
55-
memory: "20Gi"
56-
limits:
57-
cpu: "1"
58-
memory: "20Gi"
59-
extraPodSpec:
60-
mainContainer:
61-
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.3.1
62-
workingDir: /workspace/examples/vllm_v1
63-
args:
64-
- dynamo
65-
- serve
66-
- graphs.agg:SimpleLoadBalancer
67-
- --system-app-port
68-
- "5000"
69-
- --enable-system-app
70-
- --use-default-health-checks
71-
- --service-name
72-
- SimpleLoadBalancer
73-
- -f
74-
- ./configs/agg.yaml
56+
- run
57+
- in=http
58+
- out=dyn
59+
- --http-port
60+
- "8000"
7561
VllmDecodeWorker:
7662
envFromSecret: hf-token-secret
63+
livenessProbe:
64+
exec:
65+
command:
66+
- /bin/sh
67+
- -c
68+
- "exit 0"
69+
periodSeconds: 60
70+
timeoutSeconds: 30
71+
failureThreshold: 10
72+
readinessProbe:
73+
exec:
74+
command:
75+
- /bin/sh
76+
- -c
77+
- 'grep "VllmWorker.*has been initialized" /tmp/vllm.log'
78+
initialDelaySeconds: 60
79+
periodSeconds: 60
80+
timeoutSeconds: 30
81+
failureThreshold: 10
7782
dynamoNamespace: vllm-v1-agg
83+
componentType: worker
7884
replicas: 1
7985
resources:
8086
requests:
@@ -87,17 +93,7 @@ spec:
8793
gpu: "1"
8894
extraPodSpec:
8995
mainContainer:
90-
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.3.1
91-
workingDir: /workspace/examples/vllm_v1
96+
image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4
97+
workingDir: /workspace/examples/vllm
9298
args:
93-
- dynamo
94-
- serve
95-
- graphs.agg:VllmDecodeWorker
96-
- --system-app-port
97-
- "5000"
98-
- --enable-system-app
99-
- --use-default-health-checks
100-
- --service-name
101-
- VllmDecodeWorker
102-
- -f
103-
- ./configs/agg.yaml
99+
- "python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager 2>&1 | tee /tmp/vllm.log"
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
apiVersion: nvidia.com/v1alpha1
16+
kind: DynamoGraphDeployment
17+
metadata:
18+
name: vllm-v1-agg
19+
spec:
20+
services:
21+
Frontend:
22+
livenessProbe:
23+
httpGet:
24+
path: /health
25+
port: 8000
26+
initialDelaySeconds: 60
27+
periodSeconds: 60
28+
timeoutSeconds: 30
29+
failureThreshold: 10
30+
readinessProbe:
31+
exec:
32+
command:
33+
- /bin/sh
34+
- -c
35+
- 'curl -s http://localhost:8000/health | jq -e ".status == \"healthy\""'
36+
initialDelaySeconds: 60
37+
periodSeconds: 60
38+
timeoutSeconds: 30
39+
failureThreshold: 10
40+
dynamoNamespace: vllm-v1-agg
41+
componentType: main
42+
replicas: 1
43+
resources:
44+
requests:
45+
cpu: "1"
46+
memory: "2Gi"
47+
limits:
48+
cpu: "1"
49+
memory: "2Gi"
50+
extraPodSpec:
51+
mainContainer:
52+
image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4
53+
workingDir: /workspace/examples/vllm
54+
args:
55+
- dynamo
56+
- run
57+
- in=http
58+
- out=dyn
59+
- --http-port
60+
- "8000"
61+
VllmDecodeWorker:
62+
envFromSecret: hf-token-secret
63+
livenessProbe:
64+
exec:
65+
command:
66+
- /bin/sh
67+
- -c
68+
- "exit 0"
69+
periodSeconds: 60
70+
timeoutSeconds: 30
71+
failureThreshold: 10
72+
readinessProbe:
73+
exec:
74+
command:
75+
- /bin/sh
76+
- -c
77+
- 'grep "VllmWorker.*has been initialized" /tmp/vllm.log'
78+
initialDelaySeconds: 60
79+
periodSeconds: 60
80+
timeoutSeconds: 30
81+
failureThreshold: 10
82+
dynamoNamespace: vllm-v1-agg
83+
componentType: worker
84+
replicas: 2
85+
resources:
86+
requests:
87+
cpu: "10"
88+
memory: "20Gi"
89+
gpu: "1"
90+
limits:
91+
cpu: "10"
92+
memory: "20Gi"
93+
gpu: "1"
94+
extraPodSpec:
95+
mainContainer:
96+
image: nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime:dep-216.4
97+
workingDir: /workspace/examples/vllm
98+
args:
99+
- "python3 components/main.py --model Qwen/Qwen3-0.6B --enforce-eager 2>&1 | tee /tmp/vllm.log"

0 commit comments

Comments
 (0)