Skip to content

Commit 6eb5ad1

Browse files
tanmayv25biswapanda
andcommitted
feat: Add trtllm deploy examples for k8s (#2133)
Co-authored-by: Biswa Panda <biswa.panda@gmail.com>
1 parent 2a616da commit 6eb5ad1

File tree

5 files changed

+568
-0
lines changed

5 files changed

+568
-0
lines changed

components/backends/trtllm/README.md

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -185,6 +185,65 @@ For comprehensive instructions on multinode serving, see the [multinode-examples
185185
### Speculative Decoding
186186
- **[Llama 4 Maverick Instruct + Eagle Speculative Decoding](./llama4_plus_eagle.md)**
187187

188+
### Kubernetes Deployment
189+
190+
For Kubernetes deployment, YAML manifests are provided in the `deploy/` directory. These define DynamoGraphDeployment resources for various configurations:
191+
192+
- `agg.yaml` - Aggregated serving
193+
- `agg_router.yaml` - Aggregated serving with KV routing
194+
- `disagg.yaml` - Disaggregated serving
195+
- `disagg_router.yaml` - Disaggregated serving with KV routing
196+
197+
#### Prerequisites
198+
199+
- **Dynamo Cloud**: Follow the [Quickstart Guide](../../../docs/guides/dynamo_deploy/quickstart.md) to deploy Dynamo Cloud first.
200+
201+
- **Container Images**: The deployment files currently require access to `nvcr.io/nvidian/nim-llm-dev/trtllm-runtime`. If you don't have access, build and push your own image:
202+
```bash
203+
./container/build.sh --framework tensorrtllm
204+
# Tag and push to your container registry
205+
# Update the image references in the YAML files
206+
```
207+
208+
- **Port Forwarding**: After deployment, forward the frontend service to access the API:
209+
```bash
210+
kubectl port-forward deployment/trtllm-v1-disagg-frontend-<pod-uuid-info> 8080:8000
211+
```
212+
213+
#### Deploy to Kubernetes
214+
215+
Example with disagg:
216+
Export the NAMESPACE you used in your Dynamo Cloud Installation.
217+
218+
```bash
219+
cd dynamo
220+
cd components/backends/trtllm/deploy
221+
kubectl apply -f disagg.yaml -n $NAMESPACE
222+
```
223+
224+
To change `DYN_LOG` level, edit the yaml file by adding
225+
226+
```yaml
227+
...
228+
spec:
229+
envs:
230+
- name: DYN_LOG
231+
value: "debug" # or other log levels
232+
...
233+
```
234+
235+
### Client
236+
237+
See [client](../llm/README.md#client) section to learn how to send request to the deployment.
238+
239+
NOTE: To send a request to a multi-node deployment, target the node which is running `dynamo-run in=http`.
240+
241+
### Benchmarking
242+
243+
To benchmark your deployment with GenAI-Perf, see this utility script, configuring the
244+
`model` name and `host` based on your deployment: [perf.sh](../../benchmarks/llm/perf.sh)
245+
246+
188247
## Disaggregation Strategy
189248

190249
The disaggregation strategy controls how requests are distributed between the prefill and decode workers in a disaggregated deployment.
Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
apiVersion: nvidia.com/v1alpha1
5+
kind: DynamoGraphDeployment
6+
metadata:
7+
name: trtllm-agg
8+
spec:
9+
services:
10+
Frontend:
11+
dynamoNamespace: trtllm-agg
12+
componentType: main
13+
livenessProbe:
14+
exec:
15+
command:
16+
- /bin/sh
17+
- -c
18+
- 'curl -s http://localhost:8000/health | jq -e ".status == \"healthy\""'
19+
periodSeconds: 5
20+
timeoutSeconds: 3
21+
failureThreshold: 3
22+
readinessProbe:
23+
exec:
24+
command:
25+
- /bin/sh
26+
- -c
27+
- 'curl -s http://localhost:8000/health | jq -e ".status == \"healthy\""'
28+
initialDelaySeconds: 60
29+
periodSeconds: 60
30+
timeoutSeconds: 3
31+
failureThreshold: 10
32+
replicas: 1
33+
resources:
34+
requests:
35+
cpu: "5"
36+
memory: "10Gi"
37+
limits:
38+
cpu: "5"
39+
memory: "10Gi"
40+
extraPodSpec:
41+
mainContainer:
42+
image: nvcr.io/nvidian/nim-llm-dev/trtllm-runtime:dep-233.17
43+
workingDir: /workspace/components/backends/trtllm
44+
command:
45+
- /bin/sh
46+
- -c
47+
args:
48+
- "python3 -m dynamo.frontend --http-port 8000"
49+
TRTLLMWorker:
50+
envFromSecret: hf-token-secret
51+
livenessProbe:
52+
httpGet:
53+
path: /live
54+
port: 9090
55+
periodSeconds: 5
56+
timeoutSeconds: 3
57+
failureThreshold: 3
58+
readinessProbe:
59+
httpGet:
60+
path: /health
61+
port: 9090
62+
periodSeconds: 10
63+
timeoutSeconds: 3
64+
failureThreshold: 60
65+
dynamoNamespace: trtllm-agg
66+
componentType: worker
67+
replicas: 1
68+
resources:
69+
requests:
70+
cpu: "10"
71+
memory: "20Gi"
72+
gpu: "1"
73+
limits:
74+
cpu: "10"
75+
memory: "20Gi"
76+
gpu: "1"
77+
envs:
78+
- name: DYN_SYSTEM_ENABLED
79+
value: "true"
80+
- name: DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS
81+
value: "[\"generate\"]"
82+
- name: DYN_SYSTEM_PORT
83+
value: "9090"
84+
extraPodSpec:
85+
mainContainer:
86+
startupProbe:
87+
httpGet:
88+
path: /health
89+
port: 9090
90+
periodSeconds: 10
91+
timeoutSeconds: 3
92+
failureThreshold: 60
93+
image: nvcr.io/nvidian/nim-llm-dev/trtllm-runtime:dep-233.17
94+
workingDir: /workspace/components/backends/trtllm
95+
args:
96+
- "python3"
97+
- "-m"
98+
- "dynamo.trtllm"
99+
- "--model-path"
100+
- "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
101+
- "--served-model-name"
102+
- "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
103+
- "--extra-engine-args"
104+
- "engine_configs/agg.yaml"
Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
apiVersion: nvidia.com/v1alpha1
5+
kind: DynamoGraphDeployment
6+
metadata:
7+
name: trtllm-agg-router
8+
spec:
9+
services:
10+
Frontend:
11+
livenessProbe:
12+
exec:
13+
command:
14+
- /bin/sh
15+
- -c
16+
- 'curl -s http://localhost:8000/health | jq -e ".status == \"healthy\""'
17+
periodSeconds: 5
18+
timeoutSeconds: 3
19+
failureThreshold: 3
20+
readinessProbe:
21+
exec:
22+
command:
23+
- /bin/sh
24+
- -c
25+
- 'curl -s http://localhost:8000/health | jq -e ".status == \"healthy\""'
26+
initialDelaySeconds: 60
27+
periodSeconds: 60
28+
timeoutSeconds: 3
29+
failureThreshold: 5
30+
dynamoNamespace: trtllm-agg-router
31+
componentType: main
32+
replicas: 1
33+
resources:
34+
requests:
35+
cpu: "1"
36+
memory: "2Gi"
37+
limits:
38+
cpu: "1"
39+
memory: "2Gi"
40+
extraPodSpec:
41+
mainContainer:
42+
image: nvcr.io/nvidian/nim-llm-dev/trtllm-runtime:dep-233.17
43+
workingDir: /workspace/components/backends/trtllm
44+
command:
45+
- /bin/sh
46+
- -c
47+
args:
48+
- "python3 -m dynamo.frontend --http-port 8000 --router-mode kv"
49+
TRTLLMWorker:
50+
envFromSecret: hf-token-secret
51+
livenessProbe:
52+
httpGet:
53+
path: /live
54+
port: 9090
55+
periodSeconds: 5
56+
timeoutSeconds: 3
57+
failureThreshold: 3
58+
readinessProbe:
59+
httpGet:
60+
path: /health
61+
port: 9090
62+
periodSeconds: 10
63+
timeoutSeconds: 3
64+
failureThreshold: 60
65+
dynamoNamespace: trtllm-agg-router
66+
componentType: worker
67+
replicas: 2
68+
resources:
69+
requests:
70+
cpu: "10"
71+
memory: "20Gi"
72+
gpu: "1"
73+
limits:
74+
cpu: "10"
75+
memory: "20Gi"
76+
gpu: "1"
77+
envs:
78+
- name: DYN_SYSTEM_ENABLED
79+
value: "true"
80+
- name: DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS
81+
value: "[\"generate\"]"
82+
- name: DYN_SYSTEM_PORT
83+
value: "9090"
84+
extraPodSpec:
85+
mainContainer:
86+
startupProbe:
87+
httpGet:
88+
path: /health
89+
port: 9090
90+
periodSeconds: 10
91+
timeoutSeconds: 3
92+
failureThreshold: 60
93+
image: nvcr.io/nvidian/nim-llm-dev/trtllm-runtime:dep-233.17
94+
workingDir: /workspace/components/backends/trtllm
95+
args:
96+
- "python3"
97+
- "-m"
98+
- "dynamo.trtllm"
99+
- "--model-path"
100+
- "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
101+
- "--served-model-name"
102+
- "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
103+
- "--extra-engine-args"
104+
- "engine_configs/agg.yaml"
105+
- "--publish-events-and-metrics"

0 commit comments

Comments
 (0)