Skip to content

Commit b74b887

Browse files
authored
fix: add instruction to deploy model with inference gateway (#2257)
1 parent 8f24c02 commit b74b887

File tree

7 files changed

+587
-119
lines changed

7 files changed

+587
-119
lines changed

components/backends/sglang/deploy/README.md

Lines changed: 27 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -103,8 +103,34 @@ args:
103103
```
104104
105105
### 3. Deploy
106+
107+
Use the following command to deploy the deployment file.
108+
109+
First, create a secret for the HuggingFace token.
110+
```bash
111+
export HF_TOKEN=your_hf_token
112+
kubectl create secret generic hf-token-secret \
113+
--from-literal=HF_TOKEN=${HF_TOKEN} \
114+
-n ${NAMESPACE}
115+
```
116+
117+
Then, deploy the model using the deployment file.
118+
106119
```bash
107-
kubectl apply -f <your-template>.yaml
120+
export DEPLOYMENT_FILE=agg.yaml
121+
kubectl apply -f $DEPLOYMENT_FILE -n ${NAMESPACE}
122+
```
123+
124+
### 4. Using Custom Dynamo Frameworks Image for SGLang
125+
126+
To use a custom dynamo frameworks image for SGLang, you can update the deployment file using yq:
127+
128+
```bash
129+
export DEPLOYMENT_FILE=agg.yaml
130+
export FRAMEWORK_RUNTIME_IMAGE=<sglang-image>
131+
132+
yq '.spec.services.[].extraPodSpec.mainContainer.image = env(FRAMEWORK_RUNTIME_IMAGE)' $DEPLOYMENT_FILE > $DEPLOYMENT_FILE.generated
133+
kubectl apply -f $DEPLOYMENT_FILE.generated -n $NAMESPACE
108134
```
109135

110136
## Model Configuration

components/backends/trtllm/README.md

Lines changed: 3 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -189,61 +189,18 @@ For comprehensive instructions on multinode serving, see the [multinode-examples
189189

190190
### Kubernetes Deployment
191191

192-
For Kubernetes deployment, YAML manifests are provided in the `deploy/` directory. These define DynamoGraphDeployment resources for various configurations:
193-
194-
- `agg.yaml` - Aggregated serving
195-
- `agg_router.yaml` - Aggregated serving with KV routing
196-
- `disagg.yaml` - Disaggregated serving
197-
- `disagg_router.yaml` - Disaggregated serving with KV routing
198-
199-
#### Prerequisites
200-
201-
- **Dynamo Cloud**: Follow the [Quickstart Guide](../../../docs/guides/dynamo_deploy/quickstart.md) to deploy Dynamo Cloud first.
202-
203-
- **Container Images**: The deployment files currently require access to `nvcr.io/nvidian/nim-llm-dev/trtllm-runtime`. If you don't have access, build and push your own image:
204-
```bash
205-
./container/build.sh --framework tensorrtllm
206-
# Tag and push to your container registry
207-
# Update the image references in the YAML files
208-
```
209-
210-
- **Port Forwarding**: After deployment, forward the frontend service to access the API:
211-
```bash
212-
kubectl port-forward deployment/trtllm-v1-disagg-frontend-<pod-uuid-info> 8080:8000
213-
```
214-
215-
#### Deploy to Kubernetes
216-
217-
Example with disagg:
218-
Export the NAMESPACE you used in your Dynamo Cloud Installation.
219-
220-
```bash
221-
cd dynamo
222-
cd components/backends/trtllm/deploy
223-
kubectl apply -f disagg.yaml -n $NAMESPACE
224-
```
225-
226-
To change `DYN_LOG` level, edit the yaml file by adding
227-
228-
```yaml
229-
...
230-
spec:
231-
envs:
232-
- name: DYN_LOG
233-
value: "debug" # or other log levels
234-
...
235-
```
192+
For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [TensorRT-LLM Kubernetes Deployment Guide](deploy/README.md)
236193

237194
### Client
238195

239196
See [client](../llm/README.md#client) section to learn how to send request to the deployment.
240197

241-
NOTE: To send a request to a multi-node deployment, target the node which is running `dynamo-run in=http`.
198+
NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
242199

243200
### Benchmarking
244201

245202
To benchmark your deployment with GenAI-Perf, see this utility script, configuring the
246-
`model` name and `host` based on your deployment: [perf.sh](../../benchmarks/llm/perf.sh)
203+
`model` name and `host` based on your deployment: [perf.sh](../../../benchmarks/llm/perf.sh)
247204

248205

249206
## Disaggregation Strategy
Lines changed: 288 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,288 @@
1-
This folder contains deployment examples for the TRTLLM inference backend.
1+
# TensorRT-LLM Kubernetes Deployment Configurations
2+
3+
This directory contains Kubernetes Custom Resource Definition (CRD) templates for deploying TensorRT-LLM inference graphs using the **DynamoGraphDeployment** resource.
4+
5+
## Available Deployment Patterns
6+
7+
### 1. **Aggregated Deployment** (`agg.yaml`)
8+
Basic deployment pattern with frontend and a single worker.
9+
10+
**Architecture:**
11+
- `Frontend`: OpenAI-compatible API server (with kv router mode disabled)
12+
- `TRTLLMWorker`: Single worker handling both prefill and decode
13+
14+
### 2. **Aggregated Router Deployment** (`agg_router.yaml`)
15+
Enhanced aggregated deployment with KV cache routing capabilities.
16+
17+
**Architecture:**
18+
- `Frontend`: OpenAI-compatible API server (with kv router mode enabled)
19+
- `TRTLLMWorker`: Multiple workers handling both prefill and decode (2 replicas for load balancing)
20+
21+
### 3. **Disaggregated Deployment** (`disagg.yaml`)
22+
High-performance deployment with separated prefill and decode workers.
23+
24+
**Architecture:**
25+
- `Frontend`: HTTP API server coordinating between workers
26+
- `TRTLLMDecodeWorker`: Specialized decode-only worker
27+
- `TRTLLMPrefillWorker`: Specialized prefill-only worker
28+
29+
### 4. **Disaggregated Router Deployment** (`disagg_router.yaml`)
30+
Advanced disaggregated deployment with KV cache routing capabilities.
31+
32+
**Architecture:**
33+
- `Frontend`: HTTP API server (with kv router mode enabled)
34+
- `TRTLLMDecodeWorker`: Specialized decode-only worker
35+
- `TRTLLMPrefillWorker`: Specialized prefill-only worker (2 replicas for load balancing)
36+
37+
## CRD Structure
38+
39+
All templates use the **DynamoGraphDeployment** CRD:
40+
41+
```yaml
42+
apiVersion: nvidia.com/v1alpha1
43+
kind: DynamoGraphDeployment
44+
metadata:
45+
name: <deployment-name>
46+
spec:
47+
services:
48+
<ServiceName>:
49+
# Service configuration
50+
```
51+
52+
### Key Configuration Options
53+
54+
**Resource Management:**
55+
```yaml
56+
resources:
57+
requests:
58+
cpu: "10"
59+
memory: "20Gi"
60+
gpu: "1"
61+
limits:
62+
cpu: "10"
63+
memory: "20Gi"
64+
gpu: "1"
65+
```
66+
67+
**Container Configuration:**
68+
```yaml
69+
extraPodSpec:
70+
mainContainer:
71+
image: nvcr.io/nvidian/nim-llm-dev/trtllm-runtime:dep-233.17
72+
workingDir: /workspace/components/backends/trtllm
73+
args:
74+
- "python3"
75+
- "-m"
76+
- "dynamo.trtllm"
77+
# Model-specific arguments
78+
```
79+
80+
## Prerequisites
81+
82+
Before using these templates, ensure you have:
83+
84+
1. **Dynamo Cloud Platform installed** - See [Quickstart Guide](../../../../docs/guides/dynamo_deploy/quickstart.md)
85+
2. **Kubernetes cluster with GPU support**
86+
3. **Container registry access** for TensorRT-LLM runtime images
87+
4. **HuggingFace token secret** (referenced as `envFromSecret: hf-token-secret`)
88+
89+
### Container Images
90+
91+
The deployment files currently require access to `nvcr.io/nvidian/nim-llm-dev/trtllm-runtime`. If you don't have access, build and push your own image:
92+
93+
```bash
94+
./container/build.sh --framework tensorrtllm
95+
# Tag and push to your container registry
96+
# Update the image references in the YAML files
97+
```
98+
99+
**Note:** TensorRT-LLM uses git-lfs, which needs to be installed in advance:
100+
```bash
101+
apt-get update && apt-get -y install git git-lfs
102+
```
103+
104+
For ARM machines, use:
105+
```bash
106+
./container/build.sh --framework tensorrtllm --platform linux/arm64
107+
```
108+
109+
## Usage
110+
111+
### 1. Choose Your Template
112+
Select the deployment pattern that matches your requirements:
113+
- Use `agg.yaml` for simple testing
114+
- Use `agg_router.yaml` for production with KV cache routing and load balancing
115+
- Use `disagg.yaml` for maximum performance with separated workers
116+
- Use `disagg_router.yaml` for high-performance with KV cache routing and disaggregation
117+
118+
### 2. Customize Configuration
119+
Edit the template to match your environment:
120+
121+
```yaml
122+
# Update image registry and tag
123+
image: your-registry/trtllm-runtime:your-tag
124+
125+
# Configure your model and deployment settings
126+
args:
127+
- "python3"
128+
- "-m"
129+
- "dynamo.trtllm"
130+
# Add your model-specific arguments
131+
```
132+
133+
### 3. Deploy
134+
135+
See the [Create Deployment Guide](../../../../docs/guides/dynamo_deploy/create_deployment.md) to learn how to deploy the deployment file.
136+
137+
First, create a secret for the HuggingFace token.
138+
```bash
139+
export HF_TOKEN=your_hf_token
140+
kubectl create secret generic hf-token-secret \
141+
--from-literal=HF_TOKEN=${HF_TOKEN} \
142+
-n ${NAMESPACE}
143+
```
144+
145+
Then, deploy the model using the deployment file.
146+
147+
Export the NAMESPACE you used in your Dynamo Cloud Installation.
148+
149+
```bash
150+
cd dynamo/components/backends/trtllm/deploy
151+
export DEPLOYMENT_FILE=agg.yaml
152+
kubectl apply -f $DEPLOYMENT_FILE -n $NAMESPACE
153+
```
154+
155+
### 4. Using Custom Dynamo Frameworks Image for TensorRT-LLM
156+
157+
To use a custom dynamo frameworks image for TensorRT-LLM, you can update the deployment file using yq:
158+
159+
```bash
160+
export DEPLOYMENT_FILE=agg.yaml
161+
export FRAMEWORK_RUNTIME_IMAGE=<trtllm-image>
162+
163+
yq '.spec.services.[].extraPodSpec.mainContainer.image = env(FRAMEWORK_RUNTIME_IMAGE)' $DEPLOYMENT_FILE > $DEPLOYMENT_FILE.generated
164+
kubectl apply -f $DEPLOYMENT_FILE.generated -n $NAMESPACE
165+
```
166+
167+
### 5. Port Forwarding
168+
169+
After deployment, forward the frontend service to access the API:
170+
171+
```bash
172+
kubectl port-forward deployment/trtllm-v1-disagg-frontend-<pod-uuid-info> 8000:8000
173+
```
174+
175+
## Configuration Options
176+
177+
### Environment Variables
178+
179+
To change `DYN_LOG` level, edit the yaml file by adding:
180+
181+
```yaml
182+
...
183+
spec:
184+
envs:
185+
- name: DYN_LOG
186+
value: "debug" # or other log levels
187+
...
188+
```
189+
190+
### TensorRT-LLM Worker Configuration
191+
192+
TensorRT-LLM workers are configured through command-line arguments in the deployment YAML. Key configuration areas include:
193+
194+
- **Disaggregation Strategy**: Control request flow with `DISAGGREGATION_STRATEGY` environment variable
195+
- **KV Cache Transfer**: Choose between UCX (default) or NIXL for disaggregated serving
196+
- **Request Migration**: Enable graceful failure handling with `--migration-limit`
197+
198+
### Disaggregation Strategy
199+
200+
The disaggregation strategy controls how requests are distributed between prefill and decode workers:
201+
202+
- **`decode_first`** (default): Requests routed to decode worker first, then forwarded to prefill worker
203+
- **`prefill_first`**: Requests routed directly to prefill worker (used with KV routing)
204+
205+
Set via environment variable:
206+
```yaml
207+
envs:
208+
- name: DISAGGREGATION_STRATEGY
209+
value: "prefill_first"
210+
```
211+
212+
## Testing the Deployment
213+
214+
Send a test request to verify your deployment. See the [client section](../../../../components/backends/llm/README.md#client) for detailed instructions.
215+
216+
**Note:** For multi-node deployments, target the node running `python3 -m dynamo.frontend <args>`.
217+
218+
## Model Configuration
219+
220+
The deployment templates support various TensorRT-LLM models and configurations. You can customize model-specific arguments in the worker configuration sections of the YAML files.
221+
222+
### Multi-Token Prediction (MTP) Support
223+
224+
For models supporting Multi-Token Prediction (such as DeepSeek R1), special configuration is available. Note that MTP requires the experimental TensorRT-LLM commit:
225+
226+
```bash
227+
./container/build.sh --framework tensorrtllm --use-default-experimental-tensorrtllm-commit
228+
```
229+
230+
## Monitoring and Health
231+
232+
- **Frontend health endpoint**: `http://<frontend-service>:8000/health`
233+
- **Worker health endpoints**: `http://<worker-service>:9090/health`
234+
- **Liveness probes**: Check process health every 5 seconds
235+
- **Readiness probes**: Check service readiness with configurable delays
236+
237+
## KV Cache Transfer Methods
238+
239+
TensorRT-LLM supports two methods for KV cache transfer in disaggregated serving:
240+
241+
- **UCX** (default): Standard method for KV cache transfer
242+
- **NIXL** (experimental): Alternative transfer method
243+
244+
For detailed configuration instructions, see the [KV cache transfer guide](../kv-cache-tranfer.md).
245+
246+
## Request Migration
247+
248+
You can enable [request migration](../../../../docs/architecture/request_migration.md) to handle worker failures gracefully by adding the migration limit argument to worker configurations:
249+
250+
```yaml
251+
args:
252+
- "python3"
253+
- "-m"
254+
- "dynamo.trtllm"
255+
- "--migration-limit"
256+
- "3"
257+
```
258+
259+
## Benchmarking
260+
261+
To benchmark your deployment with GenAI-Perf, see this utility script: [perf.sh](../../../../benchmarks/llm/perf.sh)
262+
263+
Configure the `model` name and `host` based on your deployment.
264+
265+
## Further Reading
266+
267+
- **Deployment Guide**: [Creating Kubernetes Deployments](../../../../docs/guides/dynamo_deploy/create_deployment.md)
268+
- **Quickstart**: [Deployment Quickstart](../../../../docs/guides/dynamo_deploy/quickstart.md)
269+
- **Platform Setup**: [Dynamo Cloud Installation](../../../../docs/guides/dynamo_deploy/dynamo_cloud.md)
270+
- **Examples**: [Deployment Examples](../../../../docs/examples/README.md)
271+
- **Architecture Docs**: [Disaggregated Serving](../../../../docs/architecture/disagg_serving.md), [KV-Aware Routing](../../../../docs/architecture/kv_cache_routing.md)
272+
- **Multinode Deployment**: [Multinode Examples](../multinode/multinode-examples.md)
273+
- **Speculative Decoding**: [Llama 4 + Eagle Guide](../llama4_plus_eagle.md)
274+
- **Kubernetes CRDs**: [Custom Resources Documentation](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
275+
276+
## Troubleshooting
277+
278+
Common issues and solutions:
279+
280+
1. **Pod fails to start**: Check image registry access and HuggingFace token secret
281+
2. **GPU not allocated**: Verify cluster has GPU nodes and proper resource limits
282+
3. **Health check failures**: Review model loading logs and increase `initialDelaySeconds`
283+
4. **Out of memory**: Increase memory limits or reduce model batch size
284+
5. **Port forwarding issues**: Ensure correct pod UUID in port-forward command
285+
6. **Git LFS issues**: Ensure git-lfs is installed before building containers
286+
7. **ARM deployment**: Use `--platform linux/arm64` when building on ARM machines
287+
288+
For additional support, refer to the [deployment troubleshooting guide](../../../../docs/guides/dynamo_deploy/quickstart.md#troubleshooting).

0 commit comments

Comments
 (0)