|
1 | | -This folder contains deployment examples for the TRTLLM inference backend. |
| 1 | +# TensorRT-LLM Kubernetes Deployment Configurations |
| 2 | + |
| 3 | +This directory contains Kubernetes Custom Resource Definition (CRD) templates for deploying TensorRT-LLM inference graphs using the **DynamoGraphDeployment** resource. |
| 4 | + |
| 5 | +## Available Deployment Patterns |
| 6 | + |
| 7 | +### 1. **Aggregated Deployment** (`agg.yaml`) |
| 8 | +Basic deployment pattern with frontend and a single worker. |
| 9 | + |
| 10 | +**Architecture:** |
| 11 | +- `Frontend`: OpenAI-compatible API server (with kv router mode disabled) |
| 12 | +- `TRTLLMWorker`: Single worker handling both prefill and decode |
| 13 | + |
| 14 | +### 2. **Aggregated Router Deployment** (`agg_router.yaml`) |
| 15 | +Enhanced aggregated deployment with KV cache routing capabilities. |
| 16 | + |
| 17 | +**Architecture:** |
| 18 | +- `Frontend`: OpenAI-compatible API server (with kv router mode enabled) |
| 19 | +- `TRTLLMWorker`: Multiple workers handling both prefill and decode (2 replicas for load balancing) |
| 20 | + |
| 21 | +### 3. **Disaggregated Deployment** (`disagg.yaml`) |
| 22 | +High-performance deployment with separated prefill and decode workers. |
| 23 | + |
| 24 | +**Architecture:** |
| 25 | +- `Frontend`: HTTP API server coordinating between workers |
| 26 | +- `TRTLLMDecodeWorker`: Specialized decode-only worker |
| 27 | +- `TRTLLMPrefillWorker`: Specialized prefill-only worker |
| 28 | + |
| 29 | +### 4. **Disaggregated Router Deployment** (`disagg_router.yaml`) |
| 30 | +Advanced disaggregated deployment with KV cache routing capabilities. |
| 31 | + |
| 32 | +**Architecture:** |
| 33 | +- `Frontend`: HTTP API server (with kv router mode enabled) |
| 34 | +- `TRTLLMDecodeWorker`: Specialized decode-only worker |
| 35 | +- `TRTLLMPrefillWorker`: Specialized prefill-only worker (2 replicas for load balancing) |
| 36 | + |
| 37 | +## CRD Structure |
| 38 | + |
| 39 | +All templates use the **DynamoGraphDeployment** CRD: |
| 40 | + |
| 41 | +```yaml |
| 42 | +apiVersion: nvidia.com/v1alpha1 |
| 43 | +kind: DynamoGraphDeployment |
| 44 | +metadata: |
| 45 | + name: <deployment-name> |
| 46 | +spec: |
| 47 | + services: |
| 48 | + <ServiceName>: |
| 49 | + # Service configuration |
| 50 | +``` |
| 51 | + |
| 52 | +### Key Configuration Options |
| 53 | + |
| 54 | +**Resource Management:** |
| 55 | +```yaml |
| 56 | +resources: |
| 57 | + requests: |
| 58 | + cpu: "10" |
| 59 | + memory: "20Gi" |
| 60 | + gpu: "1" |
| 61 | + limits: |
| 62 | + cpu: "10" |
| 63 | + memory: "20Gi" |
| 64 | + gpu: "1" |
| 65 | +``` |
| 66 | +
|
| 67 | +**Container Configuration:** |
| 68 | +```yaml |
| 69 | +extraPodSpec: |
| 70 | + mainContainer: |
| 71 | + image: nvcr.io/nvidian/nim-llm-dev/trtllm-runtime:dep-233.17 |
| 72 | + workingDir: /workspace/components/backends/trtllm |
| 73 | + args: |
| 74 | + - "python3" |
| 75 | + - "-m" |
| 76 | + - "dynamo.trtllm" |
| 77 | + # Model-specific arguments |
| 78 | +``` |
| 79 | + |
| 80 | +## Prerequisites |
| 81 | + |
| 82 | +Before using these templates, ensure you have: |
| 83 | + |
| 84 | +1. **Dynamo Cloud Platform installed** - See [Quickstart Guide](../../../../docs/guides/dynamo_deploy/quickstart.md) |
| 85 | +2. **Kubernetes cluster with GPU support** |
| 86 | +3. **Container registry access** for TensorRT-LLM runtime images |
| 87 | +4. **HuggingFace token secret** (referenced as `envFromSecret: hf-token-secret`) |
| 88 | + |
| 89 | +### Container Images |
| 90 | + |
| 91 | +The deployment files currently require access to `nvcr.io/nvidian/nim-llm-dev/trtllm-runtime`. If you don't have access, build and push your own image: |
| 92 | + |
| 93 | +```bash |
| 94 | +./container/build.sh --framework tensorrtllm |
| 95 | +# Tag and push to your container registry |
| 96 | +# Update the image references in the YAML files |
| 97 | +``` |
| 98 | + |
| 99 | +**Note:** TensorRT-LLM uses git-lfs, which needs to be installed in advance: |
| 100 | +```bash |
| 101 | +apt-get update && apt-get -y install git git-lfs |
| 102 | +``` |
| 103 | + |
| 104 | +For ARM machines, use: |
| 105 | +```bash |
| 106 | +./container/build.sh --framework tensorrtllm --platform linux/arm64 |
| 107 | +``` |
| 108 | + |
| 109 | +## Usage |
| 110 | + |
| 111 | +### 1. Choose Your Template |
| 112 | +Select the deployment pattern that matches your requirements: |
| 113 | +- Use `agg.yaml` for simple testing |
| 114 | +- Use `agg_router.yaml` for production with KV cache routing and load balancing |
| 115 | +- Use `disagg.yaml` for maximum performance with separated workers |
| 116 | +- Use `disagg_router.yaml` for high-performance with KV cache routing and disaggregation |
| 117 | + |
| 118 | +### 2. Customize Configuration |
| 119 | +Edit the template to match your environment: |
| 120 | + |
| 121 | +```yaml |
| 122 | +# Update image registry and tag |
| 123 | +image: your-registry/trtllm-runtime:your-tag |
| 124 | + |
| 125 | +# Configure your model and deployment settings |
| 126 | +args: |
| 127 | + - "python3" |
| 128 | + - "-m" |
| 129 | + - "dynamo.trtllm" |
| 130 | + # Add your model-specific arguments |
| 131 | +``` |
| 132 | + |
| 133 | +### 3. Deploy |
| 134 | + |
| 135 | +See the [Create Deployment Guide](../../../../docs/guides/dynamo_deploy/create_deployment.md) to learn how to deploy the deployment file. |
| 136 | + |
| 137 | +First, create a secret for the HuggingFace token. |
| 138 | +```bash |
| 139 | +export HF_TOKEN=your_hf_token |
| 140 | +kubectl create secret generic hf-token-secret \ |
| 141 | + --from-literal=HF_TOKEN=${HF_TOKEN} \ |
| 142 | + -n ${NAMESPACE} |
| 143 | +``` |
| 144 | + |
| 145 | +Then, deploy the model using the deployment file. |
| 146 | + |
| 147 | +Export the NAMESPACE you used in your Dynamo Cloud Installation. |
| 148 | + |
| 149 | +```bash |
| 150 | +cd dynamo/components/backends/trtllm/deploy |
| 151 | +export DEPLOYMENT_FILE=agg.yaml |
| 152 | +kubectl apply -f $DEPLOYMENT_FILE -n $NAMESPACE |
| 153 | +``` |
| 154 | + |
| 155 | +### 4. Using Custom Dynamo Frameworks Image for TensorRT-LLM |
| 156 | + |
| 157 | +To use a custom dynamo frameworks image for TensorRT-LLM, you can update the deployment file using yq: |
| 158 | + |
| 159 | +```bash |
| 160 | +export DEPLOYMENT_FILE=agg.yaml |
| 161 | +export FRAMEWORK_RUNTIME_IMAGE=<trtllm-image> |
| 162 | + |
| 163 | +yq '.spec.services.[].extraPodSpec.mainContainer.image = env(FRAMEWORK_RUNTIME_IMAGE)' $DEPLOYMENT_FILE > $DEPLOYMENT_FILE.generated |
| 164 | +kubectl apply -f $DEPLOYMENT_FILE.generated -n $NAMESPACE |
| 165 | +``` |
| 166 | + |
| 167 | +### 5. Port Forwarding |
| 168 | + |
| 169 | +After deployment, forward the frontend service to access the API: |
| 170 | + |
| 171 | +```bash |
| 172 | +kubectl port-forward deployment/trtllm-v1-disagg-frontend-<pod-uuid-info> 8000:8000 |
| 173 | +``` |
| 174 | + |
| 175 | +## Configuration Options |
| 176 | + |
| 177 | +### Environment Variables |
| 178 | + |
| 179 | +To change `DYN_LOG` level, edit the yaml file by adding: |
| 180 | + |
| 181 | +```yaml |
| 182 | +... |
| 183 | +spec: |
| 184 | + envs: |
| 185 | + - name: DYN_LOG |
| 186 | + value: "debug" # or other log levels |
| 187 | + ... |
| 188 | +``` |
| 189 | + |
| 190 | +### TensorRT-LLM Worker Configuration |
| 191 | + |
| 192 | +TensorRT-LLM workers are configured through command-line arguments in the deployment YAML. Key configuration areas include: |
| 193 | + |
| 194 | +- **Disaggregation Strategy**: Control request flow with `DISAGGREGATION_STRATEGY` environment variable |
| 195 | +- **KV Cache Transfer**: Choose between UCX (default) or NIXL for disaggregated serving |
| 196 | +- **Request Migration**: Enable graceful failure handling with `--migration-limit` |
| 197 | + |
| 198 | +### Disaggregation Strategy |
| 199 | + |
| 200 | +The disaggregation strategy controls how requests are distributed between prefill and decode workers: |
| 201 | + |
| 202 | +- **`decode_first`** (default): Requests routed to decode worker first, then forwarded to prefill worker |
| 203 | +- **`prefill_first`**: Requests routed directly to prefill worker (used with KV routing) |
| 204 | + |
| 205 | +Set via environment variable: |
| 206 | +```yaml |
| 207 | +envs: |
| 208 | + - name: DISAGGREGATION_STRATEGY |
| 209 | + value: "prefill_first" |
| 210 | +``` |
| 211 | +
|
| 212 | +## Testing the Deployment |
| 213 | +
|
| 214 | +Send a test request to verify your deployment. See the [client section](../../../../components/backends/llm/README.md#client) for detailed instructions. |
| 215 | +
|
| 216 | +**Note:** For multi-node deployments, target the node running `python3 -m dynamo.frontend <args>`. |
| 217 | + |
| 218 | +## Model Configuration |
| 219 | + |
| 220 | +The deployment templates support various TensorRT-LLM models and configurations. You can customize model-specific arguments in the worker configuration sections of the YAML files. |
| 221 | + |
| 222 | +### Multi-Token Prediction (MTP) Support |
| 223 | + |
| 224 | +For models supporting Multi-Token Prediction (such as DeepSeek R1), special configuration is available. Note that MTP requires the experimental TensorRT-LLM commit: |
| 225 | + |
| 226 | +```bash |
| 227 | +./container/build.sh --framework tensorrtllm --use-default-experimental-tensorrtllm-commit |
| 228 | +``` |
| 229 | + |
| 230 | +## Monitoring and Health |
| 231 | + |
| 232 | +- **Frontend health endpoint**: `http://<frontend-service>:8000/health` |
| 233 | +- **Worker health endpoints**: `http://<worker-service>:9090/health` |
| 234 | +- **Liveness probes**: Check process health every 5 seconds |
| 235 | +- **Readiness probes**: Check service readiness with configurable delays |
| 236 | + |
| 237 | +## KV Cache Transfer Methods |
| 238 | + |
| 239 | +TensorRT-LLM supports two methods for KV cache transfer in disaggregated serving: |
| 240 | + |
| 241 | +- **UCX** (default): Standard method for KV cache transfer |
| 242 | +- **NIXL** (experimental): Alternative transfer method |
| 243 | + |
| 244 | +For detailed configuration instructions, see the [KV cache transfer guide](../kv-cache-tranfer.md). |
| 245 | + |
| 246 | +## Request Migration |
| 247 | + |
| 248 | +You can enable [request migration](../../../../docs/architecture/request_migration.md) to handle worker failures gracefully by adding the migration limit argument to worker configurations: |
| 249 | + |
| 250 | +```yaml |
| 251 | +args: |
| 252 | + - "python3" |
| 253 | + - "-m" |
| 254 | + - "dynamo.trtllm" |
| 255 | + - "--migration-limit" |
| 256 | + - "3" |
| 257 | +``` |
| 258 | + |
| 259 | +## Benchmarking |
| 260 | + |
| 261 | +To benchmark your deployment with GenAI-Perf, see this utility script: [perf.sh](../../../../benchmarks/llm/perf.sh) |
| 262 | + |
| 263 | +Configure the `model` name and `host` based on your deployment. |
| 264 | + |
| 265 | +## Further Reading |
| 266 | + |
| 267 | +- **Deployment Guide**: [Creating Kubernetes Deployments](../../../../docs/guides/dynamo_deploy/create_deployment.md) |
| 268 | +- **Quickstart**: [Deployment Quickstart](../../../../docs/guides/dynamo_deploy/quickstart.md) |
| 269 | +- **Platform Setup**: [Dynamo Cloud Installation](../../../../docs/guides/dynamo_deploy/dynamo_cloud.md) |
| 270 | +- **Examples**: [Deployment Examples](../../../../docs/examples/README.md) |
| 271 | +- **Architecture Docs**: [Disaggregated Serving](../../../../docs/architecture/disagg_serving.md), [KV-Aware Routing](../../../../docs/architecture/kv_cache_routing.md) |
| 272 | +- **Multinode Deployment**: [Multinode Examples](../multinode/multinode-examples.md) |
| 273 | +- **Speculative Decoding**: [Llama 4 + Eagle Guide](../llama4_plus_eagle.md) |
| 274 | +- **Kubernetes CRDs**: [Custom Resources Documentation](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) |
| 275 | + |
| 276 | +## Troubleshooting |
| 277 | + |
| 278 | +Common issues and solutions: |
| 279 | + |
| 280 | +1. **Pod fails to start**: Check image registry access and HuggingFace token secret |
| 281 | +2. **GPU not allocated**: Verify cluster has GPU nodes and proper resource limits |
| 282 | +3. **Health check failures**: Review model loading logs and increase `initialDelaySeconds` |
| 283 | +4. **Out of memory**: Increase memory limits or reduce model batch size |
| 284 | +5. **Port forwarding issues**: Ensure correct pod UUID in port-forward command |
| 285 | +6. **Git LFS issues**: Ensure git-lfs is installed before building containers |
| 286 | +7. **ARM deployment**: Use `--platform linux/arm64` when building on ARM machines |
| 287 | + |
| 288 | +For additional support, refer to the [deployment troubleshooting guide](../../../../docs/guides/dynamo_deploy/quickstart.md#troubleshooting). |
0 commit comments