|
| 1 | +# SGLang Kubernetes Deployment Configurations |
| 2 | + |
| 3 | +This directory contains Kubernetes Custom Resource Definition (CRD) templates for deploying SGLang inference graphs using the **DynamoGraphDeployment** resource. |
| 4 | + |
| 5 | +## Available Deployment Patterns |
| 6 | + |
| 7 | +### 1. **Aggregated Deployment** (`agg.yaml`) |
| 8 | +Basic deployment pattern with frontend and a single decode worker. |
| 9 | + |
| 10 | +**Architecture:** |
| 11 | +- `Frontend`: OpenAI-compatible API server |
| 12 | +- `SGLangDecodeWorker`: Single worker handling both prefill and decode |
| 13 | + |
| 14 | +### 2. **Aggregated Router Deployment** (`agg_router.yaml`) |
| 15 | +Enhanced aggregated deployment with KV cache routing capabilities. |
| 16 | + |
| 17 | +**Architecture:** |
| 18 | +- `Frontend`: OpenAI-compatible API server with router mode enabled (`--router-mode kv`) |
| 19 | +- `SGLangDecodeWorker`: Single worker handling both prefill and decode |
| 20 | + |
| 21 | +### 3. **Disaggregated Deployment** (`disagg.yaml`)** |
| 22 | +High-performance deployment with separated prefill and decode workers. |
| 23 | + |
| 24 | +**Architecture:** |
| 25 | +- `Frontend`: HTTP API server coordinating between workers |
| 26 | +- `SGLangDecodeWorker`: Specialized decode-only worker (`--disaggregation-mode decode`) |
| 27 | +- `SGLangPrefillWorker`: Specialized prefill-only worker (`--disaggregation-mode prefill`) |
| 28 | +- Communication via NIXL transfer backend (`--disaggregation-transfer-backend nixl`) |
| 29 | + |
| 30 | +## CRD Structure |
| 31 | + |
| 32 | +All templates use the **DynamoGraphDeployment** CRD: |
| 33 | + |
| 34 | +```yaml |
| 35 | +apiVersion: nvidia.com/v1alpha1 |
| 36 | +kind: DynamoGraphDeployment |
| 37 | +metadata: |
| 38 | + name: <deployment-name> |
| 39 | +spec: |
| 40 | + services: |
| 41 | + <ServiceName>: |
| 42 | + # Service configuration |
| 43 | +``` |
| 44 | + |
| 45 | +### Key Configuration Options |
| 46 | + |
| 47 | +**Resource Management:** |
| 48 | +```yaml |
| 49 | +resources: |
| 50 | + requests: |
| 51 | + cpu: "10" |
| 52 | + memory: "20Gi" |
| 53 | + gpu: "1" |
| 54 | + limits: |
| 55 | + cpu: "10" |
| 56 | + memory: "20Gi" |
| 57 | + gpu: "1" |
| 58 | +``` |
| 59 | +
|
| 60 | +**Container Configuration:** |
| 61 | +```yaml |
| 62 | +extraPodSpec: |
| 63 | + mainContainer: |
| 64 | + image: my-registry/sglang-runtime:my-tag |
| 65 | + workingDir: /workspace/components/backends/sglang |
| 66 | + args: |
| 67 | + - "python3" |
| 68 | + - "-m" |
| 69 | + - "dynamo.sglang.worker" |
| 70 | + # Model-specific arguments |
| 71 | +``` |
| 72 | + |
| 73 | +## Prerequisites |
| 74 | + |
| 75 | +Before using these templates, ensure you have: |
| 76 | + |
| 77 | +1. **Dynamo Cloud Platform installed** - See [Installing Dynamo Cloud](../../docs/guides/dynamo_deploy/dynamo_cloud.md) |
| 78 | +2. **Kubernetes cluster with GPU support** |
| 79 | +3. **Container registry access** for SGLang runtime images |
| 80 | +4. **HuggingFace token secret** (referenced as `envFromSecret: hf-token-secret`) |
| 81 | + |
| 82 | +## Usage |
| 83 | + |
| 84 | +### 1. Choose Your Template |
| 85 | +Select the deployment pattern that matches your requirements: |
| 86 | +- Use `agg.yaml` for development/testing |
| 87 | +- Use `agg_router.yaml` for production with load balancing |
| 88 | +- Use `disagg.yaml` for maximum performance |
| 89 | + |
| 90 | +### 2. Customize Configuration |
| 91 | +Edit the template to match your environment: |
| 92 | + |
| 93 | +```yaml |
| 94 | +# Update image registry and tag |
| 95 | +image: your-registry/sglang-runtime:your-tag |
| 96 | + |
| 97 | +# Configure your model |
| 98 | +args: |
| 99 | + - "--model-path" |
| 100 | + - "your-org/your-model" |
| 101 | + - "--served-model-name" |
| 102 | + - "your-org/your-model" |
| 103 | +``` |
| 104 | +
|
| 105 | +### 3. Deploy |
| 106 | +```bash |
| 107 | +kubectl apply -f <your-template>.yaml |
| 108 | +``` |
| 109 | + |
| 110 | +## Model Configuration |
| 111 | + |
| 112 | +All templates use **DeepSeek-R1-Distill-Llama-8B** as the default model. But you can use any sglang argument and configuration. Key parameters: |
| 113 | + |
| 114 | +## Monitoring and Health |
| 115 | + |
| 116 | +- **Frontend health endpoint**: `http://<frontend-service>:8000/health` |
| 117 | +- **Liveness probes**: Check process health every 60s |
| 118 | + |
| 119 | +## Further Reading |
| 120 | + |
| 121 | +- **Deployment Guide**: [Creating Kubernetes Deployments](../../../../docs/guides/dynamo_deploy/create_deployment.md) |
| 122 | +- **Quickstart**: [Deployment Quickstart](../../../../docs/guides/dynamo_deploy/quickstart.md) |
| 123 | +- **Platform Setup**: [Dynamo Cloud Installation](../../../../docs/guides/dynamo_deploy/dynamo_cloud.md) |
| 124 | +- **Examples**: [Deployment Examples](../../../../docs/examples/README.md) |
| 125 | +- **Kubernetes CRDs**: [Custom Resources Documentation](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) |
| 126 | + |
| 127 | +## Troubleshooting |
| 128 | + |
| 129 | +Common issues and solutions: |
| 130 | + |
| 131 | +1. **Pod fails to start**: Check image registry access and HuggingFace token secret |
| 132 | +2. **GPU not allocated**: Verify cluster has GPU nodes and proper resource limits |
| 133 | +3. **Health check failures**: Review model loading logs and increase `initialDelaySeconds` |
| 134 | +4. **Out of memory**: Increase memory limits or reduce model batch size |
| 135 | + |
| 136 | +For additional support, refer to the [deployment troubleshooting guide](../../docs/guides/dynamo_deploy/quickstart.md#troubleshooting). |
0 commit comments