Skip to content

Commit 1ad6abe

Browse files
authored
feat: add sgl deploy readme (#2238)
1 parent 8c75ed7 commit 1ad6abe

File tree

2 files changed

+138
-2
lines changed

2 files changed

+138
-2
lines changed

components/backends/sglang/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -173,10 +173,10 @@ Below we provide a selected list of advanced examples. Please open up an issue i
173173

174174
## Deployment
175175

176-
We currently provide deployment examples for Kubernetes (coming soon!) and SLURM
176+
We currently provide deployment examples for Kubernetes and SLURM.
177177

178178
## Kubernetes
179-
- **[Deploying Dynamo with SGLang on Kubernetes - coming soon!](.)**
179+
- **[Deploying Dynamo with SGLang on Kubernetes](deploy/README.md)**
180180

181181
## SLURM
182182
- **[Deploying Dynamo with SGLang on SLURM](slurm_jobs/README.md)**
Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
# SGLang Kubernetes Deployment Configurations
2+
3+
This directory contains Kubernetes Custom Resource Definition (CRD) templates for deploying SGLang inference graphs using the **DynamoGraphDeployment** resource.
4+
5+
## Available Deployment Patterns
6+
7+
### 1. **Aggregated Deployment** (`agg.yaml`)
8+
Basic deployment pattern with frontend and a single decode worker.
9+
10+
**Architecture:**
11+
- `Frontend`: OpenAI-compatible API server
12+
- `SGLangDecodeWorker`: Single worker handling both prefill and decode
13+
14+
### 2. **Aggregated Router Deployment** (`agg_router.yaml`)
15+
Enhanced aggregated deployment with KV cache routing capabilities.
16+
17+
**Architecture:**
18+
- `Frontend`: OpenAI-compatible API server with router mode enabled (`--router-mode kv`)
19+
- `SGLangDecodeWorker`: Single worker handling both prefill and decode
20+
21+
### 3. **Disaggregated Deployment** (`disagg.yaml`)**
22+
High-performance deployment with separated prefill and decode workers.
23+
24+
**Architecture:**
25+
- `Frontend`: HTTP API server coordinating between workers
26+
- `SGLangDecodeWorker`: Specialized decode-only worker (`--disaggregation-mode decode`)
27+
- `SGLangPrefillWorker`: Specialized prefill-only worker (`--disaggregation-mode prefill`)
28+
- Communication via NIXL transfer backend (`--disaggregation-transfer-backend nixl`)
29+
30+
## CRD Structure
31+
32+
All templates use the **DynamoGraphDeployment** CRD:
33+
34+
```yaml
35+
apiVersion: nvidia.com/v1alpha1
36+
kind: DynamoGraphDeployment
37+
metadata:
38+
name: <deployment-name>
39+
spec:
40+
services:
41+
<ServiceName>:
42+
# Service configuration
43+
```
44+
45+
### Key Configuration Options
46+
47+
**Resource Management:**
48+
```yaml
49+
resources:
50+
requests:
51+
cpu: "10"
52+
memory: "20Gi"
53+
gpu: "1"
54+
limits:
55+
cpu: "10"
56+
memory: "20Gi"
57+
gpu: "1"
58+
```
59+
60+
**Container Configuration:**
61+
```yaml
62+
extraPodSpec:
63+
mainContainer:
64+
image: my-registry/sglang-runtime:my-tag
65+
workingDir: /workspace/components/backends/sglang
66+
args:
67+
- "python3"
68+
- "-m"
69+
- "dynamo.sglang.worker"
70+
# Model-specific arguments
71+
```
72+
73+
## Prerequisites
74+
75+
Before using these templates, ensure you have:
76+
77+
1. **Dynamo Cloud Platform installed** - See [Installing Dynamo Cloud](../../docs/guides/dynamo_deploy/dynamo_cloud.md)
78+
2. **Kubernetes cluster with GPU support**
79+
3. **Container registry access** for SGLang runtime images
80+
4. **HuggingFace token secret** (referenced as `envFromSecret: hf-token-secret`)
81+
82+
## Usage
83+
84+
### 1. Choose Your Template
85+
Select the deployment pattern that matches your requirements:
86+
- Use `agg.yaml` for development/testing
87+
- Use `agg_router.yaml` for production with load balancing
88+
- Use `disagg.yaml` for maximum performance
89+
90+
### 2. Customize Configuration
91+
Edit the template to match your environment:
92+
93+
```yaml
94+
# Update image registry and tag
95+
image: your-registry/sglang-runtime:your-tag
96+
97+
# Configure your model
98+
args:
99+
- "--model-path"
100+
- "your-org/your-model"
101+
- "--served-model-name"
102+
- "your-org/your-model"
103+
```
104+
105+
### 3. Deploy
106+
```bash
107+
kubectl apply -f <your-template>.yaml
108+
```
109+
110+
## Model Configuration
111+
112+
All templates use **DeepSeek-R1-Distill-Llama-8B** as the default model. But you can use any sglang argument and configuration. Key parameters:
113+
114+
## Monitoring and Health
115+
116+
- **Frontend health endpoint**: `http://<frontend-service>:8000/health`
117+
- **Liveness probes**: Check process health every 60s
118+
119+
## Further Reading
120+
121+
- **Deployment Guide**: [Creating Kubernetes Deployments](../../../../docs/guides/dynamo_deploy/create_deployment.md)
122+
- **Quickstart**: [Deployment Quickstart](../../../../docs/guides/dynamo_deploy/quickstart.md)
123+
- **Platform Setup**: [Dynamo Cloud Installation](../../../../docs/guides/dynamo_deploy/dynamo_cloud.md)
124+
- **Examples**: [Deployment Examples](../../../../docs/examples/README.md)
125+
- **Kubernetes CRDs**: [Custom Resources Documentation](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
126+
127+
## Troubleshooting
128+
129+
Common issues and solutions:
130+
131+
1. **Pod fails to start**: Check image registry access and HuggingFace token secret
132+
2. **GPU not allocated**: Verify cluster has GPU nodes and proper resource limits
133+
3. **Health check failures**: Review model loading logs and increase `initialDelaySeconds`
134+
4. **Out of memory**: Increase memory limits or reduce model batch size
135+
136+
For additional support, refer to the [deployment troubleshooting guide](../../docs/guides/dynamo_deploy/quickstart.md#troubleshooting).

0 commit comments

Comments
 (0)