Skip to content

Commit e22f84a

Browse files
feat: support sglang in sla planner (#2421)
Signed-off-by: Hongkuan Zhou <tedzhouhk@gmail.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
1 parent 9b87c89 commit e22f84a

File tree

6 files changed

+290
-7
lines changed

6 files changed

+290
-7
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -58,8 +58,8 @@ Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLa
5858
| [**Disaggregated Serving**](/docs/architecture/disagg_serving.md) ||||
5959
| [**Conditional Disaggregation**](/docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | 🚧 | 🚧 |
6060
| [**KV-Aware Routing**](/docs/architecture/kv_cache_routing.md) ||||
61-
| [**SLA-Based Planner**](/docs/architecture/sla_planner.md) | | 🚧 | 🚧 |
62-
| [**Load Based Planner**](/docs/architecture/load_planner.md) || 🚧 | 🚧 |
61+
| [**Load Based Planner**](/docs/architecture/load_planner.md) | 🚧 | 🚧 | 🚧 |
62+
| [**SLA-Based Planner**](/docs/architecture/sla_planner.md) || | 🚧 |
6363
| [**KVBM**](/docs/architecture/kvbm_architecture.md) | 🚧 | 🚧 | 🚧 |
6464

6565
To learn more about each framework and their capabilities, check out each framework's README!

components/backends/sglang/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
3737
| [**Disaggregated Serving**](../../../docs/architecture/disagg_serving.md) || |
3838
| [**Conditional Disaggregation**](../../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) |
3939
| [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) || |
40-
| [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md) | | Planned |
40+
| [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md) | | |
4141
| [**Load Based Planner**](../../../docs/architecture/load_planner.md) || Planned |
4242
| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) || Planned |
4343

@@ -197,7 +197,7 @@ curl localhost:8000/v1/chat/completions \
197197
"content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"
198198
}
199199
],
200-
"stream": false,
200+
"stream": true,
201201
"max_tokens": 30
202202
}'
203203
```
Lines changed: 267 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,267 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
apiVersion: nvidia.com/v1alpha1
5+
kind: DynamoGraphDeployment
6+
metadata:
7+
name: sglang-disagg-planner
8+
annotations:
9+
nvidia.com/enable-grove: "false"
10+
spec:
11+
envs:
12+
- name: DYNAMO_SERVICE_CONFIG
13+
value: '{"Prometheus":{"global":{"scrape_interval":"5s"},"scrape_configs":[{"job_name":"prometheus","static_configs":[{"targets":["localhost:9090"]}]},{"job_name":"frontend","static_configs":[{"targets":["sglang-disagg-planner-frontend:8000"]}]}]}}'
14+
- name: DYNAMO_NAMESPACE
15+
value: "dynamo"
16+
services:
17+
Frontend:
18+
dynamoNamespace: dynamo
19+
livenessProbe:
20+
httpGet:
21+
path: /health
22+
port: 8000
23+
initialDelaySeconds: 20
24+
periodSeconds: 5
25+
timeoutSeconds: 5
26+
failureThreshold: 3
27+
readinessProbe:
28+
exec:
29+
command:
30+
- /bin/sh
31+
- -c
32+
- 'curl -s http://localhost:8000/health | jq -e ".status == \"healthy\""'
33+
initialDelaySeconds: 60
34+
periodSeconds: 60
35+
timeoutSeconds: 30
36+
failureThreshold: 10
37+
componentType: main
38+
replicas: 1
39+
resources:
40+
requests:
41+
cpu: "10"
42+
memory: "10Gi"
43+
limits:
44+
cpu: "32"
45+
memory: "40Gi"
46+
extraPodSpec:
47+
mainContainer:
48+
image: nvcr.io/nvidian/nim-llm-dev/sglang-runtime:hzhou-0811-1
49+
workingDir: /workspace/components/backends/sglang
50+
command: ["sh", "-c"]
51+
args:
52+
- "python3 -m dynamo.sglang.utils.clear_namespace --namespace sglang-disagg && python3 -m dynamo.frontend --http-port=8000"
53+
Planner:
54+
dynamoNamespace: dynamo
55+
envFromSecret: hf-token-secret
56+
componentType: planner
57+
replicas: 1
58+
livenessProbe:
59+
exec:
60+
command:
61+
- /bin/sh
62+
- -c
63+
- "exit 0"
64+
periodSeconds: 60
65+
timeoutSeconds: 30
66+
failureThreshold: 10
67+
readinessProbe:
68+
exec:
69+
command:
70+
- /bin/sh
71+
- -c
72+
- "exit 0"
73+
initialDelaySeconds: 60
74+
periodSeconds: 60
75+
timeoutSeconds: 30
76+
failureThreshold: 10
77+
resources:
78+
requests:
79+
cpu: "2"
80+
memory: "2Gi"
81+
limits:
82+
cpu: "8"
83+
memory: "16Gi"
84+
pvc:
85+
create: false
86+
name: profiling-pvc # Must be pre-created before deployment and SLA profiler must have been run
87+
mountPoint: /workspace/profiling_results
88+
extraPodSpec:
89+
mainContainer:
90+
image: nvcr.io/nvidian/nim-llm-dev/sglang-runtime:hzhou-0811-1
91+
workingDir: /workspace/components/planner/src/dynamo/planner
92+
args:
93+
- python
94+
- -m
95+
- planner_sla
96+
- --environment=kubernetes
97+
- --backend=sglang
98+
- --adjustment-interval=60
99+
- --profile-results-dir=/workspace/profiling_results
100+
Prometheus:
101+
dynamoNamespace: dynamo
102+
componentType: main
103+
replicas: 1
104+
envs:
105+
- name: PYTHONPATH
106+
value: "/workspace/components/planner/src"
107+
livenessProbe:
108+
exec:
109+
command:
110+
- /bin/sh
111+
- -c
112+
- "exit 0"
113+
periodSeconds: 60
114+
timeoutSeconds: 30
115+
failureThreshold: 10
116+
readinessProbe:
117+
exec:
118+
command:
119+
- /bin/sh
120+
- -c
121+
- "exit 0"
122+
initialDelaySeconds: 30
123+
periodSeconds: 60
124+
timeoutSeconds: 30
125+
failureThreshold: 10
126+
resources:
127+
requests:
128+
cpu: "2"
129+
memory: "2Gi"
130+
limits:
131+
cpu: "8"
132+
memory: "16Gi"
133+
extraPodSpec:
134+
mainContainer:
135+
image: nvcr.io/nvidian/nim-llm-dev/sglang-runtime:hzhou-0811-1
136+
workingDir: /workspace/components/backends/sglang
137+
command:
138+
- /bin/sh
139+
- -c
140+
args:
141+
- "python3 -m dynamo.planner.prometheus"
142+
SGLangDecodeWorker:
143+
dynamoNamespace: dynamo
144+
envFromSecret: hf-token-secret
145+
livenessProbe:
146+
httpGet:
147+
path: /live
148+
port: 9090
149+
periodSeconds: 5
150+
timeoutSeconds: 30
151+
failureThreshold: 1
152+
readinessProbe:
153+
httpGet:
154+
path: /health
155+
port: 9090
156+
periodSeconds: 10
157+
timeoutSeconds: 30
158+
failureThreshold: 60
159+
componentType: worker
160+
replicas: 2
161+
resources:
162+
requests:
163+
cpu: "10"
164+
memory: "20Gi"
165+
gpu: "1"
166+
limits:
167+
cpu: "32"
168+
memory: "80Gi"
169+
gpu: "1"
170+
envs:
171+
- name: DYN_SYSTEM_ENABLED
172+
value: "true"
173+
- name: DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS
174+
value: "[\"generate\"]"
175+
- name: DYN_SYSTEM_PORT
176+
value: "9090"
177+
extraPodSpec:
178+
mainContainer:
179+
startupProbe:
180+
httpGet:
181+
path: /live
182+
port: 9090
183+
periodSeconds: 10
184+
failureThreshold: 60
185+
image: nvcr.io/nvidian/nim-llm-dev/sglang-runtime:hzhou-0811-1
186+
workingDir: /workspace/components/backends/sglang
187+
args:
188+
- "python3"
189+
- "-m"
190+
- "dynamo.sglang.decode_worker"
191+
- "--model-path"
192+
- "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
193+
- "--served-model-name"
194+
- "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
195+
- "--page-size"
196+
- "16"
197+
- "--tp"
198+
- "1"
199+
- "--trust-remote-code"
200+
- "--skip-tokenizer-init"
201+
- "--disaggregation-mode"
202+
- "decode"
203+
- "--disaggregation-transfer-backend"
204+
- "nixl"
205+
SGLangPrefillWorker:
206+
dynamoNamespace: dynamo
207+
envFromSecret: hf-token-secret
208+
livenessProbe:
209+
httpGet:
210+
path: /live
211+
port: 9090
212+
periodSeconds: 5
213+
timeoutSeconds: 30
214+
failureThreshold: 1
215+
readinessProbe:
216+
httpGet:
217+
path: /health
218+
port: 9090
219+
periodSeconds: 10
220+
timeoutSeconds: 30
221+
failureThreshold: 60
222+
componentType: worker
223+
replicas: 2
224+
resources:
225+
requests:
226+
cpu: "10"
227+
memory: "20Gi"
228+
gpu: "1"
229+
limits:
230+
cpu: "32"
231+
memory: "80Gi"
232+
gpu: "1"
233+
envs:
234+
- name: DYN_SYSTEM_ENABLED
235+
value: "true"
236+
- name: DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS
237+
value: "[\"generate\"]"
238+
- name: DYN_SYSTEM_PORT
239+
value: "9090"
240+
extraPodSpec:
241+
mainContainer:
242+
startupProbe:
243+
httpGet:
244+
path: /health
245+
port: 9090
246+
periodSeconds: 10
247+
failureThreshold: 60
248+
image: nvcr.io/nvidian/nim-llm-dev/sglang-runtime:hzhou-0811-1
249+
workingDir: /workspace/components/backends/sglang
250+
args:
251+
- "python3"
252+
- "-m"
253+
- "dynamo.sglang.worker"
254+
- "--model-path"
255+
- "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
256+
- "--served-model-name"
257+
- "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
258+
- "--page-size"
259+
- "16"
260+
- "--tp"
261+
- "1"
262+
- "--trust-remote-code"
263+
- "--skip-tokenizer-init"
264+
- "--disaggregation-mode"
265+
- "prefill"
266+
- "--disaggregation-transfer-backend"
267+
- "nixl"

components/planner/src/dynamo/planner/planner_sla.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ async def generate(request: RequestType):
6262
parser.add_argument(
6363
"--backend",
6464
default=SLAPlannerDefaults.backend,
65-
choices=["vllm"],
65+
choices=["vllm", "sglang"],
6666
help="Backend type",
6767
)
6868
parser.add_argument(

container/Dockerfile.sglang

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -231,6 +231,20 @@ ARG CARGO_BUILD_JOBS
231231
# which might exceed the number of opened files limit.
232232
ENV CARGO_BUILD_JOBS=${CARGO_BUILD_JOBS:-16}
233233

234+
# Install prometheus
235+
ARG PROM_VERSION=3.4.1
236+
RUN ARCH=$(dpkg --print-architecture) && \
237+
case "$ARCH" in \
238+
amd64) PLATFORM=linux-amd64 ;; \
239+
arm64) PLATFORM=linux-arm64 ;; \
240+
*) echo "Unsupported architecture: $ARCH" && exit 1 ;; \
241+
esac && \
242+
curl -fsSL https://github.com/prometheus/prometheus/releases/download/v${PROM_VERSION}/prometheus-${PROM_VERSION}.${PLATFORM}.tar.gz \
243+
| tar -xz -C /tmp && \
244+
mv /tmp/prometheus-${PROM_VERSION}.${PLATFORM}/prometheus /usr/local/bin/ && \
245+
chmod +x /usr/local/bin/prometheus && \
246+
rm -rf /tmp/prometheus-${PROM_VERSION}.${PLATFORM}
247+
234248
#######################################
235249
########## Local Development ##########
236250
#######################################

docs/guides/dynamo_deploy/sla_planner_deployment.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# SLA Planner Deployment Guide
22

3-
Quick deployment guide for the vLLM disaggregated planner with automatic scaling.
3+
Quick deployment guide for the disaggregated planner with automatic scaling.
44

55
> [!NOTE]
66
> For high-level architecture and concepts, see [SLA-based Planner](../../architecture/sla_planner.md).
@@ -34,9 +34,11 @@ export NAMESPACE=your-namespace
3434

3535
## 1. Deploy the System
3636

37+
We use vllm as the backend engine in this guide. SLA planner also supports SGLang and will support TensorRT-LLM. Checkout `disagg_planner.yaml` in their example deployment folders for more details. The deployment is the same for all backends.
38+
3739
```bash
3840
# Apply the disaggregated planner deployment
39-
kubectl apply -f components/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE
41+
kubectl apply -f components/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE # for vllm
4042

4143
# Check deployment status
4244
kubectl get pods -n $NAMESPACE

0 commit comments

Comments
 (0)