Skip to content

Commit ccf304a

Browse files
committed
feat: add comprehensive SLA planner scaling tests
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
1 parent e3619ce commit ccf304a

File tree

10 files changed

+2119
-1
lines changed

10 files changed

+2119
-1
lines changed

docs/architecture/sla_planner.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -115,4 +115,4 @@ kubectl apply -f disagg_planner.yaml -n {$NAMESPACE}
115115
```
116116

117117
> [!NOTE]
118-
> The SLA planner requires a frontend that reports metrics at `/metrics` HTTP endpoint with number of requests, ISL, OSL, TTFT, ITL in the correct format. The dynamo frontend provides these metrics automatically.
118+
> The SLA planner requires a frontend that reports metrics at `/metrics` HTTP endpoint with number of requests, ISL, OSL, TTFT, ITL in the correct format. The dynamo frontend provides these metrics automatically.

tests/planner/.gitignore

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# E2E test results - don't commit test artifacts to git
2+
e2e_scaling_results/
3+
4+
# Temporary files
5+
*.tmp
6+
*.log
7+
8+
# Python cache
9+
__pycache__/
10+
*.pyc
11+
*.pyo

tests/planner/README.md

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,7 @@ python benchmarks/sin_load_generator/sin_synth.py \
8686
8787
The dataset starts at 12 requests/s, increases to 36 requests/s at t=300s, decreases back to 12 requests/s at t=600s, and repeats.
8888
The total duration is 30 minutes or 1800 seconds.
89+
8990
## Planner Dry Run
9091
9192
Before testing SLA planner on real deployments, we provide a dry run feature to test the autoscaling behavior on a given dataset. Specifically, in dry run mode,
@@ -129,3 +130,64 @@ The second plot shows the actual ISL/OSL and the predicted ISL/OSL. The first tw
129130
The third plot shows the actual prefill throughput, number of prefill workers that planner scales, and the safe throughput limit with the number of prefill workers. If the actual throughput is below the safe throughput limit, the deployment has the capacity to adhere the TTFT SLA. Note that in the real deployment, due to other factors such as queueing, load balancing, KV cache transfer latency, and ISL variance, it is not guaranteed that the actual deployment can adhere the TTFT SLA.
130131
131132
The fourth plot, similar to the third plot, shows the actual decode throughput, number of decode workers that planner scales, and the safe throughput limit with the number of decode workers. If the actual throughput is below the safe throughput limit, the deployment has the capacity to adhere the ITL SLA. Note that in the real deployment, due to other factors such as load balancing and OSL variance, it is not guaranteed that the actual deployment can adhere the ITL SLA.
133+
134+
## Scaling Tests
135+
136+
This directory contains comprehensive tests for validating the SLA planner's scaling behavior. The tests validate both the replica calculation logic and end-to-end scaling behavior. The scaling test uses a graduated load approach rather than dataset files, as it proved more reliable for metric generation and scaling triggers.
137+
138+
### Test Types
139+
140+
1. **Unit Tests** (`test_replica_calculation.py`) - Test the mathematical formulas for calculating prefill and decode replicas in isolation
141+
2. **End-to-End Tests** (`run_scaling_test.sh`) - Test complete workflow including Kubernetes deployment, load generation, and pod scaling validation
142+
143+
### Quick Start
144+
145+
#### Run Unit Tests Only
146+
Test the replica calculation logic without requiring Kubernetes:
147+
148+
```bash
149+
python -m pytest test_replica_calculation.py -v
150+
```
151+
152+
#### Run Full End-to-End Test
153+
Test complete scaling behavior including Kubernetes deployment and load generation:
154+
155+
```bash
156+
./run_scaling_test.sh
157+
```
158+
159+
With custom namespace:
160+
```bash
161+
./run_scaling_test.sh --namespace production
162+
```
163+
164+
To save results to `tests/planner/e2e_scaling_results` instead of `/tmp`:
165+
```bash
166+
./run_scaling_test.sh --save-results
167+
```
168+
169+
**E2E Test Deployment Management:**
170+
- If no deployment exists: creates, tests, and cleans up deployment
171+
- If deployment exists: uses existing deployment and preserves it
172+
- Perfect for development workflows where you want to keep deployments running between tests
173+
174+
**Test Scenario**
175+
176+
The main test scenario validates prefill scaling for H200 with 1P1D → 2P1D configuration:
177+
178+
- **Phase 1**: 8 req/s for 90s (baseline - maintains 1P1D)
179+
- **Phase 2**: 15 req/s for 120s (moderate load - maintains 1P1D)
180+
- **Phase 3**: 25 req/s for 180s (scaling trigger - scales to 2P1D)
181+
- **ISL/OSL**: 4000/150 tokens (optimized for prefill bottleneck)
182+
- **Transition delay**: 30s between phases
183+
- **Total test duration**: ~7 minutes + scaling observation
184+
- **Smart cleanup**: Only removes deployment if test created it (preserves existing deployments)
185+
186+
### Prerequisites for E2E Tests
187+
188+
- Kubernetes cluster with GPU nodes
189+
- kubectl configured and accessible
190+
- genai-perf available in PATH
191+
- Python dependencies installed
192+
193+
For detailed configuration, troubleshooting, and architecture information, see [README_scaling_tests.md](README_scaling_tests.md).

tests/planner/conftest.py

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
"""
5+
Local conftest.py for planner tests to disable automatic test logging.
6+
This overrides the autouse logger fixture from the parent conftest.py.
7+
"""
8+
9+
import pytest
10+
11+
12+
@pytest.fixture(autouse=True)
13+
def logger(request):
14+
"""Dummy logger fixture that does nothing - overrides the parent one."""
15+
yield

tests/planner/disagg_planner.yaml

Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,153 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
apiVersion: nvidia.com/v1alpha1
5+
kind: DynamoGraphDeployment
6+
metadata:
7+
name: vllm-disagg-planner
8+
annotations:
9+
nvidia.com/enable-grove: "false" # temporarily disable grove because current k8s connector does not work with grove
10+
spec:
11+
envs:
12+
- name: DYNAMO_SERVICE_CONFIG
13+
value: '{"Prometheus":{"global":{"scrape_interval":"5s"},"scrape_configs":[{"job_name":"prometheus","static_configs":[{"targets":["localhost:9090"]}]},{"job_name":"frontend","static_configs":[{"targets":["vllm-disagg-planner-frontend:8000"]}]}]}}'
14+
- name: DYNAMO_NAMESPACE
15+
value: "vllm-disagg-planner"
16+
- name: PROMETHEUS_PORT
17+
value: "8000"
18+
services:
19+
Frontend:
20+
dynamoNamespace: vllm-disagg-planner
21+
componentType: frontend
22+
replicas: 1
23+
extraPodSpec:
24+
mainContainer:
25+
image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:dep-301.6
26+
args:
27+
- "python3 -m dynamo.frontend --http-port 8000"
28+
Planner:
29+
dynamoNamespace: vllm-disagg-planner
30+
envFromSecret: hf-token-secret
31+
componentType: planner
32+
replicas: 1
33+
livenessProbe:
34+
exec:
35+
command:
36+
- /bin/sh
37+
- -c
38+
- "exit 0"
39+
periodSeconds: 60
40+
timeoutSeconds: 30
41+
failureThreshold: 10
42+
readinessProbe:
43+
exec:
44+
command:
45+
- /bin/sh
46+
- -c
47+
- "exit 0"
48+
initialDelaySeconds: 60
49+
periodSeconds: 60
50+
timeoutSeconds: 30
51+
failureThreshold: 10
52+
extraPodSpec:
53+
mainContainer:
54+
image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:dep-301.6
55+
workingDir: /workspace/components/planner/src/dynamo/planner
56+
ports:
57+
- name: metrics
58+
containerPort: 9085
59+
command:
60+
- /bin/sh
61+
- -c
62+
args:
63+
- >-
64+
python3 -m planner_sla
65+
--environment=kubernetes
66+
--backend=vllm
67+
--adjustment-interval=60
68+
--profile-results-dir=/workspace/tests/planner/profiling_results/H200_TP1P_TP1D
69+
--prometheus-port=9085
70+
--ttft=0.1
71+
--itl=0.01
72+
--load-predictor=constant
73+
Prometheus: # NOTE: this is set on Prometheus to ensure a service is created for the Prometheus component. This is a workaround and should be managed differently.
74+
dynamoNamespace: vllm-disagg-planner
75+
componentType: frontend
76+
replicas: 1
77+
envs:
78+
- name: PYTHONPATH
79+
value: "/workspace/components/planner/src"
80+
livenessProbe:
81+
exec:
82+
command:
83+
- /bin/sh
84+
- -c
85+
- "exit 0"
86+
periodSeconds: 60
87+
timeoutSeconds: 30
88+
failureThreshold: 10
89+
readinessProbe:
90+
exec:
91+
command:
92+
- /bin/sh
93+
- -c
94+
- "exit 0"
95+
initialDelaySeconds: 30
96+
periodSeconds: 60
97+
timeoutSeconds: 30
98+
failureThreshold: 10
99+
extraPodSpec:
100+
mainContainer:
101+
image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:dep-301.6
102+
workingDir: /workspace/components/backends/vllm
103+
command:
104+
- /bin/sh
105+
- -c
106+
args:
107+
- "python3 -m dynamo.planner.prometheus"
108+
VllmDecodeWorker:
109+
dynamoNamespace: vllm-disagg-planner
110+
envFromSecret: hf-token-secret
111+
componentType: worker
112+
replicas: 1
113+
resources:
114+
limits:
115+
gpu: "1"
116+
extraPodSpec:
117+
mainContainer:
118+
startupProbe:
119+
httpGet:
120+
path: /health
121+
port: 9090
122+
periodSeconds: 30
123+
failureThreshold: 60
124+
image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:dep-301.6
125+
workingDir: /workspace/components/backends/vllm
126+
command:
127+
- /bin/sh
128+
- -c
129+
args:
130+
- "python3 -m dynamo.vllm --model nvidia/Llama-3.1-8B-Instruct-FP8 --migration-limit=3 --max-model-len=8192"
131+
VllmPrefillWorker:
132+
dynamoNamespace: vllm-disagg-planner
133+
envFromSecret: hf-token-secret
134+
componentType: worker
135+
replicas: 1
136+
resources:
137+
limits:
138+
gpu: "1"
139+
extraPodSpec:
140+
mainContainer:
141+
startupProbe:
142+
httpGet:
143+
path: /health
144+
port: 9090
145+
periodSeconds: 30
146+
failureThreshold: 60
147+
image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:dep-301.6
148+
workingDir: /workspace/components/backends/vllm
149+
command:
150+
- /bin/sh
151+
- -c
152+
args:
153+
- python3 -m dynamo.vllm --model nvidia/Llama-3.1-8B-Instruct-FP8 --is-prefill-worker --migration-limit=3 --max-model-len=8192

0 commit comments

Comments
 (0)