ai-dynamo
diff --git a/‎docs/architecture/sla_planner.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/architecture/sla_planner.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎tests/planner/.gitignore‎
Lines changed: 11 additions & 0 deletions b/‎tests/planner/.gitignore‎
Lines changed: 11 additions & 0 deletions
diff --git a/‎tests/planner/README.md‎
Lines changed: 62 additions & 0 deletions b/‎tests/planner/README.md‎
Lines changed: 62 additions & 0 deletions
diff --git a/‎tests/planner/conftest.py‎
Lines changed: 15 additions & 0 deletions b/‎tests/planner/conftest.py‎
Lines changed: 15 additions & 0 deletions
diff --git a/‎tests/planner/disagg_planner.yaml‎
Lines changed: 153 additions & 0 deletions b/‎tests/planner/disagg_planner.yaml‎
Lines changed: 153 additions & 0 deletions
@@ -115,4 +115,4 @@ kubectl apply -f disagg_planner.yaml -n {$NAMESPACE}
 ```
 
 > [!NOTE]
-> The SLA planner requires a frontend that reports metrics at `/metrics` HTTP endpoint with number of requests, ISL, OSL, TTFT, ITL in the correct format. The dynamo frontend provides these metrics automatically.
+> The SLA planner requires a frontend that reports metrics at `/metrics` HTTP endpoint with number of requests, ISL, OSL, TTFT, ITL in the correct format. The dynamo frontend provides these metrics automatically.
@@ -0,0 +1,11 @@
+# E2E test results - don't commit test artifacts to git
+e2e_scaling_results/
+
+# Temporary files
+*.tmp
+*.log
+
+# Python cache
+__pycache__/
+*.pyc
+*.pyo
@@ -86,6 +86,7 @@ python benchmarks/sin_load_generator/sin_synth.py \
 
 The dataset starts at 12 requests/s, increases to 36 requests/s at t=300s, decreases back to 12 requests/s at t=600s, and repeats.
 The total duration is 30 minutes or 1800 seconds.
+
 ## Planner Dry Run
 
 Before testing SLA planner on real deployments, we provide a dry run feature to test the autoscaling behavior on a given dataset. Specifically, in dry run mode,
@@ -129,3 +130,64 @@ The second plot shows the actual ISL/OSL and the predicted ISL/OSL. The first tw
 The third plot shows the actual prefill throughput, number of prefill workers that planner scales, and the safe throughput limit with the number of prefill workers. If the actual throughput is below the safe throughput limit, the deployment has the capacity to adhere the TTFT SLA. Note that in the real deployment, due to other factors such as queueing, load balancing, KV cache transfer latency, and ISL variance, it is not guaranteed that the actual deployment can adhere the TTFT SLA.
 
 The fourth plot, similar to the third plot, shows the actual decode throughput, number of decode workers that planner scales, and the safe throughput limit with the number of decode workers. If the actual throughput is below the safe throughput limit, the deployment has the capacity to adhere the ITL SLA. Note that in the real deployment, due to other factors such as load balancing and OSL variance, it is not guaranteed that the actual deployment can adhere the ITL SLA.
+
+## Scaling Tests
+
+This directory contains comprehensive tests for validating the SLA planner's scaling behavior. The tests validate both the replica calculation logic and end-to-end scaling behavior. The scaling test uses a graduated load approach rather than dataset files, as it proved more reliable for metric generation and scaling triggers.
+
+### Test Types
+
+1. **Unit Tests** (`test_replica_calculation.py`) - Test the mathematical formulas for calculating prefill and decode replicas in isolation
+2. **End-to-End Tests** (`run_scaling_test.sh`) - Test complete workflow including Kubernetes deployment, load generation, and pod scaling validation
+
+### Quick Start
+
+#### Run Unit Tests Only
+Test the replica calculation logic without requiring Kubernetes:
+
+```bash
+python -m pytest test_replica_calculation.py -v
+```
+
+#### Run Full End-to-End Test
+Test complete scaling behavior including Kubernetes deployment and load generation:
+
+```bash
+./run_scaling_test.sh
+```
+
+With custom namespace:
+```bash
+./run_scaling_test.sh --namespace production
+```
+
+To save results to `tests/planner/e2e_scaling_results` instead of `/tmp`:
+```bash
+./run_scaling_test.sh --save-results
+```
+
+**E2E Test Deployment Management:**
+- If no deployment exists: creates, tests, and cleans up deployment
+- If deployment exists: uses existing deployment and preserves it
+- Perfect for development workflows where you want to keep deployments running between tests
+
+**Test Scenario**
+
+The main test scenario validates prefill scaling for H200 with 1P1D → 2P1D configuration:
+
+- **Phase 1**: 8 req/s for 90s (baseline - maintains 1P1D)
+- **Phase 2**: 15 req/s for 120s (moderate load - maintains 1P1D)
+- **Phase 3**: 25 req/s for 180s (scaling trigger - scales to 2P1D)
+- **ISL/OSL**: 4000/150 tokens (optimized for prefill bottleneck)
+- **Transition delay**: 30s between phases
+- **Total test duration**: ~7 minutes + scaling observation
+- **Smart cleanup**: Only removes deployment if test created it (preserves existing deployments)
+
+### Prerequisites for E2E Tests
+
+- Kubernetes cluster with GPU nodes
+- kubectl configured and accessible
+- genai-perf available in PATH
+- Python dependencies installed
+
+For detailed configuration, troubleshooting, and architecture information, see [README_scaling_tests.md](README_scaling_tests.md).
@@ -0,0 +1,15 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""
+Local conftest.py for planner tests to disable automatic test logging.
+This overrides the autouse logger fixture from the parent conftest.py.
+"""
+
+import pytest
+
+
+@pytest.fixture(autouse=True)
+def logger(request):
+    """Dummy logger fixture that does nothing - overrides the parent one."""
+    yield
@@ -0,0 +1,153 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: vllm-disagg-planner
+  annotations:
+    nvidia.com/enable-grove: "false" # temporarily disable grove because current k8s connector does not work with grove
+spec:
+  envs:
+    - name: DYNAMO_SERVICE_CONFIG
+      value: '{"Prometheus":{"global":{"scrape_interval":"5s"},"scrape_configs":[{"job_name":"prometheus","static_configs":[{"targets":["localhost:9090"]}]},{"job_name":"frontend","static_configs":[{"targets":["vllm-disagg-planner-frontend:8000"]}]}]}}'
+    - name: DYNAMO_NAMESPACE
+      value: "vllm-disagg-planner"
+    - name: PROMETHEUS_PORT
+      value: "8000"
+  services:
+    Frontend:
+      dynamoNamespace: vllm-disagg-planner
+      componentType: frontend
+      replicas: 1
+      extraPodSpec:
+        mainContainer:
+          image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:dep-301.6
+          args:
+            - "python3 -m dynamo.frontend --http-port 8000"
+    Planner:
+      dynamoNamespace: vllm-disagg-planner
+      envFromSecret: hf-token-secret
+      componentType: planner
+      replicas: 1
+      livenessProbe:
+        exec:
+          command:
+            - /bin/sh
+            - -c
+            - "exit 0"
+        periodSeconds: 60
+        timeoutSeconds: 30
+        failureThreshold: 10
+      readinessProbe:
+        exec:
+          command:
+            - /bin/sh
+            - -c
+            - "exit 0"
+        initialDelaySeconds: 60
+        periodSeconds: 60
+        timeoutSeconds: 30
+        failureThreshold: 10
+      extraPodSpec:
+        mainContainer:
+          image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:dep-301.6
+          workingDir: /workspace/components/planner/src/dynamo/planner
+          ports:
+            - name: metrics
+              containerPort: 9085
+          command:
+            - /bin/sh
+            - -c
+          args:
+            - >-
+              python3 -m planner_sla
+              --environment=kubernetes
+              --backend=vllm
+              --adjustment-interval=60
+              --profile-results-dir=/workspace/tests/planner/profiling_results/H200_TP1P_TP1D
+              --prometheus-port=9085
+              --ttft=0.1
+              --itl=0.01
+              --load-predictor=constant
+    Prometheus: # NOTE: this is set on Prometheus to ensure a service is created for the Prometheus component. This is a workaround and should be managed differently.
+      dynamoNamespace: vllm-disagg-planner
+      componentType: frontend
+      replicas: 1
+      envs:
+        - name: PYTHONPATH
+          value: "/workspace/components/planner/src"
+      livenessProbe:
+        exec:
+          command:
+            - /bin/sh
+            - -c
+            - "exit 0"
+        periodSeconds: 60
+        timeoutSeconds: 30
+        failureThreshold: 10
+      readinessProbe:
+        exec:
+          command:
+            - /bin/sh
+            - -c
+            - "exit 0"
+        initialDelaySeconds: 30
+        periodSeconds: 60
+        timeoutSeconds: 30
+        failureThreshold: 10
+      extraPodSpec:
+        mainContainer:
+          image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:dep-301.6
+          workingDir: /workspace/components/backends/vllm
+          command:
+            - /bin/sh
+            - -c
+          args:
+            - "python3 -m dynamo.planner.prometheus"
+    VllmDecodeWorker:
+      dynamoNamespace: vllm-disagg-planner
+      envFromSecret: hf-token-secret
+      componentType: worker
+      replicas: 1
+      resources:
+        limits:
+          gpu: "1"
+      extraPodSpec:
+        mainContainer:
+          startupProbe:
+            httpGet:
+              path: /health
+              port: 9090
+            periodSeconds: 30
+            failureThreshold: 60
+          image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:dep-301.6
+          workingDir: /workspace/components/backends/vllm
+          command:
+            - /bin/sh
+            - -c
+          args:
+            - "python3 -m dynamo.vllm --model nvidia/Llama-3.1-8B-Instruct-FP8 --migration-limit=3 --max-model-len=8192"
+    VllmPrefillWorker:
+      dynamoNamespace: vllm-disagg-planner
+      envFromSecret: hf-token-secret
+      componentType: worker
+      replicas: 1
+      resources:
+        limits:
+          gpu: "1"
+      extraPodSpec:
+        mainContainer:
+          startupProbe:
+            httpGet:
+              path: /health
+              port: 9090
+            periodSeconds: 30
+            failureThreshold: 60
+          image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:dep-301.6
+          workingDir: /workspace/components/backends/vllm
+          command:
+            - /bin/sh
+            - -c
+          args:
+            - python3 -m dynamo.vllm --model nvidia/Llama-3.1-8B-Instruct-FP8 --is-prefill-worker --migration-limit=3 --max-model-len=8192