-
Notifications
You must be signed in to change notification settings - Fork 676
test: add tests for replica calculation and planner scaling #2525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: add tests for replica calculation and planner scaling #2525
Conversation
7d3ebf9 to
ccf304a
Compare
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
|
can we put the files related to this test in a subfolder? |
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
WalkthroughAdds planner test infrastructure and docs: a Kubernetes DynamoGraphDeployment manifest for a disaggregated vLLM planner, an end-to-end scaling test (script and Python), a unit test suite for replica calculation, a load generator utility, local test fixtures, a .gitignore, and a minor docs formatting fix. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Dev as User
participant Sh as run_scaling_test.sh
participant K8s as Kubernetes API
participant Svc as Frontend Service
participant E2E as test_scaling_e2e.py
participant LG as LoadGenerator
Dev->>Sh: Invoke script (namespace, YAML, save-results)
Sh->>K8s: Check/apply DynamoGraphDeployment
Sh->>K8s: Wait for Ready (deployment, pods)
Sh->>Svc: Port-forward :8000 -> :8000
Sh->>Svc: /health poll until OK
Sh->>E2E: Run E2E test with namespace/base URL
par Monitor
E2E->>K8s: Start pod-count monitor (prefill/decode)
and Load
E2E->>LG: Run phased load
LG->>Svc: Send requests (genai-perf)
end
E2E->>E2E: Analyze counts (1P1D→2P1D), validate
E2E-->>Sh: Exit code and results
Sh->>K8s: Optional cleanup (delete deployment)
Sh-->>Dev: Summary and return code
note over E2E,K8s: Scaling observed via pod counts timeline
sequenceDiagram
autonumber
participant Client as genai-perf
participant FE as Frontend
participant Planner as SLA Planner
participant Pref as VllmPrefillWorker
participant Dec as VllmDecodeWorker
participant Prom as Prometheus
Client->>FE: Generate requests (RPS phases)
FE->>Prom: Expose /metrics (ISL, OSL, TTFT, ITL, reqs)
Planner->>Prom: Scrape metrics
Planner->>Planner: Compute desired replicas
Planner->>K8s: Set replicas Pref/Dec
K8s-->>Pref: Scale prefill pods
K8s-->>Dec: Scale decode pods
Client-->>FE: Observe capacity change over phases
note over Planner: Adjustment interval (e.g., 60s)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 20
🧹 Nitpick comments (14)
tests/planner/.gitignore (1)
1-11: Broaden ignores to cover pytest/artifacts and local results.Prevents accidental commits from local runs.
# E2E test results - don't commit test artifacts to git e2e_scaling_results/ # Temporary files *.tmp *.log # Python cache __pycache__/ *.pyc *.pyo + +# Pytest artifacts +.pytest_cache/ +.coverage +.coverage.* + +# macOS cruft +.DS_Store + +# Local load-generator result dirs +scaling_test_*/tests/planner/README.md (2)
169-173: Use a heading instead of emphasis (MD036).-**E2E Test Deployment Management:** +### E2E Test Deployment Management
183-184: Correct total duration estimate.90 + 120 + 180 + 2×30 = 450s ≈ 7.5 minutes.
-- **Total test duration**: ~7 minutes + scaling observation +- **Total test duration**: ~7.5 minutes + scaling observationtests/planner/utils/load_generator.py (4)
21-25: Avoid configuring logging at import time.Move basicConfig to CLI entrypoint to not override repo/test logging.
-logging.basicConfig( - level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s" -) logger = logging.getLogger(__name__)Add in main() before using logger:
# configure logging only for CLI logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
44-58: Docstring lists nonexistent parameters.Remove stray args and clarify return type.
def _calculate_genai_perf_params( self, req_per_sec: float, ) -> Dict[str, Any]: """ - Calculate genai-perf parameters to approximate desired request rate. - - Args: - req_per_sec: Desired requests per second - duration_sec: Test duration in seconds - estimated_request_duration: Estimated average request duration in seconds - - Returns: - Dictionary with concurrency and request_rate parameters - """ + Calculate genai-perf parameters to approximate the desired request rate. + + Args: + req_per_sec: Desired requests per second. + + Returns: + Dict with keys: {"concurrency": int, "request_rate": float}. + """
246-268: Docstring and phase description mismatch with e2e expectations.Docstring says 5/10/18 rps; code uses 8/15/25. Align docstring or phases; keeping 8/15/25 is fine, just update text.
- Uses a conservative graduated approach: - - Phase 1: 5 req/s (baseline, should work) - - Phase 2: 10 req/s (moderate load) - - Phase 3: 18 req/s (should trigger prefill scaling to 2P1D) + Uses a conservative graduated approach: + - Phase 1: 8 req/s (baseline) + - Phase 2: 15 req/s (moderate load) + - Phase 3: 25 req/s (should trigger prefill scaling to 2P1D)
266-312: Interface mismatch with e2e runner.tests/planner/test_scaling_e2e.py expects two phases (12/24 rps) and keys "phase1"/"phase2", while this generator runs three phases (8/15/25) and returns "phase_results". Either adapt the e2e test to consume "phase_results" or expose a configurable phase list here with defaults matching the e2e expectations.
Would you like me to add a --phases JSON CLI arg and plumb it through so both 12/24 and 8/15/25 scenarios are supported?
tests/planner/conftest.py (1)
12-15: Consider documenting the test isolation pattern.While disabling the parent logger fixture is a valid approach for planner tests, consider adding more details about why logging is disabled and what parent fixture is being overridden to help future maintainers understand this test isolation pattern.
@pytest.fixture(autouse=True) def logger(request): - """Dummy logger fixture that does nothing - overrides the parent one.""" + """ + Dummy logger fixture that does nothing - overrides the parent autouse logger fixture. + + This prevents automatic test logging from the parent conftest.py, allowing planner + tests to run with cleaner output and without interference from the parent fixture's + logging configuration. + """ yieldtests/planner/test_replica_calculation.py (3)
139-141: Avoid importing asyncio inside test methods.Importing asyncio within test methods is unconventional and reduces readability.
Move the import to the top of the file:
+import asyncio import argparse import math import os # ... in the test method: - import asyncio - asyncio.run(planner.make_adjustments())
150-151: Remove debug print statements in tests.Debug print statements should be replaced with proper assertions or logging.
- print(f"Expected prefill replicas: {expected_prefill_replicas}") - print(f"Calculated prefill replicas: {calculated_prefill_replicas}") + # Assert with a descriptive message + assert ( + max(expected_prefill_replicas, planner.args.min_endpoint) + == calculated_prefill_replicas + ), f"Expected {max(expected_prefill_replicas, planner.args.min_endpoint)}, got {calculated_prefill_replicas}"
223-229: Test parametrization could be more descriptive.The parametrized test cases would benefit from using
pytest.paramwith IDs for better test output.@pytest.mark.parametrize( "num_req,decode_thpt,expected_p,expected_d", [ - (10, 10000, 1, 1), # low_load_10_req_per_second - (500, 1000, 1, 2), # high_load_500_req_per_second (lower decode throughput) + pytest.param(10, 10000, 1, 1, id="low_load_10_req_per_second"), + pytest.param(500, 1000, 1, 2, id="high_load_500_req_per_second"), ], )tests/planner/run_scaling_test.sh (1)
182-182: Remove unused loop variable.The variable
iin the loop is never used.- for i in {1..30}; do + for _ in {1..30}; dotests/planner/disagg_planner.yaml (2)
13-13: Large inline configuration may be hard to maintain.The DYNAMO_SERVICE_CONFIG environment variable contains a large inline JSON configuration that's difficult to read and maintain.
Consider using a ConfigMap for better maintainability:
--- apiVersion: v1 kind: ConfigMap metadata: name: dynamo-service-config data: config.json: | { "Prometheus": { "global": { "scrape_interval": "5s" }, "scrape_configs": [ { "job_name": "prometheus", "static_configs": [{"targets": ["localhost:9090"]}] }, { "job_name": "frontend", "static_configs": [{"targets": ["vllm-disagg-planner-frontend:8000"]}] } ] } }Then reference it in the deployment.
26-27: Frontend container command uses deprecated format.Using args with a shell command string is not the recommended approach.
extraPodSpec: mainContainer: image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:dep-301.6 - args: - - "python3 -m dynamo.frontend --http-port 8000" + command: + - python3 + args: + - -m + - dynamo.frontend + - --http-port + - "8000"
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (10)
docs/architecture/sla_planner.md(1 hunks)tests/planner/.gitignore(1 hunks)tests/planner/README.md(2 hunks)tests/planner/conftest.py(1 hunks)tests/planner/disagg_planner.yaml(1 hunks)tests/planner/run_scaling_test.sh(1 hunks)tests/planner/test_replica_calculation.py(1 hunks)tests/planner/test_scaling_e2e.py(1 hunks)tests/planner/utils/__init__.py(1 hunks)tests/planner/utils/load_generator.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
tests/planner/run_scaling_test.sh (2)
tests/planner/test_scaling_e2e.py (1)
main(445-507)tests/planner/utils/load_generator.py (1)
main(335-398)
tests/planner/utils/load_generator.py (2)
tests/planner/test_scaling_e2e.py (2)
run_scaling_test(267-342)main(445-507)tests/planner/run_scaling_test.sh (1)
main(235-302)
tests/planner/test_scaling_e2e.py (1)
tests/planner/utils/load_generator.py (3)
LoadGenerator(27-332)run_scaling_test(246-332)main(335-398)
tests/planner/test_replica_calculation.py (3)
components/planner/src/dynamo/planner/utils/planner_core.py (4)
Metrics(31-52)Planner(55-525)get_workers_info(129-178)make_adjustments(331-388)components/planner/src/dynamo/planner/utils/perf_interpolation.py (3)
find_best_throughput_per_gpu(147-168)interpolate_ttft(48-50)interpolate_itl(137-139)components/planner/src/dynamo/planner/kubernetes_connector.py (1)
set_component_replicas(116-139)
🪛 Shellcheck (0.10.0)
tests/planner/run_scaling_test.sh
[warning] 101-101: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 166-166: Quote this to prevent word splitting.
(SC2046)
[warning] 182-182: i appears unused. Verify use (or export if used externally).
(SC2034)
🪛 LanguageTool
docs/architecture/sla_planner.md
[grammar] ~118-~118: There might be a mistake here.
Context: ...rts metrics at /metrics HTTP endpoint with number of requests, ISL, OSL, TTFT, ITL...
(QB_NEW_EN)
[grammar] ~118-~118: There might be a mistake here.
Context: ...nd provides these metrics automatically.
(QB_NEW_EN)
tests/planner/README.md
[grammar] ~145-~145: There might be a mistake here.
Context: ...## Quick Start #### Run Unit Tests Only Test the replica calculation logic witho...
(QB_NEW_EN)
[grammar] ~152-~152: There might be a mistake here.
Context: ...py -v ``` #### Run Full End-to-End Test Test complete scaling behavior including...
(QB_NEW_EN)
[grammar] ~169-~169: There might be a mistake here.
Context: ...s ``` E2E Test Deployment Management: - If no deployment exists: creates, tests,...
(QB_NEW_EN)
[grammar] ~170-~170: There might be a mistake here.
Context: ...ment exists: creates, tests, and cleans up deployment - If deployment exists: uses...
(QB_NEW_EN)
[grammar] ~178-~178: There might be a mistake here.
Context: ...eq/s for 90s (baseline - maintains 1P1D) - Phase 2: 15 req/s for 120s (moderate l...
(QB_NEW_EN)
[grammar] ~179-~179: There might be a mistake here.
Context: ...or 120s (moderate load - maintains 1P1D) - Phase 3: 25 req/s for 180s (scaling tr...
(QB_NEW_EN)
[grammar] ~180-~180: There might be a mistake here.
Context: ... 180s (scaling trigger - scales to 2P1D) - ISL/OSL: 4000/150 tokens (optimized fo...
(QB_NEW_EN)
[grammar] ~181-~181: There might be a mistake here.
Context: ...okens (optimized for prefill bottleneck) - Transition delay: 30s between phases -...
(QB_NEW_EN)
[grammar] ~182-~182: There might be a mistake here.
Context: ...Transition delay: 30s between phases - Total test duration: ~7 minutes + scal...
(QB_NEW_EN)
[grammar] ~183-~183: There might be a mistake here.
Context: ...tion**: ~7 minutes + scaling observation - Smart cleanup: Only removes deployment...
(QB_NEW_EN)
🪛 markdownlint-cli2 (0.17.2)
tests/planner/README.md
174-174: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
🔇 Additional comments (2)
tests/planner/utils/__init__.py (1)
1-2: LGTM: package marker and SPDX headers.tests/planner/utils/load_generator.py (1)
117-119: Maintain host:port only for --url. genai-perf’s --url flag expects a host:port (no http://); the existing.replace("http://", "")aligns with official examples (e.g.--url localhost:8001). (docs.nvidia.com)Likely an incorrect or invalid review comment.
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
dep-301-test-dynamo-planner-scaling-up
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
dep-301-test-dynamo-planner-scaling-up
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
Pull Request Summary by devActivityMetricsAchievements
|
Signed-off-by: Hannah Zhang <hannahz@nvidia.com> Signed-off-by: ayushag <ayushag@nvidia.com>
Signed-off-by: Hannah Zhang <hannahz@nvidia.com> Signed-off-by: Jason Zhou <jasonzho@jasonzho-mlt.client.nvidia.com>
Signed-off-by: Hannah Zhang <hannahz@nvidia.com> Signed-off-by: Michael Shin <michaelshin@users.noreply.github.com>
Signed-off-by: Hannah Zhang <hannahz@nvidia.com> Signed-off-by: Krishnan Prashanth <kprashanth@nvidia.com>
Signed-off-by: Hannah Zhang <hannahz@nvidia.com> Signed-off-by: nnshah1 <neelays@nvidia.com>
Overview:
This MR implements comprehensive testing infrastructure for SLA planner scaling behavior, enabling automated validation of prefill and decode worker scaling under varying load conditions. The test suite includes both unit tests for replica calculation logic and end-to-end Kubernetes deployment tests that validate scaling from 1P1D to 2P1D configurations.
Details:
What it looks like:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
Documentation
Tests
Chores