feat: Deploy SLA planner to Kubernetes #2135

hhzhang16 · 2025-07-26T07:52:05Z

Overview:

This MR provides a complete Kubernetes deployment solution for the SLA-based planner system, enabling automatic worker scaling based on real-time metrics collected via Prometheus. It refactors and replaces the original SLA planner approach with a K8s-native architecture that supports streaming request telemetry and dynamic resource adjustments.

Details:

Removed local connector dependency and focused architecture on Kubernetes-native deployments
Prometheus integration for metrics collection with service discovery
planner core logic uses DYNAMO_NAMESPACE environment variable and K8s deployment info instead of hardcoded endpoints
added scaling logic to handle cases with no traffic and avoid adjustments when no requests have been made
Added disagg planner deployment configuration file (disagg_planner.yaml), deploys vLLM disagg example with Prometheus and planner
Added K8s deployment guide for SLA planner

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

New Features
- Introduced a new deployment manifest and documentation for SLA-based vLLM disaggregated planner with automatic scaling on Kubernetes.
- Added new components for SLA Planner and Prometheus in the deployment, with updated resource allocations and configuration options.
Improvements
- Enhanced configuration flexibility by supporting dynamic namespaces and Prometheus endpoints via environment variables.
- Improved hostname/IP handling for multi-node deployments and made backend naming consistent.
- Refined scaling logic and metric validation for more robust automatic scaling behavior.
Bug Fixes
- Improved error handling in Prometheus metric queries to handle missing data gracefully.
Documentation
- Added and updated deployment guides, architecture overviews, and usage instructions for the SLA Planner and profiling.
- Updated quick start and configuration instructions, emphasizing required metrics and deployment steps.
Refactor
- Refactored planner and Prometheus components to use asynchronous worker functions for better lifecycle management.
- Removed support for local (bare metal) deployment and deprecated components.
- Consolidated and simplified backend and component naming conventions.
Tests
- Expanded and reorganized test coverage for Kubernetes deployment readiness and connector logic, improving reliability and clarity.

…-create-deploy-crds-for-vllm_v1-example

… of github.com:ai-dynamo/dynamo into hannahz/dep-216-create-deploy-crds-for-vllm_v1-example

… of https://github.com/ai-dynamo/dynamo into hannahz/dep-216-create-deploy-crds-for-vllm_v1-example

… of https://github.com/ai-dynamo/dynamo into hzhou/profile_vllmv1_k8s

…mo/dynamo into hzhou/profile_vllmv1_k8s

…/profile_vllmv1_k8s

…jobs (#1975)

… in k8s

coderabbitai · 2025-07-26T17:13:44Z

Walkthrough

This update removes all support for local deployment and the LocalConnector from the planner, refactors the SLA planner and Prometheus components to async worker functions, unifies backend naming from "vllm_v1" to "vllm", updates deployment manifests and documentation, and improves configuration, scaling, and testing for Kubernetes-based SLA planner deployments.

Changes

File(s)	Change Summary
benchmarks/profiler/README.md, components/planner/README.md	README files replaced with single-line references to external documentation.
benchmarks/profiler/profile_sla.py, benchmarks/profiler/utils/config.py	Backend name changed from "vllm_v1" to "vllm" in CLI arguments and config; help strings and config keys updated accordingly.
benchmarks/profiler/utils/dynamo_deployment.py	Improved type annotations for attributes and imports; added type ignore for aiofiles.
components/backends/vllm/README.md, docs/architecture/pre_deployment_profiling.md, docs/architecture/sla_planner.md	Documentation updated: new deployment instructions, clarifications, and references to external guides; added manual config update step for profiling.
components/backends/vllm/deploy/disagg_planner.yaml	Major update: added Planner and Prometheus services, increased resource limits/replicas, updated images, and added environment variable configuration.
components/backends/vllm/src/dynamo/vllm/args.py	Namespace and host IP handling refactored for dynamic env/config; new helper for robust IP resolution; function signatures updated.
components/planner/src/dynamo/planner/init.py	Removed LocalConnector from exports and imports.
components/planner/src/dynamo/planner/defaults.py	Default environment set to Kubernetes; backend unified to "vllm"; Prometheus endpoint dynamically derived; backend component name classes consolidated.
components/planner/src/dynamo/planner/kube.py	KubernetesAPI refactored: improved config loading, namespace handling, readiness checks, and deployment retrieval logic; new async readiness method; wait time increased.
components/planner/src/dynamo/planner/kubernetes_connector.py	Constructor updated for separate Dynamo/K8s namespaces; added async set_component_replicas method; CLI argument parsing updated; logging added.
components/planner/src/dynamo/planner/local_connector.py	Entire file deleted; LocalConnector class and all local deployment logic removed.
components/planner/src/dynamo/planner/planner_sla.py	Refactored from class-based service to async worker function; CLI parsing and endpoint serving updated; old class removed.
components/planner/src/dynamo/planner/prometheus.py	Refactored from class-based service to async worker function; subprocess management and cancellation now async; main entrypoint added.
components/planner/src/dynamo/planner/utils/planner_core.py	LocalConnector support removed; always uses default namespace; scaling logic simplified; added metric NaN and correction factor guards.
components/planner/src/dynamo/planner/utils/prometheus.py	PrometheusAPIClient methods now return 0 if no data is available instead of erroring; improved robustness.
components/planner/test/kube.py	Tests refactored for new KubernetesAPI logic; improved coverage; new fixtures and readiness tests; internal method mocking.
components/planner/test/kubernetes_connector.py	Namespace argument in test updated to match new attribute; aligns with refactored connector.
docs/guides/dynamo_deploy/sla_planner_deployment.md	New deployment guide for SLA-based vLLM planner on Kubernetes added, with architecture, step-by-step instructions, metrics, and troubleshooting.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Frontend
    participant Prometheus
    participant Planner
    participant Workers

    User->>Frontend: Send streaming request
    Frontend->>Prometheus: Expose /metrics endpoint
    Prometheus->>Planner: Provide metrics (TTFT, ITL, ISL, OSL)
    Planner->>Kubernetes API: Adjust worker replicas (Prefill/Decode)
    Planner->>Workers: Scale up/down via K8s
    User-->>Frontend: Receive streamed response

sequenceDiagram
    participant CLI
    participant Profiler
    participant Config

    CLI->>Profiler: Run profile_sla.py --backend vllm
    Profiler->>Config: Use "vllm" backend config
    Profiler->>Profiler: Run profiling, find best TP sizes
    Profiler->>User: Instruct to update config with best TP sizes

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~90 minutes

Possibly related PRs

feat: deploy SLA profiler to k8s #2030: Updates benchmarks/profiler/profile_sla.py for Kubernetes deployment, changing backend naming and CLI arguments—directly overlaps with backend and profiling updates in this PR.
feat: SLA-based Planner #1420: Introduces the initial SLA-based planner feature and related classes; this PR further refactors and evolves the SLA planner, affecting many of the same files.
feat: simplify k8s deployment #1708: Removes the DynamoComponent resource and image builder, simplifying Kubernetes deployment and CRD usage, which is also a focus of this PR's deployment and planner refactoring.

Poem

🐇
Goodbye to local, Kubernetes takes the stage,
Planner and Prometheus now dance in async wage.
"vllm" unifies the backend's name,
Scaling with metrics, no more the same.
Docs refreshed, deployments anew—
The cloud is our meadow, and scaling we do!
🌱✨

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 8

🔭 Outside diff range comments (1)

benchmarks/profiler/README.md (1)
157-167: Remove duplicate section.

Lines 157-167 appear to be an exact duplicate of lines 146-156, both explaining how to change the DYN_LOG level with identical YAML examples.

Apply this diff to remove the duplicate section:
-To change `DYN_LOG` level, edit the yaml file by adding
-
-```yaml
-...
-spec:
-  envs:
-    - name: DYN_LOG
-      value: "debug" # or other log levels
-  ...
-```

🧹 Nitpick comments (9)

benchmarks/profiler/README.md (1)

1-1: Consider the file structure and naming consistency.

The file starts with a simple reference to external documentation but then contains extensive inline documentation. This creates confusion about whether this is a pointer file or comprehensive documentation.

Consider either:

Making this a true pointer file with just the reference, or

Updating the opening line to reflect that this file contains the actual documentation

docs/architecture/sla_planner.md (1)

11-11: Clear deprecation notice and migration guidance.

The updated warning effectively communicates the deprecation of bare metal deployment and directs users to Kubernetes deployment.

Optional: Consider hyphenating "bare metal" as "bare-metal" when used as a compound adjective modifying "deployment".
docs/guides/dynamo_deploy/sla_planner_deployment.md (1)
1-117: Excellent comprehensive deployment guide.

This new deployment guide provides thorough, well-structured documentation covering all aspects of SLA planner deployment from prerequisites through troubleshooting. The step-by-step approach, architecture overview, and practical examples make this highly valuable for users.

Minor improvement needed: The fenced code block at lines 46-52 should specify a language for better rendering:
-```
+```text
vllm-disagg-planner-frontend-*            1/1 Running
vllm-disagg-planner-prometheus-*          1/1 Running
vllm-disagg-planner-planner-*             1/1 Running
vllm-disagg-planner-backend-*             1/1 Running
vllm-disagg-planner-prefill-*             1/1 Running
components/planner/src/dynamo/planner/prometheus.py (1)
43-46: Consider cleaning up temporary configuration file.

The temporary file created with delete=False is never explicitly deleted. Consider adding cleanup in the finally block or using a context manager.
-    temp_file = tempfile.NamedTemporaryFile(mode="w", suffix=".yml", delete=False)
-    yaml.dump(config, temp_file)
-    temp_file.close()
-    config_path = temp_file.name
+    import os
+    temp_file = tempfile.NamedTemporaryFile(mode="w", suffix=".yml", delete=False)
+    config_path = temp_file.name
+    try:
+        yaml.dump(config, temp_file)
+        temp_file.close()
And add cleanup after the process loop:
     except asyncio.CancelledError:
         logger.info("Shutting down Prometheus...")
         process.terminate()
         process.wait()
+        os.unlink(config_path)
         raise
+    finally:
+        if os.path.exists(config_path):
+            os.unlink(config_path)
components/planner/src/dynamo/planner/planner_sla.py (2)
28-30: Address the TODO: Remove hardcoded startup delay

The 30-second delay is a workaround that should be replaced with proper readiness checks. Consider implementing a retry mechanism that waits for dependent components to be ready instead of using a fixed delay.

Would you like me to generate a solution that implements proper component readiness checks or open an issue to track this technical debt?

46-52: Document the purpose of the mock endpoint

The mock endpoint appears to be a placeholder. Consider adding a more descriptive comment explaining why this endpoint exists and whether it's intended to be replaced with actual functionality.
     async def generate(request: RequestType):
-        """Dummy endpoint to satisfy that each component has an endpoint"""
+        """Mock endpoint required by the Dynamo runtime framework.
+        
+        Each component must expose at least one endpoint. This placeholder
+        endpoint satisfies that requirement while the actual planning logic
+        is handled by the start_sla_planner function.
+        """
         yield "mock endpoint"
components/planner/src/dynamo/planner/defaults.py (1)
57-59: Validate environment variable values

The port and namespace values from environment variables are used without validation. Invalid values could cause runtime issues.

Add basic validation:
 class SLAPlannerDefaults(BasePlannerDefaults):
     port = os.environ.get("DYNAMO_PORT", "8000")
+    if not port.isdigit() or not (1 <= int(port) <= 65535):
+        raise ValueError(f"Invalid DYNAMO_PORT value: {port}. Must be a valid port number (1-65535).")
+    
     namespace = os.environ.get("DYNAMO_NAMESPACE", "vllm-disagg-planner")
+    if not namespace or not namespace.replace("-", "").replace("_", "").isalnum():
+        raise ValueError(f"Invalid DYNAMO_NAMESPACE value: {namespace}. Must be a valid Kubernetes namespace name.")
+    
     prometheus_endpoint = _get_default_prometheus_endpoint(port, namespace)
components/planner/src/dynamo/planner/kube.py (1)
142-156: Consider returning False instead of raising exception for missing deployment

The is_deployment_ready method raises an exception when the deployment is not found, which is inconsistent with its boolean return type. This could cause unexpected errors for callers expecting a simple True/False response.
 async def is_deployment_ready(self, graph_deployment_name: str) -> bool:
     """Check if a graph deployment is ready"""
 
     graph_deployment = self._get_graph_deployment_from_name(graph_deployment_name)
 
     if not graph_deployment:
-        raise ValueError(f"Graph deployment {graph_deployment_name} not found")
+        # Deployment not found means it's not ready
+        return False
 
     conditions = graph_deployment.get("status", {}).get("conditions", [])
     ready_condition = next(
         (c for c in conditions if c.get("type") == "Ready"), None
     )
 
     return ready_condition is not None and ready_condition.get("status") == "True"
Alternatively, if you want to distinguish between "not found" and "not ready", consider creating a separate method like deployment_exists().
components/planner/test/kube.py (1)
148-150: Consider extracting test timeout constants

The test timeout values (max_attempts=2, delay_seconds=0.1) are repeated across multiple tests. Consider extracting these as constants for better maintainability.
+# Test constants for faster execution
+TEST_MAX_ATTEMPTS = 2
+TEST_DELAY_SECONDS = 0.1
+
 @pytest.mark.asyncio
 async def test_wait_for_graph_deployment_ready_success(k8s_api, mock_custom_api):
     """Test wait_for_graph_deployment_ready when deployment becomes ready"""
     # ... existing code ...
         await k8s_api.wait_for_graph_deployment_ready(
-            "test-deployment", max_attempts=2, delay_seconds=0.1
+            "test-deployment", max_attempts=TEST_MAX_ATTEMPTS, delay_seconds=TEST_DELAY_SECONDS
         )
Also applies to: 176-177, 189-191

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 222245e and 539ff3e.

📒 Files selected for processing (22)

benchmarks/profiler/README.md (1 hunks)
benchmarks/profiler/profile_sla.py (1 hunks)
benchmarks/profiler/utils/config.py (7 hunks)
benchmarks/profiler/utils/dynamo_deployment.py (2 hunks)
components/backends/vllm/README.md (3 hunks)
components/backends/vllm/deploy/disagg_planner.yaml (4 hunks)
components/backends/vllm/src/dynamo/vllm/args.py (5 hunks)
components/planner/README.md (1 hunks)
components/planner/src/dynamo/planner/__init__.py (0 hunks)
components/planner/src/dynamo/planner/defaults.py (3 hunks)
components/planner/src/dynamo/planner/kube.py (4 hunks)
components/planner/src/dynamo/planner/kubernetes_connector.py (4 hunks)
components/planner/src/dynamo/planner/local_connector.py (0 hunks)
components/planner/src/dynamo/planner/planner_sla.py (2 hunks)
components/planner/src/dynamo/planner/prometheus.py (1 hunks)
components/planner/src/dynamo/planner/utils/planner_core.py (5 hunks)
components/planner/src/dynamo/planner/utils/prometheus.py (2 hunks)
components/planner/test/kube.py (4 hunks)
components/planner/test/kubernetes_connector.py (1 hunks)
docs/architecture/pre_deployment_profiling.md (1 hunks)
docs/architecture/sla_planner.md (2 hunks)
docs/guides/dynamo_deploy/sla_planner_deployment.md (1 hunks)

💤 Files with no reviewable changes (2)

components/planner/src/dynamo/planner/init.py
components/planner/src/dynamo/planner/local_connector.py

🧰 Additional context used

🧠 Learnings (9)

components/planner/README.md (1)

Learnt from: biswapanda
PR: #1266
File: deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go:85-85
Timestamp: 2025-05-29T16:29:45.152Z
Learning: In the Dynamo codebase, ComponentTypePlanner constants with different cases ("Planner" vs "planner") are intentional and serve different purposes: component type in config vs component label. These should not be made consistent as they handle different contexts.

components/backends/vllm/README.md (1)

Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:92-98
Timestamp: 2025-07-18T16:04:31.771Z
Learning: CRD schemas in files like deploy/cloud/helm/crds/templates/*.yaml are auto-generated from Kubernetes library upgrades and should not be manually modified as changes would be overwritten during regeneration.

docs/guides/dynamo_deploy/sla_planner_deployment.md (2)

Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.

Learnt from: nnshah1
PR: #2124
File: components/backends/vllm/deploy/disagg.yaml:54-60
Timestamp: 2025-07-25T22:34:11.384Z
Learning: In vLLM worker deployments, startup probes (with longer periods and higher failure thresholds like periodSeconds: 10, failureThreshold: 60) are used to handle the slow model loading startup phase, while liveness probes are intentionally kept aggressive (periodSeconds: 5, failureThreshold: 1) for quick failure detection once the worker is operational. This pattern separates startup concerns from operational health monitoring in GPU-heavy workloads.

benchmarks/profiler/utils/config.py (2)

Learnt from: nnshah1
PR: #1444
File: tests/fault_tolerance/configs/agg_tp_1_dp_8.yaml:31-38
Timestamp: 2025-07-01T15:33:53.262Z
Learning: In fault tolerance test configurations, the resources section under ServiceArgs specifies resources per individual worker, not total resources for all workers. So workers: 8 with gpu: '1' means 8 workers × 1 GPU each = 8 GPUs total.

Learnt from: krishung5
PR: #1388
File: examples/multimodal/utils/model.py:47-53
Timestamp: 2025-06-09T17:52:06.761Z
Learning: Different Vision-Language Models (VLMs) have different hidden dimensions, so using a single hardcoded fallback value of 4096 in get_vision_embeddings_size() is problematic and can cause tensor shape mismatches. The current code uses getattr(config, "hidden_size", 4096) where 4096 is a fallback default based on LLaVA models, but other VLMs like Qwen2-VL may have different dimensions.

components/planner/src/dynamo/planner/utils/planner_core.py (1)

Learnt from: biswapanda
PR: #1266
File: deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go:85-85
Timestamp: 2025-05-29T16:29:45.152Z
Learning: In the Dynamo codebase, ComponentTypePlanner constants with different cases ("Planner" vs "planner") are intentional and serve different purposes: component type in config vs component label. These should not be made consistent as they handle different contexts.

components/planner/src/dynamo/planner/planner_sla.py (1)

Learnt from: nnshah1
PR: #1444
File: tests/fault_tolerance/utils/metrics.py:30-32
Timestamp: 2025-07-01T13:55:03.940Z
Learning: The @dynamo_worker() decorator in the dynamo codebase returns a wrapper that automatically injects the runtime parameter before calling the wrapped function. This means callers only need to provide the non-runtime parameters, while the decorator handles injecting the runtime argument automatically. For example, a function with signature async def get_metrics(runtime, log_dir) decorated with @dynamo_worker() can be called as get_metrics(log_dir) because the decorator wrapper injects the runtime parameter.

components/planner/src/dynamo/planner/prometheus.py (2)

Learnt from: nnshah1
PR: #1444
File: tests/fault_tolerance/utils/metrics.py:30-32
Timestamp: 2025-07-01T13:55:03.940Z
Learning: The @dynamo_worker() decorator in the dynamo codebase returns a wrapper that automatically injects the runtime parameter before calling the wrapped function. This means callers only need to provide the non-runtime parameters, while the decorator handles injecting the runtime argument automatically. For example, a function with signature async def get_metrics(runtime, log_dir) decorated with @dynamo_worker() can be called as get_metrics(log_dir) because the decorator wrapper injects the runtime parameter.

Learnt from: PeaBrane
PR: #1392
File: launch/dynamo-run/src/subprocess/vllm_v1_inc.py:71-71
Timestamp: 2025-06-05T01:04:24.775Z
Learning: The create_endpoint method in WorkerMetricsPublisher has backward compatibility maintained through pyo3 signature annotation #[pyo3(signature = (component, dp_rank = None))], making the dp_rank parameter optional with a default value of None.

components/backends/vllm/src/dynamo/vllm/args.py (2)

Learnt from: biswapanda
PR: #1412
File: lib/bindings/python/src/dynamo/runtime/logging.py:100-100
Timestamp: 2025-06-06T21:48:35.214Z
Learning: In the Dynamo codebase, BentoML has been completely removed from all executable code, with only documentation and attribution references remaining. The error_loggers configuration in lib/bindings/python/src/dynamo/runtime/logging.py should not include "bentoml" since those modules no longer exist.

Learnt from: PeaBrane
PR: #1392
File: launch/dynamo-run/src/subprocess/vllm_v1_inc.py:71-71
Timestamp: 2025-06-05T01:04:24.775Z
Learning: The create_endpoint method in WorkerMetricsPublisher has backward compatibility maintained through pyo3 signature annotation #[pyo3(signature = (component, dp_rank = None))], making the dp_rank parameter optional with a default value of None.

components/backends/vllm/deploy/disagg_planner.yaml (3)

Learnt from: julienmancuso
PR: #1365
File: deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go:171-178
Timestamp: 2025-06-04T13:09:53.416Z
Learning: The DYN_DEPLOYMENT_CONFIG environment variable (commonconsts.DynamoDeploymentConfigEnvVar) in the Dynamo operator will never be set via ValueFrom (secrets/config maps), only via direct Value assignment. The GetDynamoDeploymentConfig method correctly only checks env.Value for this specific environment variable.

Learnt from: nnshah1
PR: #2124
File: components/backends/vllm/deploy/disagg.yaml:54-60
Timestamp: 2025-07-25T22:34:11.384Z
Learning: In vLLM worker deployments, startup probes (with longer periods and higher failure thresholds like periodSeconds: 10, failureThreshold: 60) are used to handle the slow model loading startup phase, while liveness probes are intentionally kept aggressive (periodSeconds: 5, failureThreshold: 1) for quick failure detection once the worker is operational. This pattern separates startup concerns from operational health monitoring in GPU-heavy workloads.

Learnt from: nnshah1
PR: #1444
File: tests/fault_tolerance/configs/agg_tp_1_dp_8.yaml:31-38
Timestamp: 2025-07-01T15:33:53.262Z
Learning: In fault tolerance test configurations, the resources section under ServiceArgs specifies resources per individual worker, not total resources for all workers. So workers: 8 with gpu: '1' means 8 workers × 1 GPU each = 8 GPUs total.

🪛 markdownlint-cli2 (0.17.2)

docs/guides/dynamo_deploy/sla_planner_deployment.md

46-46: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 LanguageTool

docs/architecture/sla_planner.md

[uncategorized] ~11-~11: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...ts disaggregated setup. > [!WARNING] > Bare metal deployment with local connector is depr...

(EN_COMPOUND_ADJECTIVE_INTERNAL)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: pre-merge-rust (lib/bindings/python)
GitHub Check: pre-merge-rust (.)
GitHub Check: Build and Test - vllm

🔇 Additional comments (31)

components/planner/README.md (1)

18-18: LGTM! Documentation consolidation aligns with PR objectives.

The simplification from detailed inline documentation to an external reference is consistent with the PR's focus on Kubernetes-native deployment and documentation consolidation.

components/planner/test/kubernetes_connector.py (1)

66-66: Namespace attribute consistency verified in KubernetesConnector.

All occurrences of the old namespace attribute have been replaced with dynamo_namespace throughout the codebase, including the CLI parser, connector implementation, helper methods, and tests. No references to kubernetes_connector.namespace remain.

docs/architecture/pre_deployment_profiling.md (1)

32-32: Excellent addition! Clear instruction for applying profiling results.

The added instruction to update the config file with the optimal TP sizes obtained from profiling is essential for users to properly configure their deployments. This bridges the gap between profiling results and actual configuration.

components/backends/vllm/README.md (2)

115-115: Excellent addition of SLA Planner documentation references.

The addition of disagg_planner.yaml and links to SLA Planner documentation provides users with clear guidance on the new deployment option and comprehensive documentation.

128-129: Good addition of pre-deployment profiling prerequisite.

Adding the pre-deployment profiling prerequisite with clear documentation reference helps users understand the complete workflow for SLA Planner deployment.

benchmarks/profiler/profile_sla.py (1)

592-594: LGTM! Backend naming unification looks good.

The changes correctly update the backend naming from "vllm_v1" to "vllm", which aligns with the broader effort to unify backend naming conventions across the codebase.

benchmarks/profiler/utils/config.py (3)

83-83: Consistent metadata naming update.

The metadata name change from "vllm-v1-agg" to "vllm-agg" is consistent with the backend naming unification.

92-180: LGTM! Worker component naming updates are consistent.

All references to WORKER_COMPONENT_NAMES["vllm_v1"] have been correctly updated to WORKER_COMPONENT_NAMES["vllm"], maintaining functionality while aligning with the unified backend naming convention.

235-235: CONFIG_MODIFIERS key update aligns with naming changes.

The key change from "vllm_v1" to "vllm" in the CONFIG_MODIFIERS dictionary is consistent with the backend naming unification.

docs/architecture/sla_planner.md (2)

109-115: Excellent deployment guidance with clear quick start.

The addition of the detailed deployment guide reference and quick start command provides users with clear, actionable instructions for deploying the SLA planner.

117-118: Important frontend metrics requirement clearly communicated.

The note about frontend metrics requirements is well-placed and informative, helping users understand the system dependencies.

components/planner/src/dynamo/planner/utils/prometheus.py (5)

32-38: Excellent defensive programming for empty query results.

The addition of empty result checking prevents potential errors when no requests have been made yet, returning 0 as a sensible default for inter-token latency metrics.

45-51: Consistent empty result handling for TTFT metrics.

The same defensive pattern is correctly applied to time-to-first-token metrics, maintaining consistency across the API.

58-64: Robust handling for request duration metrics.

The empty result check for request duration follows the established pattern, improving the overall reliability of the Prometheus client.

85-91: Appropriate default handling for input sequence metrics.

The defensive programming pattern is consistently applied to input sequence token metrics, ensuring graceful handling of scenarios with no available data.

98-104: Complete defensive pattern for output sequence metrics.

The final method follows the same robust pattern, completing the consistent empty result handling across all Prometheus query methods.

benchmarks/profiler/utils/dynamo_deployment.py (1)

20-20: Type annotation improvements look good.

The changes improve type precision by using explicit typing imports and adding appropriate type ignore comments for untyped modules.

Also applies to: 22-22, 65-65, 67-67

components/backends/vllm/src/dynamo/vllm/args.py (3)

82-88: Dynamic namespace configuration looks good.

The use of DYNAMO_NAMESPACE environment variable with a default fallback provides good flexibility for Kubernetes deployments.

135-142: IP resolution with fallback is well implemented.

Good error handling for hostname resolution failures. Using IP addresses instead of hostnames in ETCD keys will be more reliable in Kubernetes environments where DNS might not be immediately available.

254-282: Robust IP address resolution implementation.

The get_host_ip() function has excellent error handling with multiple fallback mechanisms:

Attempts hostname resolution

Tests if the resolved IP can be bound

Falls back to localhost with clear warnings

This will prevent issues in various deployment scenarios.

components/planner/src/dynamo/planner/kubernetes_connector.py (2)

28-30: Good separation of namespace concerns.

Separating Dynamo namespace from Kubernetes namespace provides better flexibility for multi-namespace deployments.

92-99: Good addition of readiness check before scaling.

Checking deployment readiness before applying scaling changes prevents operations on unstable deployments. The warning log is helpful for debugging.

components/planner/src/dynamo/planner/prometheus.py (2)

30-38: Clean async worker implementation.

Good refactoring from class-based to async worker pattern using the @dynamo_worker decorator. This aligns well with the overall architecture changes.

64-74: Good process lifecycle management.

The subprocess handling with proper cancellation and termination is well implemented. The continuous monitoring of process health is a good practice.

components/planner/src/dynamo/planner/utils/planner_core.py (4)

55-55: Good centralization of configuration.

Using SLAPlannerDefaults for namespace and Prometheus endpoint configuration aligns with the environment-driven approach.

Also applies to: 63-65

171-180: Excellent handling of no-traffic scenarios.

The NaN checks prevent invalid scaling decisions when there's no traffic. This is a critical improvement for production stability.

239-246: Good defensive programming for correction factor.

The division by zero protection with fallback to default ITL value prevents crashes and ensures the planner continues operating even with incomplete metrics.

294-298: Efficient batch scaling implementation.

Using set_component_replicas for batch updates is more efficient than individual component operations and reduces the chance of partial updates.

components/planner/src/dynamo/planner/kube.py (1)

157-163: Verify the extended timeout is appropriate

The max_attempts has been increased from 60 to 180, resulting in a 30-minute default timeout. This seems excessive for most deployment scenarios.

Please verify that 30 minutes is the intended timeout. Consider:

Making this configurable via environment variable

Using exponential backoff instead of fixed intervals

Documenting why such a long timeout is necessary

Would you like me to generate an implementation with configurable timeout and exponential backoff?

components/planner/test/kube.py (2)

42-58: Well-structured test additions for namespace support

The tests properly cover the new namespace initialization functionality with both custom and default namespace scenarios.

77-129: Comprehensive test coverage for is_deployment_ready

Excellent test coverage for the new is_deployment_ready method, including ready, not ready, and not found scenarios. The tests properly mock the internal method and validate all edge cases.

components/backends/vllm/deploy/disagg_planner.yaml

components/backends/vllm/README.md

components/planner/src/dynamo/planner/defaults.py

components/planner/src/dynamo/planner/kubernetes_connector.py

components/planner/src/dynamo/planner/planner_sla.py

…-deploy-sla-planner-to-k8s

components/planner/src/dynamo/planner/utils/prometheus.py

components/planner/src/dynamo/planner/utils/planner_core.py

hhzhang16 and others added 30 commits July 11, 2025 09:28

feat: update k8s deploy yamls to use binary/python3

065cb2a

config part working

aee478c

feat: add component type worker and bump image

9455ad1

fix: merge conflicts

f3dd01a

fix: using health checks exposed by dynamo-run

7de97ef

Merge branch 'main' of github.com:ai-dynamo/dynamo into hannahz/dep-216…

16fd7f2

…-create-deploy-crds-for-vllm_v1-example

Merge branch 'hannahz/dep-216-create-deploy-crds-for-vllm_v1-example'…

3a29913

… of github.com:ai-dynamo/dynamo into hannahz/dep-216-create-deploy-crds-for-vllm_v1-example

fix: check for message in logs

51835db

Merge branch 'hannahz/dep-216-create-deploy-crds-for-vllm_v1-example'…

39b377f

… of https://github.com/ai-dynamo/dynamo into hannahz/dep-216-create-deploy-crds-for-vllm_v1-example

Merge branch 'hannahz/dep-216-create-deploy-crds-for-vllm_v1-example'…

dddb45f

… of https://github.com/ai-dynamo/dynamo into hzhou/profile_vllmv1_k8s

define apis

34bc79c

update script

8c22d14

fix: add dynamodeployment lib

9856dde

fix: working client lib

61a215b

fix: working client lib

5141334

integrate with utils.dynamo_deployment

8e25a29

fix: port forward works

1d87164

Merge branch 'hzhou/profile_vllmv1_k8s' of https://github.com/ai-dyna…

aaf4544

…mo/dynamo into hzhou/profile_vllmv1_k8s

pc

65dec07

add dep; bug fix

0af209b

Merge branch 'main' of https://github.com/ai-dynamo/dynamo into hzhou…

918733a

…/profile_vllmv1_k8s

staging, port forward not working

3f900ef

stage

bd12d40

Merge branch 'main' of https://github.com/ai-dynamo/dynamo into hzhou…

7ac43a9

…/profile_vllmv1_k8s

fix: running script

9971acf

fix: fix

a5d8aca

Merge branch 'main' of https://github.com/ai-dynamo/dynamo into hzhou…

7b1d99a

…/profile_vllmv1_k8s

add logic to find a free port

f8f9363

feat: add Kubernetes service account configuration for SLA profiling …

8e292f6

…jobs (#1975)

feat: use service DNS for interfacing with deployments when profiling…

d62731f

… in k8s

tedzhouhk requested review from a team, GuanLuo, PeaBrane, alec-flowers, biswapanda, ishandhanani, jthomson04, kkranen, paulhendricks, rmccorm4, tedzhouhk and tmonty12 as code owners July 26, 2025 17:07

coderabbitai bot reviewed Jul 26, 2025

View reviewed changes

hhzhang16 and others added 4 commits July 28, 2025 09:09

feat: address coderabbit MR comments

f994128

Merge branch 'main' of github.com:ai-dynamo/dynamo into hannahz/dep-253…

d44b042

…-deploy-sla-planner-to-k8s

fix pytest

f601f88

mypy

69d64dc

PeaBrane reviewed Jul 28, 2025

View reviewed changes

components/planner/src/dynamo/planner/utils/prometheus.py Outdated Show resolved Hide resolved

PeaBrane reviewed Jul 28, 2025

View reviewed changes

components/planner/src/dynamo/planner/utils/planner_core.py Outdated Show resolved Hide resolved

julienmancuso approved these changes Jul 28, 2025

View reviewed changes

hhzhang16 added 2 commits July 28, 2025 11:56

feat: addressing MR comments

1fdf3e9

feat: addressing MR comments

bf58bd0

PeaBrane approved these changes Jul 28, 2025

View reviewed changes

biswapanda mentioned this pull request Jul 28, 2025

feat: deprecate sdk as dependency #2149

Merged

tedzhouhk merged commit d23d48b into main Jul 28, 2025
12 checks passed

tedzhouhk deleted the hannahz/dep-253-deploy-sla-planner-to-k8s branch July 28, 2025 20:09

biswapanda mentioned this pull request Jul 28, 2025

fix: vllm disagg hang - ip address based nixl sidechannel #2108

Closed

This was referenced Aug 19, 2025

docs: CLOSE ME #2516

Closed

test: add dryrun mode for sla planner #2557

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Deploy SLA planner to Kubernetes #2135

feat: Deploy SLA planner to Kubernetes #2135

Uh oh!

hhzhang16 commented Jul 26, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jul 26, 2025

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

feat: Deploy SLA planner to Kubernetes #2135

feat: Deploy SLA planner to Kubernetes #2135

Uh oh!

Conversation

hhzhang16 commented Jul 26, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jul 26, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

hhzhang16 commented Jul 26, 2025 •

edited by coderabbitai bot

Loading