Skip to content

Conversation

@hhzhang16
Copy link
Contributor

@hhzhang16 hhzhang16 commented Jul 26, 2025

Overview:

This MR provides a complete Kubernetes deployment solution for the SLA-based planner system, enabling automatic worker scaling based on real-time metrics collected via Prometheus. It refactors and replaces the original SLA planner approach with a K8s-native architecture that supports streaming request telemetry and dynamic resource adjustments.

Details:

  • Removed local connector dependency and focused architecture on Kubernetes-native deployments
  • Prometheus integration for metrics collection with service discovery
  • planner core logic uses DYNAMO_NAMESPACE environment variable and K8s deployment info instead of hardcoded endpoints
  • added scaling logic to handle cases with no traffic and avoid adjustments when no requests have been made
  • Added disagg planner deployment configuration file (disagg_planner.yaml), deploys vLLM disagg example with Prometheus and planner
  • Added K8s deployment guide for SLA planner

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • New Features

    • Introduced a new deployment manifest and documentation for SLA-based vLLM disaggregated planner with automatic scaling on Kubernetes.
    • Added new components for SLA Planner and Prometheus in the deployment, with updated resource allocations and configuration options.
  • Improvements

    • Enhanced configuration flexibility by supporting dynamic namespaces and Prometheus endpoints via environment variables.
    • Improved hostname/IP handling for multi-node deployments and made backend naming consistent.
    • Refined scaling logic and metric validation for more robust automatic scaling behavior.
  • Bug Fixes

    • Improved error handling in Prometheus metric queries to handle missing data gracefully.
  • Documentation

    • Added and updated deployment guides, architecture overviews, and usage instructions for the SLA Planner and profiling.
    • Updated quick start and configuration instructions, emphasizing required metrics and deployment steps.
  • Refactor

    • Refactored planner and Prometheus components to use asynchronous worker functions for better lifecycle management.
    • Removed support for local (bare metal) deployment and deprecated components.
    • Consolidated and simplified backend and component naming conventions.
  • Tests

    • Expanded and reorganized test coverage for Kubernetes deployment readiness and connector logic, improving reliability and clarity.

hhzhang16 and others added 30 commits July 11, 2025 09:28
… of github.com:ai-dynamo/dynamo into hannahz/dep-216-create-deploy-crds-for-vllm_v1-example
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 26, 2025

Walkthrough

This update removes all support for local deployment and the LocalConnector from the planner, refactors the SLA planner and Prometheus components to async worker functions, unifies backend naming from "vllm_v1" to "vllm", updates deployment manifests and documentation, and improves configuration, scaling, and testing for Kubernetes-based SLA planner deployments.

Changes

File(s) Change Summary
benchmarks/profiler/README.md, components/planner/README.md README files replaced with single-line references to external documentation.
benchmarks/profiler/profile_sla.py, benchmarks/profiler/utils/config.py Backend name changed from "vllm_v1" to "vllm" in CLI arguments and config; help strings and config keys updated accordingly.
benchmarks/profiler/utils/dynamo_deployment.py Improved type annotations for attributes and imports; added type ignore for aiofiles.
components/backends/vllm/README.md, docs/architecture/pre_deployment_profiling.md, docs/architecture/sla_planner.md Documentation updated: new deployment instructions, clarifications, and references to external guides; added manual config update step for profiling.
components/backends/vllm/deploy/disagg_planner.yaml Major update: added Planner and Prometheus services, increased resource limits/replicas, updated images, and added environment variable configuration.
components/backends/vllm/src/dynamo/vllm/args.py Namespace and host IP handling refactored for dynamic env/config; new helper for robust IP resolution; function signatures updated.
components/planner/src/dynamo/planner/init.py Removed LocalConnector from exports and imports.
components/planner/src/dynamo/planner/defaults.py Default environment set to Kubernetes; backend unified to "vllm"; Prometheus endpoint dynamically derived; backend component name classes consolidated.
components/planner/src/dynamo/planner/kube.py KubernetesAPI refactored: improved config loading, namespace handling, readiness checks, and deployment retrieval logic; new async readiness method; wait time increased.
components/planner/src/dynamo/planner/kubernetes_connector.py Constructor updated for separate Dynamo/K8s namespaces; added async set_component_replicas method; CLI argument parsing updated; logging added.
components/planner/src/dynamo/planner/local_connector.py Entire file deleted; LocalConnector class and all local deployment logic removed.
components/planner/src/dynamo/planner/planner_sla.py Refactored from class-based service to async worker function; CLI parsing and endpoint serving updated; old class removed.
components/planner/src/dynamo/planner/prometheus.py Refactored from class-based service to async worker function; subprocess management and cancellation now async; main entrypoint added.
components/planner/src/dynamo/planner/utils/planner_core.py LocalConnector support removed; always uses default namespace; scaling logic simplified; added metric NaN and correction factor guards.
components/planner/src/dynamo/planner/utils/prometheus.py PrometheusAPIClient methods now return 0 if no data is available instead of erroring; improved robustness.
components/planner/test/kube.py Tests refactored for new KubernetesAPI logic; improved coverage; new fixtures and readiness tests; internal method mocking.
components/planner/test/kubernetes_connector.py Namespace argument in test updated to match new attribute; aligns with refactored connector.
docs/guides/dynamo_deploy/sla_planner_deployment.md New deployment guide for SLA-based vLLM planner on Kubernetes added, with architecture, step-by-step instructions, metrics, and troubleshooting.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Frontend
    participant Prometheus
    participant Planner
    participant Workers

    User->>Frontend: Send streaming request
    Frontend->>Prometheus: Expose /metrics endpoint
    Prometheus->>Planner: Provide metrics (TTFT, ITL, ISL, OSL)
    Planner->>Kubernetes API: Adjust worker replicas (Prefill/Decode)
    Planner->>Workers: Scale up/down via K8s
    User-->>Frontend: Receive streamed response
Loading
sequenceDiagram
    participant CLI
    participant Profiler
    participant Config

    CLI->>Profiler: Run profile_sla.py --backend vllm
    Profiler->>Config: Use "vllm" backend config
    Profiler->>Profiler: Run profiling, find best TP sizes
    Profiler->>User: Instruct to update config with best TP sizes
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~90 minutes

Possibly related PRs

  • feat: deploy SLA profiler to k8s #2030: Updates benchmarks/profiler/profile_sla.py for Kubernetes deployment, changing backend naming and CLI arguments—directly overlaps with backend and profiling updates in this PR.
  • feat: SLA-based Planner #1420: Introduces the initial SLA-based planner feature and related classes; this PR further refactors and evolves the SLA planner, affecting many of the same files.
  • feat: simplify k8s deployment #1708: Removes the DynamoComponent resource and image builder, simplifying Kubernetes deployment and CRD usage, which is also a focus of this PR's deployment and planner refactoring.

Poem

🐇
Goodbye to local, Kubernetes takes the stage,
Planner and Prometheus now dance in async wage.
"vllm" unifies the backend's name,
Scaling with metrics, no more the same.
Docs refreshed, deployments anew—
The cloud is our meadow, and scaling we do!
🌱✨

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🔭 Outside diff range comments (1)
benchmarks/profiler/README.md (1)

157-167: Remove duplicate section.

Lines 157-167 appear to be an exact duplicate of lines 146-156, both explaining how to change the DYN_LOG level with identical YAML examples.

Apply this diff to remove the duplicate section:

-To change `DYN_LOG` level, edit the yaml file by adding
-
-```yaml
-...
-spec:
-  envs:
-    - name: DYN_LOG
-      value: "debug" # or other log levels
-  ...
-```
🧹 Nitpick comments (9)
benchmarks/profiler/README.md (1)

1-1: Consider the file structure and naming consistency.

The file starts with a simple reference to external documentation but then contains extensive inline documentation. This creates confusion about whether this is a pointer file or comprehensive documentation.

Consider either:

  1. Making this a true pointer file with just the reference, or
  2. Updating the opening line to reflect that this file contains the actual documentation
docs/architecture/sla_planner.md (1)

11-11: Clear deprecation notice and migration guidance.

The updated warning effectively communicates the deprecation of bare metal deployment and directs users to Kubernetes deployment.

Optional: Consider hyphenating "bare metal" as "bare-metal" when used as a compound adjective modifying "deployment".

docs/guides/dynamo_deploy/sla_planner_deployment.md (1)

1-117: Excellent comprehensive deployment guide.

This new deployment guide provides thorough, well-structured documentation covering all aspects of SLA planner deployment from prerequisites through troubleshooting. The step-by-step approach, architecture overview, and practical examples make this highly valuable for users.

Minor improvement needed: The fenced code block at lines 46-52 should specify a language for better rendering:

-```
+```text
vllm-disagg-planner-frontend-*            1/1 Running
vllm-disagg-planner-prometheus-*          1/1 Running
vllm-disagg-planner-planner-*             1/1 Running
vllm-disagg-planner-backend-*             1/1 Running
vllm-disagg-planner-prefill-*             1/1 Running
components/planner/src/dynamo/planner/prometheus.py (1)

43-46: Consider cleaning up temporary configuration file.

The temporary file created with delete=False is never explicitly deleted. Consider adding cleanup in the finally block or using a context manager.

-    temp_file = tempfile.NamedTemporaryFile(mode="w", suffix=".yml", delete=False)
-    yaml.dump(config, temp_file)
-    temp_file.close()
-    config_path = temp_file.name
+    import os
+    temp_file = tempfile.NamedTemporaryFile(mode="w", suffix=".yml", delete=False)
+    config_path = temp_file.name
+    try:
+        yaml.dump(config, temp_file)
+        temp_file.close()

And add cleanup after the process loop:

     except asyncio.CancelledError:
         logger.info("Shutting down Prometheus...")
         process.terminate()
         process.wait()
+        os.unlink(config_path)
         raise
+    finally:
+        if os.path.exists(config_path):
+            os.unlink(config_path)
components/planner/src/dynamo/planner/planner_sla.py (2)

28-30: Address the TODO: Remove hardcoded startup delay

The 30-second delay is a workaround that should be replaced with proper readiness checks. Consider implementing a retry mechanism that waits for dependent components to be ready instead of using a fixed delay.

Would you like me to generate a solution that implements proper component readiness checks or open an issue to track this technical debt?


46-52: Document the purpose of the mock endpoint

The mock endpoint appears to be a placeholder. Consider adding a more descriptive comment explaining why this endpoint exists and whether it's intended to be replaced with actual functionality.

     async def generate(request: RequestType):
-        """Dummy endpoint to satisfy that each component has an endpoint"""
+        """Mock endpoint required by the Dynamo runtime framework.
+        
+        Each component must expose at least one endpoint. This placeholder
+        endpoint satisfies that requirement while the actual planning logic
+        is handled by the start_sla_planner function.
+        """
         yield "mock endpoint"
components/planner/src/dynamo/planner/defaults.py (1)

57-59: Validate environment variable values

The port and namespace values from environment variables are used without validation. Invalid values could cause runtime issues.

Add basic validation:

 class SLAPlannerDefaults(BasePlannerDefaults):
     port = os.environ.get("DYNAMO_PORT", "8000")
+    if not port.isdigit() or not (1 <= int(port) <= 65535):
+        raise ValueError(f"Invalid DYNAMO_PORT value: {port}. Must be a valid port number (1-65535).")
+    
     namespace = os.environ.get("DYNAMO_NAMESPACE", "vllm-disagg-planner")
+    if not namespace or not namespace.replace("-", "").replace("_", "").isalnum():
+        raise ValueError(f"Invalid DYNAMO_NAMESPACE value: {namespace}. Must be a valid Kubernetes namespace name.")
+    
     prometheus_endpoint = _get_default_prometheus_endpoint(port, namespace)
components/planner/src/dynamo/planner/kube.py (1)

142-156: Consider returning False instead of raising exception for missing deployment

The is_deployment_ready method raises an exception when the deployment is not found, which is inconsistent with its boolean return type. This could cause unexpected errors for callers expecting a simple True/False response.

 async def is_deployment_ready(self, graph_deployment_name: str) -> bool:
     """Check if a graph deployment is ready"""
 
     graph_deployment = self._get_graph_deployment_from_name(graph_deployment_name)
 
     if not graph_deployment:
-        raise ValueError(f"Graph deployment {graph_deployment_name} not found")
+        # Deployment not found means it's not ready
+        return False
 
     conditions = graph_deployment.get("status", {}).get("conditions", [])
     ready_condition = next(
         (c for c in conditions if c.get("type") == "Ready"), None
     )
 
     return ready_condition is not None and ready_condition.get("status") == "True"

Alternatively, if you want to distinguish between "not found" and "not ready", consider creating a separate method like deployment_exists().

components/planner/test/kube.py (1)

148-150: Consider extracting test timeout constants

The test timeout values (max_attempts=2, delay_seconds=0.1) are repeated across multiple tests. Consider extracting these as constants for better maintainability.

+# Test constants for faster execution
+TEST_MAX_ATTEMPTS = 2
+TEST_DELAY_SECONDS = 0.1
+
 @pytest.mark.asyncio
 async def test_wait_for_graph_deployment_ready_success(k8s_api, mock_custom_api):
     """Test wait_for_graph_deployment_ready when deployment becomes ready"""
     # ... existing code ...
         await k8s_api.wait_for_graph_deployment_ready(
-            "test-deployment", max_attempts=2, delay_seconds=0.1
+            "test-deployment", max_attempts=TEST_MAX_ATTEMPTS, delay_seconds=TEST_DELAY_SECONDS
         )

Also applies to: 176-177, 189-191

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 222245e and 539ff3e.

📒 Files selected for processing (22)
  • benchmarks/profiler/README.md (1 hunks)
  • benchmarks/profiler/profile_sla.py (1 hunks)
  • benchmarks/profiler/utils/config.py (7 hunks)
  • benchmarks/profiler/utils/dynamo_deployment.py (2 hunks)
  • components/backends/vllm/README.md (3 hunks)
  • components/backends/vllm/deploy/disagg_planner.yaml (4 hunks)
  • components/backends/vllm/src/dynamo/vllm/args.py (5 hunks)
  • components/planner/README.md (1 hunks)
  • components/planner/src/dynamo/planner/__init__.py (0 hunks)
  • components/planner/src/dynamo/planner/defaults.py (3 hunks)
  • components/planner/src/dynamo/planner/kube.py (4 hunks)
  • components/planner/src/dynamo/planner/kubernetes_connector.py (4 hunks)
  • components/planner/src/dynamo/planner/local_connector.py (0 hunks)
  • components/planner/src/dynamo/planner/planner_sla.py (2 hunks)
  • components/planner/src/dynamo/planner/prometheus.py (1 hunks)
  • components/planner/src/dynamo/planner/utils/planner_core.py (5 hunks)
  • components/planner/src/dynamo/planner/utils/prometheus.py (2 hunks)
  • components/planner/test/kube.py (4 hunks)
  • components/planner/test/kubernetes_connector.py (1 hunks)
  • docs/architecture/pre_deployment_profiling.md (1 hunks)
  • docs/architecture/sla_planner.md (2 hunks)
  • docs/guides/dynamo_deploy/sla_planner_deployment.md (1 hunks)
💤 Files with no reviewable changes (2)
  • components/planner/src/dynamo/planner/init.py
  • components/planner/src/dynamo/planner/local_connector.py
🧰 Additional context used
🧠 Learnings (9)
components/planner/README.md (1)

Learnt from: biswapanda
PR: #1266
File: deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go:85-85
Timestamp: 2025-05-29T16:29:45.152Z
Learning: In the Dynamo codebase, ComponentTypePlanner constants with different cases ("Planner" vs "planner") are intentional and serve different purposes: component type in config vs component label. These should not be made consistent as they handle different contexts.

components/backends/vllm/README.md (1)

Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:92-98
Timestamp: 2025-07-18T16:04:31.771Z
Learning: CRD schemas in files like deploy/cloud/helm/crds/templates/*.yaml are auto-generated from Kubernetes library upgrades and should not be manually modified as changes would be overwritten during regeneration.

docs/guides/dynamo_deploy/sla_planner_deployment.md (2)

Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.

Learnt from: nnshah1
PR: #2124
File: components/backends/vllm/deploy/disagg.yaml:54-60
Timestamp: 2025-07-25T22:34:11.384Z
Learning: In vLLM worker deployments, startup probes (with longer periods and higher failure thresholds like periodSeconds: 10, failureThreshold: 60) are used to handle the slow model loading startup phase, while liveness probes are intentionally kept aggressive (periodSeconds: 5, failureThreshold: 1) for quick failure detection once the worker is operational. This pattern separates startup concerns from operational health monitoring in GPU-heavy workloads.

benchmarks/profiler/utils/config.py (2)

Learnt from: nnshah1
PR: #1444
File: tests/fault_tolerance/configs/agg_tp_1_dp_8.yaml:31-38
Timestamp: 2025-07-01T15:33:53.262Z
Learning: In fault tolerance test configurations, the resources section under ServiceArgs specifies resources per individual worker, not total resources for all workers. So workers: 8 with gpu: '1' means 8 workers × 1 GPU each = 8 GPUs total.

Learnt from: krishung5
PR: #1388
File: examples/multimodal/utils/model.py:47-53
Timestamp: 2025-06-09T17:52:06.761Z
Learning: Different Vision-Language Models (VLMs) have different hidden dimensions, so using a single hardcoded fallback value of 4096 in get_vision_embeddings_size() is problematic and can cause tensor shape mismatches. The current code uses getattr(config, "hidden_size", 4096) where 4096 is a fallback default based on LLaVA models, but other VLMs like Qwen2-VL may have different dimensions.

components/planner/src/dynamo/planner/utils/planner_core.py (1)

Learnt from: biswapanda
PR: #1266
File: deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go:85-85
Timestamp: 2025-05-29T16:29:45.152Z
Learning: In the Dynamo codebase, ComponentTypePlanner constants with different cases ("Planner" vs "planner") are intentional and serve different purposes: component type in config vs component label. These should not be made consistent as they handle different contexts.

components/planner/src/dynamo/planner/planner_sla.py (1)

Learnt from: nnshah1
PR: #1444
File: tests/fault_tolerance/utils/metrics.py:30-32
Timestamp: 2025-07-01T13:55:03.940Z
Learning: The @dynamo_worker() decorator in the dynamo codebase returns a wrapper that automatically injects the runtime parameter before calling the wrapped function. This means callers only need to provide the non-runtime parameters, while the decorator handles injecting the runtime argument automatically. For example, a function with signature async def get_metrics(runtime, log_dir) decorated with @dynamo_worker() can be called as get_metrics(log_dir) because the decorator wrapper injects the runtime parameter.

components/planner/src/dynamo/planner/prometheus.py (2)

Learnt from: nnshah1
PR: #1444
File: tests/fault_tolerance/utils/metrics.py:30-32
Timestamp: 2025-07-01T13:55:03.940Z
Learning: The @dynamo_worker() decorator in the dynamo codebase returns a wrapper that automatically injects the runtime parameter before calling the wrapped function. This means callers only need to provide the non-runtime parameters, while the decorator handles injecting the runtime argument automatically. For example, a function with signature async def get_metrics(runtime, log_dir) decorated with @dynamo_worker() can be called as get_metrics(log_dir) because the decorator wrapper injects the runtime parameter.

Learnt from: PeaBrane
PR: #1392
File: launch/dynamo-run/src/subprocess/vllm_v1_inc.py:71-71
Timestamp: 2025-06-05T01:04:24.775Z
Learning: The create_endpoint method in WorkerMetricsPublisher has backward compatibility maintained through pyo3 signature annotation #[pyo3(signature = (component, dp_rank = None))], making the dp_rank parameter optional with a default value of None.

components/backends/vllm/src/dynamo/vllm/args.py (2)

Learnt from: biswapanda
PR: #1412
File: lib/bindings/python/src/dynamo/runtime/logging.py:100-100
Timestamp: 2025-06-06T21:48:35.214Z
Learning: In the Dynamo codebase, BentoML has been completely removed from all executable code, with only documentation and attribution references remaining. The error_loggers configuration in lib/bindings/python/src/dynamo/runtime/logging.py should not include "bentoml" since those modules no longer exist.

Learnt from: PeaBrane
PR: #1392
File: launch/dynamo-run/src/subprocess/vllm_v1_inc.py:71-71
Timestamp: 2025-06-05T01:04:24.775Z
Learning: The create_endpoint method in WorkerMetricsPublisher has backward compatibility maintained through pyo3 signature annotation #[pyo3(signature = (component, dp_rank = None))], making the dp_rank parameter optional with a default value of None.

components/backends/vllm/deploy/disagg_planner.yaml (3)

Learnt from: julienmancuso
PR: #1365
File: deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go:171-178
Timestamp: 2025-06-04T13:09:53.416Z
Learning: The DYN_DEPLOYMENT_CONFIG environment variable (commonconsts.DynamoDeploymentConfigEnvVar) in the Dynamo operator will never be set via ValueFrom (secrets/config maps), only via direct Value assignment. The GetDynamoDeploymentConfig method correctly only checks env.Value for this specific environment variable.

Learnt from: nnshah1
PR: #2124
File: components/backends/vllm/deploy/disagg.yaml:54-60
Timestamp: 2025-07-25T22:34:11.384Z
Learning: In vLLM worker deployments, startup probes (with longer periods and higher failure thresholds like periodSeconds: 10, failureThreshold: 60) are used to handle the slow model loading startup phase, while liveness probes are intentionally kept aggressive (periodSeconds: 5, failureThreshold: 1) for quick failure detection once the worker is operational. This pattern separates startup concerns from operational health monitoring in GPU-heavy workloads.

Learnt from: nnshah1
PR: #1444
File: tests/fault_tolerance/configs/agg_tp_1_dp_8.yaml:31-38
Timestamp: 2025-07-01T15:33:53.262Z
Learning: In fault tolerance test configurations, the resources section under ServiceArgs specifies resources per individual worker, not total resources for all workers. So workers: 8 with gpu: '1' means 8 workers × 1 GPU each = 8 GPUs total.

🪛 markdownlint-cli2 (0.17.2)
docs/guides/dynamo_deploy/sla_planner_deployment.md

46-46: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 LanguageTool
docs/architecture/sla_planner.md

[uncategorized] ~11-~11: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...ts disaggregated setup. > [!WARNING] > Bare metal deployment with local connector is depr...

(EN_COMPOUND_ADJECTIVE_INTERNAL)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
  • GitHub Check: pre-merge-rust (.)
  • GitHub Check: Build and Test - vllm
🔇 Additional comments (31)
components/planner/README.md (1)

18-18: LGTM! Documentation consolidation aligns with PR objectives.

The simplification from detailed inline documentation to an external reference is consistent with the PR's focus on Kubernetes-native deployment and documentation consolidation.

components/planner/test/kubernetes_connector.py (1)

66-66: Namespace attribute consistency verified in KubernetesConnector.

All occurrences of the old namespace attribute have been replaced with dynamo_namespace throughout the codebase, including the CLI parser, connector implementation, helper methods, and tests. No references to kubernetes_connector.namespace remain.

docs/architecture/pre_deployment_profiling.md (1)

32-32: Excellent addition! Clear instruction for applying profiling results.

The added instruction to update the config file with the optimal TP sizes obtained from profiling is essential for users to properly configure their deployments. This bridges the gap between profiling results and actual configuration.

components/backends/vllm/README.md (2)

115-115: Excellent addition of SLA Planner documentation references.

The addition of disagg_planner.yaml and links to SLA Planner documentation provides users with clear guidance on the new deployment option and comprehensive documentation.


128-129: Good addition of pre-deployment profiling prerequisite.

Adding the pre-deployment profiling prerequisite with clear documentation reference helps users understand the complete workflow for SLA Planner deployment.

benchmarks/profiler/profile_sla.py (1)

592-594: LGTM! Backend naming unification looks good.

The changes correctly update the backend naming from "vllm_v1" to "vllm", which aligns with the broader effort to unify backend naming conventions across the codebase.

benchmarks/profiler/utils/config.py (3)

83-83: Consistent metadata naming update.

The metadata name change from "vllm-v1-agg" to "vllm-agg" is consistent with the backend naming unification.


92-180: LGTM! Worker component naming updates are consistent.

All references to WORKER_COMPONENT_NAMES["vllm_v1"] have been correctly updated to WORKER_COMPONENT_NAMES["vllm"], maintaining functionality while aligning with the unified backend naming convention.


235-235: CONFIG_MODIFIERS key update aligns with naming changes.

The key change from "vllm_v1" to "vllm" in the CONFIG_MODIFIERS dictionary is consistent with the backend naming unification.

docs/architecture/sla_planner.md (2)

109-115: Excellent deployment guidance with clear quick start.

The addition of the detailed deployment guide reference and quick start command provides users with clear, actionable instructions for deploying the SLA planner.


117-118: Important frontend metrics requirement clearly communicated.

The note about frontend metrics requirements is well-placed and informative, helping users understand the system dependencies.

components/planner/src/dynamo/planner/utils/prometheus.py (5)

32-38: Excellent defensive programming for empty query results.

The addition of empty result checking prevents potential errors when no requests have been made yet, returning 0 as a sensible default for inter-token latency metrics.


45-51: Consistent empty result handling for TTFT metrics.

The same defensive pattern is correctly applied to time-to-first-token metrics, maintaining consistency across the API.


58-64: Robust handling for request duration metrics.

The empty result check for request duration follows the established pattern, improving the overall reliability of the Prometheus client.


85-91: Appropriate default handling for input sequence metrics.

The defensive programming pattern is consistently applied to input sequence token metrics, ensuring graceful handling of scenarios with no available data.


98-104: Complete defensive pattern for output sequence metrics.

The final method follows the same robust pattern, completing the consistent empty result handling across all Prometheus query methods.

benchmarks/profiler/utils/dynamo_deployment.py (1)

20-20: Type annotation improvements look good.

The changes improve type precision by using explicit typing imports and adding appropriate type ignore comments for untyped modules.

Also applies to: 22-22, 65-65, 67-67

components/backends/vllm/src/dynamo/vllm/args.py (3)

82-88: Dynamic namespace configuration looks good.

The use of DYNAMO_NAMESPACE environment variable with a default fallback provides good flexibility for Kubernetes deployments.


135-142: IP resolution with fallback is well implemented.

Good error handling for hostname resolution failures. Using IP addresses instead of hostnames in ETCD keys will be more reliable in Kubernetes environments where DNS might not be immediately available.


254-282: Robust IP address resolution implementation.

The get_host_ip() function has excellent error handling with multiple fallback mechanisms:

  1. Attempts hostname resolution
  2. Tests if the resolved IP can be bound
  3. Falls back to localhost with clear warnings

This will prevent issues in various deployment scenarios.

components/planner/src/dynamo/planner/kubernetes_connector.py (2)

28-30: Good separation of namespace concerns.

Separating Dynamo namespace from Kubernetes namespace provides better flexibility for multi-namespace deployments.


92-99: Good addition of readiness check before scaling.

Checking deployment readiness before applying scaling changes prevents operations on unstable deployments. The warning log is helpful for debugging.

components/planner/src/dynamo/planner/prometheus.py (2)

30-38: Clean async worker implementation.

Good refactoring from class-based to async worker pattern using the @dynamo_worker decorator. This aligns well with the overall architecture changes.


64-74: Good process lifecycle management.

The subprocess handling with proper cancellation and termination is well implemented. The continuous monitoring of process health is a good practice.

components/planner/src/dynamo/planner/utils/planner_core.py (4)

55-55: Good centralization of configuration.

Using SLAPlannerDefaults for namespace and Prometheus endpoint configuration aligns with the environment-driven approach.

Also applies to: 63-65


171-180: Excellent handling of no-traffic scenarios.

The NaN checks prevent invalid scaling decisions when there's no traffic. This is a critical improvement for production stability.


239-246: Good defensive programming for correction factor.

The division by zero protection with fallback to default ITL value prevents crashes and ensures the planner continues operating even with incomplete metrics.


294-298: Efficient batch scaling implementation.

Using set_component_replicas for batch updates is more efficient than individual component operations and reduces the chance of partial updates.

components/planner/src/dynamo/planner/kube.py (1)

157-163: Verify the extended timeout is appropriate

The max_attempts has been increased from 60 to 180, resulting in a 30-minute default timeout. This seems excessive for most deployment scenarios.

Please verify that 30 minutes is the intended timeout. Consider:

  • Making this configurable via environment variable
  • Using exponential backoff instead of fixed intervals
  • Documenting why such a long timeout is necessary

Would you like me to generate an implementation with configurable timeout and exponential backoff?

components/planner/test/kube.py (2)

42-58: Well-structured test additions for namespace support

The tests properly cover the new namespace initialization functionality with both custom and default namespace scenarios.


77-129: Comprehensive test coverage for is_deployment_ready

Excellent test coverage for the new is_deployment_ready method, including ready, not ready, and not found scenarios. The tests properly mock the internal method and validate all edge cases.

@tedzhouhk tedzhouhk merged commit d23d48b into main Jul 28, 2025
12 checks passed
@tedzhouhk tedzhouhk deleted the hannahz/dep-253-deploy-sla-planner-to-k8s branch July 28, 2025 20:09
This was referenced Aug 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants