chore: bug fixes in pre-deployment sweeping and vllm_v1 planner; expose num_d/p to k8s metrics #2454

tedzhouhk · 2025-08-15T00:16:51Z

fix bugs in pre-deployment sweep
fix bugs in vllm_v1 planner k8s example
expose num_d/p to k8s metrics and update k8s metric docs

Summary by CodeRabbit

New Features
- Planner now exposes Prometheus metrics with a configurable port via a new CLI flag.
Bug Fixes
- More reliable PVC access pod deployment through corrected path handling.
- Clearer profiler output pointing to the correct config file location.
Chores
- Updated vLLM runtime images across components.
- Added deployment annotation to disable Grove.
- Exposed planner metrics port in deployment.
Documentation
- Expanded Kubernetes metrics guide with namespace templating, planner PodMonitor, apply steps, and dashboard/port-forward updates.
- Updated profiling docs with the new config path.

copy-pr-bot · 2025-08-15T00:16:54Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

benchmarks/profiler/utils/kubernetes.py

benchmarks/profiler/profile_sla.py

coderabbitai · 2025-08-15T00:29:29Z

Walkthrough

The PR updates profiler import paths and messages, adjusts a Kubernetes utility path, modifies vLLM planner deployment to include Prometheus metrics and new images, adds a Prometheus port default and CLI flag to the planner, integrates Prometheus gauges and async observation in the planner core, and updates related documentation and PodMonitor samples.

Changes

Cohort / File(s)	Summary
Profiler import refactor `benchmarks/profiler/profile_endpoint.py`, `benchmarks/profiler/profile_sla.py`	Switch imports of profile_decode from benchmarks.profiler.utils to utils.profile_decode; no functional changes.
Profiler messaging and k8s util `benchmarks/profiler/inject_disagg_config.py`, `benchmarks/profiler/utils/kubernetes.py`	inject_disagg_config now instructs DGD_CONFIG_FILE=/workspace/{target_path}; kubernetes.py resolves pvc-access-pod.yaml from parent dir (…/profiler/deploy/).
Planner metrics enablement `components/planner/src/dynamo/planner/defaults.py`, `components/planner/src/dynamo/planner/planner_sla.py`, `components/planner/src/dynamo/planner/utils/planner_core.py`	Add BasePlannerDefaults.prometheus_port=0; add --prometheus-port CLI arg; integrate Prometheus server and gauges, make observe_metrics async and update counts; run loop awaits observe_metrics.
vLLM disagg planner manifest `components/backends/vllm/deploy/disagg_planner.yaml`	Add annotation nvidia.com/enable-grove: "false"; update vllm-runtime images to hzhou-0814-02; expose planner metrics port 9085 and add --prometheus-port=9085 arg.
Docs updates `docs/architecture/pre_deployment_profiling.md`, `docs/guides/deploy/k8s_metrics.md`	Update DGD_CONFIG_FILE example to /workspace path; parameterize namespace as $NAMESPACE, add planner PodMonitor, update apply steps and port-forward namespaces, adjust Grafana ConfigMap path/namespace.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant Planner
  participant PrefillWorkers
  participant DecodeWorkers
  participant Prometheus as Prometheus Scraper

  User->>Planner: Start with --prometheus-port (0 disables)
  alt port != 0
    Planner->>Planner: start_http_server(port)
  end

  loop periodic
    Planner->>PrefillWorkers: get_workers_info()
    PrefillWorkers-->>Planner: prefill endpoints
    Planner->>DecodeWorkers: get_workers_info()
    DecodeWorkers-->>Planner: decode endpoints
    Planner->>Planner: update Gauges (num_p_workers, num_d_workers)
  end

  Prometheus-->>Planner: scrape /metrics

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

feat: support sglang in sla planner #2421: Touches the same planner files to extend CLI/backends; overlaps with added prometheus_port and CLI surface.
docs: exporting metrics in k8s (dep-302) #2271: Adds/adjusts Prometheus metrics exposure and related configs/docs; aligns with planner metrics and PodMonitor additions.
feat: standalone profiling script for a given endpoint #2386: Introduces/organizes profiling utilities including profile_decode; overlaps with import path refactors in profiler scripts.

Poem

A planner hums, its gauges bright,
I twitch my ears at metrics’ light.
Ports ajar at 9085,
Prometheus comes by to hive.
In pods and paths we hop along—
/workspace set, the scrape is strong. 🐇📊

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 7

🔭 Outside diff range comments (1)

components/planner/src/dynamo/planner/utils/planner_core.py (1)

181-205: Avoid blocking the event loop when querying Prometheus; fetch concurrently

If PrometheusAPIClient uses blocking I/O, these calls will stall the loop. Fetch concurrently via asyncio.to_thread and gather for better responsiveness.

Apply this diff:

-        self.last_metrics.ttft = self.prometheus_api_client.get_avg_time_to_first_token(
-            f"{self.args.adjustment_interval}s"
-        )
-        self.last_metrics.itl = self.prometheus_api_client.get_avg_inter_token_latency(
-            f"{self.args.adjustment_interval}s"
-        )
-        self.last_metrics.num_req = self.prometheus_api_client.get_avg_request_count(
-            f"{self.args.adjustment_interval}s"
-        )
-        self.last_metrics.request_duration = (
-            self.prometheus_api_client.get_avg_request_duration(
-                f"{self.args.adjustment_interval}s"
-            )
-        )
-        self.last_metrics.isl = (
-            self.prometheus_api_client.get_avg_input_sequence_tokens(
-                f"{self.args.adjustment_interval}s"
-            )
-        )
-        self.last_metrics.osl = (
-            self.prometheus_api_client.get_avg_output_sequence_tokens(
-                f"{self.args.adjustment_interval}s"
-            )
-        )
+        window = f"{self.args.adjustment_interval}s"
+        ttft_f = asyncio.to_thread(self.prometheus_api_client.get_avg_time_to_first_token, window)
+        itl_f = asyncio.to_thread(self.prometheus_api_client.get_avg_inter_token_latency, window)
+        num_req_f = asyncio.to_thread(self.prometheus_api_client.get_avg_request_count, window)
+        req_dur_f = asyncio.to_thread(self.prometheus_api_client.get_avg_request_duration, window)
+        isl_f = asyncio.to_thread(self.prometheus_api_client.get_avg_input_sequence_tokens, window)
+        osl_f = asyncio.to_thread(self.prometheus_api_client.get_avg_output_sequence_tokens, window)
+        (
+            self.last_metrics.ttft,
+            self.last_metrics.itl,
+            self.last_metrics.num_req,
+            self.last_metrics.request_duration,
+            self.last_metrics.isl,
+            self.last_metrics.osl,
+        ) = await asyncio.gather(ttft_f, itl_f, num_req_f, req_dur_f, isl_f, osl_f, return_exceptions=False)

🧹 Nitpick comments (3)

benchmarks/profiler/inject_disagg_config.py (1)
165-165: Avoid double slash in printed DGD_CONFIG_FILE path

When target_path starts with “/”, this prints “/workspace//profiling_results/...”. Not functionally wrong, but noisy. Remove the extra slash.
-    print(f"🔧 Set DGD_CONFIG_FILE=/workspace/{args.target_path} in your profiler job")
+    print(f"🔧 Set DGD_CONFIG_FILE=/workspace{args.target_path} in your profiler job")
benchmarks/profiler/utils/kubernetes.py (1)
81-83: Path resolution change looks correct; consider resolving symlinks for robustness

Moving up one directory to reach benchmarks/profiler/deploy is correct. Minor nit: use resolve() to avoid surprises if the file is symlinked.

Apply this diff:
-    script_dir = Path(__file__).parent.parent
+    script_dir = Path(__file__).resolve().parent.parent
Optionally, allow overriding via an env var for flexibility:
-    pod_yaml_path = script_dir / "deploy" / "pvc-access-pod.yaml"
+    pod_yaml_path = Path(
+        os.environ.get("PVC_ACCESS_POD_YAML", str(script_dir / "deploy" / "pvc-access-pod.yaml"))
+    )
components/planner/src/dynamo/planner/utils/planner_core.py (1)
106-121: Prometheus server bootstrap: good guard; consider namespacing metrics

Starting the HTTP server only when port != 0 is correct. Consider namespacing metrics to avoid collisions in multi-process environments and to follow metric naming best practices.

Apply this diff to add namespace/subsystem and consolidate metric naming:
-        # Initialize Prometheus metrics
-        self.num_p_workers_gauge = Gauge("num_p_workers", "Number of prefill workers")
-        self.num_d_workers_gauge = Gauge("num_d_workers", "Number of decode workers")
+        # Initialize Prometheus metrics
+        # Use a single gauge with a role label for better cardinality control and querying
+        self.num_workers_gauge = Gauge(
+            "dynamo_planner_workers",
+            "Number of engine workers by role",
+            labelnames=("role",),
+        )
And update the setters (see observe_metrics diff below).

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a3d624a and 73659fd.

📒 Files selected for processing (10)

benchmarks/profiler/inject_disagg_config.py (1 hunks)
benchmarks/profiler/profile_endpoint.py (1 hunks)
benchmarks/profiler/profile_sla.py (1 hunks)
benchmarks/profiler/utils/kubernetes.py (1 hunks)
components/backends/vllm/deploy/disagg_planner.yaml (7 hunks)
components/planner/src/dynamo/planner/defaults.py (1 hunks)
components/planner/src/dynamo/planner/planner_sla.py (1 hunks)
components/planner/src/dynamo/planner/utils/planner_core.py (5 hunks)
docs/architecture/pre_deployment_profiling.md (1 hunks)
docs/guides/deploy/k8s_metrics.md (8 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-07-25T22:34:11.384Z

Learnt from: nnshah1
PR: ai-dynamo/dynamo#2124
File: components/backends/vllm/deploy/disagg.yaml:54-60
Timestamp: 2025-07-25T22:34:11.384Z
Learning: In vLLM worker deployments, startup probes (with longer periods and higher failure thresholds like periodSeconds: 10, failureThreshold: 60) are used to handle the slow model loading startup phase, while liveness probes are intentionally kept aggressive (periodSeconds: 5, failureThreshold: 1) for quick failure detection once the worker is operational. This pattern separates startup concerns from operational health monitoring in GPU-heavy workloads.

Applied to files:

components/backends/vllm/deploy/disagg_planner.yaml

🧬 Code Graph Analysis (4)

benchmarks/profiler/profile_sla.py (1)

benchmarks/profiler/utils/profile_decode.py (1)

profile_decode (21-85)

components/planner/src/dynamo/planner/planner_sla.py (1)

components/planner/src/dynamo/planner/defaults.py (1)

SLAPlannerDefaults (64-74)

benchmarks/profiler/profile_endpoint.py (1)

benchmarks/profiler/utils/profile_decode.py (1)

profile_decode (21-85)

components/planner/src/dynamo/planner/utils/planner_core.py (1)

components/planner/src/dynamo/planner/defaults.py (1)

SLAPlannerDefaults (64-74)

🪛 markdownlint-cli2 (0.17.2)

docs/guides/deploy/k8s_metrics.md

139-139: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

🔇 Additional comments (10)

benchmarks/profiler/profile_sla.py (1)

38-38: Import path change verified — no remaining old imports

Confirmed the new import path is correct, the module defines profile_decode, and all call sites use the updated import.

benchmarks/profiler/utils/profile_decode.py — defines def profile_decode(...) (around line 21)

benchmarks/profiler/profile_sla.py — import at line 38: from utils.profile_decode import profile_decode; call at ~line 476

benchmarks/profiler/profile_endpoint.py — import at line 8: from utils.profile_decode import profile_decode; call at ~line 89

components/backends/vllm/deploy/disagg_planner.yaml (2)

8-9: Grove disabled annotation acknowledged

Setting nvidia.com/enable-grove: "false" is explicit and clear for this deployment.

50-50: Image tag bumps: confirm provenance and digest pinning policy

All components now use nvcr.io/.../vllm-runtime:hzhou-0814-02. If your org policy prefers immutability, consider digest pinning to avoid tag drift. Otherwise, these bumps are fine.

To help track runtime compatibility, document the image change in the release notes and validate the image is accessible in your cluster registry.

Also applies to: 94-94, 143-143, 193-193, 243-243

components/planner/src/dynamo/planner/planner_sla.py (1)

138-143: No action required — prometheus_port already defaults to 0

components/planner/src/dynamo/planner/defaults.py defines SLAPlannerDefaults.prometheus_port = 0 (around line 38), so the argparse default is safe and no change is needed.

docs/guides/deploy/k8s_metrics.md (3)

96-96: Namespace templating via $NAMESPACE is good

Switching to $NAMESPACE in PodMonitor resources + envsubst in the apply step improves reuse across clusters/namespaces.

Also applies to: 108-108, 118-118, 130-130

187-187: Including -n monitoring in port-forward commands is correct

This avoids relying on default namespace selection and reduces operator error.

Also applies to: 198-198

171-171: Grafana ConfigMap path updated — file present and labeled for auto-discovery

Verified: deploy/metrics/k8s/grafana-dynamo-dashboard-configmap.yaml exists and contains grafana_dashboard: "1" (lines 9–11). No changes required.
components/planner/src/dynamo/planner/utils/planner_core.py (3)

362-362: Awaiting observe_metrics in the loop is correct

Switching observe_metrics to async and awaiting it in the loop keeps sequencing deterministic before make_adjustments runs.

474-479: Same potential defaults issue here: prometheus_port may be undefined

The main block also uses SLAPlannerDefaults.prometheus_port. Ensure it exists or default to 0.

Apply this diff if needed:
-    parser.add_argument(
-        "--prometheus-port",
-        type=int,
-        default=SLAPlannerDefaults.prometheus_port,
-        help="Prometheus port for metrics server (0 to disable)",
-    )
+    parser.add_argument(
+        "--prometheus-port",
+        type=int,
+        default=0,
+        help="Prometheus port for metrics server (0 to disable)",
+    )
You can verify the attribute with the same script provided for planner_sla.py.

24-25: Verify runtime image or dependency manifests include "prometheus-client"

The file components/planner/src/dynamo/planner/utils/planner_core.py now imports prometheus_client (Gauge, start_http_server). If the runtime image or dependency manifests don't include the pip package prometheus-client, the import will fail at module import time (before the try/except around start_http_server).
Location to check:
components/planner/src/dynamo/planner/utils/planner_core.py — lines ~24–25:
from prometheus_client import Gauge, start_http_server
What I ran: attempted to find Dockerfile and requirements.* but both were not present in the workspace, so I could not confirm whether the image includes the package.

Please verify (run in repo root):

rg -n --hidden -S 'promheus_client|prometheus-client' || true

rg -n --hidden -S 'Dockerfile|requirements|pyproject.toml|setup.cfg|Pipfile|environment.yml' || true

check any image build steps in .github/workflows or infra folders for pip install steps
If the package is missing, add prometheus-client to your dependency manifest or ensure the Dockerfile installs it (or catch ImportError around the import if a non-fatal absence is acceptable).

benchmarks/profiler/profile_endpoint.py

components/backends/vllm/deploy/disagg_planner.yaml

components/planner/src/dynamo/planner/defaults.py

components/planner/src/dynamo/planner/utils/planner_core.py

docs/architecture/pre_deployment_profiling.md

docs/guides/deploy/k8s_metrics.md

components/backends/vllm/deploy/disagg_planner.yaml

components/planner/src/dynamo/planner/utils/planner_core.py

…/sla-planner-bench

components/backends/vllm/deploy/disagg_planner.yaml

components/planner/src/dynamo/planner/planner_sla.py

components/planner/src/dynamo/planner/utils/planner_core.py

docs/guides/deploy/k8s_metrics.md

benchmarks/profiler/inject_disagg_config.py

…o/dynamo into hzhou/sla-planner-bench

…se num_d/p to k8s metrics (#2454) Signed-off-by: Hannah Zhang <hannahz@nvidia.com>

tedzhouhk added 5 commits August 13, 2025 15:24

stage

c033d9a

bug fix

e5669c5

fix path

056e503

update path

59b7a5a

feat: expose num_p/num_d

73659fd

tedzhouhk requested review from grahamking, nnshah1, piotrm-nvidia, ptarasiewiczNV, ryanolson and tanmayv25 as code owners August 15, 2025 00:16

tedzhouhk requested review from a team, GuanLuo, PeaBrane, alec-flowers, biswapanda, ishandhanani, jthomson04, kkranen, paulhendricks, rmccorm4 and tmonty12 as code owners August 15, 2025 00:16

pull-request-size bot added the size/L label Aug 15, 2025

github-actions bot added the chore label Aug 15, 2025

tedzhouhk commented Aug 15, 2025

View reviewed changes

benchmarks/profiler/utils/kubernetes.py Show resolved Hide resolved

tedzhouhk commented Aug 15, 2025

View reviewed changes

benchmarks/profiler/profile_sla.py Show resolved Hide resolved

coderabbitai bot reviewed Aug 15, 2025

View reviewed changes

PeaBrane approved these changes Aug 15, 2025

View reviewed changes

nnshah1 reviewed Aug 15, 2025

View reviewed changes

components/backends/vllm/deploy/disagg_planner.yaml Show resolved Hide resolved

nnshah1 reviewed Aug 15, 2025

View reviewed changes

components/planner/src/dynamo/planner/utils/planner_core.py Show resolved Hide resolved

Merge branch 'main' of https://github.com/ai-dynamo/dynamo into hzhou…

031d7c8

…/sla-planner-bench

mohammedabdulwahhab requested changes Aug 15, 2025

View reviewed changes

tedzhouhk added 2 commits August 15, 2025 13:46

Merge branch 'hzhou/sla-planner-bench' of https://github.com/ai-dynam…

f94ce09

…o/dynamo into hzhou/sla-planner-bench

address PR issue

8644d88

tedzhouhk requested review from hhzhang16, hutm and julienmancuso as code owners August 15, 2025 20:58

mohammedabdulwahhab approved these changes Aug 15, 2025

View reviewed changes

tedzhouhk merged commit 922850a into main Aug 15, 2025
12 checks passed

tedzhouhk deleted the hzhou/sla-planner-bench branch August 15, 2025 21:30

This was referenced Aug 18, 2025

fix: Standardize namespace environment variable to DYN_NAMESPACE #2485

Closed

chore: add pre-deployment profiling results on H200 cluster to unblock planner testing #2495

Merged

hhzhang16 pushed a commit that referenced this pull request Aug 27, 2025

chore: bug fixes in pre-deployment sweeping and vllm_v1 planner; expo…

f1fe3cd

…se num_d/p to k8s metrics (#2454) Signed-off-by: Hannah Zhang <hannahz@nvidia.com>

coderabbitai bot mentioned this pull request Sep 2, 2025

docs: add mention of deploy/utils quickstart to planner deployment docs #2821

Merged

coderabbitai bot mentioned this pull request Sep 19, 2025

fix: small planner manifest/doc fixes #3129

Merged

chore: bug fixes in pre-deployment sweeping and vllm_v1 planner; expose num_d/p to k8s metrics #2454

chore: bug fixes in pre-deployment sweeping and vllm_v1 planner; expose num_d/p to k8s metrics #2454

Uh oh!

Conversation

tedzhouhk commented Aug 15, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Aug 15, 2025

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Aug 15, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

tedzhouhk commented Aug 15, 2025 •

edited by coderabbitai bot

Loading