fix: small planner manifest/doc fixes #3129

tmonty12 · 2025-09-19T00:06:10Z

Overview:

Going through the planner pre deployment profiling and launching the vllm disagg planner DGD -/components/backends/vllm/deploy/disagg_planner.yaml, uncovered some small issues and doc inconsistencies. This PR addresses these.

Details:

benchmarks/profiler/deploy/profile_sla_job.yaml - config set to env var DGD_CONFIG_FILE as it's dependent on the previous inject_manifest.py --dest arg.
components/backends/vllm/deploy/disagg_planner.yaml - fixes PVC mount for planner to consume. Also, need to specify PROMETHEUS_PORT as operator automatically creates Service for process running on port 8000 (default prometheus port is 9090 - planner metric scraping will hang without this env var set). Thinking about longer term solution here...
deploy/utils/setup_benchmarking_resources.sh - avoids creating pvc access pod as inject_manifest.py will delete and recreate (saves ~30s)

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

New Features
- Config file path is now set via a runtime variable for profiling jobs.
- Planner now exposes a Prometheus port via environment configuration.
- Standardized data mount point for planner services.
Chores
- Deployment setup script skips applying a manifest now managed by another tool, reducing redundancy.
Documentation
- Pre-deployment profiling guide simplified and restructured.
- Consolidated steps to set container image and config path before running profiling.
- Updated examples and adjusted run/wait instructions.

copy-pr-bot · 2025-09-19T00:06:14Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2025-09-19T00:15:43Z

Walkthrough

Updates profiler job to use a runtime config path variable, adjusts vLLM planner deployment env and PVC mount, adds a conditional skip for a specific manifest in the benchmarking setup script, and restructures pre-deployment profiling docs to a simplified, single-path configuration workflow.

Changes

Cohort / File(s)	Summary of changes
Profiler job config argument `benchmarks/profiler/deploy/profile_sla_job.yaml`	Replaced fixed --config path (/data/configs/disagg.yaml) with environment-driven ${DGD_CONFIG_FILE}.
vLLM planner deployment tweaks `components/backends/vllm/deploy/disagg_planner.yaml`	Added PROMETHEUS_PORT="8000" env; changed Planner PVC mountPoint from /data/profiling_results to /data.
Benchmark setup control flow `deploy/utils/setup_benchmarking_resources.sh`	In manifest loop, added conditional to skip applying pvc-access-pod.yaml; all other manifests processed as before.
Docs: pre-deployment profiling workflow `docs/benchmarks/pre_deployment_profiling.md`	Simplified flow: removed injection step details, consolidated image and config path setup, updated examples and step numbering.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor U as User
  participant SB as setup_benchmarking_resources.sh
  participant K as kubectl

  U->>SB: Run setup script
  loop For each manifest
    SB->>SB: If basename == "pvc-access-pod.yaml"?
    alt Skip specific manifest
      SB-->>U: Log "Skipping pvc-access-pod.yaml"
    else Apply manifest
      SB->>K: kubectl apply -f <manifest> (via envsubst if available)
      K-->>SB: Apply result
    end
  end

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

feat: update benchmarking and deploy utils #2933 — Also touches profiling deployment and PVC access workflow; aligns with mount and manifest handling changes.
chore: bug fixes in pre-deployment sweeping and vllm_v1 planner; expose num_d/p to k8s metrics #2454 — Related Prometheus port configuration for the vLLM planner deployment.
chore: add instructions to modify SLA to profile_sla doc; update component name #2167 — Adjusts profiler job CLI flags; related to this PR’s --config argument change.

Poem

A rabbit taps deploy with cheerful might,
Skips the PVC pod—just right!
The planner hums on port eight-thousand, true,
Configs flow by env—fresh and new.
Docs hop ahead, one simple track—
Profiling carrots in a tidy stack. 🥕

Pre-merge checks

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The provided title "fix: small planner manifest/doc fixes" is concise, follows conventional commit style, and accurately reflects the PR's primary intent of small fixes to the planner manifests and related documentation, so it suitably summarizes the main change for reviewers.
Description Check	✅ Passed	The PR description includes the required Overview and Details sections and explains the file-level changes and rationale clearly, but the "Where should the reviewer start?" section is left empty and the Related Issues field uses a placeholder (#xxx) instead of a real issue reference; otherwise the description is sufficiently complete for review.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (6)

deploy/utils/setup_benchmarking_resources.sh (1)
73-78: Fix trailing whitespace to unblock pre-commit and CI.

The pre-commit hook flagged trailing whitespace in this block.

Apply:
-    # Skip pvc-access-pod.yaml as it's managed by inject_manifest.py
-    if [[ "$(basename "$mf")" == "pvc-access-pod.yaml" ]]; then
-      log "Skipping $mf (managed by inject_manifest.py)"
-      continue
-    fi
-    
+    # Skip pvc-access-pod.yaml as it's managed by inject_manifest.py
+    if [[ "$(basename "$mf")" == "pvc-access-pod.yaml" ]]; then
+      log "Skipping $mf (managed by inject_manifest.py)"
+      continue
+    fi
Optional hardening (not required):
-for mf in "$(dirname "$0")/manifests"/*.yaml; do
+shopt -s nullglob
+for mf in "$(dirname "$0")/manifests"/*.yaml; do
docs/benchmarks/pre_deployment_profiling.md (5)
141-152: Use headings instead of bold labels (markdownlint MD036).

Convert the bold “Step 3” line to a heading.
-**Step 3: Define the container image and config path**
+### Step 3: Define the container image and config path
153-157: Use headings instead of bold labels (markdownlint MD036).
-**Step 4: Run profiling (required)**
+### Step 4: Run profiling (required)
159-164: Use headings instead of bold labels (markdownlint MD036).
-**Step 5: Wait for profiling to complete**
+### Step 5: Wait for profiling to complete
145-146: Avoid a known-broken example image tag in docs.

The note says 0.4.1 is broken; using it in the example will derail users. Replace with a known-good tag or phrase as “use a working tag for your cluster.”

If you want, I can update the example once you confirm the current recommended tag.

150-151: Call out required non-empty var before envsubst.

Add an explicit guard so users don’t apply a manifest with an empty --config.
 export DGD_CONFIG_FILE=/data/configs/disagg.yaml # should be the same path you set for --dest in Step 1
+test -n "$DGD_CONFIG_FILE" || { echo "DGD_CONFIG_FILE must be set"; exit 1; }

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 70a8aa3 and 6f58daa.

📒 Files selected for processing (4)

benchmarks/profiler/deploy/profile_sla_job.yaml (1 hunks)
components/backends/vllm/deploy/disagg_planner.yaml (2 hunks)
deploy/utils/setup_benchmarking_resources.sh (1 hunks)
docs/benchmarks/pre_deployment_profiling.md (1 hunks)

🧰 Additional context used

🪛 markdownlint-cli2 (0.17.2)

docs/benchmarks/pre_deployment_profiling.md

141-141: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

153-153: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

159-159: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/3129/merge) by tmonty12.

deploy/utils/setup_benchmarking_resources.sh

[error] 75-75: Trailing whitespace check failed. The pre-commit hook 'trailing-whitespace' modified this file. Please commit the changes and re-run pre-commit.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build and Test - dynamo

🔇 Additional comments (2)

benchmarks/profiler/deploy/profile_sla_job.yaml (1)
32-32: Good switch to env-driven config; add a guard to avoid empty substitution.

Using ${DGD_CONFIG_FILE} is correct. Please ensure Step 4 fails fast if the var is unset so envsubst doesn’t inject an empty value.

Add before applying in docs:
: "${DGD_CONFIG_FILE:?Set DGD_CONFIG_FILE (PVC path) before running envsubst}"
components/backends/vllm/deploy/disagg_planner.yaml (1)

53-53: Broader PVC mount looks fine; confirm paths used elsewhere.

Mounting /data (vs /data/profiling_results) matches planner args pointing to /data/profiling_results and the injector writing under /data. Looks consistent.

Double-check any scripts assuming the previous subpath mount.

components/backends/vllm/deploy/disagg_planner.yaml

Signed-off-by: tmontfort <tmontfort@nvidia.com>

components/backends/vllm/deploy/disagg_planner.yaml

Signed-off-by: tmontfort <tmontfort@nvidia.com>

tmonty12 requested review from Aphoh, PeaBrane, alec-flowers, atchernych, biswapanda, hhzhang16, hutm, ishandhanani, jasonqinzhou, julienmancuso, michaelshin, mohammedabdulwahhab, nnshah1 and tedzhouhk as code owners September 19, 2025 00:06

pull-request-size bot added the size/M label Sep 19, 2025

github-actions bot added the fix label Sep 19, 2025

coderabbitai bot reviewed Sep 19, 2025

View reviewed changes

components/backends/vllm/deploy/disagg_planner.yaml Show resolved Hide resolved

tmonty12 and others added 4 commits September 18, 2025 17:16

avoid applying pvc access manifest

44a4525

Signed-off-by: tmontfort <tmontfort@nvidia.com>

small fixes for pre deployment profiling

d98ebe0

Signed-off-by: tmontfort <tmontfort@nvidia.com>

add prometheus port and fix pvc mount

001444b

Signed-off-by: tmontfort <tmontfort@nvidia.com>

fix lint err

4f8973f

Signed-off-by: tmontfort <tmontfort@nvidia.com>

tmonty12 force-pushed the tmonty12/small-planner-fixes branch from 6f58daa to 4f8973f Compare September 19, 2025 00:17

hhzhang16 reviewed Sep 19, 2025

View reviewed changes

components/backends/vllm/deploy/disagg_planner.yaml Show resolved Hide resolved

fix other planner yamls

94f6b9b

Signed-off-by: tmontfort <tmontfort@nvidia.com>

copy-pr-bot bot temporarily deployed to GITLAB September 19, 2025 16:50 Inactive

copy-pr-bot bot temporarily deployed to GITLAB September 19, 2025 16:51 Inactive

hhzhang16 approved these changes Sep 19, 2025

View reviewed changes

Merge branch 'main' into tmonty12/small-planner-fixes

cae97d9

tmonty12 requested review from a team as code owners September 19, 2025 18:56

copy-pr-bot bot temporarily deployed to GITLAB September 19, 2025 18:56 Inactive

copy-pr-bot bot temporarily deployed to GITLAB September 19, 2025 18:59 Inactive

tmonty12 merged commit 7d2fc13 into main Sep 19, 2025
13 of 17 checks passed

tmonty12 deleted the tmonty12/small-planner-fixes branch September 19, 2025 19:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: small planner manifest/doc fixes #3129

fix: small planner manifest/doc fixes #3129

Uh oh!

tmonty12 commented Sep 19, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Sep 19, 2025

Uh oh!

coderabbitai bot commented Sep 19, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: small planner manifest/doc fixes #3129

fix: small planner manifest/doc fixes #3129

Uh oh!

Conversation

tmonty12 commented Sep 19, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Sep 19, 2025

Uh oh!

coderabbitai bot commented Sep 19, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tmonty12 commented Sep 19, 2025 •

edited by coderabbitai bot

Loading