Skip to content

Conversation

@tmonty12
Copy link
Contributor

@tmonty12 tmonty12 commented Sep 19, 2025

Overview:

Going through the planner pre deployment profiling and launching the vllm disagg planner DGD -/components/backends/vllm/deploy/disagg_planner.yaml, uncovered some small issues and doc inconsistencies. This PR addresses these.

Details:

  • benchmarks/profiler/deploy/profile_sla_job.yaml - config set to env var DGD_CONFIG_FILE as it's dependent on the previous inject_manifest.py --dest arg.
  • components/backends/vllm/deploy/disagg_planner.yaml - fixes PVC mount for planner to consume. Also, need to specify PROMETHEUS_PORT as operator automatically creates Service for process running on port 8000 (default prometheus port is 9090 - planner metric scraping will hang without this env var set). Thinking about longer term solution here...
  • deploy/utils/setup_benchmarking_resources.sh - avoids creating pvc access pod as inject_manifest.py will delete and recreate (saves ~30s)

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • New Features

    • Config file path is now set via a runtime variable for profiling jobs.
    • Planner now exposes a Prometheus port via environment configuration.
    • Standardized data mount point for planner services.
  • Chores

    • Deployment setup script skips applying a manifest now managed by another tool, reducing redundancy.
  • Documentation

    • Pre-deployment profiling guide simplified and restructured.
    • Consolidated steps to set container image and config path before running profiling.
    • Updated examples and adjusted run/wait instructions.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 19, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 19, 2025

Walkthrough

Updates profiler job to use a runtime config path variable, adjusts vLLM planner deployment env and PVC mount, adds a conditional skip for a specific manifest in the benchmarking setup script, and restructures pre-deployment profiling docs to a simplified, single-path configuration workflow.

Changes

Cohort / File(s) Summary of changes
Profiler job config argument
benchmarks/profiler/deploy/profile_sla_job.yaml
Replaced fixed --config path (/data/configs/disagg.yaml) with environment-driven ${DGD_CONFIG_FILE}.
vLLM planner deployment tweaks
components/backends/vllm/deploy/disagg_planner.yaml
Added PROMETHEUS_PORT="8000" env; changed Planner PVC mountPoint from /data/profiling_results to /data.
Benchmark setup control flow
deploy/utils/setup_benchmarking_resources.sh
In manifest loop, added conditional to skip applying pvc-access-pod.yaml; all other manifests processed as before.
Docs: pre-deployment profiling workflow
docs/benchmarks/pre_deployment_profiling.md
Simplified flow: removed injection step details, consolidated image and config path setup, updated examples and step numbering.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor U as User
  participant SB as setup_benchmarking_resources.sh
  participant K as kubectl

  U->>SB: Run setup script
  loop For each manifest
    SB->>SB: If basename == "pvc-access-pod.yaml"?
    alt Skip specific manifest
      SB-->>U: Log "Skipping pvc-access-pod.yaml"
    else Apply manifest
      SB->>K: kubectl apply -f <manifest> (via envsubst if available)
      K-->>SB: Apply result
    end
  end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Poem

A rabbit taps deploy with cheerful might,
Skips the PVC pod—just right!
The planner hums on port eight-thousand, true,
Configs flow by env—fresh and new.
Docs hop ahead, one simple track—
Profiling carrots in a tidy stack. 🥕

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The provided title "fix: small planner manifest/doc fixes" is concise, follows conventional commit style, and accurately reflects the PR's primary intent of small fixes to the planner manifests and related documentation, so it suitably summarizes the main change for reviewers.
Description Check ✅ Passed The PR description includes the required Overview and Details sections and explains the file-level changes and rationale clearly, but the "Where should the reviewer start?" section is left empty and the Related Issues field uses a placeholder (#xxx) instead of a real issue reference; otherwise the description is sufficiently complete for review.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (6)
deploy/utils/setup_benchmarking_resources.sh (1)

73-78: Fix trailing whitespace to unblock pre-commit and CI.

The pre-commit hook flagged trailing whitespace in this block.

Apply:

-    # Skip pvc-access-pod.yaml as it's managed by inject_manifest.py
-    if [[ "$(basename "$mf")" == "pvc-access-pod.yaml" ]]; then
-      log "Skipping $mf (managed by inject_manifest.py)"
-      continue
-    fi
-    
+    # Skip pvc-access-pod.yaml as it's managed by inject_manifest.py
+    if [[ "$(basename "$mf")" == "pvc-access-pod.yaml" ]]; then
+      log "Skipping $mf (managed by inject_manifest.py)"
+      continue
+    fi

Optional hardening (not required):

-for mf in "$(dirname "$0")/manifests"/*.yaml; do
+shopt -s nullglob
+for mf in "$(dirname "$0")/manifests"/*.yaml; do
docs/benchmarks/pre_deployment_profiling.md (5)

141-152: Use headings instead of bold labels (markdownlint MD036).

Convert the bold “Step 3” line to a heading.

-**Step 3: Define the container image and config path**
+### Step 3: Define the container image and config path

153-157: Use headings instead of bold labels (markdownlint MD036).

-**Step 4: Run profiling (required)**
+### Step 4: Run profiling (required)

159-164: Use headings instead of bold labels (markdownlint MD036).

-**Step 5: Wait for profiling to complete**
+### Step 5: Wait for profiling to complete

145-146: Avoid a known-broken example image tag in docs.

The note says 0.4.1 is broken; using it in the example will derail users. Replace with a known-good tag or phrase as “use a working tag for your cluster.”

If you want, I can update the example once you confirm the current recommended tag.


150-151: Call out required non-empty var before envsubst.

Add an explicit guard so users don’t apply a manifest with an empty --config.

 export DGD_CONFIG_FILE=/data/configs/disagg.yaml # should be the same path you set for --dest in Step 1
+test -n "$DGD_CONFIG_FILE" || { echo "DGD_CONFIG_FILE must be set"; exit 1; }
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 70a8aa3 and 6f58daa.

📒 Files selected for processing (4)
  • benchmarks/profiler/deploy/profile_sla_job.yaml (1 hunks)
  • components/backends/vllm/deploy/disagg_planner.yaml (2 hunks)
  • deploy/utils/setup_benchmarking_resources.sh (1 hunks)
  • docs/benchmarks/pre_deployment_profiling.md (1 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
docs/benchmarks/pre_deployment_profiling.md

141-141: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


153-153: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


159-159: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/3129/merge) by tmonty12.
deploy/utils/setup_benchmarking_resources.sh

[error] 75-75: Trailing whitespace check failed. The pre-commit hook 'trailing-whitespace' modified this file. Please commit the changes and re-run pre-commit.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (2)
benchmarks/profiler/deploy/profile_sla_job.yaml (1)

32-32: Good switch to env-driven config; add a guard to avoid empty substitution.

Using ${DGD_CONFIG_FILE} is correct. Please ensure Step 4 fails fast if the var is unset so envsubst doesn’t inject an empty value.

Add before applying in docs:

: "${DGD_CONFIG_FILE:?Set DGD_CONFIG_FILE (PVC path) before running envsubst}"
components/backends/vllm/deploy/disagg_planner.yaml (1)

53-53: Broader PVC mount looks fine; confirm paths used elsewhere.

Mounting /data (vs /data/profiling_results) matches planner args pointing to /data/profiling_results and the injector writing under /data. Looks consistent.

Double-check any scripts assuming the previous subpath mount.

tmonty12 and others added 4 commits September 18, 2025 17:16
Signed-off-by: tmontfort <tmontfort@nvidia.com>
Signed-off-by: tmontfort <tmontfort@nvidia.com>
Signed-off-by: tmontfort <tmontfort@nvidia.com>
Signed-off-by: tmontfort <tmontfort@nvidia.com>
@tmonty12 tmonty12 force-pushed the tmonty12/small-planner-fixes branch from 6f58daa to 4f8973f Compare September 19, 2025 00:17
Signed-off-by: tmontfort <tmontfort@nvidia.com>
@tmonty12 tmonty12 requested review from a team as code owners September 19, 2025 18:56
@tmonty12 tmonty12 merged commit 7d2fc13 into main Sep 19, 2025
13 of 17 checks passed
@tmonty12 tmonty12 deleted the tmonty12/small-planner-fixes branch September 19, 2025 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants