Skip to content

Conversation

@hhzhang16
Copy link
Contributor

@hhzhang16 hhzhang16 commented Aug 11, 2025

Overview:

Use scripts to provide better result storage access and to improve user profiling UX.

Details:

  • Added pvc-access-pod.yaml for temporary PVC access. PVC = Persistent Volume Claim, where profiling results (and potentially profiling configs) can be stored. Needs to be accessed through pods.
  • Added download_pvc_results.py script for retrieving profiling results
  • Added inject_disagg_config.py for configuration management
  • Expanded pre-deployment profiling docs with updated workflows and better guidance

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • New Features

    • Added CLI to download profiling results from a Kubernetes PVC to a local directory.
    • Added CLI to inject a disaggregation config into a profiling PVC.
    • Introduced a standardized PVC access pod for interacting with profiling data and configs; auto-deletes after 5 minutes.
  • Documentation

    • Updated pre-deployment profiling guide with two container image options (use pre-built image with PVC-injected config or build a custom image).
    • Revised steps for accessing and downloading results via the new access pod.
    • Added cleanup guidance and troubleshooting for image pull authentication.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 11, 2025

Walkthrough

Adds a Kubernetes PVC access pod manifest, two new Python CLI tools to inject a disagg config into a PVC and to download profiling results from it, and updates documentation to reflect the new access pod workflow, commands, paths, and troubleshooting notes.

Changes

Cohort / File(s) Summary of Changes
PVC Access Pod Manifest
benchmarks/profiler/deploy/pvc-access-pod.yaml
Introduces a Pod that mounts profiling-pvc at /profiling_results, runs ubuntu:22.04, installs tools, and sleeps; includes resource limits; activeDeadlineSeconds: 300; restartPolicy: Never.
PVC Results Download Tool
benchmarks/profiler/download_pvc_results.py
New CLI to validate kubectl access, ensure/access pvc-access-pod, list files in /profiling_results, selectively copy artifacts to a local dir, fetch summary files, generate a README, and provide cleanup guidance.
Disagg Config Injection Tool
benchmarks/profiler/inject_disagg_config.py
New CLI to validate kubectl access, ensure/access pvc-access-pod, copy a disagg config into the PVC at a target path, verify placement, and optionally clean up the pod.
Docs: Pre-deployment Profiling
docs/architecture/pre_deployment_profiling.md
Revises setup to support config injection via PVC or custom image; updates commands to use pvc-access-pod and /profiling_results; adds cleanup, lifecycle notes, and image pull troubleshooting.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as User
  participant DL as download_pvc_results.py
  participant K as kubectl
  participant API as Kubernetes API
  participant P as pvc-access-pod
  participant PVC as profiling-pvc

  U->>DL: Run CLI (namespace, output-dir, [--no-config])
  DL->>K: kubectl auth/namespace check
  K->>API: Validate access
  DL->>K: kubectl get/apply pvc-access-pod
  K->>API: Ensure pod Running
  DL->>K: kubectl exec ls /profiling_results
  K->>P: List files
  P->>PVC: Read directory
  DL->>K: kubectl cp P:/profiling_results/... -> local
  K->>P: Stream files
  P->>PVC: Read files
  DL-->>U: Summary + README generated
Loading
sequenceDiagram
  autonumber
  participant U as User
  participant INJ as inject_disagg_config.py
  participant K as kubectl
  participant API as Kubernetes API
  participant P as pvc-access-pod
  participant PVC as profiling-pvc

  U->>INJ: Run CLI (namespace, [--disagg-config], [--target-path], [--cleanup])
  INJ->>K: kubectl auth/namespace check
  K->>API: Validate access
  INJ->>K: kubectl get/apply pvc-access-pod
  K->>API: Ensure pod Running
  INJ->>K: kubectl cp local disagg.yaml -> P:/profiling_results/...
  K->>P: Transfer file
  P->>PVC: Write file
  INJ->>K: kubectl exec ls target path
  K->>P: Verify file exists
  INJ-->>U: Report target path and optional cleanup
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

A nibble of bytes, a hop to the pod,
I stash configs snug where volumes nod.
Then whisk back charts from PVC lands—
npz treasures in my little paws’ hands.
Five-minute burrow; then I’m gone—
profiling dreams, at rabbit dawn. 🐇📦✨


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (11)
benchmarks/profiler/deploy/pvc-access-pod.yaml (2)

31-31: Add newline at end of file

-  restartPolicy: Never
+  restartPolicy: Never
+

14-16: Consider using a pre-built image with tools installed

Installing packages at runtime (apt-get update && apt-get install) adds startup delay and network dependency. Consider using a pre-built image with curl and jq already installed.

Example alternatives:

  • Use curlimages/curl image which has curl pre-installed
  • Build a custom image FROM ubuntu:22.04 with RUN apt-get commands
  • Use alpine with pre-installed tools for smaller footprint
benchmarks/profiler/inject_disagg_config.py (2)

237-237: Add newline at end of file

 if __name__ == "__main__":
     main()
+

62-70: Improve error handling in deploy_access_pod

The current implementation uses a bare except subprocess.CalledProcessError with pass, which could mask other issues. The function should be more explicit about what it's catching.

     # Check if pod already exists
-    try:
-        result = run_command(
-            ["kubectl", "get", "pod", pod_name, "-n", namespace], capture_output=True
-        )
-        print(f"✓ Access pod '{pod_name}' already exists")
-        return
-    except subprocess.CalledProcessError:
-        # Pod doesn't exist, deploy it
-        pass
+    result = subprocess.run(
+        ["kubectl", "get", "pod", pod_name, "-n", namespace, "-o", "jsonpath={.status.phase}"],
+        capture_output=True, text=True, check=False
+    )
+    
+    if result.returncode == 0:
+        phase = result.stdout.strip()
+        if phase == "Running":
+            print(f"✓ Access pod '{pod_name}' already exists and is running")
+            return
+        elif phase:
+            print(f"Access pod exists but is in {phase} state, will redeploy...")
+            run_command(["kubectl", "delete", "pod", pod_name, "-n", namespace, "--wait"], capture_output=True)
benchmarks/profiler/download_pvc_results.py (3)

413-413: Add newline at end of file

 if __name__ == "__main__":
     main()
+

28-32: Use consistent type hints throughout the file

The file uses List[str] from typing module but Python 3.9+ supports native list[str]. Since inject_disagg_config.py uses list[str], maintain consistency.

-from typing import List
+# Remove typing import if not needed for other types

 def run_command(
-    cmd: List[str], capture_output: bool = True
+    cmd: list[str], capture_output: bool = True
 ) -> subprocess.CompletedProcess:

Apply similar changes to other function signatures using List[str].


291-294: Improve exception handling specificity

Using bare except Exception can mask unexpected errors. Be more specific about what exceptions you're catching.

-        except Exception as e:
-            # File doesn't exist or failed to download, skip silently
-            print(f"  ⚠️  Skipped {file_path.split('/')[-1]}: {e}")
-            pass
+        except subprocess.CalledProcessError as e:
+            # File doesn't exist or failed to download
+            print(f"  ⚠️  Skipped {file_path.split('/')[-1]}: {e}")
+        except Exception as e:
+            # Unexpected error, log but continue
+            print(f"  ⚠️  Unexpected error for {file_path.split('/')[-1]}: {e}")
docs/architecture/pre_deployment_profiling.md (4)

79-79: Use proper heading syntax instead of emphasis

Multiple lines use bold emphasis (**text**) where heading syntax would be more appropriate for document structure.

-**Step 1: Configure container image**
+### Step 1: Configure container image

-**Option A: Use pre-built image with custom config injection (recommended)**
+#### Option A: Use pre-built image with custom config injection (recommended)

-**Option B: Build custom image (only if you need code changes)**
+#### Option B: Build custom image (only if you need code changes)

-**Option 1: Automated Download (Recommended)**
+##### Option 1: Automated Download (Recommended)

-**Option 2: Manual Download**
+##### Option 2: Manual Download

Also applies to: 83-83, 118-118, 226-226, 248-248


110-112: Use consistent list style markers

The document mixes asterisk (*) and dash (-) for unordered lists. Maintain consistency by using asterisks throughout.

 This approach allows you to:
-- Customize DGD configurations without rebuilding container images
-- Test different model configurations easily
-- Version control your DGD configs alongside your code
+* Customize DGD configurations without rebuilding container images
+* Test different model configurations easily
+* Version control your DGD configs alongside your code

 The script will:
-- Deploy a temporary access pod (auto-deletes after 5 minutes)
-- Scan for relevant files (*.png, *.npz, *.yaml)
-- Download all files maintaining directory structure
-- Generate a README.md with file descriptions
-- Clean up automatically
+* Deploy a temporary access pod (auto-deletes after 5 minutes)
+* Scan for relevant files (*.png, *.npz, *.yaml)
+* Download all files maintaining directory structure
+* Generate a README.md with file descriptions
+* Clean up automatically

Also applies to: 242-246


243-243: Remove spaces inside emphasis markers

-- Scan for relevant files (*.png, *.npz, *.yaml)
+- Scan for relevant files (*.png, *.npz, *.yaml)

114-116: Remove blank line inside blockquote

Markdown blockquotes should not contain blank lines between consecutive quoted lines.

 > **Important**: For profiling, disagg configs should be run with Grove disabled by adding the annotation `nvidia.com/enable-grove: "false"` to avoid alpha Grove status issues.
-
 > **Note**: The default location in the PVC is `/profiling_results/disagg.yaml`. If you don't inject a config, the profiler will fall back to the built-in config at `/workspace/components/backends/vllm/deploy/disagg.yaml`.
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d51580a and b629059.

📒 Files selected for processing (4)
  • benchmarks/profiler/deploy/pvc-access-pod.yaml (1 hunks)
  • benchmarks/profiler/download_pvc_results.py (1 hunks)
  • benchmarks/profiler/inject_disagg_config.py (1 hunks)
  • docs/architecture/pre_deployment_profiling.md (4 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
benchmarks/profiler/inject_disagg_config.py (1)
benchmarks/profiler/download_pvc_results.py (4)
  • check_kubectl_access (50-54)
  • deploy_access_pod (57-135)
  • cleanup_access_pod (297-300)
  • main (354-408)
🪛 Checkov (3.2.334)
benchmarks/profiler/deploy/pvc-access-pod.yaml

[MEDIUM] 4-31: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[MEDIUM] 4-31: Minimize the admission of root containers

(CKV_K8S_23)

🪛 YAMLlint (1.37.1)
benchmarks/profiler/deploy/pvc-access-pod.yaml

[error] 31-31: no new line character at the end of file

(new-line-at-end-of-file)

🪛 GitHub Actions: Copyright Checks
benchmarks/profiler/download_pvc_results.py

[error] 1-1: Copyright header missing/invalid in file. Step: pwsh /workspace/.github/workflows/copyright-check.ps1.

benchmarks/profiler/inject_disagg_config.py

[error] 1-1: Copyright header missing/invalid in file. Step: pwsh /workspace/.github/workflows/copyright-check.ps1.

🪛 markdownlint-cli2 (0.17.2)
docs/architecture/pre_deployment_profiling.md

79-79: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


83-83: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


110-110: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


111-111: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


112-112: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


115-115: Blank line inside blockquote

(MD028, no-blanks-blockquote)


118-118: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


226-226: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


242-242: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


243-243: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


243-243: Spaces inside emphasis markers

(MD037, no-space-in-emphasis)


244-244: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


245-245: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


246-246: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


248-248: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build and Test - dynamo

@grahamking
Copy link
Contributor

Could you define "PVC" in the description?

@hhzhang16
Copy link
Contributor Author

Could you define "PVC" in the description?

Done, let me know if more detail is needed @grahamking

@hhzhang16 hhzhang16 merged commit bea92ea into main Aug 11, 2025
11 of 12 checks passed
@hhzhang16 hhzhang16 deleted the hannahz/dep-304-profiling-pvc-updates branch August 11, 2025 22:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants