Skip to content

Conversation

@hhzhang16
Copy link
Contributor

@hhzhang16 hhzhang16 commented Sep 8, 2025

Overview:

Following QA feedback, updated benchmarking.

Details:

  • Default model is now Qwen/Qwen3-0.6B, matching the manifests
  • Run deploy/utils scripts as Python modules
  • Interact with everything in the pvc under the /data mount
  • Update docs to clarify GPU resource usage

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • New Features

    • Enforced /data/ destination paths for manifest injection with clearer, user-friendly errors; unified PVC mounts to /data (/workspace/data) across jobs and access pod.
  • Refactor

    • More robust access-pod lifecycle management and softer failure handling (errors surface without hard exits), improving reliability of deploy/download flows.
  • Chores

    • Default benchmark model switched to Qwen/Qwen3-0.6B.
  • Documentation

    • Updated examples to use python -m entry points.
    • Standardized all paths to /data/* and clarified path requirements.
    • Revised benchmarking docs to reflect the new default model.

Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 8, 2025

Walkthrough

This PR updates default benchmark model strings, unifies PVC mount paths under /data, refactors deploy utilities to add ensure_clean_access_pod and adjust run_command error handling, updates Kubernetes manifests and job specs to use /data, and revises documentation and help texts to reflect new defaults and path requirements.

Changes

Cohort / File(s) Summary of edits
Benchmark defaults
benchmarks/benchmark.sh, benchmarks/utils/benchmark.py, docs/benchmarks/benchmarking.md
Changed default model to Qwen/Qwen3-0.6B; updated help text and docs to reflect the new default and note alignment with manifests/endpoints.
PVC path unification (/data)
benchmarks/profiler/deploy/profile_sla_job.yaml, deploy/utils/manifests/pvc-access-pod.yaml, docs/benchmarks/pre_deployment_profiling.md, deploy/utils/README.md
Switched config/output paths and mounts to /data/...; removed separate configs PVC/mount in Job; access pod mountPath changed to /data; docs updated to require /data/ prefix and show new path examples.
Deploy utilities refactor
deploy/utils/kubernetes.py, deploy/utils/inject_manifest.py, deploy/utils/download_pvc_results.py
Added ensure_clean_access_pod(namespace) -> str; updated callers to use it instead of deploy_access_pod; run_command gains exit_on_error flag and revised error handling; inject_manifest enforces args.dest starts with /data/ and returns target_path.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User
  participant Inject as inject_manifest.py
  participant K8sUtil as kubernetes.py
  participant K8s as Kubernetes API

  User->>Inject: python -m deploy.utils.inject_manifest --dest /data/...
  Inject->>Inject: Validate dest starts with "/data/"
  Inject->>K8sUtil: ensure_clean_access_pod(namespace)
  K8sUtil->>K8s: Check existing access pod
  alt Pod exists
    K8sUtil->>K8s: Delete existing access pod
  end
  K8sUtil->>K8s: Deploy access pod (YAML)
  K8sUtil->>K8s: Wait for Ready (run_command exit_on_error=False)
  K8s-->>K8sUtil: Ready or error (exception propagated)
  K8sUtil-->>Inject: pod_name
  Inject->>K8s: kubectl cp manifest -> PVC (/data/...)
  Inject-->>User: Completed (returns target_path)
Loading
sequenceDiagram
  autonumber
  participant DL as download_pvc_results.py
  participant K8sUtil as kubernetes.py
  participant K8s as Kubernetes API

  DL->>K8sUtil: ensure_clean_access_pod(namespace)
  K8sUtil->>K8s: Recreate/ensure access pod
  K8s-->>K8sUtil: Pod ready
  DL->>K8s: kubectl cp /data/... from pod to local
  DL-->>DL: Finish without hard exit (exceptions propagate if errors)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

I hop through pods where data lies,
From /configs past to /data skies.
A fresh clean pod, a tidy scene,
Commands don’t crash—they raise, they glean.
Benchmarks hum a Qwen-tuned tune,
Results in /data? See you soon! 🐇✨

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  - Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.
  - Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
benchmarks/utils/benchmark.py (1)

57-66: Align Python defaults with docs and benchmark.sh (ISL/OSL mismatch).

Docs and benchmark.sh use ISL=2000, OSL=256; Python defaults here are 200/200. This will skew results when running the module directly.

Apply:

-    parser.add_argument("--isl", type=int, default=200, help="Input sequence length")
+    parser.add_argument("--isl", type=int, default=2000, help="Input sequence length")
...
-    parser.add_argument("--osl", type=int, default=200, help="Output sequence length")
+    parser.add_argument("--osl", type=int, default=256, help="Output sequence length")

Also applies to: 64-66

deploy/utils/inject_manifest.py (1)

48-51: Use the actual pod name returned by ensure_clean_access_pod

You rely on PVC_ACCESS_POD_NAME in copy_manifest, but ensure_clean_access_pod returns the deployed pod’s name. Capture and pass it to avoid future divergence.

-def copy_manifest(namespace: str, manifest_path: Path, target_path: str) -> None:
+def copy_manifest(namespace: str, manifest_path: Path, target_path: str, pod_name: str) -> None:
     """Copy a manifest file into the PVC via the access pod."""
-    pod_name = PVC_ACCESS_POD_NAME
@@
-    # Deploy access pod
-    ensure_clean_access_pod(args.namespace)
+    # Deploy access pod
+    pod_name = ensure_clean_access_pod(args.namespace)
     try:
         # Copy manifest
-        copy_manifest(args.namespace, args.src, args.dest)
+        copy_manifest(args.namespace, args.src, args.dest, pod_name)
         print("\n✅ Manifest injection completed!")

Also applies to: 153-159

🧹 Nitpick comments (17)
benchmarks/utils/benchmark.py (2)

38-41: Avoid importing re inside hot path.

Minor: move to module scope to avoid repeated imports.

-import re
+import re

(move to top with other imports)


105-106: Remove unused variables or use them.

endpoints, manifests = ... are assigned but unused.

-        endpoints, manifests = categorize_inputs(parsed_inputs)
+        _endpoints, _manifests = categorize_inputs(parsed_inputs)

Or consume them as needed.

docs/benchmarks/benchmarking.md (1)

151-177: Direct Python example uses unsupported flag.

The module takes repeated --input <label>=<manifest_or_endpoint>, not --endpoint. Update examples for parity with benchmark.sh.

-python3 -u -m benchmarks.utils.benchmark \
-   --endpoint "http://your-endpoint:8000" \
+python3 -u -m benchmarks.utils.benchmark \
+   --input prod="http://your-endpoint:8000" \
docs/benchmarks/pre_deployment_profiling.md (2)

114-122: Unify invocation style to module form.

Elsewhere you switched to python -m .... Mirror it here.

-   python3 deploy/utils/inject_manifest.py --namespace $NAMESPACE --src components/backends/vllm/deploy/disagg.yaml --dest /data/configs/disagg.yaml
+   python3 -m deploy.utils.inject_manifest --namespace $NAMESPACE --src components/backends/vllm/deploy/disagg.yaml --dest /data/configs/disagg.yaml
...
-   python3 deploy/utils/inject_manifest.py --namespace $NAMESPACE --src my-custom-disagg.yaml --dest /data/configs/disagg.yaml
+   python3 -m deploy.utils.inject_manifest --namespace $NAMESPACE --src my-custom-disagg.yaml --dest /data/configs/disagg.yaml
...
-   python3 deploy/utils/inject_manifest.py --namespace $NAMESPACE --src my-custom-disagg.yaml --dest /data/profiling_results/my-disagg.yaml
+   python3 -m deploy.utils.inject_manifest --namespace $NAMESPACE --src my-custom-disagg.yaml --dest /data/profiling_results/my-disagg.yaml

182-185: Download commands updated to /data — good.

Minor: ensure earlier note sets DGD_CONFIG_FILE under /workspace/data/... to be consistent with the Job.

deploy/utils/kubernetes.py (6)

24-44: Library functions shouldn’t call sys.exit; bubble errors instead.

Let callers decide exit behavior; keep a single strategy.

-def run_command(
-    cmd: List[str], capture_output: bool = True, exit_on_error: bool = True
-) -> subprocess.CompletedProcess:
+def run_command(
+    cmd: List[str], capture_output: bool = True, exit_on_error: bool = True
+) -> subprocess.CompletedProcess:
@@
-        if exit_on_error:
-            sys.exit(1)
-        else:
-            raise
+        if exit_on_error:
+            raise RuntimeError(f"Command failed: {' '.join(cmd)}") from e
+        raise

And update internal callers that relied on process exit to catch exceptions or re-raise at the top-level CLI.


53-101: Tighten exception handling and reuse helpers.

Avoid bare except Exception; use CalledProcessError or check return codes only. Consider reusing run_command for consistency.

-    try:
-        result = subprocess.run(
+    try:
+        result = subprocess.run(
@@
-    except Exception:
-        pass  # Pod doesn't exist, which is fine
+    except subprocess.SubprocessError as e:
+        print(f"Warning: failed to query existing access pod: {e}")

Also consider increasing readiness wait or making it configurable (see next comment).


126-127: Log instead of silent pass.

Swallowing all exceptions hides actionable errors (e.g., RBAC).

-    except Exception:
-        pass  # Pod doesn't exist or isn't running
+    except subprocess.SubprocessError as e:
+        print(f"Info: access pod not running yet: {e}")

130-135: Avoid sys.exit in library; raise an exception.

-    if not pod_yaml_path.exists():
-        print(f"ERROR: Pod YAML not found at {pod_yaml_path}")
-        sys.exit(1)
+    if not pod_yaml_path.exists():
+        raise FileNotFoundError(f"Pod YAML not found at {pod_yaml_path}")

140-153: Make wait timeout configurable and increase default.

Pulls can exceed 60s; propose env var with sane default (e.g., 180s).

-    print("Waiting for pod to be ready...")
+    print("Waiting for pod to be ready...")
+    timeout="${PVC_ACCESS_POD_TIMEOUT_SECS:-180}s"
     run_command(
         [
             "kubectl",
             "wait",
             f"pod/{PVC_ACCESS_POD_NAME}",
             "-n",
             namespace,
             "--for=condition=Ready",
-            "--timeout=60s",
+            f"--timeout={timeout}",
         ],
         capture_output=False,
         exit_on_error=False,
     )

24-24: Address static analysis warnings (S603, S607, S110, BLE001).

  • Keep using a fixed executable (“kubectl”) but constrain inputs to trusted constants/args.
  • Replace bare except Exception with narrower exceptions and log.

Also applies to: 58-58, 89-90, 175-176

deploy/utils/download_pvc_results.py (2)

92-151: Speed up downloads when not filtering configs

When --no-config is false, copying files one-by-one is slow. Prefer a single kubectl cp of the whole directory; fall back to per-file only when filtering is required.

 def download_files(
     namespace: str, pod_name: str, files: List[str], output_dir: Path, base_folder: str
 ) -> None:
     """Download relevant files from PVC to local directory."""
-    if not files:
+    if not files:
         print("No files to download")
         return
@@
-    downloaded = 0
-    failed = 0
+    downloaded = 0
+    failed = 0
+
+    # Fast path: bulk copy when not excluding configs and target is a directory root
+    if files and all(fp.startswith(base_folder.rstrip("/") + "/") for fp in files):
+        try:
+            run_command(
+                [
+                    "kubectl",
+                    "cp",
+                    f"{namespace}/{pod_name}:{base_folder.rstrip('/')}/.",
+                    str(output_dir),
+                ],
+                capture_output=False,
+            )
+            print(f"✓ Bulk copied directory {base_folder} → {output_dir}")
+            return
+        except subprocess.CalledProcessError:
+            print("Falling back to per-file copy...")

125-127: Be robust on Python < 3.9 environments

Path.is_relative_to is 3.9+. If you need 3.8 compatibility, use a safe commonpath check.

-            if not local_file.resolve().is_relative_to(output_dir.resolve()):
+            try:
+                _ = local_file.resolve().relative_to(output_dir.resolve())
+            except Exception:
                 print(f"  WARNING: Skipping file outside output directory: {file_path}")
                 failed += 1
                 continue
deploy/utils/inject_manifest.py (4)

24-26: Docstring updated for /data — also switch Usage/Examples to module invocation

Good call-out on /data. For consistency with the PR’s “run as module” guidance, update the Usage/Examples.

-Usage:
-    python3 inject_manifest.py --namespace <namespace> --src <local_manifest.yaml> --dest <absolute_path_in_pvc>
+Usage:
+    python3 -m deploy.utils.inject_manifest --namespace <namespace> --src <local_manifest.yaml> --dest <absolute_path_in_pvc>
@@
-Examples:
-    python3 inject_manifest.py --namespace <ns> --src ./disagg.yaml --dest /data/configs/disagg.yaml
-    python3 inject_manifest.py --namespace <ns> --src ./my-data.yaml    --dest /data/custom/path/data.yaml
+Examples:
+    python3 -m deploy.utils.inject_manifest --namespace <ns> --src ./disagg.yaml   --dest /data/configs/disagg.yaml
+    python3 -m deploy.utils.inject_manifest --namespace <ns> --src ./my-data.yaml  --dest /data/custom/path/data.yaml

Also applies to: 27-33


84-85: Return value from copy_manifest is unused

Either remove the return or use it in main(). Keeping dead returns adds noise.

-    print(f"File details: {result.stdout.strip()}")
-    return target_path
+    print(f"File details: {result.stdout.strip()}")

112-140: Nice validation UX; consider centralizing this check

This block is solid. To avoid drift across tools, consider a helper in deploy.utils.kubernetes (e.g., validate_pvc_path(path: str, require_data_prefix=True) -> None) and reuse it here and in download_pvc_results.py.

I can draft the helper and wire it up across scripts if you want.


52-55: Validate src is a regular file (not just exists)

Tighten the check to avoid copying directories or special files by mistake.

-    if not manifest_path.exists():
+    if not (manifest_path.exists() and manifest_path.is_file()):
         print(f"ERROR: Manifest file not found: {manifest_path}")
         sys.exit(1)
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1477f6e and de28605.

📒 Files selected for processing (10)
  • benchmarks/benchmark.sh (2 hunks)
  • benchmarks/profiler/deploy/profile_sla_job.yaml (2 hunks)
  • benchmarks/utils/benchmark.py (1 hunks)
  • deploy/utils/README.md (1 hunks)
  • deploy/utils/download_pvc_results.py (3 hunks)
  • deploy/utils/inject_manifest.py (5 hunks)
  • deploy/utils/kubernetes.py (5 hunks)
  • deploy/utils/manifests/pvc-access-pod.yaml (1 hunks)
  • docs/benchmarks/benchmarking.md (2 hunks)
  • docs/benchmarks/pre_deployment_profiling.md (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
deploy/utils/inject_manifest.py (1)
deploy/utils/kubernetes.py (1)
  • ensure_clean_access_pod (53-99)
deploy/utils/download_pvc_results.py (1)
deploy/utils/kubernetes.py (1)
  • ensure_clean_access_pod (53-99)
🪛 Ruff (0.12.2)
deploy/utils/kubernetes.py

58-58: subprocess call: check for execution of untrusted input

(S603)


59-68: Starting a process with a partial executable path

(S607)


89-90: try-except-pass detected, consider logging the exception

(S110)


89-89: Do not catch blind exception: Exception

(BLE001)


175-175: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Build and Test - vllm
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (14)
benchmarks/utils/benchmark.py (1)

67-69: Default model switch to Qwen/Qwen3-0.6B — LGTM

Matches PR intent and other files.

deploy/utils/manifests/pvc-access-pod.yaml (1)

29-30: Mount path standardized to /data — good.

Consistent with new PVC convention and related tooling/docs.

docs/benchmarks/benchmarking.md (1)

36-36: Doc default model updated — looks good.

deploy/utils/README.md (2)

91-96: /data path requirement callout — good addition.


102-107: Verified entry points and imports
Both deploy/utils/inject_manifest.py and download_pvc_results.py exist and import ensure_clean_access_pod from deploy.utils.kubernetes as expected.

benchmarks/benchmark.sh (2)

14-15: Default model updated — good.


49-51: Help text aligns with model default; keep ISL/OSL in sync with Python.

Avoid divergent defaults between bash and Python entry points.

Use the same script shared in docs comment to verify ISL/OSL across files.

docs/benchmarks/pre_deployment_profiling.md (2)

93-96: Path prefix fix to /data — good.


196-196: File-structure root standardized to /workspace/data — good.

benchmarks/profiler/deploy/profile_sla_job.yaml (2)

32-35: Config/output moved under /workspace/data — LGTM.

Aligns with single PVC mount; simplifies volumes.


53-54: Single PVC mountpoint at /workspace/data — LGTM.

Ensures both config and results share the same claim.

deploy/utils/download_pvc_results.py (1)

36-41: Import rename to ensure_clean_access_pod looks good

API alignment with deploy.utils.kubernetes is correct in both import paths.

Also applies to: 46-51

deploy/utils/inject_manifest.py (2)

39-45: Import switch to ensure_clean_access_pod — LGTM

Matches the new API and keeps this module consistent with others.


107-108: Dest help text — LGTM

Clear and aligns with the /data requirement.

Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
create: false
name: dynamo-pvc # Must be pre-created before deployment and SLA profiler must have been run
mountPoint: /workspace/profiling_results
mountPoint: /workspace/data/profiling_results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still have /workspace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, workspace is where the code is, I've just moved the PVC to mount on just /data rather than /workspace/data

--backend=vllm
--adjustment-interval=60
--profile-results-dir=/workspace/profiling_results
--profile-results-dir=/workspace/data/profiling_results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, do we still have /workspace?

Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
…qa-updates

Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
@saturley-hall saturley-hall merged commit 09c7b73 into main Sep 9, 2025
13 of 16 checks passed
@saturley-hall saturley-hall deleted the benchmarking-qa-updates branch September 9, 2025 02:53
saturley-hall pushed a commit that referenced this pull request Sep 9, 2025
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
saturley-hall added a commit that referenced this pull request Sep 9, 2025
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
Co-authored-by: hhzhang16 <54051230+hhzhang16@users.noreply.github.com>
indrajit96 pushed a commit that referenced this pull request Sep 9, 2025
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
Signed-off-by: Indrajit Bhosale <iamindrajitb@gmail.com>
indrajit96 pushed a commit that referenced this pull request Sep 9, 2025
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
tedzhouhk pushed a commit that referenced this pull request Sep 10, 2025
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants