-
Notifications
You must be signed in to change notification settings - Fork 688
feat: update benchmarking and deploy utils #2933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
WalkthroughThis PR updates default benchmark model strings, unifies PVC mount paths under /data, refactors deploy utilities to add ensure_clean_access_pod and adjust run_command error handling, updates Kubernetes manifests and job specs to use /data, and revises documentation and help texts to reflect new defaults and path requirements. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant User
participant Inject as inject_manifest.py
participant K8sUtil as kubernetes.py
participant K8s as Kubernetes API
User->>Inject: python -m deploy.utils.inject_manifest --dest /data/...
Inject->>Inject: Validate dest starts with "/data/"
Inject->>K8sUtil: ensure_clean_access_pod(namespace)
K8sUtil->>K8s: Check existing access pod
alt Pod exists
K8sUtil->>K8s: Delete existing access pod
end
K8sUtil->>K8s: Deploy access pod (YAML)
K8sUtil->>K8s: Wait for Ready (run_command exit_on_error=False)
K8s-->>K8sUtil: Ready or error (exception propagated)
K8sUtil-->>Inject: pod_name
Inject->>K8s: kubectl cp manifest -> PVC (/data/...)
Inject-->>User: Completed (returns target_path)
sequenceDiagram
autonumber
participant DL as download_pvc_results.py
participant K8sUtil as kubernetes.py
participant K8s as Kubernetes API
DL->>K8sUtil: ensure_clean_access_pod(namespace)
K8sUtil->>K8s: Recreate/ensure access pod
K8s-->>K8sUtil: Pod ready
DL->>K8s: kubectl cp /data/... from pod to local
DL-->>DL: Finish without hard exit (exceptions propagate if errors)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
benchmarks/utils/benchmark.py (1)
57-66: Align Python defaults with docs and benchmark.sh (ISL/OSL mismatch).Docs and benchmark.sh use ISL=2000, OSL=256; Python defaults here are 200/200. This will skew results when running the module directly.
Apply:
- parser.add_argument("--isl", type=int, default=200, help="Input sequence length") + parser.add_argument("--isl", type=int, default=2000, help="Input sequence length") ... - parser.add_argument("--osl", type=int, default=200, help="Output sequence length") + parser.add_argument("--osl", type=int, default=256, help="Output sequence length")Also applies to: 64-66
deploy/utils/inject_manifest.py (1)
48-51: Use the actual pod name returned by ensure_clean_access_podYou rely on PVC_ACCESS_POD_NAME in copy_manifest, but ensure_clean_access_pod returns the deployed pod’s name. Capture and pass it to avoid future divergence.
-def copy_manifest(namespace: str, manifest_path: Path, target_path: str) -> None: +def copy_manifest(namespace: str, manifest_path: Path, target_path: str, pod_name: str) -> None: """Copy a manifest file into the PVC via the access pod.""" - pod_name = PVC_ACCESS_POD_NAME @@ - # Deploy access pod - ensure_clean_access_pod(args.namespace) + # Deploy access pod + pod_name = ensure_clean_access_pod(args.namespace) try: # Copy manifest - copy_manifest(args.namespace, args.src, args.dest) + copy_manifest(args.namespace, args.src, args.dest, pod_name) print("\n✅ Manifest injection completed!")Also applies to: 153-159
🧹 Nitpick comments (17)
benchmarks/utils/benchmark.py (2)
38-41: Avoid importingreinside hot path.Minor: move to module scope to avoid repeated imports.
-import re +import re(move to top with other imports)
105-106: Remove unused variables or use them.
endpoints, manifests = ...are assigned but unused.- endpoints, manifests = categorize_inputs(parsed_inputs) + _endpoints, _manifests = categorize_inputs(parsed_inputs)Or consume them as needed.
docs/benchmarks/benchmarking.md (1)
151-177: Direct Python example uses unsupported flag.The module takes repeated
--input <label>=<manifest_or_endpoint>, not--endpoint. Update examples for parity with benchmark.sh.-python3 -u -m benchmarks.utils.benchmark \ - --endpoint "http://your-endpoint:8000" \ +python3 -u -m benchmarks.utils.benchmark \ + --input prod="http://your-endpoint:8000" \docs/benchmarks/pre_deployment_profiling.md (2)
114-122: Unify invocation style to module form.Elsewhere you switched to
python -m .... Mirror it here.- python3 deploy/utils/inject_manifest.py --namespace $NAMESPACE --src components/backends/vllm/deploy/disagg.yaml --dest /data/configs/disagg.yaml + python3 -m deploy.utils.inject_manifest --namespace $NAMESPACE --src components/backends/vllm/deploy/disagg.yaml --dest /data/configs/disagg.yaml ... - python3 deploy/utils/inject_manifest.py --namespace $NAMESPACE --src my-custom-disagg.yaml --dest /data/configs/disagg.yaml + python3 -m deploy.utils.inject_manifest --namespace $NAMESPACE --src my-custom-disagg.yaml --dest /data/configs/disagg.yaml ... - python3 deploy/utils/inject_manifest.py --namespace $NAMESPACE --src my-custom-disagg.yaml --dest /data/profiling_results/my-disagg.yaml + python3 -m deploy.utils.inject_manifest --namespace $NAMESPACE --src my-custom-disagg.yaml --dest /data/profiling_results/my-disagg.yaml
182-185: Download commands updated to /data — good.Minor: ensure earlier note sets DGD_CONFIG_FILE under
/workspace/data/...to be consistent with the Job.deploy/utils/kubernetes.py (6)
24-44: Library functions shouldn’t call sys.exit; bubble errors instead.Let callers decide exit behavior; keep a single strategy.
-def run_command( - cmd: List[str], capture_output: bool = True, exit_on_error: bool = True -) -> subprocess.CompletedProcess: +def run_command( + cmd: List[str], capture_output: bool = True, exit_on_error: bool = True +) -> subprocess.CompletedProcess: @@ - if exit_on_error: - sys.exit(1) - else: - raise + if exit_on_error: + raise RuntimeError(f"Command failed: {' '.join(cmd)}") from e + raiseAnd update internal callers that relied on process exit to catch exceptions or re-raise at the top-level CLI.
53-101: Tighten exception handling and reuse helpers.Avoid bare
except Exception; useCalledProcessErroror check return codes only. Consider reusingrun_commandfor consistency.- try: - result = subprocess.run( + try: + result = subprocess.run( @@ - except Exception: - pass # Pod doesn't exist, which is fine + except subprocess.SubprocessError as e: + print(f"Warning: failed to query existing access pod: {e}")Also consider increasing readiness wait or making it configurable (see next comment).
126-127: Log instead of silent pass.Swallowing all exceptions hides actionable errors (e.g., RBAC).
- except Exception: - pass # Pod doesn't exist or isn't running + except subprocess.SubprocessError as e: + print(f"Info: access pod not running yet: {e}")
130-135: Avoid sys.exit in library; raise an exception.- if not pod_yaml_path.exists(): - print(f"ERROR: Pod YAML not found at {pod_yaml_path}") - sys.exit(1) + if not pod_yaml_path.exists(): + raise FileNotFoundError(f"Pod YAML not found at {pod_yaml_path}")
140-153: Make wait timeout configurable and increase default.Pulls can exceed 60s; propose env var with sane default (e.g., 180s).
- print("Waiting for pod to be ready...") + print("Waiting for pod to be ready...") + timeout="${PVC_ACCESS_POD_TIMEOUT_SECS:-180}s" run_command( [ "kubectl", "wait", f"pod/{PVC_ACCESS_POD_NAME}", "-n", namespace, "--for=condition=Ready", - "--timeout=60s", + f"--timeout={timeout}", ], capture_output=False, exit_on_error=False, )
24-24: Address static analysis warnings (S603, S607, S110, BLE001).
- Keep using a fixed executable (“kubectl”) but constrain inputs to trusted constants/args.
- Replace bare
except Exceptionwith narrower exceptions and log.Also applies to: 58-58, 89-90, 175-176
deploy/utils/download_pvc_results.py (2)
92-151: Speed up downloads when not filtering configsWhen --no-config is false, copying files one-by-one is slow. Prefer a single kubectl cp of the whole directory; fall back to per-file only when filtering is required.
def download_files( namespace: str, pod_name: str, files: List[str], output_dir: Path, base_folder: str ) -> None: """Download relevant files from PVC to local directory.""" - if not files: + if not files: print("No files to download") return @@ - downloaded = 0 - failed = 0 + downloaded = 0 + failed = 0 + + # Fast path: bulk copy when not excluding configs and target is a directory root + if files and all(fp.startswith(base_folder.rstrip("/") + "/") for fp in files): + try: + run_command( + [ + "kubectl", + "cp", + f"{namespace}/{pod_name}:{base_folder.rstrip('/')}/.", + str(output_dir), + ], + capture_output=False, + ) + print(f"✓ Bulk copied directory {base_folder} → {output_dir}") + return + except subprocess.CalledProcessError: + print("Falling back to per-file copy...")
125-127: Be robust on Python < 3.9 environmentsPath.is_relative_to is 3.9+. If you need 3.8 compatibility, use a safe commonpath check.
- if not local_file.resolve().is_relative_to(output_dir.resolve()): + try: + _ = local_file.resolve().relative_to(output_dir.resolve()) + except Exception: print(f" WARNING: Skipping file outside output directory: {file_path}") failed += 1 continuedeploy/utils/inject_manifest.py (4)
24-26: Docstring updated for /data — also switch Usage/Examples to module invocationGood call-out on /data. For consistency with the PR’s “run as module” guidance, update the Usage/Examples.
-Usage: - python3 inject_manifest.py --namespace <namespace> --src <local_manifest.yaml> --dest <absolute_path_in_pvc> +Usage: + python3 -m deploy.utils.inject_manifest --namespace <namespace> --src <local_manifest.yaml> --dest <absolute_path_in_pvc> @@ -Examples: - python3 inject_manifest.py --namespace <ns> --src ./disagg.yaml --dest /data/configs/disagg.yaml - python3 inject_manifest.py --namespace <ns> --src ./my-data.yaml --dest /data/custom/path/data.yaml +Examples: + python3 -m deploy.utils.inject_manifest --namespace <ns> --src ./disagg.yaml --dest /data/configs/disagg.yaml + python3 -m deploy.utils.inject_manifest --namespace <ns> --src ./my-data.yaml --dest /data/custom/path/data.yamlAlso applies to: 27-33
84-85: Return value from copy_manifest is unusedEither remove the return or use it in main(). Keeping dead returns adds noise.
- print(f"File details: {result.stdout.strip()}") - return target_path + print(f"File details: {result.stdout.strip()}")
112-140: Nice validation UX; consider centralizing this checkThis block is solid. To avoid drift across tools, consider a helper in deploy.utils.kubernetes (e.g., validate_pvc_path(path: str, require_data_prefix=True) -> None) and reuse it here and in download_pvc_results.py.
I can draft the helper and wire it up across scripts if you want.
52-55: Validate src is a regular file (not just exists)Tighten the check to avoid copying directories or special files by mistake.
- if not manifest_path.exists(): + if not (manifest_path.exists() and manifest_path.is_file()): print(f"ERROR: Manifest file not found: {manifest_path}") sys.exit(1)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
benchmarks/benchmark.sh(2 hunks)benchmarks/profiler/deploy/profile_sla_job.yaml(2 hunks)benchmarks/utils/benchmark.py(1 hunks)deploy/utils/README.md(1 hunks)deploy/utils/download_pvc_results.py(3 hunks)deploy/utils/inject_manifest.py(5 hunks)deploy/utils/kubernetes.py(5 hunks)deploy/utils/manifests/pvc-access-pod.yaml(1 hunks)docs/benchmarks/benchmarking.md(2 hunks)docs/benchmarks/pre_deployment_profiling.md(4 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
deploy/utils/inject_manifest.py (1)
deploy/utils/kubernetes.py (1)
ensure_clean_access_pod(53-99)
deploy/utils/download_pvc_results.py (1)
deploy/utils/kubernetes.py (1)
ensure_clean_access_pod(53-99)
🪛 Ruff (0.12.2)
deploy/utils/kubernetes.py
58-58: subprocess call: check for execution of untrusted input
(S603)
59-68: Starting a process with a partial executable path
(S607)
89-90: try-except-pass detected, consider logging the exception
(S110)
89-89: Do not catch blind exception: Exception
(BLE001)
175-175: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Build and Test - vllm
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (14)
benchmarks/utils/benchmark.py (1)
67-69: Default model switch to Qwen/Qwen3-0.6B — LGTMMatches PR intent and other files.
deploy/utils/manifests/pvc-access-pod.yaml (1)
29-30: Mount path standardized to /data — good.Consistent with new PVC convention and related tooling/docs.
docs/benchmarks/benchmarking.md (1)
36-36: Doc default model updated — looks good.deploy/utils/README.md (2)
91-96: /data path requirement callout — good addition.
102-107: Verified entry points and imports
Both deploy/utils/inject_manifest.py and download_pvc_results.py exist and import ensure_clean_access_pod from deploy.utils.kubernetes as expected.benchmarks/benchmark.sh (2)
14-15: Default model updated — good.
49-51: Help text aligns with model default; keep ISL/OSL in sync with Python.Avoid divergent defaults between bash and Python entry points.
Use the same script shared in docs comment to verify ISL/OSL across files.
docs/benchmarks/pre_deployment_profiling.md (2)
93-96: Path prefix fix to /data — good.
196-196: File-structure root standardized to /workspace/data — good.benchmarks/profiler/deploy/profile_sla_job.yaml (2)
32-35: Config/output moved under /workspace/data — LGTM.Aligns with single PVC mount; simplifies volumes.
53-54: Single PVC mountpoint at /workspace/data — LGTM.Ensures both config and results share the same claim.
deploy/utils/download_pvc_results.py (1)
36-41: Import rename to ensure_clean_access_pod looks goodAPI alignment with deploy.utils.kubernetes is correct in both import paths.
Also applies to: 46-51
deploy/utils/inject_manifest.py (2)
39-45: Import switch to ensure_clean_access_pod — LGTMMatches the new API and keeps this module consistent with others.
107-108: Dest help text — LGTMClear and aligns with the /data requirement.
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
| create: false | ||
| name: dynamo-pvc # Must be pre-created before deployment and SLA profiler must have been run | ||
| mountPoint: /workspace/profiling_results | ||
| mountPoint: /workspace/data/profiling_results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still have /workspace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, workspace is where the code is, I've just moved the PVC to mount on just /data rather than /workspace/data
| --backend=vllm | ||
| --adjustment-interval=60 | ||
| --profile-results-dir=/workspace/profiling_results | ||
| --profile-results-dir=/workspace/data/profiling_results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same, do we still have /workspace?
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
…qa-updates Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
Signed-off-by: Hannah Zhang <hannahz@nvidia.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
Signed-off-by: Hannah Zhang <hannahz@nvidia.com> Signed-off-by: Indrajit Bhosale <iamindrajitb@gmail.com>
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
Signed-off-by: Hannah Zhang <hannahz@nvidia.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Overview:
Following QA feedback, updated benchmarking.
Details:
/datamountWhere should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
New Features
Refactor
Chores
Documentation