Skip to content

Commit de28605

Browse files
committed
feat: have all pvc-related stuff under /data
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
1 parent 84aeddb commit de28605

File tree

7 files changed

+144
-56
lines changed

7 files changed

+144
-56
lines changed

benchmarks/profiler/deploy/profile_sla_job.yaml

Lines changed: 3 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -29,9 +29,9 @@ spec:
2929
command: ["python", "-m", "benchmarks.profiler.profile_sla"]
3030
args:
3131
- --config
32-
- /workspace/configs/disagg.yaml
32+
- /workspace/data/configs/disagg.yaml
3333
- --output-dir
34-
- /workspace/profiling_results
34+
- /workspace/data/profiling_results
3535
- --namespace
3636
- ${NAMESPACE}
3737
- --backend
@@ -50,15 +50,10 @@ spec:
5050
- "20"
5151
volumeMounts:
5252
- name: output-volume
53-
mountPath: /workspace/profiling_results
54-
- name: configs
55-
mountPath: /workspace/configs
53+
mountPath: /workspace/data
5654
restartPolicy: Never
5755
volumes:
5856
- name: output-volume
5957
persistentVolumeClaim:
6058
claimName: dynamo-pvc
61-
- name: configs
62-
persistentVolumeClaim:
63-
claimName: dynamo-pvc
6459
backoffLimit: 0

deploy/utils/README.md

Lines changed: 17 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -88,23 +88,36 @@ These scripts interact with the Persistent Volume Claim (PVC) that stores config
8888

8989
```bash
9090
# The profiling job reads your DGD config from the PVC
91-
python3 deploy/utils/inject_manifest.py \
91+
# IMPORTANT: All paths must start with /data/ for security reasons
92+
python3 -m deploy.utils.inject_manifest \
9293
--namespace $NAMESPACE \
9394
--src ./my-disagg.yaml \
94-
--dest /configs/disagg.yaml
95+
--dest /data/configs/disagg.yaml
9596
```
9697

9798
**Download benchmark/profiling results:**
9899

99100
```bash
100101
# After benchmarking or profiling completes, download results
101-
python3 deploy/utils/download_pvc_results.py \
102+
python3 -m deploy.utils.download_pvc_results \
102103
--namespace $NAMESPACE \
103104
--output-dir ./pvc_files \
104-
--folder /results \
105+
--folder /data/results \
105106
--no-config # optional: skip *.yaml/*.yml in the download
106107
```
107108

109+
#### Path Requirements
110+
111+
**Important**: The PVC is mounted at `/data` in the access pod for security reasons. All destination paths must start with `/data/`.
112+
113+
**Common path patterns:**
114+
- `/data/configs/` - Configuration files (DGD manifests)
115+
- `/data/results/` - Benchmark results
116+
- `/data/profiling_results/` - Profiling data
117+
- `/data/benchmarking/` - Benchmarking artifacts
118+
119+
**User-friendly error messages**: If you forget the `/data/` prefix, the script will show a helpful error message with the correct path and example commands.
120+
108121
#### Next Steps
109122

110123
For complete benchmarking workflows:

deploy/utils/download_pvc_results.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@
3636
from deploy.utils.kubernetes import (
3737
check_kubectl_access,
3838
cleanup_access_pod,
39-
deploy_access_pod,
39+
ensure_clean_access_pod,
4040
run_command,
4141
)
4242
except ModuleNotFoundError:
@@ -46,7 +46,7 @@
4646
from deploy.utils.kubernetes import (
4747
check_kubectl_access,
4848
cleanup_access_pod,
49-
deploy_access_pod,
49+
ensure_clean_access_pod,
5050
run_command,
5151
)
5252

@@ -194,7 +194,7 @@ def main():
194194
check_kubectl_access(args.namespace)
195195

196196
# Deploy access pod
197-
pod_name = deploy_access_pod(args.namespace)
197+
pod_name = ensure_clean_access_pod(args.namespace)
198198
try:
199199
# List and download files
200200
files = list_pvc_contents(args.namespace, pod_name, args.folder, args.no_config)

deploy/utils/inject_manifest.py

Lines changed: 37 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -21,12 +21,15 @@
2121
Copies any Kubernetes manifest file into the PVC for later use by jobs.
2222
Both the source manifest path and destination path in the PVC are required.
2323
24+
IMPORTANT: The PVC is mounted at /data in the access pod for security reasons.
25+
All destination paths must start with '/data/'.
26+
2427
Usage:
2528
python3 inject_manifest.py --namespace <namespace> --src <local_manifest.yaml> --dest <absolute_path_in_pvc>
2629
2730
Examples:
28-
python3 inject_manifest.py --namespace <ns> --src ./my-disagg.yaml --dest /configs/disagg.yaml
29-
python3 inject_manifest.py --namespace <ns> --src ./my-agg.yaml --dest /configs/agg.yaml
31+
python3 inject_manifest.py --namespace <ns> --src ./disagg.yaml --dest /data/configs/disagg.yaml
32+
python3 inject_manifest.py --namespace <ns> --src ./my-data.yaml --dest /data/custom/path/data.yaml
3033
"""
3134

3235
import argparse
@@ -37,7 +40,7 @@
3740
PVC_ACCESS_POD_NAME,
3841
check_kubectl_access,
3942
cleanup_access_pod,
40-
deploy_access_pod,
43+
ensure_clean_access_pod,
4144
run_command,
4245
)
4346

@@ -78,6 +81,7 @@ def copy_manifest(namespace: str, manifest_path: Path, target_path: str) -> None
7881

7982
print("✓ Manifest successfully copied to PVC")
8083
print(f"File details: {result.stdout.strip()}")
84+
return target_path
8185

8286

8387
def main():
@@ -100,16 +104,39 @@ def main():
100104
parser.add_argument(
101105
"--dest",
102106
required=True,
103-
help="Absolute target path in PVC (e.g., /profiling_results/agg.yaml)",
107+
help="Absolute target path in PVC (must start with /data/, e.g., /data/configs/agg.yaml)",
104108
)
105109

106110
args = parser.parse_args()
107111

108-
# Validate target_path to prevent directory traversal
109-
if not args.dest.startswith("/"):
110-
print(
111-
"ERROR: Target path must be an absolute path inside the PVC (start with '/')."
112-
)
112+
# Validate target_path to prevent directory traversal and ensure it's within PVC
113+
if not args.dest.startswith("/data/"):
114+
print("=" * 60)
115+
print("❌ ERROR: Invalid target path")
116+
print("=" * 60)
117+
print("The PVC is mounted at /data in the access pod.")
118+
print("All paths must start with '/data/' for security reasons.")
119+
print("")
120+
print("💡 QUICK FIX:")
121+
if args.dest.startswith("/"):
122+
# Suggest the fix
123+
suggested_path = f"/data{args.dest}"
124+
print(f" Change: {args.dest}")
125+
print(f" To: {suggested_path}")
126+
print("")
127+
print("📝 Example commands:")
128+
print(" python3 -m deploy.utils.inject_manifest \\")
129+
print(f" --namespace {args.namespace} \\")
130+
print(f" --src {args.src} \\")
131+
print(f" --dest {suggested_path}")
132+
else:
133+
print(f" Use: /data/{args.dest.lstrip('/')}")
134+
print("")
135+
print("🔍 Common patterns:")
136+
print(" /configs/file.yaml → /data/configs/file.yaml")
137+
print(" /results/data.yaml → /data/results/data.yaml")
138+
print(" /profiling_results/... → /data/profiling_results/...")
139+
print("=" * 60)
113140
sys.exit(1)
114141

115142
if ".." in args.dest:
@@ -123,7 +150,7 @@ def main():
123150
check_kubectl_access(args.namespace)
124151

125152
# Deploy access pod
126-
deploy_access_pod(args.namespace)
153+
ensure_clean_access_pod(args.namespace)
127154
try:
128155
# Copy manifest
129156
copy_manifest(args.namespace, args.src, args.dest)

deploy/utils/kubernetes.py

Lines changed: 74 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@
2222

2323

2424
def run_command(
25-
cmd: List[str], capture_output: bool = True
25+
cmd: List[str], capture_output: bool = True, exit_on_error: bool = True
2626
) -> subprocess.CompletedProcess:
2727
"""Run a command and handle errors."""
2828
try:
@@ -37,7 +37,10 @@ def run_command(
3737
print(f"STDOUT: {e.stdout}")
3838
if e.stderr:
3939
print(f"STDERR: {e.stderr}")
40-
sys.exit(1)
40+
if exit_on_error:
41+
sys.exit(1)
42+
else:
43+
raise
4144

4245

4346
def check_kubectl_access(namespace: str) -> None:
@@ -47,6 +50,55 @@ def check_kubectl_access(namespace: str) -> None:
4750
print("✓ kubectl access confirmed")
4851

4952

53+
def ensure_clean_access_pod(namespace: str) -> str:
54+
"""Ensure a clean PVC access pod deployment by deleting any existing pod first."""
55+
56+
# Check if pod exists and delete it if it does
57+
try:
58+
result = subprocess.run(
59+
[
60+
"kubectl",
61+
"get",
62+
"pod",
63+
PVC_ACCESS_POD_NAME,
64+
"-n",
65+
namespace,
66+
"-o",
67+
"jsonpath={.metadata.name}",
68+
],
69+
capture_output=True,
70+
text=True,
71+
check=False,
72+
)
73+
if result.returncode == 0 and result.stdout.strip() == PVC_ACCESS_POD_NAME:
74+
print(f"Found existing access pod '{PVC_ACCESS_POD_NAME}', deleting it...")
75+
run_command(
76+
[
77+
"kubectl",
78+
"delete",
79+
"pod",
80+
PVC_ACCESS_POD_NAME,
81+
"-n",
82+
namespace,
83+
"--ignore-not-found",
84+
],
85+
capture_output=False,
86+
exit_on_error=False,
87+
)
88+
print("✓ Existing access pod deleted")
89+
except Exception:
90+
pass # Pod doesn't exist, which is fine
91+
92+
try:
93+
return deploy_access_pod(namespace)
94+
except Exception as e:
95+
print(f"Deployment failed: {e}")
96+
print(
97+
"Pod left running for debugging. Use 'kubectl delete pod pvc-access-pod -n <namespace>' to clean up manually."
98+
)
99+
raise
100+
101+
50102
def deploy_access_pod(namespace: str) -> str:
51103
"""Deploy the PVC access pod and return pod name."""
52104

@@ -67,25 +119,19 @@ def deploy_access_pod(namespace: str) -> str:
67119
text=True,
68120
check=False,
69121
)
70-
71122
if result.returncode == 0 and result.stdout.strip() == "Running":
72123
print(f"✓ Access pod '{PVC_ACCESS_POD_NAME}' already running")
73124
return PVC_ACCESS_POD_NAME
74125
except Exception:
75-
# Pod doesn't exist or isn't running
76-
pass
126+
pass # Pod doesn't exist or isn't running
77127

78128
print(f"Deploying access pod '{PVC_ACCESS_POD_NAME}' in namespace '{namespace}'...")
79129

80-
# Get the directory where this script is located
81-
script_dir = Path(__file__).parent
82-
pod_yaml_path = script_dir / "manifests" / "pvc-access-pod.yaml"
83-
130+
pod_yaml_path = Path(__file__).parent / "manifests" / "pvc-access-pod.yaml"
84131
if not pod_yaml_path.exists():
85132
print(f"ERROR: Pod YAML not found at {pod_yaml_path}")
86133
sys.exit(1)
87134

88-
# Deploy the pod
89135
run_command(
90136
["kubectl", "apply", "-f", str(pod_yaml_path), "-n", namespace],
91137
capture_output=False,
@@ -103,23 +149,28 @@ def deploy_access_pod(namespace: str) -> str:
103149
"--timeout=60s",
104150
],
105151
capture_output=False,
152+
exit_on_error=False,
106153
)
107154
print("✓ Access pod is ready")
108155
return PVC_ACCESS_POD_NAME
109156

110157

111158
def cleanup_access_pod(namespace: str) -> None:
112159
print("Cleaning up access pod...")
113-
run_command(
114-
[
115-
"kubectl",
116-
"delete",
117-
"pod",
118-
PVC_ACCESS_POD_NAME,
119-
"-n",
120-
namespace,
121-
"--ignore-not-found",
122-
],
123-
capture_output=False,
124-
)
125-
print("✓ Access pod deleted")
160+
try:
161+
run_command(
162+
[
163+
"kubectl",
164+
"delete",
165+
"pod",
166+
PVC_ACCESS_POD_NAME,
167+
"-n",
168+
namespace,
169+
"--ignore-not-found",
170+
],
171+
capture_output=False,
172+
exit_on_error=False,
173+
)
174+
print("✓ Access pod deleted")
175+
except Exception as e:
176+
print(f"Warning: Failed to clean up access pod: {e}")

deploy/utils/manifests/pvc-access-pod.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ spec:
2626
- ALL
2727
volumeMounts:
2828
- name: profiling-storage
29-
mountPath: /profiling_results
29+
mountPath: /data
3030
resources:
3131
requests:
3232
memory: "128Mi"

docs/benchmarks/pre_deployment_profiling.md

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -90,7 +90,7 @@ Use the injector utility to place your DGD manifest into the PVC. The profiling
9090
python3 deploy/utils/inject_manifest.py \
9191
--namespace $NAMESPACE \
9292
--src components/backends/vllm/deploy/disagg.yaml \
93-
--dest /configs/disagg.yaml
93+
--dest /data/configs/disagg.yaml
9494

9595
# Set the docker image for the profiling job; any docker image that contains your script.
9696
export DOCKER_IMAGE=nvcr.io/nvidia/dynamo:latest-vllm
@@ -112,15 +112,17 @@ Use the default pre-built image and inject custom configurations via PVC:
112112
2. **Inject your custom disagg configuration:**
113113
```bash
114114
# Use default disagg.yaml config
115-
python3 deploy/utils/inject_manifest.py --namespace $NAMESPACE --src components/backends/vllm/deploy/disagg.yaml --dest /configs/disagg.yaml
115+
python3 deploy/utils/inject_manifest.py --namespace $NAMESPACE --src components/backends/vllm/deploy/disagg.yaml --dest /data/configs/disagg.yaml
116116

117117
# Or use a custom disagg config file
118-
python3 deploy/utils/inject_manifest.py --namespace $NAMESPACE --src my-custom-disagg.yaml --dest /configs/disagg.yaml
118+
python3 deploy/utils/inject_manifest.py --namespace $NAMESPACE --src my-custom-disagg.yaml --dest /data/configs/disagg.yaml
119119

120120
# Or specify a custom target path in the PVC
121-
python3 deploy/utils/inject_manifest.py --namespace $NAMESPACE --src my-custom-disagg.yaml --dest /profiling_results/my-disagg.yaml
121+
python3 deploy/utils/inject_manifest.py --namespace $NAMESPACE --src my-custom-disagg.yaml --dest /data/profiling_results/my-disagg.yaml
122122
```
123123

124+
> **Note**: All paths must start with `/data/` for security reasons. If you forget this prefix, the script will show a helpful error message with the correct path.
125+
124126
3. **Set the config path for the profiling job:**
125127
```bash
126128
export DGD_CONFIG_FILE=/workspace/profiling_results/disagg.yaml # or your custom path
@@ -176,10 +178,10 @@ To download the results:
176178

177179
```bash
178180
# Download to directory
179-
python3 deploy/utils/download_pvc_results.py --namespace $NAMESPACE --output-dir ./results --folder /profiling_results
181+
python3 deploy/utils/download_pvc_results.py --namespace $NAMESPACE --output-dir ./results --folder /data/profiling_results
180182

181183
# Download without any of the auto-created config.yaml files used in profiling
182-
python3 deploy/utils/download_pvc_results.py --namespace $NAMESPACE --output-dir ./results --folder /profiling_results --no-config
184+
python3 deploy/utils/download_pvc_results.py --namespace $NAMESPACE --output-dir ./results --folder /data/profiling_results --no-config
183185
```
184186

185187
The script will:
@@ -191,7 +193,7 @@ The script will:
191193

192194
The profiling results directory contains the following structure:
193195
```
194-
/workspace/profiling_results/
196+
/workspace/data/profiling_results/
195197
├── prefill_performance.png # Main prefill performance plot
196198
├── decode_performance.png # Main decode performance plot
197199
├── prefill_tp1/ # Individual TP profiling directories

0 commit comments

Comments
 (0)