Skip to content

Commit fe718fd

Browse files
hhzhang16tedzhouhkmohammedabdulwahhab
authored
feat: deploy SLA profiler to k8s (#2030)
Co-authored-by: hongkuan <hongkuanz@nvidia.com> Co-authored-by: mohammedabdulwahhab <furkhan324@berkeley.edu> Co-authored-by: Hongkuan Zhou <tedzhouhk@gmail.com>
1 parent ba3ac23 commit fe718fd

24 files changed

+1627
-660
lines changed
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
apiVersion: rbac.authorization.k8s.io/v1
4+
kind: RoleBinding
5+
metadata:
6+
name: profile-sla-binding
7+
namespace: ${NAMESPACE}
8+
subjects:
9+
- kind: ServiceAccount
10+
name: profile-sla-sa
11+
namespace: ${NAMESPACE}
12+
roleRef:
13+
kind: Role
14+
name: profile-sla-role
15+
apiGroup: rbac.authorization.k8s.io
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
apiVersion: batch/v1
4+
kind: Job
5+
metadata:
6+
name: profile-sla
7+
namespace: ${NAMESPACE}
8+
spec:
9+
template:
10+
spec:
11+
serviceAccountName: profile-sla-sa
12+
containers:
13+
- name: profile-sla
14+
image: ${DOCKER_IMAGE}
15+
resources:
16+
requests:
17+
cpu: "1"
18+
memory: "2Gi"
19+
limits:
20+
cpu: "2"
21+
memory: "4Gi"
22+
env:
23+
- name: HUGGING_FACE_HUB_TOKEN
24+
valueFrom:
25+
secretKeyRef:
26+
name: hf-token-secret
27+
key: HF_TOKEN
28+
- name: NATS_SERVER
29+
value: nats://${NAMESPACE}-nats:4222
30+
- name: ETCD_ENDPOINTS
31+
value: ${NAMESPACE}-etcd:2379
32+
command: ["python", "/workspace/benchmarks/profiler/profile_sla.py"]
33+
args:
34+
- --config
35+
- ${DGD_CONFIG_FILE}
36+
- --output-dir
37+
- /workspace/profiling_results
38+
- --namespace
39+
- ${NAMESPACE}
40+
volumeMounts:
41+
- name: output-volume
42+
mountPath: /workspace/profiling_results
43+
restartPolicy: Never
44+
volumes:
45+
- name: output-volume
46+
persistentVolumeClaim:
47+
claimName: profiling-pvc
48+
backoffLimit: 0
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
apiVersion: rbac.authorization.k8s.io/v1
4+
kind: Role
5+
metadata:
6+
name: profile-sla-role
7+
namespace: ${NAMESPACE}
8+
rules:
9+
# DynamoGraphDeployment custom resources - needed for create/get/delete operations
10+
- apiGroups: ["nvidia.com"]
11+
resources: ["dynamographdeployments"]
12+
verbs: ["get", "create", "delete"]
13+
# Pods - needed for listing pods by label selector and getting logs
14+
- apiGroups: [""]
15+
resources: ["pods"]
16+
verbs: ["list"]
17+
- apiGroups: [""]
18+
resources: ["pods/log"]
19+
verbs: ["get"]
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
apiVersion: v1
4+
kind: ServiceAccount
5+
metadata:
6+
name: profile-sla-sa
7+
namespace: ${NAMESPACE}
8+
imagePullSecrets:
9+
- name: nvcr-imagepullsecret
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
apiVersion: v1
4+
kind: PersistentVolumeClaim
5+
metadata:
6+
name: profiling-pvc
7+
namespace: ${NAMESPACE}
8+
spec:
9+
accessModes:
10+
- ReadWriteOnce
11+
resources:
12+
requests:
13+
storage: 50Gi

0 commit comments

Comments
 (0)