Skip to content

Commit 45c8e78

Browse files
Akash Jaiswalmahdikhashan
authored andcommitted
feat: support for managing gpu enabled self runner infra (kubeflow#2762)
* feat: support for creating and managing gpu cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: makefile bug Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * add: ci action to ask maintainers to add label to when changes are detected Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: fixed issues and cleanup Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: run check on change in pr Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * feat: added seperate workflow for gpu runner Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: deepspeed typo Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: add gpu label on PR without merging Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: merged into single action Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fixL run runner as soon as label is added Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: use gpu runner when label exist Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: revert changes and fix script permission Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: create gpu supported gpu Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: nvidia issue Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: gpu cluster and torchtune model Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: notebookpath and delete cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * tmp fix: notebook to use k8s client Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: use akash sdk and fix notenook size Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: notebook error Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: delete cluster before creating one and notebook Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: kube config Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: makefile add comment Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: nvidia runtime Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: disable e2e go Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: delete cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: delete cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: temporarly use my personal token Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: refactored code Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: take hf token from env of self runner vm Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: to run notebook directly Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * refactor: torchtune job Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: ci action Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: pre commit hook Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: rename ci action Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * rem: delete cluster command from makefile Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: rem some steps, fixed wait timing and notebook logs according to kubeflow/sdk#83 Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * update: upgrade k8s to 1.34.0 Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> --------- Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> Signed-off-by: Mahdi Khashan <mahdikhashan1@gmail.com>
1 parent 86f05ab commit 45c8e78

File tree

4 files changed

+273
-27
lines changed

4 files changed

+273
-27
lines changed
Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
name: GPU E2E Test
2+
3+
on:
4+
pull_request:
5+
types: [opened, reopened, synchronize, labeled]
6+
7+
jobs:
8+
gpu-e2e-test:
9+
name: GPU E2E Test
10+
runs-on: oracle-vm-16cpu-a10gpu-240gb
11+
12+
env:
13+
GOPATH: ${{ github.workspace }}/go
14+
defaults:
15+
run:
16+
working-directory: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer
17+
18+
strategy:
19+
fail-fast: false
20+
matrix:
21+
kubernetes-version: ["1.34.0"]
22+
23+
steps:
24+
- name: Check GPU label
25+
id: check-label
26+
run: |
27+
if [[ "${{ join(github.event.pull_request.labels.*.name, ',') }}" != *"ok-to-test-gpu-runner"* ]]; then
28+
echo "✅ Skipping GPU E2E tests (label not present)."
29+
echo "skip=true" >> $GITHUB_OUTPUT
30+
exit 0
31+
else
32+
echo "Label found. Running GPU tests."
33+
echo "skip=false" >> $GITHUB_OUTPUT
34+
fi
35+
36+
- name: Check out code
37+
if: steps.check-label.outputs.skip == 'false'
38+
uses: actions/checkout@v4
39+
with:
40+
path: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer
41+
42+
- name: Setup Go
43+
if: steps.check-label.outputs.skip == 'false'
44+
uses: actions/setup-go@v5
45+
with:
46+
go-version-file: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer/go.mod
47+
48+
- name: Setup Python
49+
if: steps.check-label.outputs.skip == 'false'
50+
uses: actions/setup-python@v5
51+
with:
52+
python-version: 3.11
53+
54+
- name: Install dependencies
55+
if: steps.check-label.outputs.skip == 'false'
56+
run: |
57+
pip install papermill==2.6.0 jupyter==1.1.1 ipykernel==6.29.5
58+
pip install git+https://github.com/kubeflow/sdk.git@main
59+
60+
- name: Setup cluster with GPU support using nvidia/kind
61+
if: steps.check-label.outputs.skip == 'false'
62+
run: |
63+
make test-e2e-setup-gpu-cluster K8S_VERSION=${{ matrix.kubernetes-version }}
64+
65+
- name: Run e2e test on GPU cluster
66+
if: steps.check-label.outputs.skip == 'false'
67+
run: |
68+
mkdir -p artifacts/notebooks
69+
make test-e2e-notebook NOTEBOOK_INPUT=./examples/torchtune/llama3_2/alpaca-trainjob-yaml.ipynb NOTEBOOK_OUTPUT=./artifacts/notebooks/${{ matrix.kubernetes-version }}_alpaca-trainjob-yaml.ipynb TIMEOUT=900
70+
71+
- name: Upload Artifacts to GitHub
72+
if: always()
73+
uses: actions/upload-artifact@v4
74+
with:
75+
name: ${{ matrix.kubernetes-version }}
76+
path: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer/artifacts/*
77+
retention-days: 1
78+
79+
delete-kind-cluster:
80+
name: Delete kind Cluster
81+
runs-on: oracle-vm-16cpu-a10gpu-240gb
82+
needs: [gpu-e2e-test]
83+
if: always()
84+
steps:
85+
- name: Delete any existing kind cluster
86+
run: |
87+
sudo kind delete cluster --name kind-gpu && echo "kind cluster has been deleted" || echo "kind cluster doesn't exist"

Makefile

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -178,6 +178,10 @@ test-python-integration: ## Run Python integration test.
178178
test-e2e-setup-cluster: kind ## Setup Kind cluster for e2e test.
179179
KIND=$(KIND) K8S_VERSION=$(K8S_VERSION) ./hack/e2e-setup-cluster.sh
180180

181+
.PHONY: test-e2e-setup-gpu-cluster
182+
test-e2e-setup-gpu-cluster: kind ## Setup Kind cluster for GPU e2e test.
183+
KIND=$(KIND) K8S_VERSION=$(K8S_VERSION) ./hack/e2e-setup-gpu-cluster.sh
184+
181185
.PHONY: test-e2e
182186
test-e2e: ginkgo ## Run Go e2e test.
183187
$(GINKGO) -v ./test/e2e/...

examples/torchtune/llama3_2/alpaca-trainjob-yaml.ipynb

Lines changed: 58 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,9 @@
3838
"id": "288ec515",
3939
"metadata": {},
4040
"outputs": [],
41-
"source": "!pip install git+https://github.com/kubeflow/sdk.git@main"
41+
"source": [
42+
"!pip install git+https://github.com/kubeflow/sdk.git@main"
43+
]
4244
},
4345
{
4446
"cell_type": "markdown",
@@ -73,6 +75,8 @@
7375
"source": [
7476
"# List all available Kubeflow Training Runtimes.\n",
7577
"from kubeflow.trainer import *\n",
78+
"from kubeflow_trainer_api import models\n",
79+
"import os\n",
7680
"\n",
7781
"client = TrainerClient()\n",
7882
"for runtime in client.list_runtimes():\n",
@@ -154,19 +158,23 @@
154158
],
155159
"source": [
156160
"# Create a PersistentVolumeClaim for the TorchTune Llama 3.2 1B model.\n",
157-
"client.core_api.create_namespaced_persistent_volume_claim(\n",
158-
" namespace=\"default\",\n",
159-
" body=client.V1PersistentVolumeClaim(\n",
160-
" api_version=\"v1\",\n",
161-
" kind=\"PersistentVolumeClaim\",\n",
162-
" metadata=client.V1ObjectMeta(name=\"torchtune-llama3.2-1b\"),\n",
163-
" spec=client.V1PersistentVolumeClaimSpec(\n",
164-
" access_modes=[\"ReadWriteOnce\"],\n",
165-
" resources=client.V1ResourceRequirements(\n",
166-
" requests={\"storage\": \"20Gi\"}\n",
167-
" ),\n",
168-
" ),\n",
169-
" ),\n",
161+
"client.backend.core_api.create_namespaced_persistent_volume_claim(\n",
162+
" namespace=\"default\",\n",
163+
" body=models.IoK8sApiCoreV1PersistentVolumeClaim(\n",
164+
" apiVersion=\"v1\",\n",
165+
" kind=\"PersistentVolumeClaim\",\n",
166+
" metadata=models.IoK8sApimachineryPkgApisMetaV1ObjectMeta(\n",
167+
" name=\"torchtune-llama3.2-1b\"\n",
168+
" ),\n",
169+
" spec=models.IoK8sApiCoreV1PersistentVolumeClaimSpec(\n",
170+
" accessModes=[\"ReadWriteOnce\"],\n",
171+
" resources=models.IoK8sApiCoreV1VolumeResourceRequirements(\n",
172+
" requests={\n",
173+
" \"storage\": models.IoK8sApimachineryPkgApiResourceQuantity(\"200Gi\")\n",
174+
" }\n",
175+
" ),\n",
176+
" ),\n",
177+
" ).to_dict(),\n",
170178
")"
171179
]
172180
},
@@ -188,31 +196,51 @@
188196
"outputs": [],
189197
"source": [
190198
"job_name = client.train(\n",
191-
" runtime=Runtime(\n",
192-
" name=\"torchtune-llama3.2-1b\"\n",
193-
" ),\n",
199+
" runtime=client.get_runtime(name=\"torchtune-llama3.2-1b\"),\n",
194200
" initializer=Initializer(\n",
195201
" dataset=HuggingFaceDatasetInitializer(\n",
196202
" storage_uri=\"hf://tatsu-lab/alpaca/data\"\n",
197203
" ),\n",
198204
" model=HuggingFaceModelInitializer(\n",
199205
" storage_uri=\"hf://meta-llama/Llama-3.2-1B-Instruct\",\n",
200-
" access_token=\"<YOUR_HF_TOKEN>\" # Replace with your Hugging Face token,\n",
206+
" access_token=os.environ[\"HF_TOKEN\"] # Replace with your Hugging Face token,\n",
201207
" )\n",
202208
" ),\n",
203209
" trainer=BuiltinTrainer(\n",
204210
" config=TorchTuneConfig(\n",
205211
" dataset_preprocess_config=TorchTuneInstructDataset(\n",
206-
" source=DataFormat.PARQUET,\n",
212+
" source=DataFormat.PARQUET, split=\"train[:1000]\"\n",
207213
" ),\n",
208214
" resources_per_node={\n",
215+
" \"memory\": \"200G\",\n",
209216
" \"gpu\": 1,\n",
210-
" }\n",
217+
" },\n",
218+
" \n",
211219
" )\n",
212220
" )\n",
213221
")"
214222
]
215223
},
224+
{
225+
"cell_type": "markdown",
226+
"id": "ee5fbe8e",
227+
"metadata": {},
228+
"source": [
229+
"## Wait for running status"
230+
]
231+
},
232+
{
233+
"cell_type": "code",
234+
"execution_count": null,
235+
"id": "53eaa65a",
236+
"metadata": {},
237+
"outputs": [],
238+
"source": [
239+
"\n",
240+
"# Wait for the running status.\n",
241+
"client.wait_for_job_status(name=job_name, status={\"Running\"})\n"
242+
]
243+
},
216244
{
217245
"cell_type": "markdown",
218246
"id": "75a82b76",
@@ -247,8 +275,8 @@
247275
"source": [
248276
"from kubeflow.trainer.constants import constants\n",
249277
"\n",
250-
"log_dict = client.get_job_logs(job_name, follow=False, step=constants.DATASET_INITIALIZER)\n",
251-
"print(log_dict[constants.DATASET_INITIALIZER])"
278+
"for line in client.get_job_logs(job_name, follow=True, step=constants.DATASET_INITIALIZER):\n",
279+
" print(line)"
252280
]
253281
},
254282
{
@@ -279,16 +307,16 @@
279307
}
280308
],
281309
"source": [
282-
"log_dict = client.get_job_logs(job_name, follow=False, step=constants.MODEL_INITIALIZER)\n",
283-
"print(log_dict[constants.MODEL_INITIALIZER])"
310+
"for line in client.get_job_logs(job_name, follow=True, step=constants.MODEL_INITIALIZER):\n",
311+
" print(line)"
284312
]
285313
},
286314
{
287315
"cell_type": "markdown",
288316
"id": "b67775ea",
289317
"metadata": {},
290318
"source": [
291-
"### Trainer Node"
319+
"### Trainer Node "
292320
]
293321
},
294322
{
@@ -392,8 +420,11 @@
392420
}
393421
],
394422
"source": [
395-
"log_dict = client.get_job_logs(job_name, follow=False)\n",
396-
"print(log_dict[f\"{constants.NODE}-0\"])"
423+
"for c in client.get_job(name=job_name).steps:\n",
424+
" print(f\"Step: {c.name}, Status: {c.status}, Devices: {c.device} x {c.device_count}\\n\")\n",
425+
"\n",
426+
"for line in client.get_job_logs(job_name, follow=True):\n",
427+
" print(line)"
397428
]
398429
},
399430
{

hack/e2e-setup-gpu-cluster.sh

Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
#!/usr/bin/env bash
2+
3+
# Copyright 2025 The Kubeflow Authors.
4+
#
5+
# Licensed under the Apache License, Version 2.0 (the "License");
6+
# you may not use this file except in compliance with the License.
7+
# You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
17+
# This shell is used to setup Kind cluster for Kubeflow Trainer e2e tests.
18+
19+
set -o errexit
20+
set -o nounset
21+
set -o pipefail
22+
set -x
23+
24+
# Configure variables.
25+
KIND=${KIND:-./bin/kind}
26+
K8S_VERSION=${K8S_VERSION:-1.32.0}
27+
GPU_OPERATOR_VERSION="v25.3.2"
28+
KIND_NODE_VERSION=kindest/node:v${K8S_VERSION}
29+
GPU_CLUSTER_NAME="kind-gpu"
30+
NAMESPACE="kubeflow-system"
31+
TIMEOUT="5m"
32+
33+
# Kubeflow Trainer images.
34+
# TODO (andreyvelich): Support initializers images.
35+
CONTROLLER_MANAGER_CI_IMAGE_NAME="ghcr.io/kubeflow/trainer/trainer-controller-manager"
36+
CONTROLLER_MANAGER_CI_IMAGE_TAG="test"
37+
CONTROLLER_MANAGER_CI_IMAGE="${CONTROLLER_MANAGER_CI_IMAGE_NAME}:${CONTROLLER_MANAGER_CI_IMAGE_TAG}"
38+
echo "Build Kubeflow Trainer images"
39+
sudo docker build . -f cmd/trainer-controller-manager/Dockerfile -t ${CONTROLLER_MANAGER_CI_IMAGE}
40+
41+
# Set up Docker to use NVIDIA runtime.
42+
sudo nvidia-ctk runtime configure --runtime=docker --set-as-default --cdi.enabled
43+
sudo nvidia-ctk config --set accept-nvidia-visible-devices-as-volume-mounts=true --in-place
44+
sudo systemctl restart docker
45+
46+
# Create a Kind cluster with GPU support.
47+
nvkind cluster create --name ${GPU_CLUSTER_NAME} --image "${KIND_NODE_VERSION}"
48+
nvkind cluster print-gpus
49+
50+
# Install gpu-operator to make sure we can run GPU workloads.
51+
echo "Install NVIDIA GPU Operator"
52+
kubectl create ns gpu-operator
53+
kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged
54+
55+
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
56+
57+
helm install --wait --generate-name \
58+
-n gpu-operator --create-namespace \
59+
nvidia/gpu-operator \
60+
--version="${GPU_OPERATOR_VERSION}"
61+
62+
# Validation steps for GPU operator installation
63+
kubectl get ns gpu-operator
64+
kubectl get ns gpu-operator --show-labels | grep pod-security.kubernetes.io/enforce=privileged
65+
helm list -n gpu-operator
66+
kubectl get pods -n gpu-operator -o name | while read pod; do
67+
kubectl wait --for=condition=Ready --timeout=300s "$pod" -n gpu-operator || echo "$pod failed to become Ready"
68+
done
69+
kubectl get pods -n gpu-operator
70+
kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu
71+
72+
# Load Kubeflow Trainer images
73+
echo "Load Kubeflow Trainer images"
74+
kind load docker-image "${CONTROLLER_MANAGER_CI_IMAGE}" --name "${GPU_CLUSTER_NAME}"
75+
76+
# Deploy Kubeflow Trainer control plane
77+
echo "Deploy Kubeflow Trainer control plane"
78+
E2E_MANIFESTS_DIR="artifacts/e2e/manifests"
79+
mkdir -p "${E2E_MANIFESTS_DIR}"
80+
cat <<EOF >"${E2E_MANIFESTS_DIR}/kustomization.yaml"
81+
apiVersion: kustomize.config.k8s.io/v1beta1
82+
kind: Kustomization
83+
resources:
84+
- ../../../manifests/overlays/manager
85+
images:
86+
- name: "${CONTROLLER_MANAGER_CI_IMAGE_NAME}"
87+
newTag: "${CONTROLLER_MANAGER_CI_IMAGE_TAG}"
88+
EOF
89+
90+
kubectl apply --server-side -k "${E2E_MANIFESTS_DIR}"
91+
92+
# We should wait until Deployment is in Ready status.
93+
echo "Wait for Kubeflow Trainer to be ready"
94+
(kubectl wait deploy/kubeflow-trainer-controller-manager --for=condition=available -n ${NAMESPACE} --timeout ${TIMEOUT} &&
95+
kubectl wait pods --for=condition=ready -n ${NAMESPACE} --timeout ${TIMEOUT} --all) ||
96+
(
97+
echo "Failed to wait until Kubeflow Trainer is ready" &&
98+
kubectl get pods -n ${NAMESPACE} &&
99+
kubectl describe pods -n ${NAMESPACE} &&
100+
exit 1
101+
)
102+
103+
print_cluster_info() {
104+
kubectl version
105+
kubectl cluster-info
106+
kubectl get nodes
107+
kubectl get pods -n ${NAMESPACE}
108+
kubectl describe pod -n ${NAMESPACE}
109+
}
110+
111+
# TODO (andreyvelich): Currently, we print manager logs due to flaky test.
112+
echo "Deploy Kubeflow Trainer runtimes"
113+
kubectl apply --server-side -k manifests/overlays/runtimes || (
114+
kubectl logs -n ${NAMESPACE} -l app.kubernetes.io/name=trainer &&
115+
print_cluster_info &&
116+
exit 1
117+
)
118+
119+
# TODO (andreyvelich): Discuss how we want to pre-load runtime images to the Kind cluster.
120+
TORCH_RUNTIME_IMAGE=pytorch/pytorch:2.7.1-cuda12.8-cudnn9-runtime
121+
docker pull ${TORCH_RUNTIME_IMAGE}
122+
kind load docker-image ${TORCH_RUNTIME_IMAGE} --name ${GPU_CLUSTER_NAME}
123+
124+
print_cluster_info

0 commit comments

Comments
 (0)