CARRY: Add RHOAI manifests #3

z103cb · 2024-03-28T14:07:12Z

What this PR does / why we need it:
Added the manifests to allow for deployment into a RHOAI.

Which issue(s) this PR fixes
Closes: https://issues.redhat.com/browse/RHOAIENG-4787
Checklist:

Docs included if any changes are user facing

manifests/rhoai/binding_admin_roles.yaml

manifests/rhoai/kustomization.yaml

manifests/rhoai/params.env

manifests/rhoai/kubeflow-training-roles.yaml

manifests/rhoai/monitor.yaml

manifests/rhoai/kubeflow-training-roles.yaml

manifests/rhoai/binding_admin_roles.yaml

manifests/rhoai/kustomization.yaml

manifests/rhoai/monitor.yaml

manifests/rhoai/kubeflow-training-roles.yaml

jbusche · 2024-03-28T22:49:53Z

OK - I built a OC 4.14.17 cluster with ODH 2.9 and simple dsc and tried installing kfto, not quite sure what I'm exactly testing...

git clone https://github.com/z103cb/training-operator.git -b rhoai-manifests
cd training-operator/manifests/rhoai
kustomize build | oc apply -f -

Note, my IBM cluster always has trouble with pull rate limits from docker.io, so I've put up the image into quay/jbusche and then configured it with this command:

oc set image deployment kfto-training-operator training-operator=quay.io/jbusche/training-operator:v1-855e096 -n opendatahub

I also had to trick the initcontainer like this to get around the rate limits too:

oc patch deployment kfto-training-operator -n opendatahub --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/command", "value": ["/manager",  "--pytorch-init-container-image=quay.io/jbusche/alpine:3.10"]}]'

Now I've deployed the simple.yaml to my own namespace demo-dsp and it's worked:

oc get pods,pytorchjobs -n demo-dsp
NAME                          READY   STATUS      RESTARTS   AGE
pod/demo-wb-0                 2/2     Running     0          16m
pod/pytorch-simple-master-0   0/1     Completed   0          4m6s
pod/pytorch-simple-worker-0   0/1     Completed   0          6m16s

NAME                                     STATE       AGE
pytorchjob.kubeflow.org/pytorch-simple   Succeeded   6m16s

z103cb · 2024-03-29T12:57:39Z

@jbusche if you can retest this now, greatly appreciated !

jbusche · 2024-03-29T17:52:40Z

LGTM @z103cb, I deployed on a new cluster and it worked. I noticed the prefix has changed to kubeflow so I needed to adjust my image tweaks like this:

oc set image deployment kubeflow-training-operator training-operator=quay.io/jbusche/training-operator:v1-855e096 -n opendatahub

oc patch deployment kubeflow-training-operator -n opendatahub --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/command", "value": ["/manager",  "--pytorch-init-container-image=quay.io/jbusche/alpine:3.10"]}]'

and then my kfto operator came up.

oc get pods -n opendatahub |grep kubeflow
kubeflow-training-operator-65db576c9d-qmbt2        1/1     Running   0          14m

The deployment of a pytorchjob in a non-default namespace looked good:

oc get pytorchjobs,pods                                                     api.jim414rh.cp.fyre.ibm.com: Fri Mar 29 10:52:24 2024

NAME                                     STATE       AGE
pytorchjob.kubeflow.org/pytorch-simple   Succeeded   9m53s

NAME                          READY   STATUS	  RESTARTS   AGE
pod/demo-wb-0                 2/2     Running     0          49m
pod/pytorch-simple-master-0   0/1     Completed   0          9m53s
pod/pytorch-simple-worker-0   0/1     Completed   0          9m53s

astefanutti · 2024-04-02T09:21:30Z

manifests/rhoai/kubeflow-training-roles.yaml

+  - apiGroups:
+      - ""
+    resources:
+      - persistentvolumeclaims
+    verbs:
+      - create
+      - delete
+      - get
+      - list
+      - watch
+  - apiGroups:
+      - ""
+    resources:
+      - events
+    verbs:
+      - get
+      - list
+      - watch


I'd removed those rules for PVCs and events.

astefanutti · 2024-04-02T10:41:05Z

/lgtm

z103cb changed the base branch from master to dev March 28, 2024 14:08

z103cb commented Mar 28, 2024

View reviewed changes

manifests/rhoai/binding_admin_roles.yaml Outdated Show resolved Hide resolved

manifests/rhoai/kustomization.yaml Outdated Show resolved Hide resolved

manifests/rhoai/params.env Show resolved Hide resolved

zdtsw reviewed Mar 28, 2024

View reviewed changes

manifests/rhoai/kubeflow-training-roles.yaml Outdated Show resolved Hide resolved

manifests/rhoai/monitor.yaml Outdated Show resolved Hide resolved

astefanutti reviewed Mar 28, 2024

View reviewed changes

z103cb marked this pull request as ready for review March 29, 2024 12:42

z103cb force-pushed the rhoai-manifests branch from 9a94de7 to 8dd00b8 Compare March 29, 2024 13:17

astefanutti reviewed Apr 2, 2024

View reviewed changes

CARRY: Add RHOAI manifests

589c6c6

z103cb force-pushed the rhoai-manifests branch from 8dd00b8 to 589c6c6 Compare April 2, 2024 10:32

astefanutti approved these changes Apr 2, 2024

View reviewed changes

astefanutti merged commit fa7b886 into opendatahub-io:dev Apr 2, 2024

zdtsw mentioned this pull request Apr 2, 2024

Addding kfto component to odh operator opendatahub-io/opendatahub-operator#944

Merged

3 tasks

astefanutti pushed a commit that referenced this pull request Apr 5, 2024

CARRY: Add RHOAI manifests (#3)

ec8d4fb

astefanutti pushed a commit that referenced this pull request Apr 5, 2024

CARRY: Add RHOAI manifests (#3)

005c2a0

KPostOffice pushed a commit that referenced this pull request May 23, 2024

CARRY: Add RHOAI manifests (#3)

b3cc790

KPostOffice referenced this pull request in red-hat-data-services/training-operator May 23, 2024

CARRY: Add RHOAI manifests (#3)

0f509d5

astefanutti pushed a commit that referenced this pull request Feb 4, 2025

CARRY: Add RHOAI manifests (#3)

09423eb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CARRY: Add RHOAI manifests #3

CARRY: Add RHOAI manifests #3

z103cb commented Mar 28, 2024

jbusche commented Mar 28, 2024

z103cb commented Mar 29, 2024

jbusche commented Mar 29, 2024

astefanutti Apr 2, 2024 •

edited

Loading

astefanutti commented Apr 2, 2024

CARRY: Add RHOAI manifests #3

CARRY: Add RHOAI manifests #3

Conversation

z103cb commented Mar 28, 2024

jbusche commented Mar 28, 2024

z103cb commented Mar 29, 2024

jbusche commented Mar 29, 2024

astefanutti Apr 2, 2024 • edited Loading

Choose a reason for hiding this comment

astefanutti commented Apr 2, 2024

astefanutti Apr 2, 2024 •

edited

Loading