Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CARRY: Add RHOAI manifests #3

Merged
merged 1 commit into from
Apr 2, 2024

Conversation

z103cb
Copy link

@z103cb z103cb commented Mar 28, 2024

What this PR does / why we need it:
Added the manifests to allow for deployment into a RHOAI.

Which issue(s) this PR fixes
Closes: https://issues.redhat.com/browse/RHOAIENG-4787
Checklist:

  • Docs included if any changes are user facing

@z103cb z103cb changed the base branch from master to dev March 28, 2024 14:08
manifests/rhoai/binding_admin_roles.yaml Outdated Show resolved Hide resolved
manifests/rhoai/kustomization.yaml Outdated Show resolved Hide resolved
manifests/rhoai/params.env Show resolved Hide resolved
manifests/rhoai/kubeflow-training-roles.yaml Outdated Show resolved Hide resolved
manifests/rhoai/monitor.yaml Outdated Show resolved Hide resolved
manifests/rhoai/kubeflow-training-roles.yaml Show resolved Hide resolved
manifests/rhoai/binding_admin_roles.yaml Outdated Show resolved Hide resolved
manifests/rhoai/kustomization.yaml Outdated Show resolved Hide resolved
manifests/rhoai/monitor.yaml Show resolved Hide resolved
manifests/rhoai/kubeflow-training-roles.yaml Outdated Show resolved Hide resolved
@jbusche
Copy link

jbusche commented Mar 28, 2024

OK - I built a OC 4.14.17 cluster with ODH 2.9 and simple dsc and tried installing kfto, not quite sure what I'm exactly testing...

git clone https://github.com/z103cb/training-operator.git -b rhoai-manifests
cd training-operator/manifests/rhoai
kustomize build | oc apply -f -

Note, my IBM cluster always has trouble with pull rate limits from docker.io, so I've put up the image into quay/jbusche and then configured it with this command:

oc set image deployment kfto-training-operator training-operator=quay.io/jbusche/training-operator:v1-855e096 -n opendatahub

I also had to trick the initcontainer like this to get around the rate limits too:

oc patch deployment kfto-training-operator -n opendatahub --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/command", "value": ["/manager",  "--pytorch-init-container-image=quay.io/jbusche/alpine:3.10"]}]'

Now I've deployed the simple.yaml to my own namespace demo-dsp and it's worked:

oc get pods,pytorchjobs -n demo-dsp
NAME                          READY   STATUS      RESTARTS   AGE
pod/demo-wb-0                 2/2     Running     0          16m
pod/pytorch-simple-master-0   0/1     Completed   0          4m6s
pod/pytorch-simple-worker-0   0/1     Completed   0          6m16s

NAME                                     STATE       AGE
pytorchjob.kubeflow.org/pytorch-simple   Succeeded   6m16s

@z103cb z103cb marked this pull request as ready for review March 29, 2024 12:42
@z103cb
Copy link
Author

z103cb commented Mar 29, 2024

@jbusche if you can retest this now, greatly appreciated !

@jbusche
Copy link

jbusche commented Mar 29, 2024

LGTM @z103cb, I deployed on a new cluster and it worked. I noticed the prefix has changed to kubeflow so I needed to adjust my image tweaks like this:

oc set image deployment kubeflow-training-operator training-operator=quay.io/jbusche/training-operator:v1-855e096 -n opendatahub

oc patch deployment kubeflow-training-operator -n opendatahub --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/command", "value": ["/manager",  "--pytorch-init-container-image=quay.io/jbusche/alpine:3.10"]}]'

and then my kfto operator came up.

oc get pods -n opendatahub |grep kubeflow
kubeflow-training-operator-65db576c9d-qmbt2        1/1     Running   0          14m

The deployment of a pytorchjob in a non-default namespace looked good:

oc get pytorchjobs,pods                                                     api.jim414rh.cp.fyre.ibm.com: Fri Mar 29 10:52:24 2024

NAME                                     STATE       AGE
pytorchjob.kubeflow.org/pytorch-simple   Succeeded   9m53s

NAME                          READY   STATUS	  RESTARTS   AGE
pod/demo-wb-0                 2/2     Running     0          49m
pod/pytorch-simple-master-0   0/1     Completed   0          9m53s
pod/pytorch-simple-worker-0   0/1     Completed   0          9m53s

Comment on lines 41 to 58
- apiGroups:
- ""
resources:
- persistentvolumeclaims
verbs:
- create
- delete
- get
- list
- watch
- apiGroups:
- ""
resources:
- events
verbs:
- get
- list
- watch
Copy link

@astefanutti astefanutti Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd removed those rules for PVCs and events.

@astefanutti
Copy link

/lgtm

@astefanutti astefanutti merged commit fa7b886 into opendatahub-io:dev Apr 2, 2024
astefanutti pushed a commit that referenced this pull request Apr 5, 2024
astefanutti pushed a commit that referenced this pull request Apr 5, 2024
KPostOffice pushed a commit that referenced this pull request May 23, 2024
KPostOffice referenced this pull request in red-hat-data-services/training-operator May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants