-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor: use kubenetes client instead of kuberay apiserver #640
Conversation
#642 will fix the failing client test |
}, | ||
timeout=30, | ||
cluster = """ | ||
apiVersion: ray.io/v1alpha1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use Jinja templates as they come for free with django :)
We can create a file with this k8s tempalte
# tempaltes/ray_cluster.yaml
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
name: {{ name }}
namespace: {{ namespace }}
...
volumeMounts:
- mountPath: {{ persistent_storage }}
name: persistent_storage
...
resources:
limits:
cpu: {{ worker_cpu }}
memory: {{ worker_memory }}
and then use it like
template = get_template("api/ray_cluster.yaml") # django.template.loader.get_template
cluster_data = yaml.safe_load(
template.render(
name=name,
namespace=namespace,
...
)
)
response = raycluster_client.create(body=cluster_data, namespace=namespace)
It is only suggestion, but it is up to you wether you want to do it here or not :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and as I said before I'm really liking this approach :) much more flexible then hitting kuberay limitations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@IceKhan13 Jinja templates looks good. I also thinking about putting the template in a ConfigMap. So we don't need to touch the container image to change the configuration and probably we can use the helm template to render some part. I'll do these improvement in a follow up PR. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
putting the template in a ConfigMap
+100
json={ | ||
"name": template_name, | ||
"namespace": namespace, | ||
"cpu": settings.RAY_CLUSTER_TEMPLATE_CPU, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we include those settings here and we are good to merge :) Thank you, Aki!
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, running the running program notebook on kind, I got a 500 when trying to run the program (job = serverless.run(program)
) ... could very well be something with my setup, but noting it just in case
@@ -8,6 +8,10 @@ | |||
import uuid | |||
from typing import Any, Optional | |||
|
|||
import yaml | |||
from kubernetes import client, config | |||
from openshift.dynamic import DynamicClient |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of curiosity, is there a reason to prefer the openshift dynamic client to the vanilla kubernetes one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I found the doc using the openshift one first. I'll look into the difference later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The openshift one just adds "apply" function. We don't use it so I'll change to the kubernetes one in the follow up PR. Thanks for pointing it out.
@@ -232,7 +231,7 @@ kuberay-operator: | |||
# Kuberay API Server | |||
# =================== | |||
|
|||
kuberayApiServerEnable: true | |||
kuberayApiServerEnable: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there anything still using the api server? if not, may as well rip off the bandaid and cut all this stuff out now 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, nothing is using the api server. It can be taken off along with the raycluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:kill-it-with-fire:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took it off.
We may also need to bump versions in infrastructure/helm/quantumserverless/Chart.lock |
I'll run |
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Aki, I will test it today (or latest tomorrow), I'm buried in meetings today |
I'm still getting a 500 when trying to run a program after the last changes... doesn't look like a Ray cluster is getting spun up. |
@akihikokuroda @psschwei if we are getting 500 errors with the current code... and tests are not failing... can we do something to detect it from tests? |
maybe the error is related with the @caleb-johnson error that it is appearing after add a new test for the docker environment: https://github.com/Qiskit-Extensions/quantum-serverless/actions/runs/5195660406/jobs/9368597527?pr=648, so i dont know if this test certify the @psschwei error as well, so we can have a test to jump into the error in our CI |
@psschwei I can not see the issue here. What are in the log of the gateway-scheduler pod? Thanks! |
Gateway scheduler pod logs are a repeating cycle of
edit: need to head rather than tail, there's an error at the beginning:
|
wait... I'm not building the containers from source... 😡 (aside: we really need to set up a process for building from source and deploying to kubernetes...) |
) | ||
|
||
create_compute_template_if_not_exists() | ||
cpu = settings.RAY_CLUSTER_TEMPLATE_CPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought for later: we should consider pulling these settings into a configmap as well, so operators can change the values without having to rebuild the images
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can make it in the values.yaml file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that would require reploying the helm chart for any updates though...
either way, let's hold off on that change until we merge this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked! Looks like everything works as expected!
In following PRs we can remove helm kuberay and cluster charts and only leave ray operator. This will simplify deployment configs.
Great work, Aki!
Thanks! |
@IceKhan13 @akihikokuroda what are you using for Kubernetes? I haven't been able to run a program on anything that doesn't do a bunch of networking magic behind the scenes (i.e. Docker / Rancher Desktop).. |
I'm using rancher desktop |
I'm using Rancher Desktop. I don't do anything except port forwarding Jupyter notebook service. |
yes, I also do port forwarding |
Any changes to values.yaml? |
I only deploy gateway + scheduler + keycloak |
So no jupyter ... how are you testing programs? |
@psschwei Is the raycluster CRD instance created? |
The gateway-scheduler should create the instance. Do you see any exceptions in the gateway-scheduler pod? |
I have local jupyter with local programs and connecting to gateway straight away |
yes, it's there
yes, I'm still seeing this one:
Local notebooks fail too (though I'm guessing the scheduler error above has something to do with it). |
hmm, the gateway pod is also having a problem with the attached volume |
yeah, seems the error I'm hitting is the permissions on PVC is getting mounted as root, which prevents the gateway from writing to it |
oh, yes we have a call tomorrow for exactly this :) |
I'm working on #652 and seeing gateway failures in CI but not locally running same commands. Not sure if related to discussion here |
Summary
This PR refactor the gateway ray cluster creation and deletion
Details and comments
This replace the kuberay apiserver use with the kubernetes dynamic client.