Refactor: use kubenetes client instead of kuberay apiserver #640

akihikokuroda · 2023-06-05T18:15:04Z

Summary

This PR refactor the gateway ray cluster creation and deletion

Details and comments

This replace the kuberay apiserver use with the kubernetes dynamic client.

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

psschwei · 2023-06-05T18:41:26Z

#642 will fix the failing client test

IceKhan13 · 2023-06-05T19:37:36Z

gateway/api/ray.py

-        },
-        timeout=30,
+    cluster = """
+    apiVersion: ray.io/v1alpha1


We can use Jinja templates as they come for free with django :)

We can create a file with this k8s tempalte

# tempaltes/ray_cluster.yaml apiVersion: ray.io/v1alpha1 kind: RayCluster metadata: name: {{ name }} namespace: {{ namespace }} ... volumeMounts: - mountPath: {{ persistent_storage }} name: persistent_storage ... resources: limits: cpu: {{ worker_cpu }} memory: {{ worker_memory }}

and then use it like

template = get_template("api/ray_cluster.yaml") # django.template.loader.get_template cluster_data = yaml.safe_load( template.render( name=name, namespace=namespace, ... ) ) response = raycluster_client.create(body=cluster_data, namespace=namespace)

It is only suggestion, but it is up to you wether you want to do it here or not :)

and as I said before I'm really liking this approach :) much more flexible then hitting kuberay limitations

@IceKhan13 Jinja templates looks good. I also thinking about putting the template in a ConfigMap. So we don't need to touch the container image to change the configuration and probably we can use the helm template to render some part. I'll do these improvement in a follow up PR. Thanks!

putting the template in a ConfigMap

+100

IceKhan13 · 2023-06-05T21:08:54Z

gateway/api/ray.py

-            json={
-                "name": template_name,
-                "namespace": namespace,
-                "cpu": settings.RAY_CLUSTER_TEMPLATE_CPU,


can we include those settings here and we are good to merge :) Thank you, Aki!

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

psschwei

FWIW, running the running program notebook on kind, I got a 500 when trying to run the program (job = serverless.run(program)) ... could very well be something with my setup, but noting it just in case

psschwei · 2023-06-06T16:25:33Z

gateway/api/ray.py

@@ -8,6 +8,10 @@
 import uuid
 from typing import Any, Optional

+import yaml
+from kubernetes import client, config
+from openshift.dynamic import DynamicClient


out of curiosity, is there a reason to prefer the openshift dynamic client to the vanilla kubernetes one?

No, I found the doc using the openshift one first. I'll look into the difference later.

The openshift one just adds "apply" function. We don't use it so I'll change to the kubernetes one in the follow up PR. Thanks for pointing it out.

psschwei · 2023-06-06T16:30:12Z

infrastructure/helm/quantumserverless/values.yaml

@@ -232,7 +231,7 @@ kuberay-operator:
 # Kuberay API Server
 # ===================

-kuberayApiServerEnable: true
+kuberayApiServerEnable: false


is there anything still using the api server? if not, may as well rip off the bandaid and cut all this stuff out now 😄

No, nothing is using the api server. It can be taken off along with the raycluster.

:kill-it-with-fire:

I took it off.

psschwei · 2023-06-06T16:47:55Z

We may also need to bump versions in infrastructure/helm/quantumserverless/Chart.lock

akihikokuroda · 2023-06-06T17:00:39Z

I'll run helm dependency update and push Chart.lock. I also took out "ray cluster" and "ray apiserver" from Charts.yaml before that.

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

IceKhan13 · 2023-06-06T17:58:31Z

Aki, I will test it today (or latest tomorrow), I'm buried in meetings today

psschwei · 2023-06-06T18:00:22Z

FWIW, running the running program notebook on kind, I got a 500 when trying to run the program (job = serverless.run(program)) ... could very well be something with my setup, but noting it just in case

I'm still getting a 500 when trying to run a program after the last changes... doesn't look like a Ray cluster is getting spun up.

pacomf · 2023-06-07T08:01:46Z

@akihikokuroda @psschwei if we are getting 500 errors with the current code... and tests are not failing... can we do something to detect it from tests?

pacomf · 2023-06-07T09:38:57Z

@akihikokuroda @psschwei if we are getting 500 errors with the current code... and tests are not failing... can we do something to detect it from tests?

maybe the error is related with the @caleb-johnson error that it is appearing after add a new test for the docker environment: https://github.com/Qiskit-Extensions/quantum-serverless/actions/runs/5195660406/jobs/9368597527?pr=648, so i dont know if this test certify the @psschwei error as well, so we can have a test to jump into the error in our CI

akihikokuroda · 2023-06-07T11:57:03Z

@psschwei I can not see the issue here. What are in the log of the gateway-scheduler pod? Thanks!

psschwei · 2023-06-07T13:49:49Z

Gateway scheduler pod logs are a repeating cycle of

Updated 0 jobs.
Using selector: EpollSelector
Deallocated 0 compute resources.
Using selector: EpollSelector
4 free cluster slots.
0 are scheduled for execution.

edit: need to head rather than tail, there's an error at the beginning:

Using selector: EpollSelector
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 89, in _execute
    return self.cursor.execute(sql, params)
psycopg2.errors.UndefinedTable: relation "api_job" does not exist
LINE 1: ...d", "api_job"."ray_job_id", "api_job"."logs" FROM "api_job" ...
                                                             ^


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/src/app/manage.py", line 22, in <module>
    main()
  File "/usr/src/app/manage.py", line 18, in main
    execute_from_command_line(sys.argv)
  File "/usr/local/lib/python3.9/site-packages/django/core/management/__init__.py", line 442, in execute_from_command_line
    utility.execute()
  File "/usr/local/lib/python3.9/site-packages/django/core/management/__init__.py", line 436, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/usr/local/lib/python3.9/site-packages/django/core/management/base.py", line 412, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/usr/local/lib/python3.9/site-packages/django/core/management/base.py", line 458, in execute
    output = self.handle(*args, **options)
  File "/usr/src/app/api/management/commands/update_jobs_statuses.py", line 17, in handle
    for job in Job.objects.filter(status__in=Job.RUNNING_STATES):
  File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 398, in __iter__
    self._fetch_all()
  File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 1881, in _fetch_all
    self._result_cache = list(self._iterable_class(self))
  File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 91, in __iter__
    results = compiler.execute_sql(
  File "/usr/local/lib/python3.9/site-packages/django/db/models/sql/compiler.py", line 1562, in execute_sql
    cursor.execute(sql, params)
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 67, in execute
    return self._execute_with_wrappers(
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 80, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 89, in _execute
    return self.cursor.execute(sql, params)
  File "/usr/local/lib/python3.9/site-packages/django/db/utils.py", line 91, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 89, in _execute
    return self.cursor.execute(sql, params)
django.db.utils.ProgrammingError: relation "api_job" does not exist
LINE 1: ...d", "api_job"."ray_job_id", "api_job"."logs" FROM "api_job" ...

psschwei · 2023-06-07T13:52:44Z

wait... I'm not building the containers from source... 😡

(aside: we really need to set up a process for building from source and deploying to kubernetes...)

psschwei · 2023-06-07T15:25:07Z

gateway/api/ray.py

-    )
-
-    create_compute_template_if_not_exists()
+    cpu = settings.RAY_CLUSTER_TEMPLATE_CPU


thought for later: we should consider pulling these settings into a configmap as well, so operators can change the values without having to rebuild the images

I can make it in the values.yaml file

that would require reploying the helm chart for any updates though...

either way, let's hold off on that change until we merge this PR

IceKhan13

Checked! Looks like everything works as expected!

In following PRs we can remove helm kuberay and cluster charts and only leave ray operator. This will simplify deployment configs.

Great work, Aki!

akihikokuroda · 2023-06-07T17:14:48Z

Thanks!

psschwei · 2023-06-07T17:48:25Z

@IceKhan13 @akihikokuroda what are you using for Kubernetes? I haven't been able to run a program on anything that doesn't do a bunch of networking magic behind the scenes (i.e. Docker / Rancher Desktop)..

IceKhan13 · 2023-06-07T17:52:56Z

I'm using rancher desktop

akihikokuroda · 2023-06-07T17:53:05Z

I'm using Rancher Desktop. I don't do anything except port forwarding Jupyter notebook service.

IceKhan13 · 2023-06-07T17:53:31Z

yes, I also do port forwarding

psschwei · 2023-06-07T17:55:01Z

Any changes to values.yaml?

IceKhan13 · 2023-06-07T17:56:26Z

I only deploy gateway + scheduler + keycloak

psschwei · 2023-06-07T17:57:24Z

So no jupyter ... how are you testing programs?
(I'm running notebooks in jupyter... I wonder if that's why I'm seeing issues)

akihikokuroda · 2023-06-07T17:57:46Z

@psschwei Is the raycluster CRD instance created?

akihikokuroda · 2023-06-07T17:59:33Z

The gateway-scheduler should create the instance. Do you see any exceptions in the gateway-scheduler pod?

IceKhan13 · 2023-06-07T18:02:32Z

So no jupyter ... how are you testing programs?

I have local jupyter with local programs and connecting to gateway straight away

psschwei · 2023-06-07T18:12:10Z

Is the raycluster CRD instance created?

yes, it's there

Do you see any exceptions in the gateway-scheduler pod?

yes, I'm still seeing this one:

Defaulted container "gateway-scheduler" out of: gateway-scheduler, waitpostresql (init)
Using selector: EpollSelector
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 89, in _execute
    return self.cursor.execute(sql, params)
psycopg2.errors.UndefinedTable: relation "api_job" does not exist
LINE 1: ...d", "api_job"."ray_job_id", "api_job"."logs" FROM "api_job" ...
                                                             ^

I have local jupyter with local programs

Local notebooks fail too (though I'm guessing the scheduler error above has something to do with it).

psschwei · 2023-06-07T18:15:34Z

hmm, the gateway pod is also having a problem with the attached volume

psschwei · 2023-06-07T18:24:05Z

yeah, seems the error I'm hitting is the permissions on /usr/src/app/media/user in the gateway pod when trying to run a job...

PVC is getting mounted as root, which prevents the gateway from writing to it

IceKhan13 · 2023-06-07T19:26:28Z

oh, yes we have a call tomorrow for exactly this :)

caleb-johnson · 2023-06-07T19:51:14Z

I have local jupyter with local programs and connecting to gateway straight away

I'm working on #652 and seeing gateway failures in CI but not locally running same commands. Not sure if related to discussion here

akihikokuroda added 4 commits June 5, 2023 14:10

use kubenetes client instead of kuberay apiserver

258163d

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

lint

e8673d7

lint

a370d0f

lint

a5f09c4

fix client test

18705fe

IceKhan13 reviewed Jun 5, 2023

View reviewed changes

akihikokuroda and others added 3 commits June 5, 2023 20:12

add cpu and memory setting

d0ecf51

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

Merge branch 'main' into refactor

5e3f1c6

lint

0a9793e

akihikokuroda requested review from IceKhan13 June 6, 2023 11:43

Merge branch 'main' into refactor

ee36e9d

psschwei reviewed Jun 6, 2023

View reviewed changes

akihikokuroda added 2 commits June 6, 2023 13:09

remove ray-cluster and ray apiserver

46fa24b

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

remove ray-cluster and ray apiserver

617d812

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

psschwei mentioned this pull request Jun 7, 2023

Create CI workflow for running a notebook test in notebook docker environment #648

Closed

1 task

psschwei reviewed Jun 7, 2023

View reviewed changes

IceKhan13 approved these changes Jun 7, 2023

View reviewed changes

akihikokuroda merged commit 8d93196 into Qiskit:main Jun 7, 2023

psschwei mentioned this pull request Jun 7, 2023

Deprecate Kuberay API server code #656

Closed

akihikokuroda mentioned this pull request Jun 9, 2023

Move ray node template in configmap #670

Merged

akihikokuroda deleted the refactor branch August 18, 2023 18:35

Refactor: use kubenetes client instead of kuberay apiserver #640

Refactor: use kubenetes client instead of kuberay apiserver #640

Conversation

akihikokuroda commented Jun 5, 2023

Summary

Details and comments

psschwei commented Jun 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

psschwei Jun 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

psschwei left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

psschwei commented Jun 6, 2023

akihikokuroda commented Jun 6, 2023 • edited Loading

IceKhan13 commented Jun 6, 2023

psschwei commented Jun 6, 2023

pacomf commented Jun 7, 2023

pacomf commented Jun 7, 2023

akihikokuroda commented Jun 7, 2023

psschwei commented Jun 7, 2023 • edited Loading

psschwei commented Jun 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IceKhan13 left a comment

Choose a reason for hiding this comment

akihikokuroda commented Jun 7, 2023

psschwei commented Jun 7, 2023

IceKhan13 commented Jun 7, 2023

akihikokuroda commented Jun 7, 2023

IceKhan13 commented Jun 7, 2023

psschwei commented Jun 7, 2023

IceKhan13 commented Jun 7, 2023

psschwei commented Jun 7, 2023

akihikokuroda commented Jun 7, 2023

akihikokuroda commented Jun 7, 2023

IceKhan13 commented Jun 7, 2023

psschwei commented Jun 7, 2023

psschwei commented Jun 7, 2023

psschwei commented Jun 7, 2023 • edited Loading

IceKhan13 commented Jun 7, 2023

caleb-johnson commented Jun 7, 2023 • edited Loading

psschwei Jun 6, 2023 •

edited

Loading

akihikokuroda commented Jun 6, 2023 •

edited

Loading

psschwei commented Jun 7, 2023 •

edited

Loading

psschwei commented Jun 7, 2023 •

edited

Loading

psschwei commented Jun 7, 2023 •

edited

Loading

caleb-johnson commented Jun 7, 2023 •

edited

Loading