Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor: use kubenetes client instead of kuberay apiserver #640

Merged
merged 11 commits into from
Jun 7, 2023

Conversation

akihikokuroda
Copy link
Collaborator

Summary

This PR refactor the gateway ray cluster creation and deletion

Details and comments

This replace the kuberay apiserver use with the kubernetes dynamic client.

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
@psschwei
Copy link
Collaborator

psschwei commented Jun 5, 2023

#642 will fix the failing client test

},
timeout=30,
cluster = """
apiVersion: ray.io/v1alpha1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use Jinja templates as they come for free with django :)

We can create a file with this k8s tempalte

# tempaltes/ray_cluster.yaml
apiVersion: ray.io/v1alpha1
kind: RayCluster
  metadata:
    name: {{ name }}
    namespace: {{ namespace }}
...
              volumeMounts:
              - mountPath: {{ persistent_storage }}
                name: persistent_storage
...
              resources:
                limits:
                  cpu: {{ worker_cpu }}
                  memory: {{ worker_memory }}

and then use it like

template = get_template("api/ray_cluster.yaml") # django.template.loader.get_template
cluster_data = yaml.safe_load(
  template.render(
    name=name, 
    namespace=namespace,
    ...
  )
)
response = raycluster_client.create(body=cluster_data, namespace=namespace)

It is only suggestion, but it is up to you wether you want to do it here or not :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and as I said before I'm really liking this approach :) much more flexible then hitting kuberay limitations

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@IceKhan13 Jinja templates looks good. I also thinking about putting the template in a ConfigMap. So we don't need to touch the container image to change the configuration and probably we can use the helm template to render some part. I'll do these improvement in a follow up PR. Thanks!

Copy link
Collaborator

@psschwei psschwei Jun 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

putting the template in a ConfigMap

+100

json={
"name": template_name,
"namespace": namespace,
"cpu": settings.RAY_CLUSTER_TEMPLATE_CPU,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we include those settings here and we are good to merge :) Thank you, Aki!

akihikokuroda and others added 3 commits June 5, 2023 20:12
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Copy link
Collaborator

@psschwei psschwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, running the running program notebook on kind, I got a 500 when trying to run the program (job = serverless.run(program)) ... could very well be something with my setup, but noting it just in case

@@ -8,6 +8,10 @@
import uuid
from typing import Any, Optional

import yaml
from kubernetes import client, config
from openshift.dynamic import DynamicClient
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity, is there a reason to prefer the openshift dynamic client to the vanilla kubernetes one?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I found the doc using the openshift one first. I'll look into the difference later.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The openshift one just adds "apply" function. We don't use it so I'll change to the kubernetes one in the follow up PR. Thanks for pointing it out.

@@ -232,7 +231,7 @@ kuberay-operator:
# Kuberay API Server
# ===================

kuberayApiServerEnable: true
kuberayApiServerEnable: false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there anything still using the api server? if not, may as well rip off the bandaid and cut all this stuff out now 😄

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, nothing is using the api server. It can be taken off along with the raycluster.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:kill-it-with-fire:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took it off.

@psschwei
Copy link
Collaborator

psschwei commented Jun 6, 2023

We may also need to bump versions in infrastructure/helm/quantumserverless/Chart.lock

@akihikokuroda
Copy link
Collaborator Author

akihikokuroda commented Jun 6, 2023

I'll run helm dependency update and push Chart.lock. I also took out "ray cluster" and "ray apiserver" from Charts.yaml before that.

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
@IceKhan13
Copy link
Member

Aki, I will test it today (or latest tomorrow), I'm buried in meetings today

@psschwei
Copy link
Collaborator

psschwei commented Jun 6, 2023

FWIW, running the running program notebook on kind, I got a 500 when trying to run the program (job = serverless.run(program)) ... could very well be something with my setup, but noting it just in case

I'm still getting a 500 when trying to run a program after the last changes... doesn't look like a Ray cluster is getting spun up.

@pacomf
Copy link
Member

pacomf commented Jun 7, 2023

@akihikokuroda @psschwei if we are getting 500 errors with the current code... and tests are not failing... can we do something to detect it from tests?

@pacomf
Copy link
Member

pacomf commented Jun 7, 2023

@akihikokuroda @psschwei if we are getting 500 errors with the current code... and tests are not failing... can we do something to detect it from tests?

maybe the error is related with the @caleb-johnson error that it is appearing after add a new test for the docker environment: https://github.com/Qiskit-Extensions/quantum-serverless/actions/runs/5195660406/jobs/9368597527?pr=648, so i dont know if this test certify the @psschwei error as well, so we can have a test to jump into the error in our CI

@akihikokuroda
Copy link
Collaborator Author

@psschwei I can not see the issue here. What are in the log of the gateway-scheduler pod? Thanks!

@psschwei
Copy link
Collaborator

psschwei commented Jun 7, 2023

Gateway scheduler pod logs are a repeating cycle of

Updated 0 jobs.
Using selector: EpollSelector
Deallocated 0 compute resources.
Using selector: EpollSelector
4 free cluster slots.
0 are scheduled for execution.

edit: need to head rather than tail, there's an error at the beginning:

Using selector: EpollSelector
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 89, in _execute
    return self.cursor.execute(sql, params)
psycopg2.errors.UndefinedTable: relation "api_job" does not exist
LINE 1: ...d", "api_job"."ray_job_id", "api_job"."logs" FROM "api_job" ...
                                                             ^


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/src/app/manage.py", line 22, in <module>
    main()
  File "/usr/src/app/manage.py", line 18, in main
    execute_from_command_line(sys.argv)
  File "/usr/local/lib/python3.9/site-packages/django/core/management/__init__.py", line 442, in execute_from_command_line
    utility.execute()
  File "/usr/local/lib/python3.9/site-packages/django/core/management/__init__.py", line 436, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/usr/local/lib/python3.9/site-packages/django/core/management/base.py", line 412, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/usr/local/lib/python3.9/site-packages/django/core/management/base.py", line 458, in execute
    output = self.handle(*args, **options)
  File "/usr/src/app/api/management/commands/update_jobs_statuses.py", line 17, in handle
    for job in Job.objects.filter(status__in=Job.RUNNING_STATES):
  File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 398, in __iter__
    self._fetch_all()
  File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 1881, in _fetch_all
    self._result_cache = list(self._iterable_class(self))
  File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 91, in __iter__
    results = compiler.execute_sql(
  File "/usr/local/lib/python3.9/site-packages/django/db/models/sql/compiler.py", line 1562, in execute_sql
    cursor.execute(sql, params)
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 67, in execute
    return self._execute_with_wrappers(
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 80, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 89, in _execute
    return self.cursor.execute(sql, params)
  File "/usr/local/lib/python3.9/site-packages/django/db/utils.py", line 91, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 89, in _execute
    return self.cursor.execute(sql, params)
django.db.utils.ProgrammingError: relation "api_job" does not exist
LINE 1: ...d", "api_job"."ray_job_id", "api_job"."logs" FROM "api_job" ...

@psschwei
Copy link
Collaborator

psschwei commented Jun 7, 2023

wait... I'm not building the containers from source... 😡

(aside: we really need to set up a process for building from source and deploying to kubernetes...)

)

create_compute_template_if_not_exists()
cpu = settings.RAY_CLUSTER_TEMPLATE_CPU
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought for later: we should consider pulling these settings into a configmap as well, so operators can change the values without having to rebuild the images

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make it in the values.yaml file

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would require reploying the helm chart for any updates though...

either way, let's hold off on that change until we merge this PR

Copy link
Member

@IceKhan13 IceKhan13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked! Looks like everything works as expected!

In following PRs we can remove helm kuberay and cluster charts and only leave ray operator. This will simplify deployment configs.

Great work, Aki!

@akihikokuroda akihikokuroda merged commit 8d93196 into Qiskit:main Jun 7, 2023
@akihikokuroda
Copy link
Collaborator Author

Thanks!

@psschwei
Copy link
Collaborator

psschwei commented Jun 7, 2023

@IceKhan13 @akihikokuroda what are you using for Kubernetes? I haven't been able to run a program on anything that doesn't do a bunch of networking magic behind the scenes (i.e. Docker / Rancher Desktop)..

@IceKhan13
Copy link
Member

I'm using rancher desktop

@akihikokuroda
Copy link
Collaborator Author

I'm using Rancher Desktop. I don't do anything except port forwarding Jupyter notebook service.

@IceKhan13
Copy link
Member

yes, I also do port forwarding

@psschwei
Copy link
Collaborator

psschwei commented Jun 7, 2023

Any changes to values.yaml?

@IceKhan13
Copy link
Member

I only deploy gateway + scheduler + keycloak

@psschwei
Copy link
Collaborator

psschwei commented Jun 7, 2023

So no jupyter ... how are you testing programs?
(I'm running notebooks in jupyter... I wonder if that's why I'm seeing issues)

@akihikokuroda
Copy link
Collaborator Author

@psschwei Is the raycluster CRD instance created?

@akihikokuroda
Copy link
Collaborator Author

The gateway-scheduler should create the instance. Do you see any exceptions in the gateway-scheduler pod?

@IceKhan13
Copy link
Member

So no jupyter ... how are you testing programs?

I have local jupyter with local programs and connecting to gateway straight away

@psschwei
Copy link
Collaborator

psschwei commented Jun 7, 2023

Is the raycluster CRD instance created?

yes, it's there

Do you see any exceptions in the gateway-scheduler pod?

yes, I'm still seeing this one:

Defaulted container "gateway-scheduler" out of: gateway-scheduler, waitpostresql (init)
Using selector: EpollSelector
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 89, in _execute
    return self.cursor.execute(sql, params)
psycopg2.errors.UndefinedTable: relation "api_job" does not exist
LINE 1: ...d", "api_job"."ray_job_id", "api_job"."logs" FROM "api_job" ...
                                                             ^

I have local jupyter with local programs

Local notebooks fail too (though I'm guessing the scheduler error above has something to do with it).

@psschwei
Copy link
Collaborator

psschwei commented Jun 7, 2023

hmm, the gateway pod is also having a problem with the attached volume

@psschwei
Copy link
Collaborator

psschwei commented Jun 7, 2023

yeah, seems the error I'm hitting is the permissions on /usr/src/app/media/user in the gateway pod when trying to run a job...

PVC is getting mounted as root, which prevents the gateway from writing to it

@IceKhan13
Copy link
Member

oh, yes we have a call tomorrow for exactly this :)

@caleb-johnson
Copy link
Collaborator

caleb-johnson commented Jun 7, 2023

I have local jupyter with local programs and connecting to gateway straight away

I'm working on #652 and seeing gateway failures in CI but not locally running same commands. Not sure if related to discussion here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants