Add support for running functions on GPUs #1515

psschwei · 2024-10-10T19:56:43Z

Summary

Provides support for running a select group of functions on GPU nodes. Functions will need to use distribute_task to run workloads on the GPU (ex: @distribute_task(target={"gpu": 1}).

Details and comments

To set the list of providers allowed to run functions on GPUs, edit the gateway/api/v1/gpu-jobs.json file. A sample config, allowing the mockprovider provider to run on GPUs would look like this:

{ 
    "gpu-functions": { 
        "mockprovider": [ "my-first-pattern" ]
     }
}

The method for getting the list of GPU providers is the same as we previously used for allowlisting dependencies.

In a future iteration, this could also be limited to specific functions by checking the the program title matches an element in the list (in this example, that would be my-first-pattern).

Jobs and ComputeResources using GPUs are tracked via a new field in each model.

There is also a separate queue for provisioning GPU clusters. The current default is only 1 GPU job at a time, but that can be configured by setting a value for the LIMITS_GPU_CLUSTERS on the scheduler deployment.

CPU-only jobs are only allowed to run on CPU nodes, and GPU jobs can only run on GPU nodes. This is enforced by configuring a nodeSelector on the rayclustertemplate, the values of which can be configured in the Helm chart.

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

pacomf · 2024-10-10T20:02:35Z

One question about it,

Is it using the distribute_task annotation or the entire function will run in GPU?

I am asking it because imagine that a function is using 4 GPUs (parallelizing tasks) but in the QPU call (qiskit primitives invocation) they need to wait for QPU hardware results... so my question here it is, in that case the GPU or GPUs will be blocked? or they can point when run things in GPU to avoid block GPUs during the QPU time (where the call can be run in CPUs to block CPUs rather than GPUs)

psschwei · 2024-10-10T20:28:40Z

Is it using the distribute_task annotation or the entire function will run in GPU?

There's two parts here:

As of now, to run on the GPU, functions will need to use distribute_task to let Ray know they'll need GPU resources. If we didn't want to use distribute_task, we could alternatively set entrypoint_num_gpus when the job is submitted.

But either way we go, each job will block a GPU until it completes. Nothing we can really do about that -- as serverless is currently written, we don't have the ability to release resources (CPU or GPU) in the middle of a job.

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

pacomf · 2024-10-11T19:48:01Z

distribute_task annotation is ok.

Ohh so even if the method where the distribute_task annotation finished the resource is not released to be available again for other function?

pacomf · 2024-10-11T19:56:20Z

Is it possible assign a classical CPU node for the main functions and just assign the GPU nodes per each distribute_task annotation call, so the GPU resource is unblocked when the annotated method finishes? (like Ray allows :) )

psschwei · 2024-10-11T20:21:54Z

Is it possible assign a classical CPU node for the main functions and just assign the GPU nodes per each distribute_task annotation call, so the GPU resource is unblocked when the annotated method finishes? (like Ray allows :) )

No, not without a complete rewrite of serverless.

Tansito · 2024-10-14T12:35:58Z

I'm going to try to give a review today @psschwei , thank you! 👍

Tansito

In general it looks really good. I just left a couple of comments related with the implementation, @psschwei 👍

gateway/main/settings.py

gateway/api/ray.py

gateway/api/v1/gpu-jobs.json

Tansito · 2024-10-14T18:06:02Z

gateway/main/settings.py

 PROGRAM_TIMEOUT = int(os.environ.get("PROGRAM_TIMEOUT", "14"))

 GATEWAY_ALLOWLIST_CONFIG = str(
    os.environ.get("GATEWAY_ALLOWLIST_CONFIG", "api/v1/allowlist.json")
 )

+GATEWAY_GPU_JOBS_CONFIG = str(


Similar to my previous comment, should we add this environment variable to the helm configuration?

I'd wait until we're ready to make it a configmap... for now, easier to manage the file directly in the internal repo

Ok, you are planning to do the tests in the internal repo. Just to double check that I would not like to expose this info in the public repo.

Tansito

As a POC it looks good to me to start testing it in staging. It remains some things to close for a final version as I was pointing out but if they are closed before the deadline, perfect.

I can approve it @psschwei as soon as you can confirm me that the configuration for the gpu-jobs.json can be done in the private repository and it's not going to happen in the public one 👍

@IceKhan13 if you can take a look in case I'm forgetting something I would appreciate the double check.

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

psschwei · 2024-10-16T14:52:26Z

confirm that the configuration for the gpu-jobs.json can be done in the private repository

Yes, that config will be done in the internal repo

Tansito

@IceKhan13 will be OOO some days, tests are passing and I would not like to delay more this feature to enable you to start testing it in staging, @psschwei so I'm going to approve it and we can add improvements later 👍

add support for running functions on gpus

06b1019

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

Tansito requested a review from a team October 10, 2024 19:58

psschwei added 5 commits October 10, 2024 17:07

lint

b819008

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

more lint

7585978

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

lint 3

76d5796

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

lint

a058ef1

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

label kind nodes

0a59b20

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

Tansito reviewed Oct 14, 2024

View reviewed changes

Tansito requested review from IceKhan13 and Tansito October 15, 2024 21:55

Tansito reviewed Oct 15, 2024

View reviewed changes

adding config

6c508f5

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>

Tansito self-requested a review October 22, 2024 12:25

Tansito approved these changes Oct 22, 2024

View reviewed changes

psschwei merged commit 4f29082 into Qiskit:main Oct 22, 2024
10 checks passed

psschwei deleted the enable-gpus branch October 22, 2024 13:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for running functions on GPUs #1515

Add support for running functions on GPUs #1515

psschwei commented Oct 10, 2024 •

edited

Loading

pacomf commented Oct 10, 2024

psschwei commented Oct 10, 2024

pacomf commented Oct 11, 2024

pacomf commented Oct 11, 2024 •

edited

Loading

psschwei commented Oct 11, 2024 •

edited

Loading

Tansito commented Oct 14, 2024

Tansito left a comment

Tansito Oct 14, 2024

psschwei Oct 14, 2024

Tansito Oct 15, 2024

Tansito left a comment

psschwei commented Oct 16, 2024

Tansito left a comment

Add support for running functions on GPUs #1515

Add support for running functions on GPUs #1515

Conversation

psschwei commented Oct 10, 2024 • edited Loading

Summary

Details and comments

pacomf commented Oct 10, 2024

psschwei commented Oct 10, 2024

pacomf commented Oct 11, 2024

pacomf commented Oct 11, 2024 • edited Loading

psschwei commented Oct 11, 2024 • edited Loading

Tansito commented Oct 14, 2024

Tansito left a comment

Choose a reason for hiding this comment

Tansito Oct 14, 2024

Choose a reason for hiding this comment

psschwei Oct 14, 2024

Choose a reason for hiding this comment

Tansito Oct 15, 2024

Choose a reason for hiding this comment

Tansito left a comment

Choose a reason for hiding this comment

psschwei commented Oct 16, 2024

Tansito left a comment

Choose a reason for hiding this comment

psschwei commented Oct 10, 2024 •

edited

Loading

pacomf commented Oct 11, 2024 •

edited

Loading

psschwei commented Oct 11, 2024 •

edited

Loading