Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for running functions on GPUs #1515

Merged
merged 7 commits into from
Oct 22, 2024
Merged

Conversation

psschwei
Copy link
Collaborator

@psschwei psschwei commented Oct 10, 2024

Summary

Provides support for running a select group of functions on GPU nodes. Functions will need to use distribute_task to run workloads on the GPU (ex: @distribute_task(target={"gpu": 1}).

Details and comments

To set the list of providers allowed to run functions on GPUs, edit the gateway/api/v1/gpu-jobs.json file. A sample config, allowing the mockprovider provider to run on GPUs would look like this:

{ 
    "gpu-functions": { 
        "mockprovider": [ "my-first-pattern" ]
     }
}

The method for getting the list of GPU providers is the same as we previously used for allowlisting dependencies.

In a future iteration, this could also be limited to specific functions by checking the the program title matches an element in the list (in this example, that would be my-first-pattern).

Jobs and ComputeResources using GPUs are tracked via a new field in each model.

There is also a separate queue for provisioning GPU clusters. The current default is only 1 GPU job at a time, but that can be configured by setting a value for the LIMITS_GPU_CLUSTERS on the scheduler deployment.

CPU-only jobs are only allowed to run on CPU nodes, and GPU jobs can only run on GPU nodes. This is enforced by configuring a nodeSelector on the rayclustertemplate, the values of which can be configured in the Helm chart.

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
@Tansito Tansito requested a review from a team October 10, 2024 19:58
@pacomf
Copy link
Member

pacomf commented Oct 10, 2024

One question about it,

Is it using the distribute_task annotation or the entire function will run in GPU?

I am asking it because imagine that a function is using 4 GPUs (parallelizing tasks) but in the QPU call (qiskit primitives invocation) they need to wait for QPU hardware results... so my question here it is, in that case the GPU or GPUs will be blocked? or they can point when run things in GPU to avoid block GPUs during the QPU time (where the call can be run in CPUs to block CPUs rather than GPUs)

@psschwei
Copy link
Collaborator Author

Is it using the distribute_task annotation or the entire function will run in GPU?

There's two parts here:

As of now, to run on the GPU, functions will need to use distribute_task to let Ray know they'll need GPU resources. If we didn't want to use distribute_task, we could alternatively set entrypoint_num_gpus when the job is submitted.

But either way we go, each job will block a GPU until it completes. Nothing we can really do about that -- as serverless is currently written, we don't have the ability to release resources (CPU or GPU) in the middle of a job.

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
@pacomf
Copy link
Member

pacomf commented Oct 11, 2024

distribute_task annotation is ok.

Ohh so even if the method where the distribute_task annotation finished the resource is not released to be available again for other function?

@pacomf
Copy link
Member

pacomf commented Oct 11, 2024

Is it possible assign a classical CPU node for the main functions and just assign the GPU nodes per each distribute_task annotation call, so the GPU resource is unblocked when the annotated method finishes? (like Ray allows :) )

@psschwei
Copy link
Collaborator Author

psschwei commented Oct 11, 2024

Is it possible assign a classical CPU node for the main functions and just assign the GPU nodes per each distribute_task annotation call, so the GPU resource is unblocked when the annotated method finishes? (like Ray allows :) )

No, not without a complete rewrite of serverless.

@Tansito
Copy link
Member

Tansito commented Oct 14, 2024

I'm going to try to give a review today @psschwei , thank you! 👍

Copy link
Member

@Tansito Tansito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general it looks really good. I just left a couple of comments related with the implementation, @psschwei 👍

gateway/main/settings.py Show resolved Hide resolved
gateway/api/ray.py Outdated Show resolved Hide resolved
gateway/api/v1/gpu-jobs.json Show resolved Hide resolved
PROGRAM_TIMEOUT = int(os.environ.get("PROGRAM_TIMEOUT", "14"))

GATEWAY_ALLOWLIST_CONFIG = str(
os.environ.get("GATEWAY_ALLOWLIST_CONFIG", "api/v1/allowlist.json")
)

GATEWAY_GPU_JOBS_CONFIG = str(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my previous comment, should we add this environment variable to the helm configuration?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd wait until we're ready to make it a configmap... for now, easier to manage the file directly in the internal repo

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, you are planning to do the tests in the internal repo. Just to double check that I would not like to expose this info in the public repo.

@Tansito Tansito requested review from IceKhan13 and Tansito October 15, 2024 21:55
Copy link
Member

@Tansito Tansito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a POC it looks good to me to start testing it in staging. It remains some things to close for a final version as I was pointing out but if they are closed before the deadline, perfect.

I can approve it @psschwei as soon as you can confirm me that the configuration for the gpu-jobs.json can be done in the private repository and it's not going to happen in the public one 👍

@IceKhan13 if you can take a look in case I'm forgetting something I would appreciate the double check.

Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
@psschwei
Copy link
Collaborator Author

confirm that the configuration for the gpu-jobs.json can be done in the private repository

Yes, that config will be done in the internal repo

@Tansito Tansito self-requested a review October 22, 2024 12:25
Copy link
Member

@Tansito Tansito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@IceKhan13 will be OOO some days, tests are passing and I would not like to delay more this feature to enable you to start testing it in staging, @psschwei so I'm going to approve it and we can add improvements later 👍

@psschwei psschwei merged commit 4f29082 into Qiskit:main Oct 22, 2024
10 checks passed
@psschwei psschwei deleted the enable-gpus branch October 22, 2024 13:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants