-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for running functions on GPUs #1515
Conversation
Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
One question about it, Is it using the distribute_task annotation or the entire function will run in GPU? I am asking it because imagine that a function is using 4 GPUs (parallelizing tasks) but in the QPU call (qiskit primitives invocation) they need to wait for QPU hardware results... so my question here it is, in that case the GPU or GPUs will be blocked? or they can point when run things in GPU to avoid block GPUs during the QPU time (where the call can be run in CPUs to block CPUs rather than GPUs) |
There's two parts here: As of now, to run on the GPU, functions will need to use But either way we go, each job will block a GPU until it completes. Nothing we can really do about that -- as serverless is currently written, we don't have the ability to release resources (CPU or GPU) in the middle of a job. |
Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
distribute_task annotation is ok. Ohh so even if the method where the distribute_task annotation finished the resource is not released to be available again for other function? |
Is it possible assign a classical CPU node for the main functions and just assign the GPU nodes per each distribute_task annotation call, so the GPU resource is unblocked when the annotated method finishes? (like Ray allows :) ) |
No, not without a complete rewrite of serverless. |
I'm going to try to give a review today @psschwei , thank you! 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general it looks really good. I just left a couple of comments related with the implementation, @psschwei 👍
PROGRAM_TIMEOUT = int(os.environ.get("PROGRAM_TIMEOUT", "14")) | ||
|
||
GATEWAY_ALLOWLIST_CONFIG = str( | ||
os.environ.get("GATEWAY_ALLOWLIST_CONFIG", "api/v1/allowlist.json") | ||
) | ||
|
||
GATEWAY_GPU_JOBS_CONFIG = str( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to my previous comment, should we add this environment variable to the helm configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd wait until we're ready to make it a configmap... for now, easier to manage the file directly in the internal repo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, you are planning to do the tests in the internal repo. Just to double check that I would not like to expose this info in the public repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a POC it looks good to me to start testing it in staging. It remains some things to close for a final version as I was pointing out but if they are closed before the deadline, perfect.
I can approve it @psschwei as soon as you can confirm me that the configuration for the gpu-jobs.json
can be done in the private repository and it's not going to happen in the public one 👍
@IceKhan13 if you can take a look in case I'm forgetting something I would appreciate the double check.
Signed-off-by: Paul S. Schweigert <paul@paulschweigert.com>
Yes, that config will be done in the internal repo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@IceKhan13 will be OOO some days, tests are passing and I would not like to delay more this feature to enable you to start testing it in staging, @psschwei so I'm going to approve it and we can add improvements later 👍
Summary
Provides support for running a select group of functions on GPU nodes. Functions will need to use
distribute_task
to run workloads on the GPU (ex:@distribute_task(target={"gpu": 1})
.Details and comments
To set the list of providers allowed to run functions on GPUs, edit the
gateway/api/v1/gpu-jobs.json
file. A sample config, allowing themockprovider
provider to run on GPUs would look like this:The method for getting the list of GPU providers is the same as we previously used for allowlisting dependencies.
In a future iteration, this could also be limited to specific functions by checking the the program title matches an element in the list (in this example, that would be
my-first-pattern
).Jobs and ComputeResources using GPUs are tracked via a new field in each model.
There is also a separate queue for provisioning GPU clusters. The current default is only 1 GPU job at a time, but that can be configured by setting a value for the
LIMITS_GPU_CLUSTERS
on the scheduler deployment.CPU-only jobs are only allowed to run on CPU nodes, and GPU jobs can only run on GPU nodes. This is enforced by configuring a nodeSelector on the rayclustertemplate, the values of which can be configured in the Helm chart.