-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-48582: [CI][GPU][C++][Python] Add new CUDA jobs using the new self-hosted runners #48583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
|
@pitrou @kou this PR was originally created to test the
From an organization standpoint would you prefer to have the C++ jobs added to the |
|
The Python CUDA 13.0.2 errors are not related to this PR per se (this is only adding the new runners) but there seems to be an issue initializing CUDA: |
Perhaps the driver version on the machine is older than 13.0.2 used in the container. Do you have a way to check the driver version installed on the machine that's hosting the docker image? (e.g. what is its |
|
From the I think that the machine will need at least driver version 580 for a CUDA 13 container. Are you able to change the underlying machine? |
|
Alternatively, it may be possible to make this configuration work by adding the relevant |
|
We are using the default images built here (yes they point out they have cuda 12): |
|
Thanks @gmarkall for your help, unfortunately it seems to fail with a different error if and |
|
I prefer |
…w self-hosted runners
This reverts commit f5766b7.
raulcd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
…t reusing the CPP workflow
|
I've simplified the job adding cuda and ubuntu to the matrix. Thanks @kou I'll merge this into 23.0.0 in case we do a patch release in the future to have cuda validation. |
…-hosted runners (#48583) ### Rationale for this change The CUDA jobs stopped working when Voltron Data infrastructure went down. We have set up with ASF Infra a [runs-on](https://runs-on.com/runners/gpu/) solution to run CUDA runners. ### What changes are included in this PR? Add the new workflow for `cuda_extra.yml` with CI jobs that use the runs-on CUDA runners. Due to the underlying instances having CUDA 12.9 the jobs to be run are: - AMD64 Ubuntu 22 CUDA 11.7.1 - AMD64 Ubuntu 24 CUDA 12.9.0 - AMD64 Ubuntu 22 CUDA 11.7.1 Python - AMD64 Ubuntu 24 CUDA 12.9.0 Python A follow up issue has been created to add jobs for CUDA 13, see: #48783 A new label `CI: Extra: CUDA` has also been created. ### Are these changes tested? Yes via CI ### Are there any user-facing changes? No * GitHub Issue: #48582 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
|
After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 985b16e. There weren't enough matching historic benchmark results to make a call on whether there were regressions. The full Conbench report has more details. |
Rationale for this change
The CUDA jobs stopped working when Voltron Data infrastructure went down. We have set up with ASF Infra a runs-on solution to run CUDA runners.
What changes are included in this PR?
Add the new workflow for
cuda_extra.ymlwith CI jobs that use the runs-on CUDA runners.Due to the underlying instances having CUDA 12.9 the jobs to be run are:
A follow up issue has been created to add jobs for CUDA 13, see: #48783
A new label
CI: Extra: CUDAhas also been created.Are these changes tested?
Yes via CI
Are there any user-facing changes?
No