Skip to content

Conversation

@raulcd
Copy link
Member

@raulcd raulcd commented Dec 18, 2025

Rationale for this change

The CUDA jobs stopped working when Voltron Data infrastructure went down. We have set up with ASF Infra a runs-on solution to run CUDA runners.

What changes are included in this PR?

Add the new workflow for cuda_extra.yml with CI jobs that use the runs-on CUDA runners.

Due to the underlying instances having CUDA 12.9 the jobs to be run are:

  • AMD64 Ubuntu 22 CUDA 11.7.1
  • AMD64 Ubuntu 24 CUDA 12.9.0
  • AMD64 Ubuntu 22 CUDA 11.7.1 Python
  • AMD64 Ubuntu 24 CUDA 12.9.0 Python

A follow up issue has been created to add jobs for CUDA 13, see: #48783

A new label CI: Extra: CUDA has also been created.

Are these changes tested?

Yes via CI

Are there any user-facing changes?

No

@github-actions github-actions bot added the awaiting committer review Awaiting committer review label Dec 18, 2025
@github-actions
Copy link

⚠️ GitHub issue #48582 has been automatically assigned in GitHub to PR creator.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Dec 22, 2025
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jan 7, 2026
@raulcd
Copy link
Member Author

raulcd commented Jan 7, 2026

@pitrou @kou this PR was originally created to test the runs-on solution using our AWS infra to add the CUDA runners.
We used to have 4 CI jobs:

  • CUDA 13.0.2 C++
  • CUDA 13.0.2 Python
  • CUDA 11.7.1 C++
  • CUDA 11.7.1 Python

From an organization standpoint would you prefer to have the C++ jobs added to the CI: Extra: C++ and create a new CI: Extra: Python or should we add a new CI: Extra: Cuda with those 4 jobs?

@raulcd
Copy link
Member Author

raulcd commented Jan 7, 2026

The Python CUDA 13.0.2 errors are not related to this PR per se (this is only adding the new runners) but there seems to be an issue initializing CUDA:
OSError: Cuda error 999 in function 'cuInit': [CUDA_ERROR_UNKNOWN] unknown error
@gmarkall @pitrou should I create a new issue to fix this separately once we have added the new runners or do you have any pointers for a fix?

@gmarkall
Copy link
Contributor

gmarkall commented Jan 7, 2026

do you have any pointers for a fix?

Perhaps the driver version on the machine is older than 13.0.2 used in the container. Do you have a way to check the driver version installed on the machine that's hosting the docker image? (e.g. what is its nvidia-smi output?)

@gmarkall
Copy link
Contributor

gmarkall commented Jan 7, 2026

From the nvidia-smi output:

| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |

I think that the machine will need at least driver version 580 for a CUDA 13 container. Are you able to change the underlying machine?

@gmarkall
Copy link
Contributor

gmarkall commented Jan 7, 2026

Alternatively, it may be possible to make this configuration work by adding the relevant cuda-compat package to the container, which I think from https://docs.nvidia.com/deploy/cuda-compatibility/forward-compatibility.html#id1 may be cuda-compat-13-0.

@raulcd
Copy link
Member Author

raulcd commented Jan 7, 2026

We are using the default images built here (yes they point out they have cuda 12):
https://github.com/runs-on/runner-images-for-aws?tab=readme-ov-file#gpu
There's the possibility of creating our own AWS images with packer but I am unsure we want to follow that route due to maintenance overhead:
https://runs-on.com/runners/gpu/#using-nvidia-deep-learning-amis
I have a meeting with Cyril from runs-on tomorrow, I might ask if CUDA 13 is in their roadmap.
Maybe we should go with cuda-compat here, I am trying a naive approach for testing purposes at the moment

@raulcd
Copy link
Member Author

raulcd commented Jan 7, 2026

Thanks @gmarkall for your help, unfortunately it seems to fail with a different error if cuda-compat-13-0 is installed:

 cuda.core._utils.cuda_utils.CUDAError: CUDA_ERROR_INVALID_IMAGE: This indicates that the device kernel image is invalid. This can also indicate an invalid CUDA module.

and nvidia-smi now shows 13.0:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 13.0     |
|-----------------------------------------+------------------------+----------------------+

@kou
Copy link
Member

kou commented Jan 8, 2026

I prefer CI: Extra: CUDA because CUDA isn't related in most changes. If we use CI: Extra: C++/Python, we need to use CUDA runners for many non CUDA related changes.

@github-actions github-actions bot added the awaiting changes Awaiting changes label Jan 8, 2026
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Jan 8, 2026
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Jan 8, 2026
@raulcd raulcd requested a review from pitrou January 9, 2026 09:15
Copy link
Member Author

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to merge this to have CUDA runners later today. @pitrou @kou in case you want to take a look before that.

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@github-actions github-actions bot added awaiting review Awaiting review awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Jan 12, 2026
@github-actions github-actions bot added awaiting review Awaiting review awaiting changes Awaiting changes and removed awaiting review Awaiting review awaiting merge Awaiting merge labels Jan 12, 2026
@raulcd
Copy link
Member Author

raulcd commented Jan 12, 2026

I've simplified the job adding cuda and ubuntu to the matrix. Thanks @kou I'll merge this into 23.0.0 in case we do a patch release in the future to have cuda validation.

@raulcd raulcd merged commit 985b16e into apache:main Jan 12, 2026
23 of 31 checks passed
@raulcd raulcd removed the awaiting changes Awaiting changes label Jan 12, 2026
raulcd added a commit that referenced this pull request Jan 12, 2026
…-hosted runners (#48583)

### Rationale for this change

The CUDA jobs stopped working when Voltron Data infrastructure went down. We have set up with ASF Infra a [runs-on](https://runs-on.com/runners/gpu/) solution to run CUDA runners.

### What changes are included in this PR?

Add the new workflow for `cuda_extra.yml` with CI jobs that use the runs-on CUDA runners.

Due to the underlying instances having CUDA 12.9 the jobs to be run are:
- AMD64 Ubuntu 22 CUDA 11.7.1
- AMD64 Ubuntu 24 CUDA 12.9.0
- AMD64 Ubuntu 22 CUDA 11.7.1 Python
- AMD64 Ubuntu 24 CUDA 12.9.0 Python

A follow up issue has been created to add jobs for CUDA 13, see: #48783

A new label `CI: Extra: CUDA` has also been created.

### Are these changes tested?

Yes via CI

### Are there any user-facing changes?

No

* GitHub Issue: #48582

Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
@raulcd raulcd deleted the GH-48582 branch January 12, 2026 11:13
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 985b16e.

There weren't enough matching historic benchmark results to make a call on whether there were regressions.

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants