Skip to content

Conversation

@jameslamb
Copy link
Member

@jameslamb jameslamb commented Aug 26, 2025

Contributes to rapidsai/build-planning#208

  • uses CUDA 13.0.0 to build and test

@jameslamb
Copy link
Member Author

Took a couple of re-runs to get through network errors, but most build jobs are passing.

All of the arm64 CUDA 12.0.1 jobs are failing though, like this:

error    libmamba Could not solve for environment specs
    The following packages are incompatible
    ├─ cuda-version =12.0 * is requested and can be installed;
    └─ rapids =25.10 * is not installable because there are no viable options
       ├─ rapids [25.10.00a6|25.10.00a7] would require
       │  └─ cucim =25.10 * but there are no viable options
       │     ├─ cucim 25.10.00a19 would require
       │     │  └─ libcucim =25.10.0a19 * but there are no viable options
       │     │     ├─ libcucim 25.10.00a19 would require
       │     │     │  └─ libcufile >=1.14.1.1,<2.0a0 * but there are no viable options
       │     │     │     ├─ libcufile 1.14.1.1 would require
       │     │     │     │  └─ cuda-version >=12.9,<12.10.0a0 *, which conflicts with any installable versions previously reported;
       │     │     │     └─ libcufile [1.15.0.42|1.15.1.6] would require
       │     │     │        └─ cuda-version >=13.0,<13.1.0a0 *, which conflicts with any installable versions previously reported;
       │     │     └─ libcucim 25.10.00a19 would require
       │     │        └─ cuda-version >=13,<14.0a0 *, which conflicts with any installable versions previously reported;
       │     └─ cucim 25.10.00a19 would require
       │        └─ cuda-version >=13,<14.0a0 *, which conflicts with any installable versions previously reported;
       └─ rapids 25.10.00a7 would require
          └─ cuda-version >=13,<14.0a0 *, which conflicts with any installable versions previously reported.
critical libmamba Could not solve for environment specs

(build link)

libcucim does not directly constrain its libcufile-dev version:

https://github.com/rapidsai/cucim/blob/7d990eb17d23550f6c919bcb40042e7298328671/conda/recipes/libcucim/meta.yaml#L58

But it does allow libcufile-dev's run exports on arm64:

https://github.com/rapidsai/cucim/blob/7d990eb17d23550f6c919bcb40042e7298328671/conda/recipes/libcucim/meta.yaml#L24

It looks like libcufile-dev has a major-version run export on itself + a {major}.{minor} run dependency on cuda-version. That's probably the issue here:

  - name: libcufile-dev
    build:
      run_exports:
        - {{ pin_subpackage("libcufile", max_pin="x") }}
    # ... omitted ...
    requirements:
      build:
        # ... omitted ...
        - arm-variant * {{ arm_variant_type }}  # [aarch64]
      host:
        - cuda-version {{ cuda_version }}
      run:
        - {{ pin_compatible("cuda-version", max_pin="x.x") }}
        - {{ pin_subpackage("libcufile", exact=True) }}

(conda-forge/libcufile-feedstock - recipe/meta.yaml)

It looks like @jakirkham anticipated this (https://github.com/rapidsai/cucim/pull/905/files#r2292656134) and started the work to avoid it (rapidsai/cucim#930). I think we'll need that to include cucim in these arm64 CUDA 12.0.1 images.

rapids-bot bot pushed a commit to rapidsai/integration that referenced this pull request Sep 5, 2025
`libcucim` is not installable at the moment on arm64 systems with CUDA < 12.2:

* https://github.com/rapidsai/cucim/pull/905/files#r2292609197
* rapidsai/cucim#930

Which is blocking builds of RAPIDS docker images:

* rapidsai/docker#782 (comment)

This proposes **temporarily** excluding `cucim` from the dependencies of arm64 CUDA 12 `rapids` packages, to unblock `docker` CI (and therefore publication of the first CUDA 13 nightlies of those images) until that `cucim` issue is resolved.

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Gil Forsyth (https://github.com/gforsyth)

URL: #796
@jameslamb jameslamb changed the title WIP: Build and test with CUDA 13.0.0 Build and test with CUDA 13.0.0 Sep 5, 2025
@jameslamb jameslamb marked this pull request as ready for review September 5, 2025 16:46
@jameslamb jameslamb requested a review from a team as a code owner September 5, 2025 16:46
NOTEBOOK_REPOS=(cudf cuml cugraph)
else
NOTEBOOK_REPOS=(cudf cugraph)
fi
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have predicted this... there are some cuml notebooks that expect to be able to train an xgboost model using GPUs.

Depending on the CPU-only version thanks to rapidsai/integration#795 leads to this:

XGBoostError: [17:20:04] /home/conda/feedstock_root/build_artifacts/xgboost-split_1754002079811/work/src/c_api/../common/common.h:181: XGBoost version not compiled with GPU support.
Stack trace:
  [bt] (0) /opt/conda/lib/libxgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x6e) [0x7522dbe3857e]
  [bt] (1) /opt/conda/lib/libxgboost.so(xgboost::common::AssertGPUSupport()+0x3b) [0x7522dbe3881b]
  [bt] (2) /opt/conda/lib/libxgboost.so(XGDMatrixCreateFromCudaArrayInterface+0xf) [0x7522dbda038f]
  [bt] (3) /opt/conda/lib/python3.13/lib-dynload/../../libffi.so.8(+0x6d8a) [0x75242f774d8a]
  [bt] (4) /opt/conda/lib/python3.13/lib-dynload/../../libffi.so.8(+0x61cd) [0x75242f7741cd]
  [bt] (5) /opt/conda/lib/python3.13/lib-dynload/../../libffi.so.8(ffi_call+0xcd) [0x75242f77491d]
  [bt] (6) /opt/conda/lib/python3.13/lib-dynload/_ctypes.cpython-313-x86_64-linux-gnu.so(+0x15f90) [0x75242f78ff90]
  [bt] (7) /opt/conda/lib/python3.13/lib-dynload/_ctypes.cpython-313-x86_64-linux-gnu.so(+0x13da6) [0x75242f78dda6]
  [bt] (8) /opt/conda/bin/python(_PyObject_MakeTpCall+0x27c) [0x6331f8b71ddc]

(build link)

This proposes just skipping cuml notebook testing here temporarily, to unblock publishing the first nightly container images with CUDA 13 packages.

If reviewers agree, I'll add an issue in this repo tracking the work of putting that testing back.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we have the issue up I'm fine with this temporary patch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thank you. Put up an issue here: #784

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple notes but LGTM.

NOTEBOOK_REPOS=(cudf cuml cugraph)
else
NOTEBOOK_REPOS=(cudf cugraph)
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we have the issue up I'm fine with this temporary patch.

- { CUDA_VER: '12.9', ARCH: 'amd64', PYTHON_VER: '3.11', GPU: 'l4', DRIVER: 'latest' }
- { CUDA_VER: '12.9', ARCH: 'arm64', PYTHON_VER: '3.13', GPU: 'a100', DRIVER: 'latest' }
- { CUDA_VER: '13.0', ARCH: 'amd64', PYTHON_VER: '3.11', GPU: 'l4', DRIVER: 'latest' }
- { CUDA_VER: '13.0', ARCH: 'arm64', PYTHON_VER: '3.12', GPU: 'a100', DRIVER: 'latest' }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following on from our shared-workflows discussion, should we run at least one of these jobs on an h100? This is a low traffic repo so it shouldn't add too much load and it seems like it would be a good test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point, I agree!

I just pushed f003cb3 switching one of these PR jobs to H100s

@jameslamb jameslamb removed the request for review from KyleFromNVIDIA September 5, 2025 20:53
@jameslamb jameslamb mentioned this pull request Sep 5, 2025
1 task
@jameslamb
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit f9ed55b into rapidsai:branch-25.10 Sep 5, 2025
86 checks passed
@jameslamb jameslamb deleted the cuda-13.0.0 branch September 5, 2025 21:23
rapids-bot bot pushed a commit that referenced this pull request Sep 9, 2025
Closes #784 

In #782, we skipped cuML notebook testing on CUDA 13 because there weren't yet CUDA 13 `xgboost` conda packages. Those exist now:

* rapidsai/xgboost-feedstock#100
* rapidsai/integration#800

This reverts a workaround from #782, so all notebooks will be tested on CUDA 12 and CUDA 13. It also ensures that the CUDA 13 images include GPU-accelerated builds of `xgboost`.

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)

URL: #785
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants