-
Notifications
You must be signed in to change notification settings - Fork 56
Build and test with CUDA 13.0.0 #782
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Took a couple of re-runs to get through network errors, but most build jobs are passing. All of the arm64 CUDA 12.0.1 jobs are failing though, like this:
But it does allow It looks like - name: libcufile-dev
build:
run_exports:
- {{ pin_subpackage("libcufile", max_pin="x") }}
# ... omitted ...
requirements:
build:
# ... omitted ...
- arm-variant * {{ arm_variant_type }} # [aarch64]
host:
- cuda-version {{ cuda_version }}
run:
- {{ pin_compatible("cuda-version", max_pin="x.x") }}
- {{ pin_subpackage("libcufile", exact=True) }}(conda-forge/libcufile-feedstock - recipe/meta.yaml) It looks like @jakirkham anticipated this (https://github.com/rapidsai/cucim/pull/905/files#r2292656134) and started the work to avoid it (rapidsai/cucim#930). I think we'll need that to include |
`libcucim` is not installable at the moment on arm64 systems with CUDA < 12.2: * https://github.com/rapidsai/cucim/pull/905/files#r2292609197 * rapidsai/cucim#930 Which is blocking builds of RAPIDS docker images: * rapidsai/docker#782 (comment) This proposes **temporarily** excluding `cucim` from the dependencies of arm64 CUDA 12 `rapids` packages, to unblock `docker` CI (and therefore publication of the first CUDA 13 nightlies of those images) until that `cucim` issue is resolved. Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Gil Forsyth (https://github.com/gforsyth) URL: #796
| NOTEBOOK_REPOS=(cudf cuml cugraph) | ||
| else | ||
| NOTEBOOK_REPOS=(cudf cugraph) | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should have predicted this... there are some cuml notebooks that expect to be able to train an xgboost model using GPUs.
Depending on the CPU-only version thanks to rapidsai/integration#795 leads to this:
XGBoostError: [17:20:04] /home/conda/feedstock_root/build_artifacts/xgboost-split_1754002079811/work/src/c_api/../common/common.h:181: XGBoost version not compiled with GPU support.
Stack trace:
[bt] (0) /opt/conda/lib/libxgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x6e) [0x7522dbe3857e]
[bt] (1) /opt/conda/lib/libxgboost.so(xgboost::common::AssertGPUSupport()+0x3b) [0x7522dbe3881b]
[bt] (2) /opt/conda/lib/libxgboost.so(XGDMatrixCreateFromCudaArrayInterface+0xf) [0x7522dbda038f]
[bt] (3) /opt/conda/lib/python3.13/lib-dynload/../../libffi.so.8(+0x6d8a) [0x75242f774d8a]
[bt] (4) /opt/conda/lib/python3.13/lib-dynload/../../libffi.so.8(+0x61cd) [0x75242f7741cd]
[bt] (5) /opt/conda/lib/python3.13/lib-dynload/../../libffi.so.8(ffi_call+0xcd) [0x75242f77491d]
[bt] (6) /opt/conda/lib/python3.13/lib-dynload/_ctypes.cpython-313-x86_64-linux-gnu.so(+0x15f90) [0x75242f78ff90]
[bt] (7) /opt/conda/lib/python3.13/lib-dynload/_ctypes.cpython-313-x86_64-linux-gnu.so(+0x13da6) [0x75242f78dda6]
[bt] (8) /opt/conda/bin/python(_PyObject_MakeTpCall+0x27c) [0x6331f8b71ddc]
This proposes just skipping cuml notebook testing here temporarily, to unblock publishing the first nightly container images with CUDA 13 packages.
If reviewers agree, I'll add an issue in this repo tracking the work of putting that testing back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as we have the issue up I'm fine with this temporary patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thank you. Put up an issue here: #784
vyasr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple notes but LGTM.
| NOTEBOOK_REPOS=(cudf cuml cugraph) | ||
| else | ||
| NOTEBOOK_REPOS=(cudf cugraph) | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as we have the issue up I'm fine with this temporary patch.
| - { CUDA_VER: '12.9', ARCH: 'amd64', PYTHON_VER: '3.11', GPU: 'l4', DRIVER: 'latest' } | ||
| - { CUDA_VER: '12.9', ARCH: 'arm64', PYTHON_VER: '3.13', GPU: 'a100', DRIVER: 'latest' } | ||
| - { CUDA_VER: '13.0', ARCH: 'amd64', PYTHON_VER: '3.11', GPU: 'l4', DRIVER: 'latest' } | ||
| - { CUDA_VER: '13.0', ARCH: 'arm64', PYTHON_VER: '3.12', GPU: 'a100', DRIVER: 'latest' } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following on from our shared-workflows discussion, should we run at least one of these jobs on an h100? This is a low traffic repo so it shouldn't add too much load and it seems like it would be a good test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point, I agree!
I just pushed f003cb3 switching one of these PR jobs to H100s
|
/merge |
Closes #784 In #782, we skipped cuML notebook testing on CUDA 13 because there weren't yet CUDA 13 `xgboost` conda packages. Those exist now: * rapidsai/xgboost-feedstock#100 * rapidsai/integration#800 This reverts a workaround from #782, so all notebooks will be tested on CUDA 12 and CUDA 13. It also ensures that the CUDA 13 images include GPU-accelerated builds of `xgboost`. Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Kyle Edwards (https://github.com/KyleFromNVIDIA) URL: #785
Contributes to rapidsai/build-planning#208