-
Notifications
You must be signed in to change notification settings - Fork 74
Relax libcufile dependency for old CTK 12 on ARM
#930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Relax libcufile dependency for old CTK 12 on ARM
#930
Conversation
a29ea1b to
15e235d
Compare
The CUDA Toolkit (CTK) added cuFile for ARM in CTK 12.2. As a result always requiring `libcufile` with `libcucim` can make it uninstallable on ARM with CTK 12.0 & 12.1. To fix this, build two variants of the `libcucim` package. One with `libcufile` and one without. Constrain `cuda-version` accordingly so that `libcufile` is installed whenever possible. Only when `libcufile` cannot be installed, fallback to the `libcucim` package where the `libcufile` dependency is left off. The `x86_64` CTK has had cuFile since the CTK 11.4. As RAPIDS now requires CUDA 12.0+, there is no issue always requiring `libcufile` on `x86_64`. This issue will not be present in CUDA 13+. So always require `libcufile` in that case.
15e235d to
b0e17af
Compare
The CUDA Toolkit will either have `cufile.h` or not. If it has it, we can build support for cuFile. If it doesn't, we can skip it entirely.
| #else | ||
| #include "cufile_stub_types.h" | ||
| #endif | ||
| #include <cufile.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to #include "cufile_stub_types.h" causes the following error on CI:
In file included from /opt/conda/conda-bld/work/gds/src/cufile_stub.cpp:17:
/opt/conda/conda-bld/work/gds/include/cufile_stub.h:23:14: fatal error: cufile_stub_types.h: No such file or directory
23 | #include "cufile_stub_types.h"
| ^~~~~~~~~~~~~~~~~~~~~
compilation terminated.Given we either have a new enough CUDA Toolkit and it has cuFile or it doesn't, we can simply check for cuFile and build or not
Unsure what we gain by having this in-between mode. Though would be happy to learn if I'm missing something
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
`libcucim` is not installable at the moment on arm64 systems with CUDA < 12.2: * https://github.com/rapidsai/cucim/pull/905/files#r2292609197 * rapidsai/cucim#930 Which is blocking builds of RAPIDS docker images: * rapidsai/docker#782 (comment) This proposes **temporarily** excluding `cucim` from the dependencies of arm64 CUDA 12 `rapids` packages, to unblock `docker` CI (and therefore publication of the first CUDA 13 nightlies of those images) until that `cucim` issue is resolved. Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Gil Forsyth (https://github.com/gforsyth) URL: #796
|
/ok to test |
| cat > extra_variants.yaml <<EOF | ||
| has_cufile: | ||
| - True | ||
| EOF | ||
| if [[ "$(arch)" == "aarch64" ]] && [[ "${RAPIDS_CUDA_VERSION%%.*}" == "12" ]] | ||
| then | ||
| cat >> extra_variants.yaml <<EOF | ||
| - False | ||
| EOF | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the current logic will have both in the aarch64+CUDA12 case?
has_cufile:
- True
- False
| cat > extra_variants.yaml <<EOF | |
| has_cufile: | |
| - True | |
| EOF | |
| if [[ "$(arch)" == "aarch64" ]] && [[ "${RAPIDS_CUDA_VERSION%%.*}" == "12" ]] | |
| then | |
| cat >> extra_variants.yaml <<EOF | |
| - False | |
| EOF | |
| fi | |
| if [[ "$(arch)" == "aarch64" ]] && [[ "${RAPIDS_CUDA_VERSION%%.*}" == "12" ]]; then | |
| cat > extra_variants.yaml <<EOF | |
| has_cufile: | |
| - False | |
| EOF | |
| else | |
| cat > extra_variants.yaml <<EOF | |
| has_cufile: | |
| - True | |
| EOF | |
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me commit this and see if it helps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is intended. We want to build once for each of these values, which is what the original logic does
Also for CUDA 13+ or x86_64 builds we want this to only be True
The CI failure is due to something else unrelated to this syntax. Please see this comment: #930 (comment)
Have gone ahead and restored the original logic. Though happy to chat more if you have additional questions 🙂
|
/ok to test |
|
The error in the CI log points to cuFile being used in the build and not being found In file included from /opt/conda/conda-bld/work/gds/src/cufile_stub.cpp:17:
/opt/conda/conda-bld/work/gds/include/cufile_stub.h:19:10: fatal error: cufile.h: No such file or directory
Z 19 | #include <cufile.h>
4Z | ^~~~~~~~~~
76Z compilation terminated.
111ZWill either need to...
Personally would lean towards 1 as that should be less effort. Plus the effort in 2 becomes irrelevant once CUDA 12.0 & 12.1 are dropped from support |
jakirkham
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have added option 1 below as a suggestion
Option 2 likely involves more changes possibly to parts we haven't yet touched
|
/ok to test |
|
Pushed the option 1 approach. Let's see how that goes |
|
/ok to test |
This reverts commit 4dbfd5a.
|
/ok to test |
|
/ok to test |
|
/ok to test |
To ensure the different cuCIM packages with and without cuFile are both used, they need different package strings. As the new variant, `has_cufile`, is included in the package hash, make sure to include the package hash in the package string. That way each package is named differently and they don't clobber each other. Thus ensuring both packages are uploaded and not whichever one was built last.
|
/ok to test |
|
At this point this PR has the needed pieces to address the issue High-level cuCIM is designed to always build with cuFile and optionally use it at runtime (only if it can find and use the cuFile library) As RAPIDS takes the approach of building with the latest CUDA minor version for any major version (like 12.9 for the 12 series), cuFile is available at build time for all architectures. So building cuCIM with cuFile on CUDA 12.9 works unconditionally At runtime, cuCIM has two modes of using cuFile:
In RAPIDS we only ever build cuCIM with 2 ( Lines 64 to 68 in 16c07c6
With this approach, cuCIM will only use cuFile if it can cucim/cpp/src/filesystem/cufile_driver.cpp Lines 313 to 316 in 16c07c6
However if cuCIM cannot cucim/cpp/src/filesystem/cufile_driver.cpp Lines 332 to 339 in 16c07c6
So all we ever need do for old CUDA 12 ARM builds is drop the cuFile dependency at runtime, which is what this PR does Opening this up for review |
jakirkham
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few notes for context below
| build: | ||
| number: {{ GIT_DESCRIBE_NUMBER }} | ||
| string: cuda{{ cuda_major }}_{{ date_string }}_{{ GIT_DESCRIBE_HASH }}_{{ GIT_DESCRIBE_NUMBER }} | ||
| string: cuda{{ cuda_major }}_{{ date_string }}_gh{{ GIT_DESCRIBE_HASH }}_gn{{ GIT_DESCRIBE_NUMBER }}}_ph{{ PKG_HASH }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are building for two different variants of has_cufile, the package hash will change. However the rest of these values stay the same for each variant build. As a result the package name would be the same unless we add the package hash, which we do here
| - {{ compiler('cuda') }} | ||
| - cuda-cudart-dev | ||
| - libcufile-dev # [linux64] | ||
| - libcufile-dev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to ignore the run_exports_from libcufile-dev in all cases where it is used as we already constrain the version via cuda-version
This avoids overly tight version constraints that otherwise cause solver errors when testing CUDA 12.9 packages on CUDA 12.0
| {% if aarch64 and cuda_major == "12" %} | ||
| # Relax `libcufile` dependency for old CUDA 12 ARM installs | ||
| - {{ pin_compatible('cuda-version', lower_bound='12.0', upper_bound='12.2') }} # [not has_cufile] | ||
| - {{ pin_compatible('cuda-version', lower_bound='12.2', upper_bound='13.0') }} # [has_cufile] | ||
| - libcufile # [has_cufile] | ||
| {% else %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ensures ARM packages that support CUDA 12.2+ still depend io cuFile whereas packages that require CUDA 12.0 or 12.1 packages on ARM don't require cuFile (as it is not given in this case)
jameslamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jakirkham ! The changes and comments all make sense to me, I'd be happy to see this go in.
Once it does, we can do rapidsai/integration#797 (adding the cucim dependency back to the rapids metapackage unconditionally)
bdice
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks fine to me.
|
Thanks all! 🙏 |
|
/merge |
|
Saw a flaky networking issue in one CI job: Mentioned offline. Went ahead and restarted this job |
42cf58f
into
rapidsai:branch-25.10
Previously cuCIM was dropped from CUDA 12 ARM as the packages unconditionally required cuFile, which caused issues when testing with CUDA 12.0 & 12.1. To address this, the cuCIM package was changed to only require cuFile in CUDA 12.2+ ARM packages while creating a separate package for ARM CUDA 12.0 & 12.1, which left out the cuFile dependency. This work was completed in PR: rapidsai/cucim#930 With this change now in place, it should be possible to readd cuCIM to the RAPIDS metapackage generally, which is what this PR does <hr> Reverts #796 Fixes #797 Authors: - https://github.com/jakirkham Approvers: - James Lamb (https://github.com/jameslamb) URL: #802
Following up on this thread: #905 (comment)
The CUDA Toolkit (CTK) added cuFile for ARM in CTK 12.2. As a result always requiring
libcufilewithlibcucimcan make it uninstallable on ARM with CTK 12.0 & 12.1.To fix this, build two variants of the
libcucimpackage. One withlibcufileand one without. Constraincuda-versionaccordingly so thatlibcufileis installed whenever possible. Only whenlibcufilecannot be installed, fallback to thelibcucimpackage where thelibcufiledependency is left off.The
x86_64CTK has had cuFile since the CTK 11.4. As RAPIDS now requires CUDA 12.0+, there is no issue always requiringlibcufileonx86_64.This issue will not be present in CUDA 13+. So always require
libcufilein that case.