Skip to content

Conversation

@jakirkham
Copy link
Member

Following up on this thread: #905 (comment)

The CUDA Toolkit (CTK) added cuFile for ARM in CTK 12.2. As a result always requiring libcufile with libcucim can make it uninstallable on ARM with CTK 12.0 & 12.1.

To fix this, build two variants of the libcucim package. One with libcufile and one without. Constrain cuda-version accordingly so that libcufile is installed whenever possible. Only when libcufile cannot be installed, fallback to the libcucim package where the libcufile dependency is left off.

The x86_64 CTK has had cuFile since the CTK 11.4. As RAPIDS now requires CUDA 12.0+, there is no issue always requiring libcufile on x86_64.

This issue will not be present in CUDA 13+. So always require libcufile in that case.

@jakirkham jakirkham requested review from a team as code owners August 22, 2025 04:25
@jakirkham jakirkham requested a review from jameslamb August 22, 2025 04:25
@jakirkham jakirkham added bug Something isn't working non-breaking Introduces a non-breaking change labels Aug 22, 2025
@jakirkham jakirkham force-pushed the fix_dep_cufile_arm branch 3 times, most recently from a29ea1b to 15e235d Compare August 22, 2025 05:18
The CUDA Toolkit (CTK) added cuFile for ARM in CTK 12.2. As a result
always requiring `libcufile` with `libcucim` can make it uninstallable on
ARM with CTK 12.0 & 12.1.

To fix this, build two variants of the `libcucim` package. One with
`libcufile` and one without. Constrain `cuda-version` accordingly so
that `libcufile` is installed whenever possible. Only when `libcufile`
cannot be installed, fallback to the `libcucim` package where the
`libcufile` dependency is left off.

The `x86_64` CTK has had cuFile since the CTK 11.4. As RAPIDS now
requires CUDA 12.0+, there is no issue always requiring `libcufile` on
`x86_64`.

This issue will not be present in CUDA 13+. So always require
`libcufile` in that case.
The CUDA Toolkit will either have `cufile.h` or not. If it has it, we
can build support for cuFile. If it doesn't, we can skip it entirely.
#else
#include "cufile_stub_types.h"
#endif
#include <cufile.h>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to #include "cufile_stub_types.h" causes the following error on CI:

In file included from /opt/conda/conda-bld/work/gds/src/cufile_stub.cpp:17:
/opt/conda/conda-bld/work/gds/include/cufile_stub.h:23:14: fatal error: cufile_stub_types.h: No such file or directory
   23 |     #include "cufile_stub_types.h"
      |              ^~~~~~~~~~~~~~~~~~~~~
compilation terminated.

Given we either have a new enough CUDA Toolkit and it has cuFile or it doesn't, we can simply check for cuFile and build or not

Unsure what we gain by having this in-between mode. Though would be happy to learn if I'm missing something

@jakirkham jakirkham requested a review from a team as a code owner August 22, 2025 05:59
@jakirkham jakirkham marked this pull request as draft August 22, 2025 06:24
@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 22, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

rapids-bot bot pushed a commit to rapidsai/integration that referenced this pull request Sep 5, 2025
`libcucim` is not installable at the moment on arm64 systems with CUDA < 12.2:

* https://github.com/rapidsai/cucim/pull/905/files#r2292609197
* rapidsai/cucim#930

Which is blocking builds of RAPIDS docker images:

* rapidsai/docker#782 (comment)

This proposes **temporarily** excluding `cucim` from the dependencies of arm64 CUDA 12 `rapids` packages, to unblock `docker` CI (and therefore publication of the first CUDA 13 nightlies of those images) until that `cucim` issue is resolved.

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Gil Forsyth (https://github.com/gforsyth)

URL: #796
@jameslamb
Copy link
Member

/ok to test

Comment on lines +30 to +39
cat > extra_variants.yaml <<EOF
has_cufile:
- True
EOF
if [[ "$(arch)" == "aarch64" ]] && [[ "${RAPIDS_CUDA_VERSION%%.*}" == "12" ]]
then
cat >> extra_variants.yaml <<EOF
- False
EOF
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the current logic will have both in the aarch64+CUDA12 case?

has_cufile:
  - True
  -  False
Suggested change
cat > extra_variants.yaml <<EOF
has_cufile:
- True
EOF
if [[ "$(arch)" == "aarch64" ]] && [[ "${RAPIDS_CUDA_VERSION%%.*}" == "12" ]]
then
cat >> extra_variants.yaml <<EOF
- False
EOF
fi
if [[ "$(arch)" == "aarch64" ]] && [[ "${RAPIDS_CUDA_VERSION%%.*}" == "12" ]]; then
cat > extra_variants.yaml <<EOF
has_cufile:
- False
EOF
else
cat > extra_variants.yaml <<EOF
has_cufile:
- True
EOF
fi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me commit this and see if it helps

Copy link
Member Author

@jakirkham jakirkham Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is intended. We want to build once for each of these values, which is what the original logic does

Also for CUDA 13+ or x86_64 builds we want this to only be True

The CI failure is due to something else unrelated to this syntax. Please see this comment: #930 (comment)

Have gone ahead and restored the original logic. Though happy to chat more if you have additional questions 🙂

@grlee77
Copy link
Contributor

grlee77 commented Sep 9, 2025

/ok to test

@jakirkham
Copy link
Member Author

jakirkham commented Sep 17, 2025

The error in the CI log points to cuFile being used in the build and not being found

In file included from /opt/conda/conda-bld/work/gds/src/cufile_stub.cpp:17:
 /opt/conda/conda-bld/work/gds/include/cufile_stub.h:19:10: fatal error: cufile.h: No such file or directory
Z    19 | #include <cufile.h>
4Z       |          ^~~~~~~~~~
76Z compilation terminated.
111Z

Will either need to...

  1. Always build with libcufile-dev and simply selectively drop the runtime dependency on libcufile
  2. Rework the build logic to handle cuFile being absent

Personally would lean towards 1 as that should be less effort. Plus the effort in 2 becomes irrelevant once CUDA 12.0 & 12.1 are dropped from support

Copy link
Member Author

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have added option 1 below as a suggestion

Option 2 likely involves more changes possibly to parts we haven't yet touched

@jakirkham
Copy link
Member Author

/ok to test

@jakirkham
Copy link
Member Author

Pushed the option 1 approach. Let's see how that goes

@jakirkham
Copy link
Member Author

/ok to test

@jakirkham
Copy link
Member Author

/ok to test

@jakirkham
Copy link
Member Author

/ok to test

@jakirkham
Copy link
Member Author

/ok to test

To ensure the different cuCIM packages with and without cuFile are both
used, they need different package strings. As the new variant,
`has_cufile`, is included in the package hash, make sure to include the
package hash in the package string. That way each package is named
differently and they don't clobber each other. Thus ensuring both
packages are uploaded and not whichever one was built last.
@jakirkham
Copy link
Member Author

/ok to test

@jakirkham
Copy link
Member Author

At this point this PR has the needed pieces to address the issue

High-level cuCIM is designed to always build with cuFile and optionally use it at runtime (only if it can find and use the cuFile library)

As RAPIDS takes the approach of building with the latest CUDA minor version for any major version (like 12.9 for the 12 series), cuFile is available at build time for all architectures. So building cuCIM with cuFile on CUDA 12.9 works unconditionally

At runtime, cuCIM has two modes of using cuFile:

  1. Statically linked cuFile
  2. dlopen to load the library

In RAPIDS we only ever build cuCIM with 2 (dlopen)

handle_ = cucim::dynlib::load_library(
{
"libcufile.so.0",
},
RTLD_LAZY | RTLD_LOCAL | RTLD_NODELETE);

With this approach, cuCIM will only use cuFile if it can dlopen it

CUfileError_t status = cuFileDriverOpen();
if (status.err == CU_FILE_SUCCESS)
{
is_available_ = true;

However if cuCIM cannot dlopen cuFile, it falls back to normal POSIX IO

else
{
is_available_ = false;
max_device_cache_size_ = DEFAULT_MAX_CACHE_SIZE;
max_host_cache_size_ = DEFAULT_MAX_CACHE_SIZE;
// fmt::print(stderr, "[Warning] CuFileDriver cannot be open. Falling back to use POSIX file IO APIs.\n");
}

So all we ever need do for old CUDA 12 ARM builds is drop the cuFile dependency at runtime, which is what this PR does

Opening this up for review

@jakirkham jakirkham marked this pull request as ready for review September 18, 2025 07:33
Copy link
Member Author

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few notes for context below

build:
number: {{ GIT_DESCRIBE_NUMBER }}
string: cuda{{ cuda_major }}_{{ date_string }}_{{ GIT_DESCRIBE_HASH }}_{{ GIT_DESCRIBE_NUMBER }}
string: cuda{{ cuda_major }}_{{ date_string }}_gh{{ GIT_DESCRIBE_HASH }}_gn{{ GIT_DESCRIBE_NUMBER }}}_ph{{ PKG_HASH }}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are building for two different variants of has_cufile, the package hash will change. However the rest of these values stay the same for each variant build. As a result the package name would be the same unless we add the package hash, which we do here

- {{ compiler('cuda') }}
- cuda-cudart-dev
- libcufile-dev # [linux64]
- libcufile-dev
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to ignore the run_exports_from libcufile-dev in all cases where it is used as we already constrain the version via cuda-version

This avoids overly tight version constraints that otherwise cause solver errors when testing CUDA 12.9 packages on CUDA 12.0

Comment on lines +63 to +68
{% if aarch64 and cuda_major == "12" %}
# Relax `libcufile` dependency for old CUDA 12 ARM installs
- {{ pin_compatible('cuda-version', lower_bound='12.0', upper_bound='12.2') }} # [not has_cufile]
- {{ pin_compatible('cuda-version', lower_bound='12.2', upper_bound='13.0') }} # [has_cufile]
- libcufile # [has_cufile]
{% else %}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ensures ARM packages that support CUDA 12.2+ still depend io cuFile whereas packages that require CUDA 12.0 or 12.1 packages on ARM don't require cuFile (as it is not given in this case)

Copy link
Member

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jakirkham ! The changes and comments all make sense to me, I'd be happy to see this go in.

Once it does, we can do rapidsai/integration#797 (adding the cucim dependency back to the rapids metapackage unconditionally)

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks fine to me.

@jakirkham
Copy link
Member Author

Thanks all! 🙏

@jakirkham
Copy link
Member Author

/merge

@jakirkham
Copy link
Member Author

Saw a flaky networking issue in one CI job:

Get:22 http://ports.ubuntu.com/ubuntu-ports jammy-security/restricted arm64 Packages [5000 kB]
Fetched 38.1 MB in 32s (1181 kB/s)
Reading package lists...
E: Failed to fetch http://ports.ubuntu.com/ubuntu-ports/dists/jammy-security/main/binary-arm64/Packages.gz  File has unexpected size (3038653 != 3054887). Mirror sync in progress? [IP: 91.189.91.104 80]
   Hashes of expected file:
    - Filesize:3054887 [weak]
    - SHA256:04e8cacff94dccf44b0d14079f4f896262844739a764aa6333f42a0678c1195a
    - SHA1:447449b9caef6590e3ad69ce93872cebed2c302a [weak]
    - MD5Sum:451978ccf71438e21b7cbaf35ca1390d [weak]
   Release file created at: Thu, 18 Sep 2025 13:17:15 +0000
E: Some index files failed to download. They have been ignored, or old ones used instead.

Mentioned offline. Went ahead and restarted this job

@rapids-bot rapids-bot bot merged commit 42cf58f into rapidsai:branch-25.10 Sep 18, 2025
114 of 115 checks passed
@jakirkham jakirkham deleted the fix_dep_cufile_arm branch September 18, 2025 20:10
rapids-bot bot pushed a commit to rapidsai/integration that referenced this pull request Sep 18, 2025
Previously cuCIM was dropped from CUDA 12 ARM as the packages unconditionally required cuFile, which caused issues when testing with CUDA 12.0 & 12.1. To address this, the cuCIM package was changed to only require cuFile in CUDA 12.2+ ARM packages while creating a separate package for ARM CUDA 12.0 & 12.1, which left out the cuFile dependency. This work was completed in PR: rapidsai/cucim#930

With this change now in place, it should be possible to readd cuCIM to the RAPIDS metapackage generally, which is what this PR does

<hr>

Reverts #796
Fixes #797

Authors:
  - https://github.com/jakirkham

Approvers:
  - James Lamb (https://github.com/jameslamb)

URL: #802
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants