Skip to content

Conversation

@jameslamb
Copy link
Member

@jameslamb jameslamb commented Aug 22, 2025

Contributes to rapidsai/build-planning#208

  • uses CUDA 13.0.0 to build and test
  • adds CUDA 13 devcontainers
  • moves some dependency pins:
    • cuda-python: >=12.9.2 (CUDA 12), >=13.0.1 (CUDA 13)
    • cupy: >=13.6.0

Contributes to rapidsai/build-planning#68

  • updates to CUDA 13 dependencies in fallback entries in dependencies.yaml matrices (i.e., the ones that get written to pyproject.toml in source control)

Notes for Reviewers

This switches GitHub Actions workflows to the cuda13.0 branch from here: rapidsai/shared-workflows#413

A future round of PRs will revert that back to branch-25.10, once all of RAPIDS supports CUDA 13.

What about Go, Java, and Rust?

This PR expands building / testing for those bindings to cover CUDA 13, but more changes probably need to be made to support distributing packages for those.

Proposing deferring that to follow-ups:

@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 22, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@jameslamb
Copy link
Member Author

/ok to test

@jameslamb
Copy link
Member Author

The first runs all failed with some form of "need RAFT packages", so nothing else learned from this yet.

build link: https://github.com/rapidsai/cuvs/actions/runs/17145663792/job/48641578548

Will update this once rapidsai/raft#2787 is merged and we have RAFT packages.

@jameslamb
Copy link
Member Author

/ok to test

@jameslamb
Copy link
Member Author

Happy to say that wheels are looking happy, so libcuvs.so CUDA 13 support is probably close!

But some benchmarks are failing to compile, like this:

 │ │ [680/773] Building CXX object bench/ann/CMakeFiles/HNSWLIB_ANN_BENCH.dir/src/hnswlib/hnswlib_benchmark.cpp.o
 │ │ FAILED: [code=1] bench/ann/CMakeFiles/HNSWLIB_ANN_BENCH.dir/src/hnswlib/hnswlib_benchmark.cpp.o 
 │ │ sccache $BUILD_PREFIX/bin/aarch64-conda-linux-gnu-c++ -DANN_BENCH_BUILD_MAIN> -DBENCHMARK_STATIC_DEFINE -DCUVS_ANN_BENCH_USE_HNSWLIB=CUVS_ANN_BENCH_USE_HNSWLIB -I$SRC_DIR/cpp/include -I$SRC_DIR/cpp/build/_deps/benchmark-src/include -I$SRC_DIR/cpp/build/_deps/hnswlib-src -isystem $BUILD_PREFIX/targets/sbsa-linux/include/cccl -isystem /opt/conda/include -isystem $BUILD_PREFIX/include -fvisibility-inlines-hidden -fmessage-length=0 -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O3 -pipe -isystem $PREFIX/include   -I$PREFIX/targets/sbsa-linux/include -I$PREFIX/targets/sbsa-linux/include/cccl -I$BUILD_PREFIX/targets/sbsa-linux/include -I$BUILD_PREFIX/targets/sbsa-linux/include/cccl -O3 -DNDEBUG -std=gnu++17 -fPIE -pthread -fopenmp "-ffile-prefix-map=$PREFIX/=''" -MD -MT bench/ann/CMakeFiles/HNSWLIB_ANN_BENCH.dir/src/hnswlib/hnswlib_benchmark.cpp.o -MF bench/ann/CMakeFiles/HNSWLIB_ANN_BENCH.dir/src/hnswlib/hnswlib_benchmark.cpp.o.d -o bench/ann/CMakeFiles/HNSWLIB_ANN_BENCH.dir/src/hnswlib/hnswlib_benchmark.cpp.o -c $SRC_DIR/cpp/bench/ann/src/hnswlib/hnswlib_benchmark.cpp
 │ │ In file included from $SRC_DIR/cpp/bench/ann/src/hnswlib/hnswlib_wrapper.h:20,
 │ │                  from $SRC_DIR/cpp/bench/ann/src/hnswlib/hnswlib_benchmark.cpp:18:
 │ │ $SRC_DIR/cpp/bench/ann/src/hnswlib/../common/util.hpp: In function 'auto cuvs::bench::cuda_info()':
 │ │ $SRC_DIR/cpp/bench/ann/src/hnswlib/../common/util.hpp:451:64: error: 'struct cudaDeviceProp' has no member named 'clockRate'
 │ │   451 |   props.emplace_back("gpu_sm_freq", std::to_string(device_prop.clockRate * 1e3));
 │ │       |                                                                ^~~~~~~~~
 │ │ $SRC_DIR/cpp/bench/ann/src/hnswlib/../common/util.hpp:452:65: error: 'struct cudaDeviceProp' has no member named 'memoryClockRate'
 │ │   452 |   props.emplace_back("gpu_mem_freq", std::to_string(device_prop.memoryClockRate * 1e3));
 │ │       |                                                                 ^~~~~~~~~~~~~~~

(conda-cpp-build link)

@jameslamb
Copy link
Member Author

Updated to latest branch-25.10 and pushed this patch from @robertmaynard : d122856

Let's see how that goes. Should get faster feedback now that we've mostly filled up the sccache caches for CUDA 13 builds here 😁

@jameslamb
Copy link
Member Author

/ok to test

int clockRate = 0;
int memoryClockRate = 0;
err_code = cudaDeviceGetAttribute(&clockRate, cudaDevAttrClockRate, dev);
err_code = cudaDeviceGetAttribute(&memoryClockRate, cudaDevAttrMemoryClockRate, dev);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Java builds are failing like this:

[INFO] -------------------------------------------------------------
Error:  /__w/cuvs/cuvs/java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/common/Util.java:[192,23] cannot find symbol
  symbol:   method cudaGetDeviceProperties_v2(java.lang.foreign.MemorySegment,int)
  location: class com.nvidia.cuvs.internal.common.Util
[INFO] 1 error
[INFO] -------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  13.340 s
[INFO] Finished at: 2025-08-22T21:20:16Z
[INFO] ------------------------------------------------------------------------
Error:  Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.13.0:compile (compile-java-22) on project cuvs-java: Compilation failure
Error:  /__w/cuvs/cuvs/java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/common/Util.java:[192,23] cannot find symbol
Error:    symbol:   method cudaGetDeviceProperties_v2(java.lang.foreign.MemorySegment,int)
Error:    location: class com.nvidia.cuvs.internal.common.Util

(build link)

@mythrocks do the Java bindings need to be updated? If so could you do that and push them to this PR?

I don't have Java set up in my development environment.

Copy link
Contributor

@mythrocks mythrocks Aug 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. This is exactly the sort of error that I'm running into in my work on the portable fat-jar PR. Except, I'm building on 12.9.

I think this is an artifact of some strange jextract behaviour.

Thanks for the confirmation; so I'm not imagining it.

I'll dig into this more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a first step I would try to see if cudaGetDeviceProperties_v2(java.lang.foreign.MemorySegment,int) is defined and what signature it has.
If this passes on 12.9 but fails consistently in 13.0, I suspect this is a header issue.
We generate functions bindings for Java using jextract, but this may or may not be the issue. I suspect CUDA headers #define cudaGetDeviceProperties as "something", and that "something" as a different signature in 12 vs 13. Just a guess, let me see if I can validate it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Java, as the other non-C langs, can bind to C functions but of course not via macros - that's why we need to call cudaGetDeviceProperties_v2 directly -- again, modulo my guesswork being correct and cudaGetDeviceProperties is indeed a macro)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I see #define cudaGetDeviceProperties cudaGetDeviceProperties_v2 in my cuda_runtime_api.h. Now let me see how it looks like in 13.0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this was broken by #1267

Error:  Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.13.0:compile (compile-java-22) on project cuvs-java: Compilation failure
Error:  /__w/cuvs/cuvs/java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/GPUInfoProviderImpl.java:[55,25] cannot find symbol
Error:    symbol:   method cudaGetDeviceProperties_v2(java.lang.foreign.MemorySegment,int)
Error:    location: class com.nvidia.cuvs.internal.GPUInfoProviderImpl.AvailableGpuInitializer
Error:  
Error:  -> [Help 1]

(conda-java-tests link)

I'll try to push a fix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried something here: 104508b

Would appreciate your input on it @ldematte @mythrocks . If you look and that seems wrong, please feel free to push to my branch here. I'd really like to get this PR in as soon as possible, so we can move on to other projects in RAPIDS that depend on cuvs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok one more commit (8530e22) because I missed that that was a static method (sorry, I haven't written anything for the JVM in a while 😅)

And looks like that work!

I see this as a minor extension of logic you all were already ok with putting in temporarily, so I'm planning to merge this once CI passes. Thanks again for all the help!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this was borked by #1267, exactly. I thought your change here would've dealt with it.

That said, your fix seems right. +1 to checking this in on passing CI.

Copy link
Member Author

@jameslamb jameslamb Aug 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

I think it was.... we found in this thread that cudaGetDeviceProperties_v2() could not be used unconditionally, and #1267 introduces code that uses it unconditionally:

returnValue = cudaGetDeviceProperties_v2(deviceProp, i);
checkCudaError(returnValue, "cudaGetDeviceProperties_v2");

@jameslamb
Copy link
Member Author

/ok to test

@jameslamb
Copy link
Member Author

/ok to test

@jameslamb jameslamb requested a review from divyegala August 27, 2025 05:44
Copy link
Contributor

@msarahan msarahan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving for code-owners

@jameslamb
Copy link
Member Author

Merging in recent changes, I now see build failures for conda-cpp-build jobs on CUDA 13:

 │ │ $SRC_DIR/cpp/include/cuvs/neighbors/common.hpp(465): error #20011-D: calling a __host__ function("cuvs::neighbors::filtering::base_filter::~base_filter()") from a __host__ __device__ function("cuvs::neighbors::filtering::base_filter::~base_filter [subobject]") is not allowed
 │ │ Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
 │ │ 1 error detected in the compilation of "$SRC_DIR/cpp/src/neighbors/cagra_search_uint8.cu".
 │ │ [4/766] Building CUDA object CMakeFiles/cuvs-cagra-search.dir/src/neighbors/cagra_search_int8.cu.o
 │ │ FAILED: [code=1] CMakeFiles/cuvs-cagra-search.dir/src/neighbors/cagra_search_int8.cu.o 

(build link)

I strongly suspect that's a result of #1211. @jinsolp could you please take a look? There are CUDA 13 devcontainers on this branch that you might find helpful for investigating.

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving for Java-owners. @jameslamb: Good work, mate.

Copy link
Member

@divyegala divyegala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a single comment, pre-approving. Thanks!

@robertmaynard
Copy link
Contributor

Merging in recent changes, I now see build failures for conda-cpp-build jobs on CUDA 13:

 │ │ $SRC_DIR/cpp/include/cuvs/neighbors/common.hpp(465): error #20011-D: calling a __host__ function("cuvs::neighbors::filtering::base_filter::~base_filter()") from a __host__ __device__ function("cuvs::neighbors::filtering::base_filter::~base_filter [subobject]") is not allowed
 │ │ Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
 │ │ 1 error detected in the compilation of "$SRC_DIR/cpp/src/neighbors/cagra_search_uint8.cu".
 │ │ [4/766] Building CUDA object CMakeFiles/cuvs-cagra-search.dir/src/neighbors/cagra_search_int8.cu.o
 │ │ FAILED: [code=1] CMakeFiles/cuvs-cagra-search.dir/src/neighbors/cagra_search_int8.cu.o 

(build link)

I strongly suspect that's a result of #1211. @jinsolp could you please take a look? There are CUDA 13 devcontainers on this branch that you might find helpful for investigating.

Yes, absolutely caused by #1211, since it removed the changes I made in #1219 to allow for CUDA 13 compilation.

@jameslamb
Copy link
Member Author

Thanks for looking and confirming @robertmaynard ! If you know how to fix this, could you push a patch to my branch here?

@mythrocks
Copy link
Contributor

CI seems to have borked on an intermittent test failure. I'm going to re-kick it.

@jameslamb
Copy link
Member Author

Thanks very much @mythrocks for re-running CI and to you and everyone else for your help on this!!!!

I'm going to merge this, @ me any time if it seems like it caused issues.

@jameslamb
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit 55985fe into rapidsai:branch-25.10 Aug 29, 2025
156 of 164 checks passed
@jameslamb jameslamb deleted the cuda-13.0.0 branch August 29, 2025 13:37
rapids-bot bot pushed a commit to rapidsai/cuvs-lucene that referenced this pull request Sep 2, 2025
Contributes to rapidsai/build-planning#208

* uses CUDA 13.0.0 to build and test (using the same patterns from the `cuvs-java` tests, in rapidsai/cuvs#1273)

## Notes for Reviewers

This switches GitHub Actions workflows to the `cuda13.0` branch from here: rapidsai/shared-workflows#413

A future round of PRs will revert that back to `branch-25.10`, once all of RAPIDS supports CUDA 13.

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Jake Awe (https://github.com/AyodeAwe)
  - Ben Frederickson (https://github.com/benfred)
  - rhdong (https://github.com/rhdong)

URL: #20
rapids-bot bot pushed a commit that referenced this pull request Sep 11, 2025
…based on symbol presence (#1323)

In #1273 we addressed a signature change in CUDA 13 by binding different symbols based on a environment variable, `RAPIDS_CUDA_MAJOR`. That works, but forces users of cuvs-java with CUDA 12 to define this environment variable.

This PR improves on it by making the symbol lookup dynamic, looking for the CUDA 13 symbol name, and falling back to the CUDA 12 exported name if we fail to locate the first.

Authors:
  - Lorenzo Dematté (https://github.com/ldematte)

Approvers:
  - MithunR (https://github.com/mythrocks)

URL: #1323
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Development

Successfully merging this pull request may close these issues.

8 participants