[BUG] build failed on cuda 12 runtime #1136

pxLi · 2023-05-15T00:57:41Z

Describe the bug
build JNI failed in recent nightly runs in cuda12 runtime. (build in cuda11 works fine)

[2023-05-13T03:32:07.425Z] [INFO]      [exec] -- The C compiler identification is GNU 11.2.1
[2023-05-13T03:32:07.425Z] [INFO]      [exec] -- The CXX compiler identification is GNU 11.2.1
[2023-05-13T03:32:08.364Z] [INFO]      [exec] -- The CUDA compiler identification is NVIDIA 12.0.140
...
[2023-05-13T03:32:10.136Z] [INFO]      [exec] -- Found CUDAToolkit: /usr/local/cuda/include (found version "12.0.140") 
...

[2023-05-13T04:17:38.652Z] [INFO]      [exec] /usr/local/cuda/include/thrust/iterator/transform_iterator.h:189:7:   required from 'class thrust::transform_iterator<__nv_dl_wrapper_t<__nv_dl_tag<std::pair<std::unique_ptr<cudf::column>, int> (*)(thrust::constant_iterator<long unsigned int, thrust::use_default, thrust::use_default>, thrust::constant_iterator<long unsigned int, thrust::use_default, thrust::use_default>, rmm::cuda_stream_view, rmm::mr::device_memory_resource*), cudf::detail::make_offsets_child_column<thrust::constant_iterator<long unsigned int, thrust::use_default, thrust::use_default> >, 1>, thrust::constant_iterator<long unsigned int, thrust::use_default, thrust::use_default>, int>, thrust::counting_iterator<int>, thrust::use_default, thrust::use_default>'
[2023-05-13T04:17:38.652Z] [INFO]      [exec] /tmp/tmpxft_0000492c_00000000-6_zorder.compute_90.cudafe1.stub.c:49:1088:   required from here
[2023-05-13T04:17:38.652Z] [INFO]      [exec] /usr/local/cuda/include/cuda/std/detail/libcxx/include/type_traits:4405:16: error: static assertion failed: Attempt to use an extended __device__ lambda in a context that requires querying its return type in host code. Use a named function object, a __host__ __device__ lambda, or cuda::proclaim_return_type instead.
[2023-05-13T04:17:38.652Z] [INFO]      [exec]  4405 |   static_assert(!__nv_is_extended_device_lambda_closure_type(_Fp),
[2023-05-13T04:17:38.652Z] [INFO]      [exec]       |               ~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[2023-05-13T04:17:38.652Z] [INFO]      [exec] /usr/local/cuda/include/cuda/std/detail/libcxx/include/type_traits:4405:16: note: '!(bool)__nv_extended_device_lambda_trait_helper<__nv_dl_wrapper_t<__nv_dl_tag<std::pair<std::unique_ptr<cudf::column>, int> (*)(thrust::constant_iterator<long unsigned int, thrust::use_default, thrust::use_default>, thrust::constant_iterator<long unsigned int, thrust::use_default, thrust::use_default>, rmm::cuda_stream_view, rmm::mr::device_memory_resource*), cudf::detail::make_offsets_child_column<thrust::constant_iterator<long unsigned int, thrust::use_default, thrust::use_default> >, 1>, thrust::constant_iterator<long unsigned int, thrust::use_default, thrust::use_default>, int> >::value' evaluates to false
[2023-05-13T04:17:38.653Z] [INFO]      [exec] gmake[2]: *** [CMakeFiles/spark_rapids_jni.dir/build.make:244: CMakeFiles/spark_rapids_jni.dir/src/zorder.cu.o] Error 1
[2023-05-13T04:17:56.728Z] [INFO]      [exec] [ 62%] Linking CUDA shared library libcufaultinj.so
[2023-05-13T04:17:56.990Z] [INFO]      [exec] [ 62%] Built target cufaultinj
[2023-05-13T04:18:11.862Z] [INFO]      [exec] gmake[1]: *** [CMakeFiles/Makefile2:170: CMakeFiles/spark_rapids_jni.dir/all] Error 2
[2023-05-13T04:18:11.862Z] [INFO]      [exec] gmake: *** [Makefile:146: all] Error 2

Steps/Code to reproduce bug
run mvn build of JNI in cuda 12 runtime

Expected behavior
Pass the build

The text was updated successfully, but these errors were encountered:

bdice · 2023-05-15T19:11:48Z

@ttnghia asked me a question related to this offline, and linked to this issue. I decided to respond here so it's publicly visible.

CUDA 12 ships with a version of thrust (2.0.0 or newer, updated in this PR) that requires the use of the libcudacxx function cuda::proclaim_return_type for device lambdas if the return type of the lambda needs to be known on the host. Some reasons might include thrust needing to know the amount of (shared?) memory to allocate for a kernel. This is needed because of a bug in the CUDA compiler, where the host cannot always correctly deduce the return type of a device lambda. Incorrect type deduction leads to inscrutable errors (I can't remember if the errors happen at compile time or run time) or perhaps undefined behavior. @jrhemstad would be able to better describe the original motivation for raising this, and perhaps fix anything I've said incorrectly.

In RAPIDS / libcudf, we are currently pinned to an older libcudacxx that doesn't have this requirement. We do need to migrate to this version to libcudacxx for official CUDA 12 support in libcudacxx -- older versions of libcudacxx like those in RAPIDS right now don't officially support CUDA 12, in my understanding (but they happen to work, it seems). I am working on adding CUDA 12 support in these PRs, and most of the necessary changes revolve around cuda::proclaim_return_type.

bdice · 2023-05-15T19:50:10Z

The short story is: figure out the return type of your device lambdas, and wrap them in cuda::proclaim_return_type<T>([] __device__ (...){ ... });.

e.g. this one:

spark-rapids-jni/src/main/cpp/src/zorder.cu

Lines 170 to 173 in 24e502e

    
           [col = *output_dv_ptr,  
        
            num_columns, 
        
            data_type_size, 
        
            input = *input_dv] __device__ (cudf::size_type ret_idx) {

should look like

thrust::for_each_n(
    rmm::exec_policy(stream),
    thrust::make_counting_iterator<cudf::size_type>(0),
    output_size,
    cuda::proclaim_return_type<void>(
      [col = *output_dv_ptr, 
       num_columns,
       data_type_size,
       input = *input_dv] __device__ (cudf::size_type ret_idx) {
       ...
     }));

But to do this, you may need to use a fairly new thrust/libcudacxx version to build with both CUDA 11 and CUDA 12.

bdice · 2023-05-16T00:42:09Z

I discussed with @ttnghia. I propose the following fix:

Fetch CCCL libraries (Thrust/libcudacxx) from rapids-cmake. This means you'd get CCCL versions that are newer than those in CUDA 11.8, but older than those in CUDA 12. You'd basically be hiding the problem for a short period of time, by using a version of Thrust that is slightly older than the one shipped in CUDA 12 (but doesn't check device lambdas for safety). This would align with what's currently happening in RAPIDS until we can upgrade to a newer version of CCCL.
Then, when rapids-cmake updates to CCCL 2.1.0 (sometime in the near-ish future, perhaps mid-cycle for 23.08), you'll see this problem occur again in both CUDA 11 and CUDA 12 builds. At this point, you'll have to actually fix the problem by wrapping device lambdas in cuda::proclaim_return_type. This is the same set of changes I'm making in libcudf right now: Test updates of CCCL (thrust, cub, libcudacxx) to 2.1.0. rapidsai/cudf#13222

To fetch CCCL libraries from source, you'd want to edit the CMakeLists.txt file:

Copy in the lines like these from libcudf, for both Thrust and libcudacxx: https://github.com/rapidsai/cudf/blob/4c456cb15be5ff284aa441d3fdc972fe5c761fa7/cpp/CMakeLists.txt#L172-L173
- Unlike libcudf, you should put libcudacxx first. That's due to a new requirement where Thrust itself will depend on libcudacxx -- but you should specify the dependency directly since it'll be used in your code, too, via cuda::proclaim_return_type.
Copy in the files like this one from libcudf, for both Thrust and libcudacxx: https://github.com/rapidsai/cudf/blob/branch-23.06/cpp/cmake/thirdparty/get_thrust.cmake

pxLi added bug Something isn't working build ? - Needs Triage labels May 15, 2023

jlowe assigned ttnghia May 15, 2023

ttnghia removed the ? - Needs Triage label May 16, 2023

ttnghia mentioned this issue May 16, 2023

Fix build with CUDA 12 #1143

Merged

pxLi closed this as completed in #1143 May 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] build failed on cuda 12 runtime #1136

[BUG] build failed on cuda 12 runtime #1136

pxLi commented May 15, 2023 •

edited

Loading

bdice commented May 15, 2023 •

edited

Loading

bdice commented May 15, 2023 •

edited

Loading

bdice commented May 16, 2023 •

edited

Loading

[BUG] build failed on cuda 12 runtime #1136

[BUG] build failed on cuda 12 runtime #1136

Comments

pxLi commented May 15, 2023 • edited Loading

bdice commented May 15, 2023 • edited Loading

bdice commented May 15, 2023 • edited Loading

bdice commented May 16, 2023 • edited Loading

pxLi commented May 15, 2023 •

edited

Loading

bdice commented May 15, 2023 •

edited

Loading

bdice commented May 15, 2023 •

edited

Loading

bdice commented May 16, 2023 •

edited

Loading