-
Notifications
You must be signed in to change notification settings - Fork 447
Dispatch mechanism may break when any two libraries that use CUB and/thrust have been compiled for different set of GPU architectures #545
Comments
My only concern regarding #pragma once
#ifdef STATIC
#define SPECIFIER static
#else
#define SPECIFIER
#endif
template <class T>
SPECIFIER __global__ void kernel(){} I have two TUs that use the same call :nvcc tu_1.cu tu_2.cu main.cu
:cuobjdump --dump-sass a.out | rg Function
Function : _Z6kernelIiEvv
Function : _Z6kernelIiEvv
:nvcc -DSTATIC tu_1.cu tu_2.cu main.cu
:cuobjdump --dump-sass a.out | rg Function
Function : _Z6kernelIiEvv
Function : _Z6kernelIiEvv but when you provide :cuobjdump --dump-sass a.out | rg Function
Function : _Z6kernelIiEvv
:nvcc -DSTATIC -rdc=true tu_1.cu tu_2.cu main.cu
:cuobjdump --dump-sass a.out | rg Function
Function : __nv_static_27__91103086_7_tu_1_cu__Z3foov__Z6kernelIiEvv
Function : __nv_static_27__83a59f68_7_tu_2_cu__Z3barv__Z6kernelIiEvv So we'll have a kernel per each TU in applications that use CUB. Moreover, I believe that :nvc++ tu_1.cu tu_2.cu main.cu
:cuobjdump --dump-sass a.out | rg Function
Function : _Z6kernelIiEvv
:nvc++ -DSTATIC tu_1.cu tu_2.cu main.cu
:cuobjdump --dump-sass a.out | rg Function
Function : _ZN27_INTERNAL_7_tu_1_cu__Z3foov6kernelIiEEvv
Function : _ZN27_INTERNAL_7_tu_2_cu__Z3barv6kernelIiEEvv |
Let me make sure I'm following what's going on here.
Is that right? |
This piqued my curiosity and I went far down a rabbit hole. TL;DR: There is something extremely odd going on here that I don't understand and just making the kernel I captured my repro and results so far here: https://github.com/jrhemstad/cuda_arch_odr The only thing that seems to work robustly is to make the linkage of both the kernel and the enclosing function to be internal. |
That's exactly right. |
Thanks for the reproducer and summarising the results. This highlights that we want to be careful and thoroughly verify whichever solution we should identify as a candidate. In the case of your repro, I believe that Otherwise - and for simplicity, let's assume However, it seems that if there's no ODR-use of the
This is the reason why declaring the kernel
However, after adding an algorithm invocation to
Also, I believe that means that the full call path (e.g., Similarly, we need to be careful about not querying |
If you add
|
The following describes a problem observed in more "complex" software projects, where different components (or libraries) use CUB and/or thrust without separating CUB and/or thrust through namespace costumisation. This issue may be observed when linked libraries include CUB and/or thrust - even if the libraries' dependency on CUB and/or thrust is not apparent to the library user.
Is this the issue that I'm having?
If you are:
merge_sort: failed on 2nd step: cudaErrorInvalidValue: invalid argument
"cuda-memcheck
orcompute-sanitizer --tool memcheck
reports out-of-bounds global memory reads or global memory writes (intotemporary_storage
) within a CUB (or thrust kernel)cudaErrorInvalidValue: invalid argument
thrown from a thrust algorithmThe root cause
Situation
__CUDA_ARCH__
). Such meta parameters are parameters likeBLOCK_THREADS
(the number of threads per thread block),ITEMS_PER_THREAD
(the number of items processed by each thread), etc.block size
(that corresponds to the kernel'sBLOCK_THREADS
) and the correctgrid size
. These run-time parameters need to match the parameters of the kernel that will actually get launched.cudaFuncGetAttributes
oncub::EmptyKernel
to query the closest architecture for whichEmptyKernel
was compiled for, assuming thatEmptyKernel
has been compiled for exactly the same architectures as the kernels actually implementing the various algorithms (which usually is the case).Problem
CUB's kernels have weak external linkage. All kernels from all translation units being linked will end up in the binary's fatbin. If there's multiple choices for a kernel, the CUDA runtime seems to choose any qualifying kernel candidate "at random".
compilation
nvcc -c -gencode arch=compute_52,code=compute_52 my_lib.cu nvcc -c -gencode arch=compute_70,code=compute_70 main.cu nvcc -o sort_test my_lib.o main.o && compute-sanitizer --tool memcheck ./sort_test
my_lib.cu
main.cu
output
Running on a V100
Potential Solutions
Declare the CUB kernels
static
. Making sure that CUB kernels intranslation unit A
won't interfere with the kernels intranslation unit B
would be a viable solution. We currently have all the kernels from both translation units in the linked binary anyways. See belowcuobjdump
for the above code example.List of issues that may be linked to this root cause:
The text was updated successfully, but these errors were encountered: