-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: Incorrect linkage with -fgpu-rdc on kernels created from lambdas inside anonymous namespaces #54560
Comments
It appears that we somehow failed to make the normally invisible symbol unique. That is indeed a bug.
Reproducer based on thrust is pretty hard to work with. :-( |
Do we actually have to make it unique? It should suffice to make it invisible.
Indeed. I'm still working on getting rid of Thrust here. Hopefully I can post a more minimal example by early next week. |
Yes. In order to launch the kernel from the host side, CUDA runtime needs to know the name of the kernel in the ELF executable on the GPU side. If the symbol is hidden, the symbol will not be found and the kernel launch will fail. This is where the way CUDA works conflicts with how C++ is expected to work. We try to work around it by making such symbols unique, which avoids conflicts during linking. But it also poses an issue for multiple template instantiations that should not have unique names because all instances are identical and the linker is supposed to pick only one of them. It's possible that that's the reason why we didn't make the function name unique and the problem is that NVIDIA's nvlink does not know how to deal with template instantiations (or, perhaps, we didn't generate the right PTX directives). Making all template instances unique would probably work, at the expense of bloating GPU executable size with multiple identical instances of the kernels. We'll see what exactly is going on once we have a smaller reproducer. |
It seems in general cuda-clang works for -fgpu-rdc with identical template kernel instantiation in different TU's. This could be a corner case where the ISA of template kernel instantiated in different TU's have some difference which triggered the nvlink error. If we can dump the .bc and nvptx file of the TU's and compare these two template kernel intantiation it may help. |
Thanks for the detailed explanation. I'm mainly familiar with how these things work in C++, so CUDA continues to puzzle me.
Here's an example without Thrust: template<typename F>
__global__ void generate(int * first, int * last, F f) {
if(threadIdx.x < last - first) {
first[threadIdx.x] = f(threadIdx.x);
}
}
namespace {
struct Stuff {
long c[2];
};
struct Thing {
void calc(int *data, int n, Stuff s) {
auto f = [s] __device__ (int i) -> int {
return 2*i;
};
generate<<<n/256+1, 256>>>(data, data + n, f);
}
};
}
void run(int * data, int n, Stuff s) {
Thing t;
t.calc(data, n, s);
} As seen in https://godbolt.org/z/G1Tfcdn3W, NVCC with The bug is not actually caused by templates. I can remove the template and directly put the kernel into the anonymous namespace and still get the nvlink error, though the PTX looks a bit different. As can be seen in https://godbolt.org/z/efGh6oa47, NVCC in this case prepends |
It's happening because the struct in both cases has the same name ( |
AFAICT clang does the right thing here, but we may still need to 'fix' it. Let's consider https://godbolt.org/z/efGh6oa47. If the code in the anonymous namespace were to live in a header file included from multiple TUs, would we expect the final GPU executable built with -fgpu-rdc to contain just one instance of
I suspect this error may be a secondary failure. E.g. the root cause is that nvlink does not ignore the second instance of the weak symbol and fails further down the road when it needs to construct some per-kernel data which it expected to be unique. What's relevant here is that nvlink apparently fails to deal with weak kernels. I do not know whether it's by design or a bug. Regardless of that, we'll probably need to do what NVCC does and make such kernel symbols unique, too.
You do want to launch that kernel from the host and the host needs a visible GPU-side symbol to refer to, even when that symbol is not needed for linking and would normally not even be visible if it were a C++ compilation. While the kernel can not be invoked from another GPU-side TU, it does effectively get invoked from the host side of the same TU and that's what breaks the C++ visibility model. On one hand, we do expect to have the same visibility within TU on both host and device sides (hence host-side kernel stub needs to see GPU-side kernel symbol), but we also need to keep GPU objects linkable and due to the way |
Ah, I forgot about that scenario. So that means that as long as CUDA has no concept of visibility, it is not possible to correctly represent this C++ as PTX. If we want to match NVCC and support my scenario, we need to make kernel names unique. We can keep the
The weak kernels with the non-unique name are actually different in my example. nvlink however only complains about that because their signatures are also different: .weak .entry _ZN12_GLOBAL__N_11fEPiiNS_5StuffE(
.param .u64 _ZN12_GLOBAL__N_11fEPiiNS_5StuffE_param_0,
.param .u32 _ZN12_GLOBAL__N_11fEPiiNS_5StuffE_param_1,
.param .align 8 .b8 _ZN12_GLOBAL__N_11fEPiiNS_5StuffE_param_2[16]
)
.weak .entry _ZN12_GLOBAL__N_11fEPiiNS_5StuffE(
.param .u64 _ZN12_GLOBAL__N_11fEPiiNS_5StuffE_param_0,
.param .u32 _ZN12_GLOBAL__N_11fEPiiNS_5StuffE_param_1,
.param .align 8 .b8 _ZN12_GLOBAL__N_11fEPiiNS_5StuffE_param_2[32]
) If the signature is the same (e.g. if the
Agreed. Is there anything I can do to help make it happen, or can you take care of it? |
In C++ program, for functions in an anonymous namespace, clang emits the function with internal linkage and a non-unique name. This indicates functions with the same name in the anonymous namespace in different TU's are not treated as one definition. For CUDA/HIP, we have to make kernel symbols visible to runtime, therefore kernels must have non-internal linkage. To avoid conflict with kernels with the same name in other TU's, we have to make the kernel name unique. We also need to let the host compilation know the unique name used in the device compilation. This is a similar situation we faced with making static device variables accessible to host compilation. We need to make their name unique and at the same time the unique name needs to be known at host compilation. We introduced CUID and externalized static device variables to solve the issue. It seems we could do that again with kernels in an anonymous namespace. BTW, we need to externalize device variables in an anonymous namespace too. |
I have a patch for fixing this issue: https://reviews.llvm.org/D123353 |
@llvm/issue-subscribers-clang-codegen |
Sorry for the late reply, I finally got around to testing the patch today. Unfortunately, it does not (fully) solve the issue. @yxsamliu, could you please reopen this ticket? There are now two problems:
namespace {
struct Stuff {
long c[2];
};
__global__ void f(int * data, int n, Stuff s) {
if(threadIdx.x < n) {
data[threadIdx.x] = 2*threadIdx.x;
}
}
void calc(int *data, int n, Stuff s) {
f<<<n/256+1, 256>>>(data, n, s);
}
}
void run(int * data, int n, Stuff s) {
calc(data, n, s);
} |
The second issue is easy fix. For the first issue, it seems a template instantiation with template argument type in anonymous name space will have internal linkage in C++, even though the template itself is not in anonymous name space. This is in contrary with ordinary template instantiation, which results in linkonce_odr linkage (https://godbolt.org/z/q765ehqxv). Therefore template instantiation with argument type in anonymous namespace does not follow ODR even if the type name is the same. Therefore we are justifiable to make them unique in CUDA/HIP. I will fix it. |
Fixed in https://reviews.llvm.org/D124189 |
@yxsamliu, thanks for taking care of this so quickly. I have confirmed that your latest patch resolves the issue with the above pieces of sample code and also with the larger codebase that initially led me to discover the issue. I hope you can merge this bugfix to master soon. What would it take to get it backported to the 14.x branch as well? |
kernels in anonymous name space needs to have unique name to avoid duplicate symbols. Fixes: llvm#54560 Reviewed by: Artem Belevich Differential Revision: https://reviews.llvm.org/D123353
This patch is a continuation of https://reviews.llvm.org/D123353. Not only kernels in anonymous namespace, but also template kernels with template arguments in anonymous namespace need to be externalized. To be more generic, this patch checks the linkage of a kernel assuming the kernel does not have __global__ attribute. If the linkage is internal then clang will externalize it. This patch also fixes the postfix for externalized symbol since nvptx does not allow '.' in symbol name. Reviewed by: Artem Belevich Differential Revision: https://reviews.llvm.org/D124189 Fixes: #54560
@tstellar what's the procedure for cherrypicking https://reviews.llvm.org/rG4ea1d435099f992cc16127619b0feb64e070630d and https://reviews.llvm.org/rG04fb81674ed7981397ffe70fe6a07b7168f6fe2f to 14.x ? Thanks. |
Failed to cherry-pick: 04fb816 https://github.com/llvm/llvm-project/actions/runs/2210241971 Please manually backport the fix and push it to your github fork. Once this is done, please add a comment like this:
|
kernels in anonymous name space needs to have unique name to avoid duplicate symbols. Fixes: llvm#54560 Reviewed by: Artem Belevich Differential Revision: https://reviews.llvm.org/D123353 (cherry picked from commit 4ea1d43)
This patch is a continuation of https://reviews.llvm.org/D123353. Not only kernels in anonymous namespace, but also template kernels with template arguments in anonymous namespace need to be externalized. To be more generic, this patch checks the linkage of a kernel assuming the kernel does not have __global__ attribute. If the linkage is internal then clang will externalize it. This patch also fixes the postfix for externalized symbol since nvptx does not allow '.' in symbol name. Reviewed by: Artem Belevich Differential Revision: https://reviews.llvm.org/D124189 Fixes: llvm#54560 (cherry picked from commit 04fb816)
Merged: 29f1039 |
kernels in anonymous name space needs to have unique name to avoid duplicate symbols. Fixes: llvm#54560 Reviewed by: Artem Belevich Differential Revision: https://reviews.llvm.org/D123353 [CUDA][HIP] Externalize kernels with internal linkage This patch is a continuation of https://reviews.llvm.org/D123353. Not only kernels in anonymous namespace, but also template kernels with template arguments in anonymous namespace need to be externalized. To be more generic, this patch checks the linkage of a kernel assuming the kernel does not have __global__ attribute. If the linkage is internal then clang will externalize it. This patch also fixes the postfix for externalized symbol since nvptx does not allow '.' in symbol name. Reviewed by: Artem Belevich Differential Revision: https://reviews.llvm.org/D124189 Fixes: llvm#54560 Fixes: SWDEV-335985 Change-Id: I97fae99bdc8b2b3eeb57e789aedcfe3bc8610706
kernels in anonymous name space needs to have unique name to avoid duplicate symbols. Fixes: llvm/llvm-project#54560 Reviewed by: Artem Belevich Differential Revision: https://reviews.llvm.org/D123353
This patch is a continuation of https://reviews.llvm.org/D123353. Not only kernels in anonymous namespace, but also template kernels with template arguments in anonymous namespace need to be externalized. To be more generic, this patch checks the linkage of a kernel assuming the kernel does not have __global__ attribute. If the linkage is internal then clang will externalize it. This patch also fixes the postfix for externalized symbol since nvptx does not allow '.' in symbol name. Reviewed by: Artem Belevich Differential Revision: https://reviews.llvm.org/D124189 Fixes: llvm/llvm-project#54560
kernels in anonymous name space needs to have unique name to avoid duplicate symbols. Fixes: llvm#54560 Reviewed by: Artem Belevich Differential Revision: https://reviews.llvm.org/D123353 (cherry picked from commit 4ea1d43)
This patch is a continuation of https://reviews.llvm.org/D123353. Not only kernels in anonymous namespace, but also template kernels with template arguments in anonymous namespace need to be externalized. To be more generic, this patch checks the linkage of a kernel assuming the kernel does not have __global__ attribute. If the linkage is internal then clang will externalize it. This patch also fixes the postfix for externalized symbol since nvptx does not allow '.' in symbol name. Reviewed by: Artem Belevich Differential Revision: https://reviews.llvm.org/D124189 Fixes: llvm#54560 (cherry picked from commit 04fb816)
kernels in anonymous name space needs to have unique name to avoid duplicate symbols. Fixes: llvm#54560 Reviewed by: Artem Belevich Differential Revision: https://reviews.llvm.org/D123353 (cherry picked from commit 4ea1d43)
This patch is a continuation of https://reviews.llvm.org/D123353. Not only kernels in anonymous namespace, but also template kernels with template arguments in anonymous namespace need to be externalized. To be more generic, this patch checks the linkage of a kernel assuming the kernel does not have __global__ attribute. If the linkage is internal then clang will externalize it. This patch also fixes the postfix for externalized symbol since nvptx does not allow '.' in symbol name. Reviewed by: Artem Belevich Differential Revision: https://reviews.llvm.org/D124189 Fixes: llvm#54560 (cherry picked from commit 04fb816)
kernels in anonymous name space needs to have unique name to avoid duplicate symbols. Fixes: llvm#54560 Reviewed by: Artem Belevich Differential Revision: https://reviews.llvm.org/D123353 (cherry picked from commit 4ea1d43)
This patch is a continuation of https://reviews.llvm.org/D123353. Not only kernels in anonymous namespace, but also template kernels with template arguments in anonymous namespace need to be externalized. To be more generic, this patch checks the linkage of a kernel assuming the kernel does not have __global__ attribute. If the linkage is internal then clang will externalize it. This patch also fixes the postfix for externalized symbol since nvptx does not allow '.' in symbol name. Reviewed by: Artem Belevich Differential Revision: https://reviews.llvm.org/D124189 Fixes: llvm#54560 (cherry picked from commit 04fb816)
Summary
In the example below, the PTX assembly generated by Clang declares the kernel as
.weak .entry _ZN6thrust8cuda_cub4core13_kernel_agentINS0_14__parallel_for16ParallelForAgentINS0_11__transform17unary_transform_fINS_17counting_iteratorIiNS_11use_defaultES8_S8_EEPiNS5_14no_stencil_tagEZN12_GLOBAL__N_15Thing4calcESA_iNSC_5StuffEEUliE_NS5_21always_true_predicateEEElEESH_lEEvT0_T1_
. The same example compiled with NVCC generates.weak .entry _ZN6thrust8cuda_cub4core13_kernel_agentINS0_14__parallel_for16ParallelForAgentINS0_11__transform17unary_transform_fINS_17counting_iteratorIiNS_11use_defaultES8_S8_EEPiNS5_14no_stencil_tagEZN60_GLOBAL__N__36_tmpxft_00006b02_00000000_6_b_cpp1_ii_968400945Thing4calcESA_iNSC_5StuffEEUliE_NS5_21always_true_predicateEEElEESH_lEEvT0_T1_
. In both cases, external weak linkage is used, which is not necessary since it is coming from inside a anonymous namespace. With NVCC, this is not a problem because the anonymous namespace is mangled to a unique name (60_GLOBAL__N__36_tmpxft_00006b02_00000000_6_b_cpp1_ii_96840094
). With Clang however, the name is not unique (mangled to12_GLOBAL__N_1
). This is a problem when passing the resulting object files tonvlink
, which will reportnvlink fatal error: Internal error: duplicate parameter bank data not same size
ornvlink error: Duplicate weak parameter bank for ...
depending on the CUDA version. To me, it seems like internal linkage (.entry
) instead of weak linkage (.weak .entry
) should be used in this case.Versions
I reproduced the bug with multiple Clang versions between 12.0.0 and 14.0.0. Before abd8cd9 by @yxsamliu, Clang would generate
.visible .entry
instead of.weak .entry
, which isn't any better and actually causes the example below to fail earlier onMultiple definition of '_ZN6thrust8cuda_cub3cub11EmptyKernelIvEEvv'()
.Potential workaround
It seems like the Clang option
-funique-internal-linkage-names
should be usable as a workaround that forces the symbol names to be unique, however across all Clang versions this just gives me various internal compiler errors. But that's a different issue.Working example
This example uses Thrust, which makes the symbol names very lengthy, but I am pretty sure the exact same behavior can also be observed by replacing
thrust::transform
with a hand-written kernel. The lambda capture seems to be important as just putting a kernel into an anonymous namespace is not sufficient to trigger the problem.a.cu:
b.cu:
Compile and link these as follows:
The second-to-last command will fail with the nvlink error given in the summary.
For comparison, compile and link with NVCC:
This will succeed.
LLVM IR
Unfortunately, Thrust is currently broken on clang trunk on godbolt.org, so I cannot post a link to the LLVM IR. Running clang locally, I can see that the IR uses
define weak_odr void @_ZN6thrust8cuda_cub4core13_kernel_agentINS0_14__parallel_for16ParallelForAgentINS0_11__transform17unary_transform_fINS_17counting_iteratorIiNS_11use_defaultES8_S8_EEPiNS5_14no_stencil_tagEZN12_GLOBAL__N_15Thing4calcESA_iNSC_5StuffEEUliE_NS5_21always_true_predicateEEElEESH_lEEvT0_T1_(%...
. My guess is that it should probably use some variation ofprivate
orinternal
instead ofweak_odr
.The text was updated successfully, but these errors were encountered: