-
Notifications
You must be signed in to change notification settings - Fork 12.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMDGPU] Creating relocatable object (-r) from rdc objects (-fgpu-rdc) fails with lld error attempted static link of dynamic object in /opt/rocm-6.0.0/lib #77018
Comments
@llvm/issue-subscribers-clang-driver Author: Mike Pozulp (pozulp)
Hey @arsenm and @jdoerfert, how do I generate a relocatable object (-r) for the amdgpu target? I am linking a large code containing a few millions of lines of C++ with an optional library dependency containing about 300,000 lines of C++. The library requires relocatable device code (-fgpu-rdc) because it has many kernels which reference device functions defined in separate translation units. The large code does not. A driver for the library links in 30 minutes. The large code takes 2 minutes to link without the optional library and over 8 hours with the library (the lld process is still running after 8 hours). I don't want to use rdc to link the large code, but I have to because of the optional library: if even a single object needs rdc, then the link needs it too. Perhaps an intermediate step between compiling the library and linking the large code, in which I generate a relocatable object (-r) from the rdc-compiled library, would allow me to link the large code without rdc even when I'm using the optional library.
x86+LTO (good)Consider using LTO to target x86, which works as expected. During compilation, clang -flto emits LLVM IR, which lld uses to perform link time optimizations like cross translation unit inlining. Here is an example:
Building and then disassembling the executables shows that add1, which is referenced and defined in separate translation units, is inlined for the two LTO builds but not for the separate compilation build, as expected:
The difference in the two LTO builds is that one had -flto on the link line and the other didn't. The one which included an intermediate step between compiling and linking to create a relocatable object did not need -flto on the link line because I gave the linker object code, not LLVM IR. amdgpu+rdc (bad)Now consider my use case. I'm building with rocm 6.0.0, the latest rocm clang distribution installed on my system, and I am targeting the amd mi250x. I modified my x86+LTO code to use hip with rdc:
The second-to-last line, which uses -r to make the relocatable object, fails with
Ignore the last 3 lines above, which are due to my attempt to link using the non-existent object file uber.o. |
@llvm/issue-subscribers-backend-amdgpu Author: Mike Pozulp (pozulp)
Hey @arsenm and @jdoerfert, how do I generate a relocatable object (-r) for the amdgpu target? I am linking a large code containing a few millions of lines of C++ with an optional library dependency containing about 300,000 lines of C++. The library requires relocatable device code (-fgpu-rdc) because it has many kernels which reference device functions defined in separate translation units. The large code does not. A driver for the library links in 30 minutes. The large code takes 2 minutes to link without the optional library and over 8 hours with the library (the lld process is still running after 8 hours). I don't want to use rdc to link the large code, but I have to because of the optional library: if even a single object needs rdc, then the link needs it too. Perhaps an intermediate step between compiling the library and linking the large code, in which I generate a relocatable object (-r) from the rdc-compiled library, would allow me to link the large code without rdc even when I'm using the optional library.
x86+LTO (good)Consider using LTO to target x86, which works as expected. During compilation, clang -flto emits LLVM IR, which lld uses to perform link time optimizations like cross translation unit inlining. Here is an example:
Building and then disassembling the executables shows that add1, which is referenced and defined in separate translation units, is inlined for the two LTO builds but not for the separate compilation build, as expected:
The difference in the two LTO builds is that one had -flto on the link line and the other didn't. The one which included an intermediate step between compiling and linking to create a relocatable object did not need -flto on the link line because I gave the linker object code, not LLVM IR. amdgpu+rdc (bad)Now consider my use case. I'm building with rocm 6.0.0, the latest rocm clang distribution installed on my system, and I am targeting the amd mi250x. I modified my x86+LTO code to use hip with rdc:
The second-to-last line, which uses -r to make the relocatable object, fails with
Ignore the last 3 lines above, which are due to my attempt to link using the non-existent object file uber.o. |
The short answer is we don't really support binary linking of different object files right now. The main blocker is reporting something sensible for function resource utilization if we can't see a function body. Without that, any attempt to rely on object files is going down underdeveloped and untested paths |
What do you mean when you say "can't see a function body"? I thought that because I compiled all of my object files with -fgpu-rdc, which emits LLVM IR into the objects, that when I try to generate a relocatable object (-r) the linker can see all of the function definitions. Every device function definition is in the LLVM IR in the objects. |
Oh, I see you're linking a single .o, not multiple .os together |
Yes, I'm linking a single .o in my reproducer. |
This seems to work with OpenMP just fine, so it's the driver that doesn't do it right:
Looking at the If I redo the steps manually, ignore the mcin/llvm-mc stuff and the output of that, and remove libamdhip64.so from that link command, I get a out.o. Tag: @jhuber6 |
@yxsamliu probably knows the most about expected HIP behavior. Somewhat curious if using |
Thanks Johannes!
@jhuber6, how do you build upstream hip? I built llvm to bisect an lld hang in #58639 but I have no experience building hip nor rocm and don't know where to start. The OS on my system is called TOSS 4 which is based on RHEL 8. The system has both mi250x and mi300a nodes. I'm most interested in the mi300a. |
I'm not the best person to ask for building ROCm. Maybe @yxsamliu or @saiislam would know something. The work I do with HIP is limited to basic tests using the basic support in community LLVM. My only experience building ROCm was using the AUR packages provided for Arch Linux before they were merged into the system package manager. Using HIP generally requires a lot of the HIP libraries from ROCm so it's difficult to do without a ROCm build or installation somewhere. The I was curious if If you're talking about building HIP from LLVM, it should just require the |
the following works for me for ROCm 6.0: PATH=/opt/rocm/llvm/bin:$PATH clang -O2 -fgpu-rdc --offload-arch=gfx90a -x hip -c add.c -o $dir/add.o # add.o contains llvm IR hipcc links with some libraries, which may not work with -r |
So, I actually remember specifically handling this case with the new driver's binary format. Because
|
Hey Yaxun (Sam), that worked for me too! Not just for building my little |
So is there anything to do with this issue, or can it be closed? Should there be a driver usability improvement? |
we could let clang driver assumes -no-hip-rt when -r is specified. |
Hey @arsenm, thanks for asking. Linking the large code caused 3 problems that you or @yxsamliu might be able to solve. Last week, a colleague proposed an acronym to describe the feature that I'm trying to achieve. I mention this acronym because I use it below in my description of the 3 problems. The acronym is ERDC. The "E" stands for "early", which means that I am using an intermediate step between compiling and linking to generate a relocatable (-r) so that I do not need -fgpu-rdc in my LDFLAGS. I wrote a new example that demonstrates the 3 problems. I link a tiny driver that calls two tiny libraries. Here is a summary of the problems
As before, I use x86+lto as the "good" case and amdgpu+rdc as the "bad" case. Here is a summary of the difference that I observed between the two cases:
Here's a summary of my workarounds:
And here's my feelings about the workarounds:
x86+lto (good)
Building the lto executable works, but building the elto or partial_lto executable fails because of Problem 1):
My workaround replaces the archive with its contents when I generate the relocatable (-r),
Disassembly confirms success. Specifically,
amdgpu+rdc (bad)I'm building with rocm 6.0.0 and targeting the amd mi250x, so I modified my x86+lto code to use hip with rdc:
Building the rdc executable works, but building the erdc or partial_erdc executable fails because of Problem 1):
My workaround replaces the archive with its contents when I generate the relocatable (-r), as I did in the x86+lto case, but now I encounter Problem 2):
My workaround replaces the relocatables with a single relocatable
but this is bad. I need a better workaround. Finally, building partial_erdc fails because of Problem 3)
My workaround uses objcopy to remove the CLANG_OFFLOAD_BUNDLE sections. This gets me past "Invalid encoding" but my partial_erdc build still fails with the __hip_fatbin duplicate symbol error from Problem 2)
In summary, I need a workaround for Problem 2) __hip_fatbin duplicate symbol. What is __hip_fatbin? What is the .hip_fatbin section? (llvm-objdump -h shows that there is a section called .hip_fatbin) Is there a flag I can use when I generate the relocatable (-r) to leave __hip_fatbin undefined? Or is there a flag I can use while linking to tell the linker to ignore duplicate __hip_fatbin symbols? |
I'm still very curious if using |
Hey @jhuber6, I got
Run make to verify that this works.
Add
Running
|
Thanks, I've copied your reproducers. I think I'll take some time to look into this on that side. |
The new driver uses a linker-defined array to traverse the list of entries. The dummy symbol there is used to force the linker to define the section even when the user doesn't have any entries (e.g. kernels). The problem is that the symbols conflict. I can either make this symbol I tried your basic Thanks for the detailed report, it's really helpful. Also, the warning is because the link step doesn't need |
I made #79231, but it has me wondering what the exact semantics of this would be. The patch is a good idea regardless though. Right now, it seems doing a I think the main issue here is that given
The I may need to shim in some logic to address this, depending on what we actually want to happen. Option 1: Compiling with Making the former work would require stripping the |
Okay @pozulp, I remembered the other reason why it wasn't working. The
You also do
This means that neither of the libraries will extract, because they're checked before Here's the modified Makefile I made.
This will do what I expect to be "correct" default behavior once landed. However, from your initial report it seems that you're more interested in being able to "cut off" RDC libraries so they do not inflate link time when linked via a static library. I think this should be a separate flag, because it's somewhat different from What I'll need to add is a separate flag to instruct the linker wrapper to perform the device-side linking and wrapping for the module and then delete the associated section so it won't be linked again. This should be doable once #79231 lands. Is that the behavior you'd expect @pozulp? I.e. clang -x hip foo.c bar.c -c --offload-arch=gfx90a -fgpu-rdc
clang -r --offload-link foo.c bar.c -o merged.o
llvm-ar rcs libfoo.a merged.o // libfoo.a contains no GPU device code
clang -x hip main.c libfoo.a --offload-arch=gfx90a -fgpu-rdc // main.c will not link with GPU code from libfoo.a |
Hey @jhuber6, yes! Sounds great. If you see @yxsamliu online this week please ask how to fix Problem 2) __hip_fatbin duplicate symbol. My reproducer is #77018 (comment). I'm hoping Sam's magic has not run out yet. Sam is the one who fixed the
|
Okay, I hacked something together with the new driver framework as a proof of concept. First, this requires reverting 0f8b529 locally to get the old behavior back. I'm using OpenMP just because it uses the new driver natively, always builds in RDC-mode, and works better with community LLVM. // foo.c
int bar(void) { return 2; }
#pragma omp declare target to(bar) device_type(nohost)
int foo(void) {
int x = 0;
#pragma omp target map(from : x)
{ x = bar(); }
return x;
} // main.c
extern int foo(void);
int bar(void) { return 1; }
#pragma omp declare target to(bar) device_type(nohost)
int main() {
int x = foo();
int y = 0;
#pragma omp target map(from : y)
{ y = bar(); }
return x + y;
} Here's a hacky script I wrote to fully link #!/bin/bash
set -v
set -e
# Get the `foo.o` object with embedded GPU code.
clang foo.c -fopenmp --offload-arch=native -fopenmp-offload-mandatory -c
# Rename the `.llvm.offloading` section. This is where the device code lives. We only need
# to do this because the linked GPU binary uses the same section name and they'll get clobbered
# when doing a relocatable link. This has a custom ELF type so the name is irrelevant for everything else
llvm-objcopy --rename-section .llvm.offloading=.llvm.offloading.rel foo.o
# This relocatable link will fully link the embedded GPU code in `foo.o` and then create a blob to
# register it with the OpenMP runtime, this blob will be merged into `merged.o`.
clang foo.o -lomptarget.devicertl -r -fopenmp --offload-arch=native -o merged.o
# The registration blob primarily uses runtime sections to iterate the kernels and globals.
# The linker provides `__[start|stop]_<secname>` symbols to traverse it. These will conflict
# with anything else we link so we need to rename it to something unique for this module.
# Also delete the old embedded code so nothing else will link with it.
llvm-objcopy \
--remove-section .llvm.offloading.rel \
--rename-section omp_offloading_entries=omp_offloading_entries_1 \
--redefine-sym __start_omp_offloading_entries=__start_omp_offloading_entries_1 \
--redefine-sym __stop_omp_offloading_entries=__stop_omp_offloading_entries_1 \
merged.o
# Handle the rest as normal.
llvm-ar rcs libfoo.a merged.o
clang main.c libfoo.a -fopenmp --offload-arch=native -fopenmp-offload-mandatory
./a.out || echo $? It works in theory, implementing it would require a few hacks however. HIP uses the exact same handling under |
It is possible to support -r of part of the objects then link them all together. However, this needs some change to clang driver and I doubt how useful this feature is. Why not using -shared to link each partition of the objects as a shared library, then link the main program with all the shared libraries? The current clang driver support that for HIP. |
So, with #80066 applied I was able to do the following with two generic HIP files using my local installation of ROCm 5.7.1. // main.hip
#include <hip/hip_runtime.h>
__global__ void kernel2() {
printf("%s\n", __PRETTY_FUNCTION__);
}
extern void foo();
int main() {
foo();
hipLaunchKernelGGL(kernel2, dim3(1), dim3(1), 0, 0);
auto x = hipDeviceSynchronize();
} // foo.hip
#include <hip/hip_runtime.h>
__global__ void kernel1() {
printf("%s\n", __PRETTY_FUNCTION__);
}
void foo() {
hipLaunchKernelGGL(kernel1, dim3(1), dim3(1), 0, 0);
auto x = hipDeviceSynchronize();
} I compiled both of them using the new driver and the $ clang -x hip foo.hip --offload-arch=native -c --offload-new-driver -fgpu-rdc
$ clang foo.o --offload-link -r -o merged.o
$ llvm-ar rcs libfoo.a merged.o
$ clang -x hip main.hip --offload-arch=native --offload-new-driver -fgpu-rdc -L. -lfoo
$ ./a.out
void kernel1()
void kernel2() Which seems to be what you're asking for. If you do |
Thanks Joseph!
Hey @yxsamliu, can you show me how to do this? You could use the 5-file example (main.c alpha1.c alpha2.c beta1.c beta2.c) that I shared in my comment last week #77018 (comment). |
|
Hey @yxsamliu, thanks for clarifying. I asked a few LLNL colleagues from a few different teams about dynamic linking and all of them said that it will not work for us, but they are interested in a solution using static linking. You said that
This is great news! But it sounds like you need more information before you attempt it. I can talk to you and Brendon Cahoon said that he can too. Cahoon, myself, and others from LLNL, AMD, and HPE want to run LLNL applications on the MI300s in the El Capitan machine that LLNL will deploy this year, and I think that this new build strategy could help. |
The long-term goal is to move HIP compilation to the new offloading driver, which would make #80066 work in your case as expected. However, I don't know how long that would take for these changes to get filtered down into a ROCm release. I should probably take the time to work with other members of the HIP team to see what the current blockers are. As far as I'm aware for HIP registration, we create a constructor for each TU that registers the relevant globals using an external handle that the link step then resolves once the actual image has been created. You'd probably need some post-link step to rename that handle, as it's |
For this approach to work well, the object files should be partitioned into small groups and the device code in each group are self-contained, i.e., they do not call any device functions or use any device variables in other groups. Does your HIP application have this trait? Thanks. |
Hey @yxsamliu, yes. Consider a graph in which the nodes are TUs. If two nodes have an edge between them, it means that there is at least one reference to a device function or device variable defined in the other. The graph for my HIP application is disconnected, meaning that there are at least two nodes which are not connected by a path. I made a visual to help explain: I drew the graph for my tiny 5-file program that I shared in my comment last week #77018 (comment). It is a disconnected graph containing 3 maximal connected subgraphs. I also made a table of TU combinations labeled with green checkmarks if they are valid early rdc combinations and red xmarks if they are not. See below. Finally, for anyone who is wondering if early rdc is right for them, there are at least two cases that would not benefit from early rdc:
Hey @jhuber6, do you mean that --offload-new-driver will be the default some day? |
Yes, that is the goal. I need to take some time to see what's actually missing for HIP to use it by default. |
`-fgpu-rdc` mode allows device functions call device functions in different TU. However, currently all device objects have to be linked together since only one fat binary is supported. This is time consuming for AMDGPU backend since it only supports LTO. There are use cases that objects can be divided into groups in which device functions are self-contained but host functions are not. It is desirable to link/optimize/codegen the device code and generate a fatbin for each group, whereas partially link the host code with `ld -r` or generate a static library by using the `-emit-static-lib` option of clang. This avoids linking all device code together, therefore decreases the linking time for `-fgpu-rdc`. Previously, clang emits an external symbol `__hip_fatbin` for all objects for `-fgpu-rdc`. With this patch, clang emits an unique external symbol `__hip_fatbin_{cuid}` for the fat binary for each object. When a group of objects are linked together to generate a fatbin, the symbols are merged by alias and point to the same fat binary. Each group has its own fat binary. One executable or shared library can have multiple fat binaries. Device linking is done for undefined fab binary symbols only to avoid repeated linking. `__hip_gpubin_handle` is also uniquefied and merged to avoid repeated registering. Symbol `__hip_cuid_{cuid}` is introduced to facilitate debugging and tooling. Fixes: llvm#77018
`-fgpu-rdc` mode allows device functions call device functions in different TU. However, currently all device objects have to be linked together since only one fat binary is supported. This is time consuming for AMDGPU backend since it only supports LTO. There are use cases that objects can be divided into groups in which device functions are self-contained but host functions are not. It is desirable to link/optimize/codegen the device code and generate a fatbin for each group, whereas partially link the host code with `ld -r` or generate a static library by using the `-emit-static-lib` option of clang. This avoids linking all device code together, therefore decreases the linking time for `-fgpu-rdc`. Previously, clang emits an external symbol `__hip_fatbin` for all objects for `-fgpu-rdc`. With this patch, clang emits an unique external symbol `__hip_fatbin_{cuid}` for the fat binary for each object. When a group of objects are linked together to generate a fatbin, the symbols are merged by alias and point to the same fat binary. Each group has its own fat binary. One executable or shared library can have multiple fat binaries. Device linking is done for undefined fab binary symbols only to avoid repeated linking. `__hip_gpubin_handle` is also uniquefied and merged to avoid repeated registering. Symbol `__hip_cuid_{cuid}` is introduced to facilitate debugging and tooling. Fixes: llvm#77018 Change-Id: Ia16ac3ddb05b66e6149288aacb0ba4a80120ad8c
`-fgpu-rdc` mode allows device functions call device functions in different TU. However, currently all device objects have to be linked together since only one fat binary is supported. This is time consuming for AMDGPU backend since it only supports LTO. There are use cases that objects can be divided into groups in which device functions are self-contained but host functions are not. It is desirable to link/optimize/codegen the device code and generate a fatbin for each group, whereas partially link the host code with `ld -r` or generate a static library by using the `-emit-static-lib` option of clang. This avoids linking all device code together, therefore decreases the linking time for `-fgpu-rdc`. Previously, clang emits an external symbol `__hip_fatbin` for all objects for `-fgpu-rdc`. With this patch, clang emits an unique external symbol `__hip_fatbin_{cuid}` for the fat binary for each object. When a group of objects are linked together to generate a fatbin, the symbols are merged by alias and point to the same fat binary. Each group has its own fat binary. One executable or shared library can have multiple fat binaries. Device linking is done for undefined fab binary symbols only to avoid repeated linking. `__hip_gpubin_handle` is also uniquefied and merged to avoid repeated registering. Symbol `__hip_cuid_{cuid}` is introduced to facilitate debugging and tooling. Fixes: llvm#77018
`-fgpu-rdc` mode allows device functions call device functions in different TU. However, currently all device objects have to be linked together since only one fat binary is supported. This is time consuming for AMDGPU backend since it only supports LTO. There are use cases that objects can be divided into groups in which device functions are self-contained but host functions are not. It is desirable to link/optimize/codegen the device code and generate a fatbin for each group, whereas partially link the host code with `ld -r` or generate a static library by using the `-emit-static-lib` option of clang. This avoids linking all device code together, therefore decreases the linking time for `-fgpu-rdc`. Previously, clang emits an external symbol `__hip_fatbin` for all objects for `-fgpu-rdc`. With this patch, clang emits an unique external symbol `__hip_fatbin_{cuid}` for the fat binary for each object. When a group of objects are linked together to generate a fatbin, the symbols are merged by alias and point to the same fat binary. Each group has its own fat binary. One executable or shared library can have multiple fat binaries. Device linking is done for undefined fab binary symbols only to avoid repeated linking. `__hip_gpubin_handle` is also uniquefied and merged to avoid repeated registering. Symbol `__hip_cuid_{cuid}` is introduced to facilitate debugging and tooling. Fixes: llvm#77018
`-fgpu-rdc` mode allows device functions call device functions in different TU. However, currently all device objects have to be linked together since only one fat binary is supported. This is time consuming for AMDGPU backend since it only supports LTO. There are use cases that objects can be divided into groups in which device functions are self-contained but host functions are not. It is desirable to link/optimize/codegen the device code and generate a fatbin for each group, whereas partially link the host code with `ld -r` or generate a static library by using the `--emit-static-lib` option of clang. This avoids linking all device code together, therefore decreases the linking time for `-fgpu-rdc`. Previously, clang emits an external symbol `__hip_fatbin` for all objects for `-fgpu-rdc`. With this patch, clang emits an unique external symbol `__hip_fatbin_{cuid}` for the fat binary for each object. When a group of objects are linked together to generate a fatbin, the symbols are merged by alias and point to the same fat binary. Each group has its own fat binary. One executable or shared library can have multiple fat binaries. Device linking is done for undefined fab binary symbols only to avoid repeated linking. `__hip_gpubin_handle` is also uniquefied and merged to avoid repeated registering. Symbol `__hip_cuid_{cuid}` is introduced to facilitate debugging and tooling. Fixes: #77018
@llvm/issue-subscribers-clang-codegen Author: Mike Pozulp (pozulp)
Hey @arsenm and @jdoerfert, how do I generate a relocatable object (-r) for the amdgpu target? I am linking a large code containing a few millions of lines of C++ with an optional library dependency containing about 300,000 lines of C++. The library requires relocatable device code (-fgpu-rdc) because it has many kernels which reference device functions defined in separate translation units. The large code does not. A driver for the library links in 30 minutes. The large code takes 2 minutes to link without the optional library and over 8 hours with the library (the lld process is still running after 8 hours). I don't want to use rdc to link the large code, but I have to because of the optional library: if even a single object needs rdc, then the link needs it too. Perhaps an intermediate step between compiling the library and linking the large code, in which I generate a relocatable object (-r) from the rdc-compiled library, would allow me to link the large code without rdc even when I'm using the optional library.
x86+LTO (good)Consider using LTO to target x86, which works as expected. During compilation, clang -flto emits LLVM IR, which lld uses to perform link time optimizations like cross translation unit inlining. Here is an example:
Building and then disassembling the executables shows that add1, which is referenced and defined in separate translation units, is inlined for the two LTO builds but not for the separate compilation build, as expected:
The difference in the two LTO builds is that one had -flto on the link line and the other didn't. The one which included an intermediate step between compiling and linking to create a relocatable object did not need -flto on the link line because I gave the linker object code, not LLVM IR. amdgpu+rdc (bad)Now consider my use case. I'm building with rocm 6.0.0, the latest rocm clang distribution installed on my system, and I am targeting the amd mi250x. I modified my x86+LTO code to use hip with rdc:
The second-to-last line, which uses -r to make the relocatable object, fails with
Ignore the last 3 lines above, which are due to my attempt to link using the non-existent object file uber.o. |
`-fgpu-rdc` mode allows device functions call device functions in different TU. However, currently all device objects have to be linked together since only one fat binary is supported. This is time consuming for AMDGPU backend since it only supports LTO. There are use cases that objects can be divided into groups in which device functions are self-contained but host functions are not. It is desirable to link/optimize/codegen the device code and generate a fatbin for each group, whereas partially link the host code with `ld -r` or generate a static library by using the `--emit-static-lib` option of clang. This avoids linking all device code together, therefore decreases the linking time for `-fgpu-rdc`. Previously, clang emits an external symbol `__hip_fatbin` for all objects for `-fgpu-rdc`. With this patch, clang emits an unique external symbol `__hip_fatbin_{cuid}` for the fat binary for each object. When a group of objects are linked together to generate a fatbin, the symbols are merged by alias and point to the same fat binary. Each group has its own fat binary. One executable or shared library can have multiple fat binaries. Device linking is done for undefined fab binary symbols only to avoid repeated linking. `__hip_gpubin_handle` is also uniquefied and merged to avoid repeated registering. Symbol `__hip_cuid_{cuid}` is introduced to facilitate debugging and tooling. Fixes: llvm/llvm-project#77018
@pozulp Have you tried clang with -r ? Does it work for you? Thanks. |
`-fgpu-rdc` mode allows device functions call device functions in different TU. However, currently all device objects have to be linked together since only one fat binary is supported. This is time consuming for AMDGPU backend since it only supports LTO. There are use cases that objects can be divided into groups in which device functions are self-contained but host functions are not. It is desirable to link/optimize/codegen the device code and generate a fatbin for each group, whereas partially link the host code with `ld -r` or generate a static library by using the `--emit-static-lib` option of clang. This avoids linking all device code together, therefore decreases the linking time for `-fgpu-rdc`. Previously, clang emits an external symbol `__hip_fatbin` for all objects for `-fgpu-rdc`. With this patch, clang emits an unique external symbol `__hip_fatbin_{cuid}` for the fat binary for each object. When a group of objects are linked together to generate a fatbin, the symbols are merged by alias and point to the same fat binary. Each group has its own fat binary. One executable or shared library can have multiple fat binaries. Device linking is done for undefined fab binary symbols only to avoid repeated linking. `__hip_gpubin_handle` is also uniquefied and merged to avoid repeated registering. Symbol `__hip_cuid_{cuid}` is introduced to facilitate debugging and tooling. Fixes: llvm#77018 Change-Id: I0ebf263b742b554939e5b758e5ec761e00763738
Hey @arsenm and @jdoerfert, how do I generate a relocatable object (-r) for the amdgpu target? I am linking a large code containing a few millions of lines of C++ with an optional library dependency containing about 300,000 lines of C++. The library requires relocatable device code (-fgpu-rdc) because it has many kernels which reference device functions defined in separate translation units. The large code does not. A driver for the library links in 30 minutes. The large code takes 2 minutes to link without the optional library and over 8 hours with the library (the lld process is still running after 8 hours). I don't want to use rdc to link the large code, but I have to because of the optional library: if even a single object needs rdc, then the link needs it too. Perhaps an intermediate step between compiling the library and linking the large code, in which I generate a relocatable object (-r) from the rdc-compiled library, would allow me to link the large code without rdc even when I'm using the optional library.
x86+LTO (good)
Consider using LTO to target x86, which works as expected. During compilation, clang -flto emits LLVM IR, which lld uses to perform link time optimizations like cross translation unit inlining. Here is an example:
Building and then disassembling the executables shows that add1, which is referenced and defined in separate translation units, is inlined for the two LTO builds but not for the separate compilation build, as expected:
The difference in the two LTO builds is that one had -flto on the link line and the other didn't. The one which included an intermediate step between compiling and linking to create a relocatable object did not need -flto on the link line because I gave the linker object code, not LLVM IR.
amdgpu+rdc (bad)
Now consider my use case. I'm building with rocm 6.0.0, the latest rocm clang distribution installed on my system, and I am targeting the amd mi250x. I modified my x86+LTO code to use hip with rdc:
The second-to-last line, which uses -r to make the relocatable object, fails with
ld.lld: error: attempted static link of dynamic object
and references shared libraries in /opt/rocm:Ignore the last 3 lines above, which are due to my attempt to link using the non-existent object file uber.o.
The text was updated successfully, but these errors were encountered: