Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMDGPU] Creating relocatable object (-r) from rdc objects (-fgpu-rdc) fails with lld error attempted static link of dynamic object in /opt/rocm-6.0.0/lib #77018

Closed
pozulp opened this issue Jan 4, 2024 · 35 comments · Fixed by #81700
Assignees
Labels
clang:codegen clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl'

Comments

@pozulp
Copy link
Member

pozulp commented Jan 4, 2024

Hey @arsenm and @jdoerfert, how do I generate a relocatable object (-r) for the amdgpu target? I am linking a large code containing a few millions of lines of C++ with an optional library dependency containing about 300,000 lines of C++. The library requires relocatable device code (-fgpu-rdc) because it has many kernels which reference device functions defined in separate translation units. The large code does not. A driver for the library links in 30 minutes. The large code takes 2 minutes to link without the optional library and over 8 hours with the library (the lld process is still running after 8 hours). I don't want to use rdc to link the large code, but I have to because of the optional library: if even a single object needs rdc, then the link needs it too. Perhaps an intermediate step between compiling the library and linking the large code, in which I generate a relocatable object (-r) from the rdc-compiled library, would allow me to link the large code without rdc even when I'm using the optional library.

x86+LTO (good)

Consider using LTO to target x86, which works as expected. During compilation, clang -flto emits LLVM IR, which lld uses to perform link time optimizations like cross translation unit inlining. Here is an example:

$ cat main.c
int add1(int y);
int main(int argc, char **argv) { return add1(argc); }
$ cat add.c
int add1(int y) { return y + 1; }
$ cat build.sh
dirs="normal_build lto_build relocatable_build"
for dir in $(echo $dirs); do rm -rf $dir; mkdir $dir; done

# build separate compilation executable
dir=normal_build
clang -O2 -c add.c -o $dir/add.o                          # add.o  contains object code
clang -O2 -c main.c -o $dir/main.o                        # main.o contains object code
clang -O2 $dir/add.o $dir/main.o -o $dir/foo              # linker sees object code

# build lto executable
dir=lto_build
clang -flto -O2 -c add.c -o $dir/add.o                    # add.o  contains llvm IR
clang -flto -O2 -c main.c -o $dir/main.o                  # main.o contains llvm IR
clang -flto -O2 $dir/add.o $dir/main.o -o $dir/foo        # linker sees llvm IR

# build lto executable but with an intermediate step between compiling and
# linking which creates relocatable uber.o
dir=relocatable_build 
clang    -flto -O2 -c add.c -o $dir/add.o                 # add.o  contains llvm IR
clang    -flto -O2 -c main.c -o $dir/main.o               # main.o contains llvm IR
clang -r -flto -O2 $dir/add.o $dir/main.o -o $dir/uber.o  # uber.o contains object code
clang          -O2 $dir/uber.o -o $dir/foo                # linker sees object code

Building and then disassembling the executables shows that add1, which is referenced and defined in separate translation units, is inlined for the two LTO builds but not for the separate compilation build, as expected:

$ sh < build.sh 
$ llvm-objdump --disassemble-symbols=main */foo 
lto_build/foo:  file format elf64-x86-64

Disassembly of section .text:

0000000000400540 <main>:
  400540: 8d 47 01                      leal    0x1(%rdi), %eax
  400543: c3                            retq
  400544: 66 2e 0f 1f 84 00 00 00 00 00 nopw    %cs:(%rax,%rax)
  40054e: 66 90                         nop
  
normal_build/foo:   file format elf64-x86-64

Disassembly of section .text: 

0000000000400550 <main>:
  400550: e9 eb ff ff ff                jmp 0x400540 <add1>
  400555: 66 2e 0f 1f 84 00 00 00 00 00 nopw    %cs:(%rax,%rax)
  40055f: 90                            nop 

relocatable_build/foo:  file format elf64-x86-64

Disassembly of section .text:

0000000000400570 <main>:
  400570: 8d 47 01                      leal    0x1(%rdi), %eax
  400573: c3                            retq
  400574: 66 2e 0f 1f 84 00 00 00 00 00 nopw    %cs:(%rax,%rax)
  40057e: 66 90                         nop

The difference in the two LTO builds is that one had -flto on the link line and the other didn't. The one which included an intermediate step between compiling and linking to create a relocatable object did not need -flto on the link line because I gave the linker object code, not LLVM IR.

amdgpu+rdc (bad)

Now consider my use case. I'm building with rocm 6.0.0, the latest rocm clang distribution installed on my system, and I am targeting the amd mi250x. I modified my x86+LTO code to use hip with rdc:

$ cat main.c
#include <hip/hip_runtime.h>
__device__ int add1(int y);
__global__ void mykernel(int *y) { *y = add1(*y); } 
int main(int argc, char **argv) {
    mykernel<<<1,1>>>(&argc);
    return argc;
}
$ cat add.c
__device__ int add1(int y) { return y + 1; }
$ cat build.sh
dirs="rdc_build relocatable_build"
for dir in $(echo $dirs); do rm -rf $dir; mkdir $dir; done

dir=rdc_build
hipcc -O2 -fgpu-rdc --offload-arch=gfx90a -x hip -c add.c -o $dir/add.o         # add.o contains llvm IR
hipcc -O2 -fgpu-rdc --offload-arch=gfx90a -x hip -c main.c -o $dir/main.o       # main.o contains llvm IR
hipcc -O2 -fgpu-rdc --offload-arch=gfx90a $dir/add.o $dir/main.o -o $dir/foo    # linker sees llvm IR

dir=relocatable_build
hipcc    -O2 -fgpu-rdc --offload-arch=gfx90a -x hip -c add.c -o $dir/add.o          # add.o contains llvm IR
hipcc    -O2 -fgpu-rdc --offload-arch=gfx90a -x hip -c main.c -o $dir/main.o        # main.o contains llvm IR
hipcc -r -O2 -fgpu-rdc --offload-arch=gfx90a $dir/add.o $dir/main.o -o $dir/uber.o  # uber.o contains object code 
hipcc    -O2           --offload-arch=gfx90a -o $dir/uber.o                         # linker sees object code

The second-to-last line, which uses -r to make the relocatable object, fails with ld.lld: error: attempted static link of dynamic object and references shared libraries in /opt/rocm:

$ sh < build.sh
clang: warning: argument unused during compilation: '--rtlib=compiler-rt' [-Wunused-command-line-argument]
clang: warning: argument unused during compilation: '-unwindlib=libgcc' [-Wunused-command-line-argument]
ld.lld: error: attempted static link of dynamic object /opt/rocm-6.0.0/lib/libamdhip64.so
ld.lld: error: attempted static link of dynamic object /opt/rocm-6.0.0/lib/libhsa-runtime64.so
ld.lld: error: attempted static link of dynamic object /opt/rocm-6.0.0/lib/libamd_comgr.so
ld.lld: error: attempted static link of dynamic object /opt/rocm-6.0.0/lib/libamdhip64.so
clang: error: linker command failed with exit code 1 (use -v to see invocation)
ld.lld: error: undefined symbol: main
>>> referenced by /lib/../lib64/crt1.o:(_start)
clang: error: linker command failed with exit code 1 (use -v to see invocation)

Ignore the last 3 lines above, which are due to my attempt to link using the non-existent object file uber.o.

@github-actions github-actions bot added the lld label Jan 4, 2024
@arsenm arsenm added backend:AMDGPU clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' labels Jan 5, 2024
@llvmbot
Copy link
Member

llvmbot commented Jan 5, 2024

@llvm/issue-subscribers-clang-driver

Author: Mike Pozulp (pozulp)

Hey @arsenm and @jdoerfert, how do I generate a relocatable object (-r) for the amdgpu target? I am linking a large code containing a few millions of lines of C++ with an optional library dependency containing about 300,000 lines of C++. The library requires relocatable device code (-fgpu-rdc) because it has many kernels which reference device functions defined in separate translation units. The large code does not. A driver for the library links in 30 minutes. The large code takes 2 minutes to link without the optional library and over 8 hours with the library (the lld process is still running after 8 hours). I don't want to use rdc to link the large code, but I have to because of the optional library: if even a single object needs rdc, then the link needs it too. Perhaps an intermediate step between compiling the library and linking the large code, in which I generate a relocatable object (-r) from the rdc-compiled library, would allow me to link the large code without rdc even when I'm using the optional library.

x86+LTO (good)

Consider using LTO to target x86, which works as expected. During compilation, clang -flto emits LLVM IR, which lld uses to perform link time optimizations like cross translation unit inlining. Here is an example:

$ cat main.c
int add1(int y);
int main(int argc, char **argv) { return add1(argc); }
$ cat add.c
int add1(int y) { return y + 1; }
$ cat build.sh
dirs="normal_build lto_build relocatable_build"
for dir in $(echo $dirs); do rm -rf $dir; mkdir $dir; done

# build separate compilation executable
dir=normal_build
clang -O2 -c add.c -o $dir/add.o                          # add.o  contains object code
clang -O2 -c main.c -o $dir/main.o                        # main.o contains object code
clang -O2 $dir/add.o $dir/main.o -o $dir/foo              # linker sees object code

# build lto executable
dir=lto_build
clang -flto -O2 -c add.c -o $dir/add.o                    # add.o  contains llvm IR
clang -flto -O2 -c main.c -o $dir/main.o                  # main.o contains llvm IR
clang -flto -O2 $dir/add.o $dir/main.o -o $dir/foo        # linker sees llvm IR

# build lto executable but with an intermediate step between compiling and
# linking which creates relocatable uber.o
dir=relocatable_build 
clang    -flto -O2 -c add.c -o $dir/add.o                 # add.o  contains llvm IR
clang    -flto -O2 -c main.c -o $dir/main.o               # main.o contains llvm IR
clang -r -flto -O2 $dir/add.o $dir/main.o -o $dir/uber.o  # uber.o contains object code
clang          -O2 $dir/uber.o -o $dir/foo                # linker sees object code

Building and then disassembling the executables shows that add1, which is referenced and defined in separate translation units, is inlined for the two LTO builds but not for the separate compilation build, as expected:

$ sh &lt; build.sh 
$ llvm-objdump --disassemble-symbols=main */foo 
lto_build/foo:  file format elf64-x86-64

Disassembly of section .text:

0000000000400540 &lt;main&gt;:
  400540: 8d 47 01                      leal    0x1(%rdi), %eax
  400543: c3                            retq
  400544: 66 2e 0f 1f 84 00 00 00 00 00 nopw    %cs:(%rax,%rax)
  40054e: 66 90                         nop
  
normal_build/foo:   file format elf64-x86-64

Disassembly of section .text: 

0000000000400550 &lt;main&gt;:
  400550: e9 eb ff ff ff                jmp 0x400540 &lt;add1&gt;
  400555: 66 2e 0f 1f 84 00 00 00 00 00 nopw    %cs:(%rax,%rax)
  40055f: 90                            nop 

relocatable_build/foo:  file format elf64-x86-64

Disassembly of section .text:

0000000000400570 &lt;main&gt;:
  400570: 8d 47 01                      leal    0x1(%rdi), %eax
  400573: c3                            retq
  400574: 66 2e 0f 1f 84 00 00 00 00 00 nopw    %cs:(%rax,%rax)
  40057e: 66 90                         nop

The difference in the two LTO builds is that one had -flto on the link line and the other didn't. The one which included an intermediate step between compiling and linking to create a relocatable object did not need -flto on the link line because I gave the linker object code, not LLVM IR.

amdgpu+rdc (bad)

Now consider my use case. I'm building with rocm 6.0.0, the latest rocm clang distribution installed on my system, and I am targeting the amd mi250x. I modified my x86+LTO code to use hip with rdc:

$ cat main.c
#include &lt;hip/hip_runtime.h&gt;
__device__ int add1(int y);
__global__ void mykernel(int *y) { *y = add1(*y); } 
int main(int argc, char **argv) {
    mykernel&lt;&lt;&lt;1,1&gt;&gt;&gt;(&amp;argc);
    return argc;
}
$ cat add.c
__device__ int add1(int y) { return y + 1; }
$ cat build.sh
dirs="rdc_build relocatable_build"
for dir in $(echo $dirs); do rm -rf $dir; mkdir $dir; done

dir=rdc_build
hipcc -O2 -fgpu-rdc --offload-arch=gfx90a -x hip -c add.c -o $dir/add.o         # add.o contains llvm IR
hipcc -O2 -fgpu-rdc --offload-arch=gfx90a -x hip -c main.c -o $dir/main.o       # main.o contains llvm IR
hipcc -O2 -fgpu-rdc --offload-arch=gfx90a $dir/add.o $dir/main.o -o $dir/foo    # linker sees llvm IR

dir=relocatable_build
hipcc    -O2 -fgpu-rdc --offload-arch=gfx90a -x hip -c add.c -o $dir/add.o          # add.o contains llvm IR
hipcc    -O2 -fgpu-rdc --offload-arch=gfx90a -x hip -c main.c -o $dir/main.o        # main.o contains llvm IR
hipcc -r -O2 -fgpu-rdc --offload-arch=gfx90a $dir/add.o $dir/main.o -o $dir/uber.o  # uber.o contains object code 
hipcc    -O2           --offload-arch=gfx90a -o $dir/uber.o                         # linker sees object code

The second-to-last line, which uses -r to make the relocatable object, fails with ld.lld: error: attempted static link of dynamic object and references shared libraries in /opt/rocm:

$ sh &lt; build.sh
clang: warning: argument unused during compilation: '--rtlib=compiler-rt' [-Wunused-command-line-argument]
clang: warning: argument unused during compilation: '-unwindlib=libgcc' [-Wunused-command-line-argument]
ld.lld: error: attempted static link of dynamic object /opt/rocm-6.0.0/lib/libamdhip64.so
ld.lld: error: attempted static link of dynamic object /opt/rocm-6.0.0/lib/libhsa-runtime64.so
ld.lld: error: attempted static link of dynamic object /opt/rocm-6.0.0/lib/libamd_comgr.so
ld.lld: error: attempted static link of dynamic object /opt/rocm-6.0.0/lib/libamdhip64.so
clang: error: linker command failed with exit code 1 (use -v to see invocation)
ld.lld: error: undefined symbol: main
&gt;&gt;&gt; referenced by /lib/../lib64/crt1.o:(_start)
clang: error: linker command failed with exit code 1 (use -v to see invocation)

Ignore the last 3 lines above, which are due to my attempt to link using the non-existent object file uber.o.

@llvmbot
Copy link
Member

llvmbot commented Jan 5, 2024

@llvm/issue-subscribers-backend-amdgpu

Author: Mike Pozulp (pozulp)

Hey @arsenm and @jdoerfert, how do I generate a relocatable object (-r) for the amdgpu target? I am linking a large code containing a few millions of lines of C++ with an optional library dependency containing about 300,000 lines of C++. The library requires relocatable device code (-fgpu-rdc) because it has many kernels which reference device functions defined in separate translation units. The large code does not. A driver for the library links in 30 minutes. The large code takes 2 minutes to link without the optional library and over 8 hours with the library (the lld process is still running after 8 hours). I don't want to use rdc to link the large code, but I have to because of the optional library: if even a single object needs rdc, then the link needs it too. Perhaps an intermediate step between compiling the library and linking the large code, in which I generate a relocatable object (-r) from the rdc-compiled library, would allow me to link the large code without rdc even when I'm using the optional library.

x86+LTO (good)

Consider using LTO to target x86, which works as expected. During compilation, clang -flto emits LLVM IR, which lld uses to perform link time optimizations like cross translation unit inlining. Here is an example:

$ cat main.c
int add1(int y);
int main(int argc, char **argv) { return add1(argc); }
$ cat add.c
int add1(int y) { return y + 1; }
$ cat build.sh
dirs="normal_build lto_build relocatable_build"
for dir in $(echo $dirs); do rm -rf $dir; mkdir $dir; done

# build separate compilation executable
dir=normal_build
clang -O2 -c add.c -o $dir/add.o                          # add.o  contains object code
clang -O2 -c main.c -o $dir/main.o                        # main.o contains object code
clang -O2 $dir/add.o $dir/main.o -o $dir/foo              # linker sees object code

# build lto executable
dir=lto_build
clang -flto -O2 -c add.c -o $dir/add.o                    # add.o  contains llvm IR
clang -flto -O2 -c main.c -o $dir/main.o                  # main.o contains llvm IR
clang -flto -O2 $dir/add.o $dir/main.o -o $dir/foo        # linker sees llvm IR

# build lto executable but with an intermediate step between compiling and
# linking which creates relocatable uber.o
dir=relocatable_build 
clang    -flto -O2 -c add.c -o $dir/add.o                 # add.o  contains llvm IR
clang    -flto -O2 -c main.c -o $dir/main.o               # main.o contains llvm IR
clang -r -flto -O2 $dir/add.o $dir/main.o -o $dir/uber.o  # uber.o contains object code
clang          -O2 $dir/uber.o -o $dir/foo                # linker sees object code

Building and then disassembling the executables shows that add1, which is referenced and defined in separate translation units, is inlined for the two LTO builds but not for the separate compilation build, as expected:

$ sh &lt; build.sh 
$ llvm-objdump --disassemble-symbols=main */foo 
lto_build/foo:  file format elf64-x86-64

Disassembly of section .text:

0000000000400540 &lt;main&gt;:
  400540: 8d 47 01                      leal    0x1(%rdi), %eax
  400543: c3                            retq
  400544: 66 2e 0f 1f 84 00 00 00 00 00 nopw    %cs:(%rax,%rax)
  40054e: 66 90                         nop
  
normal_build/foo:   file format elf64-x86-64

Disassembly of section .text: 

0000000000400550 &lt;main&gt;:
  400550: e9 eb ff ff ff                jmp 0x400540 &lt;add1&gt;
  400555: 66 2e 0f 1f 84 00 00 00 00 00 nopw    %cs:(%rax,%rax)
  40055f: 90                            nop 

relocatable_build/foo:  file format elf64-x86-64

Disassembly of section .text:

0000000000400570 &lt;main&gt;:
  400570: 8d 47 01                      leal    0x1(%rdi), %eax
  400573: c3                            retq
  400574: 66 2e 0f 1f 84 00 00 00 00 00 nopw    %cs:(%rax,%rax)
  40057e: 66 90                         nop

The difference in the two LTO builds is that one had -flto on the link line and the other didn't. The one which included an intermediate step between compiling and linking to create a relocatable object did not need -flto on the link line because I gave the linker object code, not LLVM IR.

amdgpu+rdc (bad)

Now consider my use case. I'm building with rocm 6.0.0, the latest rocm clang distribution installed on my system, and I am targeting the amd mi250x. I modified my x86+LTO code to use hip with rdc:

$ cat main.c
#include &lt;hip/hip_runtime.h&gt;
__device__ int add1(int y);
__global__ void mykernel(int *y) { *y = add1(*y); } 
int main(int argc, char **argv) {
    mykernel&lt;&lt;&lt;1,1&gt;&gt;&gt;(&amp;argc);
    return argc;
}
$ cat add.c
__device__ int add1(int y) { return y + 1; }
$ cat build.sh
dirs="rdc_build relocatable_build"
for dir in $(echo $dirs); do rm -rf $dir; mkdir $dir; done

dir=rdc_build
hipcc -O2 -fgpu-rdc --offload-arch=gfx90a -x hip -c add.c -o $dir/add.o         # add.o contains llvm IR
hipcc -O2 -fgpu-rdc --offload-arch=gfx90a -x hip -c main.c -o $dir/main.o       # main.o contains llvm IR
hipcc -O2 -fgpu-rdc --offload-arch=gfx90a $dir/add.o $dir/main.o -o $dir/foo    # linker sees llvm IR

dir=relocatable_build
hipcc    -O2 -fgpu-rdc --offload-arch=gfx90a -x hip -c add.c -o $dir/add.o          # add.o contains llvm IR
hipcc    -O2 -fgpu-rdc --offload-arch=gfx90a -x hip -c main.c -o $dir/main.o        # main.o contains llvm IR
hipcc -r -O2 -fgpu-rdc --offload-arch=gfx90a $dir/add.o $dir/main.o -o $dir/uber.o  # uber.o contains object code 
hipcc    -O2           --offload-arch=gfx90a -o $dir/uber.o                         # linker sees object code

The second-to-last line, which uses -r to make the relocatable object, fails with ld.lld: error: attempted static link of dynamic object and references shared libraries in /opt/rocm:

$ sh &lt; build.sh
clang: warning: argument unused during compilation: '--rtlib=compiler-rt' [-Wunused-command-line-argument]
clang: warning: argument unused during compilation: '-unwindlib=libgcc' [-Wunused-command-line-argument]
ld.lld: error: attempted static link of dynamic object /opt/rocm-6.0.0/lib/libamdhip64.so
ld.lld: error: attempted static link of dynamic object /opt/rocm-6.0.0/lib/libhsa-runtime64.so
ld.lld: error: attempted static link of dynamic object /opt/rocm-6.0.0/lib/libamd_comgr.so
ld.lld: error: attempted static link of dynamic object /opt/rocm-6.0.0/lib/libamdhip64.so
clang: error: linker command failed with exit code 1 (use -v to see invocation)
ld.lld: error: undefined symbol: main
&gt;&gt;&gt; referenced by /lib/../lib64/crt1.o:(_start)
clang: error: linker command failed with exit code 1 (use -v to see invocation)

Ignore the last 3 lines above, which are due to my attempt to link using the non-existent object file uber.o.

@arsenm
Copy link
Contributor

arsenm commented Jan 5, 2024

how do I generate a relocatable object (-r) for the amdgpu target?

The short answer is we don't really support binary linking of different object files right now. The main blocker is reporting something sensible for function resource utilization if we can't see a function body. Without that, any attempt to rely on object files is going down underdeveloped and untested paths

@pozulp pozulp changed the title [AMDGPU] Creating relocatable object (-r) from rdc objects (-fgpu-rdc) fails with lld error attempted static link of dynamic object in /opt/rocm-5.7.1/lib [AMDGPU] Creating relocatable object (-r) from rdc objects (-fgpu-rdc) fails with lld error attempted static link of dynamic object in /opt/rocm-6.0.0/lib Jan 5, 2024
@pozulp
Copy link
Member Author

pozulp commented Jan 5, 2024

The main blocker is reporting something sensible for function resource utilization if we can't see a function body.

What do you mean when you say "can't see a function body"? I thought that because I compiled all of my object files with -fgpu-rdc, which emits LLVM IR into the objects, that when I try to generate a relocatable object (-r) the linker can see all of the function definitions. Every device function definition is in the LLVM IR in the objects.

@arsenm
Copy link
Contributor

arsenm commented Jan 5, 2024

The main blocker is reporting something sensible for function resource utilization if we can't see a function body.

What do you mean when you say "can't see a function body"? I thought that because I compiled all of my object files with -fgpu-rdc, which emits LLVM IR into the objects, that when I try to generate a relocatable object (-r) the linker can see all of the function definitions. Every device function definition is in the LLVM IR in the objects.

Oh, I see you're linking a single .o, not multiple .os together

@pozulp
Copy link
Member Author

pozulp commented Jan 5, 2024

The main blocker is reporting something sensible for function resource utilization if we can't see a function body.

What do you mean when you say "can't see a function body"? I thought that because I compiled all of my object files with -fgpu-rdc, which emits LLVM IR into the objects, that when I try to generate a relocatable object (-r) the linker can see all of the function definitions. Every device function definition is in the LLVM IR in the objects.

Oh, I see you're linking a single .o, not multiple .os together

Yes, I'm linking a single .o in my reproducer.

@jdoerfert
Copy link
Member

This seems to work with OpenMP just fine, so it's the driver that doesn't do it right:

❯❯❯ cat main.c
#include <stdio.h>
int foo();

int main() {
        int x = 9999;
        #pragma omp target map(from:x)
        x = foo();
        printf("Expected 42, got: %i\n", x);
}

❯❯❯ cat add.c
int foo() { return 42; }
#pragma omp declare target(foo)

❯❯❯ clang -fopenmp -O2 --offload-arch=gfx90a -c main.c -o main.o
❯❯❯ clang -fopenmp -O2 --offload-arch=gfx90a -c add.c -o add.o
❯❯❯ clang -r add.o main.o -o out.o
❯❯❯ clang out.o -o a.out  -fopenmp --offload-arch=gfx90a
❯❯❯ LIBOMPTARGET_INFO=16 ./a.out
"PluginInterface" device 0 info: Launching kernel __omp_offloading_50_ef70b478_main_l6 with 1 blocks and 256 threads in Generic-SPMD mode
AMDGPU device 0 info: #Args: 2 Teams x Thrds:    1x 256 (MaxFlatWorkGroupSize: 256) LDS Usage: 16711931B #SGPRs/VGPRs: 10/2 #SGPR/VGPR Spills: 0/0 Tripcount: 0
Expected 42, got: 42

Looking at the -r hipcc invocation I am very confused, this looks wrong on multiple levels:
For one, they unboundle (via clang-offload-bundler) each object file twice, and then never use half of the redundant results.
Then they run llvm-mc on a temporary result for add.o but not main.o (for some reason), and the command that produces this temporary result (a .mcin file) is not shown with -v.
Finally, they feed two temporary results derived from add.o and one derived from main.o into ld.lld.

If I redo the steps manually, ignore the mcin/llvm-mc stuff and the output of that, and remove libamdhip64.so from that link command, I get a out.o.
Then I can do:
hipcc -fuse-ld=ld -O2 --offload-arch=gfx90a -o hip out.o /opt/rocm-6.0.0/lib/librccl.so
to get an executable that seems to run fine, but the kernel is not launched for some reason.
Might just be me being bad at HIP.

Tag: @jhuber6

@jhuber6
Copy link
Contributor

jhuber6 commented Jan 6, 2024

@yxsamliu probably knows the most about expected HIP behavior. Somewhat curious if using --offload-new-driver -fgpu-rdc is functional for this example, but I haven't done extensive testing w/ HIP. I did a test using my hipcc and it seems to work for basic stuff, but I'm assuming there's some stuff missing in ROCm 6.0 from upstream.

@pozulp
Copy link
Member Author

pozulp commented Jan 8, 2024

Thanks Johannes!

I did a test using my hipcc and it seems to work for basic stuff, but I'm assuming there's some stuff missing in ROCm 6.0 from upstream.

@jhuber6, how do you build upstream hip? I built llvm to bisect an lld hang in #58639 but I have no experience building hip nor rocm and don't know where to start. The OS on my system is called TOSS 4 which is based on RHEL 8. The system has both mi250x and mi300a nodes. I'm most interested in the mi300a.

@jhuber6
Copy link
Contributor

jhuber6 commented Jan 8, 2024

@jhuber6, how do you build upstream hip? I built llvm to bisect an lld hang in #58639 but I have no experience building hip nor rocm and don't know where to start. The OS on my system is called TOSS 4 which is based on RHEL 8. The system has both mi250x and mi300a nodes. I'm most interested in the mi300a.

I'm not the best person to ask for building ROCm. Maybe @yxsamliu or @saiislam would know something. The work I do with HIP is limited to basic tests using the basic support in community LLVM. My only experience building ROCm was using the AUR packages provided for Arch Linux before they were merged into the system package manager. Using HIP generally requires a lot of the HIP libraries from ROCm so it's difficult to do without a ROCm build or installation somewhere. The hipcc I used came from my system installation of ROCm 5.7 from the Arch package manager.

I was curious if --offload-new-driver yielded any different results, did you get to try that?

If you're talking about building HIP from LLVM, it should just require the clang project, but like I said it won't have the necessary libraries. If you're doing OpenMP offloading all you really need is hsa_runtime64.so which can be built yourself mostly painlessly, but HIP is much more featureful.

@yxsamliu
Copy link
Collaborator

yxsamliu commented Jan 8, 2024

the following works for me for ROCm 6.0:

PATH=/opt/rocm/llvm/bin:$PATH

clang -O2 -fgpu-rdc --offload-arch=gfx90a -x hip -c add.c -o $dir/add.o # add.o contains llvm IR
clang -O2 -fgpu-rdc --offload-arch=gfx90a -x hip -c main.c -o $dir/main.o # main.o contains llvm IR
clang -no-hip-rt --hip-link -r -O2 -fgpu-rdc --offload-arch=gfx90a $dir/add.o $dir/main.o -o $dir/uber.o # uber.o contains object code
clang -O2 --hip-link --offload-arch=gfx90a $dir/uber.o # linker sees object code

hipcc links with some libraries, which may not work with -r

@jhuber6
Copy link
Contributor

jhuber6 commented Jan 8, 2024

So, I actually remember specifically handling this case with the new driver's binary format. Because -r concatenates sections, it's possible for multiple offloading binaries to be mashed next to eachother in the section, while the standard compilation flow would expect to just see one. For this reason, the new driver's format continually parses the section until it reaches the end. Previously something like this would silently ignore the second file

clang foo.c bar.c -fopenmp --offload-arch=native -c
clang -r foo.o bar.o -o out.o
clang out.o --offload-link

@pozulp
Copy link
Member Author

pozulp commented Jan 12, 2024

the following works for me for ROCm 6.0:

Hey Yaxun (Sam), that worked for me too! Not just for building my little
reproducer but also when I build the driver for my 300,000 LoC library. Next I
will try my large code.

@arsenm
Copy link
Contributor

arsenm commented Jan 17, 2024

So is there anything to do with this issue, or can it be closed? Should there be a driver usability improvement?

@yxsamliu
Copy link
Collaborator

we could let clang driver assumes -no-hip-rt when -r is specified.

@yxsamliu yxsamliu self-assigned this Jan 17, 2024
@pozulp
Copy link
Member Author

pozulp commented Jan 23, 2024

So is there anything to do with this issue, or can it be closed? Should there be a driver usability improvement?

Hey @arsenm, thanks for asking. Linking the large code caused 3 problems that you or @yxsamliu might be able to solve. Last week, a colleague proposed an acronym to describe the feature that I'm trying to achieve. I mention this acronym because I use it below in my description of the 3 problems. The acronym is ERDC. The "E" stands for "early", which means that I am using an intermediate step between compiling and linking to generate a relocatable (-r) so that I do not need -fgpu-rdc in my LDFLAGS.

I wrote a new example that demonstrates the 3 problems. I link a tiny driver that calls two tiny libraries. Here is a summary of the problems

  • Problem 1) the relocatable object has no symbols
  • Problem 2) __hip_fatbin duplicate symbol
  • Problem 3) Invalid encoding

As before, I use x86+lto as the "good" case and amdgpu+rdc as the "bad" case. Here is a summary of the difference that I observed between the two cases:

  • Problem 1) occurs on both x86+lto and amdgpu+rdc
  • Problem 2) only occurs on amdgpu+rdc
  • Problem 3) only occurs on amdgpu+rdc

Here's a summary of my workarounds:

  • Problem 1) replace archive with its contents ("-r libalpha.a" -> "-r alpha1.o alpha2.o")
  • Problem 2) replace relocatables with a single relocatable ("-r alpha1.o alpha2.o beta1.o beta2.o")
  • Problem 3) remove the CLANG_OFFLOAD_BUNDLE sections ("objcopy -R __CLANG... foo.o foo_no_cob.o")

And here's my feelings about the workarounds:

  • Problem 1) good
  • Problem 2) bad (I need a better workaround)
  • Problem 3) good

x86+lto (good)

$ cat main.c
void alpha_add(int *x);
void beta_add(int *x);
int main(int argc, char **argv) {
    alpha_add(&argc);
    beta_add(&argc);
    return argc;
}
$ cat alpha1.c
int alpha_add_impl(int *x);
void alpha_add(int *x) { alpha_add_impl(x); }
$ cat alpha2.c
void alpha_add_impl(int *x) { *x = *x + 1; }
$ cat beta1.c
void beta_add_impl(int *x);
void beta_add(int *x) { beta_add_impl(x); }
$ cat beta2.c
void beta_add_impl(int *x) { *x = *x + 1; }
$ cat makefile
CXX=/usr/lib64/ccache/clang

CXXFLAGS=-O2 -flto
RELOCATABLE_FLAGS=-O2 -flto -fuse-ld=lld -r

ELTO_LDFLAGS=-O2 -fuse-ld=lld
LDFLAGS=$(ELTO_LDFLAGS) -flto

all: lto elto partial_elto

# Targets for the normal lto built executable
# -------------------------------------------

lto: libalpha.a libbeta.a main.o
    $(CXX) $(LDFLAGS) -L. -lalpha -lbeta main.o -o $@

libalpha.a: alpha1.o alpha2.o
    ar cr libalpha.a alpha1.o alpha2.o

libbeta.a: beta1.o beta2.o
    ar cr libbeta.a beta1.o beta2.o

.c.o:
    $(CXX) $(CXXFLAGS) -c -o $@ $<

# Targets for the new elto built executable
# -----------------------------------------

elto: libalphaELTO.a libbetaELTO.a main.o
    $(CXX) $(ELTO_LDFLAGS) -L. -lalphaELTO -lbetaELTO main.o -o $@

libalphaELTO.a: libalpha.a
    $(CXX) $(RELOCATABLE_FLAGS) libalpha.a -o alphaELTO.o
    ar cr libalphaELTO.a alphaELTO.o

libbetaELTO.a: libbeta.a
    $(CXX) $(RELOCATABLE_FLAGS) libbeta.a -o betaELTO.o
    ar cr libbetaELTO.a betaELTO.o

# Targets for the new partial_elto built executable
# -------------------------------------------------

partial_elto: libalpha.a libbetaELTO.a main.o
    $(CXX) $(ELTO_LDFLAGS) -L. -lalpha -lbetaELTO main.o -o $@

# Cleaning targets
# ----------------

clean:
    rm -f *.o *.a

cleanest: clean
    rm -f lto elto partial_elto

Building the lto executable works, but building the elto or partial_lto executable fails because of Problem 1):

$ make -s lto
$ make -s elto 2>&1 | head -1
ld.lld: error: undefined symbol: alpha_add

My workaround replaces the archive with its contents when I generate the relocatable (-r),

$ sed -i 's/libalpha.a -o/alpha1.o alpha2.o -o/g' makefile
$ sed -i 's/libbeta.a -o/beta1.o beta2.o -o/g' makefile
$ make -s clean
$ make -s

Disassembly confirms success. Specifically,

  • lto: all definitions inlined (no call instructions)
  • elto: the two impl definitions were inlined (no call to alpha_add_impl, nor beta_add_impl)
  • partial_elto: same as elto but with all libalpha definitions inlined (no call to alpha_add)
$ llvm-objdump --disassemble-symbols=main,alpha_add,beta_add lto elto partial_elto 2>/dev/null |grep -e lto -e callq
lto:    file format elf64-x86-64
elto:   file format elf64-x86-64
  2017e1: e8 2a ff ff ff                callq   0x201710 <alpha_add>
  2017e9: e8 42 ff ff ff                callq   0x201730 <beta_add>
partial_elto:   file format elf64-x86-64
  20177c: e8 4f ff ff ff                callq   0x2016d0 <beta_add>

amdgpu+rdc (bad)

I'm building with rocm 6.0.0 and targeting the amd mi250x, so I modified my x86+lto code to use hip with rdc:

$ cat main.c
void alpha_add(int *x);
void beta_add(int *x);
int main(int argc, char **argv) {
    alpha_add(&argc);
    beta_add(&argc);
    return argc;
}
$ cat alpha1.c
#include <hip/hip_runtime.h>
__device__ int alpha_add_impl(int x);
__global__ void alpha_kernel(int *x) { *x = alpha_add_impl(*x); }
void alpha_add(int *x) {
    alpha_kernel<<<1,1>>>(x);
}
$ cat alpha2.c
__device__ int alpha_add_impl(int x) { return x + 1; }
$ cat beta1.c
#include <hip/hip_runtime.h>
__device__ int beta_add_impl(int x);
__global__ void beta_kernel(int *x) { *x = beta_add_impl(*x); }
void beta_add(int *x) {
    beta_kernel<<<1,1>>>(x);
}
$ cat beta2.c
__device__ int beta_add_impl(int x) { return x + 1; }
$ cat makefile
ROCM_PATH=/opt/rocm-6.0.0

CXX_FLAGS=-O2 -fgpu-rdc --offload-arch=gfx90a -x hip
RELOCATABLE_FLAGS=-r --hip-link -no-hip-rt -O2 -fgpu-rdc --offload-arch=gfx90a

ERDC_LDFLAGS=-O2 --offload-arch=gfx90a -Wl,-rpath=$(ROCM_PATH)/lib
RDC_LDFLAGS=$(ERDC_LDFLAGS) -fgpu-rdc

all: rdc erdc partial_erdc

# Targets for the normal rdc built executable
# -------------------------------------------

rdc: libalpha.a libbeta.a main.o
    $(ROCM_PATH)/bin/hipcc $(RDC_LDFLAGS) -L. -lalpha -lbeta main.o -o $@

libalpha.a: alpha1.o alpha2.o
    ar cr libalpha.a alpha1.o alpha2.o

libbeta.a: beta1.o beta2.o
    ar cr libbeta.a beta1.o beta2.o

.c.o:
    $(ROCM_PATH)/bin/hipcc $(CXX_FLAGS) -c -o $@ $<

# Targets for the new erdc built executable
# -----------------------------------------

erdc: libalphaERDC.a libbetaERDC.a main.o
    $(ROCM_PATH)/bin/hipcc $(ERDC_LDFLAGS) -L. -lalphaERDC -lbetaERDC main.o -o $@

libalphaERDC.a: libalpha.a
    $(ROCM_PATH)/llvm/bin/clang $(RELOCATABLE_FLAGS) libalpha.a -o alphaERDC.o
    ar cr libalphaERDC.a alphaERDC.o

libbetaERDC.a: libbeta.a
    $(ROCM_PATH)/llvm/bin/clang $(RELOCATABLE_FLAGS) libbeta.a -o betaERDC.o
    ar cr libbetaERDC.a betaERDC.o

# Targets for the new partial_erdc built executable
# -------------------------------------------------

partial_erdc: libalpha.a libbetaERDC.a main.o
    $(ROCM_PATH)/bin/hipcc $(RDC_LDFLAGS) -L. -lalpha -lbetaERDC main.o -o $@

# Cleaning targets
# ----------------

clean:
    rm -f *.o *.a

cleanest: clean
    rm -f rdc erdc partial_erdc

Building the rdc executable works, but building the erdc or partial_erdc executable fails because of Problem 1):

$ make -s rdc
$ make -s erdc 2>&1 | head -1
ld.lld: error: undefined symbol: alpha_add

My workaround replaces the archive with its contents when I generate the relocatable (-r), as I did in the x86+lto case, but now I encounter Problem 2):

$ sed -i 's/libalpha.a -o/alpha1.o alpha2.o -o/g' makefile
$ sed -i 's/libbeta.a -o/beta1.o beta2.o -o/g' makefile
$ make -s clean
$ make -s erdc 2>&1 | head -5
ld.lld: error: duplicate symbol: __hip_fatbin
>>> defined at alpha2.c
>>>            alphaERDC.o:(.hip_fatbin+0x0) in archive ./libalphaERDC.a
>>> defined at beta2.c
>>>            betaERDC.o:(.hip_fatbin+0x0) in archive ./libbetaERDC.a

My workaround replaces the relocatables with a single relocatable

$ cat workaround_problem2_patch.txt
--- a
+++ b
@@ -26,12 +26,12 @@
 # Targets for the new erdc built executable
 # -----------------------------------------

-erdc: libalphaERDC.a libbetaERDC.a main.o
-   $(ROCM_PATH)/bin/hipcc $(ERDC_LDFLAGS) -L. -lalphaERDC -lbetaERDC main.o -o $@
+erdc: libalphabetaERDC.a  main.o
+   $(ROCM_PATH)/bin/hipcc $(ERDC_LDFLAGS) -L. -lalphabetaERDC main.o -o $@

-libalphaERDC.a: libalpha.a
-   $(ROCM_PATH)/llvm/bin/clang $(RELOCATABLE_FLAGS) alpha1.o alpha2.o -o alphaERDC.o
-   ar cr libalphaERDC.a alphaERDC.o
+libalphabetaERDC.a: libalpha.a libbeta.a
+   $(ROCM_PATH)/llvm/bin/clang $(RELOCATABLE_FLAGS) alpha1.o alpha2.o beta1.o beta2.o -o alphabetaERDC.o
+   ar cr libalphabetaERDC.a alphabetaERDC.o

 libbetaERDC.a: libbeta.a
    $(ROCM_PATH)/llvm/bin/clang $(RELOCATABLE_FLAGS) beta1.o beta2.o -o betaERDC.o
$ patch makefile workaround_problem2_patch.txt
$ make -s erdc

but this is bad. I need a better workaround. Finally, building partial_erdc fails because of Problem 3)

$ make -s partial_erdc 2>&1 | head -1
/opt/rocm-6.0.0/llvm/bin/clang-offload-bundler: error: 'betaERDC-hip-amdgcn-amd-amdhsa--gfx90abetaERDC.bc': Invalid encoding

My workaround uses objcopy to remove the CLANG_OFFLOAD_BUNDLE sections. This gets me past "Invalid encoding" but my partial_erdc build still fails with the __hip_fatbin duplicate symbol error from Problem 2)

$ cat workaround_problem3_patch.txt
--- a
+++ b
@@ -34,7 +34,10 @@
    ar cr libalphabetaERDC.a alphabetaERDC.o

 libbetaERDC.a: libbeta.a
-   $(ROCM_PATH)/llvm/bin/clang $(RELOCATABLE_FLAGS) beta1.o beta2.o -o betaERDC.o
+   $(ROCM_PATH)/llvm/bin/clang $(RELOCATABLE_FLAGS) beta1.o beta2.o -o betaERDC_preremoval.o
+   objcopy -R __CLANG_OFFLOAD_BUNDLE__hip-amdgcn-amd-amdhsa--gfx90a \
+           -R __CLANG_OFFLOAD_BUNDLE__host-x86_64-unknown-linux-gnu- \
+           betaERDC_preremoval.o betaERDC.o
    ar cr libbetaERDC.a betaERDC.o

 # Targets for the new partial_erdc built executable
$ patch makefile workaround_problem3_patch.txt
$ make -s clean
$ make -s partial_erdc 2>&1 | head -4
ld.lld: error: duplicate symbol: __hip_fatbin
>>> defined at beta2.c
>>>            betaERDC.o:(.hip_fatbin+0x0) in archive ./libbetaERDC.a
>>> defined at /var/tmp/pozulp1/main-0af0d9.o:(.hip_fatbin+0x0)

In summary, I need a workaround for Problem 2) __hip_fatbin duplicate symbol. What is __hip_fatbin? What is the .hip_fatbin section? (llvm-objdump -h shows that there is a section called .hip_fatbin) Is there a flag I can use when I generate the relocatable (-r) to leave __hip_fatbin undefined? Or is there a flag I can use while linking to tell the linker to ignore duplicate __hip_fatbin symbols?

@jhuber6
Copy link
Contributor

jhuber6 commented Jan 23, 2024

In summary, I need a workaround for Problem 2) __hip_fatbin duplicate symbol. What is __hip_fatbin? What is the .hip_fatbin section? (llvm-objdump -h shows that there is a section called .hip_fatbin) Is there a flag I can use when I generate the relocatable (-r) to leave __hip_fatbin undefined? Or is there a flag I can use while linking to tell the linker to ignore duplicate __hip_fatbin symbols?

__hip_fatbin is the location of the actual AMDGPU executable that the runtime will load. It's obviously quite important that it exists. I'm actually surprised that you get a duplicate symbol, since it realistically should be internal. But I haven't done any work on the HIP driver.

I'm still very curious if using --offload-new-driver works in any of these examples. I tested RDC linking when writing the new driver, so hopefully it works. At some point I want to move HIP over to the new framework by default, but that's still WIP. Could you throw it in some of your builds and see if it breaks in any obvious ways? If there's issues I can try to fix them.

@pozulp
Copy link
Member Author

pozulp commented Jan 23, 2024

I'm still very curious if using --offload-new-driver works in any of these examples.

Hey @jhuber6, I got duplicate symbol: __dummy.hip_offloading.entry, did I use the flag incorrectly? Here's a repro:

$ cat main.c
#include <hip/hip_runtime.h>
__device__ int add1(int y);
__global__ void mykernel(int *y) { *y = add1(*y); }
int main(int argc, char **argv) {
    mykernel<<<1,1>>>(&argc);
    return argc;
}
$ cat add.c
__device__ int add1(int y) { return y + 1; }
$ cat makefile
ROCM_PATH=/opt/rocm-6.0.0

CXXFLAGS=-O2 -fgpu-rdc --offload-arch=gfx90a -x hip
RELOCATABLE_FLAGS=-r --hip-link -no-hip-rt -O2 -fgpu-rdc --offload-arch=gfx90a

ERDC_LDFLAGS=-O2 --offload-arch=gfx90a -Wl,-rpath=$(ROCM_PATH)/lib
RDC_LDFLAGS=$(ERDC_LDFLAGS) -fgpu-rdc

all: rdc erdc

rdc: add.o main.o
    $(ROCM_PATH)/bin/hipcc $(RDC_LDFLAGS) -o $@ $^

.c.o:
    $(ROCM_PATH)/bin/hipcc $(CXXFLAGS) -c -o $@ $<

uber.o: add.o main.o
    $(ROCM_PATH)/llvm/bin/clang $(RELOCATABLE_FLAGS) -o $@ $^

erdc: uber.o
    $(ROCM_PATH)/bin/hipcc $(ERDC_LDFLAGS) -o $@ $<

clean:
    rm -f *.o

cleanest: clean
    rm -f rdc erdc

Run make to verify that this works.

$ make -s

Add --offload-new-driver to the flags.

sed -i 's/^CXXFLAGS=\(.*\)/CXXFLAGS=\1 --offload-new-driver/g' makefile
sed -i 's/^RELOCATABLE_FLAGS=\(.*\)/RELOCATABLE_FLAGS=\1 --offload-new-driver/g' makefile
sed -i 's/^ERDC_LDFLAGS=\(.*\)/ERDC_LDFLAGS=\1 --offload-new-driver/g' makefile
sed -i 's/^RDC_LDFLAGS=\(.*\)/RDC_LDFLAGS=\1 --offload-new-driver/g' makefile

Running make rdc gives a warning and make erdc also gives the error duplicate symbol: __dummy.hip_offloading.entry:

$ make cleanest
$ make -s rdc
clang: warning: argument unused during compilation: '--offload-arch=gfx90a' [-Wunused-command-line-argument]
$ make -s erdc
clang: warning: argument unused during compilation: '--offload-arch=gfx90a' [-Wunused-command-line-argument]
ld.lld: error: duplicate symbol: __dummy.hip_offloading.entry
>>> defined at offload.wrapper.module
>>>            /var/tmp/pozulp1/erdc.image.wrapper-590c34.o:(hip_offloading_entries+0x0)
>>> defined at main.c
>>>            uber.o:(.omp_offloading.entry._Z8mykernelPi)

@jhuber6
Copy link
Contributor

jhuber6 commented Jan 23, 2024

I'm still very curious if using --offload-new-driver works in any of these examples.

Hey @jhuber6, I got duplicate symbol: __dummy.hip_offloading.entry, did I use the flag incorrectly? Here's a repro:

$ cat main.c
#include <hip/hip_runtime.h>
__device__ int add1(int y);
__global__ void mykernel(int *y) { *y = add1(*y); }
int main(int argc, char **argv) {
    mykernel<<<1,1>>>(&argc);
    return argc;
}
$ cat add.c
__device__ int add1(int y) { return y + 1; }
$ cat makefile
ROCM_PATH=/opt/rocm-6.0.0

CXXFLAGS=-O2 -fgpu-rdc --offload-arch=gfx90a -x hip
RELOCATABLE_FLAGS=-r --hip-link -no-hip-rt -O2 -fgpu-rdc --offload-arch=gfx90a

ERDC_LDFLAGS=-O2 --offload-arch=gfx90a -Wl,-rpath=$(ROCM_PATH)/lib
RDC_LDFLAGS=$(ERDC_LDFLAGS) -fgpu-rdc

all: rdc erdc

rdc: add.o main.o
    $(ROCM_PATH)/bin/hipcc $(RDC_LDFLAGS) -o $@ $^

.c.o:
    $(ROCM_PATH)/bin/hipcc $(CXXFLAGS) -c -o $@ $<

uber.o: add.o main.o
    $(ROCM_PATH)/llvm/bin/clang $(RELOCATABLE_FLAGS) -o $@ $^

erdc: uber.o
    $(ROCM_PATH)/bin/hipcc $(ERDC_LDFLAGS) -o $@ $<

clean:
    rm -f *.o

cleanest: clean
    rm -f rdc erdc

Run make to verify that this works.

$ make -s

Add --offload-new-driver to the flags.

sed -i 's/^CXXFLAGS=\(.*\)/CXXFLAGS=\1 --offload-new-driver/g' makefile
sed -i 's/^RELOCATABLE_FLAGS=\(.*\)/RELOCATABLE_FLAGS=\1 --offload-new-driver/g' makefile
sed -i 's/^ERDC_LDFLAGS=\(.*\)/ERDC_LDFLAGS=\1 --offload-new-driver/g' makefile
sed -i 's/^RDC_LDFLAGS=\(.*\)/RDC_LDFLAGS=\1 --offload-new-driver/g' makefile

Running make rdc gives a warning and make erdc also gives the error duplicate symbol: __dummy.hip_offloading.entry:

$ make cleanest
$ make -s rdc
clang: warning: argument unused during compilation: '--offload-arch=gfx90a' [-Wunused-command-line-argument]
$ make -s erdc
clang: warning: argument unused during compilation: '--offload-arch=gfx90a' [-Wunused-command-line-argument]
ld.lld: error: duplicate symbol: __dummy.hip_offloading.entry
>>> defined at offload.wrapper.module
>>>            /var/tmp/pozulp1/erdc.image.wrapper-590c34.o:(hip_offloading_entries+0x0)
>>> defined at main.c
>>>            uber.o:(.omp_offloading.entry._Z8mykernelPi)

Thanks, I've copied your reproducers. I think I'll take some time to look into this on that side.

@jhuber6
Copy link
Contributor

jhuber6 commented Jan 23, 2024

The new driver uses a linker-defined array to traverse the list of entries. The dummy symbol there is used to force the linker to define the section even when the user doesn't have any entries (e.g. kernels). The problem is that the symbols conflict. I can either make this symbol weak so it will merge with other definitions, or make it internal but used so it doesn't participate in linking but still makes the section. I'm partial to the latter, the overhead of checking an extra dummy entry is completely negligible and it has the benefit of hiding this implementation detail for the user.

I tried your basic RDC test and for some reason it's not extracting the symbols from the static library correctly. It works if I pass -Wl,--whole-archive, and this is tested for OpenMP which uses the same platform, so I'll need to look into why this isn't working.

Thanks for the detailed report, it's really helpful.

Also, the warning is because the link step doesn't need --offload-arch for the link step because that information is embedded into the binaries themselves.

@jhuber6
Copy link
Contributor

jhuber6 commented Jan 24, 2024

I made #79231, but it has me wondering what the exact semantics of this would be. The patch is a good idea regardless though.

Right now, it seems doing a -r linking job with the new driver will cause it to create the "wrapping" code that registers the executable and kernels and stuff. This would then result in multiple images, it's almost like doing -fno-gpu-rdc for a subset of files at that point. This may actually be desirable behavior to sort of "cut off" GPU programs if you don't want libraries to provide linkable GPU code anymore. However, those semantics might be different from what's expected here, where we may want to simply merge this without touching the embedded GPU code.

I think the main issue here is that given

clang a.c b.c --offload-arch=gfx90a -c
clang a.o b.o -r -o out.o

The out.o file will contain the embedded device code in .llvm.offloading since we rely on the SHF_EXCLUDE flag to handle that for us when creating the executable. That means that out.o will both contain the compiled GPU executable from a.c and b.c while also still containing it's IR. So when someone else links against out.o it will inherit the executable as well as the IR and create a new executable.

I may need to shim in some logic to address this, depending on what we actually want to happen.

Option 1: Compiling with -r compiles and embeds GPU code, effectively allowing you to do -fno-gpu-rdc semantics but for a group of files.
Option 2: Compiling with -r skips the GPU-side handling and simply merges the .llvm.offloading section untouched. This would push the linking and registration into the final executable creation.

Making the former work would require stripping the .llvm.offloading section from the linker output. The latter would simply require parsing -r and skipping all the GPU specific handling in the clang-linker-wrapper. Both seem like useful effects so we may want to consider choosing the reasonable one to be the default with -r and then making another flag for the other one.

@jhuber6
Copy link
Contributor

jhuber6 commented Jan 24, 2024

Okay @pozulp, I remembered the other reason why it wasn't working. The clang-linker-wrapper handles doing all the device-side magic, including extracting stuff from static libraries. Because I'm lazy, I went with the easy method where we scan input files in order (in GNU ld.bfd fashion). Static archives only extract if their symbols are needed, so if they're specified before the thing that uses the library it won't be extracted. So, in your Makefile you do

clang $(RDC_LDFLAGS) -L. lalpha -lbeta main.o -o $@

You also do

clang $(RELOCATABLE_FLAGS) libalpha.a -o alphaERDC.o

This means that neither of the libraries will extract, because they're checked before main.o is input or just not used at all. With #79314 and #79231 applied and some modifications to your Makefile it seems to work with the new driver now.

Here's the modified Makefile I made.

ROCM_PATH=/opt/rocm

CXX_FLAGS=-O2 -fgpu-rdc --offload-arch=gfx1030 -x hip --offload-new-driver
RELOCATABLE_FLAGS=-r --offload-new-driver -no-hip-rt -O2 -fgpu-rdc --offload-arch=gfx1030

ERDC_LDFLAGS=-O2 --offload-arch=gfx1030 -L$(ROCM_PATH)/lib -Wl,-rpath=$(ROCM_PATH)/lib --offload-new-driver -lamdhip64
RDC_LDFLAGS=$(ERDC_LDFLAGS) -fgpu-rdc --offload-new-driver

all: rdc erdc

# Targets for the normal rdc built executable
# -------------------------------------------

rdc: libalpha.a libbeta.a main.o
	clang $(RDC_LDFLAGS) -L. main.o -lalpha -lbeta -o $@

libalpha.a: alpha1.o alpha2.o
	ar cr libalpha.a alpha1.o alpha2.o

libbeta.a: beta1.o beta2.o
	ar cr libbeta.a beta1.o beta2.o

.c.o:
	clang $(CXX_FLAGS) -c -o $@ $<

# Targets for the new erdc built executable
# -----------------------------------------

erdc: libalphaERDC.a libbetaERDC.a main.o
	clang $(ERDC_LDFLAGS) -L. main.o -lalphaERDC -lbetaERDC -o $@

libalphaERDC.a: alpha1.o alpha2.o
	clang $(RELOCATABLE_FLAGS) alpha1.o alpha2.o -o alphaERDC.o
	ar cr libalphaERDC.a alphaERDC.o

libbetaERDC.a: beta1.o beta2.o
	clang $(RELOCATABLE_FLAGS) beta1.o beta2.o -o betaERDC.o
	ar cr libbetaERDC.a betaERDC.o

# Cleaning targets
# ----------------

clean:
	rm -f *.o *.a

cleanest: clean
	rm -f rdc erdc

This will do what I expect to be "correct" default behavior once landed.

However, from your initial report it seems that you're more interested in being able to "cut off" RDC libraries so they do not inflate link time when linked via a static library. I think this should be a separate flag, because it's somewhat different from RDC linking in general, which simply merges multiple object files into one.

What I'll need to add is a separate flag to instruct the linker wrapper to perform the device-side linking and wrapping for the module and then delete the associated section so it won't be linked again. This should be doable once #79231 lands. Is that the behavior you'd expect @pozulp? I.e.

clang -x hip foo.c bar.c -c --offload-arch=gfx90a -fgpu-rdc
clang -r --offload-link foo.c bar.c -o merged.o
llvm-ar rcs libfoo.a merged.o // libfoo.a contains no GPU device code
clang -x hip main.c libfoo.a --offload-arch=gfx90a -fgpu-rdc // main.c will not link with GPU code from libfoo.a

@pozulp
Copy link
Member Author

pozulp commented Jan 29, 2024

What I'll need to add is a separate flag to instruct the linker wrapper to perform the device-side linking and wrapping for the module and then delete the associated section so it won't be linked again. This should be doable once #79231 lands. Is that the behavior you'd expect @pozulp?

Hey @jhuber6, yes! Sounds great. If you see @yxsamliu online this week please ask how to fix Problem 2) __hip_fatbin duplicate symbol. My reproducer is #77018 (comment). I'm hoping Sam's magic has not run out yet. Sam is the one who fixed the attempted static link of dynamic object in /opt/rocm-6.0.0/lib by running /opt/rocm/llvm/bin/clang -r instead of hipcc -r and I'm really impressed by how well it works. On Thursday, a colleague and I used Sam's fix to write a script that reduced the link time of an important configuration of our large code (containing a few million lines of C++) from 90 seconds to 2 seconds. It has some problems and it needs much more testing before it's ready for production, but here's what we have so far. We named it erdc.sh:

#!/bin/bash
#
# 2024-01-25 Created by Mike Pozulp with help from Tom Stitt,
# Lawrence Livermore National Laboratory, Livermore, CA 94550 USA
#
# This script accepts multiple archives containing rdc object files and outputs
# one archive containing one object file. The new object file has object code
# instead of llvm ir bitcode. This "early" rdc operation avoids having to link
# with -fgpu-rdc. Currently, you *must* remove -fgpu-rdc from your link line,
# and thus all libraries which require rdc must be included in the erdc step.
# Otherwise your link will fail with
#
#   duplicate symbol: __hip_fatbin
#
# For more details, see https://github.com/llvm/llvm-project/issues/77018.
#
# EXAMPLES:
#
# ROCM_PATH=/opt/rocm-5.7.1 ./erdc.sh libalpha.a libbeta.a
# ROCM_PATH=/opt/rocm-5.7.1 ARCH_FLAGS='--offload-arch=gfx90a --offload-arch=gfx940' ./erdc.sh libalpha.a libbeta.a libgamma.a

MY_ARCH_FLAGS=${ARCH_FLAGS:-'--offload-arch=gfx90a'}

# TODO: Uncomment below, then figure out why libERDC.a does not get created.
#MY_ADDITIONAL_FLAGS=${ADDITIONAL_FLAGS:-'-O2'}

# Expand archives into object files otherwise the relocatable (-r) will not
# have any symbols and linking will fail with error: undefined reference.
explode=$(mktemp -d)
pushd $explode
cp $@ .
for f in *.a; do
    ar x $f
done
rm *.a 
time $ROCM_PATH/llvm/bin/clang++ \
    -r -no-hip-rt -fgpu-rdc --hip-link \
    $MY_ARCH_FLAGS \
    $MY_ADDITIONAL_FLAGS \
    --no-gpu-bundle-output \
    -o uber.o *.o 

# Remove CLANG_OFFLOAD_BUNDLE sections, otherwise a partial erdc will fail
# during -fgpu-rdc linking with error: Invalid encoding.
#
# TODO: use objdump -h to get names of CLANG_OFFLOAD_BUNDLE sections, like
# llvm-objdump -h uber.o | grep CLANG_OFFLOAD_BUNDLE | ...
objcopy \
-R __CLANG_OFFLOAD_BUNDLE__hip-amdgcn-amd-amdhsa--gfx90a \
-R __CLANG_OFFLOAD_BUNDLE__hip-amdgcn-amd-amdhsa--gfx940 \
-R __CLANG_OFFLOAD_BUNDLE__host-x86_64-unknown-linux-gnu- \
uber.o uber_no_offload_sections.o

popd
ar cr libERDC.a $explode/uber_no_offload_sections.o

# Cleanup the temporary directory
rm -rf $explode

@jhuber6
Copy link
Contributor

jhuber6 commented Jan 30, 2024

Okay, I hacked something together with the new driver framework as a proof of concept. First, this requires reverting 0f8b529 locally to get the old behavior back. I'm using OpenMP just because it uses the new driver natively, always builds in RDC-mode, and works better with community LLVM.

// foo.c
int bar(void) { return 2; }
#pragma omp declare target to(bar) device_type(nohost)

int foo(void) {
  int x = 0;
#pragma omp target map(from : x)
  { x = bar(); }
  return x;
}
// main.c
extern int foo(void);

int bar(void) { return 1; }
#pragma omp declare target to(bar) device_type(nohost)

int main() {
  int x = foo();
  int y = 0;
#pragma omp target map(from : y)
  { y = bar(); }
  return x + y;
}

Here's a hacky script I wrote to fully link foo.o using a relocatable link. Hopefully the comments add some context. The documentation at https://clang.llvm.org/docs/OffloadingDesign.html sheds some light on the underlying implementation details here.

#!/bin/bash

set -v
set -e

# Get the `foo.o` object with embedded GPU code.
clang foo.c -fopenmp --offload-arch=native -fopenmp-offload-mandatory -c
# Rename the `.llvm.offloading` section. This is where the device code lives. We only need
# to do this because the linked GPU binary uses the same section name and they'll get clobbered
# when doing a relocatable link. This has a custom ELF type so the name is irrelevant for everything else
llvm-objcopy --rename-section .llvm.offloading=.llvm.offloading.rel foo.o
# This relocatable link will fully link the embedded GPU code in `foo.o` and then create a blob to
# register it with the OpenMP runtime, this blob will be merged into `merged.o`.
clang foo.o -lomptarget.devicertl -r -fopenmp --offload-arch=native -o merged.o
# The registration blob primarily uses runtime sections to iterate the kernels and globals.
# The linker provides `__[start|stop]_<secname>` symbols to traverse it. These will conflict
# with anything else we link so we need to rename it to something unique for this module.
# Also delete the old embedded code so nothing else will link with it.
llvm-objcopy \
  --remove-section .llvm.offloading.rel \
  --rename-section omp_offloading_entries=omp_offloading_entries_1 \
  --redefine-sym __start_omp_offloading_entries=__start_omp_offloading_entries_1 \
  --redefine-sym __stop_omp_offloading_entries=__stop_omp_offloading_entries_1 \
  merged.o
# Handle the rest as normal.
llvm-ar rcs libfoo.a merged.o
clang main.c libfoo.a -fopenmp --offload-arch=native -fopenmp-offload-mandatory
./a.out || echo $?

It works in theory, implementing it would require a few hacks however. HIP uses the exact same handling under --offload-new-driver. Hope this is interesting / useful in some way, if so I can put in the effort to get this working through some clang flag.

@yxsamliu
Copy link
Collaborator

It is possible to support -r of part of the objects then link them all together. However, this needs some change to clang driver and I doubt how useful this feature is. Why not using -shared to link each partition of the objects as a shared library, then link the main program with all the shared libraries? The current clang driver support that for HIP.

@jhuber6
Copy link
Contributor

jhuber6 commented Jan 31, 2024

So, with #80066 applied I was able to do the following with two generic HIP files using my local installation of ROCm 5.7.1.

// main.hip
#include <hip/hip_runtime.h>                                     
 
__global__ void kernel2() {
  printf("%s\n", __PRETTY_FUNCTION__);
}

extern void foo();

int main() {
  foo();
  hipLaunchKernelGGL(kernel2, dim3(1), dim3(1), 0, 0);
  auto x = hipDeviceSynchronize();
}
// foo.hip
#include <hip/hip_runtime.h>

__global__ void kernel1() {
  printf("%s\n", __PRETTY_FUNCTION__);
}

void foo() {
  hipLaunchKernelGGL(kernel1, dim3(1), dim3(1), 0, 0);
  auto x = hipDeviceSynchronize();
}

I compiled both of them using the new driver and the -r support in the patch.

$ clang -x hip foo.hip --offload-arch=native -c --offload-new-driver -fgpu-rdc
$ clang foo.o --offload-link -r -o merged.o
$ llvm-ar rcs libfoo.a merged.o 
$ clang -x hip main.hip --offload-arch=native --offload-new-driver -fgpu-rdc -L. -lfoo
$ ./a.out
void kernel1()                                              
void kernel2()

Which seems to be what you're asking for. If you do llvm-objdump -s -h .hip_fatbin you can verify that there are indeed two separate images created rather than the one linked image you would get with standard RDC compilation. Let me know if there's anything else that would help.

@pozulp
Copy link
Member Author

pozulp commented Feb 2, 2024

Which seems to be what you're asking for. If you do llvm-objdump -s -h .hip_fatbin you can verify that there are indeed two separate images created rather than the one linked image you would get with standard RDC compilation. Let me know if there's anything else that would help.

Thanks Joseph!

It is possible to support -r of part of the objects then link them all together. However, this needs some change to clang driver and I doubt how useful this feature is. Why not using -shared to link each partition of the objects as a shared library, then link the main program with all the shared libraries? The current clang driver support that for HIP.

Hey @yxsamliu, can you show me how to do this? You could use the 5-file example (main.c alpha1.c alpha2.c beta1.c beta2.c) that I shared in my comment last week #77018 (comment).

@yxsamliu
Copy link
Collaborator

yxsamliu commented Feb 3, 2024

Hey @yxsamliu, can you show me how to do this? You could use the 5-file example (main.c alpha1.c alpha2.c beta1.c beta2.c) that I shared in my comment last week #77018 (comment).

$ PATH=/opt/rocm/llvm/bin:$PATH make VERBOSE=1
clang++ -O2 -fgpu-rdc --offload-arch=gfx1030 -x hip -fpic   -c -o alpha1.o alpha1.c
clang++ -O2 -fgpu-rdc --offload-arch=gfx1030 -x hip -fpic   -c -o alpha2.o alpha2.c
clang -shared --hip-link -O2 -fgpu-rdc --offload-arch=gfx1030 alpha1.o alpha2.o -o libalphaERDC.so
clang++ -O2 -fgpu-rdc --offload-arch=gfx1030 -x hip -fpic   -c -o beta1.o beta1.c
clang++ -O2 -fgpu-rdc --offload-arch=gfx1030 -x hip -fpic   -c -o beta2.o beta2.c
clang -shared --hip-link -O2 -fgpu-rdc --offload-arch=gfx1030 beta1.o beta2.o -o libbetaERDC.so
clang++ -O2 -fgpu-rdc --offload-arch=gfx1030 -x hip -fpic   -c -o main.o main.c
clang -O2 --hip-link --offload-arch=gfx1030 -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lamdhip64 -L. main.o -lalphaERDC -lbetaERDC -o erdc

$ cat Makefile
ROCM_PATH=/opt/rocm

CC = clang++
CFLAGS=-O2 -fgpu-rdc --offload-arch=gfx1030 -x hip -fpic
RELOCATABLE_FLAGS=-shared --hip-link -O2 -fgpu-rdc --offload-arch=gfx1030

ERDC_LDFLAGS=-O2 --hip-link --offload-arch=gfx1030 -L$(ROCM_PATH)/lib -Wl,-rpath=$(ROCM_PATH)/lib -lamdhip64
RDC_LDFLAGS=$(ERDC_LDFLAGS) -fgpu-rdc

all: erdc


erdc: libalphaERDC.so libbetaERDC.so main.o
	clang $(ERDC_LDFLAGS) -L. main.o -lalphaERDC -lbetaERDC -o $@

libalphaERDC.so: alpha1.o alpha2.o
	clang $(RELOCATABLE_FLAGS) alpha1.o alpha2.o -o libalphaERDC.so

libbetaERDC.so: beta1.o beta2.o
	clang $(RELOCATABLE_FLAGS) beta1.o beta2.o -o libbetaERDC.so

clean:
	rm -f *.o *.so

cleanest: clean
	rm -f rdc erdc

@pozulp
Copy link
Member Author

pozulp commented Feb 5, 2024

Hey @yxsamliu, thanks for clarifying. I asked a few LLNL colleagues from a few different teams about dynamic linking and all of them said that it will not work for us, but they are interested in a solution using static linking. You said that

It is possible to support -r of part of the objects then link them all together. However, this needs some change to clang driver and I doubt how useful this feature is.

This is great news! But it sounds like you need more information before you attempt it. I can talk to you and Brendon Cahoon said that he can too. Cahoon, myself, and others from LLNL, AMD, and HPE want to run LLNL applications on the MI300s in the El Capitan machine that LLNL will deploy this year, and I think that this new build strategy could help.

@jhuber6
Copy link
Contributor

jhuber6 commented Feb 5, 2024

This is great news! But it sounds like you need more information before you attempt it. I can talk to you and Brendon Cahoon said that he can too. Cahoon, myself, and others from LLNL, AMD, and HPE want to run LLNL applications on the MI300s in the El Capitan machine that LLNL will deploy this year, and I think that this new build strategy could help.

The long-term goal is to move HIP compilation to the new offloading driver, which would make #80066 work in your case as expected. However, I don't know how long that would take for these changes to get filtered down into a ROCm release. I should probably take the time to work with other members of the HIP team to see what the current blockers are.

As far as I'm aware for HIP registration, we create a constructor for each TU that registers the relevant globals using an external handle that the link step then resolves once the actual image has been created. You'd probably need some post-link step to rename that handle, as it's linkonce and any further linked modules will override it incorrectly.

@yxsamliu
Copy link
Collaborator

yxsamliu commented Feb 6, 2024

Hey @yxsamliu, thanks for clarifying. I asked a few LLNL colleagues from a few different teams about dynamic linking and all of them said that it will not work for us, but they are interested in a solution using static linking. You said that

It is possible to support -r of part of the objects then link them all together. However, this needs some change to clang driver and I doubt how useful this feature is.

This is great news! But it sounds like you need more information before you attempt it. I can talk to you and Brendon Cahoon said that he can too. Cahoon, myself, and others from LLNL, AMD, and HPE want to run LLNL applications on the MI300s in the El Capitan machine that LLNL will deploy this year, and I think that this new build strategy could help.

For this approach to work well, the object files should be partitioned into small groups and the device code in each group are self-contained, i.e., they do not call any device functions or use any device variables in other groups. Does your HIP application have this trait? Thanks.

@pozulp
Copy link
Member Author

pozulp commented Feb 7, 2024

For this approach to work well, the object files should be partitioned into small groups and the device code in each group are self-contained, i.e., they do not call any device functions or use any device variables in other groups. Does your HIP application have this trait? Thanks.

Hey @yxsamliu, yes. Consider a graph in which the nodes are TUs. If two nodes have an edge between them, it means that there is at least one reference to a device function or device variable defined in the other. The graph for my HIP application is disconnected, meaning that there are at least two nodes which are not connected by a path. I made a visual to help explain: I drew the graph for my tiny 5-file program that I shared in my comment last week #77018 (comment). It is a disconnected graph containing 3 maximal connected subgraphs. I also made a table of TU combinations labeled with green checkmarks if they are valid early rdc combinations and red xmarks if they are not. See below.

graph

Finally, for anyone who is wondering if early rdc is right for them, there are at least two cases that would not benefit from early rdc:

  1. Codes with connected graphs, meaning that there is a path connecting every pair of nodes. These codes need -fgpu-rdc and would not benefit from early rdc because the smallest group of TUs on which early rdc could be performed contains every TU, which is no better than link-time codegen.

  2. Codes with empty graphs, meaning that the graph has zero edges. These codes do not need -fgpu-rdc.

The long-term goal is to move HIP compilation to the new offloading driver, which would make #80066 work in your case as expected.

Hey @jhuber6, do you mean that --offload-new-driver will be the default some day?

@jhuber6
Copy link
Contributor

jhuber6 commented Feb 7, 2024

Hey @jhuber6, do you mean that --offload-new-driver will be the default some day?

Yes, that is the goal. I need to take some time to see what's actually missing for HIP to use it by default.

yxsamliu added a commit to yxsamliu/llvm-project that referenced this issue Feb 14, 2024
`-fgpu-rdc` mode allows device functions call device functions
in different TU. However, currently all device objects
have to be linked together since only one fat binary
is supported. This is time consuming for AMDGPU backend
since it only supports LTO.

There are use cases that objects can be divided into groups
in which device functions are self-contained but host functions
are not. It is desirable to link/optimize/codegen the device
code and generate a fatbin for each group, whereas partially
link the host code with `ld -r` or generate a static library
by using the `-emit-static-lib` option of clang. This avoids
linking all device code together, therefore decreases
the linking time for `-fgpu-rdc`.

Previously, clang emits an external symbol `__hip_fatbin`
for all objects for `-fgpu-rdc`. With this patch, clang
emits an unique external symbol `__hip_fatbin_{cuid}`
for the fat binary for each object. When a group of objects
are linked together to generate a fatbin, the symbols
are merged by alias and point to the same fat binary.
Each group has its own fat binary. One executable or
shared library can have multiple fat binaries. Device
linking is done for undefined fab binary symbols only
to avoid repeated linking. `__hip_gpubin_handle` is also
uniquefied and merged to avoid repeated registering.
Symbol `__hip_cuid_{cuid}` is introduced to facilitate
debugging and tooling.

Fixes: llvm#77018
yxsamliu added a commit to yxsamliu/llvm-project that referenced this issue Feb 14, 2024
`-fgpu-rdc` mode allows device functions call device functions
in different TU. However, currently all device objects
have to be linked together since only one fat binary
is supported. This is time consuming for AMDGPU backend
since it only supports LTO.

There are use cases that objects can be divided into groups
in which device functions are self-contained but host functions
are not. It is desirable to link/optimize/codegen the device
code and generate a fatbin for each group, whereas partially
link the host code with `ld -r` or generate a static library
by using the `-emit-static-lib` option of clang. This avoids
linking all device code together, therefore decreases
the linking time for `-fgpu-rdc`.

Previously, clang emits an external symbol `__hip_fatbin`
for all objects for `-fgpu-rdc`. With this patch, clang
emits an unique external symbol `__hip_fatbin_{cuid}`
for the fat binary for each object. When a group of objects
are linked together to generate a fatbin, the symbols
are merged by alias and point to the same fat binary.
Each group has its own fat binary. One executable or
shared library can have multiple fat binaries. Device
linking is done for undefined fab binary symbols only
to avoid repeated linking. `__hip_gpubin_handle` is also
uniquefied and merged to avoid repeated registering.
Symbol `__hip_cuid_{cuid}` is introduced to facilitate
debugging and tooling.

Fixes: llvm#77018
Change-Id: Ia16ac3ddb05b66e6149288aacb0ba4a80120ad8c
yxsamliu added a commit to yxsamliu/llvm-project that referenced this issue Feb 22, 2024
`-fgpu-rdc` mode allows device functions call device functions
in different TU. However, currently all device objects
have to be linked together since only one fat binary
is supported. This is time consuming for AMDGPU backend
since it only supports LTO.

There are use cases that objects can be divided into groups
in which device functions are self-contained but host functions
are not. It is desirable to link/optimize/codegen the device
code and generate a fatbin for each group, whereas partially
link the host code with `ld -r` or generate a static library
by using the `-emit-static-lib` option of clang. This avoids
linking all device code together, therefore decreases
the linking time for `-fgpu-rdc`.

Previously, clang emits an external symbol `__hip_fatbin`
for all objects for `-fgpu-rdc`. With this patch, clang
emits an unique external symbol `__hip_fatbin_{cuid}`
for the fat binary for each object. When a group of objects
are linked together to generate a fatbin, the symbols
are merged by alias and point to the same fat binary.
Each group has its own fat binary. One executable or
shared library can have multiple fat binaries. Device
linking is done for undefined fab binary symbols only
to avoid repeated linking. `__hip_gpubin_handle` is also
uniquefied and merged to avoid repeated registering.
Symbol `__hip_cuid_{cuid}` is introduced to facilitate
debugging and tooling.

Fixes: llvm#77018
yxsamliu added a commit to yxsamliu/llvm-project that referenced this issue Feb 22, 2024
`-fgpu-rdc` mode allows device functions call device functions
in different TU. However, currently all device objects
have to be linked together since only one fat binary
is supported. This is time consuming for AMDGPU backend
since it only supports LTO.

There are use cases that objects can be divided into groups
in which device functions are self-contained but host functions
are not. It is desirable to link/optimize/codegen the device
code and generate a fatbin for each group, whereas partially
link the host code with `ld -r` or generate a static library
by using the `-emit-static-lib` option of clang. This avoids
linking all device code together, therefore decreases
the linking time for `-fgpu-rdc`.

Previously, clang emits an external symbol `__hip_fatbin`
for all objects for `-fgpu-rdc`. With this patch, clang
emits an unique external symbol `__hip_fatbin_{cuid}`
for the fat binary for each object. When a group of objects
are linked together to generate a fatbin, the symbols
are merged by alias and point to the same fat binary.
Each group has its own fat binary. One executable or
shared library can have multiple fat binaries. Device
linking is done for undefined fab binary symbols only
to avoid repeated linking. `__hip_gpubin_handle` is also
uniquefied and merged to avoid repeated registering.
Symbol `__hip_cuid_{cuid}` is introduced to facilitate
debugging and tooling.

Fixes: llvm#77018
yxsamliu added a commit that referenced this issue Feb 22, 2024
`-fgpu-rdc` mode allows device functions call device functions in
different TU. However, currently all device objects have to be linked
together since only one fat binary is supported. This is time consuming
for AMDGPU backend since it only supports LTO.

There are use cases that objects can be divided into groups in which
device functions are self-contained but host functions are not. It is
desirable to link/optimize/codegen the device code and generate a fatbin
for each group, whereas partially link the host code with `ld -r` or
generate a static library by using the `--emit-static-lib` option of
clang. This avoids linking all device code together, therefore decreases
the linking time for `-fgpu-rdc`.

Previously, clang emits an external symbol `__hip_fatbin` for all
objects for `-fgpu-rdc`. With this patch, clang emits an unique external
symbol `__hip_fatbin_{cuid}` for the fat binary for each object. When a
group of objects are linked together to generate a fatbin, the symbols
are merged by alias and point to the same fat binary. Each group has its
own fat binary. One executable or shared library can have multiple fat
binaries. Device linking is done for undefined fab binary symbols only
to avoid repeated linking. `__hip_gpubin_handle` is also uniquefied and
merged to avoid repeated registering. Symbol `__hip_cuid_{cuid}` is
introduced to facilitate debugging and tooling.

Fixes: #77018
@llvmbot
Copy link
Member

llvmbot commented Feb 22, 2024

@llvm/issue-subscribers-clang-codegen

Author: Mike Pozulp (pozulp)

Hey @arsenm and @jdoerfert, how do I generate a relocatable object (-r) for the amdgpu target? I am linking a large code containing a few millions of lines of C++ with an optional library dependency containing about 300,000 lines of C++. The library requires relocatable device code (-fgpu-rdc) because it has many kernels which reference device functions defined in separate translation units. The large code does not. A driver for the library links in 30 minutes. The large code takes 2 minutes to link without the optional library and over 8 hours with the library (the lld process is still running after 8 hours). I don't want to use rdc to link the large code, but I have to because of the optional library: if even a single object needs rdc, then the link needs it too. Perhaps an intermediate step between compiling the library and linking the large code, in which I generate a relocatable object (-r) from the rdc-compiled library, would allow me to link the large code without rdc even when I'm using the optional library.

x86+LTO (good)

Consider using LTO to target x86, which works as expected. During compilation, clang -flto emits LLVM IR, which lld uses to perform link time optimizations like cross translation unit inlining. Here is an example:

$ cat main.c
int add1(int y);
int main(int argc, char **argv) { return add1(argc); }
$ cat add.c
int add1(int y) { return y + 1; }
$ cat build.sh
dirs="normal_build lto_build relocatable_build"
for dir in $(echo $dirs); do rm -rf $dir; mkdir $dir; done

# build separate compilation executable
dir=normal_build
clang -O2 -c add.c -o $dir/add.o                          # add.o  contains object code
clang -O2 -c main.c -o $dir/main.o                        # main.o contains object code
clang -O2 $dir/add.o $dir/main.o -o $dir/foo              # linker sees object code

# build lto executable
dir=lto_build
clang -flto -O2 -c add.c -o $dir/add.o                    # add.o  contains llvm IR
clang -flto -O2 -c main.c -o $dir/main.o                  # main.o contains llvm IR
clang -flto -O2 $dir/add.o $dir/main.o -o $dir/foo        # linker sees llvm IR

# build lto executable but with an intermediate step between compiling and
# linking which creates relocatable uber.o
dir=relocatable_build 
clang    -flto -O2 -c add.c -o $dir/add.o                 # add.o  contains llvm IR
clang    -flto -O2 -c main.c -o $dir/main.o               # main.o contains llvm IR
clang -r -flto -O2 $dir/add.o $dir/main.o -o $dir/uber.o  # uber.o contains object code
clang          -O2 $dir/uber.o -o $dir/foo                # linker sees object code

Building and then disassembling the executables shows that add1, which is referenced and defined in separate translation units, is inlined for the two LTO builds but not for the separate compilation build, as expected:

$ sh &lt; build.sh 
$ llvm-objdump --disassemble-symbols=main */foo 
lto_build/foo:  file format elf64-x86-64

Disassembly of section .text:

0000000000400540 &lt;main&gt;:
  400540: 8d 47 01                      leal    0x1(%rdi), %eax
  400543: c3                            retq
  400544: 66 2e 0f 1f 84 00 00 00 00 00 nopw    %cs:(%rax,%rax)
  40054e: 66 90                         nop
  
normal_build/foo:   file format elf64-x86-64

Disassembly of section .text: 

0000000000400550 &lt;main&gt;:
  400550: e9 eb ff ff ff                jmp 0x400540 &lt;add1&gt;
  400555: 66 2e 0f 1f 84 00 00 00 00 00 nopw    %cs:(%rax,%rax)
  40055f: 90                            nop 

relocatable_build/foo:  file format elf64-x86-64

Disassembly of section .text:

0000000000400570 &lt;main&gt;:
  400570: 8d 47 01                      leal    0x1(%rdi), %eax
  400573: c3                            retq
  400574: 66 2e 0f 1f 84 00 00 00 00 00 nopw    %cs:(%rax,%rax)
  40057e: 66 90                         nop

The difference in the two LTO builds is that one had -flto on the link line and the other didn't. The one which included an intermediate step between compiling and linking to create a relocatable object did not need -flto on the link line because I gave the linker object code, not LLVM IR.

amdgpu+rdc (bad)

Now consider my use case. I'm building with rocm 6.0.0, the latest rocm clang distribution installed on my system, and I am targeting the amd mi250x. I modified my x86+LTO code to use hip with rdc:

$ cat main.c
#include &lt;hip/hip_runtime.h&gt;
__device__ int add1(int y);
__global__ void mykernel(int *y) { *y = add1(*y); } 
int main(int argc, char **argv) {
    mykernel&lt;&lt;&lt;1,1&gt;&gt;&gt;(&amp;argc);
    return argc;
}
$ cat add.c
__device__ int add1(int y) { return y + 1; }
$ cat build.sh
dirs="rdc_build relocatable_build"
for dir in $(echo $dirs); do rm -rf $dir; mkdir $dir; done

dir=rdc_build
hipcc -O2 -fgpu-rdc --offload-arch=gfx90a -x hip -c add.c -o $dir/add.o         # add.o contains llvm IR
hipcc -O2 -fgpu-rdc --offload-arch=gfx90a -x hip -c main.c -o $dir/main.o       # main.o contains llvm IR
hipcc -O2 -fgpu-rdc --offload-arch=gfx90a $dir/add.o $dir/main.o -o $dir/foo    # linker sees llvm IR

dir=relocatable_build
hipcc    -O2 -fgpu-rdc --offload-arch=gfx90a -x hip -c add.c -o $dir/add.o          # add.o contains llvm IR
hipcc    -O2 -fgpu-rdc --offload-arch=gfx90a -x hip -c main.c -o $dir/main.o        # main.o contains llvm IR
hipcc -r -O2 -fgpu-rdc --offload-arch=gfx90a $dir/add.o $dir/main.o -o $dir/uber.o  # uber.o contains object code 
hipcc    -O2           --offload-arch=gfx90a -o $dir/uber.o                         # linker sees object code

The second-to-last line, which uses -r to make the relocatable object, fails with ld.lld: error: attempted static link of dynamic object and references shared libraries in /opt/rocm:

$ sh &lt; build.sh
clang: warning: argument unused during compilation: '--rtlib=compiler-rt' [-Wunused-command-line-argument]
clang: warning: argument unused during compilation: '-unwindlib=libgcc' [-Wunused-command-line-argument]
ld.lld: error: attempted static link of dynamic object /opt/rocm-6.0.0/lib/libamdhip64.so
ld.lld: error: attempted static link of dynamic object /opt/rocm-6.0.0/lib/libhsa-runtime64.so
ld.lld: error: attempted static link of dynamic object /opt/rocm-6.0.0/lib/libamd_comgr.so
ld.lld: error: attempted static link of dynamic object /opt/rocm-6.0.0/lib/libamdhip64.so
clang: error: linker command failed with exit code 1 (use -v to see invocation)
ld.lld: error: undefined symbol: main
&gt;&gt;&gt; referenced by /lib/../lib64/crt1.o:(_start)
clang: error: linker command failed with exit code 1 (use -v to see invocation)

Ignore the last 3 lines above, which are due to my attempt to link using the non-existent object file uber.o.

ChengChen002 pushed a commit to intel/llvm that referenced this issue Mar 8, 2024
`-fgpu-rdc` mode allows device functions call device functions in
different TU. However, currently all device objects have to be linked
together since only one fat binary is supported. This is time consuming
for AMDGPU backend since it only supports LTO.

There are use cases that objects can be divided into groups in which
device functions are self-contained but host functions are not. It is
desirable to link/optimize/codegen the device code and generate a fatbin
for each group, whereas partially link the host code with `ld -r` or
generate a static library by using the `--emit-static-lib` option of
clang. This avoids linking all device code together, therefore decreases
the linking time for `-fgpu-rdc`.

Previously, clang emits an external symbol `__hip_fatbin` for all
objects for `-fgpu-rdc`. With this patch, clang emits an unique external
symbol `__hip_fatbin_{cuid}` for the fat binary for each object. When a
group of objects are linked together to generate a fatbin, the symbols
are merged by alias and point to the same fat binary. Each group has its
own fat binary. One executable or shared library can have multiple fat
binaries. Device linking is done for undefined fab binary symbols only
to avoid repeated linking. `__hip_gpubin_handle` is also uniquefied and
merged to avoid repeated registering. Symbol `__hip_cuid_{cuid}` is
introduced to facilitate debugging and tooling.

Fixes: llvm/llvm-project#77018
@yxsamliu
Copy link
Collaborator

yxsamliu commented Apr 1, 2024

@pozulp Have you tried clang with -r ? Does it work for you? Thanks.

searlmc1 pushed a commit to ROCm/llvm-project that referenced this issue Apr 5, 2024
`-fgpu-rdc` mode allows device functions call device functions in
different TU. However, currently all device objects have to be linked
together since only one fat binary is supported. This is time consuming
for AMDGPU backend since it only supports LTO.

There are use cases that objects can be divided into groups in which
device functions are self-contained but host functions are not. It is
desirable to link/optimize/codegen the device code and generate a fatbin
for each group, whereas partially link the host code with `ld -r` or
generate a static library by using the `--emit-static-lib` option of
clang. This avoids linking all device code together, therefore decreases
the linking time for `-fgpu-rdc`.

Previously, clang emits an external symbol `__hip_fatbin` for all
objects for `-fgpu-rdc`. With this patch, clang emits an unique external
symbol `__hip_fatbin_{cuid}` for the fat binary for each object. When a
group of objects are linked together to generate a fatbin, the symbols
are merged by alias and point to the same fat binary. Each group has its
own fat binary. One executable or shared library can have multiple fat
binaries. Device linking is done for undefined fab binary symbols only
to avoid repeated linking. `__hip_gpubin_handle` is also uniquefied and
merged to avoid repeated registering. Symbol `__hip_cuid_{cuid}` is
introduced to facilitate debugging and tooling.

Fixes: llvm#77018
Change-Id: I0ebf263b742b554939e5b758e5ec761e00763738
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:codegen clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl'
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants