[libc] Fix linking of AMDGPU device runtime control constants for math #65676

jhuber6 · 2023-09-07T21:13:04Z

Summary:
Currently, libc temporarily provides math by linking against existing
vendor implementations. To use the AMDGPU DeviceRTL we need to define a
handful of control constants that alter behaviour for architecture
specific things. Previously these were marked extern const because
they must be present when we link-in the vendor bitcode library.
However, this causes linker errors if more than one math function was
used.

This patch fixes the issue by marking these functions as used and inline
on top of being external. This means that they are linkable, but it
gives us linkonce_odr semantics. The downside is that these globals
won't be optimized out, but it allows us to perform constant propagation
on them unlike using weak.

arsenm · 2023-09-08T07:42:41Z

Is there anything we can do better than used? I don't want these to make it into the final binary

jhuber6 · 2023-09-08T11:59:20Z

Is there anything we can do better than used? I don't want these to make it into the final binary

So, the other option would be to compile this and then -mlink-builtin-bitcode it. However the CMake isn't really built for this so it would be a lot of custom hackery.

Another solution would be to emit an internal alias to these variables, that would allow it to be stripped by LTO.

jhuber6 · 2023-09-08T12:18:52Z

I updated it to use an internal alias. This causes clang to emit the symbol long enough and with the correct linkage to link against it. The resulting symbols can then be trivially stripped once we optimize the program, see https://godbolt.org/z/66K8T3orW.

arsenm · 2023-09-08T12:20:39Z

Cmake cannot and should not be trying to link these. -mlink-builtin-bitcode is the only correct way to link device libs

jhuber6 · 2023-09-08T12:22:28Z

Cmake cannot and should not be trying to link these. -mlink-builtin-bitcode is the only correct way to link device libs

Sorry, I wasn't specific enough. I was talking about compiling this header as a bitcode file, then using -mlink-builtin-bitcode on the device libs. That would be a nightmare of CMake since we'd need to somehow only pass the flag to the correct architecture.

JonChesterfield · 2023-09-08T15:49:18Z

It's sad that the device libs need awkward linking but also very long standing. Either link them awkwardly, don't use them, or link them incorrectly and hope the failure modes don't show up in bug reports. I favour don't use them for libc.

jhuber6 · 2023-09-08T15:50:56Z

It's sad that the device libs need awkward linking but also very long standing. Either link them awkwardly, don't use them, or link them incorrectly and hope the failure modes don't show up in bug reports. I favour don't use them for libc.

We need to use them right now since we're using ocml.bc to generate the math libraries. The end goal is to remove this completely however, this is merely a stopgap.

arsenm · 2023-09-09T08:44:55Z

So technically ocml functions are now subtarget independent. It's ockl that's still problematic. They still could have the base wave size and other ABI incompatibilities. The ISA versions are only used for detecting if FMA is fast or not, which barely matters anymore. We could find a way to avoid that

jhuber6 · 2023-09-09T11:46:58Z

So technically ocml functions are now subtarget independent. It's ockl that's still problematic. They still could have the base wave size and other ABI incompatibilities. The ISA versions are only used for detecting if FMA is fast or not, which barely matters anymore. We could find a way to avoid that

Yeah, none of the ocml functions use the wavesize. We'd still need to set the math optimization flags though.

arsenm · 2023-09-09T11:55:58Z

The wavesize incompatibility is deeper than depends on the feature. It's a full ABI incompatibility and really should have 2 separate builds.

For the other math options, I have all the patches to eliminate them stuck in review

jhuber6 · 2023-09-09T11:58:32Z

The wavesize incompatibility is deeper than depends on the feature. It's a full ABI incompatibility and really should have 2 separate builds.

For the other math options, I have all the patches to eliminate them stuck in review

So, this approach is mostly just a quick solution to port the ocml libraries to this multi-architecture build I'm doing in libc. I think long-term I'd really like to move the builds to be common, since there's generally only a few builtins we could probably just make errors in the backend if they're unsupported once the user provides an architecture. The wavesize attribute is definitely the hardest one, but at least for my purposes we may be able to allocate 64 sized buffers always and use the intrinsic to set the upper limit.

jhuber6 · 2023-09-14T13:31:54Z

Ping

libc/src/math/gpu/vendor/amdgpu/platform.h

Summary: Currently, `libc` temporarily provides math by linking against existing vendor implementations. To use the AMDGPU DeviceRTL we need to define a handful of control constants that alter behaviour for architecture specific things. Previously these were marked `extern const` because they must be present when we link-in the vendor bitcode library. However, this causes linker errors if more than one math function was used. This patch fixes the issue by marking these functions as used and inline on top of being external. This means that they are linkable, but it gives us `linkonce_odr` semantics. The downside is that these globals won't be optimized out, but it allows us to perform constant propagation on them unlike using `weak`.

llvmbot · 2023-09-14T14:54:40Z

@llvm/pr-subscribers-backend-amdgpu

Changes

Summary: Currently, `libc` temporarily provides math by linking against existing vendor implementations. To use the AMDGPU DeviceRTL we need to define a handful of control constants that alter behaviour for architecture specific things. Previously these were marked `extern const` because they must be present when we link-in the vendor bitcode library. However, this causes linker errors if more than one math function was used.

This patch fixes the issue by marking these functions as used and inline
on top of being external. This means that they are linkable, but it
gives us linkonce_odr semantics. The downside is that these globals
won't be optimized out, but it allows us to perform constant propagation
on them unlike using weak.

--
Full diff: https://github.com/llvm/llvm-project/pull/65676.diff

1 Files Affected:

(modified) libc/src/math/gpu/vendor/amdgpu/platform.h (+53-41)

diff --git a/libc/src/math/gpu/vendor/amdgpu/platform.h b/libc/src/math/gpu/vendor/amdgpu/platform.h
index 6ec47c24a93a2d7..8e2c23743127f48 100644
--- a/libc/src/math/gpu/vendor/amdgpu/platform.h
+++ b/libc/src/math/gpu/vendor/amdgpu/platform.h
@@ -9,6 +9,8 @@
 #ifndef LLVM_LIBC_SRC_MATH_GPU_AMDGPU_PLATFORM_H
 #define LLVM_LIBC_SRC_MATH_GPU_AMDGPU_PLATFORM_H
 
+#include &quot;src/__support/macros/attributes.h&quot;
+
 #include &lt;stdint.h&gt;
 
 namespace __llvm_libc {
@@ -19,96 +21,106 @@ namespace __llvm_libc {
 extern &quot;C&quot; {
 
 // Disable unsafe math optimizations in the implementation.
-extern const uint8_t __oclc_unsafe_math_opt = 0;
+extern const LIBC_INLINE uint8_t __oclc_unsafe_math_opt = 0;
 
 // Disable denormalization at zero optimizations in the implementation.
-extern const uint8_t __oclc_daz_opt = 0;
+extern const LIBC_INLINE uint8_t __oclc_daz_opt = 0;
 
 // Disable rounding optimizations for 32-bit square roots.
-extern const uint8_t __oclc_correctly_rounded_sqrt32 = 1;
+extern const LIBC_INLINE uint8_t __oclc_correctly_rounded_sqrt32 = 1;
 
 // Disable finite math optimizations.
-extern const uint8_t __oclc_finite_only_opt = 0;
+extern const LIBC_INLINE uint8_t __oclc_finite_only_opt = 0;
 
 #if defined(__gfx700__)
-extern const uint32_t __oclc_ISA_version = 7000;
+extern const LIBC_INLINE uint32_t __oclc_ISA_version = 7000;
 #elif defined(__gfx701__)
-extern const uint32_t __oclc_ISA_version = 7001;
+extern const LIBC_INLINE uint32_t __oclc_ISA_version = 7001;
 #elif defined(__gfx702__)
-extern const uint32_t __oclc_ISA_version = 7002;
+extern const LIBC_INLINE uint32_t __oclc_ISA_version = 7002;
 #elif defined(__gfx703__)
-extern const uint32_t __oclc_ISA_version = 7003;
+extern const LIBC_INLINE uint32_t __oclc_ISA_version = 7003;
 #elif defined(__gfx704__)
-extern const uint32_t __oclc_ISA_version = 7004;
+extern const LIBC_INLINE uint32_t __oclc_ISA_version = 7004;
 #elif defined(__gfx705__)
-extern const uint32_t __oclc_ISA_version = 7005;
+extern const LIBC_INLINE uint32_t __oclc_ISA_version = 7005;
 #elif defined(__gfx801__)
-extern const uint32_t __oclc_ISA_version = 8001;
+extern const LIBC_INLINE uint32_t __oclc_ISA_version = 8001;
 #elif defined(__gfx802__)
-extern const uint32_t __oclc_ISA_version = 8002;
+extern const LIBC_INLINE uint32_t __oclc_ISA_version = 8002;
 #elif defined(__gfx803__)
-extern const uint32_t __oclc_ISA_version = 8003;
+extern const LIBC_INLINE uint32_t __oclc_ISA_version = 8003;
 #elif defined(__gfx805__)
-extern const uint32_t __oclc_ISA_version = 8005;
+extern const LIBC_INLINE uint32_t __oclc_ISA_version = 8005;
 #elif defined(__gfx810__)
-extern const uint32_t __oclc_ISA_version = 8100;
+extern const LIBC_INLINE uint32_t __oclc_ISA_version = 8100;
 #elif defined(__gfx900__)
-extern const uint32_t __oclc_ISA_version = 9000;
+extern const LIBC_INLINE uint32_t __oclc_ISA_version = 9000;
 #elif defined(__gfx902__)
-extern const uint32_t __oclc_ISA_version = 9002;
+extern const LIBC_INLINE uint32_t __oclc_ISA_version = 9002;
 #elif defined(__gfx904__)
-extern const uint32_t __oclc_ISA_version = 9004;
+extern const LIBC_INLINE uint32_t __oclc_ISA_version = 9004;
 #elif defined(__gfx906__)
-extern const uint32_t __oclc_ISA_version = 9006;
+extern const LIBC_INLINE uint32_t __oclc_ISA_version = 9006;
 #elif defined(__gfx908__)
-extern const uint32_t __oclc_ISA_version = 9008;
+extern const LIBC_INLINE uint32_t __oclc_ISA_version = 9008;
 #elif defined(__gfx909__)
-extern const uint32_t __oclc_ISA_version = 9009;
+extern const LIBC_INLINE uint32_t __oclc_ISA_version = 9009;
 #elif defined(__gfx90a__)
-extern const uint32_t __oclc_ISA_version = 9010;
+extern const LIBC_INLINE uint32_t __oclc_ISA_version = 9010;
 #elif defined(__gfx90c__)
-extern const uint32_t __oclc_ISA_version = 9012;
+extern const LIBC_INLINE uint32_t __oclc_ISA_version = 9012;
 #elif defined(__gfx940__)
-extern const uint32_t __oclc_ISA_version = 9400;
+extern const LIBC_INLINE uint32_t __oclc_ISA_version = 9400;
 #elif defined(__gfx1010__)
-extern const uint32_t __oclc_ISA_version = 10100;
+extern const LIBC_INLINE uint32_t __oclc_ISA_version = 10100;
 #elif defined(__gfx1011__)
-extern const uint32_t __oclc_ISA_version = 10101;
+extern const LIBC_INLINE uint32_t __oclc_ISA_version = 10101;
 #elif defined(__gfx1012__)
-extern const uint32_t __oclc_ISA_version = 10102;
+extern const LIBC_INLINE uint32_t __oclc_ISA_version = 10102;
 #elif defined(__gfx1013__)
-extern const uint32_t __oclc_ISA_version = 10103;
+extern const LIBC_INLINE uint32_t __oclc_ISA_version = 10103;
 #elif defined(__gfx1030__)
-extern const uint32_t __oclc_ISA_version = 10300;
+extern const LIBC_INLINE uint32_t __oclc_ISA_version = 10300;
 #elif defined(__gfx1031__)
-extern const uint32_t __oclc_ISA_version = 10301;
+extern const LIBC_INLINE uint32_t __oclc_ISA_version = 10301;
 #elif defined(__gfx1032__)
-extern const uint32_t __oclc_ISA_version = 10302;
+extern const LIBC_INLINE uint32_t __oclc_ISA_version = 10302;
 #elif defined(__gfx1033__)
-extern const uint32_t __oclc_ISA_version = 10303;
+extern const LIBC_INLINE uint32_t __oclc_ISA_version = 10303;
 #elif defined(__gfx1034__)
-extern const uint32_t __oclc_ISA_version = 10304;
+extern const LIBC_INLINE uint32_t __oclc_ISA_version = 10304;
 #elif defined(__gfx1035__)
-extern const uint32_t __oclc_ISA_version = 10305;
+extern const LIBC_INLINE uint32_t __oclc_ISA_version = 10305;
 #elif defined(__gfx1036__)
-extern const uint32_t __oclc_ISA_version = 10306;
+extern const LIBC_INLINE uint32_t __oclc_ISA_version = 10306;
 #elif defined(__gfx1100__)
-extern const uint32_t __oclc_ISA_version = 11000;
+extern const LIBC_INLINE uint32_t __oclc_ISA_version = 11000;
 #elif defined(__gfx1101__)
-extern const uint32_t __oclc_ISA_version = 11001;
+extern const LIBC_INLINE uint32_t __oclc_ISA_version = 11001;
 #elif defined(__gfx1102__)
-extern const uint32_t __oclc_ISA_version = 11002;
+extern const LIBC_INLINE uint32_t __oclc_ISA_version = 11002;
 #elif defined(__gfx1103__)
-extern const uint32_t __oclc_ISA_version = 11003;
+extern const LIBC_INLINE uint32_t __oclc_ISA_version = 11003;
 #elif defined(__gfx1150__)
-extern const uint32_t __oclc_ISA_version = 11500;
+extern const LIBC_INLINE uint32_t __oclc_ISA_version = 11500;
 #elif defined(__gfx1151__)
-extern const uint32_t __oclc_ISA_version = 11501;
+extern const LIBC_INLINE uint32_t __oclc_ISA_version = 11501;
 #else
 #error &quot;Unknown AMDGPU architecture&quot;
 #endif
 }
 
+// These aliases cause clang to emit the control constants with ODR linkage.
+// This allows us to link against the symbols without preventing them from being
+// optimized out or causing symbol collisions.
+[[gnu::alias(&quot;__oclc_unsafe_math_opt&quot;)]] const uint8_t __oclc_unsafe_math_opt__;
+[[gnu::alias(&quot;__oclc_daz_opt&quot;)]] const uint8_t __oclc_daz_opt__;
+[[gnu::alias(&quot;__oclc_correctly_rounded_sqrt32&quot;)]] const uint8_t
+    __oclc_correctly_rounded_sqrt32__;
+[[gnu::alias(&quot;__oclc_finite_only_opt&quot;)]] const uint8_t __oclc_finite_only_opt__;
+[[gnu::alias(&quot;__oclc_ISA_version&quot;)]] const uint32_t __oclc_ISA_version__;
+
 } // namespace __llvm_libc
 
 #endif // LLVM_LIBC_SRC_MATH_GPU_AMDGPU_PLATFORM_H

libc/src/math/gpu/vendor/amdgpu/platform.h

jhuber6 · 2023-09-25T14:34:13Z

@arsenm are there any issues with this? It's fixing something that's broken upstream right now without changing any semantics so I'd like to get it in if possible.

jhuber6 · 2023-10-02T16:22:51Z

Ping. This is a bug fix that prevents this from actually being used and doesn't change any existing behavior.

jhuber6 requested a review from a team as a code owner September 7, 2023 21:13

github-actions bot added the libc label Sep 7, 2023

jhuber6 requested review from arsenm and JonChesterfield September 7, 2023 21:13

jhuber6 force-pushed the FixOCLCGlobals branch from 048c9c9 to 9e81d06 Compare September 8, 2023 12:15

sivachandra reviewed Sep 14, 2023

View reviewed changes

libc/src/math/gpu/vendor/amdgpu/platform.h Outdated Show resolved Hide resolved

jhuber6 force-pushed the FixOCLCGlobals branch from 9e81d06 to d11fae1 Compare September 14, 2023 14:53

llvmbot added the backend:AMDGPU label Sep 14, 2023

sivachandra reviewed Sep 14, 2023

View reviewed changes

libc/src/math/gpu/vendor/amdgpu/platform.h Outdated Show resolved Hide resolved

LIBC_INLINE -> LIBC_INLINE_VAR

7d03d3f

Merge branch 'main' into FixOCLCGlobals

420cf31

arsenm approved these changes Oct 6, 2023

View reviewed changes

jhuber6 merged commit fa23a23 into llvm:main Oct 7, 2023

stepthomas mentioned this pull request Oct 10, 2023

AMDGPU stepthomas atomic csub no rtn forms ver2 stepthomas/llvm-project#1

Closed

[libc] Fix linking of AMDGPU device runtime control constants for math #65676

[libc] Fix linking of AMDGPU device runtime control constants for math #65676

Uh oh!

Conversation

jhuber6 commented Sep 7, 2023

Uh oh!

arsenm commented Sep 8, 2023

Uh oh!

jhuber6 commented Sep 8, 2023

Uh oh!

jhuber6 commented Sep 8, 2023

Uh oh!

arsenm commented Sep 8, 2023

Uh oh!

jhuber6 commented Sep 8, 2023

Uh oh!

JonChesterfield commented Sep 8, 2023

Uh oh!

jhuber6 commented Sep 8, 2023

Uh oh!

arsenm commented Sep 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhuber6 commented Sep 9, 2023

Uh oh!

arsenm commented Sep 9, 2023

Uh oh!

jhuber6 commented Sep 9, 2023

Uh oh!

jhuber6 commented Sep 14, 2023

Uh oh!

Uh oh!

llvmbot commented Sep 14, 2023

Uh oh!

Uh oh!

jhuber6 commented Sep 25, 2023

Uh oh!

jhuber6 commented Oct 2, 2023

Uh oh!

Uh oh!

arsenm commented Sep 9, 2023 •

edited

Loading