Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMDGPU] With Clang>17, -amdgpu-early-inline-all=true consumes 8x more memory #86332

Open
AngryLoki opened this issue Mar 22, 2024 · 18 comments
Open

Comments

@AngryLoki
Copy link
Contributor

There is some kind of regression in -amdgpu-early-inline-all=true option, which is set for every HIP application in hipcc.

While this option makes no significant performance/memory impact in Clang 17, attempt to migrate to Clang 18.1.0 or nightly Clang 19 build consumes 8x more memory, which makes Clang unusable for HIP (i. e. when multiple compile units consume 10GB each in parallel, there is just not enough RAM eventually, even when compiling for single target GPU arch).

Environment:

/usr/lib/llvm/17/bin/clang-17 --version | grep version
clang version 17.0.6

/usr/lib/llvm/18/bin/clang-18 --version | grep version
clang version 18.1.0

/usr/lib/llvm/19/bin/clang-19 --version | grep version
clang version 19.0.0git6d3cec01

Common flags (verbose output of composable-kernel-6.0.2):

export FLAGS="-cc1 -triple amdgcn-amd-amdhsa -aux-triple x86_64-pc-linux-gnu -emit-obj -disable-free -clear-ast-before-backend -disable-llvm-verifier -discard-value-names -main-file-name device_batchnorm_forward_f32_instance.cpp -mrelocation-model pic -pic-level 2 -fhalf-no-semantic-interposition -mframe-pointer=none -fno-rounding-math -mconstructor-aliases -aux-target-cpu x86-64 -fcuda-is-device -mllvm -amdgpu-internalize-symbols -fcuda-allow-variadic-functions -fvisibility=hidden -fapply-global-visibility-to-externs -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/hip.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/ocml.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/ockl.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_daz_opt_off.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_unsafe_math_off.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_finite_only_off.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_correctly_rounded_sqrt_on.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_wavefrontsize64_off.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_isa_version_1030.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_abi_version_400.bc -target-cpu gfx1030 -debugger-tuning=gdb -fdebug-compilation-dir=/var/tmp/portage/sci-libs/composable-kernel-6.0.2/work/composable_kernel-rocm-6.0.2_build -resource-dir /usr/lib/clang/17 -dependency-file library/src/tensor_operation_instance/gpu/batchnorm/CMakeFiles/device_batchnorm_instance.dir/device_batchnorm_forward_f32_instance.cpp.o.d -MT library/src/tensor_operation_instance/gpu/batchnorm/CMakeFiles/device_batchnorm_instance.dir/device_batchnorm_forward_f32_instance.cpp.o -sys-header-deps -internal-isystem /usr/lib/clang/17/include/cuda_wrappers -idirafter /usr/local/include -include __clang_hip_runtime_wrapper.h -include /usr/include/gentoo/fortify.h -include /usr/include/gentoo/maybe-stddefs.h -D CK_ENABLE_BF16 -D CK_ENABLE_BF8 -D CK_ENABLE_FP16 -D CK_ENABLE_FP32 -D CK_ENABLE_FP64 -D CK_ENABLE_FP8 -D CK_ENABLE_INT8 -D USE_PROF_API=1 -D __HIP_PLATFORM_AMD__=1 -D __HIP_PLATFORM_HCC__=1 -I /var/tmp/portage/sci-libs/composable-kernel-6.0.2/work/composable_kernel-rocm-6.0.2/library/include -I /var/tmp/portage/sci-libs/composable-kernel-6.0.2/work/composable_kernel-rocm-6.0.2/include -I /var/tmp/portage/sci-libs/composable-kernel-6.0.2/work/composable_kernel-rocm-6.0.2_build/include -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13 -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/x86_64-pc-linux-gnu -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/backward -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13 -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/x86_64-pc-linux-gnu -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/backward -internal-isystem /usr/lib/clang/17/include -internal-isystem /usr/local/include -internal-isystem /usr/x86_64-pc-linux-gnu/include -internal-externc-isystem /include -internal-externc-isystem /usr/include -internal-isystem /usr/lib/clang/17/include -internal-isystem /usr/local/include -internal-isystem /usr/x86_64-pc-linux-gnu/include -internal-externc-isystem /include -internal-externc-isystem /usr/include -O3 -std=c++17 -fdeprecated-macro -fno-autolink -ferror-limit 19 -fmessage-length=173 -fhip-new-launch-api -fgnuc-version=4.2.1 -fcxx-exceptions -fexceptions -fcolor-diagnostics -vectorize-loops -vectorize-slp -mllvm -amdgpu-function-calls=false -cuid=aa0b75146f478e4b -fcuda-allow-variadic-functions -faddrsig -D__GCC_HAVE_DWARF2_CFI_ASM=1 -o /tmp/device_batchnorm_forward_f32_instance-gfx1030-437c24.o -x hip /var/tmp/portage/sci-libs/composable-kernel-6.0.2/work/composable_kernel-rocm-6.0.2/library/src/tensor_operation_instance/gpu/batchnorm/device_batchnorm_forward_f32_instance.cpp"

Without -amdgpu-early-inline-all=true everything is fine:

/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/17/bin/clang-17 $FLAGS
Memory: 818272 KB, Time: 0:20.62

/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/18/bin/clang-18 $FLAGS
Memory: 830300 KB, Time: 0:18.28

/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/19/bin/clang-19 $FLAGS
Memory: 861772 KB, Time: 0:22.69

With -amdgpu-early-inline-all=true Clang 18 and 19 are hungry and slow:

/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/17/bin/clang-17 $FLAGS -mllvm -amdgpu-early-inline-all=true
Memory: 818240 KB, Time: 0:20.80

/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/18/bin/clang-18 $FLAGS -mllvm -amdgpu-early-inline-all=true
Memory: 6402824 KB, Time: 1:02.50

/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/19/bin/clang-19 $FLAGS -mllvm -amdgpu-early-inline-all=true
Memory: 6343976 KB, Time: 1:12.43

I don't provide preprocessed version of device_batchnorm_forward_f32_instance.cpp, because for some reason I can't rebuild it after preprocessing (complaints about constexprs). However if you need it or some other dumps, please ask and I will attach.

@github-actions github-actions bot added the clang Clang issues not falling into any other category label Mar 22, 2024
@EugeneZelenko EugeneZelenko added backend:AMDGPU and removed clang Clang issues not falling into any other category labels Mar 22, 2024
@llvmbot
Copy link
Member

llvmbot commented Mar 22, 2024

@llvm/issue-subscribers-backend-amdgpu

Author: None (AngryLoki)

There is some kind of regression in `-amdgpu-early-inline-all=true` option, which is set for every HIP application in hipcc.

While this option makes no significant performance/memory impact in Clang 17, attempt to migrate to Clang 18.1.0 or nightly Clang 19 build consumes 8x more memory, which makes Clang unusable for HIP (i. e. when multiple compile units consume 10GB each in parallel, there is just not enough RAM eventually, even when compiling for single target GPU arch).

Environment:

/usr/lib/llvm/17/bin/clang-17 --version | grep version
clang version 17.0.6

/usr/lib/llvm/18/bin/clang-18 --version | grep version
clang version 18.1.0

/usr/lib/llvm/19/bin/clang-19 --version | grep version
clang version 19.0.0git6d3cec01

Common flags (verbose output of composable-kernel-6.0.2):

export FLAGS="-cc1 -triple amdgcn-amd-amdhsa -aux-triple x86_64-pc-linux-gnu -emit-obj -disable-free -clear-ast-before-backend -disable-llvm-verifier -discard-value-names -main-file-name device_batchnorm_forward_f32_instance.cpp -mrelocation-model pic -pic-level 2 -fhalf-no-semantic-interposition -mframe-pointer=none -fno-rounding-math -mconstructor-aliases -aux-target-cpu x86-64 -fcuda-is-device -mllvm -amdgpu-internalize-symbols -fcuda-allow-variadic-functions -fvisibility=hidden -fapply-global-visibility-to-externs -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/hip.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/ocml.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/ockl.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_daz_opt_off.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_unsafe_math_off.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_finite_only_off.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_correctly_rounded_sqrt_on.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_wavefrontsize64_off.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_isa_version_1030.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_abi_version_400.bc -target-cpu gfx1030 -debugger-tuning=gdb -fdebug-compilation-dir=/var/tmp/portage/sci-libs/composable-kernel-6.0.2/work/composable_kernel-rocm-6.0.2_build -resource-dir /usr/lib/clang/17 -dependency-file library/src/tensor_operation_instance/gpu/batchnorm/CMakeFiles/device_batchnorm_instance.dir/device_batchnorm_forward_f32_instance.cpp.o.d -MT library/src/tensor_operation_instance/gpu/batchnorm/CMakeFiles/device_batchnorm_instance.dir/device_batchnorm_forward_f32_instance.cpp.o -sys-header-deps -internal-isystem /usr/lib/clang/17/include/cuda_wrappers -idirafter /usr/local/include -include __clang_hip_runtime_wrapper.h -include /usr/include/gentoo/fortify.h -include /usr/include/gentoo/maybe-stddefs.h -D CK_ENABLE_BF16 -D CK_ENABLE_BF8 -D CK_ENABLE_FP16 -D CK_ENABLE_FP32 -D CK_ENABLE_FP64 -D CK_ENABLE_FP8 -D CK_ENABLE_INT8 -D USE_PROF_API=1 -D __HIP_PLATFORM_AMD__=1 -D __HIP_PLATFORM_HCC__=1 -I /var/tmp/portage/sci-libs/composable-kernel-6.0.2/work/composable_kernel-rocm-6.0.2/library/include -I /var/tmp/portage/sci-libs/composable-kernel-6.0.2/work/composable_kernel-rocm-6.0.2/include -I /var/tmp/portage/sci-libs/composable-kernel-6.0.2/work/composable_kernel-rocm-6.0.2_build/include -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13 -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/x86_64-pc-linux-gnu -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/backward -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13 -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/x86_64-pc-linux-gnu -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/backward -internal-isystem /usr/lib/clang/17/include -internal-isystem /usr/local/include -internal-isystem /usr/x86_64-pc-linux-gnu/include -internal-externc-isystem /include -internal-externc-isystem /usr/include -internal-isystem /usr/lib/clang/17/include -internal-isystem /usr/local/include -internal-isystem /usr/x86_64-pc-linux-gnu/include -internal-externc-isystem /include -internal-externc-isystem /usr/include -O3 -std=c++17 -fdeprecated-macro -fno-autolink -ferror-limit 19 -fmessage-length=173 -fhip-new-launch-api -fgnuc-version=4.2.1 -fcxx-exceptions -fexceptions -fcolor-diagnostics -vectorize-loops -vectorize-slp -mllvm -amdgpu-function-calls=false -cuid=aa0b75146f478e4b -fcuda-allow-variadic-functions -faddrsig -D__GCC_HAVE_DWARF2_CFI_ASM=1 -o /tmp/device_batchnorm_forward_f32_instance-gfx1030-437c24.o -x hip /var/tmp/portage/sci-libs/composable-kernel-6.0.2/work/composable_kernel-rocm-6.0.2/library/src/tensor_operation_instance/gpu/batchnorm/device_batchnorm_forward_f32_instance.cpp"

Without -amdgpu-early-inline-all=true everything is fine:

/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/17/bin/clang-17 $FLAGS
Memory: 818272 KB, Time: 0:20.62

/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/18/bin/clang-18 $FLAGS
Memory: 830300 KB, Time: 0:18.28

/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/19/bin/clang-19 $FLAGS
Memory: 861772 KB, Time: 0:22.69

With -amdgpu-early-inline-all=true Clang 18 and 19 are hungry and slow:

/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/17/bin/clang-17 $FLAGS -mllvm -amdgpu-early-inline-all=true
Memory: 818240 KB, Time: 0:20.80

/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/18/bin/clang-18 $FLAGS -mllvm -amdgpu-early-inline-all=true
Memory: 6402824 KB, Time: 1:02.50

/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/19/bin/clang-19 $FLAGS -mllvm -amdgpu-early-inline-all=true
Memory: 6343976 KB, Time: 1:12.43

I don't provide preprocessed version of device_batchnorm_forward_f32_instance.cpp, because for some reason I can't rebuild it after preprocessing (complaints about constexprs). However if you need it or some other dumps, please ask and I will attach.

@Artem-B
Copy link
Member

Artem-B commented Mar 22, 2024

How large is the IR we end up trying to compile? Indiscriminately inlining everything may result in the code size explosion. The source file looks template-heavy, and it's possible that we may be inlining way too much because the user requested it. It's quite possible that it's not a regression, but, rather that we have actually fixed the behavior of -amdgpu-early-inline-all=true

@yxsamliu Sam, do you know what's the story with -amdgpu-early-inline-all=true ? It looks like something that can easily backfire on large/complicated enough code.

@bcahoon
Copy link
Contributor

bcahoon commented Mar 23, 2024

Looks like this is related to #59126, though that issue is about the target independent always-inline pass. That issue contains a couple of test cases too.

@AngryLoki
Copy link
Contributor Author

AngryLoki commented Mar 23, 2024

Yes, very likely to be related, but it looks like #59126 does not fully reflect all the changes. It says that in Nov 22, 2022 after changes that were pushed before LLVM-14 release, users experienced time explosion with Alwaysinliner.

However for my case everything is ok before LLVM-18 release.
Maybe before LLVM-18 -mllvm -amdgpu-early-inline-all=true was nonfunctional/placebo in clang, because

/usr/lib/llvm/17/bin/clang $FLAGS -S -emit-llvm -o /dev/stdout | md5sum
dfac0099986317d8731012f8d6e7a11c  - # 15M .ll file

/usr/lib/llvm/17/bin/clang $FLAGS  -mllvm -amdgpu-early-inline-all=true -S -emit-llvm -o /dev/stdout | md5sum
dfac0099986317d8731012f8d6e7a11c  - # 15M .ll file

/usr/lib/llvm/18/bin/clang $FLAGS -S -emit-llvm -o /dev/stdout | md5sum
5f8fb8b9c7b1a25f2669de75587845a3  - # 13M .ll file

/usr/lib/llvm/18/bin/clang $FLAGS -mllvm -amdgpu-early-inline-all=true -S -emit-llvm -o /dev/stdout | md5sum
a60a9a166226cf36898c8c470ef4be0f  - # 12M .ll file

@bcahoon
Copy link
Contributor

bcahoon commented Mar 24, 2024

The initial commit mention in #59126 was reverted and then it re-landed on Oct 29, 2023 1a2e77c. Commenting out the code that adds AlwaysInlinerPass

MPM.addPass(AlwaysInlinerPass(/*InsertLifetimeIntrinsics=*/true));
does reduce the compile-time and memory usage. But, yes, there is an interaction between the amdgpu-early-inilne-all flag and the addition of AlwaysInlinerPass so early in the optimization pipeline. It's interesting that when those passes run later in the pipeline, there isn't an issue.

@yxsamliu
Copy link
Collaborator

@arsenm @scchan Any insights? Thanks

@arsenm
Copy link
Contributor

arsenm commented Mar 26, 2024

We should just delete the flag, and fully delete AMDGPUAlwaysInlinePass. These are vestiges from before function calls were supported. Forcibly inlining everything is going to make every function bigger and slower to compile. I don't know what to do other than general large function compile time improvements.

@JonChesterfield
Copy link
Collaborator

Deleting the always inline pass sounds sensible to me. If that's a horrendous regression for someone maybe we can add a clang flag that tags everything with attribute(always_inline) instead - that should be similar in effect to the custom pass, plausibly useful on some other targets, still allow us to delete that pass.

@arsenm
Copy link
Contributor

arsenm commented May 8, 2024

Looks like this is related to #59126, though that issue is about the target independent always-inline pass. That issue contains a couple of test cases too.

But that's the same thing - all this pass is tag every function with alwaysinline and the regular AwaysInline pass does the actual work

@dfukalov
Copy link
Collaborator

Hi @AngryLoki would you please that PR #96958 fixes the issue?

@AngryLoki
Copy link
Contributor Author

Hi, this PR is released in 19.1.0, so I checked it:

# Without -amdgpu-early-inline-all=true
/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/18/bin/clang-18 $FLAGS
Memory: 827740 KB, Time: 0:18.21

/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/19/bin/clang-19 $FLAGS
Memory: 830096 KB, Time: 0:18.53

# With -amdgpu-early-inline-all=true
/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/18/bin/clang-18 $FLAGS -mllvm -amdgpu-early-inline-all=true
Memory: 6411340 KB, Time: 1:05.20

/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/19/bin/clang-19 $FLAGS -mllvm -amdgpu-early-inline-all=true
Memory: 3623372 KB, Time: 1:03.93

clang-19.1 now consumes 2x less memory, however is it still 4x more then clang-17. Also it is still as slow as clang-18. Is it possible to improve it?

@llvmbot
Copy link
Member

llvmbot commented Sep 30, 2024

@llvm/issue-subscribers-backend-amdgpu

Author: None (AngryLoki)

There is some kind of regression in `-amdgpu-early-inline-all=true` option, which is set for every HIP application in hipcc.

While this option makes no significant performance/memory impact in Clang 17, attempt to migrate to Clang 18.1.0 or nightly Clang 19 build consumes 8x more memory, which makes Clang unusable for HIP (i. e. when multiple compile units consume 10GB each in parallel, there is just not enough RAM eventually, even when compiling for single target GPU arch).

Environment:

/usr/lib/llvm/17/bin/clang-17 --version | grep version
clang version 17.0.6

/usr/lib/llvm/18/bin/clang-18 --version | grep version
clang version 18.1.0

/usr/lib/llvm/19/bin/clang-19 --version | grep version
clang version 19.0.0git6d3cec01

Common flags (verbose output of composable-kernel-6.0.2):

export FLAGS="-cc1 -triple amdgcn-amd-amdhsa -aux-triple x86_64-pc-linux-gnu -emit-obj -disable-free -clear-ast-before-backend -disable-llvm-verifier -discard-value-names -main-file-name device_batchnorm_forward_f32_instance.cpp -mrelocation-model pic -pic-level 2 -fhalf-no-semantic-interposition -mframe-pointer=none -fno-rounding-math -mconstructor-aliases -aux-target-cpu x86-64 -fcuda-is-device -mllvm -amdgpu-internalize-symbols -fcuda-allow-variadic-functions -fvisibility=hidden -fapply-global-visibility-to-externs -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/hip.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/ocml.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/ockl.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_daz_opt_off.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_unsafe_math_off.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_finite_only_off.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_correctly_rounded_sqrt_on.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_wavefrontsize64_off.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_isa_version_1030.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_abi_version_400.bc -target-cpu gfx1030 -debugger-tuning=gdb -fdebug-compilation-dir=/var/tmp/portage/sci-libs/composable-kernel-6.0.2/work/composable_kernel-rocm-6.0.2_build -resource-dir /usr/lib/clang/17 -dependency-file library/src/tensor_operation_instance/gpu/batchnorm/CMakeFiles/device_batchnorm_instance.dir/device_batchnorm_forward_f32_instance.cpp.o.d -MT library/src/tensor_operation_instance/gpu/batchnorm/CMakeFiles/device_batchnorm_instance.dir/device_batchnorm_forward_f32_instance.cpp.o -sys-header-deps -internal-isystem /usr/lib/clang/17/include/cuda_wrappers -idirafter /usr/local/include -include __clang_hip_runtime_wrapper.h -include /usr/include/gentoo/fortify.h -include /usr/include/gentoo/maybe-stddefs.h -D CK_ENABLE_BF16 -D CK_ENABLE_BF8 -D CK_ENABLE_FP16 -D CK_ENABLE_FP32 -D CK_ENABLE_FP64 -D CK_ENABLE_FP8 -D CK_ENABLE_INT8 -D USE_PROF_API=1 -D __HIP_PLATFORM_AMD__=1 -D __HIP_PLATFORM_HCC__=1 -I /var/tmp/portage/sci-libs/composable-kernel-6.0.2/work/composable_kernel-rocm-6.0.2/library/include -I /var/tmp/portage/sci-libs/composable-kernel-6.0.2/work/composable_kernel-rocm-6.0.2/include -I /var/tmp/portage/sci-libs/composable-kernel-6.0.2/work/composable_kernel-rocm-6.0.2_build/include -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13 -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/x86_64-pc-linux-gnu -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/backward -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13 -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/x86_64-pc-linux-gnu -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/backward -internal-isystem /usr/lib/clang/17/include -internal-isystem /usr/local/include -internal-isystem /usr/x86_64-pc-linux-gnu/include -internal-externc-isystem /include -internal-externc-isystem /usr/include -internal-isystem /usr/lib/clang/17/include -internal-isystem /usr/local/include -internal-isystem /usr/x86_64-pc-linux-gnu/include -internal-externc-isystem /include -internal-externc-isystem /usr/include -O3 -std=c++17 -fdeprecated-macro -fno-autolink -ferror-limit 19 -fmessage-length=173 -fhip-new-launch-api -fgnuc-version=4.2.1 -fcxx-exceptions -fexceptions -fcolor-diagnostics -vectorize-loops -vectorize-slp -mllvm -amdgpu-function-calls=false -cuid=aa0b75146f478e4b -fcuda-allow-variadic-functions -faddrsig -D__GCC_HAVE_DWARF2_CFI_ASM=1 -o /tmp/device_batchnorm_forward_f32_instance-gfx1030-437c24.o -x hip /var/tmp/portage/sci-libs/composable-kernel-6.0.2/work/composable_kernel-rocm-6.0.2/library/src/tensor_operation_instance/gpu/batchnorm/device_batchnorm_forward_f32_instance.cpp"

Without -amdgpu-early-inline-all=true everything is fine:

/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/17/bin/clang-17 $FLAGS
Memory: 818272 KB, Time: 0:20.62

/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/18/bin/clang-18 $FLAGS
Memory: 830300 KB, Time: 0:18.28

/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/19/bin/clang-19 $FLAGS
Memory: 861772 KB, Time: 0:22.69

With -amdgpu-early-inline-all=true Clang 18 and 19 are hungry and slow:

/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/17/bin/clang-17 $FLAGS -mllvm -amdgpu-early-inline-all=true
Memory: 818240 KB, Time: 0:20.80

/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/18/bin/clang-18 $FLAGS -mllvm -amdgpu-early-inline-all=true
Memory: 6402824 KB, Time: 1:02.50

/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/19/bin/clang-19 $FLAGS -mllvm -amdgpu-early-inline-all=true
Memory: 6343976 KB, Time: 1:12.43

I don't provide preprocessed version of device_batchnorm_forward_f32_instance.cpp, because for some reason I can't rebuild it after preprocessing (complaints about constexprs). However if you need it or some other dumps, please ask and I will attach.

@LunNova
Copy link

LunNova commented Dec 17, 2024

Is the amdgpu-early-inline-all flag necessary any more?

There's -fgpu-inline-threshold=... which isn't vendor specific and allows tuning instead of all or nothing.

@arsenm
Copy link
Contributor

arsenm commented Dec 18, 2024

Is the amdgpu-early-inline-all flag necessary any more?

No. It should have been deleted years ago, but hipcc has been using it and it's been sticky to get out.

There's -fgpu-inline-threshold=... which isn't vendor specific and allows tuning instead of all or nothing.

Changing the inline threshold is a much weaker option and doesn't serve the same function

@LunNova
Copy link

LunNova commented Dec 19, 2024

Looks like it's being used directly by some projects like pytorch:

https://github.com/pytorch/pytorch/blob/deb1da15cccd9726da7c0dd1d21a12113e78a56a/torch/_inductor/codegen/rocm/compile_command.py#L84-L87

        "-mllvm", "-amdgpu-early-inline-all=true", "-mllvm", "-amdgpu-function-calls=false",

Should they be changing to something else?

@arsenm
Copy link
Contributor

arsenm commented Dec 19, 2024

    "-mllvm", "-amdgpu-early-inline-all=true", "-mllvm", "-amdgpu-function-calls=false",

Should they be changing to something else?

This should be removed without replacement

@AngryLoki
Copy link
Contributor Author

These parameters were recently removed from ROCm/HIP and ROCm/clr (ref). The biggest user - hipcc - is still using it (reported above). Also some projects like composable_kernel (as mentioned in the first post here) or pytorch enabling it in CMake - they seemingly just tried to replicate hipcc, so if hipcc agrees to remove it, I'll request the removal in other places too. Thanks!

@arsenm
Copy link
Contributor

arsenm commented Dec 19, 2024

The only reason to use them is a kludge for performance. If something is performing worse as a result of removing the flags, that's a new optimization issue to be debugged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests