Description
There is some kind of regression in -amdgpu-early-inline-all=true
option, which is set for every HIP application in hipcc.
While this option makes no significant performance/memory impact in Clang 17, attempt to migrate to Clang 18.1.0 or nightly Clang 19 build consumes 8x more memory, which makes Clang unusable for HIP (i. e. when multiple compile units consume 10GB each in parallel, there is just not enough RAM eventually, even when compiling for single target GPU arch).
Environment:
/usr/lib/llvm/17/bin/clang-17 --version | grep version
clang version 17.0.6
/usr/lib/llvm/18/bin/clang-18 --version | grep version
clang version 18.1.0
/usr/lib/llvm/19/bin/clang-19 --version | grep version
clang version 19.0.0git6d3cec01
Common flags (verbose output of composable-kernel-6.0.2):
export FLAGS="-cc1 -triple amdgcn-amd-amdhsa -aux-triple x86_64-pc-linux-gnu -emit-obj -disable-free -clear-ast-before-backend -disable-llvm-verifier -discard-value-names -main-file-name device_batchnorm_forward_f32_instance.cpp -mrelocation-model pic -pic-level 2 -fhalf-no-semantic-interposition -mframe-pointer=none -fno-rounding-math -mconstructor-aliases -aux-target-cpu x86-64 -fcuda-is-device -mllvm -amdgpu-internalize-symbols -fcuda-allow-variadic-functions -fvisibility=hidden -fapply-global-visibility-to-externs -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/hip.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/ocml.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/ockl.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_daz_opt_off.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_unsafe_math_off.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_finite_only_off.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_correctly_rounded_sqrt_on.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_wavefrontsize64_off.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_isa_version_1030.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_abi_version_400.bc -target-cpu gfx1030 -debugger-tuning=gdb -fdebug-compilation-dir=/var/tmp/portage/sci-libs/composable-kernel-6.0.2/work/composable_kernel-rocm-6.0.2_build -resource-dir /usr/lib/clang/17 -dependency-file library/src/tensor_operation_instance/gpu/batchnorm/CMakeFiles/device_batchnorm_instance.dir/device_batchnorm_forward_f32_instance.cpp.o.d -MT library/src/tensor_operation_instance/gpu/batchnorm/CMakeFiles/device_batchnorm_instance.dir/device_batchnorm_forward_f32_instance.cpp.o -sys-header-deps -internal-isystem /usr/lib/clang/17/include/cuda_wrappers -idirafter /usr/local/include -include __clang_hip_runtime_wrapper.h -include /usr/include/gentoo/fortify.h -include /usr/include/gentoo/maybe-stddefs.h -D CK_ENABLE_BF16 -D CK_ENABLE_BF8 -D CK_ENABLE_FP16 -D CK_ENABLE_FP32 -D CK_ENABLE_FP64 -D CK_ENABLE_FP8 -D CK_ENABLE_INT8 -D USE_PROF_API=1 -D __HIP_PLATFORM_AMD__=1 -D __HIP_PLATFORM_HCC__=1 -I /var/tmp/portage/sci-libs/composable-kernel-6.0.2/work/composable_kernel-rocm-6.0.2/library/include -I /var/tmp/portage/sci-libs/composable-kernel-6.0.2/work/composable_kernel-rocm-6.0.2/include -I /var/tmp/portage/sci-libs/composable-kernel-6.0.2/work/composable_kernel-rocm-6.0.2_build/include -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13 -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/x86_64-pc-linux-gnu -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/backward -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13 -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/x86_64-pc-linux-gnu -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/backward -internal-isystem /usr/lib/clang/17/include -internal-isystem /usr/local/include -internal-isystem /usr/x86_64-pc-linux-gnu/include -internal-externc-isystem /include -internal-externc-isystem /usr/include -internal-isystem /usr/lib/clang/17/include -internal-isystem /usr/local/include -internal-isystem /usr/x86_64-pc-linux-gnu/include -internal-externc-isystem /include -internal-externc-isystem /usr/include -O3 -std=c++17 -fdeprecated-macro -fno-autolink -ferror-limit 19 -fmessage-length=173 -fhip-new-launch-api -fgnuc-version=4.2.1 -fcxx-exceptions -fexceptions -fcolor-diagnostics -vectorize-loops -vectorize-slp -mllvm -amdgpu-function-calls=false -cuid=aa0b75146f478e4b -fcuda-allow-variadic-functions -faddrsig -D__GCC_HAVE_DWARF2_CFI_ASM=1 -o /tmp/device_batchnorm_forward_f32_instance-gfx1030-437c24.o -x hip /var/tmp/portage/sci-libs/composable-kernel-6.0.2/work/composable_kernel-rocm-6.0.2/library/src/tensor_operation_instance/gpu/batchnorm/device_batchnorm_forward_f32_instance.cpp"
Without -amdgpu-early-inline-all=true
everything is fine:
/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/17/bin/clang-17 $FLAGS
Memory: 818272 KB, Time: 0:20.62
/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/18/bin/clang-18 $FLAGS
Memory: 830300 KB, Time: 0:18.28
/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/19/bin/clang-19 $FLAGS
Memory: 861772 KB, Time: 0:22.69
With -amdgpu-early-inline-all=true
Clang 18 and 19 are hungry and slow:
/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/17/bin/clang-17 $FLAGS -mllvm -amdgpu-early-inline-all=true
Memory: 818240 KB, Time: 0:20.80
/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/18/bin/clang-18 $FLAGS -mllvm -amdgpu-early-inline-all=true
Memory: 6402824 KB, Time: 1:02.50
/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/19/bin/clang-19 $FLAGS -mllvm -amdgpu-early-inline-all=true
Memory: 6343976 KB, Time: 1:12.43
I don't provide preprocessed version of device_batchnorm_forward_f32_instance.cpp, because for some reason I can't rebuild it after preprocessing (complaints about constexprs). However if you need it or some other dumps, please ask and I will attach.