-
Notifications
You must be signed in to change notification settings - Fork 12.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMDGPU] With Clang>17, -amdgpu-early-inline-all=true consumes 8x more memory #86332
Comments
@llvm/issue-subscribers-backend-amdgpu Author: None (AngryLoki)
There is some kind of regression in `-amdgpu-early-inline-all=true` option, which is set for every HIP application in hipcc.
While this option makes no significant performance/memory impact in Clang 17, attempt to migrate to Clang 18.1.0 or nightly Clang 19 build consumes 8x more memory, which makes Clang unusable for HIP (i. e. when multiple compile units consume 10GB each in parallel, there is just not enough RAM eventually, even when compiling for single target GPU arch). Environment:
Common flags (verbose output of composable-kernel-6.0.2):
Without
With
I don't provide preprocessed version of device_batchnorm_forward_f32_instance.cpp, because for some reason I can't rebuild it after preprocessing (complaints about constexprs). However if you need it or some other dumps, please ask and I will attach. |
How large is the IR we end up trying to compile? Indiscriminately inlining everything may result in the code size explosion. The source file looks template-heavy, and it's possible that we may be inlining way too much because the user requested it. It's quite possible that it's not a regression, but, rather that we have actually fixed the behavior of @yxsamliu Sam, do you know what's the story with |
Looks like this is related to #59126, though that issue is about the target independent always-inline pass. That issue contains a couple of test cases too. |
Yes, very likely to be related, but it looks like #59126 does not fully reflect all the changes. It says that in Nov 22, 2022 after changes that were pushed before LLVM-14 release, users experienced time explosion with Alwaysinliner. However for my case everything is ok before LLVM-18 release. /usr/lib/llvm/17/bin/clang $FLAGS -S -emit-llvm -o /dev/stdout | md5sum
dfac0099986317d8731012f8d6e7a11c - # 15M .ll file
/usr/lib/llvm/17/bin/clang $FLAGS -mllvm -amdgpu-early-inline-all=true -S -emit-llvm -o /dev/stdout | md5sum
dfac0099986317d8731012f8d6e7a11c - # 15M .ll file
/usr/lib/llvm/18/bin/clang $FLAGS -S -emit-llvm -o /dev/stdout | md5sum
5f8fb8b9c7b1a25f2669de75587845a3 - # 13M .ll file
/usr/lib/llvm/18/bin/clang $FLAGS -mllvm -amdgpu-early-inline-all=true -S -emit-llvm -o /dev/stdout | md5sum
a60a9a166226cf36898c8c470ef4be0f - # 12M .ll file |
The initial commit mention in #59126 was reverted and then it re-landed on Oct 29, 2023 1a2e77c. Commenting out the code that adds AlwaysInlinerPass
|
We should just delete the flag, and fully delete AMDGPUAlwaysInlinePass. These are vestiges from before function calls were supported. Forcibly inlining everything is going to make every function bigger and slower to compile. I don't know what to do other than general large function compile time improvements. |
Deleting the always inline pass sounds sensible to me. If that's a horrendous regression for someone maybe we can add a clang flag that tags everything with attribute(always_inline) instead - that should be similar in effect to the custom pass, plausibly useful on some other targets, still allow us to delete that pass. |
But that's the same thing - all this pass is tag every function with alwaysinline and the regular AwaysInline pass does the actual work |
Hi @AngryLoki would you please that PR #96958 fixes the issue? |
Hi, this PR is released in 19.1.0, so I checked it: # Without -amdgpu-early-inline-all=true
/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/18/bin/clang-18 $FLAGS
Memory: 827740 KB, Time: 0:18.21
/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/19/bin/clang-19 $FLAGS
Memory: 830096 KB, Time: 0:18.53
# With -amdgpu-early-inline-all=true
/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/18/bin/clang-18 $FLAGS -mllvm -amdgpu-early-inline-all=true
Memory: 6411340 KB, Time: 1:05.20
/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/19/bin/clang-19 $FLAGS -mllvm -amdgpu-early-inline-all=true
Memory: 3623372 KB, Time: 1:03.93 clang-19.1 now consumes 2x less memory, however is it still 4x more then clang-17. Also it is still as slow as clang-18. Is it possible to improve it? |
@llvm/issue-subscribers-backend-amdgpu Author: None (AngryLoki)
There is some kind of regression in `-amdgpu-early-inline-all=true` option, which is set for every HIP application in hipcc.
While this option makes no significant performance/memory impact in Clang 17, attempt to migrate to Clang 18.1.0 or nightly Clang 19 build consumes 8x more memory, which makes Clang unusable for HIP (i. e. when multiple compile units consume 10GB each in parallel, there is just not enough RAM eventually, even when compiling for single target GPU arch). Environment:
Common flags (verbose output of composable-kernel-6.0.2):
Without
With
I don't provide preprocessed version of device_batchnorm_forward_f32_instance.cpp, because for some reason I can't rebuild it after preprocessing (complaints about constexprs). However if you need it or some other dumps, please ask and I will attach. |
Is the There's |
No. It should have been deleted years ago, but hipcc has been using it and it's been sticky to get out.
Changing the inline threshold is a much weaker option and doesn't serve the same function |
Looks like it's being used directly by some projects like pytorch:
Should they be changing to something else? |
This should be removed without replacement |
These parameters were recently removed from ROCm/HIP and ROCm/clr (ref). The biggest user - hipcc - is still using it (reported above). Also some projects like |
The only reason to use them is a kludge for performance. If something is performing worse as a result of removing the flags, that's a new optimization issue to be debugged |
There is some kind of regression in
-amdgpu-early-inline-all=true
option, which is set for every HIP application in hipcc.While this option makes no significant performance/memory impact in Clang 17, attempt to migrate to Clang 18.1.0 or nightly Clang 19 build consumes 8x more memory, which makes Clang unusable for HIP (i. e. when multiple compile units consume 10GB each in parallel, there is just not enough RAM eventually, even when compiling for single target GPU arch).
Environment:
Common flags (verbose output of composable-kernel-6.0.2):
Without
-amdgpu-early-inline-all=true
everything is fine:With
-amdgpu-early-inline-all=true
Clang 18 and 19 are hungry and slow:I don't provide preprocessed version of device_batchnorm_forward_f32_instance.cpp, because for some reason I can't rebuild it after preprocessing (complaints about constexprs). However if you need it or some other dumps, please ask and I will attach.
The text was updated successfully, but these errors were encountered: