-
Notifications
You must be signed in to change notification settings - Fork 12.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New compile time regressions (12 vs 13) on znver3 (LoopMicroOpBufferSize too large?) #50802
Comments
Shouldn't LoopMicroOpBufferSize for znver3 be reduced? While it may be correct from a micro-architectural perspective, the way it is used by LLVM is clearly inappropriate: LoopMicroOpBufferSize should be an upper bound on how much we partially unroll, but what we seem to be doing right now is to just unroll maximally up to that size. That's okay if the buffer size is reasonably small, but not if it's very large. That results in massive code bloat for no benefit. While there are scalability problems in other passes, that seems like the part that is causing the immediate problem and should be addressed. As this value appears to only be used for this unrolling heuristics, I'd suggest dropping it to a more sensible value. |
I agree that cost/benefit model for full loop unrolling is broken. At the very least, i really don't think we should be doing full unrolling +Reames - thoughts on changing profitability heuristics for full unrolling? |
I don't think full unrolling is relevant here: LoopMicroOpBufferSize controls the PartialThreshold, which is used for partial and runtime unrolling. Full unrolling is controlled by different thresholds and a more sophisticated cost model that takes into account how much simplification is expected from breaking the loop backedge. Fully unrolling a relatively large loop can still be beneficial if it results in significant simplification. Partial and runtime unrolling have more tenuous profitability heuristics, and creating 4096 instruction loops because that's the micro-architectural loop buffer size is almost certainly not profitable. Runtime unrolling will at least limit to 8 unrolls by default, but partial unrolling will be happily unroll to the full threshold. |
Yeah, there should be a more reasonable value because as wee see for reported projects, regressions are huge (and hardly with 30% better runtime perf to “justify” them). Reopened, renamed a bit. |
Thank you everyone for your thoughts. As a word of caution, i would strongly advise from using some random Unfortunately, my position still hasn't changed, and this is still WONFIX. Even halving the LoopMicroOpBufferSize (512->256) results in a number of |
To be clear, my primary concern here is (for once!) not compile-time, but the fact that this simply generates terrible code. Just look at this: https://c.godbolt.org/z/617TjjKhs The compiler shouldn't be unrolling random loops to hundreds of iterations. While I'd love to see issues like #49928 resolved as a matter of general principle, they are only tangentially related to the problem at hand, which is way too aggressive unrolling. The overzealous unrolling must be fixed independently of any compile-time problems it may also be triggering. |
Sorry for late response here, was on vacation. I started writing a long comment on unrolling heuristics, realized it was only vaguely relevant here, and decided to save it to my public notes for ease of later use. See https://github.com/preames/public-notes/blob/master/llvm-loop-opt-ideas.rst#unroll-heuristics if interested. Specifically for this bug, I think it's unfortunate that our unrolling heuristic is too tightly tied to an architectural feature, but I don't think we should avoid correcting the feature value simply because it causes some regressions on that architecture. I would suggest we look at some workarounds to preserve the previous behavior while letting the flag be correct. A few ideas:
I think it's unreasonable to ask Roman to rewrite the core partial unrolling heuristic just because he happened to trip across a case where it's broken. A more constructive approach is workaround it for the moment, and let someone interested with time and resources to tackle unrolling separately. |
So yes, if we unroll 100-instruction loop 40 times, and bloat it to 4000 ops, But then yes, we probably should reevaluate the cases where the code
The thing is, the current threshold (512) is already a compromise/workaround. :) |
Can we remove release-13.0.0 from the blocks field? |
Thank you all for participating in this disscussion! |
@llvm/issue-subscribers-backend-x86 |
Extended Description
Phoronix reported some significant CT regressions for mplayer and ffmpeg
https://www.phoronix.com/scan.php?page=article&item=clang13-initial-epyc&num=4
The text was updated successfully, but these errors were encountered: