Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New compile time regressions (12 vs 13) on znver3 (LoopMicroOpBufferSize too large?) #50802

Closed
davidbolvansky opened this issue Aug 12, 2021 · 13 comments
Assignees
Labels
backend:X86 Scheduler Models Accuracy of X86 scheduler models backend:X86 bugzilla Issues migrated from bugzilla

Comments

@davidbolvansky
Copy link
Collaborator

Bugzilla Link 51460
Version trunk
OS All
CC @alinas,@LebedevRI,@preames,@RKSimon,@MattPD,@nikic,@rotateright,@tstellar

Extended Description

Phoronix reported some significant CT regressions for mplayer and ffmpeg

https://www.phoronix.com/scan.php?page=article&item=clang13-initial-epyc&num=4

@LebedevRI
Copy link
Member

This is a duplicate of #49928
This is expected, and will not be addressed for a release.

*** This bug has been marked as a duplicate of bug #49928 ***

@nikic
Copy link
Contributor

nikic commented Aug 12, 2021

Shouldn't LoopMicroOpBufferSize for znver3 be reduced? While it may be correct from a micro-architectural perspective, the way it is used by LLVM is clearly inappropriate: LoopMicroOpBufferSize should be an upper bound on how much we partially unroll, but what we seem to be doing right now is to just unroll maximally up to that size. That's okay if the buffer size is reasonably small, but not if it's very large. That results in massive code bloat for no benefit. While there are scalability problems in other passes, that seems like the part that is causing the immediate problem and should be addressed. As this value appears to only be used for this unrolling heuristics, I'd suggest dropping it to a more sensible value.

@LebedevRI
Copy link
Member

I agree that cost/benefit model for full loop unrolling is broken.
It seems we want both to aggressively fully unroll if we can,
and not do that at the same time.

At the very least, i really don't think we should be doing full unrolling
before inlining/vectorization, but IMHO we should not be fully unrolling at all,
at least not unless we can tell that there will be other benefits
than lack of said loop.

+Reames - thoughts on changing profitability heuristics for full unrolling?

@nikic
Copy link
Contributor

nikic commented Aug 12, 2021

I don't think full unrolling is relevant here: LoopMicroOpBufferSize controls the PartialThreshold, which is used for partial and runtime unrolling.

Full unrolling is controlled by different thresholds and a more sophisticated cost model that takes into account how much simplification is expected from breaking the loop backedge. Fully unrolling a relatively large loop can still be beneficial if it results in significant simplification.

Partial and runtime unrolling have more tenuous profitability heuristics, and creating 4096 instruction loops because that's the micro-architectural loop buffer size is almost certainly not profitable. Runtime unrolling will at least limit to 8 unrolls by default, but partial unrolling will be happily unroll to the full threshold.

@davidbolvansky
Copy link
Collaborator Author

Shouldn't LoopMicroOpBufferSize for znver3 be reduced?

Yeah, there should be a more reasonable value because as wee see for reported projects, regressions are huge (and hardly with 30% better runtime perf to “justify” them).

Reopened, renamed a bit.

@LebedevRI
Copy link
Member

Thank you everyone for your thoughts.
It is always fascinating to read what people think.

As a word of caution, i would strongly advise from using some random
3'rd party numbers to pass hard judgments.

Unfortunately, my position still hasn't changed, and this is still WONFIX.
Although we can discuss temporarily reducing it iff someone
personally promises to resolve #49928 .

Even halving the LoopMicroOpBufferSize (512->256) results in a number of
statistically-significant performance regressions (+ ~6%).

@nikic
Copy link
Contributor

nikic commented Aug 13, 2021

To be clear, my primary concern here is (for once!) not compile-time, but the fact that this simply generates terrible code. Just look at this: https://c.godbolt.org/z/617TjjKhs The compiler shouldn't be unrolling random loops to hundreds of iterations.

While I'd love to see issues like #49928 resolved as a matter of general principle, they are only tangentially related to the problem at hand, which is way too aggressive unrolling. The overzealous unrolling must be fixed independently of any compile-time problems it may also be triggering.

@preames
Copy link
Collaborator

preames commented Aug 16, 2021

Sorry for late response here, was on vacation.

I started writing a long comment on unrolling heuristics, realized it was only vaguely relevant here, and decided to save it to my public notes for ease of later use. See https://github.com/preames/public-notes/blob/master/llvm-loop-opt-ideas.rst#unroll-heuristics if interested.

Specifically for this bug, I think it's unfortunate that our unrolling heuristic is too tightly tied to an architectural feature, but I don't think we should avoid correcting the feature value simply because it causes some regressions on that architecture. I would suggest we look at some workarounds to preserve the previous behavior while letting the flag be correct. A few ideas:

  • Cap the scheduler with a fixed upper bound to avoid complexity explosion with a link specifically to the bug to fix.
  • Introduce a "legacy feature for unrolling heuristic" TTI hook which used the old value on that platform, and defaults to the feature value on all other platforms.

I think it's unreasonable to ask Roman to rewrite the core partial unrolling heuristic just because he happened to trip across a case where it's broken. A more constructive approach is workaround it for the moment, and let someone interested with time and resources to tackle unrolling separately.

@LebedevRI
Copy link
Member

Sorry for late response here, was on vacation.

I started writing a long comment on unrolling heuristics, realized it was
only vaguely relevant here, and decided to save it to my public notes for
ease of later use. See
https://github.com/preames/public-notes/blob/master/llvm-loop-opt-ideas.
rst#unroll-heuristics if interested.

Specifically for this bug, I think it's unfortunate that our unrolling
heuristic is too tightly tied to an architectural feature, but I don't
think we should avoid correcting the feature value simply because it causes
some regressions on that architecture. I would suggest we look at some
workarounds to preserve the previous behavior while letting the flag be
correct. A few ideas:

  • Cap the scheduler with a fixed upper bound to avoid complexity explosion
    with a link specifically to the bug to fix.
  • Introduce a "legacy feature for unrolling heuristic" TTI hook which used
    the old value on that platform, and defaults to the feature value on all
    other platforms.
    To be honest, i'm not quite convinced there is anything to fix
    wrt the threshold itself - unrolled loops are great, and as long as loop's body
    is less than 4K x86 uops (well, let's pretend that it maps 1:1 to 4K LLVM IR ops),
    then it fit within uop cache and we've also improved cpu dispatch performance.

So yes, if we unroll 100-instruction loop 40 times, and bloat it to 4000 ops,
that's fine, ignoring the latent quadratic edge-cases it exposes
in the rest of the compiler. Of course, if you care about binary size,
perhaps you should use -Os/-Oz :]

But then yes, we probably should reevaluate the cases where the code
is strictly sequential, like the contrived example from nikic,
those are perhaps not worth unrolling, at least as much.

I think it's unreasonable to ask Roman to rewrite the core partial unrolling
heuristic just because he happened to trip across a case where it's broken.

A more constructive approach is workaround it for the moment, and let
someone interested with time and resources to tackle unrolling separately.

The thing is, the current threshold (512) is already a compromise/workaround. :)

@tstellar
Copy link
Collaborator

Can we remove release-13.0.0 from the blocks field?

@LebedevRI
Copy link
Member

Thank you all for participating in this disscussion!

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 11, 2021
@RKSimon RKSimon added backend:X86 backend:X86 Scheduler Models Accuracy of X86 scheduler models labels Aug 8, 2022
@llvmbot
Copy link
Member

llvmbot commented Aug 8, 2022

@llvm/issue-subscribers-backend-x86

@RKSimon RKSimon self-assigned this Aug 8, 2022
@RKSimon
Copy link
Collaborator

RKSimon commented May 22, 2024

Closing this - the znver3/4 case was addressed by #91340 / 54e52aa

(The more general threshold limit at #67657 was reverted but should be re-committed at some point)

@RKSimon RKSimon closed this as completed May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 Scheduler Models Accuracy of X86 scheduler models backend:X86 bugzilla Issues migrated from bugzilla
Projects
None yet
Development

No branches or pull requests

7 participants