You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[LoopUnroll] Clamp PartialThreshold for large LoopMicroOpBufferSize (#67657)
The znver3/znver4 scheduler models are outliers, specifying very large
LoopMicroOpBufferSizes at 512, while typical values for other subtargets
are on the order of ~50. Even if this information is
micro-architecturally correct (*), this does not mean that we want to
runtime unroll all loops to a size that completely fills the loop
buffer. Unless this is the single hot loop in the entire application,
the massive code size increase will bust the micro-op and instruction
caches.
Protect against this by clamping to the default PartialThreshold of 150,
which is the same as the default full-unroll threshold and half the
aggressive full-unroll threshold. Allowing more partial unrolling than
full unrolling certainly does not make sense.
(*) I strongly doubt that this is actually correct -- I believe this may
derive from an incorrect reading of Agner Fog's micro-architecture
guide. The number 4096 that was originally used here is the size of the
general micro-op cache, not that of a loop buffer. A separate loop
buffer is not listed for the Zen microarchitecture. Comparing this to
the listing for Skylake, it has a 1536 micro-op buffer, but only a 64
micro-op loopback buffer, with a note that it's rarely fully utilized.
Our scheduling model specifies LoopMicroOpBufferSize of 50 in that case.
0 commit comments