-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
loop cloning and pgo #48850
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
The extra calls to Sample of spmi failures if I suppress the recomputations:
Problematic phases for pred list maintenance seem to be "Find loops", "Unroll loops", and "Clone loops". Not as clear yet where we're dropping BBF_JUMP_TARGET updates. We should add checking for this to the post-phase checks. |
Shouldn't all blocks of the cloned loop have zero weight by default even without the PGO? |
@AndyAyersMS same for CloneFinally optimization I believe: [MethodImpl(MethodImplOptions.NoInlining)]
static void DoWork() { }
[MethodImpl(MethodImplOptions.NoInlining)]
static void DoWorkFinally() { }
[MethodImpl(MethodImplOptions.NoInlining)]
static void Test()
{
try
{
DoWork();
}
finally
{
DoWorkFinally();
}
} Tier0->Tier1 + TieredPGO: G_M24707_IG01: ;; offset=0000H
55 push rbp
4883EC30 sub rsp, 48
488D6C2430 lea rbp, [rsp+30H]
488965F0 mov qword ptr [rbp-10H], rsp
;; bbWeight=1 PerfScore 2.75
G_M24707_IG02: ;; offset=000EH
E8050BFEFF call Program:DoWork()
90 nop
;; bbWeight=1 PerfScore 1.25
G_M24707_IG03: ;; offset=0014H
E8070BFEFF call Program:DoWorkFinally()
90 nop
;; bbWeight=1 PerfScore 1.25
G_M24707_IG04: ;; offset=001AH
488D6500 lea rsp, [rbp]
5D pop rbp
C3 ret
;; bbWeight=1 PerfScore 2.00
G_M24707_IG05: ;; offset=0020H
55 push rbp
4883EC30 sub rsp, 48
488B6920 mov rbp, qword ptr [rcx+32]
48896C2420 mov qword ptr [rsp+20H], rbp
488D6D30 lea rbp, [rbp+30H]
;; bbWeight=1 PerfScore 4.75 <-- bbWeight should be zero
G_M24707_IG06: ;; offset=0032H
E8E90AFEFF call Program:DoWorkFinally()
90 nop
;; bbWeight=1 PerfScore 1.25 <-- bbWeight should be zero
G_M24707_IG07: ;; offset=0038H
4883C430 add rsp, 48
5D pop rbp
C3 ret
;; bbWeight=1 PerfScore 1.75 <-- bbWeight should be zero (with TieredPGO=0 weights look better) |
@EgorBo not sure what the right division is for cloning. Setting the unoptimized loops counts to zero may be too drastic; we still want to lightly optimize it in case that's where the program goes at runtime. Good point about finally cloning -- this isn't caught by the consistency checker as it considers each EH handler more or less as its own separate graph. Also note we can't always divert all the non-eh flow to the clone.... so we need to do some math and figure out the right scaling here. Likely something like: scale the finally clone to match the incoming flow, scale the original finally to preserve overall count balance for non EH. If there is no other non-EH flow or all other non-EH flow is zero, then perhaps still keep a bit of residual profile around so we can lightly optimize the original finally. |
Loop cloning no longer recomputes preds lists: #51757 Loop cloning now scales block weights for the cloned (slow path) loop, at 1%/99% ratio: #51901 Cloning currently doesn't change / scale any edge weights. The only benefit analysis cloning does is to determine if there are loop optimizations that would be done if cloning occurs, namely, removal of array bounds checks. Adding other considerations, such as PGO weight considerations, and comparing cost against benefit, can still be done. |
Loop cloning is not a good citizen when it comes to PGO. There are two aspects to this:
If we start cloning to enable explicit control flow optimization (aka "loop unswitching") then we will have data telling us roughly how to divide up the flow. For example if the loop body contains a loop-invariant test and branch (or a test that can be rendered loop invariant with a suitable upfront check) we will know how often the branches were taken from PGO data and that tells us how to split the profile data.
An example of what happens today:
Note how cloning duplicates the block weights for the loop blocks, doesn't set reasonable weights for the new blocks it adds, and messes up the edge weights even in some "remote" parts of the flow graph (suspect this is because
fgUpdateChangedFlowGraph
recomputes the pred lists, which seems a bit odd).cc @BruceForstall
category:cq
theme:loop-opt
skill-level:expert
cost:large
impact:medium
The text was updated successfully, but these errors were encountered: