loop cloning and pgo #48850

AndyAyersMS · 2021-02-26T21:52:33Z

Loop cloning is not a good citizen when it comes to PGO. There are two aspects to this:

Cloning heuristics should take PGO into account (along with other things). The benefit of cloning is in part related to the relative weight of the loop body, which is knowable with PGO. It is also related to the relative trip count of the loop, which is also knowable with PGO. So a loop that executes a lot and iterates a lot is a better cloning candidate than a loop that either doesn't execute a lot or doesn't iterate a lot. Within the loop the relative frequency of the operations that can be bypassed or further optimized with cloning should also be taken into account (eg if there's a bounds check but only under some rarely true condition, the benefit of cloning is reduced).
Cloning does not make appropriate updates to profile data (see example below). One would imagine that post cloning the profile data should show that we now think the code is more likely to execute the cloned loop; otherwise why would we have cloned in the first place? Exactly how to split up the profile data is perhaps an open question. In cases like ours where the focus is on bounds checks and control flow checks are implicit there is no obvious signal we can measure (short of tracking how often exceptions are thrown) and we may have to simply guess; something like a 0.1/0.9 split seems plausible but we might consider even more biased splits.

If we start cloning to enable explicit control flow optimization (aka "loop unswitching") then we will have data telling us roughly how to divide up the flow. For example if the loop body contains a loop-invariant test and branch (or a test that can be rendered loop invariant with a suitable upfront check) we will know how often the branches were taken from PGO data and that tells us how to split the profile data.

An example of what happens today:

Before	After

Note how cloning duplicates the block weights for the loop blocks, doesn't set reasonable weights for the new blocks it adds, and messes up the edge weights even in some "remote" parts of the flow graph (suspect this is because fgUpdateChangedFlowGraph recomputes the pred lists, which seems a bit odd).

cc @BruceForstall

category:cq
theme:loop-opt
skill-level:expert
cost:large
impact:medium

The text was updated successfully, but these errors were encountered:

dotnet-issue-labeler · 2021-02-26T21:52:35Z

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

AndyAyersMS · 2021-02-26T23:35:33Z

The extra calls to fgComputePreds seem like the could be avoided. Currently they are masking some kind of faulty pred list maintenance in places, and a general failure to properly maintain BBF_JUMP_TARGET. Both of those seem fixable and then we can remove the need to rebuild pred lists (and thereby preserve the edge weights).

Sample of spmi failures if I suppress the recomputations:

ISSUE: <ASSERT> C:\repos\runtime1\src\coreclr\jit\emitxarch.cpp (6775) - Assertion failed 'dst->bbFlags & BBF_JMP_TARGET' in 'System.Text.EncodingProvider:GetEncodingFromProvider(int):System.Text.Encoding' during 'Generate code' (IL size 51)
ISSUE: <ASSERT> C:\repos\runtime1\src\coreclr\jit\emitxarch.cpp (6775) - Assertion failed 'dst->bbFlags & BBF_JMP_TARGET' in 'System.RuntimeType:IsSubclassOf(System.Type):bool:this' during 'Generate code' (IL size 94)
ISSUE: <ASSERT> C:\repos\runtime1\src\coreclr\jit\fgdiagnostic.cpp (1807) - Assertion failed 'blockPred->bbNext == block || blockPred->bbJumpDest == block' in 'MemberInfoCache`1[__Canon][System.__Canon]:MergeWithGlobalList(System.__Canon[]):this' during 'Clone loops' (IL size 256)
ISSUE: <ASSERT> C:\repos\runtime1\src\coreclr\jit\emitxarch.cpp (6775) - Assertion failed 'dst->bbFlags & BBF_JMP_TARGET' in 'System.Reflection.RuntimeMethodInfo:get_ContainsGenericParameters():bool:this' during 'Generate code' (IL size 74)
ISSUE: <ASSERT> C:\repos\runtime1\src\coreclr\jit\fgdiagnostic.cpp (1807) - Assertion failed 'blockPred->bbNext == block || blockPred->bbJumpDest == block' in 'System.RuntimeType:MakeGenericType(System.Type[]):System.Type:this' during 'Clone loops' (IL size 315)
ISSUE: <ASSERT> C:\repos\runtime1\src\coreclr\jit\emitxarch.cpp (6775) - Assertion failed 'dst->bbFlags & BBF_JMP_TARGET' in 'System.String:IsNullOrWhiteSpace(System.String):bool' during 'Generate code' (IL size 40)
ISSUE: <ASSERT> C:\repos\runtime1\src\coreclr\jit\fgdiagnostic.cpp (1807) - Assertion failed 'blockPred->bbNext == block || blockPred->bbJumpDest == block' in 'System.Linq.EnumerableSorter`2[__Canon,Int32][System.__Canon,System.Int32]:ComputeKeys(System.__Canon[],int):this' during 'Clone loops' (IL size 73)
ISSUE: <ASSERT> C:\repos\runtime1\src\coreclr\jit\fgdiagnostic.cpp (1811) - Assertion failed 'blockPred->bbNext == block' in 'System.Number:FindSection(System.ReadOnlySpan`1[Char],int):int' during 'Find loops' (IL size 190)

Problematic phases for pred list maintenance seem to be "Find loops", "Unroll loops", and "Clone loops".

Not as clear yet where we're dropping BBF_JUMP_TARGET updates. We should add checking for this to the post-phase checks.

EgorBo · 2021-02-27T17:34:58Z

Shouldn't all blocks of the cloned loop have zero weight by default even without the PGO?

EgorBo · 2021-02-27T17:54:53Z

@AndyAyersMS same for CloneFinally optimization I believe:

    [MethodImpl(MethodImplOptions.NoInlining)]
    static void DoWork() { }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static void DoWorkFinally() { }


    [MethodImpl(MethodImplOptions.NoInlining)]
    static void Test()
    {
        try
        {
            DoWork();
        }
        finally
        {
            DoWorkFinally();
        }
    }

Tier0->Tier1 + TieredPGO:

G_M24707_IG01:              ;; offset=0000H
       55                   push     rbp
       4883EC30             sub      rsp, 48
       488D6C2430           lea      rbp, [rsp+30H]
       488965F0             mov      qword ptr [rbp-10H], rsp
                                                ;; bbWeight=1    PerfScore 2.75
G_M24707_IG02:              ;; offset=000EH
       E8050BFEFF           call     Program:DoWork()
       90                   nop
                                                ;; bbWeight=1    PerfScore 1.25
G_M24707_IG03:              ;; offset=0014H
       E8070BFEFF           call     Program:DoWorkFinally()
       90                   nop
                                                ;; bbWeight=1    PerfScore 1.25
G_M24707_IG04:              ;; offset=001AH
       488D6500             lea      rsp, [rbp]
       5D                   pop      rbp
       C3                   ret
                                                ;; bbWeight=1    PerfScore 2.00
G_M24707_IG05:              ;; offset=0020H
       55                   push     rbp
       4883EC30             sub      rsp, 48
       488B6920             mov      rbp, qword ptr [rcx+32]
       48896C2420           mov      qword ptr [rsp+20H], rbp
       488D6D30             lea      rbp, [rbp+30H]
                                                ;; bbWeight=1    PerfScore 4.75  <-- bbWeight should be zero
G_M24707_IG06:              ;; offset=0032H
       E8E90AFEFF           call     Program:DoWorkFinally()
       90                   nop
                                                ;; bbWeight=1    PerfScore 1.25  <-- bbWeight should be zero
G_M24707_IG07:              ;; offset=0038H
       4883C430             add      rsp, 48
       5D                   pop      rbp
       C3                   ret
                                                ;; bbWeight=1    PerfScore 1.75  <-- bbWeight should be zero

(with TieredPGO=0 weights look better)

AndyAyersMS · 2021-02-27T19:42:49Z

@EgorBo not sure what the right division is for cloning. Setting the unoptimized loops counts to zero may be too drastic; we still want to lightly optimize it in case that's where the program goes at runtime.

Good point about finally cloning -- this isn't caught by the consistency checker as it considers each EH handler more or less as its own separate graph. Also note we can't always divert all the non-eh flow to the clone.... so we need to do some math and figure out the right scaling here. Likely something like: scale the finally clone to match the incoming flow, scale the original finally to preserve overall count balance for non EH. If there is no other non-EH flow or all other non-EH flow is zero, then perhaps still keep a bit of residual profile around so we can lightly optimize the original finally.

BruceForstall · 2021-05-10T19:01:36Z

Loop cloning no longer recomputes preds lists: #51757

Loop cloning now scales block weights for the cloned (slow path) loop, at 1%/99% ratio: #51901

Cloning currently doesn't change / scale any edge weights.

The only benefit analysis cloning does is to determine if there are loop optimizations that would be done if cloning occurs, namely, removal of array bounds checks. Adding other considerations, such as PGO weight considerations, and comparing cost against benefit, can still be done.

dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Feb 26, 2021

AndyAyersMS added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed untriaged New issue has not been triaged by the area owner labels Feb 26, 2021

BruceForstall added this to the 6.0.0 milestone Feb 26, 2021

BruceForstall mentioned this issue Feb 26, 2021

Improve JIT loop optimizations (.NET 6) #43549

Closed

25 tasks

This was referenced Mar 2, 2021

Loop opts should not be recomputing pred lists from scratch #49030

Closed

PGO considerations for finally cloning #48925

Open

Dynamic PGO #43618

Closed

JulieLeeMSFT assigned BruceForstall Mar 11, 2021

BruceForstall modified the milestones: 6.0.0, 7.0.0 Jul 6, 2021

BruceForstall mentioned this issue Jul 6, 2021

Improve JIT loop optimizations (.NET 7) #55235

Closed

5 tasks

BruceForstall mentioned this issue Feb 15, 2022

Improve JIT loop optimizations #65342

Open

20 tasks

BruceForstall modified the milestones: 7.0.0, 8.0.0 May 26, 2022

BruceForstall modified the milestones: 8.0.0, Future Jun 16, 2023

JulieLeeMSFT added this to .NET Core CodeGen Jun 5, 2024

JulieLeeMSFT moved this to Optimizations in .NET Core CodeGen Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loop cloning and pgo #48850

loop cloning and pgo #48850

AndyAyersMS commented Feb 26, 2021 •

edited by BruceForstall

Loading

dotnet-issue-labeler bot commented Feb 26, 2021

AndyAyersMS commented Feb 26, 2021

EgorBo commented Feb 27, 2021

EgorBo commented Feb 27, 2021 •

edited

Loading

AndyAyersMS commented Feb 27, 2021

BruceForstall commented May 10, 2021

loop cloning and pgo #48850

loop cloning and pgo #48850

Comments

AndyAyersMS commented Feb 26, 2021 • edited by BruceForstall Loading

dotnet-issue-labeler bot commented Feb 26, 2021

AndyAyersMS commented Feb 26, 2021

EgorBo commented Feb 27, 2021

EgorBo commented Feb 27, 2021 • edited Loading

AndyAyersMS commented Feb 27, 2021

BruceForstall commented May 10, 2021

AndyAyersMS commented Feb 26, 2021 •

edited by BruceForstall

Loading

EgorBo commented Feb 27, 2021 •

edited

Loading