PGO: Add new tiers #70941

EgorBo · 2022-06-18T19:27:05Z

This PR implements @jkotas's idea in #70410 (comment) when DOTNET_TieredPGO is enabled (it's off by default and will be so in .NET 7.0)

Use R2R code for process startup

Once the process startups up, switch to instrumented code for hot methods

Once you collect enough PGO data, create optimized JITed version

Design

flowchart
    prestub(.NET Function) -->|Compilation| hasAO{"Marked with<br/>[AggressiveOpts]?"}
    hasAO-->|Yes|tier1ao["JIT to <b><ins>Tier1</ins></b><br/><br/>(no dynamic profile data)"]
    hasAO-->|No|hasR2R
    hasR2R{"Is prejitted (R2R)?"} -->|No| tier000

    tier000["JIT to <b><ins>Tier0</ins></b><br/><br/>(not optimized, not instrumented,<br/> with patchpoints)"]-->|Running...|ishot555
    ishot555{"Is hot?<br/>(called >30 times)"}
    ishot555-.->|No,<br/>keep running...|ishot555
    ishot555-->|Yes|tier0
   
    hasR2R -->|Yes| R2R
    R2R["Use <b><ins>R2R</ins></b> code<br/><br/>(optimized, not instrumented,<br/>no patchpoints)"] -->|Running...|ishot1
    ishot1{"Is hot?<br/>(called >30 times)"}-.->|No,<br/>keep running...|ishot1
    ishot1--->|"Yes"|tier1inst

    tier0["JIT to <b><ins>Tier0Instrumented</ins></b><br/><br/>(not optimized, instrumented,<br/> with patchpoints)"]-->|Running...|ishot5
    tier1pgo2["JIT to <b><ins>Tier1</ins></b><br/><br/>(optimized with profile data)"]
      
    tier1inst["JIT to <b><ins>Tier1Instrumented</ins></b><br/><br/>(optimized, instrumented, <br/>no patchpoints)"]
    tier1inst-->|Running...|ishot5
    ishot5{"Is hot?<br/>(called >30 times)"}-->|Yes|tier1pgo2
    ishot5-.->|No,<br/>keep running...|ishot5

jkotas · 2022-06-18T19:34:25Z

cc @noahfalk @kouvel

janvorli · 2022-06-20T11:19:27Z

2. "Patch" the initial [callcounting] stub to look at tier0 instead of r2r and reset the call-counting-cell

It sounds better to me, as we never delete the call counting stubs, so this would leave less garbage.

noahfalk · 2022-06-20T20:28:27Z

When I was doing tiered compilation originally one of the benchmarks I found helpful was MusicStore and I had the app measure its own latency for requests 0-500, 501-1000, 1000-1500, and so on. This helped me get an idea how quickly an app was able to converge to the steady state behavior. Completely up to you if a similar analysis would be useful now. One hypothesis I'd have for the worse TE numbers is that the benchmark might be short enough that it is capturing a substantial amount of pre-steady-state behavior.

EgorBo · 2022-06-20T20:39:41Z

When I was doing tiered compilation originally one of the benchmarks I found helpful was MusicStore and I had the app measure its own latency for requests 0-500, 501-1000, 1000-1500, and so on. This helped me get an idea how quickly an app was able to converge to the steady state behavior. Completely up to you if a similar analysis would be useful now. One hypothesis I'd have for the worse TE numbers is that the benchmark might be short enough that it is capturing a substantial amount of pre-steady-state behavior.

Thanks! When I will be patching the call-counting stub we might consider resetting the call-counting-cell to some smaller number (e.g. 10)

EgorBo · 2022-06-21T22:07:26Z

Just realized that I can also introduce a new tier for non-r2r cases:

tier0 -> instrumented tier0 -> tier1

This solves a different problem that we have now - instrumentation is quite heavy (both in terms of TP and perf). However, as Andy noted, we need to be careful around OSR.

I have a demo locally, for now I am allocating a new callcounting stub every time because it's simpler

AndyAyersMS · 2022-06-22T16:46:18Z

We can't currently leave Tier1-OSR and get back to Tier0 (at least mid-method; if the method is called every so often we could switch at a call boundary. But it wouldn't help say Main which is only called once).

What we could do instead is go from Tier0 (uninstrumented) to Tier0-OSR (instrumented) and then to Tier1-OSR (instrumented). This would give some PGO data, but we might not see the early parts of the method execute with instrumentation.

AndyAyersMS · 2022-06-22T18:44:36Z

Sort of? I can post what I think of as the right flow in a bit.

kouvel

Just one comment, otherwise the tiering stuff LGTM, thanks!

src/coreclr/vm/tieredcompilation.cpp

EgorBo · 2022-10-19T00:20:22Z

/azp run runtime-coreclr pgo, runtime-coreclr libraries-pgo

azure-pipelines · 2022-10-19T00:20:48Z

Azure Pipelines successfully started running 2 pipeline(s).

EgorBo · 2022-10-19T15:22:05Z

X axis - time from start, seconds
Y axis - average RPS per 10 seconds.

RPS during startup for the BingSNR, X - RPS, Y - time from start, seconds

Observations:

Since a lot of stuff is prejitted (they've fixed R2R issues with their builds yesterday) we don't see RPS improvements from DynamicPGO in current Main since it gives up on R2R'd code - blue and green lines have the same RPS, green one (PGO) starts a bit slower.
This PR enables DynamicPGO for R2R'd code so we can see actual RPS improvements (not much but it's likely because of my setup where I had to limit bombardier to send only 2 requests at once) - we pay a big price for that during start, as you can see it takes some time for the grey line to hit the top RPS.
Red line is this PR + some tunning around the queue of methods pending call-counting installation (e.g. put all R2R'd code into a separate "low priority" queue because R2R'd code is fast as is). Also, @AndyAyersMS is right, the threshold for OSR needs to be lowered, to something like 200-100 because we don't want to stuck in a slow loop for too long - lower threshold also helped the red line.

The measurements are quite stable, I did a lot of runs locally (made a script)

EgorBo · 2022-10-19T17:29:14Z

The problem with the queue of methods pending the call counting installation (not even promotion to tier1, just to start counting calls) visualized (Bing SNR again):

X axis - time in seconds after start.
Y axis - number of methods waiting for getting a call-counting stub

It means that a lot of methods were stuck in tier0 and had no chance to get a call-counting stub to start counting - every 100ms a new compilation occurs and it delays call-counting stub installation till we have a window large enough for it.

EgorBo · 2022-10-19T17:32:51Z

Anyway, I'd like to work on improvements for that queue in a separate PR (and will adjust the OSR limit), e.g. to introduce a separate "low priority" queue for methods coming from R2R since it's fast as is

@AndyAyersMS can you review/approve the jit part? It seems to be passing outerloop PGO pipelines (both runtime and libraries)
@davidwrighton this PR is blocked on your change request, could you take a look if I addressed your concerns (lack of docs)?

AndyAyersMS

Overall looks good. Left notes on a few things to consider.

I take it COMPlus_WritePGOData still works as long as you also enable TieredPGO? It is occasionally quite useful.

src/coreclr/jit/compiler.cpp

AndyAyersMS · 2022-10-21T20:25:55Z

src/coreclr/jit/fgprofile.cpp

-    const bool osrMethod            = opts.IsOSR();
-    const bool useEdgeProfiles = (JitConfig.JitEdgeProfiling() > 0) && !prejit && !tier0WithPatchpoints && !osrMethod;
+    const bool instrOpt             = opts.IsInstrumentedOptimized();
+    const bool useEdgeProfiles = (JitConfig.JitEdgeProfiling() > 0) && !prejit && !tier0WithPatchpoints && !instrOpt;


Do we really need block profiles for full instrumented/optimized methods? Seems like edge profiles might work -- unless perhaps the think you need is the is special handling for tail calls.

if so can you add a clarifying comment here?

Unfortunately we still need it for now, I'll allocate some time to investigate what is needed to enable edge-profiling after this PR

AndyAyersMS · 2022-10-21T20:27:31Z

src/coreclr/jit/importercalls.cpp

            // Only schedule importation if we're not currently importing.
            //
-            if (mustImportEntryBlock && (compCurBB != fgEntryBB))
+            if ((opts.IsInstrumentedOptimized() || opts.IsOSR()) && mustImportEntryBlock && (compCurBB != entryBb))


This is an OSR only issue, as full method compilation will naturally start importing with fgFirstBB.

src/coreclr/inc/dacprivate.h

AndyAyersMS · 2022-10-21T20:37:06Z

src/coreclr/jit/importercalls.cpp

@@ -1288,7 +1288,7 @@ var_types Compiler::impImportCall(OPCODE                  opcode,
        //    have to check for anything that might introduce a recursive tail call.
        // * We only instrument root method blocks in OSR methods,
        //
-        if (opts.IsOSR() && !compIsForInlining())
+        if ((opts.IsInstrumentedOptimized() || opts.IsOSR()) && !compIsForInlining())


As noted below, I think this is an OSR only problem.

So it could be

Suggested change

if ((opts.IsInstrumentedOptimized() || opts.IsOSR()) && !compIsForInlining())

if ((opts.IsInstrumentedOptimized() && opts.IsOSR()) && !compIsForInlining())

opts.IsInstrumentedOptimized() && opts.IsOSR() is not working, for some reason it hits an assert even in non PGO OSR mode. I should be able to revert these changes once I fix "edge profiling" for optimized code

EgorBo · 2022-10-22T23:52:46Z

/azp run runtime-coreclr pgo, runtime-coreclr libraries-pgo

azure-pipelines · 2022-10-22T23:53:08Z

Azure Pipelines successfully started running 2 pipeline(s).

docs/design/features/DynamicPgo-InstrumentedTiers.md

Co-authored-by: Andy Ayers <andya@microsoft.com>

EgorBo · 2022-10-26T10:53:15Z

I'm merging the PR, we should see improvements at the PGO tab at https://aka.ms/aspnet/benchmarks in a couple of days.
Meanwhile I'll work on the queue of methods pending call-counting because this PR introduces new compilations hence delays promotion to tier1, I already have a couple of improvements in that area.

Just in case: this PR doesn't affect the current defaults, it only kicks in in DOTNET_TieredPGO=1

PS: I've just kicked off a new SPMI collection

AndyAyersMS · 2022-10-27T02:00:49Z

Keep a close eye on BDN results from the lab -- we may need to adjust things since some benchmarks implicitly rely on the old tiering strategy.

EgorBo · 2022-10-27T17:57:56Z

Out of the oven public results from https://aka.ms/aspnet/benchmarks (14th page)

🙂

DynamicPGO now have almost the same RPS as FullPGO, slightly less because of less accurate profile in optimized code (after R2R) - as expected.
Both DynamicPGO and FullPGO now start faster because they only instrument hot code, not just everything.

Improvements in the working set are unrelated - #76985. I checked that this PR doesn't regress working set (due to more compilations) or at least it's around noise.
^ UPD: #76985 also improved start-time on Linux so on the screenshot it's a joined effort from these two PRs for Time to first request. RPS is still this PR-only. On windows where #76985 has no effect we still see improvements around start-time on FullPGO as expected:

dotnet-issue-labeler bot added the area-VM-coreclr label Jun 18, 2022

ghost assigned EgorBo Jun 18, 2022

EgorBo changed the title ~~r2r -> instrumented tier0 -> optimized tier1~~ R2R -> instrumented tier0 -> optimized tier1 Jun 18, 2022