Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible strategy changes needed to enable OSR #2214

Closed
AndyAyersMS opened this issue Jan 15, 2022 · 8 comments
Closed

Possible strategy changes needed to enable OSR #2214

AndyAyersMS opened this issue Jan 15, 2022 · 8 comments

Comments

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Jan 15, 2022

We are working towards enabling OSR (On Stack Replacement) by default for .NET 7 for x64 and arm64. As part of this we will also modify the runtime so that quick jit for loops is enabled.

See for instance dotnet/runtime#63642.

This has performance implications for benchmarks that don't run enough iterations to reach Tier1. These are typically benchmarks that internally loop and so are currently eagerly optimized because quick jit for loops is disabled. A private benchmark run shows several hundred benchmarks impacted by this, with regressions outnumbering improvements by about 2 to 1.

[Upon further analysis the number of truly impacted benchmarks may be smaller, maybe ~100. It is hard to gauge from one-off runs as many benchmarks are noisy. But we can look at perf history in main and see that some of the "regressions" seen from the one-off OSR run are in noisy tests and the values are within the expected noise range.]

One such example is Burgers.Test3. With current strategy we end up running about 20 invocations total. The main method is initially fully optimized. When we turn on QuickJitForLoops and OSR, the main method is initially not optimized. OSR accelerates its performance, but OSR performance does not reach the same level as Tier1, and we don't run enough invocations to make it to Tier1.

While in this case the OSR version is slower, sometimes the OSR version runs faster. In general, we aspire to have the OSR perf be competitive with Tier1, but swings of +/- 20% are going to be common and cannot easily be addressed.

One way we can mitigate these effects is to always run (or selectively run, for some subset of benchmarks) at least 30 warmup iterations. For example:

default

Method Job Ver Mean
Burgers_0 Job-SYCWNE OSR 192.68 ms
Burgers_0 Job-ZXAOBL DEF 184.51 ms
Burgers_1 Job-SYCWNE OSR 224.11 ms
Burgers_1 Job-ZXAOBL DEF 155.64 ms
Burgers_2 Job-SYCWNE OSR 178.63 ms
Burgers_2 Job-ZXAOBL DEF 156.51 ms
Burgers_3 Job-SYCWNE OSR 181.05 ms
Burgers_3 Job-ZXAOBL DEF 85.63 ms

default + --warmupCount 30

Method Job Ver Mean
Burgers_0 Job-SOPVCH OSR 186.75 ms
Burgers_0 Job-TMAUKP DEF 185.61 ms
Burgers_1 Job-SOPVCH OSR 155.64 ms
Burgers_1 Job-TMAUKP DEF 157.39 ms
Burgers_2 Job-SOPVCH OSR 157.37 ms
Burgers_2 Job-TMAUKP DEF 160.09 ms
Burgers_3 Job-SOPVCH OSR 89.70 ms
Burgers_3 Job-TMAUKP DEF 85.96 ms

It is expected that if we can do this (or something equivalent) then OSR will not impact perf measurements.

@AndyAyersMS
Copy link
Member Author

FWIW we already do this in one spot:

[WarmupCount(30)] // make sure it's promoted to Tier 1

@AndyAyersMS
Copy link
Member Author

I am also seeing some cases where more warmups impacts perf, eg Benchstone.BenchF.InProd.Test:

Method Config Mean
InProd OSR 1.277 s
InProd DEF 1.133 s
InProd OSR 30 1.092 s
InProd DEF 30 1.110 s

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Feb 17, 2022

FWIW my current thinking is that we'll have to enable OSR/QJFL and let things bake for a week or so to sort out the truly impacted tests, then start to address them as we see fit. There may be some tough calls for benchmarks that do a lot of work per iteration (benchmarks games, say) that can't easily be broken down into smaller units of work.

ETA for enabling all this is sometime before Preview 3.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Feb 26, 2022

Some more notes on this. I have been looking at System.Text.Tests.Perf_StringBuilder.Append_Strings.

With current defaults, OSR perf shows a substantial regression on the 1000 x case:

Method Job Toolchain repeat Mean
Append_Strings Job-TDJEZD \main\corerun.exe 1 154.3 ns
Append_Strings Job-DGZWOQ \osr\corerun.exe 1 157.9 ns
Append_Strings Job-TDJEZD \main\corerun.exe 1000 68,416.7 ns
Append_Strings Job-DGZWOQ \osr\corerun.exe 1000 272,356.5 ns

Drilling in, there seem to be a few things interesting going on. First, BDN chooses very different iteration strategies for the two cases, because it sees the initial iterations with OSR are "slow" and so plans less of them overall

;; main

OverheadJitting  1: 1 op, 1560000.00 ns, 1.5600 ms/op
WorkloadJitting  1: 1 op, 2171600.00 ns, 2.1716 ms/op

OverheadJitting  2: 16 op, 2081000.00 ns, 130.0625 us/op
WorkloadJitting  2: 16 op, 4974600.00 ns, 310.9125 us/op

WorkloadPilot    1: 16 op, 1345600.00 ns, 84.1000 us/op
WorkloadPilot    2: 2976 op, 212611300.00 ns, 71.4420 us/op
WorkloadPilot    3: 3504 op, 234737500.00 ns, 66.9913 us/op
WorkloadPilot    4: 3744 op, 254916000.00 ns, 68.0865 us/op
WorkloadPilot    5: 3680 op, 238987500.00 ns, 64.9423 us/op
WorkloadPilot    6: 3856 op, 251380500.00 ns, 65.1920 us/op
WorkloadPilot    7: 3840 op, 259264700.00 ns, 67.5168 us/op

;; osr

OverheadJitting  1: 1 op, 677800.00 ns, 677.8000 us/op
WorkloadJitting  1: 1 op, 889400.00 ns, 889.4000 us/op

OverheadJitting  2: 16 op, 1271900.00 ns, 79.4938 us/op
WorkloadJitting  2: 16 op, 6227700.00 ns, 389.2312 us/op

WorkloadPilot    1: 16 op, 1655100.00 ns, 103.4437 us/op
WorkloadPilot    2: 2432 op, 672568200.00 ns, 276.5494 us/op
WorkloadPilot    3: 912 op, 239154500.00 ns, 262.2308 us/op
WorkloadPilot    4: 960 op, 258201200.00 ns, 268.9596 us/op
WorkloadPilot    5: 944 op, 254253800.00 ns, 269.3367 us/op

If I force the two runs to use the same number of invocations per iteration, performance equalizes (--unrollFactor 1000), though the values don't match what we are measuring now:

Method Job Toolchain repeat Mean
Append_Strings Job-GIKFDD \main\corerun.exe 1000 394.5 us
Append_Strings Job-ZAVQWS \osr\corerun.exe 1000 407.9 us

Similarly, with --warmupCount 30 we do enough warmups to get everything to Tier1 and there's substantially less regression:

Method Job Toolchain repeat Mean
Append_Strings Job-GSKGPF \main\corerun.exe 1000 69.78 us
Append_Strings Job-XXUJZP \osr\corerun.exe 1000 76.81 us

When I drill in further, the Tier1 version of Append_Strings is actually created pretty early on even in the OSR case, because the second WorkloadPilot runs invokes it enough times to tier up. But we don't seem to invoke the Tier1 method during any of the Workload Pilot runs, perhaps because we're using a cached delegate and those don't get backpatched?

It also turns out the default OSR policies don't work well here, we need to see at least 1000 iterations in any given call to transition to OSR and AppendStrings iterates exactly 1000 times. I can hack the OSR policies to force much quicker OSR creation and transition and this also fixes the perf discrepancy:

COMPlus_TC_OnStackReplacement_InitialCounter=1,COMPlus_OSR_CounterBump=1

Method Job Toolchain repeat Mean
Append_Strings Job-PJJGVK \main\corerun.exe 1000 66.31 us
Append_Strings Job-DVRJPD \osr\corerun.exe 1000 73.05 us

But the general idea for OSR is that it is an insurance policy for methods that iterate a lot but don't get called enough to tier up. Here the method is getting called enough to tier up. So, I'm somewhat reluctant to hack the OSR policy here.

I wonder if we either need to consider backpatching into delegates, or else having the benchmark harness reallocate the delegate every so often to ensure it's getting the latest codegen?

cc @kouvel re backpatching of delegates. (I'm probably wrong about this, see below)

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Feb 27, 2022

Hmm, looks like I am off base. Tier1 first CPU sample is at 1305 in my trace, just after method load at 1304. The second workload pilot stretches from 872 to 1553. So we are invoking Tier1 code.

Given that you'd expect the second workload pilot to show slightly improved results over the first, and
the third workload pilot which runs from 1554 to 1800 or so should only have fast iterations. But that's not the case:

pilot 1 5.729 / 16  = 358 us
pilot 2  674 / 2352 = 286 us
pilot 3  253 / 880  = 287 us

I think what is really going on here is that the benchmark allocates a fair amount, and so spending any reasonable amount of time allocating from Tier0 leads to very high survival rates, and this somehow biases the GC so it ends up doing a lot more work.

Here are the Gen 2 GC and Pilot stage events for a default run:

image

Note no Gen2 GC's happen during the pilot intervals, they are all before/after ,

But with OSR enabled (and no other special settings) there are many Gen2's during the pilot intervals (so many that you don't see the end of the second interval here).

image

Looking at the total process GC behavior, default is

image

and with OSR we see (note it is running fewer invocations overall, so allocating less)

image

So now the question is, what is it about OSR that leads to this behavior? One factor is that Tier0 versions of methods will generally have untracked GC lifetimes, and so can root a lot of objects. It seems likely to be the case that this happens for Append_Strings.

@AndyAyersMS
Copy link
Member Author

I think I have narrowed one class of issues down to the various autogenerated methods not being optimized. In particular WorkloadActionUnroll, but likely more.

When OSR is enabled we see some alarming regressions. Here's one
newplot - 2022-03-07T183611 200

https://pvscmdupload.blob.core.windows.net/reports/allTestHistory/refs/heads/main_x64_Windows%2010.0.18362_PGOType=nopgo/System.Text.Tests.Perf_StringBuilder.ToString_SingleSegment(length:%20100000).html

This repros with just ---envVars COMPlus_TC_QuickJitForLoops:1 and the issue seems to be that the manually unrolled loop holds on to large amounts of data, drastically changing GC behavior. You can see at some point things just blow up

// AfterActualRun
WorkloadResult   1: 4112 op, 253990800.00 ns, 61.7682 us/op
WorkloadResult   2: 4112 op, 258923300.00 ns, 62.9677 us/op
WorkloadResult   3: 4112 op, 259021100.00 ns, 62.9915 us/op
WorkloadResult   4: 4112 op, 257990900.00 ns, 62.7410 us/op
WorkloadResult   5: 4112 op, 255131000.00 ns, 62.0455 us/op
WorkloadResult   6: 4112 op, 252709800.00 ns, 61.4567 us/op
WorkloadResult   7: 4112 op, 252181300.00 ns, 61.3281 us/op
WorkloadResult   8: 4112 op, 251426700.00 ns, 61.1446 us/op
WorkloadResult   9: 4112 op, 249935700.00 ns, 60.7820 us/op
WorkloadResult  10: 4112 op, 249252300.00 ns, 60.6158 us/op
WorkloadResult  11: 4112 op, 1311020700.00 ns, 318.8280 us/op
WorkloadResult  12: 4112 op, 492687900.00 ns, 119.8171 us/op
WorkloadResult  13: 4112 op, 132994300.00 ns, 32.3430 us/op
WorkloadResult  14: 4112 op, 1181440100.00 ns, 287.3152 us/op
WorkloadResult  15: 4112 op, 1349824000.00 ns, 328.2646 us/op
WorkloadResult  16: 4112 op, 1344150900.00 ns, 326.8849 us/op
WorkloadResult  17: 4112 op, 393625800.00 ns, 95.7261 us/op
WorkloadResult  18: 4112 op, 1339774900.00 ns, 325.8207 us/op
WorkloadResult  19: 4112 op, 1375974900.00 ns, 334.6242 us/op
WorkloadResult  20: 4112 op, 1363235900.00 ns, 331.5262 us/op

and profiling shows almost all time is spent in GC or in the kernel allocating more pages.

A simple mitigation would be to mark all these internal delegate-invoking methods as [MethodImpl(MethodImplOptions.AggressiveOptimization)] which would preserve the current behavior that these methods are eagerly optimized.

This is potentially blocking enabling OSR, so it would be nice to resolve.

cc @adamsitnik

@AndyAyersMS
Copy link
Member Author

I can tell the jit to forcibly optimize those methods, and that confirms that jitting the Action methods at Tier0 is the cause of the apparent regressions.

For reference, I used enable osr range to force the 4 {Workload,Overhead}Action*Unroll methods to immediately optimize:

--envVars COMPlus_JitEnableOsrRange:0-44ebd074,44ebd076-6f337b34,6f337b36-8e2f8af3,8e2f8af5-f2e92773,f2e92775-ffffffff

Not entirely sure how to do this in BDN itself. I'll open an issue/PR.

Method length Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
ToString_SingleSegment 100000 165.5 us 103.3 us 119.0 us 107.1 us 59.52 us 340.6 us 43.6275 43.6275 43.6275 195.35 KB
ToString_SingleSegment (w/envVars) 100000 13.87 us 0.187 us 0.175 us 13.86 us 13.66 us 14.28 us 62.4491 62.4491 62.4491 195.36 KB

@AndyAyersMS
Copy link
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant