Possible strategy changes needed to enable OSR #2214

AndyAyersMS · 2022-01-15T03:01:09Z

We are working towards enabling OSR (On Stack Replacement) by default for .NET 7 for x64 and arm64. As part of this we will also modify the runtime so that quick jit for loops is enabled.

See for instance dotnet/runtime#63642.

This has performance implications for benchmarks that don't run enough iterations to reach Tier1. These are typically benchmarks that internally loop and so are currently eagerly optimized because quick jit for loops is disabled. A private benchmark run shows several hundred benchmarks impacted by this, with regressions outnumbering improvements by about 2 to 1.

[Upon further analysis the number of truly impacted benchmarks may be smaller, maybe ~100. It is hard to gauge from one-off runs as many benchmarks are noisy. But we can look at perf history in main and see that some of the "regressions" seen from the one-off OSR run are in noisy tests and the values are within the expected noise range.]

One such example is Burgers.Test3. With current strategy we end up running about 20 invocations total. The main method is initially fully optimized. When we turn on QuickJitForLoops and OSR, the main method is initially not optimized. OSR accelerates its performance, but OSR performance does not reach the same level as Tier1, and we don't run enough invocations to make it to Tier1.

While in this case the OSR version is slower, sometimes the OSR version runs faster. In general, we aspire to have the OSR perf be competitive with Tier1, but swings of +/- 20% are going to be common and cannot easily be addressed.

One way we can mitigate these effects is to always run (or selectively run, for some subset of benchmarks) at least 30 warmup iterations. For example:

default

Method	Job	Ver	Mean

Burgers_0	Job-SYCWNE	OSR	192.68 ms
Burgers_0	Job-ZXAOBL	DEF	184.51 ms

Burgers_1	Job-SYCWNE	OSR	224.11 ms
Burgers_1	Job-ZXAOBL	DEF	155.64 ms

Burgers_2	Job-SYCWNE	OSR	178.63 ms
Burgers_2	Job-ZXAOBL	DEF	156.51 ms

Burgers_3	Job-SYCWNE	OSR	181.05 ms
Burgers_3	Job-ZXAOBL	DEF	85.63 ms

default + --warmupCount 30

Method	Job	Ver	Mean
Burgers_0	Job-SOPVCH	OSR	186.75 ms
Burgers_0	Job-TMAUKP	DEF	185.61 ms

Burgers_1	Job-SOPVCH	OSR	155.64 ms
Burgers_1	Job-TMAUKP	DEF	157.39 ms

Burgers_2	Job-SOPVCH	OSR	157.37 ms
Burgers_2	Job-TMAUKP	DEF	160.09 ms

Burgers_3	Job-SOPVCH	OSR	89.70 ms
Burgers_3	Job-TMAUKP	DEF	85.96 ms

It is expected that if we can do this (or something equivalent) then OSR will not impact perf measurements.

The text was updated successfully, but these errors were encountered:

AndyAyersMS · 2022-01-18T21:04:31Z

FWIW we already do this in one spot:

performance/src/benchmarks/micro/libraries/System.Buffers/Base64Tests.cs

Line 90 in d7dac8a

[WarmupCount(30)] // make sure it's promoted to Tier 1

AndyAyersMS · 2022-01-18T21:16:21Z

I am also seeing some cases where more warmups impacts perf, eg Benchstone.BenchF.InProd.Test:

Method	Config	Mean
InProd	OSR	1.277 s
InProd	DEF	1.133 s
InProd	OSR 30	1.092 s
InProd	DEF 30	1.110 s

AndyAyersMS · 2022-02-17T20:13:44Z

FWIW my current thinking is that we'll have to enable OSR/QJFL and let things bake for a week or so to sort out the truly impacted tests, then start to address them as we see fit. There may be some tough calls for benchmarks that do a lot of work per iteration (benchmarks games, say) that can't easily be broken down into smaller units of work.

ETA for enabling all this is sometime before Preview 3.

AndyAyersMS · 2022-02-26T20:41:36Z

Some more notes on this. I have been looking at System.Text.Tests.Perf_StringBuilder.Append_Strings.

With current defaults, OSR perf shows a substantial regression on the 1000 x case:

Method	Job	Toolchain	repeat	Mean
Append_Strings	Job-TDJEZD	\main\corerun.exe	1	154.3 ns
Append_Strings	Job-DGZWOQ	\osr\corerun.exe	1	157.9 ns

Append_Strings	Job-TDJEZD	\main\corerun.exe	1000	68,416.7 ns
Append_Strings	Job-DGZWOQ	\osr\corerun.exe	1000	272,356.5 ns

Drilling in, there seem to be a few things interesting going on. First, BDN chooses very different iteration strategies for the two cases, because it sees the initial iterations with OSR are "slow" and so plans less of them overall

;; main

OverheadJitting  1: 1 op, 1560000.00 ns, 1.5600 ms/op
WorkloadJitting  1: 1 op, 2171600.00 ns, 2.1716 ms/op

OverheadJitting  2: 16 op, 2081000.00 ns, 130.0625 us/op
WorkloadJitting  2: 16 op, 4974600.00 ns, 310.9125 us/op

WorkloadPilot    1: 16 op, 1345600.00 ns, 84.1000 us/op
WorkloadPilot    2: 2976 op, 212611300.00 ns, 71.4420 us/op
WorkloadPilot    3: 3504 op, 234737500.00 ns, 66.9913 us/op
WorkloadPilot    4: 3744 op, 254916000.00 ns, 68.0865 us/op
WorkloadPilot    5: 3680 op, 238987500.00 ns, 64.9423 us/op
WorkloadPilot    6: 3856 op, 251380500.00 ns, 65.1920 us/op
WorkloadPilot    7: 3840 op, 259264700.00 ns, 67.5168 us/op

;; osr

OverheadJitting  1: 1 op, 677800.00 ns, 677.8000 us/op
WorkloadJitting  1: 1 op, 889400.00 ns, 889.4000 us/op

OverheadJitting  2: 16 op, 1271900.00 ns, 79.4938 us/op
WorkloadJitting  2: 16 op, 6227700.00 ns, 389.2312 us/op

WorkloadPilot    1: 16 op, 1655100.00 ns, 103.4437 us/op
WorkloadPilot    2: 2432 op, 672568200.00 ns, 276.5494 us/op
WorkloadPilot    3: 912 op, 239154500.00 ns, 262.2308 us/op
WorkloadPilot    4: 960 op, 258201200.00 ns, 268.9596 us/op
WorkloadPilot    5: 944 op, 254253800.00 ns, 269.3367 us/op

If I force the two runs to use the same number of invocations per iteration, performance equalizes (--unrollFactor 1000), though the values don't match what we are measuring now:

Method	Job	Toolchain	repeat	Mean
Append_Strings	Job-GIKFDD	\main\corerun.exe	1000	394.5 us
Append_Strings	Job-ZAVQWS	\osr\corerun.exe	1000	407.9 us

Similarly, with --warmupCount 30 we do enough warmups to get everything to Tier1 and there's substantially less regression:

Method	Job	Toolchain	repeat	Mean
Append_Strings	Job-GSKGPF	\main\corerun.exe	1000	69.78 us
Append_Strings	Job-XXUJZP	\osr\corerun.exe	1000	76.81 us

When I drill in further, the Tier1 version of Append_Strings is actually created pretty early on even in the OSR case, because the second WorkloadPilot runs invokes it enough times to tier up. But we don't seem to invoke the Tier1 method during any of the Workload Pilot runs, perhaps because we're using a cached delegate and those don't get backpatched?

It also turns out the default OSR policies don't work well here, we need to see at least 1000 iterations in any given call to transition to OSR and AppendStrings iterates exactly 1000 times. I can hack the OSR policies to force much quicker OSR creation and transition and this also fixes the perf discrepancy:

COMPlus_TC_OnStackReplacement_InitialCounter=1,COMPlus_OSR_CounterBump=1

Method	Job	Toolchain	repeat	Mean
Append_Strings	Job-PJJGVK	\main\corerun.exe	1000	66.31 us
Append_Strings	Job-DVRJPD	\osr\corerun.exe	1000	73.05 us

But the general idea for OSR is that it is an insurance policy for methods that iterate a lot but don't get called enough to tier up. Here the method is getting called enough to tier up. So, I'm somewhat reluctant to hack the OSR policy here.

I wonder if we either need to consider backpatching into delegates, or else having the benchmark harness reallocate the delegate every so often to ensure it's getting the latest codegen?

cc @kouvel re backpatching of delegates. (I'm probably wrong about this, see below)

AndyAyersMS · 2022-02-27T01:12:22Z

Hmm, looks like I am off base. Tier1 first CPU sample is at 1305 in my trace, just after method load at 1304. The second workload pilot stretches from 872 to 1553. So we are invoking Tier1 code.

Given that you'd expect the second workload pilot to show slightly improved results over the first, and
the third workload pilot which runs from 1554 to 1800 or so should only have fast iterations. But that's not the case:

pilot 1 5.729 / 16  = 358 us
pilot 2  674 / 2352 = 286 us
pilot 3  253 / 880  = 287 us

I think what is really going on here is that the benchmark allocates a fair amount, and so spending any reasonable amount of time allocating from Tier0 leads to very high survival rates, and this somehow biases the GC so it ends up doing a lot more work.

Here are the Gen 2 GC and Pilot stage events for a default run:

Note no Gen2 GC's happen during the pilot intervals, they are all before/after ,

But with OSR enabled (and no other special settings) there are many Gen2's during the pilot intervals (so many that you don't see the end of the second interval here).

Looking at the total process GC behavior, default is

and with OSR we see (note it is running fewer invocations overall, so allocating less)

So now the question is, what is it about OSR that leads to this behavior? One factor is that Tier0 versions of methods will generally have untracked GC lifetimes, and so can root a lot of objects. It seems likely to be the case that this happens for Append_Strings.

AndyAyersMS · 2022-03-08T02:39:46Z

I think I have narrowed one class of issues down to the various autogenerated methods not being optimized. In particular WorkloadActionUnroll, but likely more.

When OSR is enabled we see some alarming regressions. Here's one

https://pvscmdupload.blob.core.windows.net/reports/allTestHistory/refs/heads/main_x64_Windows%2010.0.18362_PGOType=nopgo/System.Text.Tests.Perf_StringBuilder.ToString_SingleSegment(length:%20100000).html

This repros with just ---envVars COMPlus_TC_QuickJitForLoops:1 and the issue seems to be that the manually unrolled loop holds on to large amounts of data, drastically changing GC behavior. You can see at some point things just blow up

// AfterActualRun
WorkloadResult   1: 4112 op, 253990800.00 ns, 61.7682 us/op
WorkloadResult   2: 4112 op, 258923300.00 ns, 62.9677 us/op
WorkloadResult   3: 4112 op, 259021100.00 ns, 62.9915 us/op
WorkloadResult   4: 4112 op, 257990900.00 ns, 62.7410 us/op
WorkloadResult   5: 4112 op, 255131000.00 ns, 62.0455 us/op
WorkloadResult   6: 4112 op, 252709800.00 ns, 61.4567 us/op
WorkloadResult   7: 4112 op, 252181300.00 ns, 61.3281 us/op
WorkloadResult   8: 4112 op, 251426700.00 ns, 61.1446 us/op
WorkloadResult   9: 4112 op, 249935700.00 ns, 60.7820 us/op
WorkloadResult  10: 4112 op, 249252300.00 ns, 60.6158 us/op
WorkloadResult  11: 4112 op, 1311020700.00 ns, 318.8280 us/op
WorkloadResult  12: 4112 op, 492687900.00 ns, 119.8171 us/op
WorkloadResult  13: 4112 op, 132994300.00 ns, 32.3430 us/op
WorkloadResult  14: 4112 op, 1181440100.00 ns, 287.3152 us/op
WorkloadResult  15: 4112 op, 1349824000.00 ns, 328.2646 us/op
WorkloadResult  16: 4112 op, 1344150900.00 ns, 326.8849 us/op
WorkloadResult  17: 4112 op, 393625800.00 ns, 95.7261 us/op
WorkloadResult  18: 4112 op, 1339774900.00 ns, 325.8207 us/op
WorkloadResult  19: 4112 op, 1375974900.00 ns, 334.6242 us/op
WorkloadResult  20: 4112 op, 1363235900.00 ns, 331.5262 us/op

and profiling shows almost all time is spent in GC or in the kernel allocating more pages.

A simple mitigation would be to mark all these internal delegate-invoking methods as [MethodImpl(MethodImplOptions.AggressiveOptimization)] which would preserve the current behavior that these methods are eagerly optimized.

This is potentially blocking enabling OSR, so it would be nice to resolve.

cc @adamsitnik

AndyAyersMS · 2022-03-08T16:55:42Z

I can tell the jit to forcibly optimize those methods, and that confirms that jitting the Action methods at Tier0 is the cause of the apparent regressions.

For reference, I used enable osr range to force the 4 {Workload,Overhead}Action*Unroll methods to immediately optimize:

--envVars COMPlus_JitEnableOsrRange:0-44ebd074,44ebd076-6f337b34,6f337b36-8e2f8af3,8e2f8af5-f2e92773,f2e92775-ffffffff

Not entirely sure how to do this in BDN itself. I'll open an issue/PR.

Method	length	Mean	Error	StdDev	Median	Min	Max	Gen 0	Gen 1	Gen 2	Allocated
ToString_SingleSegment	100000	165.5 us	103.3 us	119.0 us	107.1 us	59.52 us	340.6 us	43.6275	43.6275	43.6275	195.35 KB
ToString_SingleSegment (w/envVars)	100000	13.87 us	0.187 us	0.175 us	13.86 us	13.66 us	14.28 us	62.4491	62.4491	62.4491	195.36 KB

AndyAyersMS · 2022-03-30T21:20:18Z

Resolved via dotnet/BenchmarkDotNet#1935 and dotnet/BenchmarkDotNet#1949.

This was referenced Mar 8, 2022

Ensure WorkloadActionUnroll and similar are optimized if possible dotnet/BenchmarkDotNet#1934

Closed

On Stack Replacement Next Steps dotnet/runtime#33658

Open

AndyAyersMS closed this as completed Mar 30, 2022

timcassell mentioned this issue Jan 12, 2025

Call benchmark method directly dotnet/BenchmarkDotNet#2334

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible strategy changes needed to enable OSR #2214

Possible strategy changes needed to enable OSR #2214

AndyAyersMS commented Jan 15, 2022 •

edited

Loading

AndyAyersMS commented Jan 18, 2022

AndyAyersMS commented Jan 18, 2022

AndyAyersMS commented Feb 17, 2022 •

edited

Loading

AndyAyersMS commented Feb 26, 2022 •

edited

Loading

AndyAyersMS commented Feb 27, 2022 •

edited

Loading

AndyAyersMS commented Mar 8, 2022

AndyAyersMS commented Mar 8, 2022

AndyAyersMS commented Mar 30, 2022

Possible strategy changes needed to enable OSR #2214

Possible strategy changes needed to enable OSR #2214

Comments

AndyAyersMS commented Jan 15, 2022 • edited Loading

AndyAyersMS commented Jan 18, 2022

AndyAyersMS commented Jan 18, 2022

AndyAyersMS commented Feb 17, 2022 • edited Loading

AndyAyersMS commented Feb 26, 2022 • edited Loading

AndyAyersMS commented Feb 27, 2022 • edited Loading

AndyAyersMS commented Mar 8, 2022

AndyAyersMS commented Mar 8, 2022

AndyAyersMS commented Mar 30, 2022

AndyAyersMS commented Jan 15, 2022 •

edited

Loading

AndyAyersMS commented Feb 17, 2022 •

edited

Loading

AndyAyersMS commented Feb 26, 2022 •

edited

Loading

AndyAyersMS commented Feb 27, 2022 •

edited

Loading