-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible strategy changes needed to enable OSR #2214
Comments
FWIW we already do this in one spot:
|
I am also seeing some cases where more warmups impacts perf, eg Benchstone.BenchF.InProd.Test:
|
FWIW my current thinking is that we'll have to enable OSR/QJFL and let things bake for a week or so to sort out the truly impacted tests, then start to address them as we see fit. There may be some tough calls for benchmarks that do a lot of work per iteration (benchmarks games, say) that can't easily be broken down into smaller units of work. ETA for enabling all this is sometime before Preview 3. |
Some more notes on this. I have been looking at With current defaults, OSR perf shows a substantial regression on the 1000 x case:
Drilling in, there seem to be a few things interesting going on. First, BDN chooses very different iteration strategies for the two cases, because it sees the initial iterations with OSR are "slow" and so plans less of them overall
If I force the two runs to use the same number of invocations per iteration, performance equalizes (
Similarly, with
When I drill in further, the Tier1 version of It also turns out the default OSR policies don't work well here, we need to see at least 1000 iterations in any given call to transition to OSR and
But the general idea for OSR is that it is an insurance policy for methods that iterate a lot but don't get called enough to tier up. Here the method is getting called enough to tier up. So, I'm somewhat reluctant to hack the OSR policy here. I wonder if we either need to consider backpatching into delegates, or else having the benchmark harness reallocate the delegate every so often to ensure it's getting the latest codegen? cc @kouvel re backpatching of delegates. (I'm probably wrong about this, see below) |
Hmm, looks like I am off base. Tier1 first CPU sample is at 1305 in my trace, just after method load at 1304. The second workload pilot stretches from 872 to 1553. So we are invoking Tier1 code. Given that you'd expect the second workload pilot to show slightly improved results over the first, and
I think what is really going on here is that the benchmark allocates a fair amount, and so spending any reasonable amount of time allocating from Tier0 leads to very high survival rates, and this somehow biases the GC so it ends up doing a lot more work. Here are the Gen 2 GC and Pilot stage events for a default run: Note no Gen2 GC's happen during the pilot intervals, they are all before/after , But with OSR enabled (and no other special settings) there are many Gen2's during the pilot intervals (so many that you don't see the end of the second interval here). Looking at the total process GC behavior, default is and with OSR we see (note it is running fewer invocations overall, so allocating less) So now the question is, what is it about OSR that leads to this behavior? One factor is that Tier0 versions of methods will generally have untracked GC lifetimes, and so can root a lot of objects. It seems likely to be the case that this happens for |
I think I have narrowed one class of issues down to the various autogenerated methods not being optimized. In particular When OSR is enabled we see some alarming regressions. Here's one This repros with just
and profiling shows almost all time is spent in GC or in the kernel allocating more pages. A simple mitigation would be to mark all these internal delegate-invoking methods as This is potentially blocking enabling OSR, so it would be nice to resolve. cc @adamsitnik |
I can tell the jit to forcibly optimize those methods, and that confirms that jitting the Action methods at Tier0 is the cause of the apparent regressions. For reference, I used enable osr range to force the 4 {Workload,Overhead}Action*Unroll methods to immediately optimize:
Not entirely sure how to do this in BDN itself. I'll open an issue/PR.
|
Resolved via dotnet/BenchmarkDotNet#1935 and dotnet/BenchmarkDotNet#1949. |
We are working towards enabling OSR (On Stack Replacement) by default for .NET 7 for x64 and arm64. As part of this we will also modify the runtime so that quick jit for loops is enabled.
See for instance dotnet/runtime#63642.
This has performance implications for benchmarks that don't run enough iterations to reach Tier1. These are typically benchmarks that internally loop and so are currently eagerly optimized because quick jit for loops is disabled. A private benchmark run shows several hundred benchmarks impacted by this, with regressions outnumbering improvements by about 2 to 1.
[Upon further analysis the number of truly impacted benchmarks may be smaller, maybe ~100. It is hard to gauge from one-off runs as many benchmarks are noisy. But we can look at perf history in main and see that some of the "regressions" seen from the one-off OSR run are in noisy tests and the values are within the expected noise range.]
One such example is Burgers.Test3. With current strategy we end up running about 20 invocations total. The main method is initially fully optimized. When we turn on QuickJitForLoops and OSR, the main method is initially not optimized. OSR accelerates its performance, but OSR performance does not reach the same level as Tier1, and we don't run enough invocations to make it to Tier1.
While in this case the OSR version is slower, sometimes the OSR version runs faster. In general, we aspire to have the OSR perf be competitive with Tier1, but swings of +/- 20% are going to be common and cannot easily be addressed.
One way we can mitigate these effects is to always run (or selectively run, for some subset of benchmarks) at least 30 warmup iterations. For example:
default
default + --warmupCount 30
It is expected that if we can do this (or something equivalent) then OSR will not impact perf measurements.
The text was updated successfully, but these errors were encountered: