Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JIT] Improve inliner: new heuristics, rely on PGO data #52708

Merged
merged 67 commits into from
Jul 1, 2021

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented May 13, 2021

Closes #33338 optimize comparisons for const strings
Closes #7651 Generic Type Check Inlining
Closes #10303 have inlining heuristics look for cases where inlining might enable devirtualization
Closes #47434 Weird inlining of methods optimized by JIT
Closes #41923 Don't inline calls inside rarely used basic-blocks
Closes #33349 Utf8Formatter.TryFormat causes massive code bloat when used with a constant StandardFormat

This PR was split, the first part is #53670

Inlining is one of the most important compiler optimizations - it significantly widens opportunities for other optimizations such as Constant Folding, CSE, Devirtualization, etc. Unfortunately, there is no Silver Bullet to do it right and it's essentially a sort of an NP complete Knapsack problem, especially in JIT environments where we don't see the whole graph of all calls, new types can be loaded/added dynamically, and we're limited in time, e.g. we can't just try to import all the possible callees into JIT's IR nodes, apply all the existing optimizations we have and decide whether it was profitable or not - we only have time to quickly inspect raw IL and note some useful patterns (furthermore, we ask R2R to bake "noinline" into non-profitable methods forever).

It's important to note possible regressions due to bad Inlining decisions:

  • Significant codegen size increase - it can be noticeable for the AOT'd (R2R) code. There are cases where inlining actually reduces size of callers, but mostly it's a trade-off between size and performance.
  • Always regresses JIT's throughput (we analyze/inline more and it's not cheap). Visible effect - it takes longer to warmup/start an app/process first request.
  • CPUs have limited sets of registers, so if we inline a callee with a lot of temps inside - we might cause serious problems for the Register Allocator (that, just like the inliner, also uses heuristics 🙂).
  • Bigger functions/loop bodies may not fit into CPU caches.
  • Jit has a hard-coded limit of locals it can track (lclMAX_TRACKED=512) - a good example is this sharplab.io snippet.

Current approach in RyuJIT

In RyuJIT we have several strategies (or policies) such as DefaultPolicy, ModelPolicy (based on some ML model to estimate impact), SizePolicy (prefers size over perf) and ProfilePolicy (heavily relies on PGO data). But only DefaultPolicy is used in Production. Its algorithm can be explained in 5 steps:

  1. Make some quick pre-checks, e.g.:
    • Callee is less than 16 bytes of IL code or has AggressiveInlining attribute - always inline
    • Callee is larger than 100 bytes of IL code or has NoInlinine - always ignore
    • Callee has EH, stackalloc (and callsite is in a loop), etc - always ignore
    • Check that we don't exceed depth of calls, amount of basic-block and time budget.
  2. Roughly estimate codegen size of the callsite (NOTE: we don't take current caller's size into account directly)
  3. Roughly estimate codegen size of the inline candidate by using a state-machine that accepts raw IL code and applies some weight for all IL opcodes.
  4. Inspect raw IL and try to find some useful observations (heuristics) and compose a so called "Benefit multiplier"
  5. Make a decision using the following formula: bool toInline = callsiteNativeSize * BenefitMultiplier > calleeNativeSize
    An example with logs:
    image
    (BTW, in this PR this method is inlined)

Things this PR tries to improve

  1. Add more observations (it's done in Inliner: new observations (don't impact inlining for now) #53670) on top of raw IL, such as
    • Inline candidate has some abstract class in its signature for one of the arguments and we pass something known at the callsite. If we inline it we'll be able to devirtualize all calls inside the callee even without the GDV (Guarded Devirtualization).
    • Inline candidate is generic when caller is not
    • Inline candidate has foldable unary/binary expressions and even branches when we pass a constant argument(s).
    • Inline candidate has foldable BOX/UNBOX operations (generics emit a lot of them).
    • Inline candidate has an expensive "X / Arg" expression and we pass a constant at the callsite for the Arg - if we inline it - we'll avoid expensive idiv.
    • It now is able to recognize more potentially foldable branches, even patterns like typeof(T1) == typeof(T2) or argStr.Length == 10 when argStr is a string literal at the callsite. (NOTE: only during prejitting or when a method was promoted to tier1 naturally. Won't work for methods with loops or with TieredCompilation=0).
    • Inline candidates pass/returns a large struct by value - we might avoid redundant copies if we inline.
    • there are few more
      A small example that covers some of the new observations inliner makes:
      image

Thanks to @tannergooding's work we can now resolve some call tokens and recognize intrinsics.

  1. Refactor fgFindJumpTarget and PushedStack: We pushed opcodes to that two-slots stack in order to recognize some patterns in the next (binary/unary) opcodes - unfortunately we ignored a lot of opcodes and it led to invalid state in that stack => invalid observations. Such invalid observations were the main source of regressions in my PR because some of them led to a better performance for some reason 😐.

  2. Find more optimal multipliers for the observations (using micro/TE benchmarks, deep performance analysis of traces/regressions, example: [JIT] Improve inliner: new heuristics, rely on PGO data #52708 (comment)).

  3. Rely on PGO data - the most important part. We try harder to inline what was recognized as hot. Unfortunately, there are cases where a profile can be misleading/polluted and mark hot or semi-hot blocks as cold:

    • We don't support context-sensitive instrumentation in Dynamic PGO mode (see example in PGO: Instrument cold blocks with profile validators #53840)
    • The Static PGO that we ship can be irrelevant for some apps.
    • We don't support deoptimizations so we can't re-collect profiles if something changes in the workflow over time.
    • Sometimes it's still makes sense to inline methods in cold blocks to improve type/escape analysis for the whole caller. Example: sharplab.io
  4. Allow JIT to inline bigger functions if really needed: more than 5 basic-blocks and more than 100 bytes of IL.

Metodology

For now, I mostly run various benchmarks (TE, dotnet/performance, PowerShell, Kestrel, SignalR), analyze collected samples (flamegraphs) produced by profilers/dotnet trace/BDN itself in hot spots where I inspect all calls in the chains and if I see a method that looks like it should be inlined - I try to find what kind of observations (and corresponding benefit multipliers) can make it happen, example:
image

There are some thoughts on how to automatize it using ML/Genetic Algorithms.

Benchmarks

I use PerfLab machines for TE/SignalR/YARP benchmarks. Besides that, I have locally:

  • Win - Windows 10 x64, Core i7 8700K (4 cores, 3.7Ghz, Coffee Lake, AVX2)
  • Linux - Ubuntu 20.04 x64, Core i7 4930K (6 cores, 3.4Ghz, Ivy Bridge E, AVX1)
  • Arm64 - macOS 11.5 arm64 (Mac Mini M1, arm64-v8.3(or higher?))

Also, I use different JIT modes:

  • Default - Default arguments.
  • NoPGO - same as Default, but without static PGO collected from various benchmarks/workloads. I use this mode to estimate overall impact from PGO when compared against FullPGO (in .NET 5.0 we didn't have any static profile so it kind of simulates net5.0).
  • DynPGO:
    • DOTNET_TieredPGO=1 - collect profile in tier0.
    • DOTNET_TC_QuickJitForLoops=1 - so we also instrument methods with loops on tier0. By default they bypass tier0 to avoid "cold loops with hot bodies" problem (OSR is not enabled by default yet).
  • FullPGO: same as DynPGO but with:
    • DOTNET_ReadyToRun=0 - ignore all AOT code and re-collect actual profile (with better class probes).

Results

It seems like this PR unlocks PGO and GDV potential in FullPGO mode. Many high-load benchmarks show +10-20% improvements. I tried to avoid startup/time-to-first-request regressions in the Default mode so the RPS numbers are not as great as they could be with a more aggressive inliner. In order to avoid size regressions in R2R, we use the "old" inliner - it leads to zero improvements when we benchmark R2R'd code in DOTNET_TieredCompilation=0 mode (e.g. dotnet/performance microbenchmarks). Also, prejitter bakes "noinline" into thousands of method using the old algorithm and kind of limits possibilities for the new one during prejitting.

The latest run of TE benchmarks on aspnet-citrine-lin:
Linux-lessagr-1
^ some P90-P99 numbers regressed because, I believe, they include the warmup stage, so when I try to run a specific benchmark and ask it to run longer than 30 seconds - P99 numbers are greatly improved)
Also, don't expect large numbers from Platform-* benchmarks, those are heavily optimized by hands 😐.
UPD: the latest numbers: #52708 (comment)

NoPGO vs FullPGO (mainly to estimate impact of PGO and show room for improvements for the Default mode, NOTE: NoPGO numbers should be slightly better than the current results for .NET 5.0 in the Default mode):
nopgo-vs-fullpgo UPD new numbers: #52708 (comment)

JIT ET events show that the improved inliner usually inlines 10-20% more functions, e.g.:
Plaintext-MVC (Default mode):
image

Orchard CMS (Default mode):
image

YARP Proxy (Default mode):
image

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 13, 2021
@EgorBo
Copy link
Member Author

EgorBo commented May 17, 2021

@tannergooding Thanks for the feedback! Yeah it's early stage WIP but still open for any feedback :)

@EgorBo
Copy link
Member Author

EgorBo commented May 27, 2021

PowerShell benchmarks:

EgorPR vs Main (Default parameters)

| Faster                                                                           | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| -------------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| Engine.Scripting.InvokeMethod(InvokeMethodScript: "'String'.GetType()")          |      1.14 |          1311.98 |          1149.36 |         |
| Engine.Scripting.InvokeMethod(InvokeMethodScript: "[System.IO.Path]::HasExtensio |      1.14 |          1327.41 |          1163.76 |         |
| Engine.Scripting.InvokeMethod(InvokeMethodScript: "$fs=New-Object -ComObject scr |      1.04 |         51051.17 |         49205.17 |         |
| Engine.Parsing.UsingStatement                                                    |      1.01 |       9962178.13 |       9827435.94 |         |

EgorPR vs Main (TieredPGO=1)

| Faster                                                                           | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| -------------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| Engine.Scripting.InvokeMethod(InvokeMethodScript: "[System.IO.Path]::HasExtensio |      1.12 |          1350.38 |          1208.74 |         |
| Engine.Scripting.InvokeMethod(InvokeMethodScript: "'String'.GetType()")          |      1.08 |          1296.26 |          1200.47 |         |
| Engine.Scripting.InvokeMethod(InvokeMethodScript: "$fs=New-Object -ComObject scr |      1.03 |         52221.30 |         50559.75 |         |
| Engine.Parsing.UsingStatement                                                    |      1.03 |      10018784.38 |       9765815.63 |         |

EgorPR vs Main (TieredPGO=1 TC_QJFL=1 OSR=1)

| Faster                                                                           | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| -------------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| Engine.Scripting.InvokeMethod(InvokeMethodScript: "[System.IO.Path]::HasExtensio |      1.18 |          1343.95 |          1143.72 |         |
| Engine.Scripting.InvokeMethod(InvokeMethodScript: "'String'.GetType()")          |      1.16 |          1362.58 |          1172.03 |         |
| Engine.Scripting.InvokeMethod(InvokeMethodScript: "$fs=New-Object -ComObject scr |      1.06 |         53039.17 |         49984.76 |         |
| Engine.Parsing.UsingStatement                                                    |      1.01 |       9938895.31 |       9804198.44 |         |

EgorPR vs Main (R2R=0 TieredPGO=1 TC_QJFL=1 OSR=1)

| Faster                                                                           | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| -------------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| Engine.Scripting.InvokeMethod(InvokeMethodScript: "'String'.GetType()")          |      1.19 |          1255.76 |          1057.23 |         |
| Engine.Scripting.InvokeMethod(InvokeMethodScript: "[System.IO.Path]::HasExtensio |      1.16 |          1257.16 |          1082.81 |         |
| Engine.Scripting.InvokeMethod(InvokeMethodScript: "$fs=New-Object -ComObject scr |      1.08 |         50584.39 |         46657.50 |         |
| Engine.Parsing.UsingStatement                                                    |      1.03 |       9933034.38 |       9646275.00 |         |

EgorPR vs Main (R2R=0 TieredPGO=1 TC_QJFL=1 OSR=1)

| Faster                                                                           | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| -------------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| Engine.Scripting.InvokeMethod(InvokeMethodScript: "'String'.GetType()")          |      1.19 |          1255.76 |          1057.23 |         |
| Engine.Scripting.InvokeMethod(InvokeMethodScript: "[System.IO.Path]::HasExtensio |      1.16 |          1257.16 |          1082.81 |         |
| Engine.Scripting.InvokeMethod(InvokeMethodScript: "$fs=New-Object -ComObject scr |      1.08 |         50584.39 |         46657.50 |         |
| Engine.Parsing.UsingStatement                                                    |      1.03 |       9933034.38 |       9646275.00 |         |

EgorPR (TieredPGO=1 TC_QJFL=1 OSR=1) vs Main (Default parameters)

| Faster                                                                           | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| -------------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| Engine.Scripting.InvokeMethod(InvokeMethodScript: "[System.IO.Path]::HasExtensio |      1.16 |          1327.41 |          1143.72 |         |
| Engine.Scripting.InvokeMethod(InvokeMethodScript: "'String'.GetType()")          |      1.12 |          1311.98 |          1172.03 |         |
| Engine.Parsing.UsingStatement                                                    |      1.02 |       9962178.13 |       9804198.44 |         |

EgorPR (R2R=0 TieredPGO=1 TC_QJFL=1 OSR=1) vs Main (Default parameters)

| Faster                                                                           | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| -------------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| Engine.Scripting.InvokeMethod(InvokeMethodScript: "'String'.GetType()")          |      1.24 |          1311.98 |          1057.23 |         |
| Engine.Scripting.InvokeMethod(InvokeMethodScript: "[System.IO.Path]::HasExtensio |      1.23 |          1327.41 |          1082.81 |         |
| Engine.Scripting.InvokeMethod(InvokeMethodScript: "$fs=New-Object -ComObject scr |      1.09 |         51051.17 |         46657.50 |         |
| Engine.Parsing.UsingStatement                                                    |      1.03 |       9962178.13 |       9646275.00 |         |

@EgorBo
Copy link
Member Author

EgorBo commented May 28, 2021

JSON deserialization (System.Text.Json) example over a quite complicated file: Perf_Enumerable-report-full.json. Default parameters.

[Benchmark]
public RootObj ParseTextJson() => JsonSerializer.Deserialize<RootObj>(TestFileBytes);
|             Method |        Job |                    Toolchain |       Mean |    Error |   StdDev | Ratio | RatioSD |
|------------------- |----------- |----------------------------- |-----------:|---------:|---------:|------:|--------:|
|      ParseTextJson | Job-VFKUTY | \runtime-main\..\corerun.exe |   900.7 us | 13.95 us | 13.05 us |  1.10 |    0.02 |
|      ParseTextJson | Job-BYFRBX |   \runtime-pr\..\corerun.exe |   817.9 us |  1.18 us |  0.92 us |  1.00 |    0.00 |

@EgorBo EgorBo force-pushed the jit-inliner-improvements branch from 61c7c91 to 98afcea Compare June 10, 2021 10:02
@EgorBo
Copy link
Member Author

EgorBo commented Jun 10, 2021

aspnet Kestrel microbenchmarks: (150 benchmarks) https://gist.github.com/EgorBo/719020f50575c34c146be535d44721ce

--threshold 1% --noise 10ns
summary:
better: 59, geomean: 1.159
worse: 1, geomean: 1.026
total diff: 60

@EgorBo
Copy link
Member Author

EgorBo commented Jun 11, 2021

TechEmpower, aspnet-perf-win machine (PerfLab), randomly-picked benchmarks:

image

Note: Results may be different for the citrine machine (same hw as the official TE), but from a quick test plaintext-mvc has the same +20% in FullPGO there on aspnet-citrin-win). Will check Linux later.

@danmoseley
Copy link
Member

Detailed PR descriptions like this are great for curious observers like me. Kudos!

@sebastienros
Copy link
Member

sebastienros commented Jun 11, 2021

We usually base the acceptance criteria on citrine. I appreciate that you did you investigations on the small machines, but it will be preferable to share the final results from citrine here. Might be even better gains!

Update:
Can this be run on Linux too? Since official TE is Linux only.

@EgorBo
Copy link
Member Author

EgorBo commented Jun 11, 2021

We usually base the acceptance criteria on citrine. I appreciate that you did you investigations on the small machines, but it will be preferable to share the final results from citrine here. Might be even better gains!

Update:
Can this be run on Linux too? Since official TE is Linux only.

Initially I just wanted to tests some configs, didn't plan to publish any results, but the numbers looked too nice 🙂 I'll publish both Windows and Linux citrine once I validate some theories

@EgorBo
Copy link
Member Author

EgorBo commented Jun 12, 2021

TE Benchmarks on aspnet-citrine-lin (Intel, 28 cores, Ubuntu 18.04) - the official TE hardware (except 4x bigger network bandwidth):

image

Each config was ran twice, "Platform *" benchmarks were expected to have smaller diffs since they're heavily optimized by hands and mostly are limitted by OS/bandwidth but still quite some numbers.

Let me know if I should also test some other configurations (e.g. Multiquery, etc).

@EgorBo
Copy link
Member Author

EgorBo commented Jun 26, 2021

Yes, this looks more reasonable that the original size regressions.

With relatively large internal methods fully inlined (like your Array:Sort or MoveNext examples), we should tech crossgen2 to not compile these internal methods separately. Their body will never be used since they are fully inlined.

Good idea!

Unrelatated: I've just built a histogram on amount of locals in callers during inlining (crossgen2 + ExtDefaultPolicy + Release + PGO data):
image

So doesn't look like we inline too much often, but just in case I added a check to decrease the benefit mutliplier if we have to many locals already (say, 1/4 of MaxLocalsToTrack). Currently, MaxLocalsToTrack is 1024, and the biggest amount of locals after inlining is 463.

@EgorBo
Copy link
Member Author

EgorBo commented Jun 26, 2021

The latest run of TE benchmarks on aspnet-perf-linux (citrine was busy):
image

@EgorBo
Copy link
Member Author

EgorBo commented Jun 26, 2021

Citrine-Linux results (the difference is smaller):
image

@EgorBo
Copy link
Member Author

EgorBo commented Jun 28, 2021

@jkotas Any objections to merging this?

@EgorBo
Copy link
Member Author

EgorBo commented Jun 28, 2021

Windows TE aspent-citrine-win results (less stable as expected):
image

@jkotas
Copy link
Member

jkotas commented Jun 28, 2021

No objections. I think that the defaults for R2R can probably can use more tuning, but that can be done separately.

@mangod9 @dotnet/crossgen-contrib The optimization for time (-Ot crossgen2 option - not enabled by default) will have very material impact after this change. We should make sure to turn it on for workloads that use R2R with tiered JIT compilation disabled.

@mangod9
Copy link
Member

mangod9 commented Jun 28, 2021

Thanks, assume these the gains would be most pronounced with R2R composite mode?

@jkotas
Copy link
Member

jkotas commented Jun 28, 2021

assume these the gains would be most pronounced with R2R composite mode?

It depends. I would expect it to impact both composite and non-composite cases.

@EgorBo
Copy link
Member Author

EgorBo commented Jun 29, 2021

It can be merged then. This PR mostly unlocks dynamic PGO's potential that we won't use by default in .NET 6. However, users, who care less about slower start, can set the following variables for "FullPGO + Aggressive Inliner" mode:

DOTNET_TieredPGO=1                        # enable PGO instrumentation in tier0
DOTNET_TC_QuickJitForLoops=1              # don't bypass tier0 for methods with loops
DOTNET_ReadyToRun=0                       # don't use prejitted code and collect a better profile for everything
DOTNET_JitExtDefaultPolicyMaxIL=0x200     # allow inlining for large methods
DOTNET_JitExtDefaultPolicyMaxBB=15        # same here
DOTNET_JitExtDefaultPolicyProfScale=0x40  # try harder inlining hot methods

Here are the results for TE benchmarks on citrine-linux for Main-default (Static PGO is included) vs ThisPR + FullPGO + aggressive inliner ^:
image
(the diff is even bigger if we disable static PGO for the left one)

This PR also improves the default mode, but only by a few %s in general. And, just a reminder: if we use/benchmark something with DOTNET_TieredCompilation=0 we won't see any benefits since we use the old inlining policy for AOT and will never re-jit it for the BCL (don't expect any improvements for dotnet/performance micro-benchmarks if we use TC=0 there).

What is not done, but nice to have:

  1. It doesn't fix AggressiveInlining not respected when Method grow too large #41692, System.Numerics.Vector: Recognize division by a constant and inline #43761 and JIT code optimization inconsistency #51587 since it doesn't touch "EstimateTime" yet.
  2. It should ignore methods with insufficient samples (currently if callsite's profile >= callsite's fgFirstBb->weight -- we consider it as hot) e.g. static constructors, etc
  3. Re-vise "report noinline to vm" logic, sometimes it might bake "noinline" for methods which are good candidates for inlining within specific call-sites (e.g. with const args).
  4. "resolve call token" logic currently doesn't work for methods with loops.

If something goes wrong we can easily switch to the previous (more conservative and PGO-independent) policy.

@EgorBo
Copy link
Member Author

EgorBo commented Jun 30, 2021

The example from #33349 is my favorite one:

private static bool Format(Span<byte> span, int value)
{
    return Utf8Formatter.TryFormat(value, span, out _, new StandardFormat('D', 2));
}

Diff: https://www.diffchecker.com/WFVcKhJm (973 bytes -> 102 bytes)

@EgorBo EgorBo changed the title JIT: Improve inliner (new heuristics, clean up) [JIT] Improve inliner: new heuristics, rely on PGO data Jun 30, 2021
@EgorBo
Copy link
Member Author

EgorBo commented Jun 30, 2021

Failure is unrelated (#54125)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
8 participants