Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: opportunities for improving async state machine codegen #6916

Open
AndyAyersMS opened this issue Oct 31, 2016 · 4 comments
Open

JIT: opportunities for improving async state machine codegen #6916

AndyAyersMS opened this issue Oct 31, 2016 · 4 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Milestone

Comments

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Oct 31, 2016

There are some idiomatic patterns in the async state machines created by the C# compiler that cause the jit to generate sub-optimal code.

A motivating example is the method <RequestProcessingAsync>d__2[Context][Microsoft.AspNetCore.Hosting.Internal.HostingApplication+Context]:MoveNext():this from Kestrel (source code).

  • The MoveNext method is a value class method that holds a pointer to the associated Frame class. Frame field references in MoveNext thus must go through one extra level of indirection. The JIT also spills the this pointer in MoveNext, leading to code sequences like the following:
G_M55243_IG09:
       488B5510             mov      rdx, bword ptr [rbp+10H]   // RDX=this
       488B12               mov      rdx, gword ptr [rdx]       // RDX=this.<>4__this (frame)
       8B0A                 mov      ecx, dword ptr [rdx]       // null check (?)
       488B4A58             mov      rcx, gword ptr [rdx+88]    // RCX = this.<>_this.somefield
       488B5510             mov      rdx, bword ptr [rbp+10H]   // RDX = this
       488B12               mov      rdx, gword ptr [rdx]       // RDX = this.<>4_this (frame)
       488B9290010000       mov      rdx, qword ptr [rdx+400]
       488B4920             mov      rcx, gword ptr [rcx+32]

The jit should be able to CSE some of these redundant loads. See #6913 for a small repro case. The C# compiler could help the jit by caching <>4__this in a local, and perhaps by inserting an early null check.

  • The method is (roughly speaking) an outermost switch. Such methods can incur relatively high prolog costs from zero initializations in the prolog, and this method is a good example:
       488DBD58FFFFFF       lea      rdi, [rbp-A8H]
       B922000000           mov      ecx, 34
       33C0                 xor      rax, rax
       F3AB                 rep stosd 

Most of the zeroing comes from structs with GC references. Possible mitigation routes: more aggressive copy (and reverse-copy) prop of structs, tracking struct field lifetimes, or stack packing.

  • Code layout: because this method has EH, cold code cannot be aggressively moved to the end of the method. It could however be moved to the end of the enclosing try regions. For instance, right after the first switch in the method there are a couple of throw blocks:
G_M55243_IG03:
     ....
       FFE1                 jmp      rcx

G_M55243_IG04:
       488BD3               mov      rdx, rbx
       48B92885A3C6FF7F0000 mov      rcx, 0x7FFFC6A38528
       E83E47B35E           call     CORINFO_HELP_ISINSTANCEOFCLASS
       4885C0               test     rax, rax
       7508                 jne      SHORT G_M55243_IG05
       488BCB               mov      rcx, rbx
       E8A12FCE5E           call     CORINFO_HELP_THROW

G_M55243_IG05:
       488BC8               mov      rcx, rax
       E84932505E           call     System.Runtime.ExceptionServices.ExceptionDispatchInfo:Capture(ref):ref
       488BC8               mov      rcx, rax
       3909                 cmp      dword ptr [rcx], ecx
       E88F32505E           call     System.Runtime.ExceptionServices.ExceptionDispatchInfo:Throw():this
  • Inlining: there are some examples here where an ALWAYS_INLINE case pulls in a call to a method with an aggressive inlining attribute. We might want to reconsider such cases and force the profitability heuristics to kick in instead. For example:
[42 IL=0919 TR=001193 06001E56] [below ALWAYS_INLINE size] ConfiguredTaskAwaiter:GetResult():this
    [43 IL=0006 TR=002148 06001E94] [below ALWAYS_INLINE size] System.Runtime.CompilerServices.TaskAwaiter:ValidateEnd(ref)
      [44 IL=0001 TR=002154 06002271] [aggressive inline attribute] System.Threading.Tasks.Task:get_IsWaitNotificationEnabledOrNotRanToCompletion():bool:this
  • Switch codegen: there are some degenerate switch cases that would be better expressed as if/then ladders. It would probably be worth surveying other async methods to see if this is a common occurence. For example:
IL_0008:  switch     ( 
                          IL_003b,
                          IL_003b,
                          IL_003b,
                          IL_003b,
                          IL_003b,
                          IL_003b,
                          IL_003b,
                          IL_0729)

Here there are only 3 targets and there's no interleaving.

More detailed notes and various jit artifacts can be found on this gist.

category:cq
theme:optimization
skill-level:expert
cost:extra-large

@VSadov
Copy link
Member

VSadov commented Nov 2, 2016

The degenerate switch case will commonly happen when "awaits" are in a "try" region. We dispatch the execution according to the current state to the appropriate resume point. However, we cannot dispatch directly to resume points that are inside "try", so we dispatch those to a label right before the "try" and then sub-dispatch inside the try.

As a result you may have contiguous segments of the outer jump table filled with the same label. - That would be the label before the try.

This pattern does not seem likely to happen in code written by hand, but could be frequent enough in synthetic switches.

I guess we could do better when lowering switches.

VSadov referenced this issue in dotnet/roslyn Dec 21, 2016
Related to dotnet/coreclr#7914
Where the offending pattern was observed in the IL generated for the Kestrel web server and was recommended as an easy improvement.

Becasue of cascaded dispatching into "try" regions , it is not uncommon to see async state machine to contain degenerate switches like

IL_0008:  switch     (
                          IL_003b,
                          IL_003b,
                          IL_003b,
                          IL_003b,
                          IL_003b,
                          IL_003b,
                          IL_003b,
                          IL_0729)

Note numerous cases all leading to the same target.
We can trivially emit such switches as just a range-check (that switch would need to do anyways), without any actual "switching".
Since all cases lead to the same label anyways, there is no need to switch once we know the value is in range of the switch.

Fixes:#14878
VSadov referenced this issue in dotnet/roslyn Dec 28, 2016
Related to dotnet/coreclr#7914
Where the offending pattern was observed in the IL generated for the Kestrel web server and was recommended as an easy improvement.

Becasue of cascaded dispatching into "try" regions , it is not uncommon to see async state machine to contain degenerate switches like

IL_0008:  switch     (
                          IL_003b,
                          IL_003b,
                          IL_003b,
                          IL_003b,
                          IL_003b,
                          IL_003b,
                          IL_003b,
                          IL_0729)

Note numerous cases all leading to the same target.
We can trivially emit such switches as just a range-check (that switch would need to do anyways), without any actual "switching".
Since all cases lead to the same label anyways, there is no need to switch once we know the value is in range of the switch.

Fixes:#14878
@benaadams
Copy link
Member

Seeing same from TechEmpower aspnetcore Platform benchmark

public async ValueTask OnReadCompletedAsync()
{
    await Writer.FlushAsync();
}
G_M39073_IG17:
       488BCE               mov      rcx, rsi
       480FBF55E0           movsx    rdx, word  ptr [rbp-20H]
       49BB9010BF3BFA7F0000 mov      r11, 0x7FFA3BBF1090
       48B89010BF3BFA7F0000 mov      rax, 0x7FFA3BBF1090
       3909                 cmp      dword ptr [rcx], ecx
       FF10                 call     qword ptr [rax]IValueTaskSource`1:GetResult(short):struct:this
       8BF8                 mov      edi, eax

G_M39073_IG18:
       40887DB8             mov      byte  ptr [rbp-48H], dil

G_M39073_IG19:
       488B5510             mov      rdx, bword ptr [rbp+10H]  ; rdx load
       C74208FEFFFFFF       mov      dword ptr [rdx+8], -2
       488B5510             mov      rdx, bword ptr [rbp+10H]  ; rdx unchanged?
       3912                 cmp      dword ptr [rdx], edx
       488B5510             mov      rdx, bword ptr [rbp+10H]  ; rdx unchanged?
       488D7210             lea      rsi, bword ptr [rdx+16]
       807E0100             cmp      byte  ptr [rsi+1], 0
       7430                 je       SHORT G_M39073_IG21
       3936                 cmp      dword ptr [rsi], esi
       488D4E08             lea      rcx, bword ptr [rsi+8]
       48BA882626F79D010000 mov      rdx, 0x19DF7262688
       488B12               mov      rdx, gword ptr [rdx]
       48833900             cmp      gword ptr [rcx], 0
       7507                 jne      SHORT G_M39073_IG20
       E8A7A87C5F           call     CORINFO_HELP_CHECKED_ASSIGN_REF
       EB13                 jmp      SHORT G_M39073_IG22

G_M39073_IG20:
       C6458000             mov      byte  ptr [rbp-80H], 0
       480FBE5580           movsx    rdx, byte  ptr [rbp-80H]
       E807B8A14D           call     AsyncTaskMethodBuilder`1:SetExistingTaskResult(struct):this
       EB03                 jmp      SHORT G_M39073_IG22

@AndyAyersMS
Copy link
Member Author

See also dotnet/coreclr#21973, where we might be able to achieve write-through like codegen by CSEing reads of DNER locals.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Dec 5, 2020
@teo-tsirpanis
Copy link
Contributor

I hit a similar situation too, with async methods that return ValueTask and complete synchronously. The state machines add considerable overhead by themselves.

Method Mean Error StdDev
Plain 3.621 ms 0.0005 ms 0.0004 ms
PlainWithTryCatch 5.079 ms 0.0007 ms 0.0007 ms
ValueTask 90.974 ms 0.0542 ms 0.0453 ms
Benchmark code
using BenchmarkDotNet.Attributes;
using System.Runtime.ExceptionServices;

public class ShallowRecursionBenchmark
{
    public const uint N = 20;

    [Benchmark]
    public int Plain()
    {
        return Impl(N);

        static int Impl(uint n)
        {
            if (n == 0)
            {
                return 1;
            }

            return Impl(n - 1) + Impl(n - 1);
        }
    }

    [Benchmark]
    public int PlainWithTryCatch()
    {
        return Impl(N);

        static int Impl(uint n)
        {
            try
            {
                if (n == 0)
                {
                    return 1;
                }

                return Impl(n - 1) + Impl(n - 1);
            }
            catch (Exception e)
            {
                ExceptionDispatchInfo.Throw(e);
                return 0;
            }
        }
    }

    [Benchmark]
    public int ValueTask()
    {
        return Impl(N).GetAwaiter().GetResult();

        static async ValueTask<int> Impl(uint n)
        {
            if (n == 0)
            {
                return 1;
            }

            return await Impl(n - 1) + await Impl(n - 1);
        }
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

6 participants