Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inliner: next steps #59002

Open
EgorBo opened this issue Sep 12, 2021 · 1 comment
Open

Inliner: next steps #59002

EgorBo opened this issue Sep 12, 2021 · 1 comment
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@EgorBo
Copy link
Member

EgorBo commented Sep 12, 2021

This issue summarizes problems which Inliner doesn't handle well currently.

Better understanding of what will be removed if we inline a callee

TODO:

Bottom-up inlining for AOT

TODO:

"Agile" time budget

TODO:

stfld is expensive for ref types

Currently inliner doesn't realize stfld for ref types will end up as ASSIGN_REF calls (gc write barriers) and is cheap for value-types. There is a benchmark that shows better results with NoInlining: https://discord.com/channels/143867839282020352/312132327348240384/886606988287569950

class MyTest
{
    private object _a;
    private object _b;
    public MyTest(object a, object b)
    {
        _a = a;
        _b = b;
    }
    static MyTest Create(object a, object b) => new MyTest(a, b);
}

Codegen for Create:

; Method MyTest:Test(System.Object,System.Object):MyTest
G_M58625_IG01:              ;; offset=0000H
       57                   push     rdi
       56                   push     rsi
       53                   push     rbx
       4883EC20             sub      rsp, 32
       488BF1               mov      rsi, rcx
       488BFA               mov      rdi, rdx
						;; bbWeight=1    PerfScore 3.75

G_M58625_IG02:              ;; offset=000DH
       48B9D00288CDFB7F0000 mov      rcx, 0x7FFBCD8802D0
       E8B4C7A15F           call     CORINFO_HELP_NEWSFAST
       488BD8               mov      rbx, rax
       488D4B08             lea      rcx, bword ptr [rbx+8]
       488BD6               mov      rdx, rsi
       E825C3A15F           call     CORINFO_HELP_ASSIGN_REF
       488D4B10             lea      rcx, bword ptr [rbx+16]
       488BD7               mov      rdx, rdi
       E819C3A15F           call     CORINFO_HELP_ASSIGN_REF
       488BC3               mov      rax, rbx
						;; bbWeight=1    PerfScore 5.25

G_M58625_IG03:              ;; offset=003AH
       4883C420             add      rsp, 32
       5B                   pop      rbx
       5E                   pop      rsi
       5F                   pop      rdi
       C3                   ret      
						;; bbWeight=1    PerfScore 2.75
; Total bytes of code: 66

ldsfld over a static readonly field returns constant

static readonly fields over primitive types most likely are converted to constants, e.g.:

private static readonly int s_Field = Environment.ProcessorCount;

void Validate()
{
    if (s_Field < 16)
        throw new Exception("");
}

void Test() => Validate();

Inliner's log in Test:

Inline candidate has 1 binary expressions with constants.  Multiplier increased to 0.5.
Inline candidate callsite is boring.  Multiplier increased to 1.8.

while it is expected from it to recognize a foldable branch here.

Callees with Pinvokes are expensive

(or basically any callee that injects code into callers' prologues/epilogues)

class MyTest
{
    [DllImport("user32.dll")]
    static extern int MessageBeep(uint uType);

    static int MessageBeepWrapper(uint uType) => MessageBeep(uType);

    static void Beep(bool beep)
    {
        if (beep)
            MessageBeepWrapper(0x20);
    }
}

Codegen for Beep:

G_M29584_IG01:              ;; offset=0000H
       55                   push     rbp
       4157                 push     r15
       4156                 push     r14
       4155                 push     r13
       4154                 push     r12
       57                   push     rdi
       56                   push     rsi
       53                   push     rbx
       4883EC68             sub      rsp, 104
       488DAC24A0000000     lea      rbp, [rsp+A0H]
       8BF1                 mov      esi, ecx
						;; bbWeight=1    PerfScore 9.00

G_M29584_IG02:              ;; offset=001AH
       488D4D88             lea      rcx, [rbp-78H]
       498BD2               mov      rdx, r10
       E8FA4E705F           call     CORINFO_HELP_INIT_PINVOKE_FRAME
       488BF8               mov      rdi, rax
       488BCC               mov      rcx, rsp
       48894DA8             mov      qword ptr [rbp-58H], rcx
       488BCD               mov      rcx, rbp
       48894DB8             mov      qword ptr [rbp-48H], rcx
       4084F6               test     sil, sil
       744B                 je       SHORT G_M29584_IG07
						;; bbWeight=1    PerfScore 5.75

G_M29584_IG03:              ;; offset=003CH
       B920000000           mov      ecx, 32
       48B8C8018ACDFB7F0000 mov      rax, 0x7FFBCD8A01C8
       48894598             mov      qword ptr [rbp-68H], rax
       488D0516000000       lea      rax, G_M29584_IG05
       488945B0             mov      qword ptr [rbp-50H], rax
       488D4588             lea      rax, bword ptr [rbp-78H]
       48894710             mov      qword ptr [rdi+16], rax
       C6470C00             mov      byte  ptr [rdi+12], 0
						;; bbWeight=0.50 PerfScore 3.00

G_M29584_IG04:              ;; offset=0066H
       FF15FCD54600         call     [MyTest:MessageBeep(int):int]
						;; bbWeight=0.50 PerfScore 1.50

G_M29584_IG05:              ;; offset=006CH
       C6470C01             mov      byte  ptr [rdi+12], 1
       833D11C7F55F00       cmp      dword ptr [(reloc 0x7ffc2d38f428)], 0
       7406                 je       SHORT G_M29584_IG06
       FF15310AF45F         call     [CORINFO_HELP_STOP_FOR_GC]
						;; bbWeight=0.50 PerfScore 4.00

G_M29584_IG06:              ;; offset=007FH
       488B4590             mov      rax, bword ptr [rbp-70H]
       48894710             mov      qword ptr [rdi+16], rax
						;; bbWeight=0.50 PerfScore 1.00

G_M29584_IG07:              ;; offset=0087H
       4883C468             add      rsp, 104
       5B                   pop      rbx
       5E                   pop      rsi
       5F                   pop      rdi
       415C                 pop      r12
       415D                 pop      r13
       415E                 pop      r14
       415F                 pop      r15
       5D                   pop      rbp
       C3                   ret      
						;; bbWeight=1    PerfScore 5.25
; Total bytes of code: 152

NOTE if caller already have a pinvoke - then it should make sense to inline callees with pinvokes too.

typeof(T) should be recognized as a constant argument (and propagated by value)

void Validate(Type type)
{
    if (type.IsValueType)
        throw new InvalidOperationException("");
}

void CallTest() => Validate(typeof(string));

Inliner's log for CallTest:

Inline has 1 intrinsics.  Multiplier increased to 1.3.
Inline candidate callsite is boring.  Multiplier increased to 2.6.
calleeNativeSizeEstimate=642
callsiteNativeSizeEstimate=85
benefit multiplier=2.6
threshold=221
Native estimate for function size exceeds threshold for inlining 64.2 > 22.1 (multiplier = 2.6)

So it recognized get_IsValueType intrinsic, but didn't realize we pass a constant there. #55875 is a good example of how to propagate complex trees during inlining so the same should be done with typeof.

call (cnsArg, ...) heuristic is too optimistic

When inliner sees a pattern like if (CallSomething(arg0, ...) it might recognize it as a foldable branch currently if arg0 is a constant arg: it's too optimistic and doesn't check call's signature.

if (!isJitIntrinsic && !handled && FgStack::IsArgument(pushedStack.Top()))
{
// Optimistically assume that "call(arg)" returns something arg-dependent.
// However, we don't know how many args it expects and its return type.
handled = true;
}

Methods with string literals from external assemblies are never inlined

See #53726 (comment)

JIT should not give up completely on large methods

It should be able to analyze a small part of such method to find potential foldable branches in large methods or when it runs out of "time budget", related issues:

JIT should be able to resolve 'call' instructions in methods with loops.

We either should allow it to resolve 'call' in TC=0/methods with loops or enable QJFL. Enabling it for TC=0 will allow users to see how it works with intrinsics on sharplab. Potential impact on TP. So currently we'll never see a foldable branch here:

void Test<T>()
{
    if (typeof(T) == typeof(int))
        return;
    for (int i = 0; i < 100; i++) {}
}

Cache machine code size

If JIT compiles a method with IL size, say, 100 bytes and it ends up with only 20 bytes of machine code we can keep that info for future compilations that this method is not as big as it looks (it can be just a single bit instead of machine code size).

Misc

OS/Platform specifics: Callee-saved registers and CSE, ABI differences
Explicit tail-calls are never inlined
Save some tips for inliner in tier0/crossgen
etc.

category:planning
theme:inlining
skill-level:expert
cost:large
impact:large

@EgorBo EgorBo added the tenet-performance Performance related issue label Sep 12, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Sep 12, 2021
@ghost
Copy link

ghost commented Sep 12, 2021

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

This issue summarizes current problems which Inliner doesn't handle well currently.

stfld is expensive for ref types

Currently inliner doesn't realize stfld for ref types will end up as ASSIGN_REF calls. There is a benchmark that shows better results with NoInlining: https://discord.com/channels/143867839282020352/312132327348240384/886606988287569950

class MyTest
{
    private object _a;
    private object _b;
    public MyTest(object a, object b)
    {
        _a = a;
        _b = b;
    }
    static MyTest Create(object a, object b) => new MyTest(a, b);
}

Codegen for Create:

; Method MyTest:Test(System.Object,System.Object):MyTest
G_M58625_IG01:              ;; offset=0000H
       57                   push     rdi
       56                   push     rsi
       53                   push     rbx
       4883EC20             sub      rsp, 32
       488BF1               mov      rsi, rcx
       488BFA               mov      rdi, rdx
						;; bbWeight=1    PerfScore 3.75

G_M58625_IG02:              ;; offset=000DH
       48B9D00288CDFB7F0000 mov      rcx, 0x7FFBCD8802D0
       E8B4C7A15F           call     CORINFO_HELP_NEWSFAST
       488BD8               mov      rbx, rax
       488D4B08             lea      rcx, bword ptr [rbx+8]
       488BD6               mov      rdx, rsi
       E825C3A15F           call     CORINFO_HELP_ASSIGN_REF
       488D4B10             lea      rcx, bword ptr [rbx+16]
       488BD7               mov      rdx, rdi
       E819C3A15F           call     CORINFO_HELP_ASSIGN_REF
       488BC3               mov      rax, rbx
						;; bbWeight=1    PerfScore 5.25

G_M58625_IG03:              ;; offset=003AH
       4883C420             add      rsp, 32
       5B                   pop      rbx
       5E                   pop      rsi
       5F                   pop      rdi
       C3                   ret      
						;; bbWeight=1    PerfScore 2.75
; Total bytes of code: 66

Calles with Pinvokes are expensive

(or basically any callees which impact on callers' prologues/epilogues)

class MyTest
{
    [DllImport("user32.dll")]
    static extern int MessageBeep(uint uType);

    static int MessageBeepWrapper(uint uType) => MessageBeep(uType);

    static void Beep(bool beep)
    {
        if (beep)
            MessageBeepWrapper(0x20);
    }
}

Codegen for Beep:

G_M29584_IG01:              ;; offset=0000H
       55                   push     rbp
       4157                 push     r15
       4156                 push     r14
       4155                 push     r13
       4154                 push     r12
       57                   push     rdi
       56                   push     rsi
       53                   push     rbx
       4883EC68             sub      rsp, 104
       488DAC24A0000000     lea      rbp, [rsp+A0H]
       8BF1                 mov      esi, ecx
						;; bbWeight=1    PerfScore 9.00

G_M29584_IG02:              ;; offset=001AH
       488D4D88             lea      rcx, [rbp-78H]
       498BD2               mov      rdx, r10
       E8FA4E705F           call     CORINFO_HELP_INIT_PINVOKE_FRAME
       488BF8               mov      rdi, rax
       488BCC               mov      rcx, rsp
       48894DA8             mov      qword ptr [rbp-58H], rcx
       488BCD               mov      rcx, rbp
       48894DB8             mov      qword ptr [rbp-48H], rcx
       4084F6               test     sil, sil
       744B                 je       SHORT G_M29584_IG07
						;; bbWeight=1    PerfScore 5.75

G_M29584_IG03:              ;; offset=003CH
       B920000000           mov      ecx, 32
       48B8C8018ACDFB7F0000 mov      rax, 0x7FFBCD8A01C8
       48894598             mov      qword ptr [rbp-68H], rax
       488D0516000000       lea      rax, G_M29584_IG05
       488945B0             mov      qword ptr [rbp-50H], rax
       488D4588             lea      rax, bword ptr [rbp-78H]
       48894710             mov      qword ptr [rdi+16], rax
       C6470C00             mov      byte  ptr [rdi+12], 0
						;; bbWeight=0.50 PerfScore 3.00

G_M29584_IG04:              ;; offset=0066H
       FF15FCD54600         call     [MyTest:MessageBeep(int):int]
						;; bbWeight=0.50 PerfScore 1.50

G_M29584_IG05:              ;; offset=006CH
       C6470C01             mov      byte  ptr [rdi+12], 1
       833D11C7F55F00       cmp      dword ptr [(reloc 0x7ffc2d38f428)], 0
       7406                 je       SHORT G_M29584_IG06
       FF15310AF45F         call     [CORINFO_HELP_STOP_FOR_GC]
						;; bbWeight=0.50 PerfScore 4.00

G_M29584_IG06:              ;; offset=007FH
       488B4590             mov      rax, bword ptr [rbp-70H]
       48894710             mov      qword ptr [rdi+16], rax
						;; bbWeight=0.50 PerfScore 1.00

G_M29584_IG07:              ;; offset=0087H
       4883C468             add      rsp, 104
       5B                   pop      rbx
       5E                   pop      rsi
       5F                   pop      rdi
       415C                 pop      r12
       415D                 pop      r13
       415E                 pop      r14
       415F                 pop      r15
       5D                   pop      rbp
       C3                   ret      
						;; bbWeight=1    PerfScore 5.25
; Total bytes of code: 152

typeof(T) should be recognized as a constant argument (and propagated by value)

void Validate(Type type)
{
    if (type.IsValueType)
        throw new InvalidOperationException("");
}

void CallTest() => Validate(typeof(string));

Inliner's log for CallTest:

Inline has 1 intrinsics.  Multiplier increased to 1.3.
Inline candidate callsite is boring.  Multiplier increased to 2.6.
calleeNativeSizeEstimate=642
callsiteNativeSizeEstimate=85
benefit multiplier=2.6
threshold=221
Native estimate for function size exceeds threshold for inlining 64.2 > 22.1 (multiplier = 2.6)

So it recognized get_IsValueType intrinsic, but didn't realize we pass a constant there. #55875 is a good example of how to propagate complex trees during inlining so the same should be done with typeof.

call (cnsArg, ...) heuristic is too optimistic

When inliner sees a pattern like if (CallSomething(arg0, ...) it might recognize it as a foldable branch currently if arg0 is a constant arg: it's too optimistic and doesn't check call's signature.

if (!isJitIntrinsic && !handled && FgStack::IsArgument(pushedStack.Top()))
{
// Optimistically assume that "call(arg)" returns something arg-dependent.
// However, we don't know how many args it expects and its return type.
handled = true;
}

Methods with string literals from external assemblies are never inlined

See #53726 (comment)

JIT should not give up completely for large methods

It should be able to analyze a small part of such method to find potential foldable branches in large methods or when it runs out of "time budget", related issues:

JIT should be able to resolve 'call' instructions in methods with loops.

We either should allow it to resolve 'call' in TC=0/methods with loops or enable QJFL. Enabling it for TC=0 will allow users to see how it works with intrinsics on sharplab. Potential impact on TP. So currently we'll never see a foldable branch here:

void Test<T>()
{
    if (typeof(T) == typeof(int))
        return;
    for (int i = 0; i < 100; i++) {}
}

... more to come

Author: EgorBo
Assignees: -
Labels:

tenet-performance, area-CodeGen-coreclr, untriaged

Milestone: -

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

1 participant