Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT not able to inline methods that contain generic virtual calls #59075

Open
acaly opened this issue Sep 14, 2021 · 5 comments
Open

JIT not able to inline methods that contain generic virtual calls #59075

acaly opened this issue Sep 14, 2021 · 5 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@acaly
Copy link

acaly commented Sep 14, 2021

Description

I am trying to make a static reflection system in C#, which is basically to enumerate a struct's fields by ref without requiring any reflection from the runtime. I found that the performance does not fully meet the expectation, probably due to some generic methods not being inlined.

The source code used in benchmarking is here.

Configuration

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1165 (21H1/May2021Update)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.100-preview.7.21379.14
[Host] : .NET 6.0.0 (6.0.21.37719), X64 RyuJIT [AttachedDebugger]
DefaultJob : .NET 6.0.0 (6.0.21.37719), X64 RyuJIT

Regression?

No.

Data

Method Mean StdDev Median
Reflected 5.430 ns 0.0941 ns 5.419 ns
Direct 1.972 ns 0.0308 ns 1.972 ns

Source code is in the Description section above. The two methods Direct and Reflected are equivalent.

For Direct, much of the time should be in CreateState(), which is unrelated to the issue. Considering a vxorps + vmovsd should be fast, the difference between the two indicates that the method calls not inlined would have huge effect on performance, especially when the reflection is done inside a tight loop, which is the expected usage of this static reflection mechanism.

The disassembly of the two methods are shown below. Note that in Reflected, these two methods are not inlined:

  • StaticReflectionTest.EntityState2D.Reflect[[StaticReflectionTest.TestClass+ResetVelocityConsumer, StaticReflectionTest]]
  • StaticReflectionTest.TestClass.ResetVelocity[[StaticReflectionTest.EntityState2D, StaticReflectionTest]]

Reflected

; StaticReflectionTest.TestClass.Reflected()
       sub       rsp,38
       xor       eax,eax
       mov       [rsp+28],rax
       mov       [rsp+30],rax
       lea       rcx,[rsp+28]
       call      StaticReflectionTest.TestClass.CreateState()
       lea       rcx,[rsp+28]
       call      StaticReflectionTest.TestClass.ResetVelocity[[StaticReflectionTest.EntityState2D, StaticReflectionTest]](StaticReflectionTest.EntityState2D ByRef)
       nop
       add       rsp,38
       ret
; Total bytes of code 42

; StaticReflectionTest.TestClass.ResetVelocity[[StaticReflectionTest.EntityState2D, StaticReflectionTest]](StaticReflectionTest.EntityState2D ByRef)
       sub       rsp,28
       xor       eax,eax
       mov       [rsp+20],rax
       lea       rdx,[rsp+20]
       call      StaticReflectionTest.EntityState2D.Reflect[[StaticReflectionTest.TestClass+ResetVelocityConsumer, StaticReflectionTest]](ResetVelocityConsumer ByRef)
       nop
       add       rsp,28
       ret
; Total bytes of code 27

; StaticReflectionTest.EntityState2D.Reflect[[StaticReflectionTest.TestClass+ResetVelocityConsumer, StaticReflectionTest]](ResetVelocityConsumer ByRef)
       add       rcx,8
       xor       eax,eax
       mov       [rcx],rax
       ret
; Total bytes of code 10

Direct (reference)

; StaticReflectionTest.TestClass.Direct()
       sub       rsp,38
       vzeroupper
       xor       eax,eax
       mov       [rsp+28],rax
       mov       [rsp+30],rax
       lea       rcx,[rsp+28]
       call      StaticReflectionTest.TestClass.CreateState()
       vxorps    xmm0,xmm0,xmm0
       vmovsd    qword ptr [rsp+30],xmm0
       add       rsp,38
       ret
; Total bytes of code 44

Shared method CreateState()

; StaticReflectionTest.TestClass.CreateState()
       vzeroupper
       vmovsd    xmm0,qword ptr [7FFE6EE4CA60]
       vmovss    xmm1,dword ptr [7FFE6EE4CA70]
       vxorps    xmm2,xmm2,xmm2
       vmovss    xmm2,xmm2,xmm1
       vpslldq   xmm2,xmm2,4
       vmovss    xmm2,xmm2,xmm1
       vmovaps   xmm1,xmm2
       vmovsd    qword ptr [rcx],xmm1
       vmovsd    qword ptr [rcx+8],xmm0
       mov       rax,rcx
       ret
; Total bytes of code 53

category:cq
theme:inlining
skill-level:expert
cost:medium
impact:medium

@acaly acaly added the tenet-performance Performance related issue label Sep 14, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Sep 14, 2021
@ghost
Copy link

ghost commented Sep 14, 2021

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

I am trying to make a static reflection system in C#, which is basically to enumerate a struct's fields by ref without requiring any reflection from the runtime. I found that the performance does not fully meet the expectation, probably due to some generic methods not being inlined.

The source code used in benchmarking is here.

Configuration

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1165 (21H1/May2021Update)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.100-preview.7.21379.14
[Host] : .NET 6.0.0 (6.0.21.37719), X64 RyuJIT [AttachedDebugger]
DefaultJob : .NET 6.0.0 (6.0.21.37719), X64 RyuJIT

Regression?

No.

Data

Method Mean StdDev Median
Reflected 5.430 ns 0.0941 ns 5.419 ns
Direct 1.972 ns 0.0308 ns 1.972 ns

Source code is in the Description section above. The two methods Direct and Reflected are equivalent.

For Direct, much of the time should be in CreateState(), which is unrelated to the issue. Considering a vxorps + vmovsd should be fast, the difference between the two indicates that the method calls not inlined would have huge effect on performance, especially when the reflection is done inside a tight loop, which is the expected usage of this static reflection mechanism.

The disassembly of the two methods are shown below. The shared method StaticReflectionTest.TestClass.CreateState() is removed from the disassembly output. Note that in Reflected, these two methods are not inlined:

  • StaticReflectionTest.EntityState2D.Reflect[[StaticReflectionTest.TestClass+ResetVelocityConsumer, StaticReflectionTest]]
  • StaticReflectionTest.TestClass.ResetVelocity[[StaticReflectionTest.EntityState2D, StaticReflectionTest]]

Reflected

; StaticReflectionTest.TestClass.Reflected()
       sub       rsp,38
       xor       eax,eax
       mov       [rsp+28],rax
       mov       [rsp+30],rax
       lea       rcx,[rsp+28]
       call      StaticReflectionTest.TestClass.CreateState()
       lea       rcx,[rsp+28]
       call      StaticReflectionTest.TestClass.ResetVelocity[[StaticReflectionTest.EntityState2D, StaticReflectionTest]](StaticReflectionTest.EntityState2D ByRef)
       nop
       add       rsp,38
       ret
; Total bytes of code 42

; StaticReflectionTest.TestClass.ResetVelocity[[StaticReflectionTest.EntityState2D, StaticReflectionTest]](StaticReflectionTest.EntityState2D ByRef)
       sub       rsp,28
       xor       eax,eax
       mov       [rsp+20],rax
       lea       rdx,[rsp+20]
       call      StaticReflectionTest.EntityState2D.Reflect[[StaticReflectionTest.TestClass+ResetVelocityConsumer, StaticReflectionTest]](ResetVelocityConsumer ByRef)
       nop
       add       rsp,28
       ret
; Total bytes of code 27

; StaticReflectionTest.EntityState2D.Reflect[[StaticReflectionTest.TestClass+ResetVelocityConsumer, StaticReflectionTest]](ResetVelocityConsumer ByRef)
       add       rcx,8
       xor       eax,eax
       mov       [rcx],rax
       ret
; Total bytes of code 10

Direct (reference)

; StaticReflectionTest.TestClass.Direct()
       sub       rsp,38
       vzeroupper
       xor       eax,eax
       mov       [rsp+28],rax
       mov       [rsp+30],rax
       lea       rcx,[rsp+28]
       call      StaticReflectionTest.TestClass.CreateState()
       vxorps    xmm0,xmm0,xmm0
       vmovsd    qword ptr [rsp+30],xmm0
       add       rsp,38
       ret
; Total bytes of code 44
Author: acaly
Assignees: -
Labels:

tenet-performance, area-CodeGen-coreclr, untriaged

Milestone: -

@acaly
Copy link
Author

acaly commented Sep 14, 2021

Probably related to #59002.

@EgorBo
Copy link
Member

EgorBo commented Sep 14, 2021

INLINER: Marking StaticReflectionTest.TestClass:ResetVelocity(byref) as NOINLINE because of generic virtual
INLINER: during 'fgInline' result 'failed this callee' reason 'generic virtual'

It's a known issue - inliner gives up on virtual calls over generics currently (Reflect()), there was an issue for it somewhere but I can't find it.
It should be able to inline if it can prove after inlining it will be a normal call.

@EgorBo EgorBo added this to the Future milestone Sep 14, 2021
@EgorBo EgorBo removed the untriaged New issue has not been triaged by the area owner label Sep 14, 2021
@EgorBo
Copy link
Member

EgorBo commented Sep 14, 2021

A similar issue:

using System;

class Program : IDisposable
{
    public static void Main()
    {
        Program r = new ();
        Test(ref r);
    }

    static void Test<T>(ref T foo) 
        where T : IDisposable => foo.Dispose();

    public void Dispose() {}
}

codegen for main:

; Method Program:Main()
G_M27646_IG01:              ;; offset=0000H
       4883EC28             sub      rsp, 40
						;; bbWeight=1    PerfScore 0.25

G_M27646_IG02:              ;; offset=0004H
       48B9C01538C1FB7F0000 mov      rcx, 0x7FFBC13815C0
       E85D9DA15F           call     CORINFO_HELP_NEWSFAST
       488BC8               mov      rcx, rax
       49BB4000D9C0FB7F0000 mov      r11, 0x7FFBC0D90040
       FF155AD0E7FF         call     [System.IDisposable:Dispose():this]
       90                   nop      
						;; bbWeight=1    PerfScore 5.00

G_M27646_IG03:              ;; offset=0027H
       4883C428             add      rsp, 40
       C3                   ret      
						;; bbWeight=1    PerfScore 1.25
; Total bytes of code: 44

@acaly
Copy link
Author

acaly commented Sep 14, 2021

It should be able to inline if it can prove after inlining it will be a normal call.

Could you elaborate on that? I think both Consume() and Reflect() are generic virtuals (and both are declared in value types, used as constrained callvirt IL). Why the Consume can be fully inlined but not the other one? Also ResetVelocity itself isn't virtual, why it cannot be inlined into the benchmarked method?

EDIT

I changed Reflect into static abstract methods, and it makes ResetVelocity inlined. So I think the problem is, whenever a method calls another generic virtual method, that caller method won't be inlined into the caller's caller. In this case, the call to Consume is blocking Reflect from being inlined, and the call to Reflect is blocking ResetVelocity from being inlined.

@acaly acaly changed the title JIT not able to inline some simple generic methods marked as AggressiveInline JIT not able to inline methods that contain generic virtual calls Sep 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

2 participants