Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RyuJIT: Inlined "is class statically inited" check #47901

Closed
wants to merge 21 commits into from

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Feb 5, 2021

Fixes #1327

For members with static constructors or beforefieldinit we emit a helper call to initialize the classes. This PR adds an additional quick check "is already inited?" so we can get rid of the call overhead (which is supposed to be executed only once).

Example:

static int s_Field = 42; // = 42 introduces a static constructor

static int Test()
{
    return s_Field;
}

Current codegen:

G_M36434_IG01:
       sub      rsp, 40
G_M36434_IG02:
       mov      rcx, 0xD1FFAB1E
       mov      edx, 1
       call     CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
       mov      eax, dword ptr [reloc classVar[0xd1ffab1e]]
G_M36434_IG03:
       add      rsp, 40
       ret      
; Total bytes of code 35

New codegen:

G_M36434_IG01:
       sub      rsp, 40
G_M36434_IG02:
+      test     byte  ptr [(reloc)], 1           ;; <--- fast check for initialization status
+      je       SHORT G_M36434_IG05
G_M36434_IG03:
       mov      eax, dword ptr [reloc classVar[0xd1ffab1e]]
G_M36434_IG04:
       add      rsp, 40
       ret      
+G_M36434_IG05:                                  ;; <--- cold block (executed just once per lifetime)
+       mov      rcx, 0xD1FFAB1E
+       call     CORINFO_HELP_INITCLASS
+       jmp      SHORT G_M36434_IG03
; Total bytes of code 41 (+6 bytes)

In most cases such init-calls are eliminated in the real world, because a type can be already initialized at the JIT time (e.g. during re-jitting tier0 -> tier1), but it's not the case for methods with loops because those aren't re-jitted atm (only if COMPlus_TC_QuickJitForLoops=1 is set) and we might end up with slow call init forever especially when if it's located inside a loop (in case of static constructors we can't hoist such calls).

/cc @dotnet/jit-contrib

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 5, 2021
@AndyAyersMS
Copy link
Member

I wonder if we should consider using this check as a gating condition for loop cloning... Also may factor into the do-while transformation heuristics, these try to account for the potential "savings" from hoisting the class init call check out of the loop.

So for some subset of loops we would produce a loop that knows classes are inited and does no checks, and another that will conditionally check within the loop body as above.

cc @BruceForstall

@AndyAyersMS
Copy link
Member

Can you look at the case where there are two static accesses in a method, one of which dominates the other (say, separated by some unrelated flow, just to make things interesting), and make sure we're only emitting one check and call?

Eg

  var x = s_f1;
  if (...) { }
  var y = s_f2;  // we should not need a check here (when optimizing)

My concern is that if we make the call to init the class conditional under the check, then the jit may no longer realize that the dominated check and call are redundant.

@EgorBo
Copy link
Member Author

EgorBo commented Feb 9, 2021

Can you look at the case where there are two static accesses in a method, one of which dominates the other (say, separated by some unrelated flow, just to make things interesting), and make sure we're only emitting one check and call?

Eg

  var x = s_f1;
  if (...) { }
  var y = s_f2;  // we should not need a check here (when optimizing)

My concern is that if we make the call to init the class conditional under the check, then the jit may no longer realize that the dominated check and call are redundant.

@AndyAyersMS unfortunately it regresses that case, I wonder if I should do this optimization later after loopHoisting.
It will allow to also handle class-inits for beforefieldinit cases - currently I skip those to keep them "LICM friendly" (note that GTF_CALL_HOISTABLE check). So I'll do it without the QMARK/COLON just like how we insert the GC safepoints for SuppressGCTransition.

This optimization took me a bit longer to implement, mainly because of COMMA and QMARK but at least I now know more about them and how they are expanded.

@AndyAyersMS
Copy link
Member

I wonder if I should do this optimization later after loopHoisting.

Perhaps... you might also be able to work around this by leaving some sort of pseduo-op at the join point below the initial check, something that value numbers the same as the helper call but doesn't get treated as a call otherwise, and emits no code.

If we do that and mark the indirection to fetch the init flag nonfaulting hopefully the jit suppresses the now-dominated call and realizes it has an empty if/then guarded by a side-effect free predicate, and cleans it all up nicely.

Base automatically changed from master to main March 1, 2021 09:07
@EgorBo EgorBo force-pushed the jit-inlined-is-static-inited-check branch from 4e5a6f5 to 0e1585d Compare March 3, 2021 14:43
@EgorBo
Copy link
Member Author

EgorBo commented Mar 3, 2021

Rewrote to be just before the "insert gc safepoints" phase (closer to the rationalization phase)
So I don't have to care about hoisting/jump threading and can only emit it where necessary.

Even if the current approach won't make it I still learned a lot about the existing flowgraph APIs to manipulate with blocks.

@EgorBo EgorBo marked this pull request as draft March 3, 2021 14:47
@EgorBo
Copy link
Member Author

EgorBo commented Mar 3, 2021

Just checked the jump-threading example:

...
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Test(bool cond1, bool cond2)
    {
        int z = 0;
        if (cond1)
        {
            z += CctorTestClass.A;

            if (cond2) 
                z += CctorTestClass.B;
        }
        return z;
    }
}

static class CctorTestClass
{
    public static readonly int A;
    public static readonly int B;

    static CctorTestClass()
    {
        A = 42;
        B = 43;
    }
}

Emits:

G_M50630_IG01:              ;; offset=0000H
       56                   push     rsi
       4883EC20             sub      rsp, 32
       8BF2                 mov      esi, edx
						;; bbWeight=1    PerfScore 1.50
G_M50630_IG02:              ;; offset=0007H
       33C0                 xor      eax, eax
       84C9                 test     cl, cl
       741C                 je       SHORT G_M50630_IG05
						;; bbWeight=1    PerfScore 1.50
G_M50630_IG03:              ;; offset=000DH
       0FB6059E1C1000       movzx    rax, byte  ptr [(reloc)]
       A801                 test     al, 1
       7417                 je       SHORT G_M50630_IG06
						;; bbWeight=0.50 PerfScore 1.63
G_M50630_IG04:              ;; offset=0018H
       8B05961C1000         mov      eax, dword ptr [reloc classVar[0x6d6303a0]]
       4084F6               test     sil, sil
       7406                 je       SHORT G_M50630_IG05
       03058F1C1000         add      eax, dword ptr [reloc classVar[0x6d6303b8]]
						;; bbWeight=0.50 PerfScore 1.63
G_M50630_IG05:              ;; offset=0029H
       4883C420             add      rsp, 32
       5E                   pop      rsi
       C3                   ret      
						;; bbWeight=1    PerfScore 1.75
G_M50630_IG06:              ;; offset=002FH
       48B960EF606DF97F0000 mov      rcx, 0x7FF96D60EF60
       BA02000000           mov      edx, 2
       E8DDB61A5F           call     CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
       EBD3                 jmp      SHORT G_M50630_IG04
						;; bbWeight=0    PerfScore 0.00

with a single check as expected, the cctor block is cold.
the only thing - it looks like the check could be optimized from

       0FB6059E1C1000       movzx    rax, byte  ptr [(reloc)]
       A801                 test     al, 1
       7417                 je       SHORT G_M50630_IG06

to

       A801                 test     byte  ptr [(reloc)], 1
       7417                 je       SHORT G_M50630_IG06

but it's a different issue

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this later was a good idea.

Left you some notes.

src/coreclr/jit/flowgraph.cpp Show resolved Hide resolved
src/coreclr/jit/flowgraph.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/flowgraph.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/flowgraph.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/flowgraph.cpp Outdated Show resolved Hide resolved
@EgorBo
Copy link
Member Author

EgorBo commented Mar 5, 2021

Jit-diffs: https://gist.github.com/EgorBo/2ba6bcc04c5d68a295f758428499c34a
Jit-diffs with CORINFO_HELP_CLASSINIT_SHARED_DYNAMICCLASS: https://gist.github.com/EgorBo/404d59535b4926266c708d7324e1d76a

Those are quite huge because of PMI nature. but I'd love to see if it affects the benchmarks.

@EgorBo EgorBo force-pushed the jit-inlined-is-static-inited-check branch from 5f87af0 to a3ca77d Compare March 7, 2021 19:24
@EgorBo
Copy link
Member Author

EgorBo commented Mar 7, 2021

The PR is green.
@dotnet/jit-contrib @jkotas PTAL

@AndyAyersMS
Copy link
Member

From the dynamic pmi gist:

Total bytes of delta: 561981 (1.04% of base)

This is a fairly large diff and I wonder if there are things we could do to mitigate to reduce the impact on crossgen

I assume a crossgen diff shows similar code size growth? Can you look at this?

I'm curious what is going on in some of the methods with very high % diffs. Can you show the old/new codegen for something like:

          32 (106.67% of base) : System.Private.Xml.dasm - Datatype_anySimpleType:.ctor():this

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try and test cases where there are multiple class init calls in a single tree?

Not sure if we ever create these but if so it would be good to verify things work as expected.

src/coreclr/jit/flowgraph.cpp Show resolved Hide resolved
src/coreclr/jit/flowgraph.cpp Outdated Show resolved Hide resolved
@jkotas
Copy link
Member

jkotas commented Mar 8, 2021

This is a fairly large diff and I wonder if there are things we could do to mitigate to reduce the impact on crossgen

As implemented, it won't kick in for crossgen. And if it did kick in for crossgen, the size regression would likely be even bigger due to extra indirections and fixups.

Would it make sense to do this optimization for tier-1 JIT only or only inside loops? The static constructors should have run for the most part already, and the rest should not add up to much.

@EgorBo
Copy link
Member Author

EgorBo commented Mar 8, 2021

Would it make sense to do this optimization for tier-1 JIT only or only inside loops?

it's currently tier-1 jit only, will check the diff for "loop-only" (BBF_BACKWARD_JUMP)

@jkotas
Copy link
Member

jkotas commented Mar 8, 2021

it's currently tier-1 jit only,

I meant when we we are recompiling code that run in the current process already at least one. It is different from enabling it when optimizations are on.

@EgorBo
Copy link
Member Author

EgorBo commented Mar 8, 2021

it's currently tier-1 jit only,

I meant when we we are recompiling code that run in the current process already at least one. It is different from enabling it when optimizations are on.

um.. the recompiled code is supposed to be initclass-free, isn't it? all classes will init itself on tier0.
(at least the non-dynamic ones)

@EgorBo
Copy link
Member Author

EgorBo commented Mar 8, 2021

Quick experiment:

Current diff for corelib (without the dynamic cases): 11926 : System.Private.CoreLib.dasm (0.26% of base)

Diff for methods with fgHasLoops (means, the optimization is enabled only if a method has any loop): 1870 : System.Private.CoreLib.dasm (0.04% of base)

From my understanding the non-loop ones are expected to get rid of the static init by their own so we can leave as is.
(not true for COMPlus_TieredCompilation=0)

@AndyAyersMS
Copy link
Member

Would it make sense to do this optimization for tier-1 JIT only or only inside loops?

The interesting case is when we're optimizing a method without having first run a Tier0 or prejitted version. So that would mean non-Tier1 optimized code (in our current world, AggressiveOptimization, non-prejitted methods with loops, and other things that currently bypass tiering like dynamic methods).

But perhaps it's not too much of a stretch to continue to enable this on for all optimized code as we don't expect this to do much for methods where we've already run an earlier version.

To restrict this to calls in loops, you should try using the loop info to detect loop membership instead of relying on BBF_BACKWARDS_BRANCH (and maybe just consider calls in inner loops). This opt would then mitigate the cost of the init checks that we were unable to hoist. As an alternative, you might consider having the loop optimizer mark these calls for you when it fails to do the hoisting (or fails to do loop inversion, which will inhibit hoisting).

@EgorBo
Copy link
Member Author

EgorBo commented Mar 8, 2021

To restrict this to calls in loops,

I understand, fgHasLoops was a pretty conservative check - but it reduced the diff by 10x :-)
The optimization is still useful when the init call is not inside a loop, isn't? E.g. it's located in a method which is called from a different method in a loop (and that first method wasn't inlined e.g. for size reasons).

@EgorBo EgorBo force-pushed the jit-inlined-is-static-inited-check branch from db9e5c5 to 15318fa Compare March 8, 2021 16:53
@AndyAyersMS
Copy link
Member

The optimization is still useful when the init call is not inside a loop,

Sure, but it's a question of benefit/cost ratio. If the call is not inside a loop then the benefit (cycles saved per call to the root method) is less but the cost (extra bytes of code) is the same. Or to put it another way, the loop is a benefit multiplier.

I still think some of those absolute code size increases in your diff summary seem large; if we can find a way to reduce that then perhaps we can consider enabling this more broadly. So we should verify we're getting the codegen we expect there.

@EgorBo
Copy link
Member Author

EgorBo commented Apr 5, 2021

Ping myself

@EgorBo
Copy link
Member Author

EgorBo commented May 10, 2021

Will back to it later, I couldn't find cases where non-hoistable static initializations were inside loops in asm diffs

@EgorBo EgorBo closed this May 10, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Jun 9, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Optimize field access in types not marked with beforefieldinit
5 participants