RyuJIT: Inlined "is class statically inited" check #47901

EgorBo · 2021-02-05T10:47:11Z

For members with static constructors or beforefieldinit we emit a helper call to initialize the classes. This PR adds an additional quick check "is already inited?" so we can get rid of the call overhead (which is supposed to be executed only once).

Example:

static int s_Field = 42; // = 42 introduces a static constructor

static int Test()
{
    return s_Field;
}

Current codegen:

G_M36434_IG01:
       sub      rsp, 40
G_M36434_IG02:
       mov      rcx, 0xD1FFAB1E
       mov      edx, 1
       call     CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
       mov      eax, dword ptr [reloc classVar[0xd1ffab1e]]
G_M36434_IG03:
       add      rsp, 40
       ret      
; Total bytes of code 35

New codegen:

G_M36434_IG01:
       sub      rsp, 40
G_M36434_IG02:
+      test     byte  ptr [(reloc)], 1           ;; <--- fast check for initialization status
+      je       SHORT G_M36434_IG05
G_M36434_IG03:
       mov      eax, dword ptr [reloc classVar[0xd1ffab1e]]
G_M36434_IG04:
       add      rsp, 40
       ret      
+G_M36434_IG05:                                  ;; <--- cold block (executed just once per lifetime)
+       mov      rcx, 0xD1FFAB1E
+       call     CORINFO_HELP_INITCLASS
+       jmp      SHORT G_M36434_IG03
; Total bytes of code 41 (+6 bytes)

In most cases such init-calls are eliminated in the real world, because a type can be already initialized at the JIT time (e.g. during re-jitting tier0 -> tier1), but it's not the case for methods with loops because those aren't re-jitted atm (only if COMPlus_TC_QuickJitForLoops=1 is set) and we might end up with slow call init forever especially when if it's located inside a loop (in case of static constructors we can't hoist such calls).

/cc @dotnet/jit-contrib

src/coreclr/jit/flowgraph.cpp

AndyAyersMS · 2021-02-05T20:33:28Z

I wonder if we should consider using this check as a gating condition for loop cloning... Also may factor into the do-while transformation heuristics, these try to account for the potential "savings" from hoisting the class init call check out of the loop.

So for some subset of loops we would produce a loop that knows classes are inited and does no checks, and another that will conditionally check within the loop body as above.

cc @BruceForstall

AndyAyersMS · 2021-02-09T21:27:24Z

Can you look at the case where there are two static accesses in a method, one of which dominates the other (say, separated by some unrelated flow, just to make things interesting), and make sure we're only emitting one check and call?

Eg

  var x = s_f1;
  if (...) { }
  var y = s_f2;  // we should not need a check here (when optimizing)

My concern is that if we make the call to init the class conditional under the check, then the jit may no longer realize that the dominated check and call are redundant.

EgorBo · 2021-02-09T21:43:04Z

Can you look at the case where there are two static accesses in a method, one of which dominates the other (say, separated by some unrelated flow, just to make things interesting), and make sure we're only emitting one check and call?

Eg
  var x = s_f1;
  if (...) { }
  var y = s_f2;  // we should not need a check here (when optimizing)
My concern is that if we make the call to init the class conditional under the check, then the jit may no longer realize that the dominated check and call are redundant.

@AndyAyersMS unfortunately it regresses that case, I wonder if I should do this optimization later after loopHoisting.
It will allow to also handle class-inits for beforefieldinit cases - currently I skip those to keep them "LICM friendly" (note that GTF_CALL_HOISTABLE check). So I'll do it without the QMARK/COLON just like how we insert the GC safepoints for SuppressGCTransition.

This optimization took me a bit longer to implement, mainly because of COMMA and QMARK but at least I now know more about them and how they are expanded.

AndyAyersMS · 2021-02-09T21:57:27Z

I wonder if I should do this optimization later after loopHoisting.

Perhaps... you might also be able to work around this by leaving some sort of pseduo-op at the join point below the initial check, something that value numbers the same as the helper call but doesn't get treated as a call otherwise, and emits no code.

If we do that and mark the indirection to fetch the init flag nonfaulting hopefully the jit suppresses the now-dominated call and realizes it has an empty if/then guarded by a side-effect free predicate, and cleans it all up nicely.

EgorBo · 2021-03-03T14:47:27Z

Rewrote to be just before the "insert gc safepoints" phase (closer to the rationalization phase)
So I don't have to care about hoisting/jump threading and can only emit it where necessary.

Even if the current approach won't make it I still learned a lot about the existing flowgraph APIs to manipulate with blocks.

EgorBo · 2021-03-03T21:39:35Z

Just checked the jump-threading example:

...
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Test(bool cond1, bool cond2)
    {
        int z = 0;
        if (cond1)
        {
            z += CctorTestClass.A;

            if (cond2) 
                z += CctorTestClass.B;
        }
        return z;
    }
}

static class CctorTestClass
{
    public static readonly int A;
    public static readonly int B;

    static CctorTestClass()
    {
        A = 42;
        B = 43;
    }
}

Emits:

G_M50630_IG01:              ;; offset=0000H
       56                   push     rsi
       4883EC20             sub      rsp, 32
       8BF2                 mov      esi, edx
						;; bbWeight=1    PerfScore 1.50
G_M50630_IG02:              ;; offset=0007H
       33C0                 xor      eax, eax
       84C9                 test     cl, cl
       741C                 je       SHORT G_M50630_IG05
						;; bbWeight=1    PerfScore 1.50
G_M50630_IG03:              ;; offset=000DH
       0FB6059E1C1000       movzx    rax, byte  ptr [(reloc)]
       A801                 test     al, 1
       7417                 je       SHORT G_M50630_IG06
						;; bbWeight=0.50 PerfScore 1.63
G_M50630_IG04:              ;; offset=0018H
       8B05961C1000         mov      eax, dword ptr [reloc classVar[0x6d6303a0]]
       4084F6               test     sil, sil
       7406                 je       SHORT G_M50630_IG05
       03058F1C1000         add      eax, dword ptr [reloc classVar[0x6d6303b8]]
						;; bbWeight=0.50 PerfScore 1.63
G_M50630_IG05:              ;; offset=0029H
       4883C420             add      rsp, 32
       5E                   pop      rsi
       C3                   ret      
						;; bbWeight=1    PerfScore 1.75
G_M50630_IG06:              ;; offset=002FH
       48B960EF606DF97F0000 mov      rcx, 0x7FF96D60EF60
       BA02000000           mov      edx, 2
       E8DDB61A5F           call     CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
       EBD3                 jmp      SHORT G_M50630_IG04
						;; bbWeight=0    PerfScore 0.00

with a single check as expected, the cctor block is cold.
the only thing - it looks like the check could be optimized from

       0FB6059E1C1000       movzx    rax, byte  ptr [(reloc)]
       A801                 test     al, 1
       7417                 je       SHORT G_M50630_IG06

to

       A801                 test     byte  ptr [(reloc)], 1
       7417                 je       SHORT G_M50630_IG06

but it's a different issue

AndyAyersMS

Moving this later was a good idea.

Left you some notes.

src/coreclr/jit/flowgraph.cpp

EgorBo · 2021-03-05T22:36:14Z

Jit-diffs: https://gist.github.com/EgorBo/2ba6bcc04c5d68a295f758428499c34a
Jit-diffs with CORINFO_HELP_CLASSINIT_SHARED_DYNAMICCLASS: https://gist.github.com/EgorBo/404d59535b4926266c708d7324e1d76a

Those are quite huge because of PMI nature. but I'd love to see if it affects the benchmarks.

…static-inited-check

EgorBo · 2021-03-07T22:04:19Z

The PR is green.
@dotnet/jit-contrib @jkotas PTAL

AndyAyersMS · 2021-03-08T00:14:16Z

From the dynamic pmi gist:

Total bytes of delta: 561981 (1.04% of base)

This is a fairly large diff and I wonder if there are things we could do to mitigate to reduce the impact on crossgen

I assume a crossgen diff shows similar code size growth? Can you look at this?

I'm curious what is going on in some of the methods with very high % diffs. Can you show the old/new codegen for something like:

          32 (106.67% of base) : System.Private.Xml.dasm - Datatype_anySimpleType:.ctor():this

AndyAyersMS

Can you try and test cases where there are multiple class init calls in a single tree?

Not sure if we ever create these but if so it would be good to verify things work as expected.

src/coreclr/jit/flowgraph.cpp

jkotas · 2021-03-08T07:09:14Z

This is a fairly large diff and I wonder if there are things we could do to mitigate to reduce the impact on crossgen

As implemented, it won't kick in for crossgen. And if it did kick in for crossgen, the size regression would likely be even bigger due to extra indirections and fixups.

Would it make sense to do this optimization for tier-1 JIT only or only inside loops? The static constructors should have run for the most part already, and the rest should not add up to much.

EgorBo · 2021-03-08T16:06:08Z

Would it make sense to do this optimization for tier-1 JIT only or only inside loops?

it's currently tier-1 jit only, will check the diff for "loop-only" (BBF_BACKWARD_JUMP)

jkotas · 2021-03-08T16:12:02Z

it's currently tier-1 jit only,

I meant when we we are recompiling code that run in the current process already at least one. It is different from enabling it when optimizations are on.

EgorBo · 2021-03-08T16:20:25Z

it's currently tier-1 jit only,

I meant when we we are recompiling code that run in the current process already at least one. It is different from enabling it when optimizations are on.

um.. the recompiled code is supposed to be initclass-free, isn't it? all classes will init itself on tier0.
(at least the non-dynamic ones)

EgorBo · 2021-03-08T16:42:10Z

Quick experiment:

Current diff for corelib (without the dynamic cases): 11926 : System.Private.CoreLib.dasm (0.26% of base)

Diff for methods with fgHasLoops (means, the optimization is enabled only if a method has any loop): 1870 : System.Private.CoreLib.dasm (0.04% of base)

From my understanding the non-loop ones are expected to get rid of the static init by their own so we can leave as is.
(not true for COMPlus_TieredCompilation=0)

AndyAyersMS · 2021-03-08T16:45:08Z

Would it make sense to do this optimization for tier-1 JIT only or only inside loops?

The interesting case is when we're optimizing a method without having first run a Tier0 or prejitted version. So that would mean non-Tier1 optimized code (in our current world, AggressiveOptimization, non-prejitted methods with loops, and other things that currently bypass tiering like dynamic methods).

But perhaps it's not too much of a stretch to continue to enable this on for all optimized code as we don't expect this to do much for methods where we've already run an earlier version.

To restrict this to calls in loops, you should try using the loop info to detect loop membership instead of relying on BBF_BACKWARDS_BRANCH (and maybe just consider calls in inner loops). This opt would then mitigate the cost of the init checks that we were unable to hoist. As an alternative, you might consider having the loop optimizer mark these calls for you when it fails to do the hoisting (or fails to do loop inversion, which will inhibit hoisting).

EgorBo · 2021-03-08T16:50:06Z

To restrict this to calls in loops,

I understand, fgHasLoops was a pretty conservative check - but it reduced the diff by 10x :-)
The optimization is still useful when the init call is not inside a loop, isn't? E.g. it's located in a method which is called from a different method in a loop (and that first method wasn't inlined e.g. for size reasons).

AndyAyersMS · 2021-03-08T17:25:56Z

The optimization is still useful when the init call is not inside a loop,

Sure, but it's a question of benefit/cost ratio. If the call is not inside a loop then the benefit (cycles saved per call to the root method) is less but the cost (extra bytes of code) is the same. Or to put it another way, the loop is a benefit multiplier.

I still think some of those absolute code size increases in your diff summary seem large; if we can find a way to reduce that then perhaps we can consider enabling this more broadly. So we should verify we're getting the codegen we expect there.

EgorBo · 2021-04-05T16:51:31Z

Ping myself

EgorBo · 2021-05-10T07:20:43Z

Will back to it later, I couldn't find cases where non-hoistable static initializations were inside loops in asm diffs

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 5, 2021