Use xmm for stack prolog zeroing rather than rep stos #32538

benaadams · 2020-02-19T12:11:45Z

Use xmm registers to clear prolog blocks rather than rep stosd.

rep stos does have a shallower growth rate so it will eventually overtake; however unlikely in the size range for the stack clearance:

Also its variability based on small changes in stack changes is problematic as just adding an extra variable to a method (with no other changes) could add or remove 9ns from its running time. Whereas the xmm clear in this PR is much smoother and more expected changes.

Focusing in on the more common <= 256 range

And throughput

As the prolog doesn't support jmp labels; I have taught the x86/x64 Jit to understand relative jmps by instruction count (as per Arm) for the prolog.

e.g. output prolog for a 1024 byte stack leap method:

G_M1845_IG01:
       push     rdi
       push     rsi
       sub      rsp, 0x408
       mov      rax, -0x3F0
       vxorps   xmm4, xmm4
       vmovups  xmmword ptr [rsp], xmm4
       vmovups  xmmword ptr [rsp+rax+400H], xmm4
       vmovups  xmmword ptr [rsp+rax+410H], xmm4
       vmovups  xmmword ptr [rsp+rax+420H], xmm4
       add      rax, 48
       jne      SHORT  -5 instr

Switch to movups rather than movdqu as is shorter encoding with same functionality to reduce code size regressions.

Top file regressions (bytes):
        2195 : System.Private.CoreLib.dasm (0.07% of base)

Contributes to #8890
Resolves #32396

benaadams · 2020-02-19T12:32:23Z

Example change from ValueTuple`6:Equals(Object):bool:this

    sub      rsp, 88
-   mov      rsi, rcx
-   lea      rdi, [rsp+20H]
-   mov      ecx, 12
-   xor      rax, rax
-   rep stosd 
-   mov      rcx, rsi
+   xorps    xmm4, xmm4
+   movdqu   xmmword ptr [rsp+20H], xmm4
+   movdqu   xmmword ptr [rsp+30H], xmm4
+   movdqu   xmmword ptr [rsp+40H], xmm4
    mov      qword ptr [rsp+50H], rdx

CustomAttributeData:Init(DllImportAttribute):this

    sub      rsp, 304
-   mov      rsi, rcx
-   lea      rdi, [rsp+40H]
-   mov      ecx, 60
-   xor      rax, rax
-   rep stosd 
-   mov      rcx, rsi
+   xorps    xmm4, xmm4
+   mov      rax, -240
+   movups   xmmword ptr [rsp+rax+130H], xmm4
+   movups   xmmword ptr [rsp+rax+140H], xmm4
+   movups   xmmword ptr [rsp+rax+150H], xmm4
+   add      rax, 48
+   jne      SHORT  -5 instr
    mov      rbp, rcx

Though this is still within the range amount of xmm zeroing performed by clang #32442 (comment)

jkotas · 2020-02-19T14:15:53Z

src/coreclr/src/jit/codegencommon.cpp

+        // As we output multiple instructions for SIMD zeroing, we want to balance code size, with throughput
+        // so cap max size at 16 * MAX SIMD length
+        int initMaxSIMDSize = 16 * XMM_REGSIZE_BYTES;
+        if ((untrLclHi - untrLclLo) <= initMaxSIMDSize)


There may be locals in this region that do not need to be zero-initialized. Does it add up to anything significant? We may be able to do better by skipping zero-initialization for them.

Is handled by the else branch when genUseBlockInit is false where it just zeros out the GC fields (not sure of heuristics that determine it); starts just after the shown GitHub diff at else if (genInitStkLclCnt > 0)

E.g. it may be worth it to build a bit-mask what needs to be zero-initialized, and then go over the bitmask to do the zero-init.

Do have an issue open for double zeroing #2325

Could use use the same logic the genInitStkLclCnt else block uses to drop any xmm blocks that don't have something to zero? (in a follow up)

src/coreclr/src/jit/codegencommon.cpp

AndyAyersMS · 2020-02-19T16:03:41Z

cc @erozenfeld

Do we need to go all the way to 256 bytes for methods in Kestrel?

It would be good to highlight the methods that need a lot of prolog zeroing so we can see if other parts of the heuristic need reworking, or whether other changes -- either to the sources or to the jit -- might be able to remove the need for this much zeroing.

Can you look at code size impact outside of corelib (say via -f to jit-diff)?

tannergooding · 2020-02-19T16:19:59Z

Do we need to go all the way to 256 bytes for methods in Kestrel?

256 bytes looks to be what Clang/LLVM do by default (or 512 bytes for AVX2): https://godbolt.org/z/geQ9u9
160 bytes looks to be what MSVC does by default: https://godbolt.org/z/dphgZN
Both fallback to calling memset (rather than using rep stosb/w/d) above their cutoffs.

Perhaps this is one of the cases where we could generate the larger code unless the user has explicitly opted for "small code generation"?

tannergooding · 2020-02-19T16:24:27Z

When optimizing for size; MSVC (-O1) looks to drop the cutoff to 64 bytes while Clang (-Os) drops the cutoff to 128 bytes (XMM) or 256 bytes (YMM).

benaadams · 2020-02-19T16:27:55Z

When optimizing for size...

Is there an easy way to check if code is being produced for R2R? Could have lower limits for that?

AndyAyersMS · 2020-02-19T16:42:15Z

Jitted code will have more zeroing than typical C/C++ code so a more conservative cutover may be appropriate.

Is there an easy way to check if code is being produced for R2R? Could have lower limits for that?

Yes, IsReadyToRun. But you would be better off checking opts.compCodeOpt; the goal can vary by method.

jkotas · 2020-02-19T16:44:23Z

Is there an easy way to check if code is being produced for R2R? Could have lower limits for that?

We have important customers that are not using tiered-JIT and care about code quality. I do not think we should be changing the cut-off for R2R.

benaadams · 2020-02-19T17:05:28Z

As @tannergooding mentions 16 simd instructions is the clang switch over point; so either 16x xmm (256bytes) or 16x ymm (512bytes). Intel suggests 128 bytes as switch over for rep stos if the data is 32 byte aligned; though it not being is probably adding to the problems

Do we need to go all the way to 256 bytes for methods in Kestrel?

The Kestrel method I was looking at https://gist.github.com/benaadams/0a44b0cb1c0aab57d6525a0eedc0672f is 10x xmm

Even though it is high, hopefully it is infrequent?

benaadams · 2020-02-19T19:15:16Z

Rebased to kick start CI

BruceForstall · 2020-02-19T19:22:00Z

cc @dotnet/jit-contrib

src/coreclr/src/jit/codegencommon.cpp

AndyAyersMS · 2020-02-19T20:59:17Z

In TakeMessageHeaders the jit will zero 168 bytes, so yes, 10 xmm stores and one gpr store to cover it all.

Of those 168 bytes, only 64 bytes are GC refs (you can see this by looking at the untracked GC lifetimes, say via COMPlus_JitGCDump). So if we're suppressing initlocals, looks like 7 xmm stores would cover the necessary untracked GC initializations.

So perhaps the "bit vector" approach has some appeal?

src/coreclr/src/jit/codegencommon.cpp

erozenfeld · 2020-02-19T21:10:27Z

So if we're suppressing initlocals, looks like 7 xmm stores would cover the necessary untracked GC initializations.

So perhaps the "bit vector" approach has some appeal?

If we go that route the logic should live in CodeGen::genCheckUseBlockInit. Also, we'll need to be careful not to break any assumptions in fgVarNeedsExplicitZeroInit.

benaadams · 2020-02-19T23:34:57Z

Think I discovered what was going wrong in the other PR with alignment; so have included aligning rep stosb and reduced the cut over to 128 bytes, so all lengths should benefit and at R2R time. (Didn't bring over the ymm size version)

src/coreclr/src/jit/codegencommon.cpp

benaadams · 2020-03-03T19:19:15Z

Probably worth breaking this method up since it is hard to follow with all the ifdefs, but that is something we can do in a follow-on PR.

I was wondering if it could be made a little more generic and used for genCodeForInitBlkUnroll and since there is now a mini-loop capability also genCodeForInitBlkRepStos which seems to be for blocks of unspecified size.

AndyAyersMS · 2020-03-03T19:52:28Z

Right, it would be nice to share logic with other similar cases, though there can be different constraints on codegen (gc safe vs non-gc safe, for instance) that impact the allowable code sequences.

AndyAyersMS · 2020-03-03T20:38:10Z

Linux build issue looks like #2234:

OCI runtime exec failed: exec failed: container_linux.go:344: starting container process caused "process_linux.go:95: adding pid 75444 to cgroups caused \"failed to write 75444 to cgroup.procs: write /sys/fs/cgroup/cpu,cpuacct/docker/69bd2a5c8d374a2795cee96d4030553e1525133f7613269c286a24b66eb38215/cgroup.procs: invalid argument\"": unknown

AndyAyersMS · 2020-03-03T20:41:24Z

Perf failures look like download issues, #32835.

[2020/03/03 20:07:13][INFO] Could not find/download: ".NET Core SDK" with version = 5.0.100-preview.2.20153.4

AndyAyersMS · 2020-03-03T23:36:14Z

@benaadams can you share out your perf tests?

I'd like to see us run them on an AMD box and perhaps get them into the performance repo.

benaadams · 2020-03-04T01:52:38Z

can you share out your perf tests?

Sure https://gist.github.com/benaadams/fed05032ff6c27e2d245135fe51e5496

tannergooding · 2020-03-04T02:06:24Z

Sure https://gist.github.com/benaadams/fed05032ff6c27e2d245135fe51e5496

I'll run these on my Ryzen 1800X and Ryzen 3900X tomorrow for additional reference points.

Edit: Started building the branch so I can run tests, will expect results shortly.

tannergooding · 2020-03-04T21:37:43Z

Baseline_1800X.txt, Diff_1800X.txt
Baseline_3900X.txt, Diff_3900X.txt

AndyAyersMS · 2020-03-04T22:10:02Z

Thanks, @tannergooding -- looks qualitatively similar to the intel results. And @benaadams thanks for all your hard work here, this is a really nice improvement.

@dotnet/jit-contrib I think this is ready to merge, any final comments?

erozenfeld · 2020-03-04T23:19:51Z

I'll review this later today.

AndyAyersMS · 2020-03-05T01:25:06Z

@erozenfeld didn't see your comment in time -- we can revise if needed.

tannergooding · 2020-03-05T01:37:54Z

@adamsitnik, @DrewScoggins

It might be good to check the perf numbers tomorrow after they have gotten a chance to run 😄

benaadams · 2020-03-05T14:41:14Z

@erozenfeld I was wondering if it changed the cost/benefit for "JIT: Support object stack allocation" any? #11192

e.g.

class Foo
{
    public long f1;
    public long f2;
    public long f3;
    public long f4;
    public Foo(long f1, long f2)
    {
        this.f1 = f1;
        this.f2 = f2;
    }
}

class Test
{
    static long f1;
    static long f2;
    public static int Main()
    {
        Foo foo = new Foo(f1, f2);
        return (int)(foo.f1 + foo.f2);
    }
}

Changes from with the object allocation

;* V00 loc0         [V00    ] (  0,  0   )     ref  ->  zero-ref    class-hnd exact
;  V01 OutArgs      [V01    ] (  1,  1   )  lclBlk (32) [rsp+0x00]   "OutgoingArgSpace"
;  V02 tmp1         [V02,T00] (  5, 10   )     ref  ->  rax         class-hnd exact "NewObj constructor temp"
;  V03 tmp2         [V03,T01] (  2,  4   )    long  ->  rdx         "Inlining Arg"
;  V04 tmp3         [V04,T02] (  2,  4   )    long  ->  rcx         "Inlining Arg"
;
; Lcl frame size = 40

G_M59477_IG01:
       sub      rsp, 40
						;; bbWeight=1    PerfScore 0.25
G_M59477_IG02:
       mov      rcx, 0xD1FFAB1E
       call     CORINFO_HELP_NEWSFAST
       mov      rdx, qword ptr [reloc classVar[0xd1ffab1e]]
       mov      rcx, qword ptr [reloc classVar[0xd1ffab1e]]
       mov      qword ptr [rax+8], rdx
       mov      qword ptr [rax+16], rcx
       mov      edx, dword ptr [rax+8]
       add      edx, dword ptr [rax+16]
       mov      eax, edx
						;; bbWeight=1    PerfScore 9.50
G_M59477_IG03:
       add      rsp, 40
       ret      
						;; bbWeight=1    PerfScore 1.25

With with COMPlus_JitObjectStackAllocation=1, now drops the expensive 40 byte rep stosd for the cheaper vmovdqa

;  V00 loc0         [V00,T02] (  3,  3   )    long  ->  rax         class-hnd exact
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]   "OutgoingArgSpace"
;* V02 tmp1         [V02    ] (  0,  0   )    long  ->  zero-ref    class-hnd exact "NewObj constructor temp"
;  V03 tmp2         [V03,T00] (  2,  4   )    long  ->  rax         "Inlining Arg"
;  V04 tmp3         [V04,T01] (  2,  4   )    long  ->  rdx         "Inlining Arg"
;  V05 tmp4         [V05    ] (  4,  4   )  struct (40) [rsp+0x08]   do-not-enreg[XSF] must-init addr-exposed "MorphAllocObjNodeIntoStackAlloc temp"

-; Lcl frame size = 48
+; Lcl frame size = 40

G_M59477_IG01:
-      push     rdi
-      sub      rsp, 48
-      lea      rdi, [rsp+08H]
-      mov      ecx, 10
-      xor      rax, rax
-      rep stosd 
+      sub      rsp, 40
+      vxorps   xmm4, xmm4
+      vmovdqa  xmmword ptr [rsp], xmm4
+      vmovdqa  xmmword ptr [rsp+10H], xmm4
+      xor      rax, rax
+      mov      qword ptr [rsp+20H], rax
-					;; bbWeight=1    PerfScore 27.25
+					;; bbWeight=1    PerfScore 3.83

G_M59477_IG02:
       mov      rax, 0xD1FFAB1E
-      mov      qword ptr [rsp+08H], rax
+      mov      qword ptr [rsp], rax
       mov      rax, qword ptr [reloc classVar[0xd1ffab1e]]
       mov      rdx, qword ptr [reloc classVar[0xd1ffab1e]]
-      mov      qword ptr [rsp+10H], rax
-      mov      qword ptr [rsp+18H], rdx
-      lea      rax, [rsp+08H]
+      mov      qword ptr [rsp+08H], rax
+      mov      qword ptr [rsp+10H], rdx
+      lea      rax, [rsp]
       mov      edx, dword ptr [rax+8]
       add      edx, dword ptr [rax+16]
       mov      eax, edx
						;; bbWeight=1    PerfScore 10.00
G_M59477_IG03:
-      add      rsp, 48
+      add      rsp, 40
-      pop      rdi
       ret      
						;; bbWeight=1    PerfScore 1.75

benaadams · 2020-03-05T15:25:12Z

From there, if the lea was dropped

G_M59477_IG02:
       mov      rax, 0xD1FFAB1E
       mov      qword ptr [rsp], rax
       mov      rax, qword ptr [reloc classVar[0xd1ffab1e]]
       mov      rdx, qword ptr [reloc classVar[0xd1ffab1e]]
       mov      qword ptr [rsp+08H], rax
       mov      qword ptr [rsp+10H], rdx
+      mov      edx, dword ptr [rsp+08H]
+      add      edx, dword ptr [rsp+10H]
-      lea      rax, [rsp]
-      mov      edx, dword ptr [rax+8]
-      add      edx, dword ptr [rax+16]
       mov      eax, edx

Since its operating on the framepointer, could then eliminate the reg->stack, stack->reg?

G_M59477_IG02:
       mov      rax, 0xD1FFAB1E
       mov      qword ptr [rsp], rax
       mov      rax, qword ptr [reloc classVar[0xd1ffab1e]]
       mov      rdx, qword ptr [reloc classVar[0xd1ffab1e]]
-      mov      qword ptr [rsp+08H], rax
-      mov      qword ptr [rsp+10H], rdx
-      mov      edx, dword ptr [rsp+08H]
-      add      edx, dword ptr [rsp+10H]
+      add      edx, eax
       mov      eax, edx

Reverse the add

G_M59477_IG02:
       mov      rax, 0xD1FFAB1E
       mov      qword ptr [rsp], rax
       mov      rax, qword ptr [reloc classVar[0xd1ffab1e]]
       mov      rdx, qword ptr [reloc classVar[0xd1ffab1e]]
-      add      edx, eax
-      mov      eax, edx
+      add      eax, edx

Then remove the write never read

G_M59477_IG02:
-      mov      rax, 0xD1FFAB1E
-      mov      qword ptr [rsp], rax
       mov      rax, qword ptr [reloc classVar[0xd1ffab1e]]
       mov      rdx, qword ptr [reloc classVar[0xd1ffab1e]]
       add      eax, edx

From there determine don't need any stack anyway? 🤔

erozenfeld · 2020-03-05T19:34:10Z

I was wondering if it changed the cost/benefit for "JIT: Support object stack allocation" any? #11192

I'm glad this will potentially help with CQ of methods with object stack allocation. Currently the main issue is not the quality of the code that's generated but the fact that very few allocations can be safely moved to the stack without an interprocedural escape analysis. We have some ideas on how to approach that problem but this work is not in near-term plans.

benaadams · 2020-03-05T20:13:43Z

@adamsitnik, @DrewScoggins

It might be good to check the perf numbers tomorrow after they have gotten a chance to run 😄

System.Text.Json tests might be one to look at

erozenfeld · 2020-03-05T20:32:16Z

didn't see your comment in time -- we can revise if needed.

No problem. I reviewed the changes and they look good.
@benaadams Thank you for this work, very nice improvement.

DrewScoggins · 2020-03-05T23:51:00Z

@adamsitnik, @DrewScoggins

It might be good to check the perf numbers tomorrow after they have gotten a chance to run 😄

System.Text.Json tests might be one to look at

Thanks for the insight Ben!

Looking at the data we are seeing big wins (7%-20%) across the suite of JSON tests. I have added an image of a graph from the report below. That kind of drop is what we are seeing across over half of the JSON tests, with the other half within the noise range of the tests.

You can find the full reports here:
https://pvscmdupload.blob.core.windows.net/reports/JSONReport_x64_refs-heads-master_Daily_Ubuntu1804_dotnetruntime_Micro_2020-03-05.html

https://pvscmdupload.blob.core.windows.net/reports/JSONReport_x64_refs-heads-master_Daily_Windows1018362_dotnetruntime_Micro_2020-03-05.html

AndyAyersMS · 2020-03-06T03:51:48Z

It might be good to check the perf numbers tomorrow

I looked but there were no new perf reports on our internal share. Must be jitted code is now too fast to measure.

benaadams · 2020-03-06T06:22:44Z

Courteously of @sebastienros source

+5% to Plaintext (extra 200k requests per second)

PlaintextPlatform: 8,693M  -> 8,734M
Plaintext:         4,037M  -> 4,239M
Json:              838,137 -> 840,426
DbFortunesRaw:     268,438 -> 274,706

Dotnet-GitSync-Bot added the area-CodeGen-coreclr label Feb 19, 2020

benaadams mentioned this pull request Feb 19, 2020

Use simd for small prolog zeroing (ia32/x64) #32442

Closed

benaadams requested review from erozenfeld, CarolEidt, jkotas and AndyAyersMS February 19, 2020 13:25

jkotas reviewed Feb 19, 2020

View reviewed changes