Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Perf][Windows_NT][x86] Investigate ByteMark/BenchLUDecomp regression. #9833

Open
jorive opened this issue Feb 27, 2018 · 13 comments
Open

[Perf][Windows_NT][x86] Investigate ByteMark/BenchLUDecomp regression. #9833

jorive opened this issue Feb 27, 2018 · 13 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI JitUntriaged CLR JIT issues needing additional triage optimization tenet-performance Performance related issue
Milestone

Comments

@jorive
Copy link
Member

jorive commented Feb 27, 2018

There is ~4.5% regression from release/2.0.0 to release/2.1

category:cq
theme:benchmarks
skill-level:intermediate
cost:small

@AndyAyersMS
Copy link
Member

Will investigate.

@AndyAyersMS AndyAyersMS self-assigned this Mar 12, 2018
@AndyAyersMS
Copy link
Member

For x86, I see 2.0 at about 1060, 2.1 at 1068. So maybe 1% slower.

For x64, I see 2.0 at 714, 2.1 at 714.

So I don't see any regression here.

@jorive can you double-check what you saw?

@AndyAyersMS
Copy link
Member

Oops, I was looking at SciMark's LU. Hold on a sec.

@AndyAyersMS
Copy link
Member

Ok, for Bytemark LU, x86: 2.0 is around 2000, 2.1 is noisy but somewhere between 2060 and 2110.

History is interesting here -- both master and 2.0 show a large impact from a CSE bug fix: dotnet/coreclr#15323 / dotnet/coreclr#15360 respectively. master has a couple of later smaller regressions that look to be harder to pin down.

Here's master:

image

and here's 2.0:

image

So it's probably worth looking back at that change too.

Note x64 perf on the same test is comparable in 2.0 and master/2.1 and was not impacted by the CSE fix.

@AndyAyersMS
Copy link
Member

Time is spent in ludcmp. This is a large method with 9 loops, including a few nested loops. In both 2.0 and 2.1 the jit decides to clone 8 of the loops, with a huge amount of code duplication. This makes perf analysis a bit more challenging as it is harder to tell from just diffs which versions of the loops execute.

Some experimentation shows cloning hurts perf quite a bit too (locally, in master, goes from 1730 with cloning to 1520 without). So I'll link this to the general item for re-examining cloning: #8558.

@AndyAyersMS
Copy link
Member

Backing out the changes from dotnet/coreclr#15323 gets me to around 1470 locally in master. Will look at diffs and see if we were cheating before or we lost a good optimization somehow.

@AndyAyersMS
Copy link
Member

Still not super-confident I understand why the newer version is slower. Inner loops look pretty comparable with one exception. I think this is the clone version of the first deeply nested loop:

                        for (k = 0; k < i; k++)
                            sum -= a[i][k] * a[k][j];

In the after version the loop body is disconnected and there is a spill restore from [ebp-38H]. So this may be why the test is now running slower.

;;; Before

G_M19375_IG39:
       8B5D88       mov      ebx, gword ptr [ebp-78H]
       8BFB         mov      edi, ebx
       C4E17B1054D708 vmovsd   xmm2, qword ptr [edi+8*edx+8]
       8B7C9108     mov      edi, gword ptr [ecx+4*edx+8]
       3B7704       cmp      esi, dword ptr [edi+4]
       0F8342080000 jae      G_M19375_IG101
       C4E16B5954F708 vmulsd   xmm2, qword ptr [edi+8*esi+8]
       C4E1735CCA   vsubsd   xmm1, xmm2
       42           inc      edx
       3BD6         cmp      edx, esi
       895D88       mov      gword ptr [ebp-78H], ebx
       7CD3         jl       SHORT G_M19375_IG39

;;; AFTER

G_M19375_IG37:
       8B5D90       mov      ebx, gword ptr [ebp-70H]
       8BCB         mov      ecx, ebx
       C4E17B1054D108 vmovsd   xmm2, qword ptr [ecx+8*edx+8]
       8B4DC8       mov      ecx, gword ptr [ebp-38H]
       8B449108     mov      eax, gword ptr [ecx+4*edx+8]
       3B7004       cmp      esi, dword ptr [eax+4]
       0F8346080000 jae      G_M19375_IG102
       C4E16B5954F008 vmulsd   xmm2, qword ptr [eax+8*esi+8]
       C4E1735CCA   vsubsd   xmm1, xmm2
       42           inc      edx
       3BD6         cmp      edx, esi
       895D90       mov      gword ptr [ebp-70H], ebx
       7C0B         jl       SHORT G_M19375_IG38

       8B5DEC       mov      ebx, dword ptr [ebp-14H]
       897DD0       mov      dword ptr [ebp-30H], edi
       8B45CC       mov      eax, dword ptr [ebp-34H]
       EB7B         jmp      SHORT G_M19375_IG48

G_M19375_IG38:
       894DC8       mov      gword ptr [ebp-38H], ecx
       EBC0         jmp      SHORT G_M19375_IG37

This method is probably a good stress test for spill placement as a whole, as there are large runs of blocks to spill and reload. So linking in dotnet/coreclr#16857. Likely cloning plays a big role in creating lots of pressure and backegdes.

G_M19375_IG40:
       8B5D88       mov      ebx, gword ptr [ebp-78H]
       EB2F         jmp      SHORT G_M19375_IG48

G_M19375_IG41:
       8B5D88       mov      ebx, gword ptr [ebp-78H]
       EB2A         jmp      SHORT G_M19375_IG48

G_M19375_IG42:
       895D88       mov      gword ptr [ebp-78H], ebx
       EB5E         jmp      SHORT G_M19375_IG49

G_M19375_IG43:
       895D88       mov      gword ptr [ebp-78H], ebx
       EB59         jmp      SHORT G_M19375_IG49

G_M19375_IG44:
       8B55F0       mov      edx, dword ptr [ebp-10H]
       E972010000   jmp      G_M19375_IG56

G_M19375_IG45:
       8B5DC0       mov      ebx, gword ptr [ebp-40H]
       E9C0FEFFFF   jmp      G_M19375_IG35

G_M19375_IG46:
       895DC0       mov      gword ptr [ebp-40H], ebx
       E9F5FEFFFF   jmp      G_M19375_IG36

G_M19375_IG47:
       895DC0       mov      gword ptr [ebp-40H], ebx
       E9EDFEFFFF   jmp      G_M19375_IG36

So for the immediate 2.1 vs 2.0 issue I don't see anything fixable in the 2.1 timeframe.

I am still going to try and understand the larger regression that both these branches saw a while back...

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Mar 13, 2018

Looking at the impact of dotnet/coreclr#15323, while jitting ludcmp, the jit no longer remove some useless statements during CSE (as expected):

;;; BEFORE
Removing statement [003692] in BB120 as useless:
     ( 13, 16)              [003692] ------------              *  STMT      void  (IL   ???...  ???)
N014 (  0,  0)              [003690] ------------              |  /--*  NOP       void  
N015 ( 13, 16)              [003691] ----G-------              \--*  COMMA     void  
                            [003754] ------------                 \--*  LCL_VAR   ref    V46 cse6          <l:$d64, c:$d63>
New refCnts for V46: refCnt =  7, refCntWtd = 14   

BB120 becomes empty
optValnumCSE removed tree:
N014 (  0,  0)              [003690] ------------              /--*  NOP       void  
N015 ( 13, 16)              [003691] ----G-------              *  COMMA     void  
                            [003754] ------------              \--*  LCL_VAR   ref    V46 cse6  
;;; AFTER

optValnumCSE morphed tree:
N002 (  0,  0)              [003690] ------------              /--*  NOP       void  
N003 (  1,  1)              [003691] ----G-------              *  COMMA     void  
N001 (  1,  1)              [003754] ------------              \--*  LCL_VAR   ref    V46 cse6          <l:$d64, c:$d63>

These statements all seem to sit in blocks that are on critical edges, eg BB119 below:

;;; BEFORE

BB10  [0060]  1       BB09                  4     [???..???)-> BB15  ( cond )                     internal 
BB11  [0127]  2       BB10,BB13            16     [024..032)-> BB13  ( cond )                     i target idxlen bwd LoopPH 

;;; AFTER

BB10  [0060]  1       BB09                  4     [???..???)-> BB15  ( cond )                     internal 
BB119 [0127]  1       BB10                  4     [024..???)                                      internal LoopPH 
BB11  [0002]  2       BB13,BB119           16     [024..032)-> BB13  ( cond )                     i Loop Loop0 label target idxlen bwd 

So these blocks and statements hang around a while.

But the dead statements eventually get removed and the flowgraph gets compacted back into the same shape. However the block numbers are different (BB11 before becomes BB119 after) and in the after case the numbers no longer match the linear order:

;;; BEFORE

BB09  [0059]  1       BB06                  4     [???..???)-> BB15  ( cond )                     internal LIR 
BB10  [0060]  1       BB09                  4     [???..???)-> BB15  ( cond )                     internal LIR 
BB11  [0127]  2       BB10,BB13            16     [024..032)-> BB13  ( cond )                     i target idxlen bwd LoopPH LIR 
BB12  [0003]  1       BB11                  8     [032..03E)                                      i idxlen bwd LIR

;;; AFTER
BB09  [0059]  1       BB06                  4     [???..???)-> BB15  ( cond )                     internal LIR 
BB10  [0060]  1       BB09                  4     [???..???)-> BB15  ( cond )                     internal LIR 
BB119 [0127]  2       BB10,BB13            16     [024..032)-> BB13  ( cond )                     i target idxlen bwd LoopPH LIR 
BB12  [0003]  1       BB119                 8     [032..03E)                                      i idxlen bwd LIR 

This in turn changes the LSRA block sequence:

;;; BEFORE

New BlockSet epoch 12, # of blocks (including unused BB00): 124, bitset array size: 4 (long)
LSRA Block Sequence: BB01(  1   ) BB02(  1   ) BB04(  1   ) BB05(  4   ) BB06(  4   ) BB09(  4   ) BB10(  4   ) BB11( 16   ) BB12(  8   ) BB13( 16   ) 
                     BB14(  4   ) BB15( 16   ) BB16(  8   ) BB17( 16   ) BB18(  4   ) BB19(  4   ) BB20(  1   ) BB21(  4   ) BB22( 16   ) BB23(  8   ) 
                     BB24( 16   ) BB25(  4   ) BB26(  4   ) BB27(  0.50) BB28(  0.50) BB30(  0.50) BB31(  4   ) BB32(  2   ) BB33(  2   ) BB36( 16   ) 
                     BB37(  8   ) BB38(  8   ) BB41(  8   ) BB42(  8   ) BB43( 64   ) BB44(  8   ) BB45( 64   ) BB46( 16   ) BB47(  2   ) BB48( 16   ) 
                     BB49(  8   ) BB50( 64   ) BB51( 16   ) BB52(  4   ) BB53(  4   ) BB56( 16   ) BB57(  8   ) BB58(  8   ) BB61(  8   ) BB62(  8   ) 
                     BB63( 64   ) BB64(  8   ) BB65( 64   ) BB66( 16   ) BB67(  8   ) BB68( 16   ) BB69(  4   ) BB70( 16   ) BB71(  8   ) BB72( 64   ) 
                     BB73( 16   ) BB74(  8   ) BB75( 16   ) BB76(  4   ) BB77(  2   ) BB78(  2   ) BB81(  2   ) BB82(  2   ) BB83( 16   ) BB84(  2   ) 
                     BB85( 16   ) BB86(  2   ) BB87(  4   ) BB88(  2   ) BB89(  4   ) BB90(  2   ) BB91( 16   ) BB92(  4   ) BB93(  0.50) BB94(  4   ) 
                     BB95(  2   ) BB96( 16   ) BB97(  8   ) BB98( 64   ) BB99( 16   ) BB100(  4   ) BB101( 16   ) BB102(  8   ) BB103( 64   ) BB104( 16   ) 
                     BB105(  8   ) BB106( 16   ) BB107(  4   ) BB108(  2   ) BB109( 16   ) BB110(  2   ) BB111(  4   ) BB112(  2   ) BB113(  4   ) BB114(  2   ) 
                     BB115( 16   ) BB116(  4   ) BB117(  0.50) BB118(  0.50) BB123(  0   ) 

;;; AFTER

New BlockSet epoch 12, # of blocks (including unused BB00): 124, bitset array size: 4 (long)
LSRA Block Sequence: BB01(  1   ) BB02(  1   ) BB04(  1   ) BB05(  4   ) BB06(  4   ) BB09(  4   ) BB10(  4   ) BB15( 16   ) BB16(  8   ) BB17( 16   ) 
                     BB18(  4   ) BB19(  4   ) BB20(  1   ) BB21(  4   ) BB22( 16   ) BB23(  8   ) BB24( 16   ) BB25(  4   ) BB26(  4   ) BB27(  0.50) 
                     BB28(  0.50) BB30(  0.50) BB31(  4   ) BB32(  2   ) BB33(  2   ) BB36( 16   ) BB37(  8   ) BB38(  8   ) BB41(  8   ) BB42(  8   ) 
                     BB45( 64   ) BB46( 16   ) BB47(  2   ) BB48( 16   ) BB49(  8   ) BB50( 64   ) BB51( 16   ) BB52(  4   ) BB53(  4   ) BB56( 16   ) 
                     BB57(  8   ) BB58(  8   ) BB61(  8   ) BB62(  8   ) BB65( 64   ) BB66( 16   ) BB67(  8   ) BB68( 16   ) BB69(  4   ) BB70( 16   ) 
                     BB71(  8   ) BB72( 64   ) BB73( 16   ) BB74(  8   ) BB75( 16   ) BB76(  4   ) BB77(  2   ) BB78(  2   ) BB81(  2   ) BB82(  2   ) 
                     BB85( 16   ) BB86(  2   ) BB87(  4   ) BB88(  2   ) BB89(  4   ) BB90(  2   ) BB91( 16   ) BB92(  4   ) BB93(  0.50) BB94(  4   ) 
                     BB95(  2   ) BB96( 16   ) BB97(  8   ) BB98( 64   ) BB99( 16   ) BB100(  4   ) BB101( 16   ) BB102(  8   ) BB103( 64   ) BB104( 16   ) 
                     BB105(  8   ) BB106( 16   ) BB107(  4   ) BB108(  2   ) BB109( 16   ) BB110(  2   ) BB111(  4   ) BB112(  2   ) BB113(  4   ) BB114(  2   ) 
                     BB115( 16   ) BB116(  4   ) BB117(  0.50) BB118(  0.50) BB119( 16   ) BB12(  8   ) BB13( 16   ) BB14(  4   ) BB120( 16   ) BB84(  2   ) 
                     BB121( 64   ) BB64(  8   ) BB122( 64   ) BB44(  8   ) BB123(  0   ) 

Note in before BB11 is visited quite early while in after the equivalent BB119 is visited quite late.

And with this different allocation sequence we get different and arguably worse allocation overall.

So it appears there is some sensitivity in LSRA to block IDs and that (at cursory glance anyways) things work better when block IDs and block physical order going into LSRA corresponded one to one. It seems that this might be commonly true as the late dead code pass probably rarely allows for flowgraph simplification, but it's not guaranteed to be true, and isn't true for this method after the change in dotnet/coreclr#15323.

@CarolEidt any thoughts? Should I try ensuring that blocks are renumbered going into LSRA?

Need to see if this somehow interferes with liveness.

@AndyAyersMS
Copy link
Member

Added a call to fgRenumberBlocks if liveness leads to a modified flow graph.

This puts ludcmp codegen back to where it was before dotnet/coreclr#15323. Seems like a wash in general. Will see if can get a private perf run for this.

jit-diffs results for x86:

Total bytes of diff: 730 (0.00% of base)
    diff is a regression.
Total byte diff includes 0 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    0 unique methods,        0 unique bytes
Top file regressions by size (bytes):
         361 : Microsoft.CodeAnalysis.VisualBasic.dasm (0.02% of base)
         262 : JIT\Performance\CodeQuality\Bytemark\Bytemark\Bytemark.dasm (0.37% of base)
         139 : Microsoft.CodeAnalysis.CSharp.dasm (0.01% of base)
          69 : JIT\Methodical\MDArray\basics\classarr_cs_do\classarr_cs_do.dasm (1.02% of base)
          69 : JIT\Methodical\MDArray\basics\classarr_cs_ro\classarr_cs_ro.dasm (1.02% of base)
Top file improvements by size (bytes):
        -260 : Microsoft.CSharp.dasm (-0.12% of base)
         -79 : System.Private.CoreLib.dasm (0.00% of base)
         -65 : JIT\Methodical\MDArray\InnerProd\intarr_cs_do\intarr_cs_do.dasm (-1.57% of base)
         -65 : JIT\Methodical\MDArray\InnerProd\intarr_cs_ro\intarr_cs_ro.dasm (-1.57% of base)
         -63 : JIT\Methodical\MDArray\InnerProd\doublearr_cs_do\doublearr_cs_do.dasm (-1.67% of base)
107 total files with size differences (43 improved, 64 regressed), 2336 unchanged.
Top method regessions by size (bytes):
         259 : Microsoft.CodeAnalysis.VisualBasic.dasm - ExpressionEvaluator:PerformCompileTimeBinaryOperation(ushort,byte,ref,ref,ref):ref
          94 : Microsoft.CodeAnalysis.CSharp.dasm - Lexer:ScanIdentifier_SlowPath(byref):bool:this
          85 : Microsoft.CodeAnalysis.VisualBasic.dasm - ConstraintsHelper:RemoveDirectConstraintConflicts(ref,struct,ref,int,ref):struct
          80 : System.Private.CoreLib.dasm - Dictionary`2:System.Collections.Generic.IEnumerable<System.Collections.Generic.KeyValuePair<TKey,TValue>>.GetEnumerator():ref:this (29 methods)
          80 : System.Private.CoreLib.dasm - Dictionary`2:System.Collections.IEnumerable.GetEnumerator():ref:this (29 methods)
Top method improvements by size (bytes):
        -175 : Microsoft.CSharp.dasm - ExpressionBinder:GetStandardAndLiftedBinopSignatures(ref,ref):bool:this
        -106 : System.Private.CoreLib.dasm - GenericEqualityComparer`1:LastIndexOf(ref,struct,int,int):int:this (22 methods)
        -106 : System.Private.CoreLib.dasm - GenericEqualityComparer`1:IndexOf(ref,struct,int,int):int:this (22 methods)
         -85 : JIT\Methodical\MDArray\InnerProd\intarr_cs_do\intarr_cs_do.dasm - intmm:Main():int
         -85 : JIT\Methodical\MDArray\InnerProd\intarr_cs_ro\intarr_cs_ro.dasm - intmm:Main():int
280 total methods with size differences (132 improved, 148 regressed), 218787 unchanged.

@AndyAyersMS
Copy link
Member

@CarolEidt I'm going to assign this over to you for further evaluation -- let me know if you have any questions.

I suspect we'll want to push this out of 2.1 but will let you make that call.

@AndyAyersMS AndyAyersMS assigned CarolEidt and unassigned AndyAyersMS Mar 21, 2018
@RussKeldorph
Copy link
Contributor

@CarolEidt 2.1?

@CarolEidt
Copy link
Contributor

@AndyAyersMS - I made some adjustments to the ordering that I thought ought to be better. It's a net win according to jit-diff, but shows just one regression- in ludcmp, even though jitting it results in a net improvement, with the loop you identify above not being split. There's more spill and more moves, but fewer split edges (21 instead of 43), and many fewer split backedges. Dumping disasm and then running jit-analyze on it (jitted version) shows:

Total bytes of diff: -18 (-0.03% of base)
    diff is an improvement.

Total byte diff includes 368 bytes from reconciling methods
        Base had    1 unique methods,       46 unique bytes
        Diff had    1 unique methods,      414 unique bytes

Top file improvements by size (bytes):
         -18 : Bytemark.dasm (-0.03% of base)

1 total files with size differences (1 improved, 0 regressed), 0 unchanged.

Top method regessions by size (bytes):
         414 : Bytemark.dasm - System.SpanHelpers:IndexOf(byref,ushort,int):int (0/1 methods)
          38 : Bytemark.dasm - EMFloatClass:DoEmFloatIteration(ref,ref,ref,int,int):long
          32 : Bytemark.dasm - NeuralJagged:read_data_file():this
          26 : Bytemark.dasm - Neural:read_data_file():this
          20 : Bytemark.dasm - AssignJagged:second_assignments(ref,ref)

Top method improvements by size (bytes):
        -147 : Bytemark.dasm - System.SpanHelpers:SequenceEqual(byref,byref,int):bool
         -72 : Bytemark.dasm - LUDecomp:ludcmp(ref,int,ref,byref):int
         -61 : Bytemark.dasm - AssignRect:first_assignments(ref,ref):int
         -46 : Bytemark.dasm - EMFloat:SetInternalFPFZero(byref,ubyte) (1/0 methods)
         -43 : Bytemark.dasm - Huffman:Run():double:this

43 total methods with size differences (26 improved, 17 regressed), 147 unchanged.

However, running Release versions on my system shows that it is actually slower (though the slowdown is less than the standard deviation). I attempted to start a private perf run, but it's been "pending - Waiting for next available executor" for some time.

At this point, I think it's best to choose discretion over valor, and defer this to "Future".

@CarolEidt
Copy link
Contributor

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@CarolEidt CarolEidt removed their assignment Dec 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI JitUntriaged CLR JIT issues needing additional triage optimization tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

6 participants