Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Have lowering set up IR for post-indexed addressing and make strength reduced IV updates amenable to post-indexed addressing #105185

Merged
merged 2 commits into from
Jul 22, 2024

Conversation

jakobbotsch
Copy link
Member

@jakobbotsch jakobbotsch commented Jul 20, 2024

This adds a transformation in lowering that tries to set up the IR to be
amenable to post-indexed addressing in the backend. It does so by
looking for RMW additions/subtractions of a local that was also recently
used as the address to an indirection, and making them adjacent.

Additionally, have strength reduction try to insert IV updates after the last
use if that last use is a legal insertion point. This allows the lowering transformation
to kick in.

For a simple loop:

[MethodImpl(MethodImplOptions.NoInlining)]
public static int Sum(int[] arr)
{
    int sum = 0;
    foreach (int x in arr)
    {
        sum += x;
    }

    return sum;
}

this results in:

@@ -19,12 +19,11 @@ G_M53154_IG03:        ; bbWeight=0.25, gcrefRegs=0001 {x0}, byrefRegs=0000 {}, b
 						;; size=4 bbWeight=0.25 PerfScore 0.12
 
 G_M53154_IG04:        ; bbWeight=4, gcrefRegs=0000 {}, byrefRegs=0001 {x0}, byref, isz
-            ldr     w3, [x0]
+            ldr     w3, [x0], #0x04
             add     w1, w3, w1
-            add     x0, x0, #4
             sub     w2, w2, #1
             cbnz    w2, G_M53154_IG04
-						;; size=20 bbWeight=4 PerfScore 22.00
+						;; size=16 bbWeight=4 PerfScore 20.00
 
 G_M53154_IG05:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
                              ; byrRegs -[x0]
@@ -35,5 +34,5 @@ G_M53154_IG06:        ; bbWeight=1, epilog, nogc, extend
             ldp     fp, lr, [sp], #0x10
             ret     lr
 						;; size=8 bbWeight=1 PerfScore 2.00
-; Total bytes of code: 60
+; Total bytes of code: 56

The .NET 8 vs .NET 9 codegen diff for this loop becomes:

@@ -7,11 +7,10 @@ G_M53154_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref,
 G_M53154_IG02:        ; bbWeight=1, gcrefRegs=0001 {x0}, byrefRegs=0000 {}, byref, isz
                              ; gcrRegs +[x0]
             mov     w1, wzr
-            mov     w2, wzr
-            ldr     w3, [x0, #0x08]
-            cmp     w3, #0
+            ldr     w2, [x0, #0x08]
+            cmp     w2, #0
             ble     G_M53154_IG05
-						;; size=20 bbWeight=1 PerfScore 5.50
+						;; size=16 bbWeight=1 PerfScore 5.00
 
 G_M53154_IG03:        ; bbWeight=0.25, gcrefRegs=0001 {x0}, byrefRegs=0000 {}, byref
             add     x0, x0, #16
@@ -20,12 +19,11 @@ G_M53154_IG03:        ; bbWeight=0.25, gcrefRegs=0001 {x0}, byrefRegs=0000 {}, b
 						;; size=4 bbWeight=0.25 PerfScore 0.12
 
 G_M53154_IG04:        ; bbWeight=4, gcrefRegs=0000 {}, byrefRegs=0001 {x0}, byref, isz
-            ldr     w4, [x0, w2, UXTW #2]
-            add     w1, w4, w1
-            add     w2, w2, #1
-            cmp     w3, w2
-            bgt     G_M53154_IG04
-						;; size=20 bbWeight=4 PerfScore 22.00
+            ldr     w3, [x0], #0x04
+            add     w1, w3, w1
+            sub     w2, w2, #1
+            cbnz    w2, G_M53154_IG04
+						;; size=16 bbWeight=4 PerfScore 20.00
 
 G_M53154_IG05:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
                              ; byrRegs -[x0]
@@ -36,5 +34,5 @@ G_M53154_IG06:        ; bbWeight=1, epilog, nogc, extend
             ldp     fp, lr, [sp], #0x10
             ret     lr
 						;; size=8 bbWeight=1 PerfScore 2.00
-; Total bytes of code: 64
+; Total bytes of code: 56

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 20, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@jakobbotsch
Copy link
Member Author

@jakobbotsch
Copy link
Member Author

cc @dotnet/jit-contrib PTAL @AndyAyersMS

Diffs. Some cool diffs:
image

@AndyAyersMS
Copy link
Member

Interesting diff in windows arm benchmarks pgo

+4 (+0.09%) : 7422.dasm - System.Text.RegularExpressions.RegexPrefixAnalyzer:<FindPrefixes>g__FindPrefixesCore|0_1(System.Text.RegularExpressions.RegexNode,System.Collections.Generic.List`1[System.Text.StringBuilder],ubyte):ubyte (Tier0-FullOpts)
@@ -1600,10 +1600,12 @@ G_M12455_IG128:        ; bbWeight=0.05, gcrefRegs=180000 {x19 x20}, byrefRegs=C0
 						;; size=4 bbWeight=0.05 PerfScore 0.02
 G_M12455_IG129:        ; bbWeight=0.49, gcrefRegs=180000 {x19 x20}, byrefRegs=C00000 {x22 x23}, byref, isz
             ldrh    w1, [x23, x21]
-            stp     wzr, w1, [fp, #0x50]	// [V16 loc13], [V15 loc12]
+            str     w1, [fp, #0x54]	// [V15 loc12]
+            add     x21, x21, #2
+            str     wzr, [fp, #0x50]	// [V16 loc13]

int maxCount = min(m_blockIndirs.Height(), POST_INDEXED_ADDRESSING_MAX_DISTANCE / 2);
for (int i = 0; i < maxCount; i++)
{
SavedIndir& prev = m_blockIndirs.TopRef(i);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more efficient to start checking with the last indir instead of the first?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does start with the last indir (since it is using TopRef instead of BottomRef)

src/coreclr/jit/lowerarmarch.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/lowerarmarch.cpp Outdated Show resolved Hide resolved
assert((prevIndir->gtLIRFlags & LIR::Flags::Mark) == 0);
m_scratchSideEffects.Clear();

for (GenTree* cur = prevIndir->gtNext; cur != store; cur = cur->gtNext)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this could be cheaper if you computed two side effect sets and then checked for interference. But it probably doesn't make much difference.

Copy link
Member Author

@jakobbotsch jakobbotsch Jul 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, possibly -- although it would be a bit less precise than what's here since not all nodes that are part of store's dataflow necessarily happen after all the nodes we are moving.

This adds a transformation in lowering that tries to set up the IR to be
amenable to post-indexed addressing in the backend. It does so by
looking for RMW additions/subtractions of a local that was also recently
used as the address to an indirection.
…sing

On arm64 have strength reduction try to insert IV updates after the last
use if that last use is a legal insertion point. This often allows the
backend to use post-indexed addressing to combine the use with the IV
update.
@jakobbotsch
Copy link
Member Author

jakobbotsch commented Jul 21, 2024

Interesting diff in windows arm benchmarks pgo

+4 (+0.09%) : 7422.dasm - System.Text.RegularExpressions.RegexPrefixAnalyzer:<FindPrefixes>g__FindPrefixesCore|0_1(System.Text.RegularExpressions.RegexNode,System.Collections.Generic.List`1[System.Text.StringBuilder],ubyte):ubyte (Tier0-FullOpts)
@@ -1600,10 +1600,12 @@ G_M12455_IG128:        ; bbWeight=0.05, gcrefRegs=180000 {x19 x20}, byrefRegs=C0
 						;; size=4 bbWeight=0.05 PerfScore 0.02
 G_M12455_IG129:        ; bbWeight=0.49, gcrefRegs=180000 {x19 x20}, byrefRegs=C00000 {x22 x23}, byref, isz
             ldrh    w1, [x23, x21]
-            stp     wzr, w1, [fp, #0x50]	// [V16 loc13], [V15 loc12]
+            str     w1, [fp, #0x54]	// [V15 loc12]
+            add     x21, x21, #2
+            str     wzr, [fp, #0x50]	// [V16 loc13]

We end up with this IR after strength reduction:

***** BB135 [0056]
STMT00207 ( 0x2B1[E-] ... 0x2BB )
N007 ( 10,  7) [000666] DA-XGO-----STORE_LCL_VAR int    V15 loc12        d:1 $d1a
N006 ( 10,  7) [000664] ---XGO-N---                         └──▌  COMMA     ushort <l:$b0d, c:$b0c>
N001 (  0,  0) [000657] -----------                            ├──▌  NOP       void  
N005 ( 10,  7) [003041] ---XGO-----                            └──▌  IND       ushort <l:$b0a, c:$b0b>
N004 (  7,  5) [000663] ----GO-N---                               └──▌  ADD       byref  $c12
N002 (  3,  2) [000662] -----------                                  ├──▌  LCL_VAR   byref  V165 tmp129      u:2 $c11
N003 (  3,  2) [003994] -----------                                  └──▌  LCL_VAR   long   V247 rat2        

***** BB135 [0056]
STMT00720 ( ??? ... ??? )
N004 (  9,  8) [003993] DA---------STORE_LCL_VAR long   V247 rat2        
N003 (  5,  5) [003992] -----------                         └──▌  ADD       long  
N001 (  3,  2) [003991] -----------                            ├──▌  LCL_VAR   long   V247 rat2        
N002 (  1,  2) [003989] -----------                            └──▌  CNS_INT   long   2 $34d

***** BB135 [0056]
STMT00208 ( 0x2BD[E-] ... 0x2BE )
N002 (  1,  3) [000668] DA---------STORE_LCL_VAR int    V16 loc13        d:1 $VN.Void
N001 (  1,  2) [000667] -----------                         └──▌  CNS_INT   int    0 $c0

Lowering does not try to make the indirection [003041 and update [003993] adjacent since the indirection isn't directly on V247, so there won't be a chance for post-indexed addressing. And this breaks the stp that previously happened for V15 and V16 that end up being stack allocated in adjacent slots.

Strength reduction doesn't do much (any) sanity checking of whether we actually expect to be able to do post-indexed after moving the IV update. That would require us to check that the use is of a supported pattern. But I figure that complication is unnecessary since the exact place we update the IV at shouldn't matter much here -- it is live throughout the loop anyway. It might even be better for scheduling purposes to update it as soon as possible after that last use.

@jakobbotsch jakobbotsch merged commit 7dd68f4 into dotnet:main Jul 22, 2024
102 of 108 checks passed
@jakobbotsch jakobbotsch deleted the lower-post-indexing branch July 22, 2024 09:06
@github-actions github-actions bot locked and limited conversation to collaborators Aug 22, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants