-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Reorder indirs on arm64 to make them amenable to ldp optimization #92768
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsHacky prototype for #92756 to see the diffs.
|
Can it get the example in #64815? |
@kunalspathak FD000470 str d16, [x3,#8]
FD400870 ldr d16, [x3,#16] are ok to reorder. It gives up here: runtime/src/coreclr/jit/lower.cpp Lines 8055 to 8067 in 968ca81
It's probably not that difficult to make this particular check smarter (of course we could also make |
@kunalspathak I pushed a new commit that makes the interference check more clever. It means we get the case in that issue now, diff: @@ -1,20 +1,18 @@
-G_M62947_IG03: ;; offset=0x0020
+G_M62947_IG03: ;; offset=0x001C
ldr x3, [x0, w1, UXTW #3]
- ldr d16, [x3, #0x08]
- ldr d17, [x3, #0x20]
- fmul d17, d0, d17
- fadd d16, d16, d17
+ ldp d16, d17, [x3, #0x08]
+ ldp d18, d19, [x3, #0x20]
+ fmul d18, d0, d18
+ fadd d16, d16, d18
str d16, [x3, #0x08]
- ldr d16, [x3, #0x10]
- ldr d17, [x3, #0x28]
- fmul d17, d0, d17
- fadd d16, d16, d17
+ fmul d16, d0, d19
+ fadd d16, d17, d16
str d16, [x3, #0x10]
ldr d16, [x3, #0x18]
ldr d17, [x3, #0x30]
fmul d17, d0, d17
fadd d16, d16, d17
str d16, [x3, #0x18]
add w1, w1, #1
cmp w2, w1
bgt G_M62947_IG03 |
Of course we need the |
Some data on why cases get rejected (1 = rejected due to that reason): Volatile
<= 0 ===> 37438 count ( 99% of total)
1 .. 1 ===> 166 count (100% of total)
Too far
<= 0 ===> 32988 count ( 88% of total)
1 .. 1 ===> 4450 count (100% of total)
Interference
<= 0 ===> 28934 count ( 87% of total)
1 .. 1 ===> 4054 count (100% of total) This is over libraries_tests.run. Could try to adjust the distance we search to see if we can get some more of those "Too far" cases... |
With 16 distance on smoke_tests: Volatile
<= 0 ===> 3829 count ( 99% of total)
1 .. 1 ===> 10 count (100% of total)
Too far
<= 0 ===> 3684 count ( 96% of total)
1 .. 1 ===> 145 count (100% of total)
Interference
<= 0 ===> 3513 count ( 95% of total)
1 .. 1 ===> 171 count (100% of total) With 32: Volatile
<= 0 ===> 3829 count ( 99% of total)
1 .. 1 ===> 10 count (100% of total)
Too far
<= 0 ===> 3796 count ( 99% of total)
1 .. 1 ===> 33 count (100% of total)
Interference
<= 0 ===> 3543 count ( 93% of total)
1 .. 1 ===> 253 count (100% of total) So that allowed another 112 cases to pass onwards to the interference analysis, but there we rejected 82 of them... so probably not worth it to increase it. |
/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, Fuzzlyn |
Azure Pipelines successfully started running 3 pipeline(s). |
src/coreclr/jit/lower.cpp
Outdated
if (!ind->TypeIs(TYP_INT, TYP_LONG, TYP_DOUBLE, TYP_SIMD16)) | ||
{ | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to fix #92766 to add TYP_FLOAT
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are small types supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this needs to have TYP_SIMD8
too? (on a rare occasion there is Vector64-based code supprted by ARM64)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this needs to have
TYP_SIMD8
too? (on a rare occasion there is Vector64-based code supprted by ARM64)
True, wondering if TYP_SIMD8
was considered in this list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are small types supported?
Nope, there is no small variant of ldp
I believe.
Perhaps this needs to have TYP_SIMD8 too? (on a rare occasion there is Vector64-based code supprted by ARM64)
True, wondering if TYP_SIMD8 was considered in this list?
Yep, this can have TYP_SIMD8
-- I forgot about those, thanks for spotting this.
Nice diffs! |
cc @a74nh |
Looks like there are some interesting regressions where spilling happens, and where the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool optimization. Just a few nits/questions/comments.
src/coreclr/jit/lower.cpp
Outdated
|
||
for (GenTree* cur = prevIndir->gtNext; cur != indir; cur = cur->gtNext) | ||
{ | ||
if (cur->IsCall() || (cur->OperIsStoreBlk() && (cur->AsBlk()->gtBlkOpKind == GenTreeBlk::BlkOpKindHelper))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if we should move this check in the for-loop
above where you check the distance between the prevIndir
and indir
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable -- moved it.
FWIW, we might be able to remove this heuristic after we merge #92744
src/coreclr/jit/lower.cpp
Outdated
|
||
if ((cur->gtLIRFlags & LIR::Flags::Mark) != 0) | ||
{ | ||
// Part of data flow of 'indir', so we will be moving past this node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Part of data flow of 'indir', so we will be moving past this node. | |
// Part of data flow of 'indir', so we will be moving it upward past this node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The suggestion doesn't seem quite right to me, but I rephrased it in a different way which hopefully makes more sense.
* Move IsVolatile check into OptimizeForLdp * Allow the optimization for SIMD8 indirs * Make fldSeq parameter of gtPeelOffsets optional * Only check for GT_LCL_VARs; reorder some checks for TP purposes * Avoid adding an entry on successful optimization * Update some comments and JITDUMP * Add convenience constructor
Some of the failures to combine Remaining regressions seem to be mostly about register allocation, but I think it's expected to see some regressions due to suboptimal register allocation when doing this kind of reordering (since the transformation lengthens the lifetime of the moved LIR temp). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Do you have a sense that moving the second indir "earlier" is better than moving the first indir "later"? (or even moving both of them somewhere in between)? I believe think it's generally desirable to do loads as soon as you know you're going to need them so as to hide load latency, so I think moving the second one up is the right approach, but there may be cases where there is less pressure if we moved the first one later, or if the loads are not on any critical path maybe the CPU would be better off spending limited resources on something else. Conventional wisdom is that "scheduling" is no longer really critical with today's highly OOO CPUs so perhaps none of this matters and compacting the code better is the key aspect. |
The decision here was mainly driven by the fact that implementation wise, it is much easier to move the second indirection and its operands backwards, compared to trying to find all transitive users (including through locals) of the first indirection and moving them forwards past the second indirection. |
superpmi-diffs failure is the "out of space" one. |
This PR adds an ARM64 optimization in lowering that tries to reorder indirections to be near each other, making them amenable to
ldp
optimization done in the backend.Example:
Before:
After:
The transformation works by moving the trees that are not part of the data flow of the second indirection past it (after doing the necessary interference checks to prove whether this reordering is ok). For example, for the case above, the jitdump illustrates what happens:
In this case we needed to reorder the indirection with the
[000005] IND
and[000006] EQ
nodes.The optimization also supports reordering stores with loads to unlock opportunities like #64815. To do that there is a (somewhat ad hoc) alias analysis based on
TYP_REF
and locals. For example, for the case in #64815:the jit dump is:
In this case we needed to reorder with a much larger range of trees -- the
[000022]
to[000030]
trees needed to be moved past the indirection. In this case that involves proving that the[000033]
indir does not interfere with the[000025]
store.I want to look into handling
LCL_FLD
and stores (STOREIND/STORE_LCL_FLD
) as a follow-up.Fix #64815