-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Handle addressing modes for HW intrinsics #22944
Conversation
@@ -1333,6 +1333,25 @@ void CodeGen::genConsumeRegs(GenTree* tree) | |||
// Update the life of the lcl var. | |||
genUpdateLife(tree); | |||
} | |||
#ifdef FEATURE_HW_INTRINSICS | |||
else if (tree->OperIs(GT_HWIntrinsic)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GT_HWIntrinsic
is already handled a few lines below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I missed that. I don't believe that we have (or will have) any Arm64 intrinsics that can be contained, so I'll remove the one further down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did this get addressed? I don't see the matching removal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit hard to see, with the way the diffs are shown, but if you look below, this line was deleted:
else if (tree->OperIsHWIntrinsic())
and the genConsumeReg
below it is actually the one for the OperIsInitVal
case.
genConsumeAddress(op1); | ||
// Until we improve the handling of addressing modes in the emitter, we'll create a | ||
// temporary GT_IND to generate code with. | ||
GenTreeIndir load = indirForm(node->TypeGet(), op1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meh, and I just got rid of the last usage of indirForm
in XARCH codegen in a recent PR 😞 . I'll clean this up once the 3.0 release storm passes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, after looking at your change here: mikedn@af3df85#diff-04ea18d1ee0bf7efd98adb6840a48369 I seriously considered waiting for and/or incorporating those changes, but decided to take the expedient path so that this stands a better chance of getting into 3.0.
src/jit/emitxarch.cpp
Outdated
// rmOp -- The second operand, which may be a memory node or a node producing a register | ||
// ival -- The immediate operand | ||
// | ||
void emitter::emitIns_R_RM_I(instruction ins, emitAttr attr, regNumber reg1, GenTree* rmOp, int ival) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't put this in the emitter. Ideally the emitter interface should not deal with GenTree
, only instructions, registers and immediates/displacements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could probably just update genHWIntrinsic_R_RM_I()
:
coreclr/src/jit/hwintrinsiccodegenxarch.cpp
Line 528 in 407cc7d
void CodeGen::genHWIntrinsic_R_RM_I(GenTreeHWIntrinsic* node, instruction ins, int8_t ival) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it actually looks like most of this logic was pulled from the previous impl of genHWIntrinsic_R_RM_I
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I extracted this from genHWIntrinsic_R_RM_I
because it is not specific to HW intrinsics. And historically the emitter has dealt with GenTree
nodes for addresses. That said, I take the point that it should be in CodeGen
instead of in emitter
. I'll make that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think most of the genHWIntrinsic_R_...
methods are probably more "general-purpose" than HWIntrinsic specific.
That is, outside some asserts and the explicit code path for containedOp->OperIsHWIntrinsic()
, they are just taking a node and determining which emitter call to make (R
, AR
, C
, A
, S
, etc).
It might be worth refactoring all of these to be more generally usable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The various "RM" handling functions and the relevant emitter interface functions are pretty messed up as is. It will take a significant effort to address all that and it certainly doesn't belong in this PR. For now just avoiding adding more stuff to the emitter interface should be enough. It already suffers from duplication, bad function and parameter names etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will take a significant effort to address all that and it certainly doesn't belong in this PR.
Completely right, I was more asking as a potential future work item.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it turns out, extracting the HWIntrinsic-dependencies from this was a bit problematic, so although I moved to to CodeGen::inst_RV_TT_IV()
, it is still under #ifdef FEATURE_HW_INTRINSICS
.
return; | ||
} | ||
if (op1->OperIs(GT_LIST)) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: assert(node->gtGetOp2() == nullptr)
when op1
is a list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
assert(node->gtGetOp2() == nullptr); | ||
return; | ||
} | ||
if (op1->OperIs(GT_LIST)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: OperIsList()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have a convention for this, but the OperIsXXX
methods pre-date @mikedn's addition of the OperIs
method, which I tend to prefer since it makes it clear that you are checking for one or more specific operators. I prefer the OperIsXXX
when it is used to check for a "class" of operators. cc @dotnet/jit-contrib for opinions on usage of these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't bother too much with this. Hopefully we'll just get rid of GT_LIST
. I have ~75% of it done, just waiting for intrinsics to finally stabilize so I don't have to keep fixing conflicts.
// None. | ||
// | ||
|
||
void CodeGen::genConsumeHWIntrinsicOperands(GenTreeHWIntrinsic* node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be nice to centralize the asserting of numArgs
to number of node operands here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
emit->emitIns_AR_R(ins, simdSize, op2Reg, op1Reg, 0); | ||
} | ||
else if ((ival != -1) && varTypeIsFloating(baseType)) | ||
if ((ival != -1) && varTypeIsFloating(baseType)) | ||
{ | ||
assert((ival >= 0) && (ival <= 127)); | ||
genHWIntrinsic_R_R_RM_I(node, ins, ival); | ||
} | ||
else if (category == HW_Category_MemoryLoad) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't this case genConsumeAddress
like the other loads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's sort of arbitrary. genConsumeRegs
will correctly handle an address. genConsumeAddress
is useful if you already know you have an address, because it does fewer checks. This is distinct from genConsumeReg
which is only used if a node is known to produce a single register.
op1Reg = op1->gtRegNum; | ||
genConsumeOperands(node); | ||
} | ||
genConsumeHWIntrinsicOperands(node); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't there be some additional removals of genConsumeReg
from the below (for this and similar edits)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't transform all the methods in this file to use genConsumeHWIntrinsicOperands
- only the ones where it was specifically beneficial to abstract out the possibility that they might be contained memory ops/addresses. That may be a useful future change, but I wouldn't necessarily do that with this PR.
test OSX10.12 x64 Checked Innerloop Build and Test |
@dotnet/jit-contrib PTAL |
// Until we improve the handling of addressing modes in the emitter, we'll create a | ||
// temporary GT_IND to generate code with. | ||
GenTreeIndir load = indirForm(node->TypeGet(), op1); | ||
emit->emitInsLoadInd(ins, simdSize, node->gtRegNum, &load); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old code was using emitIns_R_AR
, that one has a Is4ByteSSEInstruction
check that increases the instruction size by 2. emitInsLoadInd
does not seem to have such a check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right - I took a quick look at that when you made that comment earlier, and I wasn't sure it was actually required. I need to take another look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far I believe that this 2-byte increment should actually be 1 byte. That said, there is definitely inconsistency in how the sizes are determined, and the code allows the size to be over-estimated, so those errors go undetected. I'm trying to decide whether to make a more general cleanup here or just go with the 1 byte increment for this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we had a work item tracking the cleanup of emitInsSize
, but I can't seem to find it.
From what I remember, these one offs were done because it was non-trivial to update the existing emitInsSize
checks due to other complications at the time. I think it is +2
to handle the non-VEX encoded case but only needs to be +1
for the VEX encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it's the case that for these addressing-mode instructions, the SSE encodings require one additional bit (and the VEX encodings don't require an additional bytes). I've run all the hw intrinsic tests without AVX & AVX2 with the most recent set of changes.
The size analysis is so convoluted that I am mostly relying on empirical results (i.e. testing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a note to https://github.com/dotnet/coreclr/issues/23006 that this should also be addressed.
else | ||
{ | ||
regNumber rmOpReg = rmOp->gtRegNum; | ||
getEmitter()->emitIns_SIMD_R_R_I(ins, attr, reg1, rmOpReg, ival); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't really SIMD-specific (and should probably be moved), but it's the method that will do a copy to the target if needed for the RMW form.
@dotnet-bot test Ubuntu x64 Checked CoreFX Test |
@dotnet-bot test Ubuntu x64 Checked CoreFX Tests |
@dotnet/jit-contrib @tannergooding @fiigii PTAL |
GenTree* op1 = node->gtGetOp1(); | ||
if (op1 == nullptr) | ||
{ | ||
assert((numArgs == 0) && (node->gtGetOp2() == nullptr)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It is generally nice to break &&
asserts into two separate ones. That makes it much clearer which assert is the problematic one when they are hit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally disagree. Asserts already clutter the code, and though they are useful, they should impact the readability & flow of the code as little as possible, IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this is another bit of code that will go away when I get rid of GT_LIST
...
{ | ||
assert((ival >= 0) && (ival <= 127)); | ||
genHWIntrinsic_R_RM_I(node, ins, (int8_t)ival); | ||
genConsumeAddress(op1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you mentioned that genConsumeOperands
will also handle the address case, but genConsumeAddress
is more efficient.
My question then would be: Is it actually worth doing explicit genConsumeAddress
calls here? It looks like it makes the overall logic slightly more complex...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what way does it make the logic more complex? It's probably a matter of taste, but this case needs special handling anyway, and it seems clearest to explicitly consume an address.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably a matter of taste
Possibly. I would have thought that the existing genConsumeOperands(node)
call would have handled things, in which case the changes could have been minimized to just changing emit->emitIns_R_AR(ins, simdSize, targetReg, op1Reg, 0);
to:
// Until we improve the handling of addressing modes in the emitter, we'll create a
// temporary GT_IND to generate code with.
GenTreeIndir load = indirForm(node->TypeGet(), op1);
emit->emitInsLoadInd(ins, simdSize, node->gtRegNum, &load);
This would have made the delta smaller and kept the nesting of the if blocks smaller. It also gives a single spot that operand consumption happens so future changes don't need to worry about it.
Likewise, down on L144, just the if block could have been added and the existing genConsumeOperands(node)
would have handle both the memory store and non-memory store cases (rather than splitting this into 4 separate calls).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid genConsumeOperands
in SIMD and HWIntrinsic code, it's incompatible with my pending GT_LIST
changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just one suggestion.
// This is the aligned case. | ||
Sse2.StoreScalar(p + alignmentOffset + 2, Sse2.Subtract(v, Sse2.LoadAlignedVector128(p + offset + alignmentOffset + 4))); | ||
// This is the unaligned case. | ||
Sse2.StoreScalar(p + alignmentOffset + 1, Sse2.Subtract(v, Sse2.LoadVector128(p + offset + alignmentOffset + 1))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test cases look not enough. This PR also changes several Is4ByteSSEInstruction
points, so I suggest adding a Sse41.LoadAlignedVector128NonTemporal
test and an Vector256<T>
load/store test.
In the future, it would be nice to update the intrinsic test template to cover more address modes.
@redknightlois - I plan to get this in for 3.0, though while waiting for the CI issues to resolve I've gotten side-tracked on other things. I should have an updated PR by the end of this week. |
Also, eliminate some places where the code size estimates were over-estimating. Contribute to #19550 Fix #19521
@dotnet-bot test Ubuntu x64 Checked Innerloop Build and Test (Jit - TieredCompilation=0) |
@dotnet-bot test Windows_NT arm Cross Checked Innerloop Build and Test |
|
||
// Now do a non-temporal load from the aligned location. | ||
// That should give us {0, 0, 1, 0}. | ||
Vector128<float> v2 = Sse41.LoadAlignedVector128NonTemporal((byte*)(p + alignmentOffset)).AsSingle(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sse41.IsSupported
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right - of course! Forgot all about the checks!
{ | ||
try | ||
{ | ||
Avx.Store(p + 1, Avx.Subtract(v, Avx.LoadVector256(p + offset + 1))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avx.IsSupported
?
@dotnet-bot test Windows_NT arm64 Cross Checked Innerloop Build and Test |
@dotnet/jit-contrib @tannergooding @fiigii PTAL |
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnoavx2 @dotnet-bot test Windows_NT x86 Checked jitx86hwintrinsicnoavx2 @dotnet-bot test Ubuntu x64 Checked jitx86hwintrinsicnoavx2 |
|
||
// Now do a non-temporal load from the aligned location. | ||
// That should give us {0, 0, 1, 0}. | ||
Vector128<float> v2 = Sse41.LoadAlignedVector128NonTemporal((byte*)(p + alignmentOffset)).AsSingle(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need if (Sse41.IsSupported)
for this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent, thanks. I split this out so that we do an Sse2
load when Sse41
isn't available.
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnoavx2 @dotnet-bot test Windows_NT x86 Checked jitx86hwintrinsicnoavx2 @dotnet-bot test Ubuntu x64 Checked jitx86hwintrinsicnoavx2 |
@dotnet/jit-contrib @tannergooding ping - I addressed @fiigii 's feedback on the test case, and fixed an error in the guard for it. I think this is ready to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Handle addressing modes for HW intrinsics Also, eliminate some places where the code size estimates were over-estimating. Contribute to dotnet/coreclr#19550 Fix dotnet/coreclr#19521 Commit migrated from dotnet/coreclr@da6ed11
Contribute to #19550
Fix #19521