-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Implement Shuffle* Sse2 Intel hardware intrinsics #15777
Conversation
@@ -33,7 +33,7 @@ static const wchar_t *coreCLRDll = W("CoreCLR.dll"); | |||
static const wchar_t *coreCLRInstallDirectory = W("%windir%\\system32\\"); | |||
|
|||
// Encapsulates the environment that CoreCLR will run in, including the TPALIST | |||
class HostEnvironment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file doesn't have any modifications outside of whitespace changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is dirty commit - just needed to run CI - everything will get cleaned later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to comment @dotnet-bot test <test-name> please
to rerun any given job, use this
to rerun all jobs
src/jit/emitxarch.cpp
Outdated
ins == INS_punpcklbw || ins == INS_punpckhqdq || ins == INS_punpcklqdq || ins == INS_punpckhwd || | ||
ins == INS_punpcklwd || ins == INS_punpckhdq || ins == INS_packssdw || ins == INS_packsswb || | ||
ins == INS_packuswb || ins == INS_packusdw || ins == INS_vperm2i128); | ||
switch (ins) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would still probably be useful to pull this change out to its own PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree
src/jit/emitxarch.cpp
Outdated
@@ -1085,6 +1181,7 @@ inline bool hasCodeMI(instruction ins) | |||
inline size_t insCodeMI(instruction ins) | |||
{ | |||
assert((unsigned)ins < _countof(insCodesMI)); | |||
size_t code = insCodesMI[ins]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debug help for conditional breakpoint - to be removed
src/jit/emitxarch.cpp
Outdated
@@ -3765,13 +3862,15 @@ void emitter::emitIns_R_R(instruction ins, emitAttr attr, regNumber reg1, regNum | |||
void emitter::emitIns_R_R_I(instruction ins, emitAttr attr, regNumber reg1, regNumber reg2, int ival) | |||
{ | |||
// SSE2 version requires 5 bytes and SSE4/AVX version 6 bytes | |||
UNATIVE_OFFSET sz = 4; | |||
UNATIVE_OFFSET sz = 4; | |||
bool encodeAsSse = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed? This logic should all be controlled elsewhere via UseVEXEncoding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other instructions (such as INS_psrldq
) should be going through emitIns_R_I
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(for the non-vex case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Original logic did not work correctly for SSE2 codegen. In case of istruction with the format ins xmm, imm
they had always being set with id->idInsFmt(IF_RRW_RRW_CNS);
what resulted in code like psllw xmm0, xmm0, 0
what is an invalid SSE2 instruction what resulted in CPU exception.
Above logic ensures that RRW_CNS instructions are flagged with valid format IF_RRW_CNS
what resolves this issue. Table based codegen should ensure this same logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of istruction with the format ins xmm, imm they had always being set with id->idInsFmt(IF_RRW_RRW_CNS);
The instructions should be going through an emitIns_R_I
method for the SSE2 case. They should only go through emitIns_R_R_I
on AVX+
src/jit/emitxarch.cpp
Outdated
|
||
void emitter::emitIns_SIMD_R_R_I(instruction ins, regNumber reg1, regNumber reg2, ssize_t iVal, var_types simdtype) | ||
{ | ||
emitIns_R_R_I(ins, emitTypeSize(simdtype), reg1, reg2, (int)iVal); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe R_R_I
should only happen under UseVEXEncoding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depends on instruction - all shuffle cases, cmppd
, cmpps
, pinsrw
are IF_RRW_RRW_CNS
under SSE2 so they should go through R_R_I
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but those instructions should be going through SIMD_R_R_R_I
.
In almost all cases so far, the codegen has been calling the emitIns_SIMD_*
case that matches the VEX encoding. We then have logic for the fallback case so that args can be arranged properly.
Ex: cmppd
is R_R_R_I
on AVX+ machines and R_R_I
elsewhere. So we call emitIns_SIMD_R_R_R_I
.
emitIns_SIMD_R_R_R_I
then has the logic for, on vex, generate the instruction directly (emtIns_R_R_R_I), on non-vex generate a movaps
(emitIns_R_R) followed by the instruction (emitIns_R_R_I).
That being said, it does look like the pshufd
instructions are bit special in that they (like crc32, lzcnt, popcnt, and a few others) have a consistent encoding between VEX and non-VEX machines, in which case we should just be calling emitIns_*
directly, without an intermediary emitIns_SIMD_*
step
src/jit/emitxarch.cpp
Outdated
void emitter::emitIns_SIMD_R_R_R_I( | ||
instruction ins, regNumber reg, regNumber reg1, regNumber reg2, ssize_t ival, var_types simdtype) | ||
{ | ||
if (UseVEXEncoding() && reg1 != reg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've dropped the reg1 != reg
from this line elsewhere, it isn't useful
src/jit/emitxarch.cpp
Outdated
{ | ||
if (reg1 != reg) | ||
{ | ||
emitIns_R_R_I(INS_movaps, emitTypeSize(simdtype), reg, reg1, (int)ival); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think this is correct. movaps
is a R_R
instruction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, will fix by dropping whole if statement. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think dropping the if statement is correct.
We still have, a target register, two operands, and an immediate, which isn't encodable on non-VEX machines if reg1 != reg
src/jit/gentree.cpp
Outdated
@@ -304,7 +304,8 @@ void GenTree::InitNodeSize() | |||
GenTree::s_gtNodeSizes[GT_INDEX_ADDR] = TREE_NODE_SZ_LARGE; | |||
GenTree::s_gtNodeSizes[GT_ARR_BOUNDS_CHECK] = TREE_NODE_SZ_LARGE; | |||
#ifdef FEATURE_SIMD | |||
GenTree::s_gtNodeSizes[GT_SIMD_CHK] = TREE_NODE_SZ_LARGE; | |||
GenTree::s_gtNodeSizes[GT_SIMD_CHK] = TREE_NODE_SZ_LARGE; | |||
GenTree::s_gtNodeSizes[GT_SIMD] = TREE_NODE_SZ_LARGE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this change needed? We shouldn't be touching the GT_SIMD code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not touched it - clang -format and clang-tidy did it by creating format patch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm.... I wouldn't expect clang format or clang-tidy to add new code, only move around existing code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's my copy paste error around place flagged by clang-format
src/jit/gentree.cpp
Outdated
* Create a list out of the five values. | ||
*/ | ||
|
||
GenTreeArgList* Compiler::gtNewArgList( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this? Parameter counts above 4 looked to be very uncommon, so a helper function that calls the smaller ones and works with arbitrary sizes might be more useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of my experiments to be removed - when you run a loop on 8 or 16 arguments and use in the first step 4 args function in every next step you need 5 args function to pass 4 args and append them to 1 list - but actually function quite different from this one.
src/jit/gentree.cpp
Outdated
GenTreeHWIntrinsic* Compiler::gtNewSimdHWIntrinsicNode(var_types type, | ||
GenTree* op1, | ||
GenTree* op2, | ||
ssize_t iVal, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want/need to carry the ival
as part of the node. It will end up being table driven, but before then, you can pull it out of the op when generating code or add a small switch for the hardcoded cases.
src/jit/gtlist.h
Outdated
@@ -158,11 +158,11 @@ GTNODE(GE , GenTreeOp ,0,GTK_BINOP|GTK_RELOP) | |||
GTNODE(GT , GenTreeOp ,0,GTK_BINOP|GTK_RELOP) | |||
#ifndef LEGACY_BACKEND | |||
// These are similar to GT_EQ/GT_NE but they generate "test" instead of "cmp" instructions. | |||
// Currently these are generated during lowering for code like ((x & y) eq|ne 0) only on | |||
// Currently these are generated during lowering for code like ((x & y) eq|ne 0) only on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another file with whitespace only changes.
src/jit/hwintrinsiccodegenxarch.cpp
Outdated
break; | ||
} | ||
|
||
if (ins != INS_cmppd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could drop this check if you return instead of break in the TYP_DOUBLE case.
src/jit/hwintrinsiccodegenxarch.cpp
Outdated
op2Reg = op2->gtRegNum; | ||
|
||
instruction ins; | ||
switch (baseType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can drop the switch statement for the ones that only take a single type.
In the long run (for generic methods) we will want to be checking the types in the importer and returning a PNSE exception for the invalid cases.
src/jit/simd.cpp
Outdated
GenTreePtr tree = se.val; | ||
|
||
return impSIMD(type, tree, ti, expectAddr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactoring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, one of experiments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be removed onviously 😄
@tannergooding |
I don't believe so. Pushing a new commit will run all jobs set to automatically run |
@tannergooding Is it possible to pass error message from Jit to exception created in importer? Currently available standard messages are not very helpful. |
I am waiting for #16183 before submitting commits to finish work on this PR. |
@fiigii @tannergooding @CarolEidt Added Extract/Insert/Shuffle* SSE2 HW intrinsics. To speed up work I have decided to update tests to templated format after all of SSE2 intrinsics are implemented. |
src/jit/hwintrinsiclistxarch.h
Outdated
HARDWARE_INTRINSIC(SSE2_LoadScalarVector128, "LoadScalarVector128", SSE2, -1, 16, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_movd, INS_movd, INS_movq, INS_movq, INS_invalid, INS_movsdsse2}, HW_Category_MemoryLoad, HW_Flag_NoRMWSemantics) | ||
HARDWARE_INTRINSIC(SSE2_LoadVector128, "LoadVector128", SSE2, -1, 16, 1, {INS_movdqu, INS_movdqu, INS_movdqu, INS_movdqu, INS_movdqu, INS_movdqu, INS_movdqu, INS_movdqu, INS_invalid, INS_movupd}, HW_Category_MemoryLoad, HW_Flag_NoRMWSemantics) | ||
HARDWARE_INTRINSIC(SSE2_Extract, "Extract", SSE2, -1, 16, 2, {INS_invalid, INS_invalid, INS_pextrw, INS_pextrw, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_IMM, HW_Flag_FullRangeIMM|HW_Flag_NoJmpTableIMM) | ||
HARDWARE_INTRINSIC(SSE2_Insert, "Insert", SSE2, -1, 16, 3, {INS_invalid, INS_invalid, INS_pinsrw, INS_pinsrw, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_IMM, HW_Flag_FullRangeIMM|HW_Flag_NoJmpTableIMM) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You give these intrinsics HW_Flag_NoJmpTableIMM
, but did not implement the fallback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SSE2_Extract
and SSE2_Insert
actually needs HW_Flag_NoJmpTableIMM
, but you need to implement the "alternatives" in the importer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. You are right - will do it.
src/jit/hwintrinsiclistxarch.h
Outdated
HARDWARE_INTRINSIC(SSE2_LoadFence, "LoadFence", SSE2, -1, 0, 0, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_Special, HW_Flag_NoRMWSemantics) | ||
HARDWARE_INTRINSIC(SSE2_LoadScalarVector128, "LoadScalarVector128", SSE2, -1, 16, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_movd, INS_movd, INS_movq, INS_movq, INS_invalid, INS_movsdsse2}, HW_Category_MemoryLoad, HW_Flag_NoRMWSemantics) | ||
HARDWARE_INTRINSIC(SSE2_LoadVector128, "LoadVector128", SSE2, -1, 16, 1, {INS_movdqu, INS_movdqu, INS_movdqu, INS_movdqu, INS_movdqu, INS_movdqu, INS_movdqu, INS_movdqu, INS_invalid, INS_movupd}, HW_Category_MemoryLoad, HW_Flag_NoRMWSemantics) | ||
HARDWARE_INTRINSIC(SSE2_Extract, "Extract", SSE2, -1, 16, 2, {INS_invalid, INS_invalid, INS_pextrw, INS_pextrw, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_IMM, HW_Flag_FullRangeIMM|HW_Flag_NoJmpTableIMM) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extract
needs HW_Flag_BaseTypeFromFirstArg
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I believe SSE2_Extract
and SSE2_Insert
require specially treatment in CodeGen and emitter. Please see #16646
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extract
needsHW_Flag_BaseTypeFromFirstArg
Fortunately it is not the case as current code handles this intrinsics correctly based on their signatures - intrinsics returning scalar values are special cased and correctly determine baseType
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see, my PR is not merged yet. Thanks.
src/jit/hwintrinsiclistxarch.h
Outdated
HARDWARE_INTRINSIC(SSE2_StoreScalar, "StoreScalar", SSE2, -1, 16, 2, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_movsdsse2}, HW_Category_MemoryStore, HW_Flag_NoRMWSemantics) | ||
HARDWARE_INTRINSIC(SSE2_StoreLow, "StoreLow", SSE2, -1, 16, 2, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_movq, INS_movq, INS_invalid, INS_movlpd}, HW_Category_MemoryStore, HW_Flag_NoRMWSemantics) | ||
HARDWARE_INTRINSIC(SSE2_StoreHigh, "StoreHigh", SSE2, -1, 16, 2, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_movhpd}, HW_Category_MemoryStore, HW_Flag_NoRMWSemantics) | ||
HARDWARE_INTRINSIC(SSE2_Shuffle, "Shuffle", SSE2, -1, 16, 2, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_pshufd, INS_pshufd, INS_invalid, INS_invalid, INS_invalid, INS_shufpd}, HW_Category_IMM, HW_Flag_NoJmpTableIMM|HW_Flag_FullRangeIMM) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SSE2_Shuffle
has 2-arg and 3-arg overload both, so that should not have "2" as the NumArg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SSE2_Shuffle
cannot have HW_Flag_NoJmpTableIMM
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SSE2_Shuffle has 2-arg and 3-arg overload both, so that should not have "2" as the NumArg
Any number of args will be wrong and it is special handled. As we discussed below I plan to implement it as a flag HW_Flag_VariableNumberOfArgs
and separate function for retrieving numArgs
which is dependent on baseType
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dependent on baseType.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SSE2_Shuffle cannot have HW_Flag_NoJmpTableIMM
Will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
If baseType
of Shuffle
is TYP_DOUBLE
we have 3 args but for baseType
TYP_INT
or TYP_UINT
we will have 2 args.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a general change for all the "VariableNumberOfArgs" intrinsics, not just for Shuffle
src/jit/lowerxarch.cpp
Outdated
@@ -2377,6 +2377,14 @@ void Lowering::ContainCheckHWIntrinsic(GenTreeHWIntrinsic* node) | |||
return; | |||
} | |||
|
|||
// TODO-XArch-CQ: improve table driven number of arguments encoding | |||
// Due to different number of arguments for different base type for the same | |||
// hardware intrinsic function we need to special case NI_SSE2_Shuffle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can update function Compiler::numArgOfHWIntrinsic
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It will be necessary to support other intrinsics with variable number of arguments like SetVector128
/ SetVector256
. Perhaps we could use flag like HW_Flag_VariableNumberOfArgs
to short circuit numArg selection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
variable number of arguments like SetVector128 / SetVector256
Not really, these two guys are "NoCodgen" herplers, we can use sig->numArg
in the importer.
src/jit/hwintrinsiclistxarch.h
Outdated
HARDWARE_INTRINSIC(SSE2_StoreHigh, "StoreHigh", SSE2, -1, 16, 2, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_movhpd}, HW_Category_MemoryStore, HW_Flag_NoRMWSemantics) | ||
HARDWARE_INTRINSIC(SSE2_Shuffle, "Shuffle", SSE2, -1, 16, 2, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_pshufd, INS_pshufd, INS_invalid, INS_invalid, INS_invalid, INS_shufpd}, HW_Category_IMM, HW_Flag_NoJmpTableIMM|HW_Flag_FullRangeIMM) | ||
HARDWARE_INTRINSIC(SSE2_ShuffleHigh, "ShuffleHigh", SSE2, -1, 16, 2, {INS_invalid, INS_invalid, INS_pshufhw, INS_pshufhw, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_IMM, HW_Flag_NoJmpTableIMM|HW_Flag_FullRangeIMM) | ||
HARDWARE_INTRINSIC(SSE2_ShuffleLow, "ShuffleLow", SSE2, -1, 16, 2, {INS_invalid, INS_invalid, INS_pshuflw, INS_pshuflw, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_IMM, HW_Flag_NoJmpTableIMM|HW_Flag_FullRangeIMM) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove HW_Flag_NoJmpTableIMM
for SSE2_ShuffleLow
and SSE2_ShuffleHigh
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
@4creators I suggest that implement |
src/jit/hwintrinsicxarch.cpp
Outdated
GenTree* op2 = node->gtGetOp2(); | ||
if (op1 != nullptr) | ||
{ | ||
if (op2 != nullptr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: You can move this check above the if(op1 != nullptr)
check and it will be slightly more efficient:
if (op2 != nullptr)
{
return 2;
}
if (op1 != nullptr)
{
if(op1->OperIsList())
{
}
else
{
return 1;
}
}
src/jit/hwintrinsicxarch.cpp
Outdated
} | ||
else if (op1->OperIsList()) | ||
{ | ||
assert(op1->AsArgList()->Rest()->Rest()->Current() != nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would rather this actually iterate through the nodes to get the actual count, rather than assume 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was one of the discussed points - IMO iteration in particular for large number of args like in SetVector128
or SetVector256
cases will be slow while using table driven field retrieval will be much faster. I am open to any solution on which we can reach consensus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That shouldn't be a concern... Those should be filtered out by the table-driven check (since the entry for those intrinsics won't be -1)
src/jit/instrsxarch.h
Outdated
@@ -196,6 +196,7 @@ INST3( cvttsd2si, "cvttsd2si" , 0, IUM_WR, 0, 0, BAD_CODE, BAD_CODE, SSE | |||
|
|||
#ifndef LEGACY_BACKEND | |||
INST3( movntdq, "movntdq" , 0, IUM_WR, 0, 0, PCKDBL(0xE7), BAD_CODE, BAD_CODE) | |||
INST3( movnti, "movnti" , 0, IUM_WR, 0, 0, PCKFLT(0xC3), BAD_CODE, BAD_CODE) // Move doubleword/quadword from r32/r64 to m32/m64 using nontemporal hint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't appear to be used anywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It slipped in from my work on StoreNonTemporal
implementation, I can remove and reintroduce in the next PR. What is your preference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introducing in the PR where it is consumed is my preference.
src/jit/instrsxarch.h
Outdated
INST3( pextrw, "pextrw" , 0, IUM_WR, 0, 0, BAD_CODE, BAD_CODE, PCKDBL(0xC5)) // Extract 16-bit value into a r32 with zero extended to 32-bits | ||
INST3( pinsrw, "pinsrw" , 0, IUM_WR, 0, 0, BAD_CODE, BAD_CODE, PCKDBL(0xC4)) // packed insert word | ||
INST3( pinsrw, "pinsrw" , 0, IUM_WR, 0, 0, BAD_CODE, BAD_CODE, PCKDBL(0xC4)) // Move the low word of r32 or from m16 into xmm at the word position specified by imm8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: The comment at the end-of-line generally looks to be the Arch Manual "long name", rather than actual descriptive text. e.g. PINSRW—Insert Word
, so the comment should be // Insert Word
or PSHUFLW—Shuffle Packed Low Words
, so the comment should be // Shuffle Packed Low Words
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually there are both formats used - in majority of cases this is descriptive text like above on line 375
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personal preference is on the "long name", since attempting to describe the instruction behavior can quickly become too complicated, especially when multiple encodings are supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @tannergooding on this - I suspect that many of the comments that don't conform to this are older. In any case, going forward I think that the Arch Manual "long name" is the way to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK will fix and keep all comments in feature in "long name" format.
src/jit/hwintrinsicxarch.cpp
Outdated
@@ -234,6 +234,54 @@ int Compiler::numArgsOfHWIntrinsic(NamedIntrinsic intrinsic) | |||
return hwIntrinsicInfoArray[intrinsic - NI_HW_INTRINSIC_START - 1].numArgs; | |||
} | |||
|
|||
//------------------------------------------------------------------------ | |||
// numArgsOfHWIntrinsic: get the number of arguments based on HW GenTree node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description here is incorrect. It is only based on the node if the table entry is -1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, it retrieves numArgs
when table entry is either >= 0 or -1. So essentially it gets the exact number of args for the given intrinsic when it is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to improve description to emphasize this behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, it retrieves numArgs when table entry is either >= 0 or -1.
The code is doing exactly this:
int numArgs = hwIntrinsicInfoArray[intrinsic - NI_HW_INTRINSIC_START - 1].numArgs;
if (numArgs != -1)
{
return numArgs;
}
So it is first checking the table and returns that value if the entry wasn't -1
. Otherwise, it checks the node operands and returns a value based on those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, if someone had a table entry of -2
, this function would return back -2
.
src/jit/emitxarch.cpp
Outdated
@@ -5550,11 +5550,24 @@ static bool isSSEExtract(instruction ins) | |||
} | |||
} | |||
|
|||
static bool isSseShuffle(instruction ins) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would change the name of this function, since it no longer encapsulates all SseShuffle, but instead only the shuffle instructions with a special encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function can be called something like noVexNdsThreeOpInstructions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that noVexNdsThreeOpInstructions
is very clear. How about something like isDstSrcImmSseInstruction
to correspond with IsDstDstSrcAVXInstruction
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CarolEidt IsDstDstSrcAVXInstruction
implies the instruction has VEX.NDS
encoding that may duplicate the first-arg in the VEX.vvvv
field.
The problem that we are trying to solve here is some instructions do not have VEX.NDS
encoding, which means they have the same register-form in SSE and VEX encoding. For example,
EXTRACTPS reg/m32, xmm1, imm8
VEXTRACTPS reg/m32, xmm1, imm8
So, I think the names corresponding to IsDstDstSrcAVXInstruction
may be more confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about noVexNdsRegRegImmInstructions
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considered again, isDstSrcImmSseInstruction
is also good to me if we have some comments here 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CarolEidt @tannergooding @fiigii
I have fixed this already by merging both isSseExtract
and isSseShuffle
functions. After consulting manual I have realized that DstSrcImm
format is common to all ISAs including SSE, AVX and AVX512 and perhaps better name would be isDstSrcImmSIMDInstruction
Current implementation is as follows:
//------------------------------------------------------------------------
// isDstSrcImmSIMDInstruction: check if instruction has RM R I format
// for all encodings: EVEX, VEX and legacy SSE encoding
//
// Arguments:
// instruction -- processor instruction to check
//
// Return Value:
// true if instruction has RRI format
//
static bool isDstSrcImmSIMDInstruction(instruction ins)
{
switch (ins)
{
case INS_extractps:
case INS_pextrb:
case INS_pextrw:
case INS_pextrd:
case INS_pextrq:
case INS_pshufd:
case INS_pshufhw:
case INS_pshuflw:
return true;
default:
return false;
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@4creators, it probably just needs to be IsDstSrcImmAvxInstruction
, as that clarifies that the VEX encoding is the same as the non-vex encoding (where it would otherwise get assumed to be DstDstSrcImm
or DstSrcSrcImm
)
src/jit/hwintrinsicxarch.cpp
Outdated
return numArgs; | ||
} | ||
|
||
assert(numArgs == -1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this assert is pointless.
A better assert would be to assert numArgs >= 0
before returning numArgs
above, or to change the check to if (numArgs >= 0) { return numArgs; }
and then to keep this assert (numArgs == -1
).
src/jit/hwintrinsicxarch.cpp
Outdated
{ | ||
assert(op1->AsArgList()->Rest()->Rest()->Current() != nullptr); | ||
assert(op1->AsArgList()->Rest()->Rest()->Rest() == nullptr); | ||
return 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would still like to actually enumerate the argList
, as it is both future proof (and will be required to support future ISAs) and is more correct in the case of the assertions being wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also won't matter, perf wise, since the majority of cases will be table driven.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a single two-arg overload of numArgsOfHWIntrinsic
is the way to go, and I agree with @tannergooding that it should actually enumerate all the args.
src/jit/compiler.h
Outdated
@@ -3108,6 +3108,7 @@ class Compiler | |||
static int ivalOfHWIntrinsic(NamedIntrinsic intrinsic); | |||
unsigned simdSizeOfHWIntrinsic(NamedIntrinsic intrinsic, CORINFO_SIG_INFO* sig); | |||
static int numArgsOfHWIntrinsic(NamedIntrinsic intrinsic); | |||
static int numArgsOfHWIntrinsic(NamedIntrinsic intrinsic, GenTreeHWIntrinsic* node); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think I suggested a single overload in my earlier comments (prior to @fiigii's excellent suggestion to avoid using the base type altogether). You could have node
default to nullptr
and then add a noway_assert
that it is non-null if numArgs == -1
, followed conservatively by a check that it is non-null before doing the rest of the method. If by mistake a caller fails to pass a non-null node
argument when they should, the behavior will be the same "as if" the 1-arg overload had been called. And with the null argument, the C++ compiler should be expected to do the inlining and the elimination of the non-null path.
In any event, we want to be very careful about pre-optimizing code at the expense of clarity and conciseness.
src/jit/emitxarch.cpp
Outdated
@@ -5550,11 +5550,24 @@ static bool isSSEExtract(instruction ins) | |||
} | |||
} | |||
|
|||
static bool isSseShuffle(instruction ins) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that noVexNdsThreeOpInstructions
is very clear. How about something like isDstSrcImmSseInstruction
to correspond with IsDstDstSrcAVXInstruction
?
9307a46
to
8c78f07
Compare
Skipping OSX tests as they are expected to fail |
src/jit/compiler.h
Outdated
@@ -3107,7 +3107,7 @@ class Compiler | |||
bool isScalarISA(InstructionSet isa); | |||
static int ivalOfHWIntrinsic(NamedIntrinsic intrinsic); | |||
unsigned simdSizeOfHWIntrinsic(NamedIntrinsic intrinsic, CORINFO_SIG_INFO* sig); | |||
static int numArgsOfHWIntrinsic(NamedIntrinsic intrinsic); | |||
static int numArgsOfHWIntrinsic(NamedIntrinsic intrinsic, GenTreeHWIntrinsic* node); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You haven't set the default parameter value; however, it seems that all callers are passing in the node.
Either the function header comment needs to be fixed, or the default value should be set here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh my. It is my local merge error - it was set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any scenarios where we will be looking this up without having a corresponding node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are not, I would vote not having any default and actually asserting that node != null
, since we generally want to be ensuring that the lookup returns the correct value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t think this function should have nullptr as the default value of node. Every call site does not know the intrinsic is variable args or not. One-arg calling will lead misusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, looking at the call sites, we don't use this method in a place where prior knowledge of the intrinsic would be available. So, I agree - no default is the way to go. Sorry for the distraction.
src/jit/hwintrinsicxarch.cpp
Outdated
int numArgs = hwIntrinsicInfoArray[intrinsic - NI_HW_INTRINSIC_START - 1].numArgs; | ||
// TODO_XARCH-BUG: several intrinsics having -1 numArgs do not handle correctly | ||
// real number of arguments - limiting slow path only to Sse2.Shuffle | ||
// https://github.com/dotnet/coreclr/issues/16714 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not comfortable with checking in this workaround. I believe that @fiigii suggested the right fix. I think these asserts:
assert(numArgs != 0);
assert(numArgs != 1);
Should be replaced with assert(numArgs >= 2);
below the if (info->srcCount >= 2)
check.
test Windows_NT x64 Checked jitincompletehwintrinsic test Windows_NT x86 Checked jitincompletehwintrinsic test Ubuntu x64 Checked jitincompletehwintrinsic |
static bool isSSEExtract(instruction ins) | ||
//------------------------------------------------------------------------ | ||
// IsDstSrcImmAvxInstruction: check if instruction has RM R I format | ||
// for all encodings: EVEX, VEX and legacy SSE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instruction has RM R I format
This comment is not correct, certain instructions can be "ins R RM I".
BTW, I still think the comment should emphasize "NO VEX.NDS".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fiigii Can I correct it in next PR. I need Ssse2.Shuffle
to implement some remaining SSE2 helpers and I am blocked to some extent on this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, this PR is still waiting for MS maintainers' final review.
I need Ssse2.Shuffle to implement some remaining SSE2 helpers and I am blocked to some extent on this PR.
Actually, you can copy the one-line definition of Shuffle into your current workspace, and rebase after this PR gets merged.
// instruction -- processor instruction to check | ||
// | ||
// Return Value: | ||
// true if instruction has RRI format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return vale comment should be modified as well (VEX.NDS can be RRI).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix. Question as above: could we postpone it to my next PR?
return numArgs; | ||
} | ||
|
||
noway_assert(node != nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add assert(node->gtHWIntrinsicId == intrinsic)
after this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to have it verified 👍
@@ -219,19 +219,64 @@ unsigned Compiler::simdSizeOfHWIntrinsic(NamedIntrinsic intrinsic, CORINFO_SIG_I | |||
} | |||
|
|||
//------------------------------------------------------------------------ | |||
// numArgsOfHWIntrinsic: get the number of arguments | |||
// numArgsOfHWIntrinsic: get the number of arguments based on table and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe reword to:
gets the number of arguments for the hardware intrinsic.
This attempts to do a table based lookup but will fallback to the number of operands in 'node' if the table entry is -1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds better that current one - will change. Next PR would be OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are going to push any more changes, I would fix it this PR. Otherwise I think next PR is fine (provided @CarolEidt agrees).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this LGTM. I would like to see the last of the comments corrected and the assert @fiigii suggested added, but we can do that in a follow up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, I would like to see the changes of the default value, assertion, and comments (in this or next PR, waiting for @CarolEidt 's approval).
@@ -3107,7 +3107,7 @@ class Compiler | |||
bool isScalarISA(InstructionSet isa); | |||
static int ivalOfHWIntrinsic(NamedIntrinsic intrinsic); | |||
unsigned simdSizeOfHWIntrinsic(NamedIntrinsic intrinsic, CORINFO_SIG_INFO* sig); | |||
static int numArgsOfHWIntrinsic(NamedIntrinsic intrinsic); | |||
static int numArgsOfHWIntrinsic(NamedIntrinsic intrinsic, GenTreeHWIntrinsic* node = nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove = nullptr
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
@dotnet-bot test Tizen armel Cross Checked Innerloop Build and Test please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - I'm ok with merging this and cleaning up the comments and removing the default arg in a subsequent PR
Fixes #16714