-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Implement SSE4.1 Insert/Extract and simplify SSE/SSE2 intrinsic #16646
Conversation
// Select base type using argument type | ||
HW_Flag_BaseTypeFromArg = 0x400, | ||
// Select base type using the first argument type | ||
HW_Flag_BaseTypeFromFirstArg = 0x400, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change, all the intrinsics (except MemoryStore) that return a scalar value (e.g., bool
) should explicitly have flag HW_Flag_BaseTypeFromFirstArg
or HW_Flag_BaseTypeFromSecondArg
. cc @4creators
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speaking of asserts (which we were a few hours ago on another thread), this seems like something that could be asserted somewhere - e.g. a pass over the table that validates these kinds of properties. There are probably other consistency checks across signature, category and flag that could be checked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fiigii, could you log an issue tracking this?
src/jit/compiler.h
Outdated
@@ -3125,6 +3117,7 @@ class Compiler | |||
GenTree* impNonConstFallback(NamedIntrinsic intrinsic, var_types simdType, var_types baseType); | |||
static bool isImmHWIntrinsic(NamedIntrinsic intrinsic, GenTree* lastOp); | |||
GenTree* addRangeCheckIfNeeded(NamedIntrinsic intrinsic, GenTree* lastOp, bool mustExpand); | |||
bool signatureTypeSupproted(var_types retType, CORINFO_SIG_INFO* sig, HWIntrinsicFlag flags); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Supproted
->Supported
src/jit/hwintrinsicxarch.cpp
Outdated
// | ||
// Return Value: | ||
// On 32-bit platforms, return true if the intrinsic does not use any 64-bit registers (r64) | ||
bool Compiler::signatureTypeSupproted(var_types retType, CORINFO_SIG_INFO* sig, HWIntrinsicFlag flags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These needs to have HWIntrinsic in the name somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
src/jit/namedintrinsiclist.h
Outdated
// 64-bit intrinsics | ||
// Intrinsics that operate over 64-bit general purpose registers are not supported on 32-bit platform | ||
HW_Flag_64BitOnly = 0x4000, | ||
HW_Flag_firstArgMaybe64Bit = 0x8000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Fix casing on First
src/jit/hwintrinsiclistxarch.h
Outdated
HARDWARE_INTRINSIC(SSE_ConvertToInt32, "ConvertToInt32", SSE, -1, 16, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_cvtss2si, INS_invalid}, HW_Category_SIMDScalar, HW_Flag_BaseTypeFromFirstArg) | ||
HARDWARE_INTRINSIC(SSE_ConvertToInt64, "ConvertToInt64", SSE, -1, 16, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_cvtss2si, INS_invalid}, HW_Category_SIMDScalar, HW_Flag_BaseTypeFromFirstArg) | ||
HARDWARE_INTRINSIC(SSE_ConvertToSingle, "ConvertToSingle", SSE, -1, 16, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_movss, INS_invalid}, HW_Category_Helper, HW_Flag_BaseTypeFromFirstArg) | ||
HARDWARE_INTRINSIC(SSE_ConvertScalarToVector128Single, "ConvertScalarToVector128Single", SSE, -1, 16, 2, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_cvtsi2ss, INS_invalid}, HW_Category_SIMDScalar, HW_Flag_SecondArgMaybe64Bit|HW_Flag_BaseTypeFromFirstArg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be marked as CopyUpperBits
src/jit/hwintrinsiclistxarch.h
Outdated
HARDWARE_INTRINSIC(SSE2_ConvertScalarToVector128Double, "ConvertScalarToVector128Double", SSE2, -1, 16, 2, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_cvtsi2sd, INS_invalid, INS_cvtsi2sd, INS_invalid, INS_cvtss2sd, INS_invalid}, HW_Category_Special, HW_Flag_NoFlag) | ||
HARDWARE_INTRINSIC(SSE2_ConvertToVector128Int32, "ConvertToVector128Int32", SSE2, -1, 16, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_cvtps2dq, INS_cvtpd2dq}, HW_Category_SimpleSIMD, HW_Flag_BaseTypeFromArg) | ||
HARDWARE_INTRINSIC(SSE2_ConvertToDouble, "ConvertToDouble", SSE2, -1, 16, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_movsd}, HW_Category_Helper, HW_Flag_BaseTypeFromFirstArg) | ||
HARDWARE_INTRINSIC(SSE2_ConvertToInt32, "ConvertToInt32", SSE2, -1, 16, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_mov_xmm2i, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_cvtsd2si}, HW_Category_SIMDScalar, HW_Flag_BaseTypeFromFirstArg|HW_Flag_MultiIns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this one (and the following) MultiIns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because these intrinsics need special CodeGen (but can be table-driven in the importer). Probably, I should add a new flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, I should add a new flag
I think so. I don't believe we want to use MultiIns
for anything other than intrinsics which generate multiple instructions
src/jit/hwintrinsiclistxarch.h
Outdated
HARDWARE_INTRINSIC(SSE2_ConvertToUInt32, "ConvertToUInt32", SSE2, -1, 16, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_mov_xmm2i, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_SIMDScalar, HW_Flag_BaseTypeFromFirstArg|HW_Flag_MultiIns) | ||
HARDWARE_INTRINSIC(SSE2_ConvertToUInt64, "ConvertToUInt64", SSE2, -1, 16, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_mov_xmm2i, INS_invalid, INS_invalid}, HW_Category_SIMDScalar, HW_Flag_BaseTypeFromFirstArg|HW_Flag_MultiIns) | ||
HARDWARE_INTRINSIC(SSE2_ConvertToVector128Double, "ConvertToVector128Double", SSE2, -1, 16, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_cvtdq2pd, INS_invalid, INS_invalid, INS_invalid, INS_cvtps2pd, INS_invalid}, HW_Category_SimpleSIMD, HW_Flag_BaseTypeFromFirstArg) | ||
HARDWARE_INTRINSIC(SSE2_ConvertScalarToVector128Double, "ConvertScalarToVector128Double", SSE2, -1, 16, 2, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_cvtsi2sd, INS_invalid, INS_cvtsi2sd, INS_invalid, INS_cvtss2sd, INS_invalid}, HW_Category_Special, HW_Flag_SecondArgMaybe64Bit|HW_Flag_MultiIns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You changed the corresponding SSE one off of Special
, why not this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Will fix.
87b2e82
to
556065f
Compare
556065f
to
81961d1
Compare
Implemented SSE4.1 @CarolEidt @tannergooding PTAL |
("InsertScalarTest.template",new string[] { "Sse2", "Sse2", "Insert", "Vector128", "Int16", "Vector128", "Int16", "1", "(short)2", "16", "(short)0", "(i == 1 ? result[i] != 2 : result[i] != 0)"}), | ||
("InsertScalarTest.template",new string[] { "Sse2", "Sse2", "Insert", "Vector128", "UInt16", "Vector128", "UInt16", "1", "(ushort)2","16", "(ushort)0", "(i == 1 ? result[i] != 2 : result[i] != 0)"}), | ||
("InsertScalarTest.template",new string[] { "Sse2", "Sse2", "Insert", "Vector128", "Int16", "Vector128", "Int16", "129","(short)2", "16", "(short)0", "(i == 1 ? result[i] != 2 : result[i] != 0)"}), | ||
("InsertScalarTest.template",new string[] { "Sse2", "Sse2", "Insert", "Vector128", "UInt16", "Vector128", "UInt16", "129","(ushort)2","16", "(ushort)0", "(i == 1 ? result[i] != 2 : result[i] != 0)"}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made two new templates for insert and extract.
@4creators I implemented SSE2 insert and extract in this PR as well, if you have done on them, I can remove here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fiigii I have implemented them as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@4creators Sounds Good, could you submit your current work or open an issue to keep sync with us?
HARDWARE_INTRINSIC(SSE41_Floor, "Floor", SSE41, 9, 16, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_roundps, INS_roundpd}, HW_Category_SimpleSIMD, HW_Flag_NoRMWSemantics) | ||
HARDWARE_INTRINSIC(SSE41_FloorScalar, "FloorScalar", SSE41, 9, 16, -1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_roundss, INS_roundsd}, HW_Category_Special, HW_Flag_CopyUpperBits) | ||
HARDWARE_INTRINSIC(SSE41_Insert, "Insert", SSE41, -1, 16, 3, {INS_pinsrb, INS_pinsrb, INS_invalid, INS_invalid, INS_pinsrd, INS_pinsrd, INS_pinsrq, INS_pinsrq, INS_insertps, INS_invalid}, HW_Category_IMM, HW_Flag_FullRangeIMM|HW_Flag_SecondArgMaybe64Bit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding Shall I give Insert
a HW_Flag_NoRMWSemantics
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should only be on instructions that don't need hasDelayFreeSrc
on op2
.
I think pinsrb/pinsrd/pinsrq
qualify here (since op2 is an integer reg and taget is an xmm reg).
However, insertps
shouldn't have it, since op2 is also an xmm reg and we need to ensure that op2 is different from targetReg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, looks like Insert
should not have it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll need to ensure that pinsrb/d/q are specially handled
@CarolEidt could you please take a look at this PR? |
@fiigii I am just getting on a plane. I should be able to review it in a few hours. Sorry for the delay. |
@CarolEidt No worries, thank you 😄 |
src/jit/emitxarch.cpp
Outdated
#ifndef LEGACY_BACKEND | ||
if (ins == INS_extractps) | ||
{ | ||
mReg = id->idReg1(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we handle this in codegen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm..we can. The special case extractps
uses ModRM:r/m
for the dst
register (store semantics).
I will handle this in codgen, thanks!
src/jit/lsraxarch.cpp
Outdated
case NI_SSE41_Extract: | ||
if (intrinsicTree->gtSIMDBaseType == TYP_FLOAT) | ||
{ | ||
info->internalIntCount += 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed to get the value back in an XMM register, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the original extractps
instruction writes the result (a floating-point number) into a GPR, but our API returns a float
. So we need the int-register as the dst
of extractps
.
src/jit/instrsxarch.h
Outdated
INST3( pextrb, "pextrb" , 0, IUM_WR, 0, 0, SSE3A(0x14), BAD_CODE, BAD_CODE) // Extract Byte | ||
INST3( pextrd, "pextrd" , 0, IUM_WR, 0, 0, SSE3A(0x16), BAD_CODE, BAD_CODE) // Extract Dword | ||
INST3( pextrq, "pextrq" , 0, IUM_WR, 0, 0, SSE3A(0x16), BAD_CODE, BAD_CODE) // Extract Qword | ||
INST3( extractps, "extractps" , 0, IUM_WR, 0, 0, BAD_CODE, BAD_CODE, SSE3A(0x17)) // Extract float |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I believe most of these names have been copied out of the Arch manual... So this should probably be Extract Packed Floating-Point Values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks!
src/jit/hwintrinsicxarch.cpp
Outdated
@@ -677,17 +702,25 @@ GenTree* Compiler::impHWIntrinsic(NamedIntrinsic intrinsic, | |||
// table-driven importer of simple intrinsics | |||
if (impIsTableDrivenHWIntrinsic(category, flags)) | |||
{ | |||
if (!varTypeIsSIMD(retType) || ((flags & HW_Flag_BaseTypeFromArg) != 0)) | |||
if (category == HW_Category_MemoryStore || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: I had to move this block in my AVX PR (#16655) in order to deal with the intrinsics that have both generic and non-generic overloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, but needs minor changes: comments & parentheses.
I mention a suggestion for a consistency check - but that's not something needed for this PR.
// Select base type using argument type | ||
HW_Flag_BaseTypeFromArg = 0x400, | ||
// Select base type using the first argument type | ||
HW_Flag_BaseTypeFromFirstArg = 0x400, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speaking of asserts (which we were a few hours ago on another thread), this seems like something that could be asserted somewhere - e.g. a pass over the table that validates these kinds of properties. There are probably other consistency checks across signature, category and flag that could be checked.
src/jit/namedintrinsiclist.h
Outdated
// the intrinsics need special rules in CodeGen, | ||
// but can be table-driven in the front-end | ||
HW_Flag_SpecialCodeGen = 0x20000, | ||
|
||
// No Read/Modify/Write Semantics | ||
// the intrinsic does not have read/modify/write semantics and doesn't need |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pre-existing, but this comment isn't complete.
I would qualify it with "the intrinsic doesn't have read/modify/write semantics in two-operand form." (or non-AVX, or whatever the appropriate characterization is).
@@ -494,6 +493,7 @@ bool Compiler::isFullyImplmentedISAClass(InstructionSet isa) | |||
case InstructionSet_SSE: | |||
case InstructionSet_SSE3: | |||
case InstructionSet_SSSE3: | |||
case InstructionSet_SSE41: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more moved to the happy state! :-)
src/jit/hwintrinsicxarch.cpp
Outdated
// flags - flags of the intrinsics | ||
// | ||
// Return Value: | ||
// On 32-bit platforms, return true if the intrinsic does not use any 64-bit registers (r64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This description doesn't reflect the name or the real purpose. I think it should say:
// Return Value:
// Returns true iff the given type signature is supported
//
// Notes:
// - This is only used on 32-bit systems to determine whether the signature uses no 64-bit registers.
// - The `retType` is passed to avoid another call to the type system, as it has already been retrieved.
src/jit/hwintrinsicxarch.cpp
Outdated
@@ -677,17 +702,25 @@ GenTree* Compiler::impHWIntrinsic(NamedIntrinsic intrinsic, | |||
// table-driven importer of simple intrinsics | |||
if (impIsTableDrivenHWIntrinsic(category, flags)) | |||
{ | |||
if (!varTypeIsSIMD(retType) || ((flags & HW_Flag_BaseTypeFromArg) != 0)) | |||
if (category == HW_Category_MemoryStore || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This expression needs more parentheses to be fully-parenthesized.
src/jit/hwintrinsicxarch.cpp
Outdated
assert(category == HW_Category_MemoryStore); | ||
baseType = | ||
getBaseTypeOfSIMDType(info.compCompHnd->getArgClass(sig, info.compCompHnd->getArgNext(sig->args))); | ||
assert(category == HW_Category_MemoryStore || ((flags & HW_Flag_BaseTypeFromSecondArg) != 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also needs more parens.
src/jit/lsraxarch.cpp
Outdated
@@ -2323,7 +2324,7 @@ void LinearScan::BuildHWIntrinsic(GenTreeHWIntrinsic* intrinsicTree) | |||
assert(numArgs != 0); | |||
assert(numArgs != 1); | |||
|
|||
if (info->srcCount >= 2) | |||
if (info->srcCount >= 2 && (genActualType(intrinsicTree->TypeGet()) == genActualType(op2->TypeGet()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding I think we can have a stricter check to avoid handling special cases like Insert
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind, we cannot have this check.
test Windows_NT x64 Checked jitincompletehwintrinsic test Windows_NT x86 Checked jitincompletehwintrinsic test Ubuntu x64 Checked jitincompletehwintrinsic test OSX10.12 x64 Checked jitincompletehwintrinsic |
CI seems to have some IO problems. All tests passed on my local machines. |
@tannergooding @CarolEidt Addressed all the feedback, does the current code look good to you? |
CC. @dotnet/dnceng, are the Mac machines in some bad state? |
if ((ins == INS_pextrq || ins == INS_pinsrq) && !UseVEXEncoding()) | ||
{ | ||
assert(UseSSE4()); | ||
sz += 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separate issue: We really need to cleanup the code size estimation code (CC. @CarolEidt).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let me just remind you that mis-estimates can lead to odd CHK/RET codegen diffs (#13399).
I have logged issue #16696 due to failing OSX jobs |
@tannergooding I think there is something wrong in the lab. This is happening across multiple jenkins instances. |
@@ -593,17 +598,6 @@ void CodeGen::genSSEIntrinsic(GenTreeHWIntrinsic* node) | |||
|
|||
switch (intrinsicID) | |||
{ | |||
case NI_SSE_ConvertScalarToVector128Single: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm. This is now considered table driven, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
case NI_SSE41_Extract: | ||
if (baseType == TYP_FLOAT) | ||
{ | ||
info->internalIntCount += 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we explicitly set the internal candidates, as we have done elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No necessary, this internal register can be any GPR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks reasonable to me overall.
I would like to spend some time cleaning up this code (as it is a bit messy right now) after we get the currently approved intrinsics in.
Don't you mean to clean up the code-size estimation code? |
I meant the intrinsic code in general. I've gone over it a couple times and it looks like there is some room for cleanup/improvement in a few places. |
test OSX10.12 x64 Checked jitincompletehwintrinsic |
test OSX10.12 x64 Checked Innerloop Build and Test |
Can we merge this PR? |
I will go ahead and merge. I think the odds of breaking something on OSX with this change but not the other OS's is pretty small. |
It seems that we had a bad merge with PR #16669. @fiigii @tannergooding could one of you ensure consistent use of |
@CarolEidt I have changed |
This PR
r64
instructions on 32-bit platforms and simplify SSE/SSE2 intrinsic code(close https://github.com/dotnet/coreclr/issues/16342).Insert
andExtract
and marks SSE4.1 as complete.@CarolEidt @tannergooding PTAL