Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Implement Shuffle* Sse2 Intel hardware intrinsics #15777

Merged
merged 2 commits into from
Mar 3, 2018

Conversation

4creators
Copy link

@4creators 4creators commented Jan 7, 2018

Fixes #16714

@@ -33,7 +33,7 @@ static const wchar_t *coreCLRDll = W("CoreCLR.dll");
static const wchar_t *coreCLRInstallDirectory = W("%windir%\\system32\\");

// Encapsulates the environment that CoreCLR will run in, including the TPALIST
class HostEnvironment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file doesn't have any modifications outside of whitespace changes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is dirty commit - just needed to run CI - everything will get cleaned later

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to comment @dotnet-bot test <test-name> please to rerun any given job, use this to rerun all jobs

ins == INS_punpcklbw || ins == INS_punpckhqdq || ins == INS_punpcklqdq || ins == INS_punpckhwd ||
ins == INS_punpcklwd || ins == INS_punpckhdq || ins == INS_packssdw || ins == INS_packsswb ||
ins == INS_packuswb || ins == INS_packusdw || ins == INS_vperm2i128);
switch (ins)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would still probably be useful to pull this change out to its own PR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

@@ -1085,6 +1181,7 @@ inline bool hasCodeMI(instruction ins)
inline size_t insCodeMI(instruction ins)
{
assert((unsigned)ins < _countof(insCodesMI));
size_t code = insCodesMI[ins];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug help for conditional breakpoint - to be removed

@@ -3765,13 +3862,15 @@ void emitter::emitIns_R_R(instruction ins, emitAttr attr, regNumber reg1, regNum
void emitter::emitIns_R_R_I(instruction ins, emitAttr attr, regNumber reg1, regNumber reg2, int ival)
{
// SSE2 version requires 5 bytes and SSE4/AVX version 6 bytes
UNATIVE_OFFSET sz = 4;
UNATIVE_OFFSET sz = 4;
bool encodeAsSse = false;
Copy link
Member

@tannergooding tannergooding Jan 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? This logic should all be controlled elsewhere via UseVEXEncoding

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other instructions (such as INS_psrldq) should be going through emitIns_R_I

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(for the non-vex case)

Copy link
Author

@4creators 4creators Jan 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original logic did not work correctly for SSE2 codegen. In case of istruction with the format ins xmm, imm they had always being set with id->idInsFmt(IF_RRW_RRW_CNS); what resulted in code like psllw xmm0, xmm0, 0 what is an invalid SSE2 instruction what resulted in CPU exception.

Above logic ensures that RRW_CNS instructions are flagged with valid format IF_RRW_CNS what resolves this issue. Table based codegen should ensure this same logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of istruction with the format ins xmm, imm they had always being set with id->idInsFmt(IF_RRW_RRW_CNS);

The instructions should be going through an emitIns_R_I method for the SSE2 case. They should only go through emitIns_R_R_I on AVX+


void emitter::emitIns_SIMD_R_R_I(instruction ins, regNumber reg1, regNumber reg2, ssize_t iVal, var_types simdtype)
{
emitIns_R_R_I(ins, emitTypeSize(simdtype), reg1, reg2, (int)iVal);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe R_R_I should only happen under UseVEXEncoding

Copy link
Author

@4creators 4creators Jan 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on instruction - all shuffle cases, cmppd, cmpps, pinsrw are IF_RRW_RRW_CNS under SSE2 so they should go through R_R_I

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but those instructions should be going through SIMD_R_R_R_I.

In almost all cases so far, the codegen has been calling the emitIns_SIMD_* case that matches the VEX encoding. We then have logic for the fallback case so that args can be arranged properly.

Ex: cmppd is R_R_R_I on AVX+ machines and R_R_I elsewhere. So we call emitIns_SIMD_R_R_R_I.

emitIns_SIMD_R_R_R_I then has the logic for, on vex, generate the instruction directly (emtIns_R_R_R_I), on non-vex generate a movaps (emitIns_R_R) followed by the instruction (emitIns_R_R_I).

That being said, it does look like the pshufd instructions are bit special in that they (like crc32, lzcnt, popcnt, and a few others) have a consistent encoding between VEX and non-VEX machines, in which case we should just be calling emitIns_* directly, without an intermediary emitIns_SIMD_* step

void emitter::emitIns_SIMD_R_R_R_I(
instruction ins, regNumber reg, regNumber reg1, regNumber reg2, ssize_t ival, var_types simdtype)
{
if (UseVEXEncoding() && reg1 != reg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've dropped the reg1 != reg from this line elsewhere, it isn't useful

{
if (reg1 != reg)
{
emitIns_R_R_I(INS_movaps, emitTypeSize(simdtype), reg, reg1, (int)ival);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think this is correct. movaps is a R_R instruction

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, will fix by dropping whole if statement. Thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think dropping the if statement is correct.

We still have, a target register, two operands, and an immediate, which isn't encodable on non-VEX machines if reg1 != reg

@@ -304,7 +304,8 @@ void GenTree::InitNodeSize()
GenTree::s_gtNodeSizes[GT_INDEX_ADDR] = TREE_NODE_SZ_LARGE;
GenTree::s_gtNodeSizes[GT_ARR_BOUNDS_CHECK] = TREE_NODE_SZ_LARGE;
#ifdef FEATURE_SIMD
GenTree::s_gtNodeSizes[GT_SIMD_CHK] = TREE_NODE_SZ_LARGE;
GenTree::s_gtNodeSizes[GT_SIMD_CHK] = TREE_NODE_SZ_LARGE;
GenTree::s_gtNodeSizes[GT_SIMD] = TREE_NODE_SZ_LARGE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change needed? We shouldn't be touching the GT_SIMD code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not touched it - clang -format and clang-tidy did it by creating format patch

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm.... I wouldn't expect clang format or clang-tidy to add new code, only move around existing code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's my copy paste error around place flagged by clang-format

* Create a list out of the five values.
*/

GenTreeArgList* Compiler::gtNewArgList(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? Parameter counts above 4 looked to be very uncommon, so a helper function that calls the smaller ones and works with arbitrary sizes might be more useful

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of my experiments to be removed - when you run a loop on 8 or 16 arguments and use in the first step 4 args function in every next step you need 5 args function to pass 4 args and append them to 1 list - but actually function quite different from this one.

GenTreeHWIntrinsic* Compiler::gtNewSimdHWIntrinsicNode(var_types type,
GenTree* op1,
GenTree* op2,
ssize_t iVal,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want/need to carry the ival as part of the node. It will end up being table driven, but before then, you can pull it out of the op when generating code or add a small switch for the hardcoded cases.

src/jit/gtlist.h Outdated
@@ -158,11 +158,11 @@ GTNODE(GE , GenTreeOp ,0,GTK_BINOP|GTK_RELOP)
GTNODE(GT , GenTreeOp ,0,GTK_BINOP|GTK_RELOP)
#ifndef LEGACY_BACKEND
// These are similar to GT_EQ/GT_NE but they generate "test" instead of "cmp" instructions.
// Currently these are generated during lowering for code like ((x & y) eq|ne 0) only on
// Currently these are generated during lowering for code like ((x & y) eq|ne 0) only on
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another file with whitespace only changes.

break;
}

if (ins != INS_cmppd)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could drop this check if you return instead of break in the TYP_DOUBLE case.

op2Reg = op2->gtRegNum;

instruction ins;
switch (baseType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can drop the switch statement for the ones that only take a single type.

In the long run (for generic methods) we will want to be checking the types in the importer and returning a PNSE exception for the invalid cases.

src/jit/simd.cpp Outdated
GenTreePtr tree = se.val;

return impSIMD(type, tree, ti, expectAddr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactoring?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, one of experiments.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be removed onviously 😄

@4creators
Copy link
Author

@tannergooding
Is it possible to make a commit to PR, skip CI and run only couple of jobs?

@tannergooding
Copy link
Member

I don't believe so. Pushing a new commit will run all jobs set to automatically run

@4creators
Copy link
Author

@tannergooding Is it possible to pass error message from Jit to exception created in importer? Currently available standard messages are not very helpful.

@4creators
Copy link
Author

I am waiting for #16183 before submitting commits to finish work on this PR.

@4creators 4creators changed the title [WIP] Implement Sse2 Intel hardware intrinsics - part 2 Implement Extract/Insert/Shuffle* Sse2 Intel hardware intrinsics Feb 28, 2018
@4creators
Copy link
Author

@fiigii @tannergooding @CarolEidt Added Extract/Insert/Shuffle* SSE2 HW intrinsics. To speed up work I have decided to update tests to templated format after all of SSE2 intrinsics are implemented.

HARDWARE_INTRINSIC(SSE2_LoadScalarVector128, "LoadScalarVector128", SSE2, -1, 16, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_movd, INS_movd, INS_movq, INS_movq, INS_invalid, INS_movsdsse2}, HW_Category_MemoryLoad, HW_Flag_NoRMWSemantics)
HARDWARE_INTRINSIC(SSE2_LoadVector128, "LoadVector128", SSE2, -1, 16, 1, {INS_movdqu, INS_movdqu, INS_movdqu, INS_movdqu, INS_movdqu, INS_movdqu, INS_movdqu, INS_movdqu, INS_invalid, INS_movupd}, HW_Category_MemoryLoad, HW_Flag_NoRMWSemantics)
HARDWARE_INTRINSIC(SSE2_Extract, "Extract", SSE2, -1, 16, 2, {INS_invalid, INS_invalid, INS_pextrw, INS_pextrw, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_IMM, HW_Flag_FullRangeIMM|HW_Flag_NoJmpTableIMM)
HARDWARE_INTRINSIC(SSE2_Insert, "Insert", SSE2, -1, 16, 3, {INS_invalid, INS_invalid, INS_pinsrw, INS_pinsrw, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_IMM, HW_Flag_FullRangeIMM|HW_Flag_NoJmpTableIMM)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You give these intrinsics HW_Flag_NoJmpTableIMM, but did not implement the fallback.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SSE2_Extract and SSE2_Insert actually needs HW_Flag_NoJmpTableIMM, but you need to implement the "alternatives" in the importer.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. You are right - will do it.

HARDWARE_INTRINSIC(SSE2_LoadFence, "LoadFence", SSE2, -1, 0, 0, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_Special, HW_Flag_NoRMWSemantics)
HARDWARE_INTRINSIC(SSE2_LoadScalarVector128, "LoadScalarVector128", SSE2, -1, 16, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_movd, INS_movd, INS_movq, INS_movq, INS_invalid, INS_movsdsse2}, HW_Category_MemoryLoad, HW_Flag_NoRMWSemantics)
HARDWARE_INTRINSIC(SSE2_LoadVector128, "LoadVector128", SSE2, -1, 16, 1, {INS_movdqu, INS_movdqu, INS_movdqu, INS_movdqu, INS_movdqu, INS_movdqu, INS_movdqu, INS_movdqu, INS_invalid, INS_movupd}, HW_Category_MemoryLoad, HW_Flag_NoRMWSemantics)
HARDWARE_INTRINSIC(SSE2_Extract, "Extract", SSE2, -1, 16, 2, {INS_invalid, INS_invalid, INS_pextrw, INS_pextrw, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_IMM, HW_Flag_FullRangeIMM|HW_Flag_NoJmpTableIMM)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract needs HW_Flag_BaseTypeFromFirstArg.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I believe SSE2_Extract and SSE2_Insert require specially treatment in CodeGen and emitter. Please see #16646

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working on it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract needs HW_Flag_BaseTypeFromFirstArg

Fortunately it is not the case as current code handles this intrinsics correctly based on their signatures - intrinsics returning scalar values are special cased and correctly determine baseType

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, my PR is not merged yet. Thanks.

HARDWARE_INTRINSIC(SSE2_StoreScalar, "StoreScalar", SSE2, -1, 16, 2, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_movsdsse2}, HW_Category_MemoryStore, HW_Flag_NoRMWSemantics)
HARDWARE_INTRINSIC(SSE2_StoreLow, "StoreLow", SSE2, -1, 16, 2, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_movq, INS_movq, INS_invalid, INS_movlpd}, HW_Category_MemoryStore, HW_Flag_NoRMWSemantics)
HARDWARE_INTRINSIC(SSE2_StoreHigh, "StoreHigh", SSE2, -1, 16, 2, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_movhpd}, HW_Category_MemoryStore, HW_Flag_NoRMWSemantics)
HARDWARE_INTRINSIC(SSE2_Shuffle, "Shuffle", SSE2, -1, 16, 2, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_pshufd, INS_pshufd, INS_invalid, INS_invalid, INS_invalid, INS_shufpd}, HW_Category_IMM, HW_Flag_NoJmpTableIMM|HW_Flag_FullRangeIMM)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SSE2_Shuffle has 2-arg and 3-arg overload both, so that should not have "2" as the NumArg.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SSE2_Shuffle cannot have HW_Flag_NoJmpTableIMM.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SSE2_Shuffle has 2-arg and 3-arg overload both, so that should not have "2" as the NumArg

Any number of args will be wrong and it is special handled. As we discussed below I plan to implement it as a flag HW_Flag_VariableNumberOfArgs and separate function for retrieving numArgs which is dependent on baseType.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dependent on baseType.

Why?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SSE2_Shuffle cannot have HW_Flag_NoJmpTableIMM

Will do

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

If baseType of Shuffle is TYP_DOUBLE we have 3 args but for baseType TYP_INT or TYP_UINT we will have 2 args.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a general change for all the "VariableNumberOfArgs" intrinsics, not just for Shuffle

@@ -2377,6 +2377,14 @@ void Lowering::ContainCheckHWIntrinsic(GenTreeHWIntrinsic* node)
return;
}

// TODO-XArch-CQ: improve table driven number of arguments encoding
// Due to different number of arguments for different base type for the same
// hardware intrinsic function we need to special case NI_SSE2_Shuffle
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can update function Compiler::numArgOfHWIntrinsic.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It will be necessary to support other intrinsics with variable number of arguments like SetVector128 / SetVector256. Perhaps we could use flag like HW_Flag_VariableNumberOfArgs to short circuit numArg selection

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

variable number of arguments like SetVector128 / SetVector256

Not really, these two guys are "NoCodgen" herplers, we can use sig->numArg in the importer.

HARDWARE_INTRINSIC(SSE2_StoreHigh, "StoreHigh", SSE2, -1, 16, 2, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_movhpd}, HW_Category_MemoryStore, HW_Flag_NoRMWSemantics)
HARDWARE_INTRINSIC(SSE2_Shuffle, "Shuffle", SSE2, -1, 16, 2, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_pshufd, INS_pshufd, INS_invalid, INS_invalid, INS_invalid, INS_shufpd}, HW_Category_IMM, HW_Flag_NoJmpTableIMM|HW_Flag_FullRangeIMM)
HARDWARE_INTRINSIC(SSE2_ShuffleHigh, "ShuffleHigh", SSE2, -1, 16, 2, {INS_invalid, INS_invalid, INS_pshufhw, INS_pshufhw, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_IMM, HW_Flag_NoJmpTableIMM|HW_Flag_FullRangeIMM)
HARDWARE_INTRINSIC(SSE2_ShuffleLow, "ShuffleLow", SSE2, -1, 16, 2, {INS_invalid, INS_invalid, INS_pshuflw, INS_pshuflw, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_IMM, HW_Flag_NoJmpTableIMM|HW_Flag_FullRangeIMM)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove HW_Flag_NoJmpTableIMM for SSE2_ShuffleLow and SSE2_ShuffleHigh as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

@fiigii
Copy link

fiigii commented Mar 1, 2018

@4creators I suggest that implement Shuffle* in a PR at first, which already have all the infrastructure (except numArgOfHWIntrinsic update) and test template.

GenTree* op2 = node->gtGetOp2();
if (op1 != nullptr)
{
if (op2 != nullptr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You can move this check above the if(op1 != nullptr) check and it will be slightly more efficient:

if (op2 != nullptr)
{
    return 2;
}

if (op1 != nullptr)
{
    if(op1->OperIsList())
    {
    }
    else
    {
        return 1;
    }
}

}
else if (op1->OperIsList())
{
assert(op1->AsArgList()->Rest()->Rest()->Current() != nullptr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would rather this actually iterate through the nodes to get the actual count, rather than assume 3.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was one of the discussed points - IMO iteration in particular for large number of args like in SetVector128 or SetVector256 cases will be slow while using table driven field retrieval will be much faster. I am open to any solution on which we can reach consensus.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That shouldn't be a concern... Those should be filtered out by the table-driven check (since the entry for those intrinsics won't be -1)

@@ -196,6 +196,7 @@ INST3( cvttsd2si, "cvttsd2si" , 0, IUM_WR, 0, 0, BAD_CODE, BAD_CODE, SSE

#ifndef LEGACY_BACKEND
INST3( movntdq, "movntdq" , 0, IUM_WR, 0, 0, PCKDBL(0xE7), BAD_CODE, BAD_CODE)
INST3( movnti, "movnti" , 0, IUM_WR, 0, 0, PCKFLT(0xC3), BAD_CODE, BAD_CODE) // Move doubleword/quadword from r32/r64 to m32/m64 using nontemporal hint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't appear to be used anywhere

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It slipped in from my work on StoreNonTemporal implementation, I can remove and reintroduce in the next PR. What is your preference?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introducing in the PR where it is consumed is my preference.

INST3( pextrw, "pextrw" , 0, IUM_WR, 0, 0, BAD_CODE, BAD_CODE, PCKDBL(0xC5)) // Extract 16-bit value into a r32 with zero extended to 32-bits
INST3( pinsrw, "pinsrw" , 0, IUM_WR, 0, 0, BAD_CODE, BAD_CODE, PCKDBL(0xC4)) // packed insert word
INST3( pinsrw, "pinsrw" , 0, IUM_WR, 0, 0, BAD_CODE, BAD_CODE, PCKDBL(0xC4)) // Move the low word of r32 or from m16 into xmm at the word position specified by imm8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The comment at the end-of-line generally looks to be the Arch Manual "long name", rather than actual descriptive text. e.g. PINSRW—Insert Word, so the comment should be // Insert Word or PSHUFLW—Shuffle Packed Low Words, so the comment should be // Shuffle Packed Low Words

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there are both formats used - in majority of cases this is descriptive text like above on line 375

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personal preference is on the "long name", since attempting to describe the instruction behavior can quickly become too complicated, especially when multiple encodings are supported.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @tannergooding on this - I suspect that many of the comments that don't conform to this are older. In any case, going forward I think that the Arch Manual "long name" is the way to go.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK will fix and keep all comments in feature in "long name" format.

@@ -234,6 +234,54 @@ int Compiler::numArgsOfHWIntrinsic(NamedIntrinsic intrinsic)
return hwIntrinsicInfoArray[intrinsic - NI_HW_INTRINSIC_START - 1].numArgs;
}

//------------------------------------------------------------------------
// numArgsOfHWIntrinsic: get the number of arguments based on HW GenTree node
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description here is incorrect. It is only based on the node if the table entry is -1

Copy link
Author

@4creators 4creators Mar 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, it retrieves numArgs when table entry is either >= 0 or -1. So essentially it gets the exact number of args for the given intrinsic when it is needed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to improve description to emphasize this behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, it retrieves numArgs when table entry is either >= 0 or -1.

The code is doing exactly this:

int numArgs = hwIntrinsicInfoArray[intrinsic - NI_HW_INTRINSIC_START - 1].numArgs;
if (numArgs != -1)
{
    return numArgs;
}

So it is first checking the table and returns that value if the entry wasn't -1. Otherwise, it checks the node operands and returns a value based on those.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if someone had a table entry of -2, this function would return back -2.

@@ -5550,11 +5550,24 @@ static bool isSSEExtract(instruction ins)
}
}

static bool isSseShuffle(instruction ins)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change the name of this function, since it no longer encapsulates all SseShuffle, but instead only the shuffle instructions with a special encoding.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function can be called something like noVexNdsThreeOpInstructions

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that noVexNdsThreeOpInstructions is very clear. How about something like isDstSrcImmSseInstruction to correspond with IsDstDstSrcAVXInstruction?

Copy link

@fiigii fiigii Mar 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CarolEidt IsDstDstSrcAVXInstruction implies the instruction has VEX.NDS encoding that may duplicate the first-arg in the VEX.vvvv field.
The problem that we are trying to solve here is some instructions do not have VEX.NDS encoding, which means they have the same register-form in SSE and VEX encoding. For example,

EXTRACTPS reg/m32, xmm1, imm8
VEXTRACTPS reg/m32, xmm1, imm8

So, I think the names corresponding to IsDstDstSrcAVXInstruction may be more confusing.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about noVexNdsRegRegImmInstructions?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considered again, isDstSrcImmSseInstruction is also good to me if we have some comments here 😄

Copy link
Author

@4creators 4creators Mar 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CarolEidt @tannergooding @fiigii

I have fixed this already by merging both isSseExtract and isSseShuffle functions. After consulting manual I have realized that DstSrcImm format is common to all ISAs including SSE, AVX and AVX512 and perhaps better name would be isDstSrcImmSIMDInstruction

Current implementation is as follows:

//------------------------------------------------------------------------
// isDstSrcImmSIMDInstruction: check if instruction has RM R I format
// for all encodings: EVEX, VEX and legacy SSE encoding
//
// Arguments:
//    instruction -- processor instruction to check
//
// Return Value:
//    true if instruction has RRI format
//
static bool isDstSrcImmSIMDInstruction(instruction ins)
{
    switch (ins)
    {
        case INS_extractps:
        case INS_pextrb:
        case INS_pextrw:
        case INS_pextrd:
        case INS_pextrq:
        case INS_pshufd:
        case INS_pshufhw:
        case INS_pshuflw:
            return true;
        default:
            return false;
    }
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@4creators, it probably just needs to be IsDstSrcImmAvxInstruction, as that clarifies that the VEX encoding is the same as the non-vex encoding (where it would otherwise get assumed to be DstDstSrcImm or DstSrcSrcImm)

return numArgs;
}

assert(numArgs == -1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this assert is pointless.

A better assert would be to assert numArgs >= 0 before returning numArgs above, or to change the check to if (numArgs >= 0) { return numArgs; } and then to keep this assert (numArgs == -1).

{
assert(op1->AsArgList()->Rest()->Rest()->Current() != nullptr);
assert(op1->AsArgList()->Rest()->Rest()->Rest() == nullptr);
return 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still like to actually enumerate the argList, as it is both future proof (and will be required to support future ISAs) and is more correct in the case of the assertions being wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also won't matter, perf wise, since the majority of cases will be table driven.

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a single two-arg overload of numArgsOfHWIntrinsic is the way to go, and I agree with @tannergooding that it should actually enumerate all the args.

@@ -3108,6 +3108,7 @@ class Compiler
static int ivalOfHWIntrinsic(NamedIntrinsic intrinsic);
unsigned simdSizeOfHWIntrinsic(NamedIntrinsic intrinsic, CORINFO_SIG_INFO* sig);
static int numArgsOfHWIntrinsic(NamedIntrinsic intrinsic);
static int numArgsOfHWIntrinsic(NamedIntrinsic intrinsic, GenTreeHWIntrinsic* node);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think I suggested a single overload in my earlier comments (prior to @fiigii's excellent suggestion to avoid using the base type altogether). You could have node default to nullptr and then add a noway_assert that it is non-null if numArgs == -1, followed conservatively by a check that it is non-null before doing the rest of the method. If by mistake a caller fails to pass a non-null node argument when they should, the behavior will be the same "as if" the 1-arg overload had been called. And with the null argument, the C++ compiler should be expected to do the inlining and the elimination of the non-null path.
In any event, we want to be very careful about pre-optimizing code at the expense of clarity and conciseness.

@@ -5550,11 +5550,24 @@ static bool isSSEExtract(instruction ins)
}
}

static bool isSseShuffle(instruction ins)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that noVexNdsThreeOpInstructions is very clear. How about something like isDstSrcImmSseInstruction to correspond with IsDstDstSrcAVXInstruction?

@4creators 4creators force-pushed the sse2part2 branch 2 times, most recently from 9307a46 to 8c78f07 Compare March 2, 2018 21:21
@4creators
Copy link
Author

Skipping OSX tests as they are expected to fail

@@ -3107,7 +3107,7 @@ class Compiler
bool isScalarISA(InstructionSet isa);
static int ivalOfHWIntrinsic(NamedIntrinsic intrinsic);
unsigned simdSizeOfHWIntrinsic(NamedIntrinsic intrinsic, CORINFO_SIG_INFO* sig);
static int numArgsOfHWIntrinsic(NamedIntrinsic intrinsic);
static int numArgsOfHWIntrinsic(NamedIntrinsic intrinsic, GenTreeHWIntrinsic* node);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You haven't set the default parameter value; however, it seems that all callers are passing in the node.
Either the function header comment needs to be fixed, or the default value should be set here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh my. It is my local merge error - it was set

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any scenarios where we will be looking this up without having a corresponding node?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are not, I would vote not having any default and actually asserting that node != null, since we generally want to be ensuring that the lookup returns the correct value.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think this function should have nullptr as the default value of node. Every call site does not know the intrinsic is variable args or not. One-arg calling will lead misusing.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, looking at the call sites, we don't use this method in a place where prior knowledge of the intrinsic would be available. So, I agree - no default is the way to go. Sorry for the distraction.

int numArgs = hwIntrinsicInfoArray[intrinsic - NI_HW_INTRINSIC_START - 1].numArgs;
// TODO_XARCH-BUG: several intrinsics having -1 numArgs do not handle correctly
// real number of arguments - limiting slow path only to Sse2.Shuffle
// https://github.com/dotnet/coreclr/issues/16714

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not comfortable with checking in this workaround. I believe that @fiigii suggested the right fix. I think these asserts:

            assert(numArgs != 0);
            assert(numArgs != 1);

Should be replaced with assert(numArgs >= 2); below the if (info->srcCount >= 2) check.

@4creators
Copy link
Author

test Windows_NT x64 Checked jitincompletehwintrinsic
test Windows_NT x64 Checked jitx86hwintrinsicnoavx
test Windows_NT x64 Checked jitx86hwintrinsicnoavx2
test Windows_NT x64 Checked jitx86hwintrinsicnosimd
test Windows_NT x64 Checked jitnox86hwintrinsic

test Windows_NT x86 Checked jitincompletehwintrinsic
test Windows_NT x86 Checked jitx86hwintrinsicnoavx
test Windows_NT x86 Checked jitx86hwintrinsicnoavx2
test Windows_NT x86 Checked jitx86hwintrinsicnosimd
test Windows_NT x86 Checked jitnox86hwintrinsic

test Ubuntu x64 Checked jitincompletehwintrinsic
test Ubuntu x64 Checked jitx86hwintrinsicnoavx
test Ubuntu x64 Checked jitx86hwintrinsicnoavx2
test Ubuntu x64 Checked jitx86hwintrinsicnosimd
test Ubuntu x64 Checked jitnox86hwintrinsic

static bool isSSEExtract(instruction ins)
//------------------------------------------------------------------------
// IsDstSrcImmAvxInstruction: check if instruction has RM R I format
// for all encodings: EVEX, VEX and legacy SSE
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instruction has RM R I format

This comment is not correct, certain instructions can be "ins R RM I".
BTW, I still think the comment should emphasize "NO VEX.NDS".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fiigii Can I correct it in next PR. I need Ssse2.Shuffle to implement some remaining SSE2 helpers and I am blocked to some extent on this PR.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, this PR is still waiting for MS maintainers' final review.

I need Ssse2.Shuffle to implement some remaining SSE2 helpers and I am blocked to some extent on this PR.

Actually, you can copy the one-line definition of Shuffle into your current workspace, and rebase after this PR gets merged.

// instruction -- processor instruction to check
//
// Return Value:
// true if instruction has RRI format
Copy link

@fiigii fiigii Mar 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return vale comment should be modified as well (VEX.NDS can be RRI).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix. Question as above: could we postpone it to my next PR?

return numArgs;
}

noway_assert(node != nullptr);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add assert(node->gtHWIntrinsicId == intrinsic) after this line.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to have it verified 👍

@@ -219,19 +219,64 @@ unsigned Compiler::simdSizeOfHWIntrinsic(NamedIntrinsic intrinsic, CORINFO_SIG_I
}

//------------------------------------------------------------------------
// numArgsOfHWIntrinsic: get the number of arguments
// numArgsOfHWIntrinsic: get the number of arguments based on table and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe reword to:

gets the number of arguments for the hardware intrinsic.
This attempts to do a table based lookup but will fallback to the number of operands in 'node' if the table entry is -1.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds better that current one - will change. Next PR would be OK?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are going to push any more changes, I would fix it this PR. Otherwise I think next PR is fine (provided @CarolEidt agrees).

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this LGTM. I would like to see the last of the comments corrected and the assert @fiigii suggested added, but we can do that in a follow up PR.

Copy link

@fiigii fiigii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, I would like to see the changes of the default value, assertion, and comments (in this or next PR, waiting for @CarolEidt 's approval).

@@ -3107,7 +3107,7 @@ class Compiler
bool isScalarISA(InstructionSet isa);
static int ivalOfHWIntrinsic(NamedIntrinsic intrinsic);
unsigned simdSizeOfHWIntrinsic(NamedIntrinsic intrinsic, CORINFO_SIG_INFO* sig);
static int numArgsOfHWIntrinsic(NamedIntrinsic intrinsic);
static int numArgsOfHWIntrinsic(NamedIntrinsic intrinsic, GenTreeHWIntrinsic* node = nullptr);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove = nullptr.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@4creators
Copy link
Author

@dotnet-bot test Tizen armel Cross Checked Innerloop Build and Test please

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - I'm ok with merging this and cleaning up the comments and removing the default arg in a subsequent PR

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants