Implement Shuffle* Sse2 Intel hardware intrinsics #15777

4creators · 2018-01-07T16:32:25Z

Fixes #16714

tannergooding · 2018-01-07T17:46:44Z

src/coreclr/hosts/corerun/corerun.cpp

@@ -33,7 +33,7 @@ static const wchar_t *coreCLRDll = W("CoreCLR.dll");
 static const wchar_t *coreCLRInstallDirectory = W("%windir%\\system32\\");

 // Encapsulates the environment that CoreCLR will run in, including the TPALIST
-class HostEnvironment 


This file doesn't have any modifications outside of whitespace changes.

It is dirty commit - just needed to run CI - everything will get cleaned later

You should be able to comment @dotnet-bot test <test-name> please to rerun any given job, use this to rerun all jobs

tannergooding · 2018-01-07T17:48:13Z

src/jit/emitxarch.cpp

-            ins == INS_punpcklbw || ins == INS_punpckhqdq || ins == INS_punpcklqdq || ins == INS_punpckhwd ||
-            ins == INS_punpcklwd || ins == INS_punpckhdq || ins == INS_packssdw || ins == INS_packsswb ||
-            ins == INS_packuswb || ins == INS_packusdw || ins == INS_vperm2i128);
+    switch (ins)


Would still probably be useful to pull this change out to its own PR.

tannergooding · 2018-01-07T17:48:58Z

src/jit/emitxarch.cpp

@@ -1085,6 +1181,7 @@ inline bool hasCodeMI(instruction ins)
 inline size_t insCodeMI(instruction ins)
 {
    assert((unsigned)ins < _countof(insCodesMI));
+    size_t code = insCodesMI[ins];


why this change?

debug help for conditional breakpoint - to be removed

tannergooding · 2018-01-07T17:50:35Z

src/jit/emitxarch.cpp

@@ -3765,13 +3862,15 @@ void emitter::emitIns_R_R(instruction ins, emitAttr attr, regNumber reg1, regNum
 void emitter::emitIns_R_R_I(instruction ins, emitAttr attr, regNumber reg1, regNumber reg2, int ival)
 {
    // SSE2 version requires 5 bytes and SSE4/AVX version 6 bytes
-    UNATIVE_OFFSET sz = 4;
+    UNATIVE_OFFSET sz          = 4;
+    bool           encodeAsSse = false;


Why is this needed? This logic should all be controlled elsewhere via UseVEXEncoding

The other instructions (such as INS_psrldq) should be going through emitIns_R_I

(for the non-vex case)

Original logic did not work correctly for SSE2 codegen. In case of istruction with the format ins xmm, imm they had always being set with id->idInsFmt(IF_RRW_RRW_CNS); what resulted in code like psllw xmm0, xmm0, 0 what is an invalid SSE2 instruction what resulted in CPU exception.

Above logic ensures that RRW_CNS instructions are flagged with valid format IF_RRW_CNS what resolves this issue. Table based codegen should ensure this same logic.

In case of istruction with the format ins xmm, imm they had always being set with id->idInsFmt(IF_RRW_RRW_CNS);

The instructions should be going through an emitIns_R_I method for the SSE2 case. They should only go through emitIns_R_R_I on AVX+

tannergooding · 2018-01-07T17:54:08Z

src/jit/emitxarch.cpp

+
+void emitter::emitIns_SIMD_R_R_I(instruction ins, regNumber reg1, regNumber reg2, ssize_t iVal, var_types simdtype)
+{
+    emitIns_R_R_I(ins, emitTypeSize(simdtype), reg1, reg2, (int)iVal);


I believe R_R_I should only happen under UseVEXEncoding

Depends on instruction - all shuffle cases, cmppd, cmpps, pinsrw are IF_RRW_RRW_CNS under SSE2 so they should go through R_R_I

Yes, but those instructions should be going through SIMD_R_R_R_I.

In almost all cases so far, the codegen has been calling the emitIns_SIMD_* case that matches the VEX encoding. We then have logic for the fallback case so that args can be arranged properly.

Ex: cmppd is R_R_R_I on AVX+ machines and R_R_I elsewhere. So we call emitIns_SIMD_R_R_R_I.

emitIns_SIMD_R_R_R_I then has the logic for, on vex, generate the instruction directly (emtIns_R_R_R_I), on non-vex generate a movaps (emitIns_R_R) followed by the instruction (emitIns_R_R_I).

That being said, it does look like the pshufd instructions are bit special in that they (like crc32, lzcnt, popcnt, and a few others) have a consistent encoding between VEX and non-VEX machines, in which case we should just be calling emitIns_* directly, without an intermediary emitIns_SIMD_* step

tannergooding · 2018-01-07T17:55:06Z

src/jit/emitxarch.cpp

+void emitter::emitIns_SIMD_R_R_R_I(
+    instruction ins, regNumber reg, regNumber reg1, regNumber reg2, ssize_t ival, var_types simdtype)
+{
+    if (UseVEXEncoding() && reg1 != reg)


We've dropped the reg1 != reg from this line elsewhere, it isn't useful

tannergooding · 2018-01-07T17:55:27Z

src/jit/emitxarch.cpp

+    {
+        if (reg1 != reg)
+        {
+            emitIns_R_R_I(INS_movaps, emitTypeSize(simdtype), reg, reg1, (int)ival);


Don't think this is correct. movaps is a R_R instruction

You are right, will fix by dropping whole if statement. Thanks

I don't think dropping the if statement is correct.

We still have, a target register, two operands, and an immediate, which isn't encodable on non-VEX machines if reg1 != reg

tannergooding · 2018-01-07T17:57:23Z

src/jit/gentree.cpp

@@ -304,7 +304,8 @@ void GenTree::InitNodeSize()
    GenTree::s_gtNodeSizes[GT_INDEX_ADDR]        = TREE_NODE_SZ_LARGE;
    GenTree::s_gtNodeSizes[GT_ARR_BOUNDS_CHECK] = TREE_NODE_SZ_LARGE;
 #ifdef FEATURE_SIMD
-    GenTree::s_gtNodeSizes[GT_SIMD_CHK] = TREE_NODE_SZ_LARGE;
+    GenTree::s_gtNodeSizes[GT_SIMD_CHK]         = TREE_NODE_SZ_LARGE;
+    GenTree::s_gtNodeSizes[GT_SIMD]             = TREE_NODE_SZ_LARGE;


Why is this change needed? We shouldn't be touching the GT_SIMD code.

I have not touched it - clang -format and clang-tidy did it by creating format patch

Hmmm.... I wouldn't expect clang format or clang-tidy to add new code, only move around existing code.

Yeah, it's my copy paste error around place flagged by clang-format

tannergooding · 2018-01-07T17:58:39Z

src/jit/gentree.cpp

+*  Create a list out of the five values.
+*/
+
+GenTreeArgList* Compiler::gtNewArgList(


Do we need this? Parameter counts above 4 looked to be very uncommon, so a helper function that calls the smaller ones and works with arbitrary sizes might be more useful

One of my experiments to be removed - when you run a loop on 8 or 16 arguments and use in the first step 4 args function in every next step you need 5 args function to pass 4 args and append them to 1 list - but actually function quite different from this one.

tannergooding · 2018-01-07T18:03:44Z

src/jit/gentree.cpp

+GenTreeHWIntrinsic* Compiler::gtNewSimdHWIntrinsicNode(var_types      type,
+                                                       GenTree*       op1,
+                                                       GenTree*       op2,
+                                                       ssize_t        iVal,


I don't think we want/need to carry the ival as part of the node. It will end up being table driven, but before then, you can pull it out of the op when generating code or add a small switch for the hardcoded cases.

tannergooding · 2018-01-07T18:04:26Z

src/jit/gtlist.h

@@ -158,11 +158,11 @@ GTNODE(GE               , GenTreeOp          ,0,GTK_BINOP|GTK_RELOP)
 GTNODE(GT               , GenTreeOp          ,0,GTK_BINOP|GTK_RELOP)
 #ifndef LEGACY_BACKEND
 // These are similar to GT_EQ/GT_NE but they generate "test" instead of "cmp" instructions.
-// Currently these are generated during lowering for code like ((x & y) eq|ne 0) only on 
+// Currently these are generated during lowering for code like ((x & y) eq|ne 0) only on


Another file with whitespace only changes.

tannergooding · 2018-01-07T18:06:06Z

src/jit/hwintrinsiccodegenxarch.cpp

+                    break;
+            }
+
+            if (ins != INS_cmppd)


you could drop this check if you return instead of break in the TYP_DOUBLE case.

tannergooding · 2018-01-07T18:08:28Z

src/jit/hwintrinsiccodegenxarch.cpp

+            op2Reg = op2->gtRegNum;
+
+            instruction ins;
+            switch (baseType)


You can drop the switch statement for the ones that only take a single type.

In the long run (for generic methods) we will want to be checking the types in the importer and returning a PNSE exception for the invalid cases.

tannergooding · 2018-01-07T18:17:28Z

src/jit/simd.cpp

    GenTreePtr tree = se.val;

+    return impSIMD(type, tree, ti, expectAddr);


refactoring?

No, one of experiments.

To be removed onviously 😄

4creators · 2018-01-07T18:31:40Z

@tannergooding
Is it possible to make a commit to PR, skip CI and run only couple of jobs?

tannergooding · 2018-01-07T18:34:00Z

I don't believe so. Pushing a new commit will run all jobs set to automatically run

4creators · 2018-01-07T19:25:14Z

@tannergooding Is it possible to pass error message from Jit to exception created in importer? Currently available standard messages are not very helpful.

4creators · 2018-02-21T16:27:39Z

I am waiting for #16183 before submitting commits to finish work on this PR.

4creators · 2018-02-28T23:04:51Z

@fiigii @tannergooding @CarolEidt Added Extract/Insert/Shuffle* SSE2 HW intrinsics. To speed up work I have decided to update tests to templated format after all of SSE2 intrinsics are implemented.

fiigii · 2018-02-28T23:11:31Z

src/jit/hwintrinsiclistxarch.h

-HARDWARE_INTRINSIC(SSE2_LoadScalarVector128,                         "LoadScalarVector128",                              SSE2,       -1,           16,          1,            {INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_movd,      INS_movd,      INS_movq,      INS_movq,      INS_invalid,   INS_movsdsse2},         HW_Category_MemoryLoad,                        HW_Flag_NoRMWSemantics)
-HARDWARE_INTRINSIC(SSE2_LoadVector128,                               "LoadVector128",                                    SSE2,       -1,           16,          1,            {INS_movdqu,    INS_movdqu,    INS_movdqu,    INS_movdqu,    INS_movdqu,    INS_movdqu,    INS_movdqu,    INS_movdqu,    INS_invalid,   INS_movupd},            HW_Category_MemoryLoad,                        HW_Flag_NoRMWSemantics)
+HARDWARE_INTRINSIC(SSE2_Extract,                                     "Extract",                                          SSE2,       -1,           16,          2,            {INS_invalid,   INS_invalid,   INS_pextrw,    INS_pextrw,    INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid},           HW_Category_IMM,                               HW_Flag_FullRangeIMM|HW_Flag_NoJmpTableIMM)
+HARDWARE_INTRINSIC(SSE2_Insert,                                      "Insert",                                           SSE2,       -1,           16,          3,            {INS_invalid,   INS_invalid,   INS_pinsrw,    INS_pinsrw,    INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid},           HW_Category_IMM,                               HW_Flag_FullRangeIMM|HW_Flag_NoJmpTableIMM)


You give these intrinsics HW_Flag_NoJmpTableIMM, but did not implement the fallback.

SSE2_Extract and SSE2_Insert actually needs HW_Flag_NoJmpTableIMM, but you need to implement the "alternatives" in the importer.

Yes. You are right - will do it.

fiigii · 2018-02-28T23:12:15Z

src/jit/hwintrinsiclistxarch.h

-HARDWARE_INTRINSIC(SSE2_LoadFence,                                   "LoadFence",                                        SSE2,       -1,            0,          0,            {INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid},           HW_Category_Special,                           HW_Flag_NoRMWSemantics)
-HARDWARE_INTRINSIC(SSE2_LoadScalarVector128,                         "LoadScalarVector128",                              SSE2,       -1,           16,          1,            {INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_movd,      INS_movd,      INS_movq,      INS_movq,      INS_invalid,   INS_movsdsse2},         HW_Category_MemoryLoad,                        HW_Flag_NoRMWSemantics)
-HARDWARE_INTRINSIC(SSE2_LoadVector128,                               "LoadVector128",                                    SSE2,       -1,           16,          1,            {INS_movdqu,    INS_movdqu,    INS_movdqu,    INS_movdqu,    INS_movdqu,    INS_movdqu,    INS_movdqu,    INS_movdqu,    INS_invalid,   INS_movupd},            HW_Category_MemoryLoad,                        HW_Flag_NoRMWSemantics)
+HARDWARE_INTRINSIC(SSE2_Extract,                                     "Extract",                                          SSE2,       -1,           16,          2,            {INS_invalid,   INS_invalid,   INS_pextrw,    INS_pextrw,    INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid},           HW_Category_IMM,                               HW_Flag_FullRangeIMM|HW_Flag_NoJmpTableIMM)


Extract needs HW_Flag_BaseTypeFromFirstArg.

And I believe SSE2_Extract and SSE2_Insert require specially treatment in CodeGen and emitter. Please see #16646

Working on it.

Extract needs HW_Flag_BaseTypeFromFirstArg

Fortunately it is not the case as current code handles this intrinsics correctly based on their signatures - intrinsics returning scalar values are special cased and correctly determine baseType

Oh, I see, my PR is not merged yet. Thanks.

fiigii · 2018-02-28T23:16:30Z

src/jit/hwintrinsiclistxarch.h

-HARDWARE_INTRINSIC(SSE2_StoreScalar,                                 "StoreScalar",                                      SSE2,       -1,           16,          2,            {INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_movsdsse2},         HW_Category_MemoryStore,                       HW_Flag_NoRMWSemantics)
-HARDWARE_INTRINSIC(SSE2_StoreLow,                                    "StoreLow",                                         SSE2,       -1,           16,          2,            {INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_movq,      INS_movq,      INS_invalid,   INS_movlpd},            HW_Category_MemoryStore,                       HW_Flag_NoRMWSemantics)
-HARDWARE_INTRINSIC(SSE2_StoreHigh,                                   "StoreHigh",                                        SSE2,       -1,           16,          2,            {INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_movhpd},            HW_Category_MemoryStore,                       HW_Flag_NoRMWSemantics)
+HARDWARE_INTRINSIC(SSE2_Shuffle,                                     "Shuffle",                                          SSE2,       -1,           16,          2,            {INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_pshufd,    INS_pshufd,    INS_invalid,   INS_invalid,   INS_invalid,   INS_shufpd},            HW_Category_IMM,                               HW_Flag_NoJmpTableIMM|HW_Flag_FullRangeIMM)


SSE2_Shuffle has 2-arg and 3-arg overload both, so that should not have "2" as the NumArg.

SSE2_Shuffle cannot have HW_Flag_NoJmpTableIMM.

SSE2_Shuffle has 2-arg and 3-arg overload both, so that should not have "2" as the NumArg

Any number of args will be wrong and it is special handled. As we discussed below I plan to implement it as a flag HW_Flag_VariableNumberOfArgs and separate function for retrieving numArgs which is dependent on baseType.

dependent on baseType.

Why?

SSE2_Shuffle cannot have HW_Flag_NoJmpTableIMM

Will do

Why?

If baseType of Shuffle is TYP_DOUBLE we have 3 args but for baseType TYP_INT or TYP_UINT we will have 2 args.

This should be a general change for all the "VariableNumberOfArgs" intrinsics, not just for Shuffle

fiigii · 2018-02-28T23:17:07Z

src/jit/lowerxarch.cpp

@@ -2377,6 +2377,14 @@ void Lowering::ContainCheckHWIntrinsic(GenTreeHWIntrinsic* node)
        return;
    }

+    // TODO-XArch-CQ: improve table driven number of arguments encoding
+    // Due to different number of arguments for different base type for the same
+    // hardware intrinsic function we need to special case NI_SSE2_Shuffle


We can update function Compiler::numArgOfHWIntrinsic.

Yes. It will be necessary to support other intrinsics with variable number of arguments like SetVector128 / SetVector256. Perhaps we could use flag like HW_Flag_VariableNumberOfArgs to short circuit numArg selection

variable number of arguments like SetVector128 / SetVector256

Not really, these two guys are "NoCodgen" herplers, we can use sig->numArg in the importer.

fiigii · 2018-02-28T23:20:52Z

src/jit/hwintrinsiclistxarch.h

-HARDWARE_INTRINSIC(SSE2_StoreHigh,                                   "StoreHigh",                                        SSE2,       -1,           16,          2,            {INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_movhpd},            HW_Category_MemoryStore,                       HW_Flag_NoRMWSemantics)
+HARDWARE_INTRINSIC(SSE2_Shuffle,                                     "Shuffle",                                          SSE2,       -1,           16,          2,            {INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_pshufd,    INS_pshufd,    INS_invalid,   INS_invalid,   INS_invalid,   INS_shufpd},            HW_Category_IMM,                               HW_Flag_NoJmpTableIMM|HW_Flag_FullRangeIMM)
+HARDWARE_INTRINSIC(SSE2_ShuffleHigh,                                 "ShuffleHigh",                                      SSE2,       -1,           16,          2,            {INS_invalid,   INS_invalid,   INS_pshufhw,   INS_pshufhw,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid},           HW_Category_IMM,                               HW_Flag_NoJmpTableIMM|HW_Flag_FullRangeIMM)
+HARDWARE_INTRINSIC(SSE2_ShuffleLow,                                  "ShuffleLow",                                       SSE2,       -1,           16,          2,            {INS_invalid,   INS_invalid,   INS_pshuflw,   INS_pshuflw,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid},           HW_Category_IMM,                               HW_Flag_NoJmpTableIMM|HW_Flag_FullRangeIMM)


Please remove HW_Flag_NoJmpTableIMM for SSE2_ShuffleLow and SSE2_ShuffleHigh as well.

fiigii · 2018-03-01T00:05:56Z

@4creators I suggest that implement Shuffle* in a PR at first, which already have all the infrastructure (except numArgOfHWIntrinsic update) and test template.

tannergooding · 2018-03-02T15:39:09Z

src/jit/hwintrinsicxarch.cpp

+    GenTree* op2 = node->gtGetOp2();
+    if (op1 != nullptr)
+    {
+        if (op2 != nullptr)


nit: You can move this check above the if(op1 != nullptr) check and it will be slightly more efficient:

if (op2 != nullptr) { return 2; } if (op1 != nullptr) { if(op1->OperIsList()) { } else { return 1; } }

tannergooding · 2018-03-02T15:40:01Z

src/jit/hwintrinsicxarch.cpp

+        }
+        else if (op1->OperIsList())
+        {
+            assert(op1->AsArgList()->Rest()->Rest()->Current() != nullptr);


nit: I would rather this actually iterate through the nodes to get the actual count, rather than assume 3.

That was one of the discussed points - IMO iteration in particular for large number of args like in SetVector128 or SetVector256 cases will be slow while using table driven field retrieval will be much faster. I am open to any solution on which we can reach consensus.

That shouldn't be a concern... Those should be filtered out by the table-driven check (since the entry for those intrinsics won't be -1)

tannergooding · 2018-03-02T15:40:34Z

src/jit/instrsxarch.h

@@ -196,6 +196,7 @@ INST3( cvttsd2si,   "cvttsd2si"   , 0, IUM_WR, 0, 0, BAD_CODE,     BAD_CODE, SSE

 #ifndef LEGACY_BACKEND
 INST3( movntdq,     "movntdq"     , 0, IUM_WR, 0, 0, PCKDBL(0xE7), BAD_CODE, BAD_CODE)
+INST3( movnti,      "movnti"      , 0, IUM_WR, 0, 0, PCKFLT(0xC3), BAD_CODE, BAD_CODE)  // Move doubleword/quadword from r32/r64 to m32/m64 using nontemporal hint


This doesn't appear to be used anywhere

It slipped in from my work on StoreNonTemporal implementation, I can remove and reintroduce in the next PR. What is your preference?

Introducing in the PR where it is consumed is my preference.

tannergooding · 2018-03-02T15:43:14Z

src/jit/instrsxarch.h

 INST3( pextrw,      "pextrw"      , 0, IUM_WR, 0, 0, BAD_CODE,     BAD_CODE,      PCKDBL(0xC5))   // Extract 16-bit value into a r32 with zero extended to 32-bits
-INST3( pinsrw,      "pinsrw"      , 0, IUM_WR, 0, 0, BAD_CODE,     BAD_CODE,      PCKDBL(0xC4))   // packed insert word
+INST3( pinsrw,      "pinsrw"      , 0, IUM_WR, 0, 0, BAD_CODE,     BAD_CODE,      PCKDBL(0xC4))   // Move the low word of r32 or from m16 into xmm at the word position specified by imm8


nit: The comment at the end-of-line generally looks to be the Arch Manual "long name", rather than actual descriptive text. e.g. PINSRW—Insert Word, so the comment should be // Insert Word or PSHUFLW—Shuffle Packed Low Words, so the comment should be // Shuffle Packed Low Words

Actually there are both formats used - in majority of cases this is descriptive text like above on line 375

Personal preference is on the "long name", since attempting to describe the instruction behavior can quickly become too complicated, especially when multiple encodings are supported.

I agree with @tannergooding on this - I suspect that many of the comments that don't conform to this are older. In any case, going forward I think that the Arch Manual "long name" is the way to go.

OK will fix and keep all comments in feature in "long name" format.

tannergooding · 2018-03-02T15:48:46Z

src/jit/hwintrinsicxarch.cpp

@@ -234,6 +234,54 @@ int Compiler::numArgsOfHWIntrinsic(NamedIntrinsic intrinsic)
    return hwIntrinsicInfoArray[intrinsic - NI_HW_INTRINSIC_START - 1].numArgs;
 }

+//------------------------------------------------------------------------
+// numArgsOfHWIntrinsic: get the number of arguments based on HW GenTree node


The description here is incorrect. It is only based on the node if the table entry is -1

Not really, it retrieves numArgs when table entry is either >= 0 or -1. So essentially it gets the exact number of args for the given intrinsic when it is needed.

I will try to improve description to emphasize this behavior.

Not really, it retrieves numArgs when table entry is either >= 0 or -1.

The code is doing exactly this:

int numArgs = hwIntrinsicInfoArray[intrinsic - NI_HW_INTRINSIC_START - 1].numArgs; if (numArgs != -1) { return numArgs; }

So it is first checking the table and returns that value if the entry wasn't -1. Otherwise, it checks the node operands and returns a value based on those.

So, if someone had a table entry of -2, this function would return back -2.

tannergooding · 2018-03-02T16:18:06Z

src/jit/emitxarch.cpp

@@ -5550,11 +5550,24 @@ static bool isSSEExtract(instruction ins)
    }
 }

+static bool isSseShuffle(instruction ins)


I would change the name of this function, since it no longer encapsulates all SseShuffle, but instead only the shuffle instructions with a special encoding.

This function can be called something like noVexNdsThreeOpInstructions

I don't think that noVexNdsThreeOpInstructions is very clear. How about something like isDstSrcImmSseInstruction to correspond with IsDstDstSrcAVXInstruction?

@CarolEidt IsDstDstSrcAVXInstruction implies the instruction has VEX.NDS encoding that may duplicate the first-arg in the VEX.vvvv field.
The problem that we are trying to solve here is some instructions do not have VEX.NDS encoding, which means they have the same register-form in SSE and VEX encoding. For example,

EXTRACTPS reg/m32, xmm1, imm8 VEXTRACTPS reg/m32, xmm1, imm8

So, I think the names corresponding to IsDstDstSrcAVXInstruction may be more confusing.

How about noVexNdsRegRegImmInstructions?

Considered again, isDstSrcImmSseInstruction is also good to me if we have some comments here 😄

@CarolEidt @tannergooding @fiigii

I have fixed this already by merging both isSseExtract and isSseShuffle functions. After consulting manual I have realized that DstSrcImm format is common to all ISAs including SSE, AVX and AVX512 and perhaps better name would be isDstSrcImmSIMDInstruction

Current implementation is as follows:

//------------------------------------------------------------------------ // isDstSrcImmSIMDInstruction: check if instruction has RM R I format // for all encodings: EVEX, VEX and legacy SSE encoding // // Arguments: // instruction -- processor instruction to check // // Return Value: // true if instruction has RRI format // static bool isDstSrcImmSIMDInstruction(instruction ins) { switch (ins) { case INS_extractps: case INS_pextrb: case INS_pextrw: case INS_pextrd: case INS_pextrq: case INS_pshufd: case INS_pshufhw: case INS_pshuflw: return true; default: return false; } }

@4creators, it probably just needs to be IsDstSrcImmAvxInstruction, as that clarifies that the VEX encoding is the same as the non-vex encoding (where it would otherwise get assumed to be DstDstSrcImm or DstSrcSrcImm)

tannergooding · 2018-03-02T16:19:58Z

src/jit/hwintrinsicxarch.cpp

+        return numArgs;
+    }
+
+    assert(numArgs == -1);


this assert is pointless.

A better assert would be to assert numArgs >= 0 before returning numArgs above, or to change the check to if (numArgs >= 0) { return numArgs; } and then to keep this assert (numArgs == -1).

tannergooding · 2018-03-02T16:22:02Z

src/jit/hwintrinsicxarch.cpp

+        {
+            assert(op1->AsArgList()->Rest()->Rest()->Current() != nullptr);
+            assert(op1->AsArgList()->Rest()->Rest()->Rest() == nullptr);
+            return 3;


I would still like to actually enumerate the argList, as it is both future proof (and will be required to support future ISAs) and is more correct in the case of the assertions being wrong.

It also won't matter, perf wise, since the majority of cases will be table driven.

CarolEidt

I think a single two-arg overload of numArgsOfHWIntrinsic is the way to go, and I agree with @tannergooding that it should actually enumerate all the args.

CarolEidt · 2018-03-02T17:06:21Z

src/jit/compiler.h

@@ -3108,6 +3108,7 @@ class Compiler
    static int ivalOfHWIntrinsic(NamedIntrinsic intrinsic);
    unsigned simdSizeOfHWIntrinsic(NamedIntrinsic intrinsic, CORINFO_SIG_INFO* sig);
    static int numArgsOfHWIntrinsic(NamedIntrinsic intrinsic);
+    static int numArgsOfHWIntrinsic(NamedIntrinsic intrinsic, GenTreeHWIntrinsic* node);


Yes, I think I suggested a single overload in my earlier comments (prior to @fiigii's excellent suggestion to avoid using the base type altogether). You could have node default to nullptr and then add a noway_assert that it is non-null if numArgs == -1, followed conservatively by a check that it is non-null before doing the rest of the method. If by mistake a caller fails to pass a non-null node argument when they should, the behavior will be the same "as if" the 1-arg overload had been called. And with the null argument, the C++ compiler should be expected to do the inlining and the elimination of the non-null path.
In any event, we want to be very careful about pre-optimizing code at the expense of clarity and conciseness.

CarolEidt · 2018-03-02T17:12:52Z

src/jit/emitxarch.cpp

@@ -5550,11 +5550,24 @@ static bool isSSEExtract(instruction ins)
    }
 }

+static bool isSseShuffle(instruction ins)


I don't think that noVexNdsThreeOpInstructions is very clear. How about something like isDstSrcImmSseInstruction to correspond with IsDstDstSrcAVXInstruction?

4creators · 2018-03-02T22:02:09Z

Skipping OSX tests as they are expected to fail

CarolEidt · 2018-03-02T22:24:08Z

src/jit/compiler.h

@@ -3107,7 +3107,7 @@ class Compiler
    bool isScalarISA(InstructionSet isa);
    static int ivalOfHWIntrinsic(NamedIntrinsic intrinsic);
    unsigned simdSizeOfHWIntrinsic(NamedIntrinsic intrinsic, CORINFO_SIG_INFO* sig);
-    static int numArgsOfHWIntrinsic(NamedIntrinsic intrinsic);
+    static int numArgsOfHWIntrinsic(NamedIntrinsic intrinsic, GenTreeHWIntrinsic* node);


You haven't set the default parameter value; however, it seems that all callers are passing in the node.
Either the function header comment needs to be fixed, or the default value should be set here.

Oh my. It is my local merge error - it was set

Are there any scenarios where we will be looking this up without having a corresponding node?

If there are not, I would vote not having any default and actually asserting that node != null, since we generally want to be ensuring that the lookup returns the correct value.

I don’t think this function should have nullptr as the default value of node. Every call site does not know the intrinsic is variable args or not. One-arg calling will lead misusing.

Yes, looking at the call sites, we don't use this method in a place where prior knowledge of the intrinsic would be available. So, I agree - no default is the way to go. Sorry for the distraction.

CarolEidt · 2018-03-02T22:29:53Z

src/jit/hwintrinsicxarch.cpp

+    int numArgs = hwIntrinsicInfoArray[intrinsic - NI_HW_INTRINSIC_START - 1].numArgs;
+    // TODO_XARCH-BUG: several intrinsics having -1 numArgs do not handle correctly
+    // real number of arguments - limiting slow path only to Sse2.Shuffle
+    // https://github.com/dotnet/coreclr/issues/16714


I am not comfortable with checking in this workaround. I believe that @fiigii suggested the right fix. I think these asserts:

assert(numArgs != 0); assert(numArgs != 1);

Should be replaced with assert(numArgs >= 2); below the if (info->srcCount >= 2) check.

4creators · 2018-03-02T22:59:09Z

test Windows_NT x64 Checked jitincompletehwintrinsic
test Windows_NT x64 Checked jitx86hwintrinsicnoavx
test Windows_NT x64 Checked jitx86hwintrinsicnoavx2
test Windows_NT x64 Checked jitx86hwintrinsicnosimd
test Windows_NT x64 Checked jitnox86hwintrinsic

test Windows_NT x86 Checked jitincompletehwintrinsic
test Windows_NT x86 Checked jitx86hwintrinsicnoavx
test Windows_NT x86 Checked jitx86hwintrinsicnoavx2
test Windows_NT x86 Checked jitx86hwintrinsicnosimd
test Windows_NT x86 Checked jitnox86hwintrinsic

test Ubuntu x64 Checked jitincompletehwintrinsic
test Ubuntu x64 Checked jitx86hwintrinsicnoavx
test Ubuntu x64 Checked jitx86hwintrinsicnoavx2
test Ubuntu x64 Checked jitx86hwintrinsicnosimd
test Ubuntu x64 Checked jitnox86hwintrinsic

fiigii · 2018-03-02T23:03:57Z

src/jit/emitxarch.cpp

-static bool isSSEExtract(instruction ins)
+//------------------------------------------------------------------------
+// IsDstSrcImmAvxInstruction: check if instruction has RM R I format
+// for all encodings: EVEX, VEX and legacy SSE


instruction has RM R I format

This comment is not correct, certain instructions can be "ins R RM I".
BTW, I still think the comment should emphasize "NO VEX.NDS".

@fiigii Can I correct it in next PR. I need Ssse2.Shuffle to implement some remaining SSE2 helpers and I am blocked to some extent on this PR.

Not sure, this PR is still waiting for MS maintainers' final review.

I need Ssse2.Shuffle to implement some remaining SSE2 helpers and I am blocked to some extent on this PR.

Actually, you can copy the one-line definition of Shuffle into your current workspace, and rebase after this PR gets merged.

fiigii · 2018-03-02T23:05:14Z

src/jit/emitxarch.cpp

+//    instruction -- processor instruction to check
+//
+// Return Value:
+//    true if instruction has RRI format


The return vale comment should be modified as well (VEX.NDS can be RRI).

Will fix. Question as above: could we postpone it to my next PR?

fiigii · 2018-03-02T23:08:49Z

src/jit/hwintrinsicxarch.cpp

+        return numArgs;
+    }
+
+    noway_assert(node != nullptr);


Add assert(node->gtHWIntrinsicId == intrinsic) after this line.

Good to have it verified 👍

tannergooding · 2018-03-02T23:12:55Z

src/jit/hwintrinsicxarch.cpp

@@ -219,19 +219,64 @@ unsigned Compiler::simdSizeOfHWIntrinsic(NamedIntrinsic intrinsic, CORINFO_SIG_I
 }

 //------------------------------------------------------------------------
-// numArgsOfHWIntrinsic: get the number of arguments
+// numArgsOfHWIntrinsic: get the number of arguments based on table and


Maybe reword to:

gets the number of arguments for the hardware intrinsic. This attempts to do a table based lookup but will fallback to the number of operands in 'node' if the table entry is -1.

Sounds better that current one - will change. Next PR would be OK?

If you are going to push any more changes, I would fix it this PR. Otherwise I think next PR is fine (provided @CarolEidt agrees).

tannergooding

Overall this LGTM. I would like to see the last of the comments corrected and the assert @fiigii suggested added, but we can do that in a follow up PR.

fiigii

Overall LGTM, I would like to see the changes of the default value, assertion, and comments (in this or next PR, waiting for @CarolEidt 's approval).

fiigii · 2018-03-02T23:19:45Z

src/jit/compiler.h

@@ -3107,7 +3107,7 @@ class Compiler
    bool isScalarISA(InstructionSet isa);
    static int ivalOfHWIntrinsic(NamedIntrinsic intrinsic);
    unsigned simdSizeOfHWIntrinsic(NamedIntrinsic intrinsic, CORINFO_SIG_INFO* sig);
-    static int numArgsOfHWIntrinsic(NamedIntrinsic intrinsic);
+    static int numArgsOfHWIntrinsic(NamedIntrinsic intrinsic, GenTreeHWIntrinsic* node = nullptr);


Please remove = nullptr.

4creators · 2018-03-02T23:34:50Z

@dotnet-bot test Tizen armel Cross Checked Innerloop Build and Test please

CarolEidt

LGTM - I'm ok with merging this and cleaning up the comments and removing the default arg in a subsequent PR

tannergooding reviewed Jan 7, 2018

View reviewed changes

jkotas added the area-CodeGen label Jan 7, 2018

4creators force-pushed the sse2part2 branch from 6864d70 to c147511 Compare January 7, 2018 21:30

This was referenced Jan 19, 2018

Table-driven Intel hardware intrinsic #15749

Merged

Implement simple Sse2 hardware instrinsics #15585

Merged

4creators mentioned this pull request Feb 8, 2018

Implement Sse2 memory fence instructions #16262

Merged

4creators mentioned this pull request Feb 21, 2018

Update the table-driven framework to support x86 imm-intrinsics #16183

Merged

4creators force-pushed the sse2part2 branch from c147511 to 567c525 Compare February 28, 2018 22:58

4creators changed the title ~~[WIP] Implement Sse2 Intel hardware intrinsics - part 2~~ Implement Extract/Insert/Shuffle* Sse2 Intel hardware intrinsics Feb 28, 2018

fiigii reviewed Feb 28, 2018

View reviewed changes

tannergooding reviewed Mar 2, 2018

View reviewed changes

4creators force-pushed the sse2part2 branch from 17d51fa to d4bb1d3 Compare March 2, 2018 16:14

tannergooding reviewed Mar 2, 2018

View reviewed changes

CarolEidt suggested changes Mar 2, 2018

View reviewed changes

4creators force-pushed the sse2part2 branch 2 times, most recently from 9307a46 to 8c78f07 Compare March 2, 2018 21:21

CarolEidt suggested changes Mar 2, 2018

View reviewed changes

4creators added 2 commits March 2, 2018 23:48

Implement Shuffle* SSE2 hardware intrinsics

978a549

Add Shuffle* Sse2 hardware intrinsics tests

0dfb892

4creators force-pushed the sse2part2 branch from 8c78f07 to 0dfb892 Compare March 2, 2018 22:50

fiigii reviewed Mar 2, 2018

View reviewed changes

tannergooding reviewed Mar 2, 2018

View reviewed changes

tannergooding approved these changes Mar 2, 2018

View reviewed changes

fiigii approved these changes Mar 2, 2018

View reviewed changes

CarolEidt approved these changes Mar 3, 2018

View reviewed changes

CarolEidt merged commit f5a4248 into dotnet:master Mar 3, 2018

4creators deleted the sse2part2 branch March 3, 2018 15:09

4creators mentioned this pull request Mar 3, 2018

Implement LoadHigh, LoadLow, and SetScalarVector128 SSE2 HW intrinsics #16736

Merged

4creators mentioned this pull request Jan 31, 2020

Lack of coordination and overlapping work while implementing Intel hardware intrinsics dotnet/runtime#9656

Closed

		GenTreePtr tree = se.val;

		return impSIMD(type, tree, ti, expectAddr);

Implement Shuffle* Sse2 Intel hardware intrinsics #15777

Implement Shuffle* Sse2 Intel hardware intrinsics #15777

Conversation

4creators commented Jan 7, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding Jan 7, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

4creators Jan 7, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

4creators Jan 7, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

4creators commented Jan 7, 2018

tannergooding commented Jan 7, 2018

4creators commented Jan 7, 2018

4creators commented Feb 21, 2018

4creators commented Feb 28, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fiigii commented Mar 1, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

4creators Mar 2, 2018 • edited Loading

Choose a reason for hiding this comment

4creators commented Jan 7, 2018 •

edited

Loading

tannergooding Jan 7, 2018 •

edited

Loading

4creators Jan 7, 2018 •

edited

Loading

4creators Jan 7, 2018 •

edited

Loading

4creators Mar 2, 2018 •

edited

Loading