Table-driven Intel hardware intrinsic #15749

fiigii · 2018-01-05T07:44:26Z

Establish the table-driven framework.
Table-driven the importer and codgen of "simple SIMD" intrinsics
- testing with Avx.Reciprocal (1-arg), Avx.Multiply (2-arg), and Avx.BlendVariable (3-arg) intrinsics.

With this PR, implementing a new simple-intrinsic only needs to add a declaration to the table (and the instructions).

tannergooding · 2018-01-05T15:28:02Z

src/jit/compiler.h

@@ -3046,6 +3048,13 @@ class Compiler
    GenTree* impPOPCNTIntrinsic(NamedIntrinsic intrinsic, CORINFO_METHOD_HANDLE method, CORINFO_SIG_INFO* sig);
    bool compSupportsHWIntrinsic(InstructionSet isa);
    bool isScalarISA(InstructionSet isa);
+    var_types retTypeOfHWIntrinsic(NamedIntrinsic intrinsic);


nit: Many of these can and should be static.

tannergooding · 2018-01-05T15:29:45Z

src/jit/emitfmtsxarch.h

@@ -110,6 +110,8 @@ IF_DEF(RRW_RRW_CNS, IS_R1_RW|IS_R2_RW,          SCNS)     // r/w    reg , r/w  r

 IF_DEF(RWR_RRD_RRD, IS_R1_WR|IS_R2_RD|IS_R3_RD, NONE)     // write  reg , read reg2 , read reg3
 IF_DEF(RWR_RRD_RRD_CNS, IS_R1_WR|IS_R2_RD|IS_R3_RD, SCNS) // write  reg , read reg2 , read reg3, const
+
+IF_DEF(RWR_RRD_RRD_RRD, IS_R1_WR|IS_R2_RD|IS_R3_RD|IS_R4_RD, NONE)     // write  reg , read reg2 , read reg3 , read reg4


NOTE: There are several others of these we will end up needing. I added a few in my CORINFO_INTRINSIC_Round PR (#14736), but that has yet to be merged.

tannergooding · 2018-01-05T15:37:17Z

src/jit/hwintrinsiccodegenxarch.cpp

+    InstructionSet      isa         = compiler->isaOfHWIntrinsic(intrinsicID);
+    HWIntrinsicCategory category    = compiler->categoryOfHWIntrinsic(intrinsicID);
+
+    if (category == HW_Category_SimpleSIMD)


nit: We should probably have a switch(category) and genHWIntrinsicSimpleSIMD function, etc

Switch may not work with mask bit now, but we can consider to define several enum to use switch later.

tannergooding · 2018-01-05T15:38:24Z

src/jit/hwintrinsiclistxarch.h

-//                  Intrinsic ID                   Function name                ISA   
+// **************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
+//                  Intrinsic ID                          Function name                            ISA        return type             parameter#            ival       SIMD size                                                                             instructions                                                                                       Category
+//                                                                                                        (UNDEF if multi types)   (-1 if multi numbers)                                    {TYP_BYTE,      TYP_UBYTE,     TYP_SHORT,     TYP_USHORT,    TYP_INT,       TYP_UINT,      TYP_LONG,      TYP_ULONG,     TYP_FLOAT,     TYP_DOUBLE}


are TYP_BYTE and TYP_UBYTE instructions ever different (other than the case where one is valid and one is invalid)?

(question also applies to the other signed vs unsigned types)

Yes, please see multiply in this PR.

tannergooding · 2018-01-05T15:45:51Z

src/jit/namedintrinsiclist.h

+    // - generate single instruction
+    // - the codegen of overloads can be determined by intrinsicID and base type of returned vector
+    // - not any other kind intrinsics
+    HW_Category_SimpleSIMD = 0,


From my own investigations, some of these generalizations don't work well for all instructions and can often make containment analysis harder.

For example, this doesn't cover things like: am I commutative (which indicates we can swap op1 and op2 if op1 is containable, but op2 isn't).

It also doesn't cover differences like MoveHighToLow, which qualifies as a simple intrinsic, but does not allow a memory encoding.

Yes, actually I am considering to give simple intrinsic a mask bit rather than 0.

Maybe we should have a byte category, byte encoding, and a ushort flags field.

category is, as you have it.
encoding helps identify things like whether the instruction supports Reg, Reg or Reg, Reg/Mem
flags helps identify things like whether the instruction is commutative

They are all used for table-driven, just giving mask bits looks enough now.
And 8-bit may be not enough for the category field.
Why do we need the “flags”? “commutative” can be a category bit.

It helps reduce code duplication and the number of checks required when they are distinct fields

For example, when doing containment analysis, the most important information is which operands can be contained. With mask bits, we will need to have multiple comparison checks against the various categories (if ((category == SIMDScalar_SecondOpContainable) || (category == Scalar_SecondOpContainable))

The containment analysis, I think, could look like this (pseudo-code):

switch (encoding) { case R_R_RM: // target must be reg, first operand must be reg, second op can be reg or mem if (!op2->isContainable() && (flags & commutative)) { swapOperands(); } // TODO: SSE/SSE2 instructions on non-vex are technically R_RM_RM when `targetReg != op1Reg` since we need an additional `movaps` or `movups` instruction if (op2->isContainable() && (isVexEncodingSupported() || (category != SIMDVector))) // on non-vex, the memory operand for SIMDVector must be aligned, ideally we would detect the case where it is aligned and allow containment { makeContained(op2); } break; // other encodings }

Once containment analysis has been done, this ends up being important information for codegen as well.

The codegen can then look something like (also pseudo-code):

switch (encoding) { case R_R_RM: genHWIntrinsic_R_R_RM(node); break; // other encodings } genHWIntrinsic_R_R_RM(node) { if (op2->isContained()) { // NOTE: most of the "logic" is fairly consistent between things like `R_R_RM` and `R_R_RM_I`, the only difference is whether we call `emit->emitIns_R_R_*` or `emit->emitIns_R_R_*_I` // we should take advantage of this // logic emit->emitIns_R_R_C(); // logic emit->emitIns_R_R_S(); // logic emit->emitIns_R_R_A(); } else { emit->emitIns_R_R_R(); } }

Basically, I've found the most important piece of information to be the final encodings supported (this is the primary factor in the control flow for our logic).

We then make slight tweaks to that based on the category (am I SIMD, SIMDScalar, Scalar, etc) or flags (am I commutative, do I require alignment, etc) set.

tannergooding · 2018-01-05T15:46:42Z

src/jit/namedintrinsiclist.h

+    HW_Category_SimpleSIMD = 0,
+
+    // Scalar intrinsics
+    HW_Category_Scalar = 0x1,


What does "scalar" imply here? We have scalar intrinsics (like SqrtScalar), but I think these are meant to cover the intrinsics like popcnt, lzcnt, and crc32.

An explicit comment might help, but I think we should come up with a different name to distinguish it from the other scalar intrinsics.

Good point, will change the names.

tannergooding · 2018-01-05T16:08:06Z

src/jit/hwintrinsiccodegenxarch.cpp

+    if (category == HW_Category_SimpleSIMD)
+    {
+
+        GenTree*  op1        = node->gtGetOp1();


When we start adding containment support, doing the operand handling here is going to cause a lot of duplicated handling.

We probably want to have a genHWIntrinsic_<encoding> method (e.g. genHWIntrinsic_R_RM) so we can have the correct handling based on the supported encodings (it also lets us handle cases, like the non-AVX instructions, where they are actually RM_RM, due to the additional movaps instruction that may need to be emitted).

I plan to deal with containment with additional mask bit of the categories.

tannergooding · 2018-01-05T16:09:47Z

src/jit/hwintrinsicxarch.cpp

+
+unsigned Compiler::simdSizeOfHWIntrinsic(NamedIntrinsic intrinsic)
+{
+    assert(intrinsic != NI_Illegal);


maybe it would be useful to have a: #define assertIntrinsicInRange()

tannergooding · 2018-01-05T16:11:01Z

src/jit/hwintrinsicxarch.cpp

+{
+    assert(intrinsic != NI_Illegal);
+    assert(intrinsic > NI_HW_INTRINSIC_START && intrinsic < NI_HW_INTRINSIC_END);
+    assert(type >= TYP_BYTE && type <= TYP_DOUBLE);


this is making an assumption that the types list won't be reordered. We should explicitly comment that in both places

tannergooding · 2018-01-05T16:13:04Z

src/jit/hwintrinsicxarch.cpp

+    return hwIntrinsicInfoArray[intrinsic - NI_HW_INTRINSIC_START - 1].category;
+}
+
+GenTreeArgList* Compiler::buildArgList(int num)


nit: This probably doesn't need to be a member on the compiler (it can probably just be a global static).

It also isn't really needed outside of the few intrinsics that take > 4 operands. The 3 and 4 operand encodings are common enough that having an explicit gtArgList constructor is useful (and probably more efficient)

fiigii · 2018-01-10T17:36:19Z

With this PR, implementing a new simple-intrinsic only needs to add a declaration to the table (and the instructions).
@CarolEidt @AndyAyersMS @jkotas @mikedn @tannergooding @4creators PTAL

tannergooding · 2018-01-10T17:53:00Z

src/jit/gentree.h

-        var_types type, GenTree* op1, GenTree* op2, NamedIntrinsic hwIntrinsicID, var_types baseType, unsigned size)
-        : GenTreeJitIntrinsic(GT_HWIntrinsic, type, op1, op2, baseType, size), gtHWIntrinsicId(hwIntrinsicID)
+    int            gtNumArgs; // number of the non-imm arguments
+    int            gtIval;    // the imm8 argument


Why do we need to carry these bits of info here, rather than in the lookup table?

Originally, I planned to unify the code for fixed-imm and unfixed-imm intrinsics that carry the ival in this field and "op" pointers only point to reg/mem arguments (good to table-driven). However, now we seem to need to deal with "non-const" arguments by the "compiler fallback" that break this design.

However, I still wonder that we may need to check constants and generate the exception/jump-table in the importer rather than hardcode in the codegen. If we decide to implement the checking in the importer, the codegen can be simplified to address IMM-intrinsics only with constants.

I guess I'm still not sure I understand.

We basically have the following cases:

Instruction has a fixed, non-user provided, constant (such as CompareEqual)

We don't need to pass anything, it is looked up from the instruction table at emit (this is the same for direct and indirect calls). Nothing special is needed in the importer.

Instruction has a user provided constant

In the importer, we check if its constant or in range and determine whether we can expand directly (is constant and is in range) or need to go through a GT_CALL (not constant)

In the codegen, if the value is constant, we expand directly; otherwise we emit a compiler fallback (failure when the value is not in range is done by the IL implementation currently and isn't something we need to do codegen for directly).

We should be able to unify both and distinguish them based on whether or not ival in the lookup table is -1 ((int)-1 indicates user-specified, otherwise (byte)ival is the fixed constant).

Thanks for the good summary.

In the codegen, if the value is constant, we expand directly; otherwise we emit a compiler fallback.

Can we do this in the front-end (i.e., importer)?

(failure when the value is not in range is done by the IL implementation currently and isn't something we need to do codegen for directly).

I do not think so. Generating the exception for out-of-range arguments in JIT is better because we may not want two range tables for each intrinsic (one in C#, one in JIT).

Can we do this in the front-end (i.e., importer)?

We can't emit the fallback in the importer, but we can detect if the value is constant and determine whether or not we should expand there. I also don't think we want to construct actual blocks for the fallback in the importer. This risks breaking based on future optimizations/etc.

See a8d8aa3, for an example of this with Sse.Shuffle.

I do not think so. Generating the exception for out-of-range arguments in JIT is better because we may not want two range tables for each intrinsic (one in C#, one in JIT).

My comment was mostly based on the current status quo. We've had lots of discussion about this (https://github.com/dotnet/corefx/issues/26194) but haven't landed concretely yet.

My comment was mostly based on the current status quo. We've had lots of discussion about this (dotnet/corefx#26194) but haven't landed concretely yet.

This discussion seems not related to the out-of-range exception.
We have to check the valid argument's value in the JIT and generate the compiler fallback based-on the valid range of instructions. These two phases can share one "range-table", so I think JIT throwing exceptions would avoid much redundant work in C#.

The other exceptions (those that aren't PNSE) were discussed starting here. It would require additional JIT helper functions to throw the correct exceptions with the correct messages.

There wasn't a clear consensus on whether or not we should add new JIT helpers (so we could throw these exceptions from the JIT) or if we should do the existing ThrowHelper pattern in IL (https://github.com/dotnet/coreclr/blob/master/src/mscorlib/src/System/Runtime/Intrinsics/X86/Avx.cs#L235).

There wasn't a clear consensus on whether or not we should add new JIT helpers (

We already have the JIT-helper for this exception that is called CORINFO_HELP_THROW_ARGUMENTOUTOFRANGEEXCEPTION.

Ah, ok. Then I see no reason not to do that one from the JIT (and I think that means we only don't have one for the this type is not supported exception, which might push the discussion towards consensus)

The only problem which I have encountered is that JIT helpers do not provide support for custom messages what would be very helpful for intrinsics. If we could refactor helpers to support it, providing it is possible, than some other problems with exception throwing would be solved as well.

tannergooding · 2018-01-10T18:02:54Z

This looks okay to me. However, I'm still not convinced the table is in the right shape to work well with the end-goal.

That is, currently all the codegen is simple, doesn't do any containment analysis, and only covers simple intrinsics. However, once we implement an entire ISA (#15538) and add containment support (#15804), the shape of the table and data we need to carry changes a bit (and just looking at the above two PRs isn't quite enough either, since their are a few additional scenarios covered by the other ISAs).

fiigii · 2018-01-10T18:37:17Z

However, I'm still not convinced the table is in the right shape to work well with the end-goal.

the shape of the table and data we need to carry changes a bit

Yes, we definitely need to extend the table and the category in the future.

That is, currently all the codegen is simple, doesn't do any containment analysis, and only covers simple intrinsics.

I tink the current table-driven framework is easy to extand for the importer and codgen. The condition function genTableDrivableHWIntrinsic/impTableDrivableHWIntrinsic can be modified to deal with more complex intrinsics. I do not dare to say anything about containment analysis because of my poor experience, so I want more suggestions from @CarolEidt @mikedn

tannergooding · 2018-01-10T18:53:30Z

I tink the current table-driven framework is easy to extand for the importer and codgen. The condition function genTableDrivableHWIntrinsic/impTableDrivableHWIntrinsic can be modified to deal with more complex intrinsics. I do not dare to say anything about containment analysis because of my poor experience, so I want more suggestions from @CarolEidt @mikedn

My thinking (and my investigations) show that:

In the importer, we care about the number and types of the args. Any two intrinsics which have the same number of args and argument types can use the same code-path (for the sake of argument type, the "must" be constant arguments are effectively their own type).
In the Lowering (where we do containment analysis) we mostly care about the number of args, and which args can be contained as that controls which args we check and mark for containment. Ex:
- addps takes two args, on non-vex the first arg can always be contained, the second arg can be contained if aligned. On VEX, only the second arg can be contained but alignment doesn't matter
- movhlps takes two args. The first can be contained on non-vex, and neither can be contained on VEX
- shufps takes three args, it has the same containment category as addps, but a third operand which can be contained if constant
- etc
In the codgen, we mostly care about the final encodings supporting. This makes a difference in what args we check for containment so we know if we should call emitIns_R_R_R or emitIns_R_R_S or emitIns_R_R_C, etc. Ex:
- addps is R_RM_RMA on non-VEX (target must be reg, op1 can be reg or mem, op2 can be reg or aligned mem) and R_R_RM on VEX (target and op1 must be reg, op2 can be reg or mem)
- movhlps is R_RM_R on non-VEX (target and op2 must be reg, op1 can be reg or mem) and R_R_R on VEX (all three must be reg)
etc

The information I listed as "what we mostly care about" is the information that ends up controlling the majority of our codeflow in order to reduce code duplication.

NOTE: In some cases, the checks can be merged further (ex: R_RM_RMA and R_R_RM) since the difference between them is basicaly a single check (VEX or not-VEX), but having the info available might still be useful

tannergooding · 2018-01-10T19:10:28Z

I forgot to mention, in Lowering (for containment), knowing properties of the instruction like: `am I commutative' can also be useful.

For something like addps, it lets us know that (for VEX machines) if op2 can't be contained, but op1 can, we can still mark op1 as contained and then swap the operands (since it will compute the same result).

4creators · 2018-01-10T20:50:01Z

src/jit/compiler.h

    bool compSupportsHWIntrinsic(InstructionSet isa);
    bool isScalarISA(InstructionSet isa);
+    int ivalOfHWIntrinsic(NamedIntrinsic intrinsic);


iVal has a ssize_t type, perhaps we should use the same type as is used in GenTree node to carry const

You don't seem to have a use of this, and it doesn't seem that any of the intrinsics use the ival field thus far. Can you explain how it is to be used?

@CarolEidt Some intrininsics require hardcoded constant (not from user-passed arguments) will use the ival field, such as CompareEqualwill generate cmpps xmm0, xmm1, 0

4creators · 2018-01-10T20:50:41Z

src/jit/emitxarch.cpp

@@ -10718,6 +10753,7 @@ size_t emitter::emitOutputInstr(insGroup* ig, instrDesc* id, BYTE** dp)
            sz  = emitSizeOfInsDsc(id);
            break;
        case IF_RWR_RRD_RRD_CNS:
+        case IF_RWR_RRD_RRD_RRD:


Is it OK to treat IF_RWR_RRD_RRD_CNS and IF_RWR_RRD_RRD_RRD as same? Constant is a part of instruction encoding and does not need a register and my worry is that this may impact emitter at later stages. I have experienced this with IF_RRW_CNS and IF_RRW_RRW_CNS.

I don't think we should treat these as same. There is other handling and/or encoding checks that this can potentially impact.

Is it OK to treat IF_RWR_RRD_RRD_CNS and IF_RWR_RRD_RRD_RRD as same?

At least, for JITDump, we cannot.

4creators · 2018-01-10T20:50:55Z

src/jit/gentree.h

-        var_types type, GenTree* op1, GenTree* op2, NamedIntrinsic hwIntrinsicID, var_types baseType, unsigned size)
-        : GenTreeJitIntrinsic(GT_HWIntrinsic, type, op1, op2, baseType, size), gtHWIntrinsicId(hwIntrinsicID)
+    int            gtNumArgs; // number of the non-imm arguments
+    int            gtIval;    // the imm8 argument


We should be able to unify both and distinguish them

There are different approaches possible, but I would be for using sign bit as indication if it represents const (no bit set) or variable (bit set)

Instruction has a fixed, non-user provided, constant

In this case value can be decoded from intrinsic name - we do not need to carry it or decode in importer, this can be done in tablegen

same as above gtIval - Ival has ssize_t type in GenTree node

4creators · 2018-01-10T20:51:12Z

src/jit/hwintrinsiclistxarch.h

 //  SSE Intrinsics          
-HARDWARE_INTRINSIC(SSE_IsSupported,             "get_IsSupported",              SSE)
-HARDWARE_INTRINSIC(SSE_Add,                     "Add",                          SSE)
+HARDWARE_INTRINSIC(SSE_IsSupported,                      "get_IsSupported",                        SSE,        -1,           0,               {INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid},           HW_Category_Special)


nit: "get_IsSupported" should start at least at column 70 (quotation mark) to make space for larger intrinsic names

Thanks! will change.

fiigii · 2018-01-10T22:32:32Z

@dotnet-bot test Windows_NT x64 Checked jitincompletehwintrinsic
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnoavx
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnoavx2
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnosimd
@dotnet-bot test Windows_NT x64 Checked jitnox86hwintrinsic

CarolEidt · 2018-01-11T00:26:06Z

src/jit/codegenlinear.h

@@ -131,6 +131,7 @@ void genFMAIntrinsic(GenTreeHWIntrinsic* node);
 void genLZCNTIntrinsic(GenTreeHWIntrinsic* node);
 void genPCLMULQDQIntrinsic(GenTreeHWIntrinsic* node);
 void genPOPCNTIntrinsic(GenTreeHWIntrinsic* node);
+bool genTableDrivableHWIntrinsic(HWIntrinsicCategory category);


I think this would be better as genIsTableDrivenHWIntrinsic

CarolEidt · 2018-01-11T00:35:33Z

src/jit/compiler.h

    bool compSupportsHWIntrinsic(InstructionSet isa);
    bool isScalarISA(InstructionSet isa);
+    int ivalOfHWIntrinsic(NamedIntrinsic intrinsic);


You don't seem to have a use of this, and it doesn't seem that any of the intrinsics use the ival field thus far. Can you explain how it is to be used?

CarolEidt · 2018-01-11T00:36:12Z

src/jit/hwintrinsiclistxarch.h

@@ -11,59 +11,69 @@
 // clang-format off

 #if FEATURE_HW_INTRINSICS
-//                  Intrinsic ID                   Function name                ISA   
+// **************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
+//                  Intrinsic ID                          Function name                            ISA        ival       SIMD size                                                                             instructions                                                                                       Category


There should be a comment above this table that describes each field.

CarolEidt · 2018-01-11T01:16:21Z

src/jit/namedintrinsiclist.h

+
+    // IMM intrinsics
+    // - the generated instruction requires immediate value (i.e. imm8)
+    // - must throw ArgumentException when the argument is not compile-time constant


I don't believe that this is true any more.

CarolEidt · 2018-01-11T01:16:46Z

src/jit/namedintrinsiclist.h

+    HW_Category_IMM = 0x8,
+
+    // Fixed IMM intrinsics
+    // - the immediate value is from ival field of the table


An example would be useful here.

CarolEidt · 2018-01-11T01:39:40Z

src/jit/hwintrinsicxarch.cpp

+}
+
+//------------------------------------------------------------------------
+// categoryOfHWIntrinsic: get the instruction the intrinsic will generate


Should be "get the category of the given intrinsic

CarolEidt · 2018-01-11T01:39:49Z

src/jit/hwintrinsicxarch.cpp

+//    type      -- vector base type of this intrinsic
+//
+// Return Value:
+//     the instruction the intrinsic will generate on the base type


this needs updating as well

CarolEidt · 2018-01-11T02:30:28Z

src/jit/hwintrinsicxarch.cpp

@@ -265,6 +366,15 @@ bool Compiler::compSupportsHWIntrinsic(InstructionSet isa)
                                                    isFullyImplmentedISAClass(isa));
 }

+static bool is64bitOnlyIntrinsicOn32bit(HWIntrinsicCategory category, var_types type)


This name is also confusing to me. How about a more general name, e.g. isTypeSupportedForIntrinsic?

CarolEidt · 2018-01-11T02:34:59Z

src/jit/hwintrinsicxarch.cpp

+        {
+            case 1:
+                op1     = impPopStack().val;
+                retNode = gtNewSimdHWIntrinsicNode(retType, op1, intrinsic, baseType, simdSize);


There needs to be some validation of the operand types that are popped off the stack. Is that being done elsewhere?

The old code had some validation with impSIMDPopStack, it looks like that was lost in the new code so that all intrinsics could share the same code path.

I think something similar to the old code might be more desirable, as it can continue having validation logic on the parameters and signature (but still lets us join a large number of code paths).

It might be worth having the table carry additional information in Checked/Debug builds (signature information) and to have a ValidateSignature helper function which asserts the signature we have for the intrinsic matches what we expect (for all args and the return value).

CarolEidt · 2018-01-11T02:36:23Z

src/jit/hwintrinsicxarch.cpp

    switch (isa)
    {
        case InstructionSet_SSE:
-            return impSSEIntrinsic(intrinsic, method, sig, mustExpand);
+            return impSSEIntrinsic(intrinsic, method, sig);


I don't think you can get rid of the mustExpand parameter on these methods, as that's what determines whether it can return nullptr so that it instead generates a call.

fiigii · 2018-01-11T03:08:26Z

@CarolEidt Thank you for the feedback! I will fix them all.

tannergooding · 2018-01-12T05:34:20Z

src/jit/compiler.h

@@ -3096,8 +3100,13 @@ class Compiler
                                CORINFO_METHOD_HANDLE method,
                                CORINFO_SIG_INFO*     sig,
                                bool                  mustExpand);
+    bool impIsTableDrivenHWIntrinsic(HWIntrinsicCategory category);


nit: A number of these can still be static since they only access data from the table

tannergooding · 2018-01-12T05:35:07Z

src/jit/emitxarch.cpp

@@ -171,6 +171,7 @@ bool emitter::IsDstDstSrcAVXInstruction(instruction ins)
        case INS_vperm2i128:
        case INS_xorpd:
        case INS_xorps:
+        case INS_pmuldq:


nit: It would be nice to keep this list in alphabetical order

Thanks, will fix.

fiigii · 2018-01-17T02:39:12Z

src/jit/hwintrinsicxarch.cpp

+        assert(varTypeIsArithmetic(argType));
+        arg = impPopStack().val;
+        assert(varTypeIsArithmetic(arg->TypeGet()));
+        assert(genTypeSize(argType) <= genTypeSize(arg->TypeGet()));


@CarolEidt @tannergooding Is this type validation enough? It seems hard to have a precise and general assert for all the HW intrinsics...

I think we want something like genActualType(arg->gtType) == genActualType(argType)

...at least for non-SIMD types. SIMD Types we probably want to be validating the baseType

Yeah, this one looks better.

Agree for the non-SIMD types. IIRC, we can't always rely on the stack node having its type normalized, which is why impSIMDPopStack() doesn't do that. I would recommend something like:

if (varTypeIsSIMD(argType)) { arg = impSIMDPopStack(); } else { arg = impPopStack().val; assert(genActualType(arg->gtType) == genActualType(argType)); }

If, in future, we change the normalizing of types to happen earlier, we can add that check in just impSIMDPopStack().

fiigii · 2018-01-18T10:24:20Z

test Windows_NT x86 Checked jitincompletehwintrinsic
test Windows_NT x86 Checked jitx86hwintrinsicnoavx
test Windows_NT x86 Checked jitx86hwintrinsicnosimd
test Windows_NT x86 Checked jitnox86hwintrinsic
test Windows_NT x64 Checked jitincompletehwintrinsic
test Windows_NT x64 Checked jitx86hwintrinsicnoavx
test Windows_NT x64 Checked jitx86hwintrinsicnoavx2
test Windows_NT x64 Checked jitx86hwintrinsicnosimd
test Windows_NT x64 Checked jitnox86hwintrinsic
test Ubuntu x64 Checked jitincompletehwintrinsic
test Ubuntu x64 Checked jitx86hwintrinsicnoavx
test Ubuntu x64 Checked jitx86hwintrinsicnoavx2
test Ubuntu x64 Checked jitx86hwintrinsicnosimd
test Ubuntu x64 Checked jitnox86hwintrinsic

tannergooding · 2018-01-18T15:06:04Z

There are at least a couple of failures which are being addressed here: #15901

tannergooding · 2018-01-18T15:24:19Z

src/jit/emitxarch.cpp

    }
    else
    {
+        assert(isSse41Blendv(ins));
+        // SSE4.1 blendv* hardcode the mask vector (op3) in XMM0
+        if (reg3 != REG_XMM0)


Do we have a bug tracking this? We should probably have a TODO-XArch-CQ with a link to it.

As I understand based on above discussion and practice, this is not a bug, the current code can assign XMM0 to reg3 for simple cases. But in some complex use cases, when reg3 cannot get XMM0, LSRA won't insert the "copy" and codgen is responsible to do it.
BTW, the current CQ looks good to me, a few more zero-latency movaps xmm, xmm does not impact the performance too much. But there may be some register pressure issues that we need more workloads and tests to investigate.

Yes, it is by convention that when a specific register is required, the code generator is responsible for checking whether that is what was assigned, and otherwise doing the copy. It has been debated whether all these checks are more or less efficient than if the register allocator inserted copies. But for now, this is how it's done. It's usually done in codegen (rather than the emitter), I believe, but for this case this seems like the right place.

tannergooding · 2018-01-18T15:33:54Z

src/jit/hwintrinsicxarch.cpp

+    int                 numArgs;
+    instruction         ins[10];
+    HWIntrinsicCategory category;
+    HWIntrinsicFlag     flag;


nit: Maybe flags instead of flag?

Thanks, will fix.

tannergooding · 2018-01-18T15:37:20Z

src/jit/hwintrinsicxarch.cpp

+static bool impIsTableDrivenHWIntrinsic(HWIntrinsicCategory category, HWIntrinsicFlag flag)
+{
+    // TODO - make more categories to the table-driven framework
+    const bool tableDrivenCategory = category == HW_Category_SimpleSIMD || category == HW_Category_MemoryLoad ||


Why can't this be as simple as does not have flag 'NoCodeGen'?

Yes, here can just be !Special, !Scalar and !NoCodeGen. I will change and add comments.

fiigii · 2018-01-18T18:44:51Z

There are at least a couple of failures which are being addressed here: #15901

Tests passed on my local repo that does not have the DebugView change, waiting for #15901 merging

fiigii · 2018-01-18T19:58:20Z

@CarolEidt All SSE intrinsics have been merged in the table-driven framework. PTAL.

CarolEidt

Just one suggestion; otherwise looks good.

CarolEidt · 2018-01-18T20:18:17Z

src/jit/emitxarch.cpp

    }
    else
    {
+        assert(isSse41Blendv(ins));
+        // SSE4.1 blendv* hardcode the mask vector (op3) in XMM0
+        if (reg3 != REG_XMM0)


Yes, it is by convention that when a specific register is required, the code generator is responsible for checking whether that is what was assigned, and otherwise doing the copy. It has been debated whether all these checks are more or less efficient than if the register allocator inserted copies. But for now, this is how it's done. It's usually done in codegen (rather than the emitter), I believe, but for this case this seems like the right place.

CarolEidt · 2018-01-18T20:31:03Z

src/jit/hwintrinsiclistxarch.h

+//HARDWARE_INTRINSIC(SSE_StoreAlignedNonTemporal,                      "StoreAlignedNonTemporal",                          SSE,        -1,           16,           2,           {INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_movntps,   INS_invalid},           HW_Category_SimpleSIMD,                        HW_Flag_NoFlag)
+//HARDWARE_INTRINSIC(SSE_StoreHigh,                                    "StoreHigh",                                        SSE,        -1,           16,           2,           {INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_movhps,    INS_invalid},           HW_Category_SimpleSIMD,                        HW_Flag_NoFlag)
+//HARDWARE_INTRINSIC(SSE_StoreLow,                                     "StoreLow",                                         SSE,        -1,           16,           2,           {INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_movlps,    INS_invalid},           HW_Category_SimpleSIMD,                        HW_Flag_NoFlag)
+//HARDWARE_INTRINSIC(SSE_StoreScalar,                                  "StoreScalar",                                      SSE,        -1,           16,           2,           {INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_invalid,   INS_movss,     INS_invalid},           HW_Category_SimpleSIMD,                        HW_Flag_NoFlag)


I'm not a fan of adding code that is not yet enabled. I'd prefer to see it omitted until it is implemented.

CarolEidt

LGTM - thanks!

tannergooding · 2018-01-19T05:04:36Z

tests/src/JIT/HardwareIntrinsics/X86/Sse/ReciprocalScalar.cs

@@ -31,11 +31,12 @@ static unsafe int Main(string[] args)

                    if (!floatTable.CheckResult((x, y) => {
                        var expected = 1 / x[0];
+                        var remain = Avx.IsSupported ? (y[1] == x[1]) && (y[2] == x[2]) && (y[3] == x[3]) : true; // VEX-encoding copies remain elements


Why are these failing now, and not before this change?

Ah, because you have it marked as SimpleSIMD, and the 1 arg case goes through emitIns_R_R, which doesn't handle the move.

Previously this was going through: emit->emitIns_SIMD_R_R_R(ins, targetReg, op1Reg, op1Reg, TYP_SIMD16), which handled the move for non-VEX.

I still say the proper fix is to make ReciprocalScalar, ReciprocalSqrtScalar, and SqrtScalar have the same signature as the other scalar intrinsics (upper, value, rather than just value).

It makes the handling easier, makes the API surface consistent, and gives the users the option to set the upper bits as they need.

They failed with COMPlus_EnableAVX=0.
Originally, you used emitIns_SIMD_R_R_R(ins, targetReg, op1Reg, op1Reg) that always copies op1 (src) to the target with SSE encoding, which is unnecessary.

and gives the users the option to set the upper bits as they need.

But that only makes sense with VEX-encoding.

It also caused the upper bits to be set appropriately.

So we need the "keeping upper bits" semantics? If yes, we need to describe this behavior in the doc.

But that only makes sense with VEX-encoding.

Why? The behavior of the SSE encoding is:

So the upper bits of the target register are unmodified.

The behavior of the Intel native intrinsic is:

The behavior of the managed intrinsic should be the same, regardless of AVX or SSE.

So we need the "keeping upper bits" semantics

Yes, at a minimum, in order to give the user deterministic behavior. Otherwise, the upper bits are completely random on non-AVX machines.

However, out of all the scalar intrinsics, we currently have 3 that don't let the user explicitly pass in upper in the API, and I think rectifying that (to make it consistent) should be the actual fix.

user deterministic behavior.

Sounds good, so we indeed want the semantics of "keeping upper bits of source to target". Right? I will fix.

Your original implementation keeps the upper bits of source rather than target.

My point was, that the SSE behavior, when using the legacy encoding, is to keep the upper bits of the target. With the VEX encoding, the behavior is to select the upper bits from the first source.

On VEX machines, we currently have logic to pass in value as both sources.

On non-VEX machines, the previous logic was to mirror that behavior by explicitly copying value to the target, and then calling the instruction.

This gives deterministic behavior, regardless of whether you are on an legacy or VEX-enabled machine.

The new behavior makes VEX and non-VEX machines behave differently and results in non-deterministic behavior on legacy machines.

fiigii · 2018-01-19T08:03:09Z

test Windows_NT x86 Checked jitincompletehwintrinsic
test Windows_NT x86 Checked jitx86hwintrinsicnoavx
test Windows_NT x86 Checked jitx86hwintrinsicnosimd
test Windows_NT x86 Checked jitnox86hwintrinsic
test Windows_NT x64 Checked jitincompletehwintrinsic
test Windows_NT x64 Checked jitx86hwintrinsicnoavx
test Windows_NT x64 Checked jitx86hwintrinsicnoavx2
test Windows_NT x64 Checked jitx86hwintrinsicnosimd
test Windows_NT x64 Checked jitnox86hwintrinsic

fiigii · 2018-01-19T15:40:40Z

All tests passed on my local machine, CI failed on unrelated eventsourcetrace test.

fiigii · 2018-01-19T15:41:58Z

@dotnet-bot test this please

fiigii · 2018-01-19T15:43:27Z

test Windows_NT x86 Checked jitincompletehwintrinsic
test Windows_NT x86 Checked jitx86hwintrinsicnoavx
test Windows_NT x86 Checked jitx86hwintrinsicnosimd
test Windows_NT x86 Checked jitnox86hwintrinsic
test Windows_NT x64 Checked jitincompletehwintrinsic
test Windows_NT x64 Checked jitx86hwintrinsicnoavx
test Windows_NT x64 Checked jitx86hwintrinsicnoavx2
test Windows_NT x64 Checked jitx86hwintrinsicnosimd
test Windows_NT x64 Checked jitnox86hwintrinsic

4creators · 2018-01-19T18:51:38Z

CI failures for:

x64_checked_windows_nt_jitincompletehwintrinsic_prtest
x64_checked_windows_nt_jitx86hwintrinsicnosimd_prtest
x64_checked_windows_nt_jitx86hwintrinsicnoavx_prtest
x86_checked_windows_nt_jitx86hwintrinsicnosimd_prtest

are unrelated and caused by eventsourcetrace test

 D:\j\workspace\x64_checked_w---1cba8152\bin\tests\Windows_NT.x64.Checked\TestWrappers\tracing.eventsourcetrace\tracing.eventsourcetrace.XUnitWrapper.cs(107): error : tracing_eventsourcetrace._eventsourcetrace_eventsourcetrace_._eventsourcetrace_eventsourcetrace_cmd [FAIL] [D:\j\workspace\x64_checked_w---1cba8152\tests\runtest.proj]

tannergooding · 2018-01-19T18:54:50Z

The eventsourcetrace failure is tracked by https://github.com/dotnet/coreclr/issues/15924. It is independent of any HWIntrinsic work.

fiigii · 2018-01-19T19:54:06Z

All the related tests have passed. Can we merge this PR?

fiigii · 2018-01-19T20:16:50Z

@4creators Once this PR gets merged, could you try to migrate your SSE2 "simple SIMD intrinsics" into the table-driven framework? I am trying to refactor Load* and the jump-table fallback of imm intrinsics, so I want to keep the same page with you to avoid wasting your time.

4creators · 2018-01-19T21:39:35Z

@fiigii I am in the process of migrating my 2 PR to table driven codegen. They comprise everything except LoadX and StoreX. These are
#15777 and #15585.

tannergooding reviewed Jan 5, 2018

View reviewed changes

jkotas added the area-CodeGen label Jan 5, 2018

fiigii force-pushed the tabledrive branch 2 times, most recently from cfba1f8 to 8639f02 Compare January 10, 2018 17:27

tannergooding reviewed Jan 10, 2018

View reviewed changes

4creators reviewed Jan 10, 2018

View reviewed changes

CarolEidt reviewed Jan 11, 2018

View reviewed changes

fiigii force-pushed the tabledrive branch from 8639f02 to 28cb28a Compare January 12, 2018 05:18

fiigii changed the title ~~[WIP] Table-driven Intel hardware intrinsic~~ Table-driven Intel hardware intrinsic Jan 12, 2018

tannergooding reviewed Jan 12, 2018

View reviewed changes

fiigii commented Jan 17, 2018

View reviewed changes

fiigii force-pushed the tabledrive branch from 61c44bc to 916466b Compare January 18, 2018 10:22

tannergooding reviewed Jan 18, 2018

View reviewed changes

fiigii force-pushed the tabledrive branch from 916466b to 13d02fd Compare January 18, 2018 19:56

CarolEidt reviewed Jan 18, 2018

View reviewed changes

fiigii force-pushed the tabledrive branch from 13d02fd to 394e012 Compare January 18, 2018 20:46

table drive Intel hardware intrinsic

fffd345

fiigii force-pushed the tabledrive branch from 394e012 to 3c8e838 Compare January 18, 2018 22:10

CarolEidt approved these changes Jan 18, 2018

View reviewed changes

tannergooding approved these changes Jan 18, 2018

View reviewed changes

fiigii force-pushed the tabledrive branch from 3c8e838 to c07edfa Compare January 19, 2018 04:31

tannergooding reviewed Jan 19, 2018

View reviewed changes

Merge SSE intrinsics into the table-driven framework

654a8d5

fiigii force-pushed the tabledrive branch from c07edfa to 654a8d5 Compare January 19, 2018 08:02

CarolEidt merged commit 516b973 into dotnet:master Jan 19, 2018

Table-driven Intel hardware intrinsic #15749

Table-driven Intel hardware intrinsic #15749

Conversation

fiigii commented Jan 5, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fiigii commented Jan 10, 2018 • edited Loading

Choose a reason for hiding this comment

fiigii Jan 10, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding commented Jan 10, 2018

fiigii commented Jan 10, 2018

tannergooding commented Jan 10, 2018

tannergooding commented Jan 10, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fiigii commented Jan 10, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fiigii commented Jan 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fiigii commented Jan 18, 2018

tannergooding commented Jan 18, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fiigii commented Jan 5, 2018 •

edited

Loading

fiigii commented Jan 10, 2018 •

edited

Loading

fiigii Jan 10, 2018 •

edited

Loading