Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Table-driven Intel hardware intrinsic #15749

Merged
merged 2 commits into from
Jan 19, 2018
Merged

Conversation

fiigii
Copy link

@fiigii fiigii commented Jan 5, 2018

  • Establish the table-driven framework.
  • Table-driven the importer and codgen of "simple SIMD" intrinsics
    • testing with Avx.Reciprocal (1-arg), Avx.Multiply (2-arg), and Avx.BlendVariable (3-arg) intrinsics.

With this PR, implementing a new simple-intrinsic only needs to add a declaration to the table (and the instructions).

@@ -3046,6 +3048,13 @@ class Compiler
GenTree* impPOPCNTIntrinsic(NamedIntrinsic intrinsic, CORINFO_METHOD_HANDLE method, CORINFO_SIG_INFO* sig);
bool compSupportsHWIntrinsic(InstructionSet isa);
bool isScalarISA(InstructionSet isa);
var_types retTypeOfHWIntrinsic(NamedIntrinsic intrinsic);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Many of these can and should be static.

@@ -110,6 +110,8 @@ IF_DEF(RRW_RRW_CNS, IS_R1_RW|IS_R2_RW, SCNS) // r/w reg , r/w r

IF_DEF(RWR_RRD_RRD, IS_R1_WR|IS_R2_RD|IS_R3_RD, NONE) // write reg , read reg2 , read reg3
IF_DEF(RWR_RRD_RRD_CNS, IS_R1_WR|IS_R2_RD|IS_R3_RD, SCNS) // write reg , read reg2 , read reg3, const

IF_DEF(RWR_RRD_RRD_RRD, IS_R1_WR|IS_R2_RD|IS_R3_RD|IS_R4_RD, NONE) // write reg , read reg2 , read reg3 , read reg4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: There are several others of these we will end up needing. I added a few in my CORINFO_INTRINSIC_Round PR (#14736), but that has yet to be merged.

InstructionSet isa = compiler->isaOfHWIntrinsic(intrinsicID);
HWIntrinsicCategory category = compiler->categoryOfHWIntrinsic(intrinsicID);

if (category == HW_Category_SimpleSIMD)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We should probably have a switch(category) and genHWIntrinsicSimpleSIMD function, etc

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switch may not work with mask bit now, but we can consider to define several enum to use switch later.

// Intrinsic ID Function name ISA
// **************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
// Intrinsic ID Function name ISA return type parameter# ival SIMD size instructions Category
// (UNDEF if multi types) (-1 if multi numbers) {TYP_BYTE, TYP_UBYTE, TYP_SHORT, TYP_USHORT, TYP_INT, TYP_UINT, TYP_LONG, TYP_ULONG, TYP_FLOAT, TYP_DOUBLE}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are TYP_BYTE and TYP_UBYTE instructions ever different (other than the case where one is valid and one is invalid)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(question also applies to the other signed vs unsigned types)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please see multiply in this PR.

// - generate single instruction
// - the codegen of overloads can be determined by intrinsicID and base type of returned vector
// - not any other kind intrinsics
HW_Category_SimpleSIMD = 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my own investigations, some of these generalizations don't work well for all instructions and can often make containment analysis harder.

For example, this doesn't cover things like: am I commutative (which indicates we can swap op1 and op2 if op1 is containable, but op2 isn't).

It also doesn't cover differences like MoveHighToLow, which qualifies as a simple intrinsic, but does not allow a memory encoding.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, actually I am considering to give simple intrinsic a mask bit rather than 0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should have a byte category, byte encoding, and a ushort flags field.

category is, as you have it.
encoding helps identify things like whether the instruction supports Reg, Reg or Reg, Reg/Mem
flags helps identify things like whether the instruction is commutative

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are all used for table-driven, just giving mask bits looks enough now.
And 8-bit may be not enough for the category field.
Why do we need the “flags”? “commutative” can be a category bit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It helps reduce code duplication and the number of checks required when they are distinct fields

For example, when doing containment analysis, the most important information is which operands can be contained. With mask bits, we will need to have multiple comparison checks against the various categories (if ((category == SIMDScalar_SecondOpContainable) || (category == Scalar_SecondOpContainable))

The containment analysis, I think, could look like this (pseudo-code):

switch (encoding)
{
    case R_R_RM:  // target must be reg, first operand must be reg, second op can be reg or mem
        if (!op2->isContainable() && (flags & commutative))
        {
            swapOperands();
        }

        // TODO: SSE/SSE2 instructions on non-vex are technically R_RM_RM when `targetReg != op1Reg` since we need an additional `movaps` or `movups` instruction

        if (op2->isContainable() && (isVexEncodingSupported() || (category != SIMDVector))) // on non-vex, the memory operand for SIMDVector must be aligned, ideally we would detect the case where it is aligned and allow containment
        {
            makeContained(op2);
        }
        break;

    // other encodings
}

Once containment analysis has been done, this ends up being important information for codegen as well.

The codegen can then look something like (also pseudo-code):

switch (encoding)
{
    case R_R_RM:
        genHWIntrinsic_R_R_RM(node);
        break;

    // other encodings
}

genHWIntrinsic_R_R_RM(node)
{
    if (op2->isContained())
    {
        // NOTE: most of the "logic" is fairly consistent between things like `R_R_RM` and `R_R_RM_I`, the only difference is whether we call `emit->emitIns_R_R_*` or `emit->emitIns_R_R_*_I`
        //       we should take advantage of this

        // logic
            emit->emitIns_R_R_C();
        // logic
            emit->emitIns_R_R_S();
        // logic
            emit->emitIns_R_R_A();
    }
    else
    {
        emit->emitIns_R_R_R();
    }
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, I've found the most important piece of information to be the final encodings supported (this is the primary factor in the control flow for our logic).

We then make slight tweaks to that based on the category (am I SIMD, SIMDScalar, Scalar, etc) or flags (am I commutative, do I require alignment, etc) set.

HW_Category_SimpleSIMD = 0,

// Scalar intrinsics
HW_Category_Scalar = 0x1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "scalar" imply here? We have scalar intrinsics (like SqrtScalar), but I think these are meant to cover the intrinsics like popcnt, lzcnt, and crc32.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An explicit comment might help, but I think we should come up with a different name to distinguish it from the other scalar intrinsics.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will change the names.

if (category == HW_Category_SimpleSIMD)
{

GenTree* op1 = node->gtGetOp1();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we start adding containment support, doing the operand handling here is going to cause a lot of duplicated handling.

We probably want to have a genHWIntrinsic_<encoding> method (e.g. genHWIntrinsic_R_RM) so we can have the correct handling based on the supported encodings (it also lets us handle cases, like the non-AVX instructions, where they are actually RM_RM, due to the additional movaps instruction that may need to be emitted).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to deal with containment with additional mask bit of the categories.


unsigned Compiler::simdSizeOfHWIntrinsic(NamedIntrinsic intrinsic)
{
assert(intrinsic != NI_Illegal);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it would be useful to have a: #define assertIntrinsicInRange()

{
assert(intrinsic != NI_Illegal);
assert(intrinsic > NI_HW_INTRINSIC_START && intrinsic < NI_HW_INTRINSIC_END);
assert(type >= TYP_BYTE && type <= TYP_DOUBLE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is making an assumption that the types list won't be reordered. We should explicitly comment that in both places

return hwIntrinsicInfoArray[intrinsic - NI_HW_INTRINSIC_START - 1].category;
}

GenTreeArgList* Compiler::buildArgList(int num)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This probably doesn't need to be a member on the compiler (it can probably just be a global static).

It also isn't really needed outside of the few intrinsics that take > 4 operands. The 3 and 4 operand encodings are common enough that having an explicit gtArgList constructor is useful (and probably more efficient)

@fiigii fiigii force-pushed the tabledrive branch 2 times, most recently from cfba1f8 to 8639f02 Compare January 10, 2018 17:27
@fiigii
Copy link
Author

fiigii commented Jan 10, 2018

With this PR, implementing a new simple-intrinsic only needs to add a declaration to the table (and the instructions).
@CarolEidt @AndyAyersMS @jkotas @mikedn @tannergooding @4creators PTAL

var_types type, GenTree* op1, GenTree* op2, NamedIntrinsic hwIntrinsicID, var_types baseType, unsigned size)
: GenTreeJitIntrinsic(GT_HWIntrinsic, type, op1, op2, baseType, size), gtHWIntrinsicId(hwIntrinsicID)
int gtNumArgs; // number of the non-imm arguments
int gtIval; // the imm8 argument
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to carry these bits of info here, rather than in the lookup table?

Copy link
Author

@fiigii fiigii Jan 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, I planned to unify the code for fixed-imm and unfixed-imm intrinsics that carry the ival in this field and "op" pointers only point to reg/mem arguments (good to table-driven). However, now we seem to need to deal with "non-const" arguments by the "compiler fallback" that break this design.

However, I still wonder that we may need to check constants and generate the exception/jump-table in the importer rather than hardcode in the codegen. If we decide to implement the checking in the importer, the codegen can be simplified to address IMM-intrinsics only with constants.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm still not sure I understand.

We basically have the following cases:

  • Instruction has a fixed, non-user provided, constant (such as CompareEqual)
    • We don't need to pass anything, it is looked up from the instruction table at emit (this is the same for direct and indirect calls). Nothing special is needed in the importer.
  • Instruction has a user provided constant
    • In the importer, we check if its constant or in range and determine whether we can expand directly (is constant and is in range) or need to go through a GT_CALL (not constant)
    • In the codegen, if the value is constant, we expand directly; otherwise we emit a compiler fallback (failure when the value is not in range is done by the IL implementation currently and isn't something we need to do codegen for directly).

We should be able to unify both and distinguish them based on whether or not ival in the lookup table is -1 ((int)-1 indicates user-specified, otherwise (byte)ival is the fixed constant).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the good summary.

In the codegen, if the value is constant, we expand directly; otherwise we emit a compiler fallback.

Can we do this in the front-end (i.e., importer)?

(failure when the value is not in range is done by the IL implementation currently and isn't something we need to do codegen for directly).

I do not think so. Generating the exception for out-of-range arguments in JIT is better because we may not want two range tables for each intrinsic (one in C#, one in JIT).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do this in the front-end (i.e., importer)?

We can't emit the fallback in the importer, but we can detect if the value is constant and determine whether or not we should expand there. I also don't think we want to construct actual blocks for the fallback in the importer. This risks breaking based on future optimizations/etc.

See a8d8aa3, for an example of this with Sse.Shuffle.

I do not think so. Generating the exception for out-of-range arguments in JIT is better because we may not want two range tables for each intrinsic (one in C#, one in JIT).

My comment was mostly based on the current status quo. We've had lots of discussion about this (https://github.com/dotnet/corefx/issues/26194) but haven't landed concretely yet.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment was mostly based on the current status quo. We've had lots of discussion about this (dotnet/corefx#26194) but haven't landed concretely yet.

This discussion seems not related to the out-of-range exception.
We have to check the valid argument's value in the JIT and generate the compiler fallback based-on the valid range of instructions. These two phases can share one "range-table", so I think JIT throwing exceptions would avoid much redundant work in C#.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other exceptions (those that aren't PNSE) were discussed starting here. It would require additional JIT helper functions to throw the correct exceptions with the correct messages.

There wasn't a clear consensus on whether or not we should add new JIT helpers (so we could throw these exceptions from the JIT) or if we should do the existing ThrowHelper pattern in IL (https://github.com/dotnet/coreclr/blob/master/src/mscorlib/src/System/Runtime/Intrinsics/X86/Avx.cs#L235).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There wasn't a clear consensus on whether or not we should add new JIT helpers (

We already have the JIT-helper for this exception that is called CORINFO_HELP_THROW_ARGUMENTOUTOFRANGEEXCEPTION.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok. Then I see no reason not to do that one from the JIT (and I think that means we only don't have one for the this type is not supported exception, which might push the discussion towards consensus)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only problem which I have encountered is that JIT helpers do not provide support for custom messages what would be very helpful for intrinsics. If we could refactor helpers to support it, providing it is possible, than some other problems with exception throwing would be solved as well.

@tannergooding
Copy link
Member

This looks okay to me. However, I'm still not convinced the table is in the right shape to work well with the end-goal.

That is, currently all the codegen is simple, doesn't do any containment analysis, and only covers simple intrinsics. However, once we implement an entire ISA (#15538) and add containment support (#15804), the shape of the table and data we need to carry changes a bit (and just looking at the above two PRs isn't quite enough either, since their are a few additional scenarios covered by the other ISAs).

@fiigii
Copy link
Author

fiigii commented Jan 10, 2018

However, I'm still not convinced the table is in the right shape to work well with the end-goal.

the shape of the table and data we need to carry changes a bit

Yes, we definitely need to extend the table and the category in the future.

That is, currently all the codegen is simple, doesn't do any containment analysis, and only covers simple intrinsics.

I tink the current table-driven framework is easy to extand for the importer and codgen. The condition function genTableDrivableHWIntrinsic/impTableDrivableHWIntrinsic can be modified to deal with more complex intrinsics. I do not dare to say anything about containment analysis because of my poor experience, so I want more suggestions from @CarolEidt @mikedn

@tannergooding
Copy link
Member

I tink the current table-driven framework is easy to extand for the importer and codgen. The condition function genTableDrivableHWIntrinsic/impTableDrivableHWIntrinsic can be modified to deal with more complex intrinsics. I do not dare to say anything about containment analysis because of my poor experience, so I want more suggestions from @CarolEidt @mikedn

My thinking (and my investigations) show that:

  • In the importer, we care about the number and types of the args. Any two intrinsics which have the same number of args and argument types can use the same code-path (for the sake of argument type, the "must" be constant arguments are effectively their own type).
  • In the Lowering (where we do containment analysis) we mostly care about the number of args, and which args can be contained as that controls which args we check and mark for containment. Ex:
    • addps takes two args, on non-vex the first arg can always be contained, the second arg can be contained if aligned. On VEX, only the second arg can be contained but alignment doesn't matter
    • movhlps takes two args. The first can be contained on non-vex, and neither can be contained on VEX
    • shufps takes three args, it has the same containment category as addps, but a third operand which can be contained if constant
    • etc
  • In the codgen, we mostly care about the final encodings supporting. This makes a difference in what args we check for containment so we know if we should call emitIns_R_R_R or emitIns_R_R_S or emitIns_R_R_C, etc. Ex:
    • addps is R_RM_RMA on non-VEX (target must be reg, op1 can be reg or mem, op2 can be reg or aligned mem) and R_R_RM on VEX (target and op1 must be reg, op2 can be reg or mem)
    • movhlps is R_RM_R on non-VEX (target and op2 must be reg, op1 can be reg or mem) and R_R_R on VEX (all three must be reg)
  • etc

The information I listed as "what we mostly care about" is the information that ends up controlling the majority of our codeflow in order to reduce code duplication.

NOTE: In some cases, the checks can be merged further (ex: R_RM_RMA and R_R_RM) since the difference between them is basicaly a single check (VEX or not-VEX), but having the info available might still be useful

@tannergooding
Copy link
Member

I forgot to mention, in Lowering (for containment), knowing properties of the instruction like: `am I commutative' can also be useful.

For something like addps, it lets us know that (for VEX machines) if op2 can't be contained, but op1 can, we can still mark op1 as contained and then swap the operands (since it will compute the same result).

bool compSupportsHWIntrinsic(InstructionSet isa);
bool isScalarISA(InstructionSet isa);
int ivalOfHWIntrinsic(NamedIntrinsic intrinsic);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iVal has a ssize_t type, perhaps we should use the same type as is used in GenTree node to carry const

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't seem to have a use of this, and it doesn't seem that any of the intrinsics use the ival field thus far. Can you explain how it is to be used?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CarolEidt Some intrininsics require hardcoded constant (not from user-passed arguments) will use the ival field, such as CompareEqualwill generate cmpps xmm0, xmm1, 0

@@ -10718,6 +10753,7 @@ size_t emitter::emitOutputInstr(insGroup* ig, instrDesc* id, BYTE** dp)
sz = emitSizeOfInsDsc(id);
break;
case IF_RWR_RRD_RRD_CNS:
case IF_RWR_RRD_RRD_RRD:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it OK to treat IF_RWR_RRD_RRD_CNS and IF_RWR_RRD_RRD_RRD as same? Constant is a part of instruction encoding and does not need a register and my worry is that this may impact emitter at later stages. I have experienced this with IF_RRW_CNS and IF_RRW_RRW_CNS.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should treat these as same. There is other handling and/or encoding checks that this can potentially impact.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it OK to treat IF_RWR_RRD_RRD_CNS and IF_RWR_RRD_RRD_RRD as same?

At least, for JITDump, we cannot.

var_types type, GenTree* op1, GenTree* op2, NamedIntrinsic hwIntrinsicID, var_types baseType, unsigned size)
: GenTreeJitIntrinsic(GT_HWIntrinsic, type, op1, op2, baseType, size), gtHWIntrinsicId(hwIntrinsicID)
int gtNumArgs; // number of the non-imm arguments
int gtIval; // the imm8 argument

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to unify both and distinguish them

There are different approaches possible, but I would be for using sign bit as indication if it represents const (no bit set) or variable (bit set)

Instruction has a fixed, non-user provided, constant

In this case value can be decoded from intrinsic name - we do not need to carry it or decode in importer, this can be done in tablegen

same as above gtIval - Ival has ssize_t type in GenTree node

// SSE Intrinsics
HARDWARE_INTRINSIC(SSE_IsSupported, "get_IsSupported", SSE)
HARDWARE_INTRINSIC(SSE_Add, "Add", SSE)
HARDWARE_INTRINSIC(SSE_IsSupported, "get_IsSupported", SSE, -1, 0, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_Special)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "get_IsSupported" should start at least at column 70 (quotation mark) to make space for larger intrinsic names

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! will change.

@fiigii
Copy link
Author

fiigii commented Jan 10, 2018

@dotnet-bot test Windows_NT x64 Checked jitincompletehwintrinsic
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnoavx
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnoavx2
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnosimd
@dotnet-bot test Windows_NT x64 Checked jitnox86hwintrinsic

@@ -131,6 +131,7 @@ void genFMAIntrinsic(GenTreeHWIntrinsic* node);
void genLZCNTIntrinsic(GenTreeHWIntrinsic* node);
void genPCLMULQDQIntrinsic(GenTreeHWIntrinsic* node);
void genPOPCNTIntrinsic(GenTreeHWIntrinsic* node);
bool genTableDrivableHWIntrinsic(HWIntrinsicCategory category);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be better as genIsTableDrivenHWIntrinsic

bool compSupportsHWIntrinsic(InstructionSet isa);
bool isScalarISA(InstructionSet isa);
int ivalOfHWIntrinsic(NamedIntrinsic intrinsic);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't seem to have a use of this, and it doesn't seem that any of the intrinsics use the ival field thus far. Can you explain how it is to be used?

@@ -11,59 +11,69 @@
// clang-format off

#if FEATURE_HW_INTRINSICS
// Intrinsic ID Function name ISA
// **************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
// Intrinsic ID Function name ISA ival SIMD size instructions Category

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a comment above this table that describes each field.


// IMM intrinsics
// - the generated instruction requires immediate value (i.e. imm8)
// - must throw ArgumentException when the argument is not compile-time constant

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe that this is true any more.

HW_Category_IMM = 0x8,

// Fixed IMM intrinsics
// - the immediate value is from ival field of the table

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example would be useful here.

}

//------------------------------------------------------------------------
// categoryOfHWIntrinsic: get the instruction the intrinsic will generate

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be "get the category of the given intrinsic

// type -- vector base type of this intrinsic
//
// Return Value:
// the instruction the intrinsic will generate on the base type

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs updating as well

@@ -265,6 +366,15 @@ bool Compiler::compSupportsHWIntrinsic(InstructionSet isa)
isFullyImplmentedISAClass(isa));
}

static bool is64bitOnlyIntrinsicOn32bit(HWIntrinsicCategory category, var_types type)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is also confusing to me. How about a more general name, e.g. isTypeSupportedForIntrinsic?

{
case 1:
op1 = impPopStack().val;
retNode = gtNewSimdHWIntrinsicNode(retType, op1, intrinsic, baseType, simdSize);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There needs to be some validation of the operand types that are popped off the stack. Is that being done elsewhere?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old code had some validation with impSIMDPopStack, it looks like that was lost in the new code so that all intrinsics could share the same code path.

I think something similar to the old code might be more desirable, as it can continue having validation logic on the parameters and signature (but still lets us join a large number of code paths).

It might be worth having the table carry additional information in Checked/Debug builds (signature information) and to have a ValidateSignature helper function which asserts the signature we have for the intrinsic matches what we expect (for all args and the return value).

switch (isa)
{
case InstructionSet_SSE:
return impSSEIntrinsic(intrinsic, method, sig, mustExpand);
return impSSEIntrinsic(intrinsic, method, sig);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you can get rid of the mustExpand parameter on these methods, as that's what determines whether it can return nullptr so that it instead generates a call.

@fiigii
Copy link
Author

fiigii commented Jan 11, 2018

@CarolEidt Thank you for the feedback! I will fix them all.

@fiigii fiigii changed the title [WIP] Table-driven Intel hardware intrinsic Table-driven Intel hardware intrinsic Jan 12, 2018
@@ -3096,8 +3100,13 @@ class Compiler
CORINFO_METHOD_HANDLE method,
CORINFO_SIG_INFO* sig,
bool mustExpand);
bool impIsTableDrivenHWIntrinsic(HWIntrinsicCategory category);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: A number of these can still be static since they only access data from the table

@@ -171,6 +171,7 @@ bool emitter::IsDstDstSrcAVXInstruction(instruction ins)
case INS_vperm2i128:
case INS_xorpd:
case INS_xorps:
case INS_pmuldq:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It would be nice to keep this list in alphabetical order

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will fix.

assert(varTypeIsArithmetic(argType));
arg = impPopStack().val;
assert(varTypeIsArithmetic(arg->TypeGet()));
assert(genTypeSize(argType) <= genTypeSize(arg->TypeGet()));
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CarolEidt @tannergooding Is this type validation enough? It seems hard to have a precise and general assert for all the HW intrinsics...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want something like genActualType(arg->gtType) == genActualType(argType)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...at least for non-SIMD types. SIMD Types we probably want to be validating the baseType

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this one looks better.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree for the non-SIMD types. IIRC, we can't always rely on the stack node having its type normalized, which is why impSIMDPopStack() doesn't do that. I would recommend something like:

if (varTypeIsSIMD(argType))
{
    arg = impSIMDPopStack();
}
else
{
    arg = impPopStack().val;
    assert(genActualType(arg->gtType) == genActualType(argType));
}

If, in future, we change the normalizing of types to happen earlier, we can add that check in just impSIMDPopStack().

@fiigii
Copy link
Author

fiigii commented Jan 18, 2018

test Windows_NT x86 Checked jitincompletehwintrinsic
test Windows_NT x86 Checked jitx86hwintrinsicnoavx
test Windows_NT x86 Checked jitx86hwintrinsicnosimd
test Windows_NT x86 Checked jitnox86hwintrinsic
test Windows_NT x64 Checked jitincompletehwintrinsic
test Windows_NT x64 Checked jitx86hwintrinsicnoavx
test Windows_NT x64 Checked jitx86hwintrinsicnoavx2
test Windows_NT x64 Checked jitx86hwintrinsicnosimd
test Windows_NT x64 Checked jitnox86hwintrinsic
test Ubuntu x64 Checked jitincompletehwintrinsic
test Ubuntu x64 Checked jitx86hwintrinsicnoavx
test Ubuntu x64 Checked jitx86hwintrinsicnoavx2
test Ubuntu x64 Checked jitx86hwintrinsicnosimd
test Ubuntu x64 Checked jitnox86hwintrinsic

@tannergooding
Copy link
Member

There are at least a couple of failures which are being addressed here: #15901

}
else
{
assert(isSse41Blendv(ins));
// SSE4.1 blendv* hardcode the mask vector (op3) in XMM0
if (reg3 != REG_XMM0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a bug tracking this? We should probably have a TODO-XArch-CQ with a link to it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand based on above discussion and practice, this is not a bug, the current code can assign XMM0 to reg3 for simple cases. But in some complex use cases, when reg3 cannot get XMM0, LSRA won't insert the "copy" and codgen is responsible to do it.
BTW, the current CQ looks good to me, a few more zero-latency movaps xmm, xmm does not impact the performance too much. But there may be some register pressure issues that we need more workloads and tests to investigate.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is by convention that when a specific register is required, the code generator is responsible for checking whether that is what was assigned, and otherwise doing the copy. It has been debated whether all these checks are more or less efficient than if the register allocator inserted copies. But for now, this is how it's done. It's usually done in codegen (rather than the emitter), I believe, but for this case this seems like the right place.

int numArgs;
instruction ins[10];
HWIntrinsicCategory category;
HWIntrinsicFlag flag;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe flags instead of flag?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will fix.

static bool impIsTableDrivenHWIntrinsic(HWIntrinsicCategory category, HWIntrinsicFlag flag)
{
// TODO - make more categories to the table-driven framework
const bool tableDrivenCategory = category == HW_Category_SimpleSIMD || category == HW_Category_MemoryLoad ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't this be as simple as does not have flag 'NoCodeGen'?

Copy link
Author

@fiigii fiigii Jan 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, here can just be !Special, !Scalar and !NoCodeGen. I will change and add comments.

@fiigii
Copy link
Author

fiigii commented Jan 18, 2018

There are at least a couple of failures which are being addressed here: #15901

Tests passed on my local repo that does not have the DebugView change, waiting for #15901 merging

@fiigii
Copy link
Author

fiigii commented Jan 18, 2018

@CarolEidt All SSE intrinsics have been merged in the table-driven framework. PTAL.

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one suggestion; otherwise looks good.

}
else
{
assert(isSse41Blendv(ins));
// SSE4.1 blendv* hardcode the mask vector (op3) in XMM0
if (reg3 != REG_XMM0)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is by convention that when a specific register is required, the code generator is responsible for checking whether that is what was assigned, and otherwise doing the copy. It has been debated whether all these checks are more or less efficient than if the register allocator inserted copies. But for now, this is how it's done. It's usually done in codegen (rather than the emitter), I believe, but for this case this seems like the right place.

//HARDWARE_INTRINSIC(SSE_StoreAlignedNonTemporal, "StoreAlignedNonTemporal", SSE, -1, 16, 2, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_movntps, INS_invalid}, HW_Category_SimpleSIMD, HW_Flag_NoFlag)
//HARDWARE_INTRINSIC(SSE_StoreHigh, "StoreHigh", SSE, -1, 16, 2, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_movhps, INS_invalid}, HW_Category_SimpleSIMD, HW_Flag_NoFlag)
//HARDWARE_INTRINSIC(SSE_StoreLow, "StoreLow", SSE, -1, 16, 2, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_movlps, INS_invalid}, HW_Category_SimpleSIMD, HW_Flag_NoFlag)
//HARDWARE_INTRINSIC(SSE_StoreScalar, "StoreScalar", SSE, -1, 16, 2, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_movss, INS_invalid}, HW_Category_SimpleSIMD, HW_Flag_NoFlag)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of adding code that is not yet enabled. I'd prefer to see it omitted until it is implemented.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks!

@@ -31,11 +31,12 @@ static unsafe int Main(string[] args)

if (!floatTable.CheckResult((x, y) => {
var expected = 1 / x[0];
var remain = Avx.IsSupported ? (y[1] == x[1]) && (y[2] == x[2]) && (y[3] == x[3]) : true; // VEX-encoding copies remain elements
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these failing now, and not before this change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, because you have it marked as SimpleSIMD, and the 1 arg case goes through emitIns_R_R, which doesn't handle the move.

Previously this was going through: emit->emitIns_SIMD_R_R_R(ins, targetReg, op1Reg, op1Reg, TYP_SIMD16), which handled the move for non-VEX.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still say the proper fix is to make ReciprocalScalar, ReciprocalSqrtScalar, and SqrtScalar have the same signature as the other scalar intrinsics (upper, value, rather than just value).

It makes the handling easier, makes the API surface consistent, and gives the users the option to set the upper bits as they need.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They failed with COMPlus_EnableAVX=0.
Originally, you used emitIns_SIMD_R_R_R(ins, targetReg, op1Reg, op1Reg) that always copies op1 (src) to the target with SSE encoding, which is unnecessary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and gives the users the option to set the upper bits as they need.

But that only makes sense with VEX-encoding.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also caused the upper bits to be set appropriately.

So we need the "keeping upper bits" semantics? If yes, we need to describe this behavior in the doc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that only makes sense with VEX-encoding.

Why? The behavior of the SSE encoding is:
image

So the upper bits of the target register are unmodified.

The behavior of the Intel native intrinsic is:
image

The behavior of the managed intrinsic should be the same, regardless of AVX or SSE.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we need the "keeping upper bits" semantics

Yes, at a minimum, in order to give the user deterministic behavior. Otherwise, the upper bits are completely random on non-AVX machines.

However, out of all the scalar intrinsics, we currently have 3 that don't let the user explicitly pass in upper in the API, and I think rectifying that (to make it consistent) should be the actual fix.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

user deterministic behavior.

Sounds good, so we indeed want the semantics of "keeping upper bits of source to target". Right? I will fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your original implementation keeps the upper bits of source rather than target.

My point was, that the SSE behavior, when using the legacy encoding, is to keep the upper bits of the target. With the VEX encoding, the behavior is to select the upper bits from the first source.

On VEX machines, we currently have logic to pass in value as both sources.

On non-VEX machines, the previous logic was to mirror that behavior by explicitly copying value to the target, and then calling the instruction.

This gives deterministic behavior, regardless of whether you are on an legacy or VEX-enabled machine.

The new behavior makes VEX and non-VEX machines behave differently and results in non-deterministic behavior on legacy machines.

@fiigii
Copy link
Author

fiigii commented Jan 19, 2018

test Windows_NT x86 Checked jitincompletehwintrinsic
test Windows_NT x86 Checked jitx86hwintrinsicnoavx
test Windows_NT x86 Checked jitx86hwintrinsicnosimd
test Windows_NT x86 Checked jitnox86hwintrinsic
test Windows_NT x64 Checked jitincompletehwintrinsic
test Windows_NT x64 Checked jitx86hwintrinsicnoavx
test Windows_NT x64 Checked jitx86hwintrinsicnoavx2
test Windows_NT x64 Checked jitx86hwintrinsicnosimd
test Windows_NT x64 Checked jitnox86hwintrinsic

@fiigii
Copy link
Author

fiigii commented Jan 19, 2018

All tests passed on my local machine, CI failed on unrelated eventsourcetrace test.

@fiigii
Copy link
Author

fiigii commented Jan 19, 2018

@dotnet-bot test this please

@fiigii
Copy link
Author

fiigii commented Jan 19, 2018

test Windows_NT x86 Checked jitincompletehwintrinsic
test Windows_NT x86 Checked jitx86hwintrinsicnoavx
test Windows_NT x86 Checked jitx86hwintrinsicnosimd
test Windows_NT x86 Checked jitnox86hwintrinsic
test Windows_NT x64 Checked jitincompletehwintrinsic
test Windows_NT x64 Checked jitx86hwintrinsicnoavx
test Windows_NT x64 Checked jitx86hwintrinsicnoavx2
test Windows_NT x64 Checked jitx86hwintrinsicnosimd
test Windows_NT x64 Checked jitnox86hwintrinsic

@4creators
Copy link

4creators commented Jan 19, 2018

CI failures for:

  • x64_checked_windows_nt_jitincompletehwintrinsic_prtest
  • x64_checked_windows_nt_jitx86hwintrinsicnosimd_prtest
  • x64_checked_windows_nt_jitx86hwintrinsicnoavx_prtest
  • x86_checked_windows_nt_jitx86hwintrinsicnosimd_prtest

are unrelated and caused by eventsourcetrace test

 D:\j\workspace\x64_checked_w---1cba8152\bin\tests\Windows_NT.x64.Checked\TestWrappers\tracing.eventsourcetrace\tracing.eventsourcetrace.XUnitWrapper.cs(107): error : tracing_eventsourcetrace._eventsourcetrace_eventsourcetrace_._eventsourcetrace_eventsourcetrace_cmd [FAIL] [D:\j\workspace\x64_checked_w---1cba8152\tests\runtest.proj]

@tannergooding
Copy link
Member

The eventsourcetrace failure is tracked by https://github.com/dotnet/coreclr/issues/15924. It is independent of any HWIntrinsic work.

@fiigii
Copy link
Author

fiigii commented Jan 19, 2018

All the related tests have passed. Can we merge this PR?

@fiigii
Copy link
Author

fiigii commented Jan 19, 2018

@4creators Once this PR gets merged, could you try to migrate your SSE2 "simple SIMD intrinsics" into the table-driven framework? I am trying to refactor Load* and the jump-table fallback of imm intrinsics, so I want to keep the same page with you to avoid wasting your time.

@CarolEidt CarolEidt merged commit 516b973 into dotnet:master Jan 19, 2018
@4creators
Copy link

4creators commented Jan 19, 2018

@fiigii I am in the process of migrating my 2 PR to table driven codegen. They comprise everything except LoadX and StoreX. These are
#15777 and #15585.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants