Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Updating Compiler::impIntrinsic to always expand hardware intrinsics. #15639

Merged
merged 2 commits into from
Jan 6, 2018
Merged

Updating Compiler::impIntrinsic to always expand hardware intrinsics. #15639

merged 2 commits into from
Jan 6, 2018

Conversation

tannergooding
Copy link
Member

@tannergooding tannergooding commented Dec 27, 2017

Issue

Hardware intrinsics are not currently expanded inline when minopts or compDbgCode is enabled.

This means that, rather than the raw instruction being emitted, a call to the hardware intrinsic method is emitted instead (ex: Sse.Shuffle).
The hardware intrinsic method itself is recursive and will be expanded when it is jitted.

Because of this, nodes that were originally GT_CNS_* are now GT_LCL_VAR and methods which require constant parameters fail codegen.

Resolution

Hardware intrinsics should either always be expanded or have a software fallback implemented for the methods which require constant parameters.

This PR does the former, but it has some drawbacks.

  • Reflection, delegates, or other forms of indirect calling will fail for intrinsics which require constant parameters
  • Various external parts may need to be updated to work with hardware intrinsics, this includes
    • Debuggers
    • Profilers
    • IL Interpreters
    • JIT minopts
    • etc

NOTE: On the second bullet point above, it may be that the external parts will need to be updated to work with hardware intrinsics for when optimizations are enabled anyways.

The latter (software fallback/not always expanding intrinsics) has several concerns around the usability and performance of hardware intrinsics when optimizations are disabled.

Ex: While some performance degradation is normally expected, hardware intrinsics generally map to a single underlying hardware instructions. Not expanding the intrinsics will result in a call, plus stack spilling, per instruction. This causes the overhead to be significantly greater than a normal method, which will generally execute a series of instructions. This can also cause the hardware intrinsics to perform worse than a naively serial (non-vectorized) algorithm.

Impacted Intrinsics

This will end up impacting all intrinsics which require a constant parameter for codegen to complete succesfully.

A non exhaustive list includes:
Extract, Insert, ShuffleHigh, ShuffleLow, ShiftLeftLogical, ShiftLeftLogical128BitLane, ShiftRightLogical, ShiftRightLogical128BitLane, ShiftRightArithmetic, Blend, MultiplySumAbsoluteDifferences, CompareImplicitLength, CompareExplicitLength, CompareImplicitLengthIndex, CompareExplicitLengthIndex, Compare, etc...

These intrinsics are spread out across most of the exposed ISAs and several of them have multiple overloads. This means we are looking at a large number of intrinsics that will be impacted.

ARM/ARM64 is adding their own intrinsics as well and will also likely be impacted. I do not currently have a list of which of their intrinsics would be impacted.

@tannergooding
Copy link
Member Author

FYI. @mikedn and @4creators, since you have been active in reviewing the related PRs

@mikedn
Copy link

mikedn commented Dec 27, 2017

In particular, I saw GT_CNS_INT being transformed to GT_CAST->GT_UBYTE->GT_LCL_VAR which would cause issues for instructions like Sse.Shuffle.

That sounds odd, any actual example? I don't see how a constant node could be transformed into a cast of a lclvar.

@4creators
Copy link

In particular, I saw GT_CNS_INT being transformed to GT_CAST->GT_UBYTE->GT_LCL_VAR which would cause issues for instructions like Sse.Shuffle

I have been hitting this problem as well. Will try to create repro.

@jkotas
Copy link
Member

jkotas commented Dec 27, 2017

This was causing method parameters to undergo later morphing and register assignment which could potentially cause assertions to fail.

This is a bug that should fixed. I do not think that the right fix for this bug is to expand the hardware intrinsics unconditionally, e.g. even under minopts.

@tannergooding
Copy link
Member Author

I do not think that the right fix for this bug is to expand the hardware intrinsics unconditionally, e.g. even under minopts.

If there is a bug, I agree this is not a proper fix. However, I do think we should be expanding hardware intrinsics unconditionally, as is done with SIMD intrinsics (see https://github.com/dotnet/coreclr/blob/master/src/jit/importer.cpp#L7173).

I am rebuilding to get the JitDump now and will share shortly.

@fiigii
Copy link

fiigii commented Dec 27, 2017

Agree with @jkotas. Unconditionally expanded intrinsic cannot work with reflection calls.

@tannergooding
Copy link
Member Author

It is definitely not a bug.

Here is the JitDump_Main.txt

If you follow the code, which starts at L867:

    [ 0]  92 (0x05c) ldloc.3
    [ 1]  93 (0x05d) ldloc.s 4
    [ 2]  95 (0x05f) ldc.i4.s 27
    [ 3]  97 (0x061) call 0A00000D
In Compiler::impImportCall: opcode is call, kind=0, callRetType is struct, structSize is 16
HW Intrinsic SIMD Candidate Type Vector128`1 with Base Type Single
  Found type Hardware Intrinsic SIMD Vector128<float>
Calling impNormStructVal on:
               [000144] ------------              *  LCL_VAR   simd16 V05 loc4         
HW Intrinsic SIMD Candidate Type Vector128`1 with Base Type Single
  Found type Hardware Intrinsic SIMD Vector128<float>
resulting tree:
               [000149] x-----------              *  OBJ(16)   simd16
               [000148] L-----------              \--*  ADDR      byref 
               [000144] ------------                 \--*  LCL_VAR   simd16 V05 loc4         
Calling impNormStructVal on:
               [000143] ------------              *  LCL_VAR   simd16 V04 loc3         
HW Intrinsic SIMD Candidate Type Vector128`1 with Base Type Single
  Found type Hardware Intrinsic SIMD Vector128<float>
resulting tree:
               [000152] x-----------              *  OBJ(16)   simd16
               [000151] L-----------              \--*  ADDR      byref 
               [000143] ------------                 \--*  LCL_VAR   simd16 V04 loc3         

lvaGrabTemp returning 35 (V35 tmp11) called for impSpillStackEnsure.
HW Intrinsic SIMD Candidate Type Vector128`1 with Base Type Single
  Found type Hardware Intrinsic SIMD Vector128<float>


               [000157] ------------              *  STMT      void  (IL 0x05C...  ???)
               [000146] S-C-G-------              \--*  CALL      void   System.Runtime.Intrinsics.X86.Sse.Shuffle
               [000155] L----------- arg0            +--*  ADDR      byref 
               [000154] ------------                 |  \--*  LCL_VAR   simd16 V35 tmp11        
               [000152] x----------- arg1            +--*  OBJ(16)   simd16
               [000151] L-----------                 |  \--*  ADDR      byref 
               [000143] ------------                 |     \--*  LCL_VAR   simd16 V04 loc3         
               [000149] x----------- arg2            +--*  OBJ(16)   simd16
               [000148] L-----------                 |  \--*  ADDR      byref 
               [000144] ------------                 |     \--*  LCL_VAR   simd16 V05 loc4         
               [000145] ------------ arg3            \--*  CNS_INT   int    27

It eventually gets given a temporary register (R9) so the argument can be passed and a call to the actual method takes place.

Then, when the method is jitted (see
JitDump_Shuffle.txt), we no longer have a constant, but a method ARG (which is in local 3).

@tannergooding
Copy link
Member Author

Unconditionally expanded intrinsic cannot work with reflection calls.

@fiigii, why won't they work? Testing locally, they appear to function the same as when if (!mustExpand && (opts.compDbgCode || opts.MinOpts())) is hit now.

That is, a call to the method happens and when that method itself is jitted, the intrinsic is expanded (since it is a recursive call).

If the non-recursive calls (currently only IsSupported) are an issue, we can filter those out to not always be expanded.

In either case, when the intrinsics are not forced expanded, we need to determine how to handle arguments that are going to be required to be constants.

@mikedn
Copy link

mikedn commented Dec 27, 2017

Huh? This all looks bizarre. What does reflection has to do with this? Who in the right mind would call such intrinsics via reflection?

In either case, when the intrinsics are not forced expanded, we need to determine how to handle arguments that are going to be required to be constants.

That would be hilarious. The only way to do this with would be to generate a giant switch having one case for each supported immediate value (luckily these immediate values are only 8 bit so there are at most 256 cases).

@tannergooding
Copy link
Member Author

That would be hilarious. The only way to do this with would be to generate a giant switch having one case for each supported immediate value (luckily these immediate values are only 8 bit so there are at most 256 cases).

In any case. There is not a bug, just a poor initial explanation on my part. The GT_CNS_INT is preserved properly, but when the intrinsic is expanded, it is because we are jitting the intrinsic method itself (rather than expanding it directly inline), so we have a GT_LCL_VAR.

We should probably always expand these intrinsics and if supporting these via reflection is actually needed, we will have to come up with a way to handle arguments that are meant to be constant at that time (maybe we could attach some metadata to the GT_CALL for intrinsics to indicate the third arg was constant).

@mikedn
Copy link

mikedn commented Dec 27, 2017

we will have to come up with a way to handle arguments that are meant to be constant at that time (maybe we could attach some metadata to the GT_CALL for intrinsics to indicate the third arg was constant).

The moment you make a call the argument is no longer a constant, there's no way around that.

@fiigii
Copy link

fiigii commented Dec 27, 2017

We seem to need the feature that converts literal arguments to JIT time constants (e.g., limited partial evaluation). That feature would make const parameter more useful.

@4creators
Copy link

The only way to do this with would be to generate a giant switch having one case for each supported immediate value

Why would that be required? I don't understand that - could pls explain in easy way

@4creators
Copy link

4creators commented Dec 27, 2017

That feature would make const parameter more useful

Having language support for const parameters would solve half the problems we have ... is it worth raising this issue again with C# language team? - during code review @jaredpar was positive about that unfortunately other members and @mikedn don't agree. What is your opinion?

dotnet/csharplang#886

@tannergooding
Copy link
Member Author

@4creators, The issue still exists for Reflection based and for intrinsics which are not always expanded.

Since the JIT has no enforcement of const parameters, a user could hand author IL or use reflection to "bypass" the language feature.

Once the constant is passed to a method, it becomes a local and you can no longer guarantee that is/was a constant (especially in a place like Reflection where the actual invocation could be several calls down and after the value has been boxed and placed in an array).

I think we may just have to say Reflection is not supported for some of these functions (and is generally frowned upon for them in any case)

@4creators
Copy link

4creators commented Dec 27, 2017

I think we may just have to say Reflection is not supported for some of these functions (and is generally frowned upon for them in any case)

IMO leaving a reflection access to hardware intrinsics is unnecessary and best solution would to keep it closed except perhaps for inspection/reading metadata.

@jkotas
Copy link
Member

jkotas commented Dec 27, 2017

It was discussed in https://github.com/dotnet/corefx/issues/16835#issuecomment-315628433 . This is not just about direct reflection.

this with would be to generate a giant switch having one case for each supported immediate value

It does not have to be a dumb as this. We can do better than that by having a implementation specific for each intrinsic.

I do think we should be expanding hardware intrinsics unconditionally, as is done with SIMD intrinsics

This is implementation deficiency that we should avoid replicating to more places if possible. MinOps should be doing as little as possible.

@mikedn
Copy link

mikedn commented Dec 27, 2017

It was discussed in dotnet/corefx#16835 (comment) . This is not just about direct reflection.

I don't see any convincing argument in that comment.

It does not have to be a dumb as this. We can do better than that by having a implementation specific for each intrinsic.

That sounds even more complicated than a switch. IMO there has to be a very good use case to do something like this.

MinOps should be doing as little as possible.

AFAIK MinOpts is intended as an escape hatch in case the JIT is bugged. The proper implementation in that case would be for IsSupported to return false so that code paths that rely on SIMD do not run. I do not think that reporting that intrinsics are supported and at the same time not actually treating these methods as intrinsics qualifies as doing as little as possible. Quite the contrary, it sounds more like doing half of what's possible and creating a Frankenstein.

@tannergooding
Copy link
Member Author

It does not have to be a dumb as this. We can do better than that by having a implementation specific for each intrinsic.

How do you propose we do this for intrinsics which take a user provided immediate? Having a switch statement to handle 256 values seems a bit overkill...

I also don't see a good way to differentiate between the cases Shuffle(x, y, 0) and Shuffle(x, y, z). By the time the intrinsic expansion happens, there isn't really any way to identify this (and even if we track the metadata, I don't think we have a way to reliably do this for Reflection).

MinOps should be doing as little as possible.

I think we are doing the same work either way. In fact, we might end up doing more work by not forcing these intrinsics to expand since the JIT thinks we have to do additional register allocations, copying values out of XMM0 (return register), etc.

I'm also not quite sure how the non-expanded form of an instruction that takes an immediate, like Sse.Shuffle, is supposed to look (assuming we could properly handle constants). The actual generated code, when not inlined, needs to be able to handle all 256 immediates, which means actually emitting a giant jump table....

@tannergooding
Copy link
Member Author

Basically, I don't see how, when not expanding, we can avoid having Sse.Shuffle compile down to:

Vector128<float> Shuffle(Vector128<float> left, Vector128<float> right, byte control)
{
    Vectpr128<float> result;

    switch (control)
    {
        case 0:
            vshufps result, left, right, 0
            break;

        // case 1 .. case 254

        case 255:
            vshufps result, left, right, 255
            break;
    }

    return result;
}

@jkotas
Copy link
Member

jkotas commented Dec 28, 2017

We had two options:

  1. Have built-in non-inlined implementations of the hardware intrinsics. Providing fallback is our problem. It mostly just works thanks to the recursive force expansion of the intrinsics itself. The problematic case are the intrinsics that take constant immediates. We should be able to deal with it.
  2. Do not not have built-in non-inlined implementation. Providing fallback or ensuring that the fallback is not needed is everybody else's problem. The affected parts include:
  • Debuggers: VS expression evaluator
  • Profilers: number of profiler vendors that do IL instrumentation need to make sure to not disturb constant immediates passed to the intrinsics by accident
  • IL interpreters: We do not have a shipping one in .NET Core today, but we do not want to make it hard/impossible to have one in future
  • JIT minops

We have been working towards option 1 so far. I believe that it is easier and overall cheaper option. If we wanted to switch to option 2, we would need a plan on how to deal with the fallout and who is going to be involved in executing it.

Sse.Shuffle

The non-inlined implementation needs to be functionally correct for reasonable cost. It does not have to be top performance. I think that the non-inlined implementation of Sse.Shuffle can be something like:

Vector128<float> Shuffle(Vector128<float> left, Vector128<float> right, byte control)
{
    if (!IsSupported) throw new PlatformNotSupportedException();

    Vector128<float> result;
    ((float*)&result)[0] = ((float*)&left)[control & 3];
    ((float*)&result)[1] = ((float*)&left)[(control >> 2) & 3]
    ((float*)&result)[2] = ((float*)&right)[(control >> 4) & 3]
    ((float*)&result)[3] = ((float*)&right)[control >> 6]
    return result;
}

If we keep using the non-inlined implementation for minopts, the test coverage for this should come from the existing minopts runs.

The proper implementation in that case would be for IsSupported to return false so that code paths that rely on SIMD do not run

I do not think that IsSupported property can change its value during the lifetime of the process like that. It would be very hard to program and test against it, e.g. refacting a method into two would be dangerous change because of IsSupported can be true in some methods and false in other methods within same process.

IsSupported property means that the methods within enclosing class are functional. It does not make strict guarantees about their performance.

@tannergooding
Copy link
Member Author

We have been working towards option 1 so far.

From the design review: https://github.com/dotnet/apireviews/blob/master/2017/08-15-Intel%20Intrinsics/README.md

This approach differs from our existing SIMD infrastructure, especially Vector, in that these API are hardware specific. In other words, code using it is no longer portable between CPU architectures. Furthermore, we do not provide a software fallback. Instead, the developer is expected to guard calls using a provided capability API. Failing to do so will cause PlatformNotSupportedException at runtime.

My understanding of this is that the instructions are meant to always be emitted/inlined and that we should not be providing a software fallback or emulating these instructions at all.

I had thought that placing the fallback burden on the user was an explicit decision of the hardware intrinsics, given their nature.

@tannergooding
Copy link
Member Author

I think that, given the nature of hardware intrinsics, their general implementation strategy in other compilers (albeit native compilers), and their target audience, option 2 is probably the better choice overall.

Many of the impacted parts listed for option 2 also apply with optimizations enabled, so they will need to be updated to handle these areas anyways.

I also believe the target audience of this feature will want to get somewhat accurate information, even in debug mode. If we don't expand the intrinsics always, performance may actually drop in Debug mode, as compared to a non-hwintrinsic based version of the some algorithm.

Many of the algorithms that will use hw-intrinsics rely on execute these instructions in a tight loop, and if the intrinsics (with optimizations disabled) actually compile down to two mov instructions (to load the args to xmm0/xmm1) and then a call (which itself only executes the target instruction and returns), then perf will not be good. (It will likely also result in a bunch of additional work done by the register allocator, as we have a tight set of calls that has to continue reusing xmm0-xmm3 and then spilling the registers back to stack)

@jkotas
Copy link
Member

jkotas commented Dec 28, 2017

It is not unusual to see perf with optimizations disabled to be multiple times slower, and I have never heard people complain about it. I would want to wait for some hard data that it is a problem before doing anything about it.

@mikedn
Copy link

mikedn commented Dec 28, 2017

I do not think that IsSupported property can change its value during the lifetime of the process like that. It would be very hard to program and test against it, e.g. refacting a method into two would be dangerous change because of IsSupported can be true in some methods and false in other methods within same process.

Hrm, yes, that is a serious problem unfortunately.

Profilers: number of profiler vendors that do IL instrumentation need to make sure to not disturb constant immediates passed to the intrinsics by accident

I'm not sure why a profiler would do something like that but in any case, it's their problem. One way or another they need to ensure that they don't significantly impact the performance of the code, otherwise they're useless. They may even need to recognize these intrinsics if they want to provide reasonable numbers, attempting to treat them as normal calls may result in a mess.

IL interpreters (and possibly VS expression evaluator, I think it actually uses an IL interpreter)

That's an interesting case. An IL interpreter would treat these intrinsics as normal calls, OK. But then who's taking care of the recursive call inside the intrinsic method? It seems that the JIT will have to be invoked or that we need to provide Shuffle like fallbacks for all intrinsics.

It is not unusual to see perf with optimizations disabled to be multiple times slower

I think it's incorrect to equate intrinsics with optimizations. That's not how intrinsics work in C/C++ and I don't see why it would work differently in .NET. Except, of course, "ah, the debugger/JIT can't properly handle this" type of scenarios.

and I have never heard people complain about it

It's not like people have a choice. JIT's support for debugging with optimizations enabled is extremely poor. So you need to disable optimizations if you want to debug and then naturally the program runs slower. There's nothing to complain about, except perhaps about the JIT.

@tannergooding
Copy link
Member Author

I think it's incorrect to equate intrinsics with optimizations. That's not how intrinsics work in C/C++ and I don't see why it would work differently in .NET.

I agree completely.

Normal intrinsics are just functions handled specially by the compiler and that handling can vary.

Hardware intrinsics are functions that are expected to emit a very particular hardware instruction (it is basically a form of inline assembly).

I don't think they should be treated the same (and no other compiler, that I am aware of, does).

@sdmaclea
Copy link

sdmaclea commented Jan 5, 2018

I think that others on this thread are actually more familiar with the developer side of this, so I'd be interested in hearing more thoughts on this.

As a C++ developer, I certainly would have been more comfortable with a compile time error. Anything less adds complexity.

It has been asserted that HW Intrinsic usage will be extremely limited to advanced users who care about every cycle. If this is really true the generated code will be disassembled and looked at incessantly.

I certainly had a lot of experience optimizing for TI C6x processors by writing C++ code and looking at the generated assembly. The experience was not perfect, but I quickly learned exactly what assembly a given C++ code sequence would produce.

I cannot comment on C#.
I cannot comment on Analyzers.

The places where these immediate must be constexpr's seem to be gaps in the underlying ISA. The ISA designers believed the use case did not warrant the flexibility. The gaps often cause developer headaches. Having the flexibility to call use non-const will sometimes be useful. Especially when

  • performance does not matter.
  • initial drafts of optimized code for experimentation
  • writing test code

In general, I would prefer functionally correct code. I would prefer to never be required to run coverage analysis. Allowing non-const to work correctly without throwing seems best.

@@ -3388,6 +3389,20 @@ GenTree* Compiler::impIntrinsic(GenTree* newobjThis,
{
// The recursive calls to Jit intrinsics are must-expand by convention.
mustExpand = mustExpand || gtIsRecursiveCall(method);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're now always expanding calls to HW intrinsics then isn't this comment and logic out of date?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be.

I wasn't sure if there are other JIT intrinsics, which can be recursive, but for which we do not want to always expand.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We added this bit just for HW intrinsics.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then should we rename the bit to indicate that it is exclusively for HWIntrinsics (CORINFO_FLG_HW_INTRINSIC) or that this bit will assume "mustExpand = true" (CORINFO_FLG_MUSTEXPAND_INTRINSIC).

Based on the current logic in the VM (https://github.com/dotnet/coreclr/blob/master/src/vm/methodtablebuilder.cpp#L5144) this bit is set for any method marked with [Intrinsic]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, 'bit' was a poor choice of words. We added this bit of logic in the importer just for HW intrinsics.

Not all [Intrinsic] methods are must expand; some of them have perfectly viable IL implementations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I get it 😄

Then my question is: Do we expect all intrinsics for which gtIsRecursiveCall() would return true to always expand (except for indirect invocation) or do we only expect them to always expand for hardware intrinsics?

Even in the case of the former, I'm not sure how we detect the difference between a JitIntrinsic that is recursive and one that has an IL implementation on first pass..

@CarolEidt
Copy link

Then my question is: Do we expect all intrinsics for which gtIsRecursiveCall() would return true to always expand (except for indirect invocation) or do we only expect them to always expand for hardware intrinsics?

Interesting question. Would it be correct to rephrase that as "Are there scenarios that would motivate the introduction of a call (e.g. for profiling or other tools), such that mustExpand would be false, but we would still want to expand if gtIsRecursiveCall() is true so that the call would itself invoke the hw intrinsic?

If so, I think there might be.

@CarolEidt
Copy link

And:

Even in the case of the former, I'm not sure how we detect the difference between a JitIntrinsic that is recursive and one that has an IL implementation on first pass..

I would expect that intrinsics with an IL implementation would not invoke themselves recursively.

@sdmaclea
Copy link

sdmaclea commented Jan 5, 2018

I would expect that intrinsics with an IL implementation would not invoke themselves recursively.

Is there a distinction between recursion and infinite recursion? Factorial could be implemented recursively. But it would need a termination condition. Whether the distinction matters for HW Intrinsics is another question.

@tannergooding
Copy link
Member Author

I would expect that intrinsics with an IL implementation would not invoke themselves recursively.

My point on this one was that, the first time we go into impIntrinsic for any given intrinsic, we are in some callee method. So, gtIsRecursiveCall() will return false. At that point, we have the option to force expansion, or to return nullptr (this is the "first pass" for the intrinsic).

When we return nullptr, we insert a GT_CALL to the intrinsic. Then, we will eventually go to JIT the GT_CALL at which point we will either inline the IL or see that it is recursive and force expansion (this is the "second pass").

In the "first pass", I am not sure there is a good way to determine the difference between a method which has an IL implementation and a method which will be gtIsRecursiveCall() will return true.

Additionally, as @sdmaclea pointed out, there may be some methods for which a partial IL implemention exists (one that does some logic and then calls itself -- we thought about doing this for the compiler fallback on immediates, but opted not to since the higher level compiler may incorrectly optimize this).

@tannergooding
Copy link
Member Author

Interesting question. Would it be correct to rephrase that as "Are there scenarios that would motivate the introduction of a call (e.g. for profiling or other tools), such that mustExpand would be false, but we would still want to expand if gtIsRecursiveCall() is true so that the call would itself invoke the hw intrinsic?

Possibly. I think we already know there are some scenarios where we can't expand (indirect calling), but yes there may be other scenarios where some tooling may want to explicitly disable forced expansion for these.

@sdmaclea
Copy link

sdmaclea commented Jan 5, 2018

Also note this pattern. Not sure if it impacts anything.

        /// <summary>
        /// __int32 _mm256_extract_epi32 (__m256i a, const int index)
        /// </summary>
        public static int ExtractInt32<T>(Vector256<T> value, byte index) where T : struct 
        {
            ThrowHelper.ThrowNotSupportedExceptionIfNonNumericType<T>();
            return ExtractInt32<T>(value, index);
        }

@tannergooding
Copy link
Member Author

Also note this pattern.

For that pattern in particular, We are already doing the ThrowNotSupportedIfNonNumericType validation in the JIT during the importer phase (whether recursive or not), so I'm not sure what benefit there is by having it in the IL.

@AndyAyersMS
Copy link
Member

If you are implementing an [Intrinsic] method, especially one with non-viable IL, you need to know what you're doing.

Either the HW intrinsic expansions should be conditioned on mustExpand (using current logic) or they should be must expand by default, in which case we don't need to check for recursion anymore. I would prefer the former but can live with the latter.

If there are odd special cases they can be handled individually; we eventually must case out for each particular intrinsic.

@tannergooding
Copy link
Member Author

tannergooding commented Jan 5, 2018

For the former case, I did it the current way because it is more efficient.

Otherwise we would have:

#if FEATURE_HW_INTRINSICS
+#ifdef _TARGET_XARCH_
+            if (ni > NI_HW_INTRINSIC_START && ni < NI_HW_INTRINSIC_END)
+            {
+                mustExpand = true;
+            }
+#endif // _TARGET_XARCH_
+#endif // FEATURE_HW_INTRINSICS

And then again (at the bottom of the file, with the rest of the named intrinsic handling)

#if FEATURE_HW_INTRINSICS
+#ifdef _TARGET_XARCH_
+            if (ni > NI_HW_INTRINSIC_START && ni < NI_HW_INTRINSIC_END)
+            {
+                return impX86HWIntrinsic(ni, method, sig);
+            }
+#endif // _TARGET_XARCH_
+#endif // FEATURE_HW_INTRINSICS

@fiigii
Copy link

fiigii commented Jan 5, 2018

Question. If we will have fallback for IMM intrinsics, why do we need to always expand HW intrinsics in debug and minopt?

@fiigii
Copy link

fiigii commented Jan 5, 2018

For that pattern in particular, We are already doing the ThrowNotSupportedIfNonNumericType validation in the JIT during the importer phase (whether recursive or not), so I'm not sure what benefit there is by having it in the IL.

JIT importer returns nullptr for non-numeric type arguments, which leads HW intrinsics back to the C# implementation to trigger the exception.

@sdmaclea
Copy link

sdmaclea commented Jan 5, 2018

JIT importer returns nullptr for non-numeric type arguments, which leads HW intrinsics back to the C# implementation to trigger the exception.

It probably should return the throw similar to PlatformNotSupported case.

@tannergooding
Copy link
Member Author

@fiigii, as for the question on expanding under minopts.

@CarolEidt stated many of the same concerns that others have raised: #15639 (comment)

@AndyAyersMS
Copy link
Member

One of the proposed fallbacks for non-const args would still expand in the recursive case -- it would just expand to something more complex.

Seems like there are enough combinatorics here that some sort of master guide would be useful:

  • prejit vs jit
  • root (or parent) method is the intrinsic iitself, or some other callee
  • IL implementation viable or not
  • intrinsic supported or not
  • opt level (full, minopts, ...)
  • simple intrinsic expansion preconditions met (IMM cases) or not

@tannergooding
Copy link
Member Author

tannergooding commented Jan 5, 2018

I think the currently proposed logic is the following:

Regular Intrinsics (based on existing logic)

  • Handled by impIntrinsic
  • Always expand StubHelpers_GetStubContext
  • Always expand StubHelpers_GetStubContextAddr
  • Expansion is otherwise evaluated as (!opts.compDbgCode && !opts.MinOpts())

SIMD Intrinsics (based on existing logic)

  • Handled specially (not by impIntrinsic)
  • Currently always (and unconditionally) expanded
  • Not sure what the behavior is here for indirect calls

Hardware Intrinsics (based on decisions made in this thread, with notes about questions in this thread)

  • Handled by impIntrinsic
  • Always expanded, unless a precondition is not met
    • When preconditions are not met, we still expand under gtIsRecursiveCall() (which should only occur for indirect calls)
    • When preconditions are not met, and we are not recursive, we can either throw or expand (this has still been going through debate and I didn't see an official decision yet)

JIT Intrinsics (based on existing logic, with notes about questions in this thread)

  • Handled by impIntrinsic
  • Expanded when gtIsRecursiveCall() or when (!opts.compDbgCode && !opts.MinOpts())

Seems like there are enough combinatorics here that some sort of master guide would be usefu

  • prejit vs jit
    • if prejit goes through the importer, they should be identical; otherwise, prejit won't expand
  • root (or parent) method is the intrinsic itself
    • parent method is handled by gtIsRecursiveCall()
    • unsure if root method is handled by gtIsRecursiveCall()
  • il implementation viable or not
    • not sure what you mean here, when would we have a non-viable IL implementation?
  • intrinsic supported or not
    • should go through the normal expansion rules listed above
  • opt level
    • follows the rules above
  • simple intrinsic preconditions met
    • follows the rules above, open question is on whether we should throw or insert a GT_CALL for direct calls which don't meet the preconditions

@jkotas
Copy link
Member

jkotas commented Jan 5, 2018

SIMD Intrinsics (based on existing logic)
• Not sure what the behavior is here for indirect calls

You may get bogus results on AVX. It is broken in similar way to how it is broken under debugger.

Hardware Intrinsics
When preconditions are not met, and we are not recursive

It should do nothing special. The intrinsic should be compiled as a regular call.

JIT Intrinsics
an unsupported intrinsic should be throwing PlatformNotSupportedException

Intrinsic not recognized by the JIT should do nothing special. It should be compiled as a regular call.

@tannergooding
Copy link
Member Author

Intrinsic not recognized by the JIT should do nothing special. It should be compiled as a regular call.

Fixed this. As a note, this will result in a StackOverflow exception in the chance the user somehow loads a library with an unsupported intrinsic that isn't set to throw PlatformNotSupportedException() for their IL implementation.

You may get bogus results on AVX. It is broken in similar way to how it is broken under debugger.

We could probably fix the bogus results on AVX with the hardware intrinsic support (would require investigation, etc in the future)

It should do nothing special. The intrinsic should be compiled as a regular call.

There was some back and forth on this above. @CarolEidt, could you confirm that this is what we should do (this is for when directly calling a method, such as Sse.Shuffle, but not meeting the preconditions, such as the third parameter being constant).

@CarolEidt
Copy link

Hardware Intrinsics
When preconditions are not met, and we are not recursive

It should do nothing special. The intrinsic should be compiled as a regular call.

@CarolEidt, could you confirm that this is what we should do (this is for when directly calling a method, such as Sse.Shuffle, but not meeting the preconditions, such as the third parameter being constant).

I believe that the current consensus for this is that:

  • If the intrinsic is supported only for the constant case, and the argument passed is not a constant, then it should be treated as a regular call. This will then cause the IL implementation to be JIT'd (if it has not already been). Then we will land in the "we are recursive" case and the JIT will generate the switch statement.
  • If the intrinsic is not supported for some other reason, e.g. for the given generic base type, again it should be treated as a regular call. However, in this case, the IL implementation, in addition to being recursive, should check for a valid base type and throw if it is not.
    • If the JIT is unable to optimize away the intrinsic in the case where the base type doesn't match, then IMO it should assert in Checked/Debug mode, and either generate the throw, or (simpler to implement) throw a BAD_CODE exception to the VM.
  • I'm not sure if there are other preconditions to check for, but if so it would be desirable if they could fit into the same "treat it as a regular call unless it is recursive".

@tannergooding
Copy link
Member Author

Okay, so the PR is still in a good shape based on the discussion so far (AFAICT).

@AndyAyersMS, did you have any other feedback or is it good to merge?

@fiigii
Copy link

fiigii commented Jan 6, 2018

If the intrinsic is supported only for the constant case, and the argument passed is not a constant, then it should be treated as a regular call. This will then cause the IL implementation to be JIT'd (if it has not already been). Then we will land in the "we are recursive" case and the JIT will generate the switch statement.

To be clear, "the argument passed is not a constant", the non-constant arguments cannot come from users' direct calls. At least, for Intel hardware intrinsics, user-passed non-constant arguments into direct calls have to generate the throw.

@tannergooding
Copy link
Member Author

At least, for Intel hardware intrinsics, user-passed non-constant arguments into direct calls have to generate the throw.

@fiigii, I believe the sentiment is that we should not force expansion in this case and let a regular GT_CALL be emitted. The code would still compile, without throwing, and would still execute correctly (the same as if it had been called indirectly).

@CarolEidt
Copy link

@fiigii, I believe the sentiment is that we should not force expansion in this case and let a regular GT_CALL be emitted. The code would still compile, without throwing, and would still execute correctly (the same as if it had been called indirectly).

Yes - that's the consensus we've reached. The expectation is that, first and foremost, the expectation is that developers using these intrinsics "know what they are doing". However, it has also been suggested that analyzers should be provided to identify cases where non-constant values are being passed where an immediate is expected.

@AndyAyersMS
Copy link
Member

I'm ok with it as is.

@tannergooding
Copy link
Member Author

I'm ok with it as is.

Ok. Thanks!

I'm going to merge this shortly, provided I don't see any other feedback requesting otherwise.

I believe all remaining discussions will be impactful to future code and not to the changes currently being made by this PR.

@CarolEidt
Copy link

I'm good with this as well (I did one more quick review, as it's been some time, and many comments, since I looked at it).

@tannergooding tannergooding merged commit 6f79b79 into dotnet:master Jan 6, 2018
@tannergooding tannergooding deleted the expand-hwintrin branch January 17, 2018 01:52
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants