Implement 64-bit-only hardware intrinsic #21264

fiigii · 2018-11-29T00:37:10Z

This PR implements 64-bit-only hardware intrinsic that are exposed in X64 nested classes.

In the next PR, I will implement the remaining BMI1/2 intrinsic and move more tests to the template framework.

Implement #20146 in JIT and close https://github.com/dotnet/coreclr/issues/21042.

fiigii · 2018-11-29T00:39:38Z

@CarolEidt @tannergooding the main JIT change is in the commit "Implement 64-bit-only intrinsic".

fiigii · 2018-11-29T00:40:35Z

@jkotas @AndyAyersMS Could you please take a look at the JIT/EE interface change and its superPMI code (the first two commits)?

src/System.Private.CoreLib/shared/System/Runtime/Intrinsics/X86/Sse2.cs

src/jit/hwintrinsiccodegenxarch.cpp

src/jit/hwintrinsicxarch.cpp

tannergooding · 2018-11-29T01:11:32Z

src/System.Private.CoreLib/shared/System/Runtime/Intrinsics/X86/Sse2.cs

            public static long ConvertToInt64(Vector128<double> value) => ConvertToInt64(value);
            /// <summary>
            /// __int64 _mm_cvtsi128_si64 (__m128i a)
            ///   MOVQ reg/m64, xmm
            /// This intrinisc is only available on 64-bit processes
            /// </summary>
+            [Intrinsic]
            public static long ConvertToInt64(Vector128<long> value) => ConvertToInt64(value);


This one shouldn't be x64 specific.

We can still emit the MOVQ standalone instruction (which requires the result be stored in memory):

This is different from the movd instruction which has an movq overload that is 64-bit only:

~~Ah, this is comment mistake. This intrinsic generates cvtss2si.~~

Why would this intrinsic generate cvtss2si (which takes a scalar float and returns an integer)?

The intrinsic is _mm_cvtsi128_si64 which takes a scalar integer and returns an integer, which generates the movq instruction as per: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_cvtsi128_si64&expand=1851

Oh, my mistake, you are talking about ConvertToInt64(Vector128<long> value), not ConvertToInt64(Vector128<double> value).

If we match C++, this should be a 64-bit only intrinsic, as C++ generate movq r64, xmm.

I think this is a case where it would be fine, on x86 machines, to just force the store to memory. On x64 machines, we can do whichever is more appropriate (store to memory or to register).

We are already needing to provide the ToScalar functionality on x86 anyways (via the helper intrinsics) so it would just make that simpler overall.

Yes, exactly. ToScalar and GetElement is implement by Unsafe on unsupported platforms, so we should optimize Unsafe CQ rather than complicate the whole JIT again for ConvertToInt64.

As I say, I don't have a strong opinion, but I'm confused about the complexity. As @tannergooding points out, we already have the ToScalar helper function. What am I missing that would cause this to be more complex?
And in any case, if it is equivalent to ToScalar then do we need this intrinsic at all?

What am I missing that would cause this to be more complex?

As far as I can tell ToScalar has "software" fallback. ConvertToInt64 is supposed to be an intrinsic so it should not have fallback.

And in any case, if it is equivalent to ToScalar then do we need this intrinsic at all?

Looks to me than all this intrinsic affair morphed from exposing the relevant instructions as they are, without "added value" to a mish-mash of actual intrinsic and various helpers with software fallback. Whatever. But if this results in unnecessary complexity in the JIT then I'd say that it's a bad affair.

we already have the ToScalar helper function. What am I missing that would cause this to be more complex?

ToScalar is a platform agnostic helper that does not guarantee the optimal codegen now. @tannergooding is suggesting generating movq m64, xmm for ConvertToInt64 on 32-platform.
Of course, we can implement ToScalar via ConvertToInt64 but not inverse.

Right. ToScalar is a helper function that, outside of float and double (which can have special semantics due to staying in the same register) will be implemented in software and in terms of calling the existing intrinsics.

tannergooding · 2018-11-29T01:11:52Z

src/System.Private.CoreLib/shared/System/Runtime/Intrinsics/X86/Sse2.cs

            public static long ConvertToInt64(Vector128<long> value) => ConvertToInt64(value);

            /// <summary>
            /// __int64 _mm_cvtsi128_si64 (__m128i a)
            ///   MOVQ reg/m64, xmm
            /// This intrinisc is only available on 64-bit processes
            /// </summary>
+            [Intrinsic]
            public static ulong ConvertToUInt64(Vector128<ulong> value) => ConvertToUInt64(value);


Same with this and other instructions that use movq

src/vm/jitinterface.cpp

AndyAyersMS

Can you link this PR to the "spec" that inspired this change?

src/ToolBox/superpmi/superpmi-shared/methodcontext.h

CarolEidt

The JIT changes LGTM, though I'd like to see us resolve the issues about the intrinsics that are moving between register files.

src/jit/hwintrinsicxarch.cpp

src/vm/jitinterface.cpp

fiigii · 2018-11-29T23:10:42Z

Addressed feedback, @jkotas @AndyAyersMS @sandreenko @tannergooding PTAL

fiigii · 2018-11-30T18:24:47Z

@dotnet-bot test this please

src/vm/methodtablebuilder.cpp

AndyAyersMS · 2018-11-30T18:33:36Z

src/ToolBox/superpmi/superpmi-shared/methodcontext.cpp

+    key.ftn = (DWORDLONG)ftn;
+    key.className = (moduleName != nullptr);
+    key.namespaceName = (namespaceName != nullptr);
+    key.enclosingClassName = (enclosingClassName != nullptr);


I'm not quite sure why we (probably me) implemented things this way.

I always have thought that for interface methods with optional out parameters that SPMI recording should assume the most general case and pass non-null internally for all those out params. Then record all the values.

And on replay give back only what the caller asked for.

If we do this, then the key would only involve ftn, not whether or not the optional out params were also asked for.

In practice it may not matter as we probably always ask for all the results.

@sandreenko thoughts?

There are many cases where we do it like this (key value contains bool for each optional param that shows if it was asked). I do not see an issue here. It can increase the map size a bit, but also can protect us from collisions that happen because not all JitEEinterface function are clear.

so LGTM.

fiigii · 2018-11-30T20:05:23Z

@dotnet-bot test Ubuntu arm Cross Checked Innerloop Build and Test
@dotnet-bot test Ubuntu x64 Checked Innerloop Build and Test
@dotnet-bot test Ubuntu16.04 arm64 Cross Checked Innerloop Build and Test

fiigii · 2018-11-30T21:51:24Z

src/vm/methodtablebuilder.cpp

+            IfFailThrow(GetMDImport()->GetNameOfTypeDef(bmtInternal->pType->GetEnclosingTypeToken(), NULL, &nameSpace));
+        }
+
+        if (hr == S_OK && (strcmp(nameSpace, "System.Runtime.Intrinsics.X86") == 0))


Change to get enclosing class's namespace and unify the condition.
@AndyAyersMS @tannergooding @jkotas

fiigii · 2018-11-30T22:54:11Z

@dotnet-bot test Windows_NT x64 Checked jitsse2only
@dotnet-bot test Windows_NT x64 Checked jitincompletehwintrinsic
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnoavx
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnoavx2
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnosimd
@dotnet-bot test Windows_NT x64 Checked jitnox86hwintrinsic

@dotnet-bot test Windows_NT x86 Checked jitsse2only
@dotnet-bot test Windows_NT x86 Checked jitincompletehwintrinsic
@dotnet-bot test Windows_NT x86 Checked jitx86hwintrinsicnoavx
@dotnet-bot test Windows_NT x86 Checked jitx86hwintrinsicnoavx2
@dotnet-bot test Windows_NT x86 Checked jitx86hwintrinsicnosimd
@dotnet-bot test Windows_NT x86 Checked jitnox86hwintrinsic

@dotnet-bot test Ubuntu x64 Checked jitsse2only
@dotnet-bot test Ubuntu x64 Checked jitincompletehwintrinsic
@dotnet-bot test Ubuntu x64 Checked jitx86hwintrinsicnoavx
@dotnet-bot test Ubuntu x64 Checked jitx86hwintrinsicnoavx2
@dotnet-bot test Ubuntu x64 Checked jitx86hwintrinsicnosimd
@dotnet-bot test Ubuntu x64 Checked jitnox86hwintrinsic

fiigii · 2018-11-30T22:55:28Z

src/System.Private.CoreLib/shared/System/Runtime/Intrinsics/X86/Popcnt.PlatformNotSupported.cs

            /// <summary>
            /// __int64 _mm_popcnt_u64 (unsigned __int64 a)
            ///   POPCNT reg64, reg/m64
            /// This intrinisc is only available on 64-bit processes
            /// </summary>
-            public static ulong PopCount(ulong value) => PopCount(value);
+            public static ulong PopCount(ulong value) { throw new PlatformNotSupportedException(); }


Fixed a mistake from previous PRs.

fiigii · 2018-12-01T00:52:37Z

CI gets all green. Can the PR get approved?

tannergooding · 2018-12-01T02:13:17Z

src/jit/hwintrinsiccodegenxarch.cpp

@@ -1398,6 +1406,16 @@ void CodeGen::genSSEIntrinsic(GenTreeHWIntrinsic* node)
            break;
        }

+        case NI_SSE_X64_ConvertScalarToVector128Single:


can't this just be grouped with NI_SSE2_X64_ConvertScalarToVector128Double, they look to have the same implementation.

I remember that we discussed to merge genSSEIntrinsic and genSSE2Intrinsic (we already did for AVX and AVX2). Let me make it in the next PR.

Was the intrinsic previously misnamed? It looks like it used to be NI_SSE2_ConvertScalrToVector128Single: https://github.com/dotnet/coreclr/pull/21264/files#diff-847442c7efdf7a78e78711bea30fd4b2L1570

We have both. I modified the table so that SSE2 ConvertScalarToVector128Single could be table-driven https://github.com/dotnet/coreclr/pull/21264/files#diff-607364c1c16999989209b5ee9773829aR213

//Sse public static Vector128<float> ConvertScalarToVector128Single(Vector128<float> upper, int value) ; // Sse.X64 public static Vector128<float> ConvertScalarToVector128Single(Vector128<float> upper, long value); // Sse2 public static Vector128<float> ConvertScalarToVector128Single(Vector128<float> upper, Vector128<double> value);

jkotas · 2018-12-01T02:30:56Z

src/jit/hwintrinsicxarch.cpp

-    if (className[0] == 'A')
+    if (strcmp(className, "X64") == 0)
+    {
+        assert(enclosingClassName != nullptr);


Nit: I find the partially recursive calls like this hard to read. I would change the signature of lookupIsa to just lookupIsa(const char* className), and the callsite to:

if (strcmp(className, "X64") == 0) { isa = X64VersionOfIsa(lookupIsa(enclosingClassName)); } else { isa = lookupIsa(className); }

Thanks, done by splitting the code into two functions.

jkotas · 2018-12-01T02:31:54Z

getMethodNameFromMetadata and related code looks good to me.

fiigii · 2018-12-01T04:45:57Z

@dotnet-bot test this please

fiigii · 2018-12-03T19:32:54Z

CoreFX API surface change is at dotnet/corefx#33805

tannergooding · 2018-12-03T19:41:20Z

tests/src/JIT/HardwareIntrinsics/X86/Shared/SimdScalarUnOpConvTest.template

@@ -141,7 +141,7 @@ namespace JIT.HardwareIntrinsics.X86
            _dataTable = new SimdScalarUnaryOpTest__DataTable<{Op1BaseType}>(_data, LargestVectorSize);
        }

-        public bool IsSupported => {Isa}.IsSupported && (Environment.Is64BitProcess || ((typeof({RetBaseType}) != typeof(long)) && (typeof({RetBaseType}) != typeof(ulong))));


Did we update all the templates that had this check?

Ah, a few templates still have this check. Can I update it in the next PR?

Can I update it in the next PR?

And will move more tests to the template framework.

Yes, that would be great. Thanks!

tannergooding

HWIntrinsic changes (in the JIT and the tests) LGTM

tannergooding · 2018-12-03T20:27:23Z

@CarolEidt, was all of your feedback resolved?

I'm not quite sure what you meant by:

though I'd like to see us resolve the issues about the intrinsics that are moving between register files.

So I thought I'd double-check before merging.

CarolEidt · 2018-12-03T20:51:40Z

all this intrinsic affair morphed from exposing the relevant instructions as they are, without "added value" to a mish-mash of actual intrinsic and various helpers with software fallback.

As I see it, we are trying to achieve a balance between sometimes-competing objectives such as:

"exposing the relevant instructions as they are",
ensuring good code generation (e.g. for places where we'd like to contain the memory op), and
making the API as usable as possible.

I'm sure that there are a number of places where this balance is off, but I think we need to be mindful about this balance, so as not to optimize one aspect to the detriment of others.

I'd like to see us resolve the issues about the intrinsics that are moving between register files.

What I meant by this was that I'd like to see some consensus specifically about how we approach an intrinsic like ConvertToInt64 that is has different constraints on different targets (i.e. is available with a memory destination on 32-bit systems). It doesn't seem to me that we closed on this, and perhaps we need some better articulation of the API guidelines.

That said, I'm find with merging this as-is.

CarolEidt

LGTM

fiigii · 2018-12-03T20:54:18Z

Thank you all for reviewing 😄

glenn-slayden · 2019-01-02T04:17:05Z

Is there any way to reference the relevant System.Private.CoreLib.dll library in order to invoke these intrinsics from .NET Framework 4.7.2 (i.e., "desktop") x64? If not, is such a backport planned, and if so, what would be the timeframe?

fiigii · 2019-01-02T05:01:00Z

@glenn-slayden No, .NET framework doesn’t support HW intrinsic.

Implement 64-bit-only hardware intrinsic Commit migrated from dotnet/coreclr@a089f64

tannergooding reviewed Nov 29, 2018

View reviewed changes

src/System.Private.CoreLib/shared/System/Runtime/Intrinsics/X86/Sse2.cs Outdated Show resolved Hide resolved

tannergooding reviewed Nov 29, 2018

View reviewed changes

src/jit/hwintrinsiccodegenxarch.cpp Show resolved Hide resolved

tannergooding reviewed Nov 29, 2018

View reviewed changes

src/jit/hwintrinsicxarch.cpp Outdated Show resolved Hide resolved

tannergooding reviewed Nov 29, 2018

View reviewed changes