Adding scalar hardware intrinsics for x86. #15341

tannergooding · 2017-12-02T00:28:57Z

This is the start of https://github.com/dotnet/corefx/issues/23519.

tannergooding · 2017-12-02T00:29:04Z

Still need to update the *.PlatformNotSupported.cs files accordingly.

tannergooding · 2017-12-02T00:29:54Z

I haven't done COMIS or UCOMIS yet as I am not sure of the naming. I was thinking bool CheckEqualScalar(Vector128<float> left, Vector128<float> right)

tannergooding · 2017-12-02T00:31:56Z

I also noticed a number of places in the existing files where we are not being consistent (either in naming or in following the API naming guidelines).

Ex: We have ReciprocalSquareRoot, Sqrt, and ReciprocalSqrt, depending on where you look.

We are also doing Int, Float, Long, etc.... When we should be doing Int32, Single, Int64, etc...

tannergooding · 2017-12-02T00:32:42Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Avx.cs

+        /// <summary>
+        /// __m128 _mm_cmp_ss (__m128 a, __m128 b, const int imm8)
+        /// </summary>
+        public static Vector128<float> CompareScalar(Vector128<float> left, Vector128<float> right, FloatComparisonMode mode) => CompareScalar(left, right, mode);


Was wondering why we have FloatComparisonMode here, but not in the Sse/Sse2 files?

tannergooding · 2017-12-02T00:33:32Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Sse.cs

        /// <summary>
        /// __m128 _mm_cmpunord_ps (__m128 a,  __m128 b)
        /// </summary>
        public static Vector128<float> CompareUnordered(Vector128<float> left, Vector128<float> right) => CompareUnordered(left, right);
-


These files all contain a bunch of lines that are just whitespace... We should probably cleanup separately.

tannergooding · 2017-12-02T00:34:42Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Sse.cs

+        /// <summary>
+        /// __m128 _mm_sqrt_ss (__m128 a)
+        /// </summary>
+        public static Vector128<float> SqrtScalar(Vector128<float> value) => SqrtScalar(value);


The Sse2 form for double takes two arguments, but we only take one here (this is matching the C/C++ intrinsics). Perhaps we should expose both or just the one that takes two arguments?

tannergooding · 2017-12-02T00:35:41Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Sse2.cs

+        /// <summary>
+        /// __m128d _mm_sqrt_sd (__m128d a, __m128d b)
+        /// </summary>
+        public static Vector128<double> SqrtScalar(Vector128<double> a, Vector128<double> b) => SqrtScalar(a, b);


I was thinking of calling a and b, upper and value respectively. Since b is the value we perform the operation on and a is the value we fill in the upper bits from. Thoughts?

Do we really need to specifically fill the upper bits of the result in practice?
From the performance perspective, we always recommend using the same register as the source and upper argument.
Especially, if we decide to support the two-parameter version of Sqrt intrinsic, on non-AVX machines, the compiler may have to insert unpack or shuffle instructions to implement this semantic, which they are both long latency instructions.

In summary, I am suggesting that only expose the one-parameter intrinsic for SQRTSS and SQRTSD.

Just exposing the single intrinsic version is probably fine. I actually missed that the two operand form is only on AVX and above, the Intel Intrinsics Guide lists it as SSE2: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=Sqrt&techs=SSE,SSE2

The Software Developers Manual lists the information correctly.

tannergooding · 2017-12-02T00:36:57Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Sse.cs

+        /// void _mm_store_ss (float* mem_addr, __m128 a)
+        /// </summary>
+        public static unsafe void StoreScalar(float* address, Vector128<float> source) => StoreScalar(address, source);
+
        /// <summary>
        /// __m128d _mm_sub_ps (__m128d a, __m128d b)


note: the files have several typos, such as this, where the wrong type is used in the intrinsic

Good catch. For these comment typos, we can fix them later that do not impact the CoreCLR/CoreFX interface.

tannergooding · 2017-12-02T00:38:17Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Avx2.cs

@@ -352,6 +352,11 @@ public static class Avx2
        /// </summary>
        public static Vector256<ulong> ConvertToVector256ULong(Vector128<uint> value) => ConvertToVector256ULong(value);


note: This is an example of an API that doesn't follow the general .NET API naming conventions. It should probably be ConvertToVector256Int64

fiigii · 2017-12-02T01:55:17Z

I also noticed a number of places in the existing files where we are not being consistent
We have ReciprocalSquareRoot, Sqrt, and ReciprocalSqrt, depending on where you look.
We are also doing Int, Float, Long, etc.... When we should be doing Int32, Single, Int64, etc...

Thank you for pointing this out. If the .NET API convention always prefers Int64 over Long, we definitely should fix it.

tannergooding · 2017-12-02T02:07:00Z

Just as an FYI. The exact guideline I am referring to is Avoiding Language Specific Names

pentp · 2017-12-02T16:14:59Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Sse41.cs

+        /// __m128 _mm_round_ss (__m128 a, int rounding)
+        /// _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC
+        /// </summary>
+        public static Vector128<float> RoundToNearestIntegerScalar(Vector128<float> value) => RoundToNearestIntegerScalar(value);


It would be more consistent to use a RoundingMode immediate parameter here (similarly to comparisons for example). Then it would be 4 functions (Round/RoundScalar * float/double) that map directly to four machine instructions (roundpd/ps/sd/ss) instead of 20. The fully named helper functions could be defined on top of these basic instructions somewhere else.

@fiigii, thoughts? I was following the existing convention you had setup for the packed forms.

AFAIR in discussions on Intrinsics API the consensus was to create direct mapping to processor instructions due to mutlitude of reasons. Therefore, I would avoid creating any APIs which do not map or omit any processor instructions. It means that if we have 3 argument scalar AVX or above instruction while having 2 argument SSE equivalent we should have both or here we should use immediate parameter for defining rounding mode.

I believe that the current design (encodes rounding mode into intrinsic names) has better static semantics.
For example, RoundingMode immediate parameter requires 1) const parameter from language feature support to avoid non-literal values, 2) compile error reporting and runtime exception for invalid values from Roslyn/CoreCLR https://github.com/dotnet/corefx/issues/22940#issuecomment-320122766.
Each round just has a few pre-defined modes, so I thought this is a good opportunity to provide intrinsic with more stable runtime behaviors and friendly development experience. Meanwhile, it does not lose any flexibility.

We could still implement rounding functions with an immediate parameter as private intrinsics and expose them through wrapper functions that just forward to the actual intrinsic if this simplifies the implementation.

// __m128 _mm_round_ss (__m128 a, int rounding) private static Vector128<float> RoundScalar(Vector128<float> value, byte rounding) => RoundScalar(value, rounding); // _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC public static Vector128<float> RoundToNearestIntegerScalar(Vector128<float> value) => RoundScalar(value, _MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC); // _MM_FROUND_TO_NEG_INF |_MM_FROUND_NO_EXC public static Vector128<float> RoundToNegativeInfinityScalar(Vector128<float> value) => RoundScalar(value, _MM_FROUND_TO_NEG_INF | _MM_FROUND_NO_EXC);

if this simplifies the implementation.

No, the current runtime always expands all the APIs as intrinsics.

pentp · 2017-12-02T16:43:16Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Sse2.cs

+        /// <summary>
+        /// __m128d _mm_cvtss_sd (__m128d a, __m128 b)
+        /// </summary>
+        public static Vector128<double> ConvertToDoubleScalar(Vector128<double> a, Vector128<float> b) => ConvertToDoubleScalar(a, b);


MOVD/MOVQ instructions are missing from here (and AVX/AVX2). I propose something like this:

// __m128i _mm_cvtsi32_si128 (int a) public static Vector128<int> CopyInt32(int value); // int _mm_cvtsi128_si32 ( __m128i a) public static int CopyInt32(Vector128<int> value); // __m128i _mm_cvtsi64_si128(__int64) public static Vector128<long> CopyInt64(long value); // __int64 _mm_cvtsi128_si64(__m128i) public static long CopyInt64(Vector128<long> value);

Fixed. Went with the existing ConvertTo naming convention

4creators · 2017-12-02T21:09:27Z

I have not seen any conversion intrinsics which would allow converting from and to Half. Fact that we do not have Half support in CLR should not stop us form having intrinsics which would act just as an interface between CLR and other runtimes which support Half sized floating types. Additionally, despite it is not directly related to this PR, it could be very helpful to have support for 8bit floating/binary sized numbers as well.

Related issues:

https://github.com/dotnet/corefx/issues/17267

https://github.com/dotnet/coreclr/issues/11948

tannergooding · 2017-12-02T21:48:43Z

@4creators, I believe the FP16C instructions haven't gone for design review yet.

The initial set that has gone for review covers both packed and scalar (as of this PR) instructions for SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2, FMA, AES, BMI1, BMI2, LZCNT, PCLMULQDQ, and POPCNT.

The intrinsic sets that have not gone to review (based on those listed under the Intel Intrinsics Guide) are MMX, AVX-512, KNC, SVML, ADX, CLFLUSHOPT, CLWB, FP16C, FSGSBASE, FXSR, INVPCID, MONITOR, MPX, PREFETCHWT1, RDPID, RDRAND, RDSEED, RDTSCP, RTM, SHA, TSC, XSAVE, XSAVEC, XSAVEOPT, XSS

A number of those intrinsic sets won't/shouldn't go to review because:

Not all of those (such as the SVML intrinsics) represent actual hardware instructions.
Some of them (such as MMX) are legacy instructions.
Others (like XSAVE) are targeted for OS, not necessarily for applications.

The others likely just need to be proposed and go up for review. For FP16C in particular, we at the very least need a Half data type so that it can be properly represented as an API (Vector128<Half>), so the review will be slightly more involved.

4creators · 2017-12-02T22:18:41Z

I believe the FP16C instructions haven't gone for design review yet.

@tannergooding Yep, you are right. Perhaps it's time to make both Half and missing intrinsics proposal :)

4creators · 2017-12-07T16:14:18Z

@dotnet-bot test Windows_NT x86 Checked Innerloop Build and Test
@dotnet-bot test Windows_NT x64 Checked Innerloop Build and Test
@dotnet-bot test Ubuntu x64 Checked Innerloop Build and Test
@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test
@dotnet-bot test CentOS7.1 x64 Debug Innerloop Build
@dotnet-bot test CentOS7.1 x64 Checked Innerloop Build and Test
@dotnet-bot test Tizen armel Cross Checked Innerloop Build and Test

tannergooding · 2017-12-09T16:45:13Z

Updated the *.PlatformNotSupported.cs files and added the remaining APIs that I was aware were missing.

tannergooding · 2017-12-09T16:48:05Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Sse2.cs

+        /// <summary>
+        /// __int64 _mm_cvtsd_si64 (__m128d a)
+        /// </summary>
+        public static long ConvertToInt64(Vector128<double> value) => ConvertToInt64(value);


@fiigii, For instructions like this, which have an additional encoding on x64, how do we want to expose them?

an additional encoding on x64

Do you mean the 64-bit register in cvtsd2si r64, xmm ?
These intrinsics are only available in 64-bit mode, and calling them in 32-bit should throw PlatformNotSupportExeception.

if (Sse2.IsSupported && Environment.Is64BitProcess) { ulong res = Sse2.ConvertToInt64(vec); }

Alright, that sounds good to me.

I was mostly just wanting to confirm we were exposing them in X86.Sse2 and not under some X64 specific sub-class

tannergooding · 2017-12-11T23:04:27Z

Should be ready for review.

Still todo (but will be in separate PRs):

Update CoreFX with the new APIs
Plumb through the JIT support for these APIs
- @fiigii, are there any pending PRs from you that I should wait on before starting on this?

eerhardt

I'm no expert here, but this looks good to me.

tannergooding · 2017-12-13T02:29:00Z

@fiigii, is there a list of instructions/intrinsics that don't have corresponding managed APIs exposed yet?

I have found, at the very least, _mm_movemask_ps (which is useful for equality comparisons, for example) to be missing, but just looking through the current list as compared to the number listed on Intel Intrinsics Guide (filtering on a per class basis), it seems like there may be more

fiigii · 2017-12-13T04:01:07Z

I have found, at the very least, _mm_movemask_ps (which is useful for equality comparisons, for example) to be missing

I am pretty sure that this one (_mm_movemask_ps) is my mistake and we should have it. As you can see, we have all other MoveMask in Sse2, Avx, and Avx2. Thank you so much for pointing it out.

is there a list of instructions/intrinsics that don't have corresponding managed APIs exposed yet?

but just looking through the current list as compared to the number listed on Intel Intrinsics Guide (filtering on a per class basis), it seems like there may be more.

Yes, I had. I remember that I did not expose legacy 64-bit SSE SIMD, scalar floating point, and certain comparison intrinsics. This is a good chance to complement the intrinsic APIs and fix other mistakes. Let me update my list based on you scalar intrinsic work. I will try my best to put them together tomorrow.

@tannergooding Thanks again for your find and reminder.

tannergooding · 2017-12-14T00:16:25Z

I went through the exposed ISAs manually and found the following "missing".

MMX Interop Instructions (68)

These are instructions that take or return __m64. I don't think we will ever need these exposed and should just instruct users to port their code to SSE or higher.

Control Instructions (10)

These may be useful for some scenarios but also have a good chance for breaking other code/assumptions. That being said, you could already do these with your own P/Invoke.

I would vote to not expose these via Intrinsics and instead expose a general API (after appropriate review) to control this (which is ~~suggested~~ required by the IEEE 754:2008 spec; see 4.1 Attribute specification)

unsigned int _MM_GET_EXCEPTION_MASK ()
unsigned int _MM_GET_EXCEPTION_STATE ()
unsigned int _MM_GET_FLUSH_ZERO_MODE ()
unsigned int _MM_GET_ROUNDING_MODE ()
unsigned int _mm_getcsr (void)
void _MM_SET_EXCEPTION_MASK (unsigned int a)
void _MM_SET_EXCEPTION_STATE (unsigned int a)
void _MM_SET_FLUSH_ZERO_MODE (unsigned int a)
void _MM_SET_ROUNDING_MODE (unsigned int a)
void _mm_setcsr (unsigned int a)

Helper Functions (26)

These don't directly map to any specific instruction, but can be helpful in some scenarios.

Not listed on the Intel Intrinsics Guide, but exposed by MSVC are other helper functions (such as _MM_SHUFFLE), which may also be useful exposing (given that C# doesn't support macros for building immediate values in some cases).

Possibly worth further discussion.

_MM_TRANSPOSE4_PS (__m128 row0, __m128 row1, __m128 row2, __m128 row3)

__m128 _mm_undefined_ps (void)
__m128d _mm_undefined_pd (void)
__m128i _mm_undefined_si128 (void)

__m128 _mm_load1_ps (float const* mem_addr)
__m128 _mm_loadr_ps (float const* mem_addr)
__m128 _mm_set_ss (float a)
__m128 _mm_setr_ps (float e3, float e2, float e1, float e0)
void _mm_store1_ps (float* mem_addr, __m128 a)
void _mm_storer_ps (float* mem_addr, __m128 a)

__m128d _mm_load1_pd (double const* mem_addr)
__m128d _mm_loadr_pd (double const* mem_addr)
__m128d _mm_set_sd (double a)
__m128i _mm_setr_epi8 (char e15, char e14, char e13, char e12, char e11, char e10, char e9, char e8, char e7, char e6, char e5, char e4, char e3, char e2, char e1, char e0)
__m128i _mm_setr_epi16 (short e7, short e6, short e5, short e4, short e3, short e2, short e1, short e0)
__m128i _mm_setr_epi32 (int e3, int e2, int e1, int e0)
__m128i _mm_setr_epi64 (__m64 e1, __m64 e0)
__m128d _mm_setr_pd (double e1, double e0)
void _mm_store1_pd (double* mem_addr, __m128d a)
void _mm_storer_pd (double* mem_addr, __m128d a)

__m256 _mm256_loadu2_m128 (float const* hiaddr, float const* loaddr)
__m256d _mm256_loadu2_m128d (double const* hiaddr, double const* loaddr)
__m256i _mm256_loadu2_m128i (__m128i const* hiaddr, __m128i const* loaddr)

void _mm256_storeu2_m128 (float* hiaddr, float* loaddr, __m256 a)
void _mm256_storeu2_m128d (double* hiaddr, double* loaddr, __m256d a)
void _mm256_storeu2_m128i (__m128i* hiaddr, __m128i* loaddr, __m256i a)

Advanced Memory (6)

These are instructions for advanced memory operations. Probably useful. Could they negatively impact the GC?

Potentially worth further discussion

void _mm_clflush (void const* p)
void _mm_lfence (void)
void _mm_mfence (void)
void _mm_pause (void)
void _mm_prefetch (char const* p, int i)
void _mm_sfence (void)

Missing (1)

These instructions are actually missing and need to be added. I will add them in this PR

int _mm_movemask_ps (__m128 a)

Unknown (16)

Not quite sure what area some of these fall under.

__m128i _mm_sll_epi16 (__m128i a, __m128i count)
__m128i _mm_sll_epi32 (__m128i a, __m128i count)
__m128i _mm_sll_epi64 (__m128i a, __m128i count)
__m128i _mm_slli_si128 (__m128i a, int imm8)
__m128i _mm_sra_epi16 (__m128i a, __m128i count)
__m128i _mm_sra_epi32 (__m128i a, __m128i count)
__m128i _mm_srl_epi16 (__m128i a, __m128i count)
__m128i _mm_srl_epi32 (__m128i a, __m128i count)
__m128i _mm_srl_epi64 (__m128i a, __m128i count)
__m128i _mm_srli_si128 (__m128i a, int imm8)

__m128i _mm_move_epi64 (__m128i a)

__m128d _mm_loadh_pd (__m128d a, double const* mem_addr)
__m128d _mm_loadl_pd (__m128d a, double const* mem_addr)
__m128i _mm_loadl_epi64 (__m128i const* mem_addr)

void _mm_stream_si32 (int* mem_addr, int a)
void _mm_stream_si64 (__int64* mem_addr, __int64 a)

fiigii · 2017-12-14T01:28:26Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Sse2.cs

+        /// __m128d _mm_cmpgt_sd (__m128d a,  __m128d b)
+        /// </summary>
+        public static Vector128<double> CompareGreaterThanScalar(Vector128<double> left, Vector128<double> right) => CompareGreaterThanScalar(left, right);
+


Did you miss int _mm_comineq_sd (__m128d a, __m128d b)?

And int _mm_ucomineq_sd (__m128d a, __m128d b).

They are at https://github.com/dotnet/coreclr/pull/15341/files/b3a5840ff609dbd8212c631da659b0154917dfc4#diff-0c40c8a7f5df6b8bc03aef2dea8f0884R316

I ended up going with the bool Compare*OrderedScalar and bool Compare*UnorderedScalar naming convention for these operations.

I see, thanks. But the comment seems incorrect _mm_comine_sd -> _mm_comineq_sd.

fiigii · 2017-12-14T01:36:17Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Sse2.cs

@@ -946,6 +1175,11 @@ public static class Sse2
        /// </summary>
        public static Vector128<double> Subtract(Vector128<double> left, Vector128<double> right) => Subtract(left, right);

+        /// <summary>
+        /// __m128d _mm_sub_ss (__m128d a, __m128d b)


__m128d _mm_sub_sd (__m128d a, __m128d b)

fiigii · 2017-12-14T01:43:45Z

I just checked my list that you have almost complemented the APIs. But I did not find Sse.MinScalar/MaxScalar.

__m128 _mm_max_ss (__m128 a, __m128 b);
__m128 _mm_min_ss (__m128 a, __m128 b);

tannergooding · 2017-12-14T02:11:20Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Sse.cs

+        /// <summary>
+        /// __m128 _mm_max_ss (__m128 a,  __m128 b)
+        /// </summary>
+        public static Vector128<float> MaxScalar(Vector128<float> left, Vector128<float> right) => MaxScalar(left, right);


@fiigii, MaxScalar is here, MinScalar is a few lines up

Thank you, Github was hiding them.

tannergooding · 2017-12-14T04:03:00Z

@fiigii, could you comment on the instructions I have labeled as Helper and Unknown here: #15341 (comment)

For the helper functions:

I would think the _mm_cast*_* instructions should be considered "missing".
The load1, store1, and set don't have instructions but look to be roughly equivalent toSet1, StoreScalar, and LoadScalar, respectively
The loadr and setr intrinsics look to be helpers for shuffle and load or shuffle and set, respectively
I'm not sure on the actual use-case for _mm_undefined (when it would be needed in real world code)

For the unknown functions:

sll, sra, and srl take an __m128i rather than an int
slli_si128 and srli_si128 look like they should be considered "missing"
move_epi64 looks like it might be a missing "scalar" instruction
loadh, and loadl look like they should be "missing"
stream looks like they should be considered missing

It also wasn't immediately obvious as to when both signed and unsigned overloads should be provided for a given packed integer instruction. cmpgt and cmplt seem to be missing the unsigned versions, but cmpeq includes them.

There were a couple other instructions in Sse2 where the Intrinsics Guide did not explicitly dictate signed vs unsigned but where only a single overload was provided (I haven't finished documenting them all yet).

fiigii · 2017-12-14T06:05:31Z

could you comment on the instructions I have labeled as Helper and Unknown

@tannergooding Sorry for the delay. I still need some time to investigate certain instructions.

I would think the mm_cast** instructions should be considered "missing".

We have generic versions: StaticCast<T,U>, ExtendToVector256<T>, and GetLowerHalf<T>.

The load1, store1, and set don't have instructions but look to be roughly equivalent toSet1, StoreScalar, and LoadScalar, respectively

The loadr and setr intrinsics look to be helpers for shuffle and load or shuffle and set, respectively

Yes, and I think we should have helper functions as less as possible.

I'm not sure on the actual use-case for _mm_undefined (when it would be needed in real world code)

These intrinsics are usually used to get a uninitialized vector to avoid the init overhead. I do not think it makes sense in .NET/CLI semantics.

slli_si128 and srli_si128 look like they should be considered "missing"

As I know, the codgen of slli_si128 is same as bslli_si128. We have ShiftRightLogical128BitLane and ShiftLeftLogical128BitLane.

move_epi64 looks like it might be a missing "scalar" instruction
loadh, and loadl look like they should be "missing"

Good catch! Thanks, I should fix.

fiigii · 2017-12-14T06:38:42Z

cmpgt and cmplt seem to be missing the unsigned versions, but cmpeq includes them.

They are signed comparison instructions, and comparing for equal does not need sign info.

There were a couple other instructions in Sse2 where the Intrinsics Guide did not explicitly dictate signed vs unsigned but where only a single overload was provided

The sign information is documented in Intel® 64 and IA-32 architectures software developer’s manual volume 2

tannergooding · 2017-12-14T18:30:55Z

Ok. I have now gone through all the exposed ISAs manually and ensured all scalar intrinsic instructions are exposed in this PR.

@fiigii, do you think unsigned overloads need to be exposed for the ConvertTo<Scalar> functions?
They all use MOVD/MOVQ, which is not explicit on the data being signed or unsigned.

int _mm_cvtsi128_si32 (__m128i a)
__m128i _mm_cvtsi32_si128 (int a)
__int64 _mm_cvtsi128_si64 (__m128i a)
__m128i _mm_cvtsi64_si128 (__int64 a)
int _mm256_cvtsi256_si32 (__m256i a)

I also found the following instructions to be missing from the packed forms.
None of them have an intrinsic with the same shape that uses the same underlying hardware instruction.

int _mm_movemask_ps (__m128 a)                                      // movmskps

__m128d _mm_loadh_pd (__m128d a, double const* mem_addr)            // movhpd
__m128d _mm_loadl_pd (__m128d a, double const* mem_addr)            // movlpd
__m128i _mm_loadl_epi64 (__m128i const* mem_addr)                   // movq

void _mm_stream_si32 (int* mem_addr, int a)                         // movnti
void _mm_stream_si64 (__int64* mem_addr, __int64 a)                 // movnti

__m256i _mm256_permute2f128_si256 (__m256i a, __m256i b, int imm8)  // vperm2f128

__m256i _mm256_stream_load_si256 (__m256i const* mem_addr)          // vmovntdqa

// The following 8 have intrinsics which take an imm8 and emit the same underlying instruction
__m128i _mm_sll_epi16 (__m128i a, __m128i count)                    // psllw
__m128i _mm_sll_epi32 (__m128i a, __m128i count)                    // pslld
__m128i _mm_sll_epi64 (__m128i a, __m128i count)                    // psllq
__m128i _mm_sra_epi16 (__m128i a, __m128i count)                    // psraw
__m128i _mm_sra_epi32 (__m128i a, __m128i count)                    // psrad
__m128i _mm_srl_epi16 (__m128i a, __m128i count)                    // psrlw
__m128i _mm_srl_epi32 (__m128i a, __m128i count)                    // psrld
__m128i _mm_srl_epi64 (__m128i a, __m128i count)                    // psrlq

// The following 6 have the corresponding _mm256 forms exposed
__m128i _mm_sllv_epi32 (__m128i a, __m128i count)                   // vpsllvd
__m128i _mm_sllv_epi64 (__m128i a, __m128i count)                   // vpsllvq
__m128i _mm_srav_epi32 (__m128i a, __m128i count)                   // vpsravd
__m128i _mm_srlv_epi32 (__m128i a, __m128i count)                   // vpsrlvd
__m128i _mm_srlv_epi64 (__m128i a, __m128i count)                   // vpsrlvq

Finally, the following are exposed, but under a different ISA than the Intrinsic Guide lists them:

// Exposed as SSE, listed as SSE2
__m128 _mm_castpd_ps (__m128d a)
__m128i _mm_castpd_si128 (__m128d a)
__m128d _mm_castps_pd (__m128 a)
__m128i _mm_castps_si128 (__m128 a)
__m128d _mm_castsi128_pd (__m128i a)
__m128 _mm_castsi128_ps (__m128i a)

// Exposed as AVX, listed as AVX2
__int16 _mm256_extract_epi16 (__m256i a, const int index)
__int8 _mm256_extract_epi8 (__m256i a, const int index)

fiigii · 2017-12-14T18:37:09Z

do you think unsigned overloads need to be exposed for the ConvertTo functions?
They all use MOVD/MOVQ, which is not explicit on the data being signed or unsigned.

Yes, they just copy the first element, no zero/sign extension behavior.

fiigii · 2017-12-14T18:44:55Z

// Exposed as AVX, listed as AVX2
__int16 _mm256_extract_epi16 (__m256i a, const int index)
__int8 _mm256_extract_epi8 (__m256i a, const int index)

They are helper intrinsics, we have AVX and AVX2 codegen solution both.

fiigii · 2017-12-14T18:48:57Z

// Exposed as SSE, listed as SSE2
__m128 _mm_castpd_ps (__m128d a)
__m128i _mm_castpd_si128 (__m128d a)
__m128d _mm_castps_pd (__m128 a)
__m128i _mm_castps_si128 (__m128 a)
__m128d _mm_castsi128_pd (__m128i a)
__m128 _mm_castsi128_ps (__m128i a)

These helper intriniscs do not generate any code. We have the type Vector128<double/int/long/...> with SSE, so I think it should be in Sse.

fiigii · 2017-12-14T18:54:30Z

__m256i _mm256_permute2f128_si256 (__m256i a, __m256i b, int imm8) // vperm2f128

We don't encourage using vperm2f128 on integer data due to the data bypass penalty.

fiigii · 2017-12-14T19:01:52Z

void _mm_stream_si32 (int* mem_addr, int a) // movnti
void _mm_stream_si64 (__int64* mem_addr, __int64 a) // movnti

I am not sure the usefulness of streaming store with scalar types. Please give me more time.

fiigii · 2017-12-14T19:05:45Z

__m256i _mm256_permute2f128_si256 (__m256i a, __m256i b, int imm8) // vperm2f128

We don't encourage using vperm2f128 on integer data due to the data bypass penalty.

But we can complement the types to make the API simpler Permute2x128<T>. Thoughts?

tannergooding · 2017-12-14T19:13:17Z

I've added the unsigned overloads that were missing and believe the PR is now ready for final review and merge.

@fiigii, I have logged https://github.com/dotnet/corefx/issues/25926 to continue the discussion of the "missing" APIs. We can add them in one go, in a separate PR, after we determine which ones need to be added.

fiigii · 2017-12-14T19:18:40Z

@tannergooding Thank you so much for the work.

tannergooding · 2017-12-14T20:45:16Z

@eerhardt, do I need another sign-off or am I good to merge?

Also, do you want the CoreFX PR up before or after this change goes in?

fiigii · 2017-12-14T21:35:36Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Sse2.cs

        /// <summary>
        /// __m128d _mm_cmpneq_pd (__m128d a,  __m128d b)
        /// </summary>
        public static Vector128<double> CompareNotEqual(Vector128<double> left, Vector128<double> right) => CompareNotEqual(left, right);

+        /// <summary>
+        /// int _mm_comine_sd (__m128d a, __m128d b)


Would you like to fix this comment to _mm_comineq_sd ?

fiigii · 2017-12-14T21:37:55Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Sse.cs

        /// <summary>
        /// __m128 _mm_cmpneq_ps (__m128 a,  __m128 b)
        /// </summary>
        public static Vector128<float> CompareNotEqual(Vector128<float> left, Vector128<float> right) => CompareNotEqual(left, right);

+        /// <summary>
+        /// int _mm_comine_ss (__m128 a, __m128 b)


fiigii · 2017-12-14T21:38:24Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Sse.cs

+        public static bool CompareNotEqualOrderedScalar(Vector128<float> left, Vector128<float> right) => CompareNotEqualOrderedScalar(left, right);
+
+        /// <summary>
+        /// int _mm_ucomine_ss (__m128 a, __m128 b)


fiigii · 2017-12-14T21:39:23Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Sse2.cs

+        public static bool CompareNotEqualOrderedScalar(Vector128<double> left, Vector128<double> right) => CompareNotEqualOrderedScalar(left, right);
+
+        /// <summary>
+        /// int _mm_ucomine_sd (__m128d a, __m128d b)


fiigii · 2017-12-14T21:50:33Z

src/mscorlib/src/System/Runtime/Intrinsics/X86/Avx.cs

+        /// <summary>
+        /// float _mm256_cvtss_f32 (__m256 a)
+        /// </summary>
+        public static float ConvertToSingle(Vector256<float> value) => ConvertToSingle(value);


I see you provide the helper functions that convert vector to float/double. Do we need helpers for float/double -> Vector128<float/double>?

Do you mean providing __m128 _mm_set_ss (float a) in addition to __m128 _mm_load_ss (float const* mem_addr)?

Yes, SetScalar sometimes can avoid memory access than LoadScalar.

👍, will add.

Added Vector128<float> Sse.SetScalar(float value) and Vector128<double> Sse2.SetScalar(double value)

tannergooding · 2017-12-29T00:31:07Z

CoreFX side of this PR is dotnet/corefx#26095

jkotas · 2017-12-29T00:41:36Z

@tannergooding Feel free to merge this if it is ready to go.

tannergooding · 2017-12-29T00:43:03Z

@jkotas, thanks!

Will merge after the two pending jobs come back green (looks like they were kicked off with my last comment, so they are probably new).

tannergooding commented Dec 2, 2017

View reviewed changes

pentp reviewed Dec 2, 2017

View reviewed changes

tannergooding commented Dec 9, 2017

View reviewed changes

fiigii mentioned this pull request Dec 11, 2017

Fix method names of hardware intrinsic APIs #15471

Merged

tannergooding changed the title ~~[WIP] Adding scalar hardware intrinsics for x86.~~ Adding scalar hardware intrinsics for x86. Dec 11, 2017

eerhardt approved these changes Dec 12, 2017

View reviewed changes

fiigii reviewed Dec 14, 2017

View reviewed changes

tannergooding commented Dec 14, 2017

View reviewed changes

fiigii reviewed Dec 14, 2017

View reviewed changes

Adding scalar hardware intrinsics for x86.

d04768c

fiigii approved these changes Dec 15, 2017

View reviewed changes

jkotas added the area-CodeGen label Dec 28, 2017

tannergooding mentioned this pull request Dec 29, 2017

Adding scalar hardware intrinsics for x86 dotnet/corefx#26095

Merged

jkotas approved these changes Dec 29, 2017

View reviewed changes

tannergooding merged commit 4797974 into dotnet:master Dec 29, 2017

tannergooding mentioned this pull request Dec 29, 2017

Implement the SSE hardware intrinsics. #15538

Merged

tannergooding deleted the simd-scalar branch January 17, 2018 01:55

tannergooding mentioned this pull request Jan 31, 2020

Several x86 intrinsic APIs are "missing" dotnet/runtime#24459

Closed

		@@ -352,6 +352,11 @@ public static class Avx2
		/// </summary>
		public static Vector256<ulong> ConvertToVector256ULong(Vector128<uint> value) => ConvertToVector256ULong(value);

Adding scalar hardware intrinsics for x86. #15341

Adding scalar hardware intrinsics for x86. #15341

Conversation

tannergooding commented Dec 2, 2017

tannergooding commented Dec 2, 2017

tannergooding commented Dec 2, 2017

tannergooding commented Dec 2, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fiigii Dec 2, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fiigii commented Dec 2, 2017

tannergooding commented Dec 2, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fiigii Dec 3, 2017 • edited Loading

Choose a reason for hiding this comment

pentp Dec 3, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

4creators commented Dec 2, 2017

tannergooding commented Dec 2, 2017

4creators commented Dec 2, 2017

4creators commented Dec 7, 2017

tannergooding commented Dec 9, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding commented Dec 11, 2017

eerhardt left a comment

Choose a reason for hiding this comment

tannergooding commented Dec 13, 2017

fiigii commented Dec 13, 2017

tannergooding commented Dec 14, 2017 • edited Loading

MMX Interop Instructions (68)

Control Instructions (10)

Helper Functions (26)

Advanced Memory (6)

Missing (1)

Unknown (16)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fiigii commented Dec 14, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding commented Dec 14, 2017

fiigii commented Dec 14, 2017

fiigii commented Dec 14, 2017 • edited Loading

tannergooding commented Dec 14, 2017 • edited Loading

fiigii commented Dec 14, 2017

fiigii commented Dec 14, 2017

fiigii commented Dec 14, 2017

fiigii commented Dec 14, 2017

fiigii commented Dec 14, 2017

fiigii commented Dec 14, 2017

tannergooding commented Dec 14, 2017

fiigii commented Dec 14, 2017

tannergooding commented Dec 14, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fiigii Dec 2, 2017 •

edited

Loading

tannergooding commented Dec 2, 2017 •

edited

Loading

fiigii Dec 3, 2017 •

edited

Loading

pentp Dec 3, 2017 •

edited

Loading

tannergooding commented Dec 14, 2017 •

edited

Loading

fiigii commented Dec 14, 2017 •

edited

Loading

tannergooding commented Dec 14, 2017 •

edited

Loading

fiigii Dec 14, 2017 •

edited

Loading

fiigii Dec 14, 2017 •

edited

Loading