Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Adding scalar hardware intrinsics for x86. #15341

Merged
merged 1 commit into from
Dec 29, 2017
Merged

Adding scalar hardware intrinsics for x86. #15341

merged 1 commit into from
Dec 29, 2017

Conversation

tannergooding
Copy link
Member

@tannergooding
Copy link
Member Author

Still need to update the *.PlatformNotSupported.cs files accordingly.

@tannergooding
Copy link
Member Author

I haven't done COMIS or UCOMIS yet as I am not sure of the naming. I was thinking bool CheckEqualScalar(Vector128<float> left, Vector128<float> right)

@tannergooding
Copy link
Member Author

I also noticed a number of places in the existing files where we are not being consistent (either in naming or in following the API naming guidelines).

Ex: We have ReciprocalSquareRoot, Sqrt, and ReciprocalSqrt, depending on where you look.

We are also doing Int, Float, Long, etc.... When we should be doing Int32, Single, Int64, etc...

/// <summary>
/// __m128 _mm_cmp_ss (__m128 a, __m128 b, const int imm8)
/// </summary>
public static Vector128<float> CompareScalar(Vector128<float> left, Vector128<float> right, FloatComparisonMode mode) => CompareScalar(left, right, mode);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was wondering why we have FloatComparisonMode here, but not in the Sse/Sse2 files?

/// <summary>
/// __m128 _mm_cmpunord_ps (__m128 a, __m128 b)
/// </summary>
public static Vector128<float> CompareUnordered(Vector128<float> left, Vector128<float> right) => CompareUnordered(left, right);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These files all contain a bunch of lines that are just whitespace... We should probably cleanup separately.

/// <summary>
/// __m128 _mm_sqrt_ss (__m128 a)
/// </summary>
public static Vector128<float> SqrtScalar(Vector128<float> value) => SqrtScalar(value);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Sse2 form for double takes two arguments, but we only take one here (this is matching the C/C++ intrinsics). Perhaps we should expose both or just the one that takes two arguments?

/// <summary>
/// __m128d _mm_sqrt_sd (__m128d a, __m128d b)
/// </summary>
public static Vector128<double> SqrtScalar(Vector128<double> a, Vector128<double> b) => SqrtScalar(a, b);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of calling a and b, upper and value respectively. Since b is the value we perform the operation on and a is the value we fill in the upper bits from. Thoughts?

Copy link

@fiigii fiigii Dec 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to specifically fill the upper bits of the result in practice?
From the performance perspective, we always recommend using the same register as the source and upper argument.
Especially, if we decide to support the two-parameter version of Sqrt intrinsic, on non-AVX machines, the compiler may have to insert unpack or shuffle instructions to implement this semantic, which they are both long latency instructions.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In summary, I am suggesting that only expose the one-parameter intrinsic for SQRTSS and SQRTSD.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just exposing the single intrinsic version is probably fine. I actually missed that the two operand form is only on AVX and above, the Intel Intrinsics Guide lists it as SSE2: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=Sqrt&techs=SSE,SSE2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Software Developers Manual lists the information correctly.

/// void _mm_store_ss (float* mem_addr, __m128 a)
/// </summary>
public static unsafe void StoreScalar(float* address, Vector128<float> source) => StoreScalar(address, source);

/// <summary>
/// __m128d _mm_sub_ps (__m128d a, __m128d b)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: the files have several typos, such as this, where the wrong type is used in the intrinsic

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. For these comment typos, we can fix them later that do not impact the CoreCLR/CoreFX interface.

@@ -352,6 +352,11 @@ public static class Avx2
/// </summary>
public static Vector256<ulong> ConvertToVector256ULong(Vector128<uint> value) => ConvertToVector256ULong(value);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: This is an example of an API that doesn't follow the general .NET API naming conventions. It should probably be ConvertToVector256Int64

@fiigii
Copy link

fiigii commented Dec 2, 2017

I also noticed a number of places in the existing files where we are not being consistent
We have ReciprocalSquareRoot, Sqrt, and ReciprocalSqrt, depending on where you look.
We are also doing Int, Float, Long, etc.... When we should be doing Int32, Single, Int64, etc...

Thank you for pointing this out. If the .NET API convention always prefers Int64 over Long, we definitely should fix it.

@tannergooding
Copy link
Member Author

tannergooding commented Dec 2, 2017

Just as an FYI. The exact guideline I am referring to is Avoiding Language Specific Names

/// __m128 _mm_round_ss (__m128 a, int rounding)
/// _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC
/// </summary>
public static Vector128<float> RoundToNearestIntegerScalar(Vector128<float> value) => RoundToNearestIntegerScalar(value);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more consistent to use a RoundingMode immediate parameter here (similarly to comparisons for example). Then it would be 4 functions (Round/RoundScalar * float/double) that map directly to four machine instructions (roundpd/ps/sd/ss) instead of 20. The fully named helper functions could be defined on top of these basic instructions somewhere else.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fiigii, thoughts? I was following the existing convention you had setup for the packed forms.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIR in discussions on Intrinsics API the consensus was to create direct mapping to processor instructions due to mutlitude of reasons. Therefore, I would avoid creating any APIs which do not map or omit any processor instructions. It means that if we have 3 argument scalar AVX or above instruction while having 2 argument SSE equivalent we should have both or here we should use immediate parameter for defining rounding mode.

Copy link

@fiigii fiigii Dec 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that the current design (encodes rounding mode into intrinsic names) has better static semantics.
For example, RoundingMode immediate parameter requires 1) const parameter from language feature support to avoid non-literal values, 2) compile error reporting and runtime exception for invalid values from Roslyn/CoreCLR https://github.com/dotnet/corefx/issues/22940#issuecomment-320122766.
Each round just has a few pre-defined modes, so I thought this is a good opportunity to provide intrinsic with more stable runtime behaviors and friendly development experience. Meanwhile, it does not lose any flexibility.

Copy link

@pentp pentp Dec 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could still implement rounding functions with an immediate parameter as private intrinsics and expose them through wrapper functions that just forward to the actual intrinsic if this simplifies the implementation.

// __m128 _mm_round_ss (__m128 a, int rounding)
private static Vector128<float> RoundScalar(Vector128<float> value, byte rounding) => RoundScalar(value, rounding);

// _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC
public static Vector128<float> RoundToNearestIntegerScalar(Vector128<float> value) => RoundScalar(value, _MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC);

// _MM_FROUND_TO_NEG_INF |_MM_FROUND_NO_EXC
public static Vector128<float> RoundToNegativeInfinityScalar(Vector128<float> value) => RoundScalar(value, _MM_FROUND_TO_NEG_INF | _MM_FROUND_NO_EXC);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this simplifies the implementation.

No, the current runtime always expands all the APIs as intrinsics.

/// <summary>
/// __m128d _mm_cvtss_sd (__m128d a, __m128 b)
/// </summary>
public static Vector128<double> ConvertToDoubleScalar(Vector128<double> a, Vector128<float> b) => ConvertToDoubleScalar(a, b);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MOVD/MOVQ instructions are missing from here (and AVX/AVX2). I propose something like this:

// __m128i _mm_cvtsi32_si128 (int a)
public static Vector128<int> CopyInt32(int value);
// int _mm_cvtsi128_si32 ( __m128i a)
public static int CopyInt32(Vector128<int> value);
// __m128i _mm_cvtsi64_si128(__int64)
public static Vector128<long> CopyInt64(long value);
// __int64 _mm_cvtsi128_si64(__m128i)
public static long CopyInt64(Vector128<long> value);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Went with the existing ConvertTo naming convention

@4creators
Copy link

I have not seen any conversion intrinsics which would allow converting from and to Half. Fact that we do not have Half support in CLR should not stop us form having intrinsics which would act just as an interface between CLR and other runtimes which support Half sized floating types. Additionally, despite it is not directly related to this PR, it could be very helpful to have support for 8bit floating/binary sized numbers as well.

Related issues:

https://github.com/dotnet/corefx/issues/17267

https://github.com/dotnet/coreclr/issues/11948

@tannergooding
Copy link
Member Author

@4creators, I believe the FP16C instructions haven't gone for design review yet.

The initial set that has gone for review covers both packed and scalar (as of this PR) instructions for SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2, FMA, AES, BMI1, BMI2, LZCNT, PCLMULQDQ, and POPCNT.

The intrinsic sets that have not gone to review (based on those listed under the Intel Intrinsics Guide) are MMX, AVX-512, KNC, SVML, ADX, CLFLUSHOPT, CLWB, FP16C, FSGSBASE, FXSR, INVPCID, MONITOR, MPX, PREFETCHWT1, RDPID, RDRAND, RDSEED, RDTSCP, RTM, SHA, TSC, XSAVE, XSAVEC, XSAVEOPT, XSS

A number of those intrinsic sets won't/shouldn't go to review because:

  • Not all of those (such as the SVML intrinsics) represent actual hardware instructions.
  • Some of them (such as MMX) are legacy instructions.
  • Others (like XSAVE) are targeted for OS, not necessarily for applications.

The others likely just need to be proposed and go up for review. For FP16C in particular, we at the very least need a Half data type so that it can be properly represented as an API (Vector128<Half>), so the review will be slightly more involved.

@4creators
Copy link

I believe the FP16C instructions haven't gone for design review yet.

@tannergooding Yep, you are right. Perhaps it's time to make both Half and missing intrinsics proposal :)

@4creators
Copy link

@dotnet-bot test Windows_NT x86 Checked Innerloop Build and Test
@dotnet-bot test Windows_NT x64 Checked Innerloop Build and Test
@dotnet-bot test Ubuntu x64 Checked Innerloop Build and Test
@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test
@dotnet-bot test CentOS7.1 x64 Debug Innerloop Build
@dotnet-bot test CentOS7.1 x64 Checked Innerloop Build and Test
@dotnet-bot test Tizen armel Cross Checked Innerloop Build and Test

@tannergooding
Copy link
Member Author

Updated the *.PlatformNotSupported.cs files and added the remaining APIs that I was aware were missing.

/// <summary>
/// __int64 _mm_cvtsd_si64 (__m128d a)
/// </summary>
public static long ConvertToInt64(Vector128<double> value) => ConvertToInt64(value);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fiigii, For instructions like this, which have an additional encoding on x64, how do we want to expose them?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an additional encoding on x64

Do you mean the 64-bit register in cvtsd2si r64, xmm ?
These intrinsics are only available in 64-bit mode, and calling them in 32-bit should throw PlatformNotSupportExeception.

if (Sse2.IsSupported && Environment.Is64BitProcess)
{
    ulong res = Sse2.ConvertToInt64(vec);
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, that sounds good to me.

I was mostly just wanting to confirm we were exposing them in X86.Sse2 and not under some X64 specific sub-class

@tannergooding tannergooding changed the title [WIP] Adding scalar hardware intrinsics for x86. Adding scalar hardware intrinsics for x86. Dec 11, 2017
@tannergooding
Copy link
Member Author

Should be ready for review.

Still todo (but will be in separate PRs):

  • Update CoreFX with the new APIs
  • Plumb through the JIT support for these APIs
    • @fiigii, are there any pending PRs from you that I should wait on before starting on this?

Copy link
Member

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm no expert here, but this looks good to me.

@tannergooding
Copy link
Member Author

@fiigii, is there a list of instructions/intrinsics that don't have corresponding managed APIs exposed yet?

I have found, at the very least, _mm_movemask_ps (which is useful for equality comparisons, for example) to be missing, but just looking through the current list as compared to the number listed on Intel Intrinsics Guide (filtering on a per class basis), it seems like there may be more

@fiigii
Copy link

fiigii commented Dec 13, 2017

I have found, at the very least, _mm_movemask_ps (which is useful for equality comparisons, for example) to be missing

I am pretty sure that this one (_mm_movemask_ps) is my mistake and we should have it. As you can see, we have all other MoveMask in Sse2, Avx, and Avx2. Thank you so much for pointing it out.

is there a list of instructions/intrinsics that don't have corresponding managed APIs exposed yet?

but just looking through the current list as compared to the number listed on Intel Intrinsics Guide (filtering on a per class basis), it seems like there may be more.

Yes, I had. I remember that I did not expose legacy 64-bit SSE SIMD, scalar floating point, and certain comparison intrinsics. This is a good chance to complement the intrinsic APIs and fix other mistakes. Let me update my list based on you scalar intrinsic work. I will try my best to put them together tomorrow.

@tannergooding Thanks again for your find and reminder.

@tannergooding
Copy link
Member Author

tannergooding commented Dec 14, 2017

I went through the exposed ISAs manually and found the following "missing".

MMX Interop Instructions (68)

These are instructions that take or return __m64. I don't think we will ever need these exposed and should just instruct users to port their code to SSE or higher.

Control Instructions (10)

These may be useful for some scenarios but also have a good chance for breaking other code/assumptions. That being said, you could already do these with your own P/Invoke.

I would vote to not expose these via Intrinsics and instead expose a general API (after appropriate review) to control this (which is suggested required by the IEEE 754:2008 spec; see 4.1 Attribute specification)

unsigned int _MM_GET_EXCEPTION_MASK ()
unsigned int _MM_GET_EXCEPTION_STATE ()
unsigned int _MM_GET_FLUSH_ZERO_MODE ()
unsigned int _MM_GET_ROUNDING_MODE ()
unsigned int _mm_getcsr (void)
void _MM_SET_EXCEPTION_MASK (unsigned int a)
void _MM_SET_EXCEPTION_STATE (unsigned int a)
void _MM_SET_FLUSH_ZERO_MODE (unsigned int a)
void _MM_SET_ROUNDING_MODE (unsigned int a)
void _mm_setcsr (unsigned int a)

Helper Functions (26)

These don't directly map to any specific instruction, but can be helpful in some scenarios.

Not listed on the Intel Intrinsics Guide, but exposed by MSVC are other helper functions (such as _MM_SHUFFLE), which may also be useful exposing (given that C# doesn't support macros for building immediate values in some cases).

Possibly worth further discussion.

_MM_TRANSPOSE4_PS (__m128 row0, __m128 row1, __m128 row2, __m128 row3)

__m128 _mm_undefined_ps (void)
__m128d _mm_undefined_pd (void)
__m128i _mm_undefined_si128 (void)

__m128 _mm_load1_ps (float const* mem_addr)
__m128 _mm_loadr_ps (float const* mem_addr)
__m128 _mm_set_ss (float a)
__m128 _mm_setr_ps (float e3, float e2, float e1, float e0)
void _mm_store1_ps (float* mem_addr, __m128 a)
void _mm_storer_ps (float* mem_addr, __m128 a)

__m128d _mm_load1_pd (double const* mem_addr)
__m128d _mm_loadr_pd (double const* mem_addr)
__m128d _mm_set_sd (double a)
__m128i _mm_setr_epi8 (char e15, char e14, char e13, char e12, char e11, char e10, char e9, char e8, char e7, char e6, char e5, char e4, char e3, char e2, char e1, char e0)
__m128i _mm_setr_epi16 (short e7, short e6, short e5, short e4, short e3, short e2, short e1, short e0)
__m128i _mm_setr_epi32 (int e3, int e2, int e1, int e0)
__m128i _mm_setr_epi64 (__m64 e1, __m64 e0)
__m128d _mm_setr_pd (double e1, double e0)
void _mm_store1_pd (double* mem_addr, __m128d a)
void _mm_storer_pd (double* mem_addr, __m128d a)

__m256 _mm256_loadu2_m128 (float const* hiaddr, float const* loaddr)
__m256d _mm256_loadu2_m128d (double const* hiaddr, double const* loaddr)
__m256i _mm256_loadu2_m128i (__m128i const* hiaddr, __m128i const* loaddr)

void _mm256_storeu2_m128 (float* hiaddr, float* loaddr, __m256 a)
void _mm256_storeu2_m128d (double* hiaddr, double* loaddr, __m256d a)
void _mm256_storeu2_m128i (__m128i* hiaddr, __m128i* loaddr, __m256i a)

Advanced Memory (6)

These are instructions for advanced memory operations. Probably useful. Could they negatively impact the GC?

Potentially worth further discussion

void _mm_clflush (void const* p)
void _mm_lfence (void)
void _mm_mfence (void)
void _mm_pause (void)
void _mm_prefetch (char const* p, int i)
void _mm_sfence (void)

Missing (1)

These instructions are actually missing and need to be added. I will add them in this PR

int _mm_movemask_ps (__m128 a)

Unknown (16)

Not quite sure what area some of these fall under.

__m128i _mm_sll_epi16 (__m128i a, __m128i count)
__m128i _mm_sll_epi32 (__m128i a, __m128i count)
__m128i _mm_sll_epi64 (__m128i a, __m128i count)
__m128i _mm_slli_si128 (__m128i a, int imm8)
__m128i _mm_sra_epi16 (__m128i a, __m128i count)
__m128i _mm_sra_epi32 (__m128i a, __m128i count)
__m128i _mm_srl_epi16 (__m128i a, __m128i count)
__m128i _mm_srl_epi32 (__m128i a, __m128i count)
__m128i _mm_srl_epi64 (__m128i a, __m128i count)
__m128i _mm_srli_si128 (__m128i a, int imm8)

__m128i _mm_move_epi64 (__m128i a)

__m128d _mm_loadh_pd (__m128d a, double const* mem_addr)
__m128d _mm_loadl_pd (__m128d a, double const* mem_addr)
__m128i _mm_loadl_epi64 (__m128i const* mem_addr)

void _mm_stream_si32 (int* mem_addr, int a)
void _mm_stream_si64 (__int64* mem_addr, __int64 a)

/// __m128d _mm_cmpgt_sd (__m128d a, __m128d b)
/// </summary>
public static Vector128<double> CompareGreaterThanScalar(Vector128<double> left, Vector128<double> right) => CompareGreaterThanScalar(left, right);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And int _mm_ucomineq_sd (__m128d a, __m128d b).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are at https://github.com/dotnet/coreclr/pull/15341/files/b3a5840ff609dbd8212c631da659b0154917dfc4#diff-0c40c8a7f5df6b8bc03aef2dea8f0884R316

I ended up going with the bool Compare*OrderedScalar and bool Compare*UnorderedScalar naming convention for these operations.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks. But the comment seems incorrect _mm_comine_sd -> _mm_comineq_sd.

@@ -946,6 +1175,11 @@ public static class Sse2
/// </summary>
public static Vector128<double> Subtract(Vector128<double> left, Vector128<double> right) => Subtract(left, right);

/// <summary>
/// __m128d _mm_sub_ss (__m128d a, __m128d b)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__m128d _mm_sub_sd (__m128d a, __m128d b)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix

@fiigii
Copy link

fiigii commented Dec 14, 2017

I just checked my list that you have almost complemented the APIs. But I did not find Sse.MinScalar/MaxScalar.

__m128 _mm_max_ss (__m128 a, __m128 b);
__m128 _mm_min_ss (__m128 a, __m128 b);

/// <summary>
/// __m128 _mm_max_ss (__m128 a, __m128 b)
/// </summary>
public static Vector128<float> MaxScalar(Vector128<float> left, Vector128<float> right) => MaxScalar(left, right);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fiigii, MaxScalar is here, MinScalar is a few lines up

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, Github was hiding them.

@tannergooding
Copy link
Member Author

@fiigii, could you comment on the instructions I have labeled as Helper and Unknown here: #15341 (comment)

For the helper functions:

  • I would think the _mm_cast*_* instructions should be considered "missing".
  • The load1, store1, and set don't have instructions but look to be roughly equivalent toSet1, StoreScalar, and LoadScalar, respectively
  • The loadr and setr intrinsics look to be helpers for shuffle and load or shuffle and set, respectively
  • I'm not sure on the actual use-case for _mm_undefined (when it would be needed in real world code)

For the unknown functions:

  • sll, sra, and srl take an __m128i rather than an int
  • slli_si128 and srli_si128 look like they should be considered "missing"
  • move_epi64 looks like it might be a missing "scalar" instruction
  • loadh, and loadl look like they should be "missing"
  • stream looks like they should be considered missing

It also wasn't immediately obvious as to when both signed and unsigned overloads should be provided for a given packed integer instruction. cmpgt and cmplt seem to be missing the unsigned versions, but cmpeq includes them.

There were a couple other instructions in Sse2 where the Intrinsics Guide did not explicitly dictate signed vs unsigned but where only a single overload was provided (I haven't finished documenting them all yet).

@fiigii
Copy link

fiigii commented Dec 14, 2017

could you comment on the instructions I have labeled as Helper and Unknown

@tannergooding Sorry for the delay. I still need some time to investigate certain instructions.

I would think the mm_cast** instructions should be considered "missing".

We have generic versions: StaticCast<T,U>, ExtendToVector256<T>, and GetLowerHalf<T>.

The load1, store1, and set don't have instructions but look to be roughly equivalent toSet1, StoreScalar, and LoadScalar, respectively

The loadr and setr intrinsics look to be helpers for shuffle and load or shuffle and set, respectively

Yes, and I think we should have helper functions as less as possible.

I'm not sure on the actual use-case for _mm_undefined (when it would be needed in real world code)

These intrinsics are usually used to get a uninitialized vector to avoid the init overhead. I do not think it makes sense in .NET/CLI semantics.

slli_si128 and srli_si128 look like they should be considered "missing"

As I know, the codgen of slli_si128 is same as bslli_si128. We have ShiftRightLogical128BitLane and ShiftLeftLogical128BitLane.

move_epi64 looks like it might be a missing "scalar" instruction
loadh, and loadl look like they should be "missing"

Good catch! Thanks, I should fix.

@fiigii
Copy link

fiigii commented Dec 14, 2017

cmpgt and cmplt seem to be missing the unsigned versions, but cmpeq includes them.

They are signed comparison instructions, and comparing for equal does not need sign info.

There were a couple other instructions in Sse2 where the Intrinsics Guide did not explicitly dictate signed vs unsigned but where only a single overload was provided

The sign information is documented in Intel® 64 and IA-32 architectures software developer’s manual volume 2

@tannergooding
Copy link
Member Author

tannergooding commented Dec 14, 2017

Ok. I have now gone through all the exposed ISAs manually and ensured all scalar intrinsic instructions are exposed in this PR.

@fiigii, do you think unsigned overloads need to be exposed for the ConvertTo<Scalar> functions?
They all use MOVD/MOVQ, which is not explicit on the data being signed or unsigned.

int _mm_cvtsi128_si32 (__m128i a)
__m128i _mm_cvtsi32_si128 (int a)
__int64 _mm_cvtsi128_si64 (__m128i a)
__m128i _mm_cvtsi64_si128 (__int64 a)
int _mm256_cvtsi256_si32 (__m256i a)

I also found the following instructions to be missing from the packed forms.
None of them have an intrinsic with the same shape that uses the same underlying hardware instruction.

int _mm_movemask_ps (__m128 a)                                      // movmskps

__m128d _mm_loadh_pd (__m128d a, double const* mem_addr)            // movhpd
__m128d _mm_loadl_pd (__m128d a, double const* mem_addr)            // movlpd
__m128i _mm_loadl_epi64 (__m128i const* mem_addr)                   // movq

void _mm_stream_si32 (int* mem_addr, int a)                         // movnti
void _mm_stream_si64 (__int64* mem_addr, __int64 a)                 // movnti

__m256i _mm256_permute2f128_si256 (__m256i a, __m256i b, int imm8)  // vperm2f128

__m256i _mm256_stream_load_si256 (__m256i const* mem_addr)          // vmovntdqa

// The following 8 have intrinsics which take an imm8 and emit the same underlying instruction
__m128i _mm_sll_epi16 (__m128i a, __m128i count)                    // psllw
__m128i _mm_sll_epi32 (__m128i a, __m128i count)                    // pslld
__m128i _mm_sll_epi64 (__m128i a, __m128i count)                    // psllq
__m128i _mm_sra_epi16 (__m128i a, __m128i count)                    // psraw
__m128i _mm_sra_epi32 (__m128i a, __m128i count)                    // psrad
__m128i _mm_srl_epi16 (__m128i a, __m128i count)                    // psrlw
__m128i _mm_srl_epi32 (__m128i a, __m128i count)                    // psrld
__m128i _mm_srl_epi64 (__m128i a, __m128i count)                    // psrlq

// The following 6 have the corresponding _mm256 forms exposed
__m128i _mm_sllv_epi32 (__m128i a, __m128i count)                   // vpsllvd
__m128i _mm_sllv_epi64 (__m128i a, __m128i count)                   // vpsllvq
__m128i _mm_srav_epi32 (__m128i a, __m128i count)                   // vpsravd
__m128i _mm_srlv_epi32 (__m128i a, __m128i count)                   // vpsrlvd
__m128i _mm_srlv_epi64 (__m128i a, __m128i count)                   // vpsrlvq

Finally, the following are exposed, but under a different ISA than the Intrinsic Guide lists them:

// Exposed as SSE, listed as SSE2
__m128 _mm_castpd_ps (__m128d a)
__m128i _mm_castpd_si128 (__m128d a)
__m128d _mm_castps_pd (__m128 a)
__m128i _mm_castps_si128 (__m128 a)
__m128d _mm_castsi128_pd (__m128i a)
__m128 _mm_castsi128_ps (__m128i a)

// Exposed as AVX, listed as AVX2
__int16 _mm256_extract_epi16 (__m256i a, const int index)
__int8 _mm256_extract_epi8 (__m256i a, const int index)

@fiigii
Copy link

fiigii commented Dec 14, 2017

do you think unsigned overloads need to be exposed for the ConvertTo functions?
They all use MOVD/MOVQ, which is not explicit on the data being signed or unsigned.

Yes, they just copy the first element, no zero/sign extension behavior.

@fiigii
Copy link

fiigii commented Dec 14, 2017

// Exposed as AVX, listed as AVX2
__int16 _mm256_extract_epi16 (__m256i a, const int index)
__int8 _mm256_extract_epi8 (__m256i a, const int index)

They are helper intrinsics, we have AVX and AVX2 codegen solution both.

@fiigii
Copy link

fiigii commented Dec 14, 2017

// Exposed as SSE, listed as SSE2
__m128 _mm_castpd_ps (__m128d a)
__m128i _mm_castpd_si128 (__m128d a)
__m128d _mm_castps_pd (__m128 a)
__m128i _mm_castps_si128 (__m128 a)
__m128d _mm_castsi128_pd (__m128i a)
__m128 _mm_castsi128_ps (__m128i a)

These helper intriniscs do not generate any code. We have the type Vector128<double/int/long/...> with SSE, so I think it should be in Sse.

@fiigii
Copy link

fiigii commented Dec 14, 2017

__m256i _mm256_permute2f128_si256 (__m256i a, __m256i b, int imm8) // vperm2f128

We don't encourage using vperm2f128 on integer data due to the data bypass penalty.

@fiigii
Copy link

fiigii commented Dec 14, 2017

void _mm_stream_si32 (int* mem_addr, int a) // movnti
void _mm_stream_si64 (__int64* mem_addr, __int64 a) // movnti

I am not sure the usefulness of streaming store with scalar types. Please give me more time.

@fiigii
Copy link

fiigii commented Dec 14, 2017

__m256i _mm256_permute2f128_si256 (__m256i a, __m256i b, int imm8) // vperm2f128

We don't encourage using vperm2f128 on integer data due to the data bypass penalty.

But we can complement the types to make the API simpler Permute2x128<T>. Thoughts?

@tannergooding
Copy link
Member Author

I've added the unsigned overloads that were missing and believe the PR is now ready for final review and merge.

@fiigii, I have logged https://github.com/dotnet/corefx/issues/25926 to continue the discussion of the "missing" APIs. We can add them in one go, in a separate PR, after we determine which ones need to be added.

@fiigii
Copy link

fiigii commented Dec 14, 2017

@tannergooding Thank you so much for the work.

@tannergooding
Copy link
Member Author

@eerhardt, do I need another sign-off or am I good to merge?

Also, do you want the CoreFX PR up before or after this change goes in?

/// <summary>
/// __m128d _mm_cmpneq_pd (__m128d a, __m128d b)
/// </summary>
public static Vector128<double> CompareNotEqual(Vector128<double> left, Vector128<double> right) => CompareNotEqual(left, right);

/// <summary>
/// int _mm_comine_sd (__m128d a, __m128d b)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like to fix this comment to _mm_comineq_sd ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

/// <summary>
/// __m128 _mm_cmpneq_ps (__m128 a, __m128 b)
/// </summary>
public static Vector128<float> CompareNotEqual(Vector128<float> left, Vector128<float> right) => CompareNotEqual(left, right);

/// <summary>
/// int _mm_comine_ss (__m128 a, __m128 b)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

public static bool CompareNotEqualOrderedScalar(Vector128<float> left, Vector128<float> right) => CompareNotEqualOrderedScalar(left, right);

/// <summary>
/// int _mm_ucomine_ss (__m128 a, __m128 b)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

public static bool CompareNotEqualOrderedScalar(Vector128<double> left, Vector128<double> right) => CompareNotEqualOrderedScalar(left, right);

/// <summary>
/// int _mm_ucomine_sd (__m128d a, __m128d b)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

/// <summary>
/// float _mm256_cvtss_f32 (__m256 a)
/// </summary>
public static float ConvertToSingle(Vector256<float> value) => ConvertToSingle(value);
Copy link

@fiigii fiigii Dec 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you provide the helper functions that convert vector to float/double. Do we need helpers for float/double -> Vector128<float/double>?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean providing __m128 _mm_set_ss (float a) in addition to __m128 _mm_load_ss (float const* mem_addr)?

Copy link

@fiigii fiigii Dec 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, SetScalar sometimes can avoid memory access than LoadScalar.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍, will add.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added Vector128<float> Sse.SetScalar(float value) and Vector128<double> Sse2.SetScalar(double value)

@tannergooding
Copy link
Member Author

CoreFX side of this PR is dotnet/corefx#26095

@jkotas
Copy link
Member

jkotas commented Dec 29, 2017

@tannergooding Feel free to merge this if it is ready to go.

@tannergooding
Copy link
Member Author

@jkotas, thanks!

Will merge after the two pending jobs come back green (looks like they were kicked off with my last comment, so they are probably new).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants