-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Adding scalar hardware intrinsics for x86. #15341
Conversation
Still need to update the |
I haven't done |
I also noticed a number of places in the existing files where we are not being consistent (either in naming or in following the API naming guidelines). Ex: We have We are also doing |
/// <summary> | ||
/// __m128 _mm_cmp_ss (__m128 a, __m128 b, const int imm8) | ||
/// </summary> | ||
public static Vector128<float> CompareScalar(Vector128<float> left, Vector128<float> right, FloatComparisonMode mode) => CompareScalar(left, right, mode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was wondering why we have FloatComparisonMode
here, but not in the Sse
/Sse2
files?
/// <summary> | ||
/// __m128 _mm_cmpunord_ps (__m128 a, __m128 b) | ||
/// </summary> | ||
public static Vector128<float> CompareUnordered(Vector128<float> left, Vector128<float> right) => CompareUnordered(left, right); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These files all contain a bunch of lines that are just whitespace... We should probably cleanup separately.
/// <summary> | ||
/// __m128 _mm_sqrt_ss (__m128 a) | ||
/// </summary> | ||
public static Vector128<float> SqrtScalar(Vector128<float> value) => SqrtScalar(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Sse2
form for double
takes two arguments, but we only take one here (this is matching the C/C++ intrinsics). Perhaps we should expose both or just the one that takes two arguments?
/// <summary> | ||
/// __m128d _mm_sqrt_sd (__m128d a, __m128d b) | ||
/// </summary> | ||
public static Vector128<double> SqrtScalar(Vector128<double> a, Vector128<double> b) => SqrtScalar(a, b); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of calling a
and b
, upper
and value
respectively. Since b
is the value we perform the operation on and a
is the value we fill in the upper bits from. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to specifically fill the upper bits of the result in practice?
From the performance perspective, we always recommend using the same register as the source and upper argument.
Especially, if we decide to support the two-parameter version of Sqrt
intrinsic, on non-AVX machines, the compiler may have to insert unpack
or shuffle
instructions to implement this semantic, which they are both long latency instructions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In summary, I am suggesting that only expose the one-parameter intrinsic for SQRTSS and SQRTSD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just exposing the single intrinsic version is probably fine. I actually missed that the two operand form is only on AVX and above, the Intel Intrinsics Guide lists it as SSE2: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=Sqrt&techs=SSE,SSE2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Software Developers Manual lists the information correctly.
/// void _mm_store_ss (float* mem_addr, __m128 a) | ||
/// </summary> | ||
public static unsafe void StoreScalar(float* address, Vector128<float> source) => StoreScalar(address, source); | ||
|
||
/// <summary> | ||
/// __m128d _mm_sub_ps (__m128d a, __m128d b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: the files have several typos, such as this, where the wrong type is used in the intrinsic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. For these comment typos, we can fix them later that do not impact the CoreCLR/CoreFX interface.
@@ -352,6 +352,11 @@ public static class Avx2 | |||
/// </summary> | |||
public static Vector256<ulong> ConvertToVector256ULong(Vector128<uint> value) => ConvertToVector256ULong(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: This is an example of an API that doesn't follow the general .NET API naming conventions. It should probably be ConvertToVector256Int64
Thank you for pointing this out. If the .NET API convention always prefers |
Just as an FYI. The exact guideline I am referring to is Avoiding Language Specific Names |
/// __m128 _mm_round_ss (__m128 a, int rounding) | ||
/// _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC | ||
/// </summary> | ||
public static Vector128<float> RoundToNearestIntegerScalar(Vector128<float> value) => RoundToNearestIntegerScalar(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be more consistent to use a RoundingMode immediate parameter here (similarly to comparisons for example). Then it would be 4 functions (Round/RoundScalar * float/double) that map directly to four machine instructions (roundpd/ps/sd/ss) instead of 20. The fully named helper functions could be defined on top of these basic instructions somewhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fiigii, thoughts? I was following the existing convention you had setup for the packed forms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIR in discussions on Intrinsics API the consensus was to create direct mapping to processor instructions due to mutlitude of reasons. Therefore, I would avoid creating any APIs which do not map or omit any processor instructions. It means that if we have 3 argument scalar AVX or above instruction while having 2 argument SSE equivalent we should have both or here we should use immediate parameter for defining rounding mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that the current design (encodes rounding mode into intrinsic names) has better static semantics.
For example, RoundingMode immediate parameter requires 1) const
parameter from language feature support to avoid non-literal values, 2) compile error reporting and runtime exception for invalid values from Roslyn/CoreCLR https://github.com/dotnet/corefx/issues/22940#issuecomment-320122766.
Each round just has a few pre-defined modes, so I thought this is a good opportunity to provide intrinsic with more stable runtime behaviors and friendly development experience. Meanwhile, it does not lose any flexibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could still implement rounding functions with an immediate parameter as private intrinsics and expose them through wrapper functions that just forward to the actual intrinsic if this simplifies the implementation.
// __m128 _mm_round_ss (__m128 a, int rounding)
private static Vector128<float> RoundScalar(Vector128<float> value, byte rounding) => RoundScalar(value, rounding);
// _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC
public static Vector128<float> RoundToNearestIntegerScalar(Vector128<float> value) => RoundScalar(value, _MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC);
// _MM_FROUND_TO_NEG_INF |_MM_FROUND_NO_EXC
public static Vector128<float> RoundToNegativeInfinityScalar(Vector128<float> value) => RoundScalar(value, _MM_FROUND_TO_NEG_INF | _MM_FROUND_NO_EXC);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this simplifies the implementation.
No, the current runtime always expands all the APIs as intrinsics.
/// <summary> | ||
/// __m128d _mm_cvtss_sd (__m128d a, __m128 b) | ||
/// </summary> | ||
public static Vector128<double> ConvertToDoubleScalar(Vector128<double> a, Vector128<float> b) => ConvertToDoubleScalar(a, b); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MOVD/MOVQ instructions are missing from here (and AVX/AVX2). I propose something like this:
// __m128i _mm_cvtsi32_si128 (int a)
public static Vector128<int> CopyInt32(int value);
// int _mm_cvtsi128_si32 ( __m128i a)
public static int CopyInt32(Vector128<int> value);
// __m128i _mm_cvtsi64_si128(__int64)
public static Vector128<long> CopyInt64(long value);
// __int64 _mm_cvtsi128_si64(__m128i)
public static long CopyInt64(Vector128<long> value);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Went with the existing ConvertTo
naming convention
I have not seen any conversion intrinsics which would allow converting from and to Half. Fact that we do not have Half support in CLR should not stop us form having intrinsics which would act just as an interface between CLR and other runtimes which support Half sized floating types. Additionally, despite it is not directly related to this PR, it could be very helpful to have support for 8bit floating/binary sized numbers as well. Related issues: |
@4creators, I believe the FP16C instructions haven't gone for design review yet. The initial set that has gone for review covers both packed and scalar (as of this PR) instructions for SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2, FMA, AES, BMI1, BMI2, LZCNT, PCLMULQDQ, and POPCNT. The intrinsic sets that have not gone to review (based on those listed under the Intel Intrinsics Guide) are MMX, AVX-512, KNC, SVML, ADX, CLFLUSHOPT, CLWB, FP16C, FSGSBASE, FXSR, INVPCID, MONITOR, MPX, PREFETCHWT1, RDPID, RDRAND, RDSEED, RDTSCP, RTM, SHA, TSC, XSAVE, XSAVEC, XSAVEOPT, XSS A number of those intrinsic sets won't/shouldn't go to review because:
The others likely just need to be proposed and go up for review. For FP16C in particular, we at the very least need a |
@tannergooding Yep, you are right. Perhaps it's time to make both Half and missing intrinsics proposal :) |
@dotnet-bot test Windows_NT x86 Checked Innerloop Build and Test |
Updated the |
/// <summary> | ||
/// __int64 _mm_cvtsd_si64 (__m128d a) | ||
/// </summary> | ||
public static long ConvertToInt64(Vector128<double> value) => ConvertToInt64(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fiigii, For instructions like this, which have an additional encoding on x64, how do we want to expose them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an additional encoding on x64
Do you mean the 64-bit register in cvtsd2si r64, xmm
?
These intrinsics are only available in 64-bit mode, and calling them in 32-bit should throw PlatformNotSupportExeception
.
if (Sse2.IsSupported && Environment.Is64BitProcess)
{
ulong res = Sse2.ConvertToInt64(vec);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, that sounds good to me.
I was mostly just wanting to confirm we were exposing them in X86.Sse2
and not under some X64
specific sub-class
Should be ready for review. Still todo (but will be in separate PRs):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm no expert here, but this looks good to me.
@fiigii, is there a list of instructions/intrinsics that don't have corresponding managed APIs exposed yet? I have found, at the very least, |
I am pretty sure that this one (
Yes, I had. I remember that I did not expose legacy 64-bit SSE SIMD, scalar floating point, and certain comparison intrinsics. This is a good chance to complement the intrinsic APIs and fix other mistakes. Let me update my list based on you scalar intrinsic work. I will try my best to put them together tomorrow. @tannergooding Thanks again for your find and reminder. |
I went through the exposed ISAs manually and found the following "missing". MMX Interop Instructions (68)These are instructions that take or return Control Instructions (10)These may be useful for some scenarios but also have a good chance for breaking other code/assumptions. That being said, you could already do these with your own P/Invoke. I would vote to not expose these via Intrinsics and instead expose a general API (after appropriate review) to control this (which is
Helper Functions (26)These don't directly map to any specific instruction, but can be helpful in some scenarios. Not listed on the Intel Intrinsics Guide, but exposed by MSVC are other helper functions (such as Possibly worth further discussion.
Advanced Memory (6)These are instructions for advanced memory operations. Probably useful. Could they negatively impact the GC? Potentially worth further discussion
Missing (1)These instructions are actually missing and need to be added. I will add them in this PR
Unknown (16)Not quite sure what area some of these fall under.
|
/// __m128d _mm_cmpgt_sd (__m128d a, __m128d b) | ||
/// </summary> | ||
public static Vector128<double> CompareGreaterThanScalar(Vector128<double> left, Vector128<double> right) => CompareGreaterThanScalar(left, right); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you miss int _mm_comineq_sd (__m128d a, __m128d b)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And int _mm_ucomineq_sd (__m128d a, __m128d b)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up going with the bool Compare*OrderedScalar
and bool Compare*UnorderedScalar
naming convention for these operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks. But the comment seems incorrect _mm_comine_sd
-> _mm_comineq_sd
.
@@ -946,6 +1175,11 @@ public static class Sse2 | |||
/// </summary> | |||
public static Vector128<double> Subtract(Vector128<double> left, Vector128<double> right) => Subtract(left, right); | |||
|
|||
/// <summary> | |||
/// __m128d _mm_sub_ss (__m128d a, __m128d b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__m128d _mm_sub_sd (__m128d a, __m128d b)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix
I just checked my list that you have almost complemented the APIs. But I did not find __m128 _mm_max_ss (__m128 a, __m128 b);
__m128 _mm_min_ss (__m128 a, __m128 b); |
/// <summary> | ||
/// __m128 _mm_max_ss (__m128 a, __m128 b) | ||
/// </summary> | ||
public static Vector128<float> MaxScalar(Vector128<float> left, Vector128<float> right) => MaxScalar(left, right); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fiigii, MaxScalar
is here, MinScalar
is a few lines up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, Github was hiding them.
@fiigii, could you comment on the instructions I have labeled as For the helper functions:
For the unknown functions:
It also wasn't immediately obvious as to when both signed and unsigned overloads should be provided for a given There were a couple other instructions in Sse2 where the Intrinsics Guide did not explicitly dictate signed vs unsigned but where only a single overload was provided (I haven't finished documenting them all yet). |
@tannergooding Sorry for the delay. I still need some time to investigate certain instructions.
We have generic versions:
Yes, and I think we should have helper functions as less as possible.
These intrinsics are usually used to get a uninitialized vector to avoid the init overhead. I do not think it makes sense in .NET/CLI semantics.
As I know, the codgen of
Good catch! Thanks, I should fix. |
They are signed comparison instructions, and comparing for equal does not need sign info.
The sign information is documented in Intel® 64 and IA-32 architectures software developer’s manual volume 2 |
Ok. I have now gone through all the exposed ISAs manually and ensured all scalar intrinsic instructions are exposed in this PR. @fiigii, do you think
I also found the following instructions to be missing from the
Finally, the following are exposed, but under a different ISA than the Intrinsic Guide lists them:
|
Yes, they just copy the first element, no zero/sign extension behavior. |
They are helper intrinsics, we have AVX and AVX2 codegen solution both. |
These helper intriniscs do not generate any code. We have the type |
We don't encourage using |
I am not sure the usefulness of streaming store with scalar types. Please give me more time. |
But we can complement the types to make the API simpler |
I've added the unsigned overloads that were missing and believe the PR is now ready for final review and merge. @fiigii, I have logged https://github.com/dotnet/corefx/issues/25926 to continue the discussion of the "missing" APIs. We can add them in one go, in a separate PR, after we determine which ones need to be added. |
@tannergooding Thank you so much for the work. |
@eerhardt, do I need another sign-off or am I good to merge? Also, do you want the CoreFX PR up before or after this change goes in? |
/// <summary> | ||
/// __m128d _mm_cmpneq_pd (__m128d a, __m128d b) | ||
/// </summary> | ||
public static Vector128<double> CompareNotEqual(Vector128<double> left, Vector128<double> right) => CompareNotEqual(left, right); | ||
|
||
/// <summary> | ||
/// int _mm_comine_sd (__m128d a, __m128d b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you like to fix this comment to _mm_comineq_sd
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
/// <summary> | ||
/// __m128 _mm_cmpneq_ps (__m128 a, __m128 b) | ||
/// </summary> | ||
public static Vector128<float> CompareNotEqual(Vector128<float> left, Vector128<float> right) => CompareNotEqual(left, right); | ||
|
||
/// <summary> | ||
/// int _mm_comine_ss (__m128 a, __m128 b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
public static bool CompareNotEqualOrderedScalar(Vector128<float> left, Vector128<float> right) => CompareNotEqualOrderedScalar(left, right); | ||
|
||
/// <summary> | ||
/// int _mm_ucomine_ss (__m128 a, __m128 b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
public static bool CompareNotEqualOrderedScalar(Vector128<double> left, Vector128<double> right) => CompareNotEqualOrderedScalar(left, right); | ||
|
||
/// <summary> | ||
/// int _mm_ucomine_sd (__m128d a, __m128d b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
/// <summary> | ||
/// float _mm256_cvtss_f32 (__m256 a) | ||
/// </summary> | ||
public static float ConvertToSingle(Vector256<float> value) => ConvertToSingle(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you provide the helper functions that convert vector to float/double. Do we need helpers for float/double
-> Vector128<float/double>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean providing __m128 _mm_set_ss (float a)
in addition to __m128 _mm_load_ss (float const* mem_addr)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, SetScalar
sometimes can avoid memory access than LoadScalar
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍, will add.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added Vector128<float> Sse.SetScalar(float value)
and Vector128<double> Sse2.SetScalar(double value)
CoreFX side of this PR is dotnet/corefx#26095 |
@tannergooding Feel free to merge this if it is ready to go. |
@jkotas, thanks! Will merge after the two pending jobs come back green (looks like they were kicked off with my last comment, so they are probably new). |
This is the start of https://github.com/dotnet/corefx/issues/23519.
FYI. @fiigii, @eerhardt, @ViktorHofer