-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Implement 64-bit-only hardware intrinsic #21264
Conversation
@CarolEidt @tannergooding the main JIT change is in the commit "Implement 64-bit-only intrinsic". |
@jkotas @AndyAyersMS Could you please take a look at the JIT/EE interface change and its superPMI code (the first two commits)? |
src/System.Private.CoreLib/shared/System/Runtime/Intrinsics/X86/Sse2.cs
Outdated
Show resolved
Hide resolved
public static long ConvertToInt64(Vector128<double> value) => ConvertToInt64(value); | ||
/// <summary> | ||
/// __int64 _mm_cvtsi128_si64 (__m128i a) | ||
/// MOVQ reg/m64, xmm | ||
/// This intrinisc is only available on 64-bit processes | ||
/// </summary> | ||
[Intrinsic] | ||
public static long ConvertToInt64(Vector128<long> value) => ConvertToInt64(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this is comment mistake. This intrinsic generates cvtss2si
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would this intrinsic generate cvtss2si
(which takes a scalar float
and returns an integer
)?
The intrinsic is _mm_cvtsi128_si64
which takes a scalar integer
and returns an integer
, which generates the movq
instruction as per: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_cvtsi128_si64&expand=1851
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, my mistake, you are talking about ConvertToInt64(Vector128<long> value)
, not ConvertToInt64(Vector128<double> value)
.
If we match C++, this should be a 64-bit only intrinsic, as C++ generate movq r64, xmm
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a case where it would be fine, on x86 machines, to just force the store to memory. On x64 machines, we can do whichever is more appropriate (store to memory or to register).
We are already needing to provide the ToScalar
functionality on x86 anyways (via the helper intrinsics) so it would just make that simpler overall.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly. ToScalar
and GetElement
is implement by Unsafe
on unsupported platforms, so we should optimize Unsafe
CQ rather than complicate the whole JIT again for ConvertToInt64
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I say, I don't have a strong opinion, but I'm confused about the complexity. As @tannergooding points out, we already have the ToScalar
helper function. What am I missing that would cause this to be more complex?
And in any case, if it is equivalent to ToScalar
then do we need this intrinsic at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What am I missing that would cause this to be more complex?
As far as I can tell ToScalar
has "software" fallback. ConvertToInt64
is supposed to be an intrinsic so it should not have fallback.
And in any case, if it is equivalent to ToScalar then do we need this intrinsic at all?
Looks to me than all this intrinsic affair morphed from exposing the relevant instructions as they are, without "added value" to a mish-mash of actual intrinsic and various helpers with software fallback. Whatever. But if this results in unnecessary complexity in the JIT then I'd say that it's a bad affair.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already have the ToScalar helper function. What am I missing that would cause this to be more complex?
ToScalar
is a platform agnostic helper that does not guarantee the optimal codegen now. @tannergooding is suggesting generating movq m64, xmm
for ConvertToInt64
on 32-platform.
Of course, we can implement ToScalar
via ConvertToInt64
but not inverse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. ToScalar is a helper function that, outside of float
and double
(which can have special semantics due to staying in the same register) will be implemented in software and in terms of calling the existing intrinsics.
public static long ConvertToInt64(Vector128<long> value) => ConvertToInt64(value); | ||
|
||
/// <summary> | ||
/// __int64 _mm_cvtsi128_si64 (__m128i a) | ||
/// MOVQ reg/m64, xmm | ||
/// This intrinisc is only available on 64-bit processes | ||
/// </summary> | ||
[Intrinsic] | ||
public static ulong ConvertToUInt64(Vector128<ulong> value) => ConvertToUInt64(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same with this and other instructions that use movq
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you link this PR to the "spec" that inspired this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JIT changes LGTM, though I'd like to see us resolve the issues about the intrinsics that are moving between register files.
Addressed feedback, @jkotas @AndyAyersMS @sandreenko @tannergooding PTAL |
@dotnet-bot test this please |
key.ftn = (DWORDLONG)ftn; | ||
key.className = (moduleName != nullptr); | ||
key.namespaceName = (namespaceName != nullptr); | ||
key.enclosingClassName = (enclosingClassName != nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure why we (probably me) implemented things this way.
I always have thought that for interface methods with optional out parameters that SPMI recording should assume the most general case and pass non-null internally for all those out params. Then record all the values.
And on replay give back only what the caller asked for.
If we do this, then the key would only involve ftn
, not whether or not the optional out params were also asked for.
In practice it may not matter as we probably always ask for all the results.
@sandreenko thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many cases where we do it like this (key value contains bool for each optional param that shows if it was asked). I do not see an issue here. It can increase the map size a bit, but also can protect us from collisions that happen because not all JitEEinterface function are clear.
so LGTM.
@dotnet-bot test Ubuntu arm Cross Checked Innerloop Build and Test |
IfFailThrow(GetMDImport()->GetNameOfTypeDef(bmtInternal->pType->GetEnclosingTypeToken(), NULL, &nameSpace)); | ||
} | ||
|
||
if (hr == S_OK && (strcmp(nameSpace, "System.Runtime.Intrinsics.X86") == 0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to get enclosing class's namespace and unify the condition.
@AndyAyersMS @tannergooding @jkotas
@dotnet-bot test Windows_NT x64 Checked jitsse2only @dotnet-bot test Windows_NT x86 Checked jitsse2only @dotnet-bot test Ubuntu x64 Checked jitsse2only |
/// <summary> | ||
/// __int64 _mm_popcnt_u64 (unsigned __int64 a) | ||
/// POPCNT reg64, reg/m64 | ||
/// This intrinisc is only available on 64-bit processes | ||
/// </summary> | ||
public static ulong PopCount(ulong value) => PopCount(value); | ||
public static ulong PopCount(ulong value) { throw new PlatformNotSupportedException(); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed a mistake from previous PRs.
CI gets all green. Can the PR get approved? |
@@ -1398,6 +1406,16 @@ void CodeGen::genSSEIntrinsic(GenTreeHWIntrinsic* node) | |||
break; | |||
} | |||
|
|||
case NI_SSE_X64_ConvertScalarToVector128Single: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't this just be grouped with NI_SSE2_X64_ConvertScalarToVector128Double
, they look to have the same implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember that we discussed to merge genSSEIntrinsic
and genSSE2Intrinsic
(we already did for AVX and AVX2). Let me make it in the next PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was the intrinsic previously misnamed? It looks like it used to be NI_SSE2_ConvertScalrToVector128Single
: https://github.com/dotnet/coreclr/pull/21264/files#diff-847442c7efdf7a78e78711bea30fd4b2L1570
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have both. I modified the table so that SSE2 ConvertScalarToVector128Single
could be table-driven https://github.com/dotnet/coreclr/pull/21264/files#diff-607364c1c16999989209b5ee9773829aR213
//Sse
public static Vector128<float> ConvertScalarToVector128Single(Vector128<float> upper, int value) ;
// Sse.X64
public static Vector128<float> ConvertScalarToVector128Single(Vector128<float> upper, long value);
// Sse2
public static Vector128<float> ConvertScalarToVector128Single(Vector128<float> upper, Vector128<double> value);
src/jit/hwintrinsicxarch.cpp
Outdated
if (className[0] == 'A') | ||
if (strcmp(className, "X64") == 0) | ||
{ | ||
assert(enclosingClassName != nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I find the partially recursive calls like this hard to read. I would change the signature of lookupIsa to just lookupIsa(const char* className)
, and the callsite to:
if (strcmp(className, "X64") == 0)
{
isa = X64VersionOfIsa(lookupIsa(enclosingClassName));
}
else
{
isa = lookupIsa(className);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done by splitting the code into two functions.
|
@dotnet-bot test this please |
CoreFX API surface change is at dotnet/corefx#33805 |
@@ -141,7 +141,7 @@ namespace JIT.HardwareIntrinsics.X86 | |||
_dataTable = new SimdScalarUnaryOpTest__DataTable<{Op1BaseType}>(_data, LargestVectorSize); | |||
} | |||
|
|||
public bool IsSupported => {Isa}.IsSupported && (Environment.Is64BitProcess || ((typeof({RetBaseType}) != typeof(long)) && (typeof({RetBaseType}) != typeof(ulong)))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we update all the templates that had this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, a few templates still have this check. Can I update it in the next PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I update it in the next PR?
And will move more tests to the template framework.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would be great. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HWIntrinsic changes (in the JIT and the tests) LGTM
@CarolEidt, was all of your feedback resolved? I'm not quite sure what you meant by:
So I thought I'd double-check before merging. |
As I see it, we are trying to achieve a balance between sometimes-competing objectives such as:
I'm sure that there are a number of places where this balance is off, but I think we need to be mindful about this balance, so as not to optimize one aspect to the detriment of others.
What I meant by this was that I'd like to see some consensus specifically about how we approach an intrinsic like That said, I'm find with merging this as-is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thank you all for reviewing 😄 |
Is there any way to reference the relevant |
@glenn-slayden No, .NET framework doesn’t support HW intrinsic. |
Implement 64-bit-only hardware intrinsic Commit migrated from dotnet/coreclr@a089f64
This PR implements 64-bit-only hardware intrinsic that are exposed in
X64
nested classes.In the next PR, I will implement the remaining BMI1/2 intrinsic and move more tests to the template framework.
Implement #20146 in JIT and close https://github.com/dotnet/coreclr/issues/21042.