Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Implement 64-bit-only hardware intrinsic #21264

Merged
merged 6 commits into from
Dec 3, 2018
Merged

Conversation

fiigii
Copy link

@fiigii fiigii commented Nov 29, 2018

This PR implements 64-bit-only hardware intrinsic that are exposed in X64 nested classes.

In the next PR, I will implement the remaining BMI1/2 intrinsic and move more tests to the template framework.

Implement #20146 in JIT and close https://github.com/dotnet/coreclr/issues/21042.

@fiigii
Copy link
Author

fiigii commented Nov 29, 2018

@CarolEidt @tannergooding the main JIT change is in the commit "Implement 64-bit-only intrinsic".

@fiigii
Copy link
Author

fiigii commented Nov 29, 2018

@jkotas @AndyAyersMS Could you please take a look at the JIT/EE interface change and its superPMI code (the first two commits)?

public static long ConvertToInt64(Vector128<double> value) => ConvertToInt64(value);
/// <summary>
/// __int64 _mm_cvtsi128_si64 (__m128i a)
/// MOVQ reg/m64, xmm
/// This intrinisc is only available on 64-bit processes
/// </summary>
[Intrinsic]
public static long ConvertToInt64(Vector128<long> value) => ConvertToInt64(value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one shouldn't be x64 specific.

We can still emit the MOVQ standalone instruction (which requires the result be stored in memory):
image

This is different from the movd instruction which has an movq overload that is 64-bit only:
image

Copy link
Author

@fiigii fiigii Nov 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is comment mistake. This intrinsic generates cvtss2si.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would this intrinsic generate cvtss2si (which takes a scalar float and returns an integer)?

The intrinsic is _mm_cvtsi128_si64 which takes a scalar integer and returns an integer, which generates the movq instruction as per: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_cvtsi128_si64&expand=1851

Copy link
Author

@fiigii fiigii Nov 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, my mistake, you are talking about ConvertToInt64(Vector128<long> value), not ConvertToInt64(Vector128<double> value).

If we match C++, this should be a 64-bit only intrinsic, as C++ generate movq r64, xmm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a case where it would be fine, on x86 machines, to just force the store to memory. On x64 machines, we can do whichever is more appropriate (store to memory or to register).

We are already needing to provide the ToScalar functionality on x86 anyways (via the helper intrinsics) so it would just make that simpler overall.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. ToScalar and GetElement is implement by Unsafe on unsupported platforms, so we should optimize Unsafe CQ rather than complicate the whole JIT again for ConvertToInt64.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I say, I don't have a strong opinion, but I'm confused about the complexity. As @tannergooding points out, we already have the ToScalar helper function. What am I missing that would cause this to be more complex?
And in any case, if it is equivalent to ToScalar then do we need this intrinsic at all?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What am I missing that would cause this to be more complex?

As far as I can tell ToScalar has "software" fallback. ConvertToInt64 is supposed to be an intrinsic so it should not have fallback.

And in any case, if it is equivalent to ToScalar then do we need this intrinsic at all?

Looks to me than all this intrinsic affair morphed from exposing the relevant instructions as they are, without "added value" to a mish-mash of actual intrinsic and various helpers with software fallback. Whatever. But if this results in unnecessary complexity in the JIT then I'd say that it's a bad affair.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already have the ToScalar helper function. What am I missing that would cause this to be more complex?

ToScalar is a platform agnostic helper that does not guarantee the optimal codegen now. @tannergooding is suggesting generating movq m64, xmm for ConvertToInt64 on 32-platform.
Of course, we can implement ToScalar via ConvertToInt64 but not inverse.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. ToScalar is a helper function that, outside of float and double (which can have special semantics due to staying in the same register) will be implemented in software and in terms of calling the existing intrinsics.

public static long ConvertToInt64(Vector128<long> value) => ConvertToInt64(value);

/// <summary>
/// __int64 _mm_cvtsi128_si64 (__m128i a)
/// MOVQ reg/m64, xmm
/// This intrinisc is only available on 64-bit processes
/// </summary>
[Intrinsic]
public static ulong ConvertToUInt64(Vector128<ulong> value) => ConvertToUInt64(value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with this and other instructions that use movq

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you link this PR to the "spec" that inspired this change?

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JIT changes LGTM, though I'd like to see us resolve the issues about the intrinsics that are moving between register files.

@fiigii
Copy link
Author

fiigii commented Nov 29, 2018

Addressed feedback, @jkotas @AndyAyersMS @sandreenko @tannergooding PTAL

@fiigii
Copy link
Author

fiigii commented Nov 30, 2018

@dotnet-bot test this please

key.ftn = (DWORDLONG)ftn;
key.className = (moduleName != nullptr);
key.namespaceName = (namespaceName != nullptr);
key.enclosingClassName = (enclosingClassName != nullptr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure why we (probably me) implemented things this way.

I always have thought that for interface methods with optional out parameters that SPMI recording should assume the most general case and pass non-null internally for all those out params. Then record all the values.

And on replay give back only what the caller asked for.

If we do this, then the key would only involve ftn, not whether or not the optional out params were also asked for.

In practice it may not matter as we probably always ask for all the results.

@sandreenko thoughts?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many cases where we do it like this (key value contains bool for each optional param that shows if it was asked). I do not see an issue here. It can increase the map size a bit, but also can protect us from collisions that happen because not all JitEEinterface function are clear.

so LGTM.

@fiigii
Copy link
Author

fiigii commented Nov 30, 2018

@dotnet-bot test Ubuntu arm Cross Checked Innerloop Build and Test
@dotnet-bot test Ubuntu x64 Checked Innerloop Build and Test
@dotnet-bot test Ubuntu16.04 arm64 Cross Checked Innerloop Build and Test

IfFailThrow(GetMDImport()->GetNameOfTypeDef(bmtInternal->pType->GetEnclosingTypeToken(), NULL, &nameSpace));
}

if (hr == S_OK && (strcmp(nameSpace, "System.Runtime.Intrinsics.X86") == 0))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to get enclosing class's namespace and unify the condition.
@AndyAyersMS @tannergooding @jkotas

@fiigii
Copy link
Author

fiigii commented Nov 30, 2018

@dotnet-bot test Windows_NT x64 Checked jitsse2only
@dotnet-bot test Windows_NT x64 Checked jitincompletehwintrinsic
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnoavx
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnoavx2
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnosimd
@dotnet-bot test Windows_NT x64 Checked jitnox86hwintrinsic

@dotnet-bot test Windows_NT x86 Checked jitsse2only
@dotnet-bot test Windows_NT x86 Checked jitincompletehwintrinsic
@dotnet-bot test Windows_NT x86 Checked jitx86hwintrinsicnoavx
@dotnet-bot test Windows_NT x86 Checked jitx86hwintrinsicnoavx2
@dotnet-bot test Windows_NT x86 Checked jitx86hwintrinsicnosimd
@dotnet-bot test Windows_NT x86 Checked jitnox86hwintrinsic

@dotnet-bot test Ubuntu x64 Checked jitsse2only
@dotnet-bot test Ubuntu x64 Checked jitincompletehwintrinsic
@dotnet-bot test Ubuntu x64 Checked jitx86hwintrinsicnoavx
@dotnet-bot test Ubuntu x64 Checked jitx86hwintrinsicnoavx2
@dotnet-bot test Ubuntu x64 Checked jitx86hwintrinsicnosimd
@dotnet-bot test Ubuntu x64 Checked jitnox86hwintrinsic

/// <summary>
/// __int64 _mm_popcnt_u64 (unsigned __int64 a)
/// POPCNT reg64, reg/m64
/// This intrinisc is only available on 64-bit processes
/// </summary>
public static ulong PopCount(ulong value) => PopCount(value);
public static ulong PopCount(ulong value) { throw new PlatformNotSupportedException(); }
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed a mistake from previous PRs.

@fiigii
Copy link
Author

fiigii commented Dec 1, 2018

CI gets all green. Can the PR get approved?

@@ -1398,6 +1406,16 @@ void CodeGen::genSSEIntrinsic(GenTreeHWIntrinsic* node)
break;
}

case NI_SSE_X64_ConvertScalarToVector128Single:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't this just be grouped with NI_SSE2_X64_ConvertScalarToVector128Double, they look to have the same implementation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that we discussed to merge genSSEIntrinsic and genSSE2Intrinsic (we already did for AVX and AVX2). Let me make it in the next PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the intrinsic previously misnamed? It looks like it used to be NI_SSE2_ConvertScalrToVector128Single: https://github.com/dotnet/coreclr/pull/21264/files#diff-847442c7efdf7a78e78711bea30fd4b2L1570

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have both. I modified the table so that SSE2 ConvertScalarToVector128Single could be table-driven https://github.com/dotnet/coreclr/pull/21264/files#diff-607364c1c16999989209b5ee9773829aR213

//Sse
public static Vector128<float> ConvertScalarToVector128Single(Vector128<float> upper, int value) ;

// Sse.X64
public static Vector128<float> ConvertScalarToVector128Single(Vector128<float> upper, long value);

// Sse2
public static Vector128<float> ConvertScalarToVector128Single(Vector128<float> upper, Vector128<double> value);

if (className[0] == 'A')
if (strcmp(className, "X64") == 0)
{
assert(enclosingClassName != nullptr);
Copy link
Member

@jkotas jkotas Dec 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I find the partially recursive calls like this hard to read. I would change the signature of lookupIsa to just lookupIsa(const char* className), and the callsite to:

if (strcmp(className, "X64") == 0)
{
    isa = X64VersionOfIsa(lookupIsa(enclosingClassName));
}
else
{
    isa = lookupIsa(className);
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done by splitting the code into two functions.

@jkotas
Copy link
Member

jkotas commented Dec 1, 2018

getMethodNameFromMetadata and related code looks good to me.

@fiigii
Copy link
Author

fiigii commented Dec 1, 2018

@dotnet-bot test this please

@fiigii
Copy link
Author

fiigii commented Dec 3, 2018

CoreFX API surface change is at dotnet/corefx#33805

@@ -141,7 +141,7 @@ namespace JIT.HardwareIntrinsics.X86
_dataTable = new SimdScalarUnaryOpTest__DataTable<{Op1BaseType}>(_data, LargestVectorSize);
}

public bool IsSupported => {Isa}.IsSupported && (Environment.Is64BitProcess || ((typeof({RetBaseType}) != typeof(long)) && (typeof({RetBaseType}) != typeof(ulong))));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we update all the templates that had this check?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, a few templates still have this check. Can I update it in the next PR?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I update it in the next PR?

And will move more tests to the template framework.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would be great. Thanks!

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HWIntrinsic changes (in the JIT and the tests) LGTM

@tannergooding
Copy link
Member

@CarolEidt, was all of your feedback resolved?

I'm not quite sure what you meant by:

though I'd like to see us resolve the issues about the intrinsics that are moving between register files.

So I thought I'd double-check before merging.

@CarolEidt
Copy link

all this intrinsic affair morphed from exposing the relevant instructions as they are, without "added value" to a mish-mash of actual intrinsic and various helpers with software fallback.

As I see it, we are trying to achieve a balance between sometimes-competing objectives such as:

  • "exposing the relevant instructions as they are",
  • ensuring good code generation (e.g. for places where we'd like to contain the memory op), and
  • making the API as usable as possible.

I'm sure that there are a number of places where this balance is off, but I think we need to be mindful about this balance, so as not to optimize one aspect to the detriment of others.

I'd like to see us resolve the issues about the intrinsics that are moving between register files.

What I meant by this was that I'd like to see some consensus specifically about how we approach an intrinsic like ConvertToInt64 that is has different constraints on different targets (i.e. is available with a memory destination on 32-bit systems). It doesn't seem to me that we closed on this, and perhaps we need some better articulation of the API guidelines.

That said, I'm find with merging this as-is.

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@CarolEidt CarolEidt merged commit a089f64 into dotnet:master Dec 3, 2018
@fiigii fiigii deleted the x64only branch December 3, 2018 20:53
@fiigii
Copy link
Author

fiigii commented Dec 3, 2018

Thank you all for reviewing 😄

@glenn-slayden
Copy link

Is there any way to reference the relevant System.Private.CoreLib.dll library in order to invoke these intrinsics from .NET Framework 4.7.2 (i.e., "desktop") x64? If not, is such a backport planned, and if so, what would be the timeframe?

@fiigii
Copy link
Author

fiigii commented Jan 2, 2019

@glenn-slayden No, .NET framework doesn’t support HW intrinsic.

picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
 Implement 64-bit-only hardware intrinsic

Commit migrated from dotnet/coreclr@a089f64
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update JIT/EE interface to get the enclosing class name from metadata
9 participants