Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine how to expose hwintrinsics that are only supported in 64-bit mode #10617

Closed
tannergooding opened this issue Jul 2, 2018 · 7 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI design-discussion Ongoing discussion about design without consensus
Milestone

Comments

@tannergooding
Copy link
Member

There are several hwintrinsics exposed that are only emittable in 64-bit mode and throw a PNSE exception if invoked in 32-bit mode (regardless of the fact that the general Isa.IsSupported check returns true).

We should have a deeper discussion on how to properly expose this data to the consumer of these APIs.

@tannergooding
Copy link
Member Author

FYI. @CarolEidt, @fiigii, @eerhardt, @terrajobst

This tracks the discussion raised here: dotnet/coreclr#18734 (comment)

@pentp
Copy link
Contributor

pentp commented Jul 10, 2018

At least ConvertScalarToVector128(U)Int64 and ConvertTo(U)Int64 should be made to work in 32-bit mode. I have tried to use intrinsics in some Decimal calculations and ended up with many #if blocks like this:

#if BIT64
  var x1 = Sse2.ConvertScalarToVector128UInt64(pdecR.Low64);
#else
  var x1 = Sse2.LoadScalarVector128((ulong*)Unsafe.AsPointer(ref Unsafe.AddByteOffset(ref pdecR, (IntPtr)8)));
#endif

The JIT is better positioned to choose an optimal solution than user code - it can load directly from memory with MOVQ or spill the low+high registers and then use MOVQ or use 2xMOVD + PUNPCKLDQ or use MOVD + PINSRD (with SSE4.1). For the reverse it can choose between storing directly to memory with MOVQ or to registers with MOVD + PEXTRD (with SSE4.1).

@tannergooding
Copy link
Member Author

At least ConvertScalarToVector128(U)Int64 and ConvertTo(U)Int64 should be made to work in 32-bit mode.

The general principle of the intrinsics so far is that they are not emulated, ever. This simplifies the implementation overall and provides some very nice guarantees about what the intrinsics will do. The downside is that the consumer ends up having to do a bit more, but I believe that can/will be alleviated by having some separate "helper" library. Such a library could do things like:

  • abstract away ARM vs x86 differences
  • abstract away 32-bit vs 64-bit differences
  • providing wrappers around common functionality
    • Such as providing a Permute wrapper that does a shuffle on SSE hardware and a Permute on AVX hardware
  • etc

@pentp
Copy link
Contributor

pentp commented Jul 10, 2018

These proposed libraries would be sub-optimal because a long is internally two int registers or a memory location, but managed code can't access these registers directly nor affect register allocation/spilling and has to either always spill or always load in two parts. If fallbacks for intrinsics are undesirable, then at least there could be a helper intrinsic in Sse2 class.

@tannergooding
Copy link
Member Author

but managed code can't access these registers directly nor affect register allocation/spilling and has to either always spill or always load in two parts

This is something that is much more general than just hardware intrinsics. If there is an issue here, it is probably worth addressing more generally (or via some separate intrinsic).

@CarolEidt
Copy link
Contributor

There is a general issue of the JIT not generally being able to determine whether it is best to use a value from memory, the "natural" register type for the value (generally 2 int regs for a long), or an xmm register. As @tannergooding points out, this is a general issue, though certainly of greater impact for code like this.

As a design issue, I think it's clearly best to keep the intrinsics as true to the hw as possible. Then, as we identify scenarios (esp. scenarios that can be illustrated to have some impact on real-world-code) that aren't well addressed, we can determine how best to get the desired performance. Creating more and more intrinsics doesn't feel like the right answer to me, but we'd clearly have to do some more analysis first.

@tannergooding
Copy link
Member Author

We discuseed and resolved this as part of dotnet/corefx#32721

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the 3.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 16, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI design-discussion Ongoing discussion about design without consensus
Projects
None yet
Development

No branches or pull requests

4 participants