Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sse2.Set[All]Vector128([u]long, [u]long) crashes with PNSE when run in 32-bit process #10453

Closed
voinokin opened this issue Jun 5, 2018 · 13 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI question Answer questions and provide assistance, not an issue with source code or documentation.
Milestone

Comments

@voinokin
Copy link

voinokin commented Jun 5, 2018

My guess is - these method have managed implementation behind them involving ConvertScalarToVector128[U]Int64() (MOVQ xmm, r64) unavailable in 32-bit mode.
Since these are all just helper methods not necessarily mapped to some specific HW intrinsic directly, my understanding is that the implementation should behave considering current process bitness - eg. use set of 32-bit HW intrinsics to setup the result when in 32-bit mode.

@tannergooding
Copy link
Member

ConvertScalarToVector128 is not considered one of the helper methods and directly maps to the MOVQ instruction.

However, it is also one of the Int64 intrinsics that can work on 32-bit, since there is a movq xmm, m64 encoding available there. It would require us to ensure that op2 is always spilled.

@CarolEidt, thoughts?

@tannergooding
Copy link
Member

Also, for reference, the helper methods are currently:

  • SSE.ConvertToSingle
  • SSE.SetScalarVector128
  • SSE.SetZeroVector128
  • SSE.StaticCast
  • SSE2.ConvertToDouble
  • SSE2.SetScalarVector128
  • SSE2.SetZeroVector128
  • AVX.ExtendToVector256
  • AVX.GetLowerHalf
  • AVX.SetVector256
  • AVX.SetHighLow
  • AVX.SetAllVector256
  • AVX.SetZeroVector256
  • AVX.StaticCast

@voinokin
Copy link
Author

voinokin commented Jun 5, 2018

@tannergooding - please check the subject of the issue, it tells exactly about SetAllVector128() and SetVector128() for 64-bit values. For some reason these two are not on your list.

@tannergooding
Copy link
Member

please check the subject of the issue, it tells exactly about SetAllVector128() and SetVector128() for 64-bit values

@voinokin, right. The failure is because ConvertScalarToVector128Int64 isn't aware that it can still emit on 32-bit architecture if it always uses the movq xmm, m64 encoding. Given that it isn't one of the helper methods, we'll want to decide if we want to:

  • Always emit the movq xmm, m64 encoding on 32-bit
    -or-
  • Modify SetVector128 to have a different implementation on 32-bit

(I would imagine the former).

For some reason these two are not on your list.

Looks like my list is missing the ones implemented in managed code. I'll update it.

@tannergooding
Copy link
Member

tannergooding commented Jun 5, 2018

For Modify SetVector128 to have a different implementation on 32-bit, the change would basically be to just call LoadScalarVector128(long*) instead, as that will have to use the movq xmm, m64 encoding.

@voinokin
Copy link
Author

voinokin commented Jun 5, 2018

Taking a note on availability on MOVQ xmm, m64, it appears more intrinsics for operations are missing in API, the ones that work on scalar value in memory and I can't confirm (and rest assured :-) ) they are handled by containment support you're implementing.

Here they are at least for SSE, can't tell for AVX since it's incomplete anyway:

I list these here to not create new issue record.
I hope I really found them all...

@tannergooding
Copy link
Member

hey are handled by containment support you're implementing.

For the most part, yes. The containment support basically allows us to take a:

movaps op1Reg, [mem]
ins targetReg, op1Reg
movaps [mem], targetReg

and convert it into either

ins targetReg, [mem]
movaps [mem], targetReg

or

movaps op1Reg, [mem]
ins [mem], op1Reg

Depending on what the instruction supports. For example, Insert falls into the first and Extract falls into the latter.

@tannergooding
Copy link
Member

There are still some cases where the API shape doesn't exactly match the semantics of the underlying instruction, however (such as Sse2.Insert(Vector128, Single) taking a Single instead of another Vector128, since the underlying instruction lets you pick a given index from the second operand).

@voinokin
Copy link
Author

voinokin commented Jun 5, 2018

Modify SetVector128 to have a different implementation on 32-bit

I vote for this option.
In C++ intrinsics _mm_cvtsi128_si64x() and _mm_set1_epi64x() are implemented so that they don't directly map to MOVQ. See https://godbolt.org/g/DqpT4E - ICC converts this to MOVDQU from memory.

And they are also marked as "HELPER" methods in ref assembly ( https://github.com/dotnet/coreclr/blob/73369eb914dc7df2118727a36f23e8c5e5d119f5/src/System.Private.CoreLib/src/System/Runtime/Intrinsics/X86/Sse2.cs#L1259 and https://github.com/dotnet/coreclr/blob/73369eb914dc7df2118727a36f23e8c5e5d119f5/src/System.Private.CoreLib/src/System/Runtime/Intrinsics/X86/Sse2.cs#L1067)

So I wonder why these methods are not called helper anymore :-)

@voinokin
Copy link
Author

voinokin commented Jun 5, 2018

For Modify SetVector128 to have a different implementation on 32-bit, the change would basically be to just call LoadScalarVector128(long*) instead, as that will have to use the movq xmm, m64 encoding.

In my case the throughoutput is everything, so issueing extra memory access is not an option - I'll have to work this around with my own helper which will use just xmm regs.

@tannergooding
Copy link
Member

So I wonder why these methods are not called helper anymore :-)

Like I said above, it was just that my reference list was missing the helper methods implemented in managed code.

In my case the throughoutput is everything, so issueing extra memory access is not an option - I'll have to work this around with my own helper which will use just xmm regs.

It's probably worth profiling whether or not it hurts performance. The general way to do this is (and what the native compilers do) is to generate a 16-byte constant and directly load that from memory. The JIT doesn't currently support creating these 16-byte constants and it is a TODO work item.

@4creators
Copy link
Contributor

4creators commented Jun 5, 2018

issueing extra memory access is not an option

The reason this API was left unimplemented is exactly one you have indicated. If you want to use 64 bit scalar values to set xmm registers workaround is to use Set(All)Vector128 with 32 bit integers splitted from 64 bit integers, otherwise, you will have to use memory based approach which may or may not be very effective (if you will load from cache it can be done in 1 CPU cycle).

It is possible to use 64 bit integer handling JIT implementation for x86 to work with 64 bit integers in xmm registers but due to time limitations before 2.1 commit window close it was not implemented.

@fiigii is planning to do some work on moving managed implementation of Set(All)Vector128 to JIT codegen to increase flexibility and it seems a good opportunity to implement missing functionality.

@fiigii
Copy link
Contributor

fiigii commented Dec 6, 2018

This is solved by the platform-agnostic helpers, please close.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the 3.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 16, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI question Answer questions and provide assistance, not an issue with source code or documentation.
Projects
None yet
Development

No branches or pull requests

5 participants