Sse2.Set[All]Vector128([u]long, [u]long) crashes with PNSE when run in 32-bit process #10453

voinokin · 2018-06-05T15:34:28Z

My guess is - these method have managed implementation behind them involving ConvertScalarToVector128[U]Int64() (MOVQ xmm, r64) unavailable in 32-bit mode.
Since these are all just helper methods not necessarily mapped to some specific HW intrinsic directly, my understanding is that the implementation should behave considering current process bitness - eg. use set of 32-bit HW intrinsics to setup the result when in 32-bit mode.

The text was updated successfully, but these errors were encountered:

tannergooding · 2018-06-05T15:40:51Z

ConvertScalarToVector128 is not considered one of the helper methods and directly maps to the MOVQ instruction.

However, it is also one of the Int64 intrinsics that can work on 32-bit, since there is a movq xmm, m64 encoding available there. It would require us to ensure that op2 is always spilled.

@CarolEidt, thoughts?

tannergooding · 2018-06-05T15:41:38Z

Also, for reference, the helper methods are currently:

SSE.ConvertToSingle
SSE.SetScalarVector128
SSE.SetZeroVector128
SSE.StaticCast
SSE2.ConvertToDouble
SSE2.SetScalarVector128
SSE2.SetZeroVector128
AVX.ExtendToVector256
AVX.GetLowerHalf
AVX.SetVector256
AVX.SetHighLow
AVX.SetAllVector256
AVX.SetZeroVector256
AVX.StaticCast

voinokin · 2018-06-05T15:43:33Z

@tannergooding - please check the subject of the issue, it tells exactly about SetAllVector128() and SetVector128() for 64-bit values. For some reason these two are not on your list.

tannergooding · 2018-06-05T15:59:39Z

please check the subject of the issue, it tells exactly about SetAllVector128() and SetVector128() for 64-bit values

@voinokin, right. The failure is because ConvertScalarToVector128Int64 isn't aware that it can still emit on 32-bit architecture if it always uses the movq xmm, m64 encoding. Given that it isn't one of the helper methods, we'll want to decide if we want to:

Always emit the movq xmm, m64 encoding on 32-bit
-or-
Modify SetVector128 to have a different implementation on 32-bit

(I would imagine the former).

For some reason these two are not on your list.

Looks like my list is missing the ones implemented in managed code. I'll update it.

tannergooding · 2018-06-05T16:01:24Z

For Modify SetVector128 to have a different implementation on 32-bit, the change would basically be to just call LoadScalarVector128(long*) instead, as that will have to use the movq xmm, m64 encoding.

voinokin · 2018-06-05T16:23:05Z

Taking a note on availability on MOVQ xmm, m64, it appears more intrinsics for operations are missing in API, the ones that work on scalar value in memory and I can't confirm (and rest assured :-) ) they are handled by containment support you're implementing.

Here they are at least for SSE, can't tell for AVX since it's incomplete anyway:

PMOVZX/SX... xmm, [m] - these load from [m] and extend at once, a nice fusion. Esp. note the 2x 8-bit version.
PEXTRB/D/W + EXTRACTPS [m], xmm, i - spill single element from xmm to [m]
PINSRB/D/W + INSERTPS xmm, [m], i - merge single element from [m] into xmm. There is special issue open on API for INSERTPS ( HW intrinsics API declaration is incorrect for Sse41.Insert() that operates on vector of 32-bit floats #10383 ).

I list these here to not create new issue record.
I hope I really found them all...

tannergooding · 2018-06-05T16:33:49Z

hey are handled by containment support you're implementing.

For the most part, yes. The containment support basically allows us to take a:

movaps op1Reg, [mem]
ins targetReg, op1Reg
movaps [mem], targetReg

and convert it into either

ins targetReg, [mem]
movaps [mem], targetReg

or

movaps op1Reg, [mem]
ins [mem], op1Reg

Depending on what the instruction supports. For example, Insert falls into the first and Extract falls into the latter.

tannergooding · 2018-06-05T16:35:41Z

There are still some cases where the API shape doesn't exactly match the semantics of the underlying instruction, however (such as Sse2.Insert(Vector128, Single) taking a Single instead of another Vector128, since the underlying instruction lets you pick a given index from the second operand).

voinokin · 2018-06-05T16:47:43Z

Modify SetVector128 to have a different implementation on 32-bit

I vote for this option.
In C++ intrinsics _mm_cvtsi128_si64x() and _mm_set1_epi64x() are implemented so that they don't directly map to MOVQ. See https://godbolt.org/g/DqpT4E - ICC converts this to MOVDQU from memory.

And they are also marked as "HELPER" methods in ref assembly ( https://github.com/dotnet/coreclr/blob/73369eb914dc7df2118727a36f23e8c5e5d119f5/src/System.Private.CoreLib/src/System/Runtime/Intrinsics/X86/Sse2.cs#L1259 and https://github.com/dotnet/coreclr/blob/73369eb914dc7df2118727a36f23e8c5e5d119f5/src/System.Private.CoreLib/src/System/Runtime/Intrinsics/X86/Sse2.cs#L1067)

So I wonder why these methods are not called helper anymore :-)

voinokin · 2018-06-05T16:54:18Z

For Modify SetVector128 to have a different implementation on 32-bit, the change would basically be to just call LoadScalarVector128(long*) instead, as that will have to use the movq xmm, m64 encoding.

In my case the throughoutput is everything, so issueing extra memory access is not an option - I'll have to work this around with my own helper which will use just xmm regs.

tannergooding · 2018-06-05T16:59:21Z

So I wonder why these methods are not called helper anymore :-)

Like I said above, it was just that my reference list was missing the helper methods implemented in managed code.

In my case the throughoutput is everything, so issueing extra memory access is not an option - I'll have to work this around with my own helper which will use just xmm regs.

It's probably worth profiling whether or not it hurts performance. The general way to do this is (and what the native compilers do) is to generate a 16-byte constant and directly load that from memory. The JIT doesn't currently support creating these 16-byte constants and it is a TODO work item.

4creators · 2018-06-05T17:09:40Z

issueing extra memory access is not an option

The reason this API was left unimplemented is exactly one you have indicated. If you want to use 64 bit scalar values to set xmm registers workaround is to use Set(All)Vector128 with 32 bit integers splitted from 64 bit integers, otherwise, you will have to use memory based approach which may or may not be very effective (if you will load from cache it can be done in 1 CPU cycle).

It is possible to use 64 bit integer handling JIT implementation for x86 to work with 64 bit integers in xmm registers but due to time limitations before 2.1 commit window close it was not implemented.

@fiigii is planning to do some work on moving managed implementation of Set(All)Vector128 to JIT codegen to increase flexibility and it seems a good opportunity to implement missing functionality.

fiigii · 2018-12-06T19:48:00Z

This is solved by the platform-agnostic helpers, please close.

tannergooding closed this as completed Dec 6, 2018

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the 3.0 milestone Jan 31, 2020

JosephTremoulet mentioned this issue Aug 10, 2018

JIT optimization: value re-numbering in CSE doesn't help span loops as much as array loops #8177

Open

ghost locked as resolved and limited conversation to collaborators Dec 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sse2.Set[All]Vector128([u]long, [u]long) crashes with PNSE when run in 32-bit process #10453

Sse2.Set[All]Vector128([u]long, [u]long) crashes with PNSE when run in 32-bit process #10453

voinokin commented Jun 5, 2018

tannergooding commented Jun 5, 2018

tannergooding commented Jun 5, 2018

voinokin commented Jun 5, 2018 •

edited

Loading

tannergooding commented Jun 5, 2018

tannergooding commented Jun 5, 2018 •

edited

Loading

voinokin commented Jun 5, 2018 •

edited

Loading

tannergooding commented Jun 5, 2018

tannergooding commented Jun 5, 2018

voinokin commented Jun 5, 2018

voinokin commented Jun 5, 2018

tannergooding commented Jun 5, 2018

4creators commented Jun 5, 2018 •

edited

Loading

fiigii commented Dec 6, 2018

Sse2.Set[All]Vector128([u]long, [u]long) crashes with PNSE when run in 32-bit process #10453

Sse2.Set[All]Vector128([u]long, [u]long) crashes with PNSE when run in 32-bit process #10453

Comments

voinokin commented Jun 5, 2018

tannergooding commented Jun 5, 2018

tannergooding commented Jun 5, 2018

voinokin commented Jun 5, 2018 • edited Loading

tannergooding commented Jun 5, 2018

tannergooding commented Jun 5, 2018 • edited Loading

voinokin commented Jun 5, 2018 • edited Loading

tannergooding commented Jun 5, 2018

tannergooding commented Jun 5, 2018

voinokin commented Jun 5, 2018

voinokin commented Jun 5, 2018

tannergooding commented Jun 5, 2018

4creators commented Jun 5, 2018 • edited Loading

fiigii commented Dec 6, 2018

voinokin commented Jun 5, 2018 •

edited

Loading

tannergooding commented Jun 5, 2018 •

edited

Loading

voinokin commented Jun 5, 2018 •

edited

Loading

4creators commented Jun 5, 2018 •

edited

Loading