Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some integer scalar and vectored methods/overloads are not exposed in HW intrinsics API #10385

Closed
voinokin opened this issue May 27, 2018 · 37 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI design-discussion Ongoing discussion about design without consensus
Milestone

Comments

@voinokin
Copy link

  1. Currently, the intrinsics are exposed for signed int -> signed int and for unsigned int -> signed int upconversion when both src and dest are 128-bit vector. But no such API method is directly exposed for unsigned int -> unsigned int, so one has to either use UnpackLow() (which is different operation), or wrap existing signed upconversion with StaticCast<signed, unsigned>() which may provide overhead (see Sub-optimal codegen when using Sse.StaticCast<TFrom,TTo>() with non-VEX encoded HW intrinsics #10357) .
    The suggestion is to add following convenience methods:
  • Sse41.ConvertToVector128UInt16(Vector128(byte)) for [V]PMOVZXBW xmm, xmm (same insn as for existing ConvertToVector128Int16())
  • Sse41.ConvertToVector128UInt32(Vector128(byte/ushort)) for [V]PMOVZXBD/WD xmm, xmm (same insn as for existing ConvertToVector128Int32())
  • Sse41.ConvertToVector128UInt64(Vector128(byte/ushort/uint)) for [V]PMOVZXBQ/WQ/DQ xmm, xmm (same insn as for existing ConvertToVector128Int64())

  1. Similarly, no direct means are exposed in API for unsigned int -> signed int upconversion when src is in 128-bit vector, and the dest is in 256-bit vector.
    The methods being suggested are:
  • Avx2.ConvertToVector256Int16(Vector128(byte)) for VPMOVZXBW ymm, xmm (same insn as for existing ConvertToVector256UInt16())
  • Avx2.ConvertToVector256Int32(Vector128(byte/ushort)) for VPMOVZXBD/WD ymm, xmm (same insn as for existing ConvertToVector256UInt32())
  • Avx2.ConvertToVector256Int64(Vector128(byte/ushort/uint)) for VPMOVZXBQ/WQ/DQ ymm, xmm (same insn as for existing ConvertToVector256UInt64())

  1. I was not able to find not-too-verbose method to convert 32/64-bit scalar value to 256-bit vector in YMM reg. It is possible to set 128-bit vector with Sse2.ConvertScalarToVector128UInt32/64() which produces (MOV r32/r64, imm + MOVD/MOVQ XMM, r32/r64), but then MOVDQA XMMd, XMMs is automatically issued when one attempts to use helper method Avx.ExtendToVector256() to get 256-bit vector. To my understanding the helper method was intended to be used as type conversion and produce no-op in such cases, since MOVD/MOVQ X/YMM, r32/r64 zeroes upper portion of dest reg. Below is an example of the issue I'm trying to explain:
var v = Avx.ExtendToVector256(Sse2.ConvertScalarToVector128UInt64(0x12345678UL));
00007FF989272625  mov         ecx,12345678h  
00007FF98927262A  vmovq       xmm0,rcx  
00007FF98927262F  vmovdqa     xmm6,xmm0  <======= this is not required

OTOH, the following conversion in reverse direction produces code that looks fine/optimal:

var v1 = Sse2.ConvertToUInt64(Avx.GetLowerHalf(Avx.SetZeroVector256<ulong>()));
00007FF989282618  vpxor       ymm0,ymm0,ymm0  
00007FF98928261D  vmovq       rsi,xmm0

  1. There exists just one overload for Ssse3.AlignRight() that works on sbyte. I believe it makes sense to add overloads for other integer types, the same way as it was implemented for Sse2.ShiftRightLogical128BitLane() which is quite similar in operation. Otherwise developers will have to use type casting.

  1. There exists just one overload for Ssse3.Shuffle() that works on sbyte. I believe it makes sense to add overload that will also work on byte.
    Adding something mentioned in https://github.com/dotnet/coreclr/issues/18300#issuecomment-394772776

I believe these versions deserve their own overloads since conceptually they can be used on SSExx-only hardware which does not provide anything closer to implement "gather" and "scatter" operations (actually "scatter" only appears in AVX512, and anyway granularity is 32 or 64 bits IIRC):

  1. PMOVZX/SX... xmm, [m] - these load from [m] and extend at once, a nice fusion. Esp. note the 2x 8-bit version.
  2. PEXTRB/D/W + EXTRACTPS [m], xmm, i - spill single element from xmm to [m]
  3. PINSRB/D/W + INSERTPS xmm, [m], i - merge single element from [m] into xmm. There is special issue open on API for INSERTPS ( HW intrinsics API declaration is incorrect for Sse41.Insert() that operates on vector of 32-bit floats #10383 ).

category:testing
theme:intrinsics
skill-level:intermediate
cost:medium

@RussKeldorph
Copy link
Contributor

@CarolEidt @fiigii @tannergooding @4creators @eerhardt Should this be discussed in corefx first?

@tannergooding
Copy link
Member

@RussKeldorph, some of this (such as the inefficient codegen) should be resolved by a PR I am currently working on.

As for the new APIs, it is probably worth discussion during the next HWIntrinsic design review (I sent an e-mail on this a couple days ago).

@tannergooding
Copy link
Member

Some of the inefficient codegen was cleaned up with dotnet/coreclr#18262

More of the remaining inefficient codegen will be cleaned up with dotnet/coreclr#18297

After dotnet/coreclr#18297, there are is a bit more work that involves special handling for various intrinsics, but we are getting closer.

@voinokin
Copy link
Author

voinokin commented Jun 5, 2018

@tannergooding, thanks for the good news!
I wonder whether there will be any way to test these changes independently (by me) before first 2.2.0 preview will become available ?

@saucecontrol
Copy link
Member

saucecontrol commented Jun 5, 2018

@voinokin you can use the SDK daily builds with the daily Intrinsics packages from myget.
https://github.com/dotnet/corefx/blob/master/Documentation/project-docs/dogfooding.md

Or you can build the runtime from source if you want to be able to do JitDumps and the like
https://github.com/dotnet/coreclr/blob/master/Documentation/building/windows-instructions.md
https://github.com/dotnet/coreclr/blob/master/Documentation/workflow/UsingYourBuild.md

@voinokin
Copy link
Author

voinokin commented Jun 5, 2018

@saucecontrol Thanks! I'll give it a try in some time.

@voinokin voinokin changed the title Some integer scalar and vectored conversion methods are not exposed in HW intrinsics API Some integer scalar and vectored methods are not exposed in HW intrinsics API Jun 7, 2018
@voinokin voinokin changed the title Some integer scalar and vectored methods are not exposed in HW intrinsics API Some integer scalar and vectored methods/overloads are not exposed in HW intrinsics API Jun 7, 2018
@voinokin
Copy link
Author

voinokin commented Jun 7, 2018

Adding these:


  1. There exists just one overload for Ssse3.AlignRight() that works on sbyte. I believe it makes sense to add overloads for other integer types, the same way as it was implemented for Sse2.ShiftRightLogical128BitLane() which is quite similar in operation. Otherwise developers will have to use type casting.

  1. There exists just one overload for Ssse3.Shuffle() that works on sbyte. I believe it makes sense to add overload that will also work on byte.

@tannergooding
Copy link
Member

There exists just one overload for Ssse3.AlignRight() that works on sbyte. I believe it makes sense to add overloads for other integer types, the same way as it was implemented for Sse2.ShiftRightLogical128BitLane() which is quite similar in operation. Otherwise developers will have to use type casting.

We will want to be careful with this one, since the instruction explicitly operates on byte values.

There exists just one overload for Ssse3.Shuffl() that works on sbyte. I believe it makes sense to add overload that will also work on byte.

It definitely makes sense to ensure that both the signed and unsigned versions are exposed here.

@voinokin
Copy link
Author

voinokin commented Jun 7, 2018

@tannergooding

We will want to be careful with this one, since the instruction explicitly operates on byte values.

It is the same for ShiftRightLogical128BitLane() - that's my point. BTW, the method appeared quite useful without typecasting :-).
I suggest to rename argument of AlignRight() which is now called mask (?) to smth like numBytes, again the same way as it's done for ShiftRightLogical128BitLane().

@saucecontrol
Copy link
Member

We will want to be careful with this one, since the instruction explicitly operates on byte values.

Really, all the masked byte-shuffle instructions work on the minimum element size that can be represented by the mask, but they're not necessarily most-often used on that size.

I have an example here that uses AlignRight on ulong values. Something went funny in the codegen with all the casts I had to do, though. Even though I have a managed helper for that here, I couldn't use it in that context without speed taking a dive. It would be nice if we could have managed helper overloads for cases like that, but we'd have to be assured that they boil down to the same instruction during JIT. I don't know whether that managed helper should be part of the Intrinsics API or whether we should have to roll our own, but it'll be a common use-case no doubt.

@voinokin
Copy link
Author

voinokin commented Jun 7, 2018

Something went funny in the codegen with all the casts I had to do, though. Even though I have a managed helper for that here, I couldn't use it in that context without speed taking a dive.

Try using ref modifier on your method's parameters - sometimes that helps when current JIT version inlines. My advice is - check the disasm after that, I've seen a lot of it for now... And logged some strange points here too.

@saucecontrol
Copy link
Member

Ah yes, thanks. I do that in a lot of cases but didn't try it on those tiny cast helpers.

@tannergooding
Copy link
Member

A good bit of the "bad codegen" was because we didn't support the ins reg, [mem] encodings for a good number of the intrinsics.

The latest builds out of master should have much better codegen and the last of the non load/store intrinsics should support containment with: dotnet/coreclr#18349.

There is, of course, still some more work to be done, but hopefully you will see much better results.

@tannergooding
Copy link
Member

(StaticCast itself does still need a change, but I will be working on that next).

@voinokin
Copy link
Author

voinokin commented Jun 8, 2018

Not sure it is proper place here to discuss API that is already defined (?)... Anyway, I stumbled across this with the names of API methods that load and store smth.

Some background:

  • the overloads to move (parts of) vectors FROM memory are mostly called LoadScalarVectorNNN(), LoadVectorNNN() / LoadAlignedVectorNNN() / LoadDquVectorNNN() / LoadAlignedVectorNNNNonTemporal().
  • the overloads to move vectors TO memory are mostly called: Store() / StoreAligned() / StoreAlignedNonTemporal().

Here are my points:

  1. My feeling is that there could be better wording for ...Dqu... overloads.
  2. There is no word "Vector" in methods that do store entire vectors; it's similar to partial store operations. It looks ok though, until you find out that there exist overloads of StoreNonTemporal() which do not operate on vectors taking GP reg as input (MOVNTI [m], reg) - may it happen that these specific ones need some more clear names ?
  3. There exist LoadScalarVectorNNN(int/uint/long/ulong/float/double) overloads to load just 1 element according to data type and then zero remaining part of vector. OTOH, the methods intended for opposite operations have names split: StoreScalar(float/double) (MOVSS/MOVSD [m], xmm), StoreLow(long/ulong) (MOVQ [m], xmm which in fact stores one 64-bit element call it scalar), and I found no method exposed to store 32-bit ints as scalars (MOVD [m], xmm).

@voinokin
Copy link
Author

Adding something mentioned in https://github.com/dotnet/coreclr/issues/18300#issuecomment-394772776

I believe these versions deserve their own overloads since conceptually they can be used on SSExx-only hardware which does not provide anything closer to implement "gather" and "scatter" operations (actually "scatter" only appears in AVX512, and anyway granularity is 32 or 64 bits IIRC):


  1. PMOVZX/SX... xmm, [m] - these load from [m] and extend at once, a nice fusion. Esp. note the 2x 8-bit version.
  2. PEXTRB/D/W + EXTRACTPS [m], xmm, i - spill single element from xmm to [m]
  3. PINSRB/D/W + INSERTPS xmm, [m], i - merge single element from [m] into xmm. There is special issue open on API for INSERTPS ( HW intrinsics API declaration is incorrect for Sse41.Insert() that operates on vector of 32-bit floats #10383 ).

@fiigii
Copy link
Contributor

fiigii commented Jun 12, 2018

My feeling is that there could be better wording for ...Dqu... overloads.

Ah, actually I tried to find a better name for these guys but I thought there is no single word can explain the semantics very well, so just followed C++ names... Do you have suggestions?

There is no word "Vector" in methods that do store entire vectors; it's similar to partial store operations.

Store* intrinsics take source as a parameter, so the different vector length can be resolved by the overload system, which makes the API simpler.

OTOH, the methods intended for opposite operations have names split: StoreScalar(float/double) (MOVSS/MOVSD [m], xmm), StoreLow(long/ulong) (MOVQ [m], xmm which in fact stores one 64-bit element call it scalar),

The Scalar suffix only makes sence with floating point types (float and double) because x86/x64 architectures execute floating point computation via SIMD (SSE2) units.

@fiigii
Copy link
Contributor

fiigii commented Jun 12, 2018

and I found no method exposed to store 32-bit ints as scalars (MOVD [m], xmm).

I believe these versions deserve their own overloads since conceptually they can be used on SSExx-only

In the current design, we are avoiding exposing "memory-access encoding" as much as possible, and we plan to generate these encodes via containment optimization (i.e. merging ins(load(address)) or store(address, ins(...)) in a single instruction). I think it also works for scalar type containment (i.e., folding a[i] = Sse2.ConvertToInt32(v) to MOVD [m], xmm).

@voinokin
Copy link
Author

voinokin commented Jun 12, 2018

My feeling is that there could be better wording for ...Dqu... overloads.

Ah, actually I tried to find a better name for these guys but I thought there is no single word can explain the semantics very well, so just followed C++ names... Do you have suggestions?

For [V]LDDQU I suggest to use LoadUnalignedVectorNNN - explicitly stating that the operation is intended as special case of unaligned loads. (I believe it behaves just like MOVDQU on current CPUs though.)

In the current design, we are avoiding exposing "memory-access encoding" as much as possible, and we plan to generate these encodes via containment optimization (i.e. merging ins(load(address)) or store(address, ins(...)) in a single instruction). I think it also works for scalar type containment (i.e., folding a[i] = Sse2.ConvertToInt32(v) to MOVD [m], xmm).

Given this will work, still some unclarities remain with API:

  1. 128-bit version of [V]PMOVZX/SXBQ xmm, [m16] - there is no single operation to assign-extend 2x8-bit values to vector to my knowledge other that this. All other versions of [V]PMOVZX/SX... take at least 32 bits which will be achievable thru containment support with some typecasting + LoadVectorNNN() or ConvertScalarToVectorXXX() , but this specific version only takes 16 bits on input.
    [UPD] Avx.Broadcast(short*/ushort*) (VPBROADCASTW xmm/ymm, [m16]) does also exists, but this is about different operation and AVX2.
  2. For MOVD/Q xmm, [m] - LoadScalarVector128(int/uint/long/ulong) do already exist, why not removing them in favor of Sse2.ConvertScalarToVector128[U]Int32/64(indir) ? It looks asymmetrical for 32-bit ints for now - there is direct load operation exposed, but not the corresponding direct store operation.
  3. For INSERTPS xmm, [m32], i - related HW intrinsics API declaration is incorrect for Sse41.Insert() that operates on vector of 32-bit floats #10383, supposedly fixed by Improve Intel hardware intrinsic APIs coreclr#17637. I can't see the final version, but if the existing overload taking just one scalar value will be replaced with the overload taking the vector, then the subject instruction encoding will become unavailable (I had no understanding of containment being introduced when I was logging HW intrinsics API declaration is incorrect for Sse41.Insert() that operates on vector of 32-bit floats #10383).

@fiigii
Copy link
Contributor

fiigii commented Jun 12, 2018

For [V]LDDQU I suggest to use LoadUnalignedVectorNNN

LoadUnaligned* is not enough to express lddqu semantics. In x86 SIMD programming, unaligned is usually related to instructions like movups, movdqu, etc. So, that may be confusing...

@voinokin
Copy link
Author

voinokin commented Jun 12, 2018

For [V]LDDQU I suggest to use LoadUnalignedVectorNNN

LoadUnaligned* is not enough to express lddqu semantics. In x86 SIMD programming, unaligned is usually related to instructions like movups, movdqu, etc. So, that may be confusing...

Then, the last remaining idea from me would be to extend existing LoadVector128/256(type* ptr) overloads with optional parameter so that it would become LoadVector128/256(type* ptr, bool forceUnaligned=false). Looks a bit ugly though....

@mikedn
Copy link
Contributor

mikedn commented Jun 12, 2018

Perhaps LoadUnalignedSplit*. Though I tend to think that coming up with such fancy names for already established instructions/intrinsics does more harm than good. And LDDQU is kind of useless these days...

@tannergooding
Copy link
Member

I think we've addressed some of this already. Could the original post be updated with anything still relevant or the issue be otherwise closed?

@fiigii
Copy link
Contributor

fiigii commented Dec 10, 2018

I think we've addressed some of this already

Right, I think we can close this issue and open a new issue for "folding store".

@voinokin
Copy link
Author

I think we've addressed some of this already. Could the original post be updated with anything still relevant or the issue be otherwise closed?

Tell me which issues remain and I will update the original post. Thanks.

@AndyAyersMS
Copy link
Member

@tannergooding can you help get this sorted out? Hoping there is no work left here.
cc @CarolEidt

@tannergooding
Copy link
Member

No.1 and No.2 haven't been resolved and need a separate proposal addressing them logged against CoreFX and in the recommended format (https://github.com/dotnet/corefx/issues/35768 tracks some of the issues raised).
The primary issue here is that PMOV* has both sign-extending and zero-extending versions. We need to ensure these take types and are exposed in a mechanism that is familiar to existing .NET users.

No.3 is meant to be covered by the 128-bit conversion and then a widening conversion to 256-bit via the ToVector256 or ToVector256Unsafe method.

No.4 and No.5 have bneen resolved.

For No.6, No.7, and No.8, we aren't currently looking at providing helper methods like these.

@voinokin
Copy link
Author

voinokin commented Mar 12, 2019

  1. PMOVZX/SX... xmm, [m] - these load from [m] and extend at once, a nice fusion. Esp. note the 2x 8-bit version.

Regarding No. 6 - my point is it's not helper method, but rather a separate operation which loads values and extends them to 16/32/64 bits. This can currently be replaced with several ops using typecasting:

  • 16-bit operands: Load [m16x2] from pointer cast to u32*/i32* as scalar into vector + reverse typecasting + Z/S-extend
  • 32-bit operands: Load [m32x2] from pointer cast to u64*/i64* as scalar into vector + reverse typecasting + Z/S-extend
  • But in case of 8-bit operands I can't see anything shorter than this: Load [m8x2] from pointer cast to u16*/i16* to GPR16 + extend to GPR32 or GPR64 + load value to vector + reverse typecasting + Z/S-extend. To my memory, 5-6 months ago when I last checked, JIT was not emitting MOVZX/SX GPR, [m8/16/32] which would compress two first instructions to just one. Even if this feature is implemented now, this would leave 3 instructions (MOVZX + MOVQ + PMOVZX/SX) instead of just one. If it's not - we have 4 instructions.

My use cases for 8-bit version are decoding stream of compressed bytes.

@tannergooding
Copy link
Member

my point is it's not helper method, but rather a separate operation which loads values and extends them to 16/32/64 bits.

Might be misunderstanding, but this isn't a singular hardware instruction; so it would be classified as a helper (it is implemented in terms of the actual intrinsics) rather than being an actual hardware intrinsic itself.

Given that it isn't a singular hardware instruction, and it isn't considered one of the "core" operations (which is basically just creating a vector and accessing individual elements), it likely wouldn't be considered at this point (users should be able to provide their own implementation in the interim).

@saucecontrol
Copy link
Member

a separate operation which loads values and extends them to 16/32/64 bits

That should be just the xmm/mem encoded versions of PMOV[ZS]X[BWD][WDQ]. These are already handled by containment, but the correct mem overloads are addressed in https://github.com/dotnet/corefx/issues/35768

@voinokin
Copy link
Author

voinokin commented Mar 12, 2019

Might be misunderstanding, but this isn't a singular hardware instruction; so it would be classified as a helper (it is implemented in terms of the actual intrinsics) rather than being an actual hardware intrinsic itself. Given that it isn't a singular hardware instruction, and it isn't considered one of the "core" operations (which is basically just creating a vector and accessing individual elements), it likely wouldn't be considered at this point (users should be able to provide their own implementation in the interim).

Here you have it (sorry, found no better way for now):
https://gcc.godbolt.org/z/TglHbD

Also, check 3rd form from the top https://www.felixcloutier.com/x86/pmovzx
66 0f 38 32 /r --- PMOVZXBQ xmm1, xmm2/m16 --- Zero extend 2 packed 8-bit integers in the low 2 bytes of xmm2/m16 to 2 packed 64-bit integers in xmm1.

I mean, IT IS singular hardware instruction.

@tannergooding
Copy link
Member

PMOVZXBQ is covered by ConvertToVector128Int64, the encoding of the memory operand is being tracked by https://github.com/dotnet/corefx/issues/35768.

@voinokin
Copy link
Author

I confirm - dotnet/corefx#35768 covers my understanding expressed in item No. 6
It's nice we will have these APIs implemented :-)

@RussKeldorph
Copy link
Contributor

@tannergooding Can this be closed in favor of other issues? If there is remaining work here, could you open separate issues to make it very clear what work remains for 3.0?

@tannergooding
Copy link
Member

Yes, I think this could be closed as I believe all issues are either resolved or tracked by other existing issues.

@voinokin, feel free to clarify if you don't believe that is the case.

@damageboy
Copy link
Contributor

I know this issue is officially close, and I'm late to the show, but I am a bit confused by the current state of preview5...:

All issues seem to be resolved, all PRs merged, yet PMOVZXB{D,Q} and other do not seem to be generated and the current master branch show this unwelcoming comment:

https://github.com/dotnet/coreclr/blob/1a495118c005b9a5409c81fea1813bd2b3044cbd/src/System.Private.CoreLib/shared/System/Runtime/Intrinsics/X86/Avx2.cs#L750-L755

Which seems to imply it isn't really supported at this stage...

@CarolEidt
Copy link
Contributor

Is it this part that's confusing: "The native signature does not exist."?

If so, that just means that there's no corresponding native (C++) intrinsic. You notice that for other intrinsics the equivalent C++ intrinsic is shown in addition to the target instruction, for example, a little further down we have:

        /// <summary>
        /// __m128i _mm256_extracti128_si256 (__m256i a, const int imm8)
        ///   VEXTRACTI128 xmm, ymm, imm8
        /// </summary>
        public new static Vector128<sbyte> ExtractVector128(Vector256<sbyte> value, byte index) => ExtractVector128(value, index);

The second line is the native (C++) intrinsic.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the 3.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 16, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI design-discussion Ongoing discussion about design without consensus
Projects
None yet
Development

No branches or pull requests

10 participants