-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some integer scalar and vectored methods/overloads are not exposed in HW intrinsics API #10385
Comments
@CarolEidt @fiigii @tannergooding @4creators @eerhardt Should this be discussed in corefx first? |
@RussKeldorph, some of this (such as the inefficient codegen) should be resolved by a PR I am currently working on. As for the new APIs, it is probably worth discussion during the next HWIntrinsic design review (I sent an e-mail on this a couple days ago). |
Some of the inefficient codegen was cleaned up with dotnet/coreclr#18262 More of the remaining inefficient codegen will be cleaned up with dotnet/coreclr#18297 After dotnet/coreclr#18297, there are is a bit more work that involves special handling for various intrinsics, but we are getting closer. |
@tannergooding, thanks for the good news! |
@voinokin you can use the SDK daily builds with the daily Intrinsics packages from myget. Or you can build the runtime from source if you want to be able to do JitDumps and the like |
@saucecontrol Thanks! I'll give it a try in some time. |
Adding these:
|
We will want to be careful with this one, since the instruction explicitly operates on byte values.
It definitely makes sense to ensure that both the signed and unsigned versions are exposed here. |
It is the same for |
Really, all the masked byte-shuffle instructions work on the minimum element size that can be represented by the mask, but they're not necessarily most-often used on that size. I have an example here that uses |
Try using |
Ah yes, thanks. I do that in a lot of cases but didn't try it on those tiny cast helpers. |
A good bit of the "bad codegen" was because we didn't support the The latest builds out of master should have much better codegen and the last of the non load/store intrinsics should support containment with: dotnet/coreclr#18349. There is, of course, still some more work to be done, but hopefully you will see much better results. |
( |
Not sure it is proper place here to discuss API that is already defined (?)... Anyway, I stumbled across this with the names of API methods that load and store smth. Some background:
Here are my points:
|
Adding something mentioned in https://github.com/dotnet/coreclr/issues/18300#issuecomment-394772776 I believe these versions deserve their own overloads since conceptually they can be used on SSExx-only hardware which does not provide anything closer to implement "gather" and "scatter" operations (actually "scatter" only appears in AVX512, and anyway granularity is 32 or 64 bits IIRC):
|
Ah, actually I tried to find a better name for these guys but I thought there is no single word can explain the semantics very well, so just followed C++ names... Do you have suggestions?
The |
In the current design, we are avoiding exposing "memory-access encoding" as much as possible, and we plan to generate these encodes via containment optimization (i.e. merging ins(load(address)) or store(address, ins(...)) in a single instruction). I think it also works for scalar type containment (i.e., folding |
For
Given this will work, still some unclarities remain with API:
|
|
Then, the last remaining idea from me would be to extend existing LoadVector128/256(type* ptr) overloads with optional parameter so that it would become LoadVector128/256(type* ptr, bool forceUnaligned=false). Looks a bit ugly though.... |
Perhaps |
I think we've addressed some of this already. Could the original post be updated with anything still relevant or the issue be otherwise closed? |
Right, I think we can close this issue and open a new issue for "folding store". |
Tell me which issues remain and I will update the original post. Thanks. |
@tannergooding can you help get this sorted out? Hoping there is no work left here. |
No.1 and No.2 haven't been resolved and need a separate proposal addressing them logged against CoreFX and in the recommended format (https://github.com/dotnet/corefx/issues/35768 tracks some of the issues raised). No.3 is meant to be covered by the 128-bit conversion and then a widening conversion to 256-bit via the No.4 and No.5 have bneen resolved. For No.6, No.7, and No.8, we aren't currently looking at providing helper methods like these. |
Regarding No. 6 - my point is it's not helper method, but rather a separate operation which loads values and extends them to 16/32/64 bits. This can currently be replaced with several ops using typecasting:
My use cases for 8-bit version are decoding stream of compressed bytes. |
Might be misunderstanding, but this isn't a singular hardware instruction; so it would be classified as a helper (it is implemented in terms of the actual intrinsics) rather than being an actual hardware intrinsic itself. Given that it isn't a singular hardware instruction, and it isn't considered one of the "core" operations (which is basically just creating a vector and accessing individual elements), it likely wouldn't be considered at this point (users should be able to provide their own implementation in the interim). |
That should be just the |
Here you have it (sorry, found no better way for now): Also, check 3rd form from the top https://www.felixcloutier.com/x86/pmovzx I mean, IT IS singular hardware instruction. |
|
I confirm - dotnet/corefx#35768 covers my understanding expressed in item No. 6 |
@tannergooding Can this be closed in favor of other issues? If there is remaining work here, could you open separate issues to make it very clear what work remains for 3.0? |
Yes, I think this could be closed as I believe all issues are either resolved or tracked by other existing issues. @voinokin, feel free to clarify if you don't believe that is the case. |
I know this issue is officially close, and I'm late to the show, but I am a bit confused by the current state of preview5...: All issues seem to be resolved, all PRs merged, yet Which seems to imply it isn't really supported at this stage... |
Is it this part that's confusing: "The native signature does not exist."? If so, that just means that there's no corresponding native (C++) intrinsic. You notice that for other intrinsics the equivalent C++ intrinsic is shown in addition to the target instruction, for example, a little further down we have:
The second line is the native (C++) intrinsic. |
UnpackLow()
(which is different operation), or wrap existing signed upconversion withStaticCast<signed, unsigned>()
which may provide overhead (see Sub-optimal codegen when using Sse.StaticCast<TFrom,TTo>() with non-VEX encoded HW intrinsics #10357) .The suggestion is to add following convenience methods:
[V]PMOVZXBW xmm, xmm
(same insn as for existing ConvertToVector128Int16())[V]PMOVZXBD/WD xmm, xmm
(same insn as for existing ConvertToVector128Int32())[V]PMOVZXBQ/WQ/DQ xmm, xmm
(same insn as for existing ConvertToVector128Int64())The methods being suggested are:
VPMOVZXBW ymm, xmm
(same insn as for existing ConvertToVector256UInt16())VPMOVZXBD/WD ymm, xmm
(same insn as for existing ConvertToVector256UInt32())VPMOVZXBQ/WQ/DQ ymm, xmm
(same insn as for existing ConvertToVector256UInt64())Sse2.ConvertScalarToVector128UInt32/64()
which produces (MOV r32/r64, imm
+MOVD/MOVQ XMM, r32/r64
), but thenMOVDQA XMMd, XMMs
is automatically issued when one attempts to use helper methodAvx.ExtendToVector256()
to get 256-bit vector. To my understanding the helper method was intended to be used as type conversion and produce no-op in such cases, sinceMOVD/MOVQ X/YMM, r32/r64
zeroes upper portion of dest reg. Below is an example of the issue I'm trying to explain:OTOH, the following conversion in reverse direction produces code that looks fine/optimal:
Ssse3.AlignRight()
that works onsbyte
. I believe it makes sense to add overloads for other integer types, the same way as it was implemented forSse2.ShiftRightLogical128BitLane()
which is quite similar in operation. Otherwise developers will have to use type casting.Ssse3.Shuffle()
that works onsbyte
. I believe it makes sense to add overload that will also work onbyte
.Adding something mentioned in https://github.com/dotnet/coreclr/issues/18300#issuecomment-394772776
I believe these versions deserve their own overloads since conceptually they can be used on SSExx-only hardware which does not provide anything closer to implement "gather" and "scatter" operations (actually "scatter" only appears in AVX512, and anyway granularity is 32 or 64 bits IIRC):
PMOVZX/SX... xmm, [m]
- these load from [m] and extend at once, a nice fusion. Esp. note the 2x 8-bit version.PEXTRB/D/W + EXTRACTPS [m], xmm, i
- spill single element from xmm to [m]PINSRB/D/W + INSERTPS xmm, [m], i
- merge single element from [m] into xmm. There is special issue open on API for INSERTPS ( HW intrinsics API declaration is incorrect for Sse41.Insert() that operates on vector of 32-bit floats #10383 ).category:testing
theme:intrinsics
skill-level:intermediate
cost:medium
The text was updated successfully, but these errors were encountered: