-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Arm64] SIMD HW Intrinsic API scope and high level design #24790
Comments
@RussKeldorph & @4creators have asked for a somewhat exhaustive API so that we can get the design right. @CarolEidt raises the legitimate issue that we should not overwhelm the API review team. This is intended as a middle ground. I just exhaustively surveyed the ARMv8 ARM for ARM64 SIMD class instructions. I was specifically looking for cases where design and/or naming would be challenging or the API would not be trivial. Due to the load store nature of ARM ISAs it looks like there will not be much issue. I think the biggest design issues will be around Load and Stores. I have drafted naming convention I expect SIMD will not be complete in time for netcoreapp2.1. However a minimal set is already proposed and implementation is under way. The biggest implementation issues will be in CoreCLR:
Intrinsics API Proposal will lead/lag implementation slightly. I expect to propose APIs which will be implemented in < 4 weeks. I expect API review will be trivial. the only issue being X86 naming similarity. @eerhardt @jkotas @dotnet/arm64-contrib @dotnet/jit-contrib @tannergooding FYI. |
We discussed this today and created the following list of open issues:
We also made these decisions:
@tannergooding @danmosemsft How do we want to track these? Is @tannergooding on the hook with driving getting the open issues sorted out? |
@tannergooding are you a good person to drive this? If so go ahead. |
Probably. I had a few of these on the backlog to discuss already, for the x86 side. I'll make sure issues are logged for these. |
Just writing up the proposals for the structure loads and permute operations, was there ever an agreement on what to do when the instruction modifies multiple registers? The proposal above has them using Also does that work when the instruction has multiple destructive operands? like |
I don't believe so. We should likely get input from both @CarolEidt and the API review team to determine if these should be (or w/e names we agree upon):
This also likely has some impact on the register allocator, as it appears these must be sequential registers (but can wrap around; e.g. 30, 31, 0, and 1 are valid). -- The ARM native intrinsics deal with this by defining |
Yeah that makes sense. I'll hold off on defining these for now then till we have some resolution. |
That will definitely be painful, so we should try to get these in ASAP so we can work through the inevitable issues. I'm mildly tempted to rework the existing handling of double for ARM32, which was also quite painful, and I'm still not happy with that design. For the API shape, it would be good to avoid the Pointer or Out solutions, as the JIT still has some difficulty in dealing with things that are "address-taken but not really". However, it should really be an API design choice. In any event we should ensure that we get the codegen that we expect, and do what's needed (presumably in the code that imports the intrinsics) to avoid spurious address-taken annotations. |
Thanks. I'll try to schedule an API review with @terrajobst and will work with @TamarChristinaArm to ensure we have all of the topic points (including this) covered. |
Most of the intrinsics with clear and exact match to X86 have been proposed and have open issues.
This is intended as a draft design/scoping exercise for the SIMD class to help ease further API reviews.
Naming conventions
Floating,Signed,unsigned. These will be handled by type system*Add
,*Subtract
postfix for accumulating formsAbsolute
,Halving
,Numeric
,Extend
,Polynomial
,Saturating
.Rounding
,Doubling
,High
,Long
,Wide
,Narrow
,Upper
,Lower
RoundEven
,RoundZero
,RoundPos
,RoundNeg
,RoundAway
SQDMULH
Signed saturating doubling multiply returning high half
would naturally becomeSaturatingDoublingMultiplyHigh
. and would be the proposed intrinsic nameArgument conventions
left
andright
argumentsvalue
argumenttarget
register to be inserted into. This will typically be the first argument. The Method name will typically have a suffixUpper
acc
register. This will be the left operand in the add/subtract.Lowering/Containment
Scope/state of instructions/intrinsics
Outline follows
ARMv8 ARM reference Manual C3. A64 Instruction Set Overview
with focus on SIMDIf intrinsic design looks straight forward, no comments are shown.
Outline is exhaustive to allow for discussion.
Recommendation:
ValueTuple<Vector64<A>, Vector64<B>> LoadVector64Pair<A,B>(void * address)
void Store<A,B>(void * address, ValueTuple<Vector64<A>, Vector64<B>>)
Recommendation:
ValueTuple<Vector64<A>, Vector64<B>> LoadVector64NonTemporalPair<A,B>(void * address)
void StoreNonTemporal<A,B>(void * address, ValueTuple<Vector64<A>, Vector64<B>>)
Recommendation:
Vector64<A> LoadVector64<A>(void * address, Vector64<A> target)
ValueTuple<Vector64<A>, ... > LoadVector64Tuple<A,B,C,D>(void * address, ValueTuple<...> target)
void Store<A>(void * address, Vector64<A> target)
void Store<A,B,C,D>(void * address, ValueTuple<Vector...> target)
Recommendation:
Vector64<A> LoadVector64<A>(void * address, Vector64<A> target, byte index)
ValueTuple<Vector64<A>, ... > LoadVector64Tuple<A,B,C,D>(void * address, ValueTuple<...> target, byte index)
void Store<A>(void * address, Vector64<A> target, byte index)
void Store<A,B,C,D>(void * address, ValueTuple<Vector...> target, byte index)
Recommendation:
Vector64<A> LoadAllVector64<A>(void * address)
ValueTuple<Vector64<A>, ... Vector64<D>> LoadAllVector64Tuple<A,B,C,D>(void * address)
Recommendation:
Vector64<float> ConvertToVector64Single(Vector64<int> a)
Vector128<doulble> ConvertToVector128Double(Vector128<ulong> a)
Use
ReverseElementBits
forREV
Use
ReverseElementBytes
forREV16
,REV32
,REV64
(separate names would make implementation slightly simpler.)Whenever possible treat the element as the base type & contain the
Extract
element intrinsicHandle these when feasible by containment/lowering
ConvertTo*
i.e.ConvertToSingleRoundNearest
Use
*Across
per ARM convention(orHorizontal*
per X86 convention.)*Pairwise
The text was updated successfully, but these errors were encountered: