[Arm64] SIMD HW Intrinsic API scope and high level design #24790

sdmaclea · 2018-01-24T22:55:16Z

Most of the intrinsics with clear and exact match to X86 have been proposed and have open issues.

This is intended as a draft design/scoping exercise for the SIMD class to help ease further API reviews.

Naming conventions

Intrinsic names will roughly follow instruction descriptions in ARMv8 ARM tables from section C3 A64 Instruction Set Overview
Drop adjectives ~~Floating~~, ~~Signed~~, ~~unsigned~~. These will be handled by type system
Use *Add, *Subtract postfix for accumulating forms
Use modifiers w/o abbreviation
Absolute, Halving, Numeric, Extend, Polynomial, Saturating. Rounding, Doubling,
High, Long, Wide, Narrow, Upper, Lower
RoundEven, RoundZero, RoundPos, RoundNeg, RoundAway
For example SQDMULH Signed saturating doubling multiply returning high half would naturally become SaturatingDoublingMultiplyHigh. and would be the proposed intrinsic name

Argument conventions

Binary operators will take left and right arguments
Unary operators will take a value argument
Instruction which insert into high half, will take a source operand which is the target register to be inserted into. This will typically be the first argument. The Method name will typically have a suffix Upper
Instruction with adding or subtracting accumulators will take a source operand which is the acc register. This will be the left operand in the add/subtract.
Argument order will typically be in left to right order following ARM assembly conventions. Exception can and will occur. Especially when copying a X86 C# API.

Lowering/Containment

Whenever an intrinsic can easily be expressed through containment without loss it should be dropped
If there are intermediate truncation/rounding/overflow issues, this rejects containment as identical results can not be guaranteed.
By element forms will typically be exposed through containment.

Scope/state of instructions/intrinsics
Outline follows ARMv8 ARM reference Manual C3. A64 Instruction Set Overview with focus on SIMD
If intrinsic design looks straight forward, no comments are shown.
Outline is exhaustive to allow for discussion.

The text was updated successfully, but these errors were encountered:

sdmaclea · 2018-01-24T23:10:45Z

@RussKeldorph & @4creators have asked for a somewhat exhaustive API so that we can get the design right.

@CarolEidt raises the legitimate issue that we should not overwhelm the API review team.

This is intended as a middle ground. I just exhaustively surveyed the ARMv8 ARM for ARM64 SIMD class instructions. I was specifically looking for cases where design and/or naming would be challenging or the API would not be trivial. Due to the load store nature of ARM ISAs it looks like there will not be much issue.

I think the biggest design issues will be around Load and Stores.

I have drafted naming convention
I have drafted APIs for the APIs which I think will be contentious.

I expect SIMD will not be complete in time for netcoreapp2.1. However a minimal set is already proposed and implementation is under way.

The biggest implementation issues will be in CoreCLR:

Handling the Short Vector and Homogenous Short Vector Aggregates ABI calling conventions correctly.
Sheer volume of instruction emmitters, JIT code, and test code to be implemented

Intrinsics API Proposal will lead/lag implementation slightly. I expect to propose APIs which will be implemented in < 4 weeks. I expect API review will be trivial. the only issue being X86 naming similarity.

@eerhardt @jkotas @dotnet/arm64-contrib @dotnet/jit-contrib @tannergooding FYI.

terrajobst · 2018-09-05T02:50:29Z

We discussed this today and created the following list of open issues:

Should we have generalized methods on the vector types or should they live on the ISA-specific types?
- Specifically: initialization and reinterpretcast-style methods
Should the generalized vector types be constrained to specific types (e.g. via and if-check in the static constructor)?
Should we add methods with types that aren't support and mark them as [Obsolete(IsError: true)]?
- The benefit is this allows us to give a better error message than "the overload cannot be found".
Should methods dealing with pointers accept Span<T> or ref T?

We also made these decisions:

We should expand all generics, given that we're planning on adding a new primitive (Half). This makes it very clear what is supported.
- However, this might expand metadata. We should measure what the impact is
ARM should live in System.Runtime.Intrinsics.Arm, i.e we should merge the Arm namespace with Arm64. We should resolve type conflicts between 32 and 64 by prefixes.

@tannergooding @danmosemsft How do we want to track these? Is @tannergooding on the hook with driving getting the open issues sorted out?

danmoseley · 2018-09-05T02:55:46Z

@tannergooding are you a good person to drive this? If so go ahead.

tannergooding · 2018-09-05T03:15:48Z

Probably. I had a few of these on the backlog to discuss already, for the x86 side.

I'll make sure issues are logged for these.

TamarChristinaArm · 2019-10-10T14:36:40Z

Just writing up the proposals for the structure loads and permute operations, was there ever an agreement on what to do when the instruction modifies multiple registers? The proposal above has them using ValueTuple, but I see no checkbox next to it.

Also does that work when the instruction has multiple destructive operands? like VUZP in AArch32?

tannergooding · 2019-10-10T15:42:12Z

was there ever an agreement on what to do when the instruction modifies multiple registers

I don't believe so.

We should likely get input from both @CarolEidt and the API review team to determine if these should be (or w/e names we agree upon):

Struct: Vector64x2<T> LoadVector64x2(T* address)
Value tuple: (Vector64<T>, Vector64<T>) LoadVector64x2(T* address)
Pointer: void LoadVector128(T* address, Vector64<T>* firstReg, Vector64<T>* secondReg)
Out: void LoadVector128(T* address, out Vector64<T> firstReg, out Vector64<T> secondReg)

This also likely has some impact on the register allocator, as it appears these must be sequential registers (but can wrap around; e.g. 30, 31, 0, and 1 are valid). -- The ARM native intrinsics deal with this by defining int8x8x2_t (2x Vector64), int8x8x3_t, int8x8x4_t, int8x16x2_t (2x Vector128), int8x16x3_t, and int8x16x4_t, for example.

TamarChristinaArm · 2019-10-10T16:16:07Z

Yeah that makes sense. I'll hold off on defining these for now then till we have some resolution.

CarolEidt · 2019-10-10T16:59:15Z

This also likely has some impact on the register allocator, as it appears these must be sequential registers (but can wrap around; e.g. 30, 31, 0, and 1 are valid).

That will definitely be painful, so we should try to get these in ASAP so we can work through the inevitable issues. I'm mildly tempted to rework the existing handling of double for ARM32, which was also quite painful, and I'm still not happy with that design.

For the API shape, it would be good to avoid the Pointer or Out solutions, as the JIT still has some difficulty in dealing with things that are "address-taken but not really". However, it should really be an API design choice. In any event we should ensure that we get the codegen that we expect, and do what's needed (presumably in the code that imports the intrinsics) to avoid spurious address-taken annotations.

tannergooding · 2019-10-10T17:03:40Z

Thanks. I'll try to schedule an API review with @terrajobst and will work with @TamarChristinaArm to ensure we have all of the topic points (including this) covered.

msftgits transferred this issue from dotnet/corefx Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

TamarChristinaArm mentioned this issue Feb 1, 2020

API Proposal : Arm Shift and Permute intrinsics #31324

Closed

echesakov mentioned this issue Mar 11, 2020

[Arm64] Load one single-element structure and Replicate to all lanes of one register #33490

Closed

tannergooding added the arch-arm64 label Mar 25, 2020

sdmaclea closed this as completed Jun 11, 2021

ghost locked as resolved and limited conversation to collaborators Jul 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Arm64] SIMD HW Intrinsic API scope and high level design #24790

[Arm64] SIMD HW Intrinsic API scope and high level design #24790

sdmaclea commented Jan 24, 2018 •

edited by echesakov

Loading

sdmaclea commented Jan 24, 2018 •

edited

Loading

terrajobst commented Sep 5, 2018 •

edited

Loading

danmoseley commented Sep 5, 2018

tannergooding commented Sep 5, 2018

TamarChristinaArm commented Oct 10, 2019 •

edited

Loading

tannergooding commented Oct 10, 2019 •

edited

Loading

TamarChristinaArm commented Oct 10, 2019

CarolEidt commented Oct 10, 2019

tannergooding commented Oct 10, 2019

[Arm64] SIMD HW Intrinsic API scope and high level design #24790

[Arm64] SIMD HW Intrinsic API scope and high level design #24790

Comments

sdmaclea commented Jan 24, 2018 • edited by echesakov Loading

sdmaclea commented Jan 24, 2018 • edited Loading

terrajobst commented Sep 5, 2018 • edited Loading

danmoseley commented Sep 5, 2018

tannergooding commented Sep 5, 2018

TamarChristinaArm commented Oct 10, 2019 • edited Loading

tannergooding commented Oct 10, 2019 • edited Loading

TamarChristinaArm commented Oct 10, 2019

CarolEidt commented Oct 10, 2019

tannergooding commented Oct 10, 2019

sdmaclea commented Jan 24, 2018 •

edited by echesakov

Loading

sdmaclea commented Jan 24, 2018 •

edited

Loading

terrajobst commented Sep 5, 2018 •

edited

Loading

TamarChristinaArm commented Oct 10, 2019 •

edited

Loading

tannergooding commented Oct 10, 2019 •

edited

Loading