Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Arm64] SIMD HW Intrinsic API scope and high level design #24790

Closed
18 of 25 tasks
sdmaclea opened this issue Jan 24, 2018 · 9 comments
Closed
18 of 25 tasks

[Arm64] SIMD HW Intrinsic API scope and high level design #24790

sdmaclea opened this issue Jan 24, 2018 · 9 comments

Comments

@sdmaclea
Copy link
Contributor

sdmaclea commented Jan 24, 2018

Most of the intrinsics with clear and exact match to X86 have been proposed and have open issues.

This is intended as a draft design/scoping exercise for the SIMD class to help ease further API reviews.

Naming conventions

  • Intrinsic names will roughly follow instruction descriptions in ARMv8 ARM tables from section C3 A64 Instruction Set Overview
  • Drop adjectives Floating, Signed, unsigned. These will be handled by type system
  • Use *Add, *Subtract postfix for accumulating forms
  • Use modifiers w/o abbreviation
    Absolute, Halving, Numeric, Extend, Polynomial, Saturating. Rounding, Doubling,
    High, Long, Wide, Narrow, Upper, Lower
    RoundEven, RoundZero, RoundPos, RoundNeg, RoundAway
  • For example SQDMULH Signed saturating doubling multiply returning high half would naturally become SaturatingDoublingMultiplyHigh. and would be the proposed intrinsic name

Argument conventions

  • Binary operators will take left and right arguments
  • Unary operators will take a value argument
  • Instruction which insert into high half, will take a source operand which is the target register to be inserted into. This will typically be the first argument. The Method name will typically have a suffix Upper
  • Instruction with adding or subtracting accumulators will take a source operand which is the acc register. This will be the left operand in the add/subtract.
  • Argument order will typically be in left to right order following ARM assembly conventions. Exception can and will occur. Especially when copying a X86 C# API.

Lowering/Containment

  • Whenever an intrinsic can easily be expressed through containment without loss it should be dropped
  • If there are intermediate truncation/rounding/overflow issues, this rejects containment as identical results can not be guaranteed.
  • By element forms will typically be exposed through containment.

Scope/state of instructions/intrinsics
Outline follows ARMv8 ARM reference Manual C3. A64 Instruction Set Overview with focus on SIMD
If intrinsic design looks straight forward, no comments are shown.
Outline is exhaustive to allow for discussion.

  • Load/Store scalar SIMD
    • Load/Store scalar SIMD API Proposal : Arm64 Load & Store #24771
    • Load/Store scalar SIMD register pair
      Recommendation:
      ValueTuple<Vector64<A>, Vector64<B>> LoadVector64Pair<A,B>(void * address)
      void Store<A,B>(void * address, ValueTuple<Vector64<A>, Vector64<B>>)
    • Load/Store scalar SIMD register Non-temporal pair
      Recommendation:
      ValueTuple<Vector64<A>, Vector64<B>> LoadVector64NonTemporalPair<A,B>(void * address)
      void StoreNonTemporal<A,B>(void * address, ValueTuple<Vector64<A>, Vector64<B>>)
  • Load/Store Vector
    • Load/Store structures (multiple structures)
      Recommendation:
      Vector64<A> LoadVector64<A>(void * address, Vector64<A> target)
      ValueTuple<Vector64<A>, ... > LoadVector64Tuple<A,B,C,D>(void * address, ValueTuple<...> target)
      void Store<A>(void * address, Vector64<A> target)
      void Store<A,B,C,D>(void * address, ValueTuple<Vector...> target)
    • Load/Store structures (single structures)
      Recommendation:
      Vector64<A> LoadVector64<A>(void * address, Vector64<A> target, byte index)
      ValueTuple<Vector64<A>, ... > LoadVector64Tuple<A,B,C,D>(void * address, ValueTuple<...> target, byte index)
      void Store<A>(void * address, Vector64<A> target, byte index)
      void Store<A,B,C,D>(void * address, ValueTuple<Vector...> target, byte index)
    • Load single structure and replicate
      Recommendation:
      Vector64<A> LoadAllVector64<A>(void * address)
      ValueTuple<Vector64<A>, ... Vector64<D>> LoadAllVector64Tuple<A,B,C,D>(void * address)
  • Floating-point conversion
    • convert to floating-point
      Recommendation:
      Vector64<float> ConvertToVector64Single(Vector64<int> a)
      Vector128<doulble> ConvertToVector128Double(Vector128<ulong> a)
  • SIMD move
  • SIMD arithmetic
  • SIMD compare
  • SIMD widening and narrowing arithmetic
  • SIMD unary arithmetic
    Use ReverseElementBits for REV
    Use ReverseElementBytes for REV16, REV32, REV64 (separate names would make implementation slightly simpler.)
  • SIMD by element arithmetic
    Whenever possible treat the element as the base type & contain the Extract element intrinsic
  • SIMD permute
  • SIMD immediate
    Handle these when feasible by containment/lowering
  • SIMD shift (immediate)
  • SIMD floating-point and integer conversion
    ConvertTo* i.e. ConvertToSingleRoundNearest
  • SIMD reduce (across vector lanes)
    Use *Across per ARM convention (or Horizontal* per X86 convention.)
  • SIMD pairwise arithmetic
    *Pairwise
  • SIMD table lookup
@sdmaclea
Copy link
Contributor Author

sdmaclea commented Jan 24, 2018

@RussKeldorph & @4creators have asked for a somewhat exhaustive API so that we can get the design right.

@CarolEidt raises the legitimate issue that we should not overwhelm the API review team.

This is intended as a middle ground. I just exhaustively surveyed the ARMv8 ARM for ARM64 SIMD class instructions. I was specifically looking for cases where design and/or naming would be challenging or the API would not be trivial. Due to the load store nature of ARM ISAs it looks like there will not be much issue.

I think the biggest design issues will be around Load and Stores.

I have drafted naming convention
I have drafted APIs for the APIs which I think will be contentious.

I expect SIMD will not be complete in time for netcoreapp2.1. However a minimal set is already proposed and implementation is under way.

The biggest implementation issues will be in CoreCLR:

  • Handling the Short Vector and Homogenous Short Vector Aggregates ABI calling conventions correctly.
  • Sheer volume of instruction emmitters, JIT code, and test code to be implemented

Intrinsics API Proposal will lead/lag implementation slightly. I expect to propose APIs which will be implemented in < 4 weeks. I expect API review will be trivial. the only issue being X86 naming similarity.

@eerhardt @jkotas @dotnet/arm64-contrib @dotnet/jit-contrib @tannergooding FYI.

@terrajobst
Copy link
Contributor

terrajobst commented Sep 5, 2018

We discussed this today and created the following list of open issues:

  • Should we have generalized methods on the vector types or should they live on the ISA-specific types?
    • Specifically: initialization and reinterpretcast-style methods
  • Should the generalized vector types be constrained to specific types (e.g. via and if-check in the static constructor)?
  • Should we add methods with types that aren't support and mark them as [Obsolete(IsError: true)]?
    • The benefit is this allows us to give a better error message than "the overload cannot be found".
  • Should methods dealing with pointers accept Span<T> or ref T?

We also made these decisions:

  • We should expand all generics, given that we're planning on adding a new primitive (Half). This makes it very clear what is supported.
    • However, this might expand metadata. We should measure what the impact is
  • ARM should live in System.Runtime.Intrinsics.Arm, i.e we should merge the Arm namespace with Arm64. We should resolve type conflicts between 32 and 64 by prefixes.

@tannergooding @danmosemsft How do we want to track these? Is @tannergooding on the hook with driving getting the open issues sorted out?

@danmoseley
Copy link
Member

@tannergooding are you a good person to drive this? If so go ahead.

@tannergooding
Copy link
Member

Probably. I had a few of these on the backlog to discuss already, for the x86 side.

I'll make sure issues are logged for these.

@TamarChristinaArm
Copy link
Contributor

TamarChristinaArm commented Oct 10, 2019

Just writing up the proposals for the structure loads and permute operations, was there ever an agreement on what to do when the instruction modifies multiple registers? The proposal above has them using ValueTuple, but I see no checkbox next to it.

Also does that work when the instruction has multiple destructive operands? like VUZP in AArch32?

@tannergooding
Copy link
Member

tannergooding commented Oct 10, 2019

was there ever an agreement on what to do when the instruction modifies multiple registers

I don't believe so.

We should likely get input from both @CarolEidt and the API review team to determine if these should be (or w/e names we agree upon):

  • Struct: Vector64x2<T> LoadVector64x2(T* address)
  • Value tuple: (Vector64<T>, Vector64<T>) LoadVector64x2(T* address)
  • Pointer: void LoadVector128(T* address, Vector64<T>* firstReg, Vector64<T>* secondReg)
  • Out: void LoadVector128(T* address, out Vector64<T> firstReg, out Vector64<T> secondReg)

This also likely has some impact on the register allocator, as it appears these must be sequential registers (but can wrap around; e.g. 30, 31, 0, and 1 are valid). -- The ARM native intrinsics deal with this by defining int8x8x2_t (2x Vector64), int8x8x3_t, int8x8x4_t, int8x16x2_t (2x Vector128), int8x16x3_t, and int8x16x4_t, for example.

@TamarChristinaArm
Copy link
Contributor

Yeah that makes sense. I'll hold off on defining these for now then till we have some resolution.

@CarolEidt
Copy link
Contributor

This also likely has some impact on the register allocator, as it appears these must be sequential registers (but can wrap around; e.g. 30, 31, 0, and 1 are valid).

That will definitely be painful, so we should try to get these in ASAP so we can work through the inevitable issues. I'm mildly tempted to rework the existing handling of double for ARM32, which was also quite painful, and I'm still not happy with that design.

For the API shape, it would be good to avoid the Pointer or Out solutions, as the JIT still has some difficulty in dealing with things that are "address-taken but not really". However, it should really be an API design choice. In any event we should ensure that we get the codegen that we expect, and do what's needed (presumably in the code that imports the intrinsics) to avoid spurious address-taken annotations.

@tannergooding
Copy link
Member

Thanks. I'll try to schedule an API review with @terrajobst and will work with @TamarChristinaArm to ensure we have all of the topic points (including this) covered.

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Jul 11, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants