-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API Proposal : Arm Shift and Permute intrinsics #31324
Comments
Thanks @TamarChristinaArm. I've added this to the list of APIs to cover tomorrow. |
APInamespace System.Runtime.Intrinsics.Arm
{
public partial class ArmBase
{
/// <summary>
/// vslid_n_[su]64
///
/// A64: SLI
/// A32: VSLI
/// </summary>
public static Vector64<long> ShiftLeftLogicalAndInsertScalar(Vector64<long> left, Vector64<long> right, byte shift);
public static Vector64<ulong> ShiftLeftLogicalAndInsertScalar(Vector64<ulong> left, Vector64<ulong> right, byte shift);
/// <summary>
/// vsrid_n_[su]64
///
/// A64: SRI
/// A32: VSRI
/// </summary>
public static Vector64<long> ShiftRightLogicalAndInsertScalar(Vector64<long> left, Vector64<long> right, byte shift);
public static Vector64<ulong> ShiftRightLogicalAndInsertScalar(Vector64<ulong> left, Vector64<ulong> right, byte shift);
}
public partial class AdvSimd
{
/// <summary>
/// vsli[q]_n_[su][8,16,32,64]
//
/// A64: SLI
/// A32: VSLI
/// </summary>
public static Vector64<byte> ShiftLeftLogicalAndInsert(Vector64<byte> left, Vector64<byte> right, byte shift);
public static Vector64<ushort> ShiftLeftLogicalAndInsert(Vector64<ushort> left, Vector64<ushort> right, byte shift);
public static Vector64<uint> ShiftLeftLogicalAndInsert(Vector64<uint> left, Vector64<uint> right, byte shift);
public static Vector64<sbyte> ShiftLeftLogicalAndInsert(Vector64<sbyte> left, Vector64<sbyte> right, byte shift);
public static Vector64<short> ShiftLeftLogicalAndInsert(Vector64<short> left, Vector64<short> right, byte shift);
public static Vector64<int> ShiftLeftLogicalAndInsert(Vector64<int> left, Vector64<int> right, byte shift);
public static Vector128<byte> ShiftLeftLogicalAndInsert(Vector128<byte> left, Vector128<byte> right, byte shift);
public static Vector128<ushort> ShiftLeftLogicalAndInsert(Vector128<ushort> left, Vector128<ushort> right, byte shift);
public static Vector128<uint> ShiftLeftLogicalAndInsert(Vector128<uint> left, Vector128<uint> right, byte shift);
public static Vector128<ulong> ShiftLeftLogicalAndInsert(Vector128<ulong> left, Vector128<ulong> right, byte shift);
public static Vector128<sbyte> ShiftLeftLogicalAndInsert(Vector128<sbyte> left, Vector128<sbyte> right, byte shift);
public static Vector128<short> ShiftLeftLogicalAndInsert(Vector128<short> left, Vector128<short> right, byte shift);
public static Vector128<int> ShiftLeftLogicalAndInsert(Vector128<int> left, Vector128<int> right, byte shift);
public static Vector128<long> ShiftLeftLogicalAndInsert(Vector128<long> left, Vector128<long> right, byte shift);
/// <summary>
/// vsri[q]_n_[su][8,16,32,64]
///
/// A64: SRI
/// A32: VSRI
/// </summary>
public static Vector64<byte> ShiftRightAndInsert(Vector64<byte> left, Vector64<byte> right, byte shift);
public static Vector64<ushort> ShiftRightAndInsert(Vector64<ushort> left, Vector64<ushort> right, byte shift);
public static Vector64<uint> ShiftRightAndInsert(Vector64<uint> left, Vector64<uint> right, byte shift);
public static Vector64<sbyte> ShiftRightAndInsert(Vector64<sbyte> left, Vector64<sbyte> right, byte shift);
public static Vector64<short> ShiftRightAndInsert(Vector64<short> left, Vector64<short> right, byte shift);
public static Vector64<int> ShiftRightAndInsert(Vector64<int> left, Vector64<int> right, byte shift);
public static Vector128<byte> ShiftRightAndInsert(Vector128<byte> left, Vector128<byte> right, byte shift);
public static Vector128<ushort> ShiftRightAndInsert(Vector128<ushort> left, Vector128<ushort> right, byte shift);
public static Vector128<uint> ShiftRightAndInsert(Vector128<uint> left, Vector128<uint> right, byte shift);
public static Vector128<ulong> ShiftRightAndInsert(Vector128<ulong> left, Vector128<ulong> right, byte shift);
public static Vector128<sbyte> ShiftRightAndInsert(Vector128<sbyte> left, Vector128<sbyte> right, byte shift);
public static Vector128<short> ShiftRightAndInsert(Vector128<short> left, Vector128<short> right, byte shift);
public static Vector128<int> ShiftRightAndInsert(Vector128<int> left, Vector128<int> right, byte shift);
public static Vector128<long> ShiftRightAndInsert(Vector128<long> left, Vector128<long> right, byte shift);
/// <summary>
/// vmovn_[su][16,32,64]
///
/// A64: XTN
/// A32: VMOVN
/// </summary>
public static Vector64<sbyte> ExtractAndNarrowLow(Vector128<short> value);
public static Vector64<short> ExtractAndNarrowLow(Vector128<int> value);
public static Vector64<int> ExtractAndNarrowLow(Vector128<long> value);
public static Vector64<byte> ExtractAndNarrowLow(Vector128<ushort> value);
public static Vector64<ushort> ExtractAndNarrowLow(Vector128<uint> value);
public static Vector64<uint> ExtractAndNarrowLow(Vector128<ulong> value);
/// <summary>
/// vmovn_high_[su][16,32,64]
//
/// A64: XTN2
/// A32: VMOVN
/// </summary>
public static Vector128<sbyte> ExtractAndNarrowHigh(Vector64<sbyte> lower, Vector128<short> value);
public static Vector128<short> ExtractAndNarrowHigh(Vector64<short> lower, Vector128<int> value);
public static Vector128<int> ExtractAndNarrowHigh(Vector64<int> lower, Vector128<long> value);
public static Vector128<byte> ExtractAndNarrowHigh(Vector64<byte> lower, Vector128<ushort> value);
public static Vector128<ushort> ExtractAndNarrowHigh(Vector64<ushort> lower, Vector128<uint> value);
public static Vector128<uint> ExtractAndNarrowHigh(Vector64<uint> lower, Vector128<ulong> value);
public partial class Arm64
{
/// <summary>
/// vtrn1[q]_[suf][8,16,32,64]
///
/// A64: UZP1
/// </summary>
public static Vector64<sbyte> UnzipEven(Vector64<sbyte> lower, Vector64<sbyte> upper);
public static Vector64<short> UnzipEven(Vector64<short> lower, Vector64<short> upper);
public static Vector64<int> UnzipEven(Vector64<int> lower, Vector64<int> upper);
public static Vector64<byte> UnzipEven(Vector64<byte> lower, Vector64<byte> upper);
public static Vector64<ushort> UnzipEven(Vector64<ushort> lower, Vector64<ushort> upper);
public static Vector64<uint> UnzipEven(Vector64<uint> lower, Vector64<uint> upper);
public static Vector64<float> UnzipEven(Vector64<float> lower, Vector64<float> upper);
public static Vector128<sbyte> UnzipEven(Vector128<sbyte> lower, Vector128<sbyte> upper);
public static Vector128<short> UnzipEven(Vector128<short> lower, Vector128<short> upper);
public static Vector128<int> UnzipEven(Vector128<int> lower, Vector128<int> upper);
public static Vector128<long> UnzipEven(Vector128<long> lower, Vector128<long> upper);
public static Vector128<byte> UnzipEven(Vector128<byte> lower, Vector128<byte> upper);
public static Vector128<ushort> UnzipEven(Vector128<ushort> lower, Vector128<ushort> upper);
public static Vector128<uint> UnzipEven(Vector128<uint> lower, Vector128<uint> upper);
public static Vector128<ulong> UnzipEven(Vector128<ulong> lower, Vector128<ulong> upper);
public static Vector128<float> UnzipEven(Vector128<float> lower, Vector128<float> upper);
public static Vector128<double> UnzipEven(Vector128<double> lower, Vector128<double> upper);
/// <summary>
/// vtrn2[q]_[suf][8,16,32,64]
///
/// A64: UZP2
/// </summary>
public static Vector64<sbyte> UnzipOdd(Vector64<sbyte> lower, Vector64<sbyte> upper);
public static Vector64<short> UnzipOdd(Vector64<short> lower, Vector64<short> upper);
public static Vector64<int> UnzipOdd(Vector64<int> lower, Vector64<int> upper);
public static Vector64<byte> UnzipOdd(Vector64<byte> lower, Vector64<byte> upper);
public static Vector64<ushort> UnzipOdd(Vector64<ushort> lower, Vector64<ushort> upper);
public static Vector64<uint> UnzipOdd(Vector64<uint> lower, Vector64<uint> upper);
public static Vector64<float> UnzipOdd(Vector64<float> lower, Vector64<float> upper);
public static Vector128<sbyte> UnzipOdd(Vector128<sbyte> lower, Vector128<sbyte> upper);
public static Vector128<short> UnzipOdd(Vector128<short> lower, Vector128<short> upper);
public static Vector128<int> UnzipOdd(Vector128<int> lower, Vector128<int> upper);
public static Vector128<long> UnzipOdd(Vector128<long> lower, Vector128<long> upper);
public static Vector128<byte> UnzipOdd(Vector128<byte> lower, Vector128<byte> upper);
public static Vector128<ushort> UnzipOdd(Vector128<ushort> lower, Vector128<ushort> upper);
public static Vector128<uint> UnzipOdd(Vector128<uint> lower, Vector128<uint> upper);
public static Vector128<ulong> UnzipOdd(Vector128<ulong> lower, Vector128<ulong> upper);
public static Vector128<float> UnzipOdd(Vector128<float> lower, Vector128<float> upper);
public static Vector128<double> UnzipOdd(Vector128<double> lower, Vector128<double> upper);
/// <summary>
/// vzip1[q]_[suf][8,16,32,64]
///
/// A64: ZIP1
/// </summary>
public static Vector64<sbyte> ZipLow(Vector64<sbyte> left, Vector64<sbyte> right);
public static Vector64<short> ZipLow(Vector64<short> left, Vector64<short> right);
public static Vector64<int> ZipLow(Vector64<int> left, Vector64<int> right);
public static Vector64<byte> ZipLow(Vector64<byte> left, Vector64<byte> right);
public static Vector64<ushort> ZipLow(Vector64<ushort> left, Vector64<ushort> right);
public static Vector64<uint> ZipLow(Vector64<uint> left, Vector64<uint> right);
public static Vector64<float> ZipLow(Vector64<float> left, Vector64<float> right);
public static Vector128<sbyte> ZipLow(Vector128<sbyte> left, Vector128<sbyte> right);
public static Vector128<short> ZipLow(Vector128<short> left, Vector128<short> right);
public static Vector128<int> ZipLow(Vector128<int> left, Vector128<int> right);
public static Vector128<long> ZipLow(Vector128<long> left, Vector128<long> right);
public static Vector128<byte> ZipLow(Vector128<byte> left, Vector128<byte> right);
public static Vector128<ushort> ZipLow(Vector128<ushort> left, Vector128<ushort> right);
public static Vector128<uint> ZipLow(Vector128<uint> left, Vector128<uint> right);
public static Vector128<ulong> ZipLow(Vector128<ulong> left, Vector128<ulong> right);
public static Vector128<float> ZipLow(Vector128<float> left, Vector128<float> right);
public static Vector128<double> ZipLow(Vector128<double> left, Vector128<double> right);
/// <summary>
/// vzip2[q]_[suf][8,16,32,64]
///
/// A64: ZIP2
/// </summary>
public static Vector64<sbyte> ZipHigh(Vector64<sbyte> left, Vector64<sbyte> right);
public static Vector64<short> ZipHigh(Vector64<short> left, Vector64<short> right);
public static Vector64<int> ZipHigh(Vector64<int> left, Vector64<int> right);
public static Vector64<byte> ZipHigh(Vector64<byte> left, Vector64<byte> right);
public static Vector64<ushort> ZipHigh(Vector64<ushort> left, Vector64<ushort> right);
public static Vector64<uint> ZipHigh(Vector64<uint> left, Vector64<uint> right);
public static Vector64<float> ZipHigh(Vector64<float> left, Vector64<float> right);
public static Vector128<sbyte> ZipHigh(Vector128<sbyte> left, Vector128<sbyte> right);
public static Vector128<short> ZipHigh(Vector128<short> left, Vector128<short> right);
public static Vector128<int> ZipHigh(Vector128<int> left, Vector128<int> right);
public static Vector128<long> ZipHigh(Vector128<long> left, Vector128<long> right);
public static Vector128<byte> ZipHigh(Vector128<byte> left, Vector128<byte> right);
public static Vector128<ushort> ZipHigh(Vector128<ushort> left, Vector128<ushort> right);
public static Vector128<uint> ZipHigh(Vector128<uint> left, Vector128<uint> right);
public static Vector128<ulong> ZipHigh(Vector128<ulong> left, Vector128<ulong> right);
public static Vector128<float> ZipHigh(Vector128<float> left, Vector128<float> right);
public static Vector128<double> ZipHigh(Vector128<double> left, Vector128<double> right);
/// <summary>
/// vtrn1[q]_[suf][8,16,32,64]
///
/// A64: TRN1
/// </summary>
public static Vector64<sbyte> TransposeEven(Vector64<sbyte> left, Vector64<sbyte> right);
public static Vector64<short> TransposeEven(Vector64<short> left, Vector64<short> right);
public static Vector64<int> TransposeEven(Vector64<int> left, Vector64<int> right);
public static Vector64<byte> TransposeEven(Vector64<byte> left, Vector64<byte> right);
public static Vector64<ushort> TransposeEven(Vector64<ushort> left, Vector64<ushort> right);
public static Vector64<uint> TransposeEven(Vector64<uint> left, Vector64<uint> right);
public static Vector64<float> TransposeEven(Vector64<float> left, Vector64<float> right);
public static Vector128<sbyte> TransposeEven(Vector128<sbyte> left, Vector128<sbyte> right);
public static Vector128<short> TransposeEven(Vector128<short> left, Vector128<short> right);
public static Vector128<int> TransposeEven(Vector128<int> left, Vector128<int> right);
public static Vector128<long> TransposeEven(Vector128<long> left, Vector128<long> right);
public static Vector128<byte> TransposeEven(Vector128<byte> left, Vector128<byte> right);
public static Vector128<ushort> TransposeEven(Vector128<ushort> left, Vector128<ushort> right);
public static Vector128<uint> TransposeEven(Vector128<uint> left, Vector128<uint> right);
public static Vector128<ulong> TransposeEven(Vector128<ulong> left, Vector128<ulong> right);
public static Vector128<float> TransposeEven(Vector128<float> left, Vector128<float> right);
public static Vector128<double> TransposeEven(Vector128<double> left, Vector128<double> right);
/// <summary>
/// vtrn2[q]_[suf][8,16,32,64]
///
/// A64: TRN2
/// </summary>
public static Vector64<sbyte> TransposeOdd(Vector64<sbyte> left, Vector64<sbyte> right);
public static Vector64<short> TransposeOdd(Vector64<short> left, Vector64<short> right);
public static Vector64<int> TransposeOdd(Vector64<int> left, Vector64<int> right);
public static Vector64<byte> TransposeOdd(Vector64<byte> left, Vector64<byte> right);
public static Vector64<ushort> TransposeOdd(Vector64<ushort> left, Vector64<ushort> right);
public static Vector64<uint> TransposeOdd(Vector64<uint> left, Vector64<uint> right);
public static Vector64<float> TransposeOdd(Vector64<float> left, Vector64<float> right);
public static Vector128<sbyte> TransposeOdd(Vector128<sbyte> left, Vector128<sbyte> right);
public static Vector128<short> TransposeOdd(Vector128<short> left, Vector128<short> right);
public static Vector128<int> TransposeOdd(Vector128<int> left, Vector128<int> right);
public static Vector128<long> TransposeOdd(Vector128<long> left, Vector128<long> right);
public static Vector128<byte> TransposeOdd(Vector128<byte> left, Vector128<byte> right);
public static Vector128<ushort> TransposeOdd(Vector128<ushort> left, Vector128<ushort> right);
public static Vector128<uint> TransposeOdd(Vector128<uint> left, Vector128<uint> right);
public static Vector128<ulong> TransposeOdd(Vector128<ulong> left, Vector128<ulong> right);
public static Vector128<float> TransposeOdd(Vector128<float> left, Vector128<float> right);
public static Vector128<double> TransposeOdd(Vector128<double> left, Vector128<double> right);
}
}
} |
@TamarChristinaArm, looking at the encoding + decoding for Specifically, the |
@tannergooding Yeah I believe that's correct, I had missed the |
@tannergooding @echesakovMSFT I'm wondering about the |
Non constant inputs are handled by dropping back to a call which contains a jump table handling all possible cases (which is between 1 and 256), its part of the reason the APIs are recursive. Intrinsics which take a 8-bit immediate are marked as
The important methods for importation are then:
The importation logic currently assumes the "immediate" is the last operand in the list: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/hwintrinsic.cpp#L648. It then uses the above methods to determine how to handle things. This includes things like directly expanding the relevant intrinsic, generating an equivalent intrinsic fallback, or falling back to a method call. If it falls back to the method call it will see the method is recursively calling itself and force expansion and the node will carry a non-constant input through to codegen. It will also insert the relevant range check if one is needed. An example of how this is handled in codegen for x86 is here: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/hwintrinsiccodegenxarch.cpp#L227-L254. |
This ultimately allows things like reflection or debugging to "just work" and the expectation is users won't use it in actual perf-critical paths. Users are expected to be profiling this code so they should catch the issue relatively quickly if one does exist. There are a few cases that this won't catch (of inputs that the JIT will eventually determine to be constant but which aren't "constant" during importation) and we have an issue tracking improving that: #9989 and #11062. Ideally we would delay the decision to be a "fallback method call" until later in the pipeline (such as We already have some support for doing that with existing |
Awesome thanks, I'll get on those then :) |
@TamarChristinaArm I am working at the moment on supporting intrinsic immediate operands on arm64 - I needed this for Extract, Insert and ExtractVector64/128 intrinsics - will have a PR soon |
@echesakovMSFT Ah great, I'll do the single register TBL in the mean time then. |
The following APIs are yet to be implemented: namespace System.Runtime.Intrinsics.Arm
{
public partial class ArmBase
{
/// <summary>
/// vslid_n_[su]64
///
/// A64: SLI
/// A32: VSLI
/// </summary>
public static Vector64<long> ShiftLeftLogicalAndInsertScalar(Vector64<long> left, Vector64<long> right, byte shift);
public static Vector64<ulong> ShiftLeftLogicalAndInsertScalar(Vector64<ulong> left, Vector64<ulong> right, byte shift);
/// <summary>
/// vsrid_n_[su]64
///
/// A64: SRI
/// A32: VSRI
/// </summary>
public static Vector64<long> ShiftRightLogicalAndInsertScalar(Vector64<long> left, Vector64<long> right, byte shift);
public static Vector64<ulong> ShiftRightLogicalAndInsertScalar(Vector64<ulong> left, Vector64<ulong> right, byte shift);
}
public partial class AdvSimd
{
/// <summary>
/// vsli[q]_n_[su][8,16,32,64]
//
/// A64: SLI
/// A32: VSLI
/// </summary>
public static Vector64<byte> ShiftLeftLogicalAndInsert(Vector64<byte> left, Vector64<byte> right, byte shift);
public static Vector64<ushort> ShiftLeftLogicalAndInsert(Vector64<ushort> left, Vector64<ushort> right, byte shift);
public static Vector64<uint> ShiftLeftLogicalAndInsert(Vector64<uint> left, Vector64<uint> right, byte shift);
public static Vector64<sbyte> ShiftLeftLogicalAndInsert(Vector64<sbyte> left, Vector64<sbyte> right, byte shift);
public static Vector64<short> ShiftLeftLogicalAndInsert(Vector64<short> left, Vector64<short> right, byte shift);
public static Vector64<int> ShiftLeftLogicalAndInsert(Vector64<int> left, Vector64<int> right, byte shift);
public static Vector128<byte> ShiftLeftLogicalAndInsert(Vector128<byte> left, Vector128<byte> right, byte shift);
public static Vector128<ushort> ShiftLeftLogicalAndInsert(Vector128<ushort> left, Vector128<ushort> right, byte shift);
public static Vector128<uint> ShiftLeftLogicalAndInsert(Vector128<uint> left, Vector128<uint> right, byte shift);
public static Vector128<ulong> ShiftLeftLogicalAndInsert(Vector128<ulong> left, Vector128<ulong> right, byte shift);
public static Vector128<sbyte> ShiftLeftLogicalAndInsert(Vector128<sbyte> left, Vector128<sbyte> right, byte shift);
public static Vector128<short> ShiftLeftLogicalAndInsert(Vector128<short> left, Vector128<short> right, byte shift);
public static Vector128<int> ShiftLeftLogicalAndInsert(Vector128<int> left, Vector128<int> right, byte shift);
public static Vector128<long> ShiftLeftLogicalAndInsert(Vector128<long> left, Vector128<long> right, byte shift);
/// <summary>
/// vsri[q]_n_[su][8,16,32,64]
///
/// A64: SRI
/// A32: VSRI
/// </summary>
public static Vector64<byte> ShiftRightAndInsert(Vector64<byte> left, Vector64<byte> right, byte shift);
public static Vector64<ushort> ShiftRightAndInsert(Vector64<ushort> left, Vector64<ushort> right, byte shift);
public static Vector64<uint> ShiftRightAndInsert(Vector64<uint> left, Vector64<uint> right, byte shift);
public static Vector64<sbyte> ShiftRightAndInsert(Vector64<sbyte> left, Vector64<sbyte> right, byte shift);
public static Vector64<short> ShiftRightAndInsert(Vector64<short> left, Vector64<short> right, byte shift);
public static Vector64<int> ShiftRightAndInsert(Vector64<int> left, Vector64<int> right, byte shift);
public static Vector128<byte> ShiftRightAndInsert(Vector128<byte> left, Vector128<byte> right, byte shift);
public static Vector128<ushort> ShiftRightAndInsert(Vector128<ushort> left, Vector128<ushort> right, byte shift);
public static Vector128<uint> ShiftRightAndInsert(Vector128<uint> left, Vector128<uint> right, byte shift);
public static Vector128<ulong> ShiftRightAndInsert(Vector128<ulong> left, Vector128<ulong> right, byte shift);
public static Vector128<sbyte> ShiftRightAndInsert(Vector128<sbyte> left, Vector128<sbyte> right, byte shift);
public static Vector128<short> ShiftRightAndInsert(Vector128<short> left, Vector128<short> right, byte shift);
public static Vector128<int> ShiftRightAndInsert(Vector128<int> left, Vector128<int> right, byte shift);
public static Vector128<long> ShiftRightAndInsert(Vector128<long> left, Vector128<long> right, byte shift);
}
} |
@echesakovMSFT I see |
@TamarChristinaArm Yes, I think so. |
@tannergooding @echesakovMSFT I believe we need to move the entries in |
Yes, I believe so as |
@echesakovMSFT the comment in |
@TamarChristinaArm If you pass a non-const immediate operand and the instruction does not have non-const form (i.e. does not accept a register operand instead of the immediate operand; I can't think about an instruction on Arm64 that has such fallback but I believe it was a case on x64 for many instructions - look for intrinsics marked as HW_Flag_MaybeIMM in hwintrinsiclistxarch.h) JIT still needs to be able to compile an intrinsic so it generates a "switch" table that conceptually does this: switch(nonConstImm)
{
case 0:
inst Vd, Vn, #0;
break;
case 1:
inst Vd, Vn, #1;
break;
case 2:
inst Vd, Vn, #2;
break;
case 3:
inst Vd, Vn, #3;
break;
} For example, for ExtractVector128(Vector128<byte>) the code will look as follows ; Assembly listing for method System.Runtime.Intrinsics.Arm.AdvSimd:ExtractVector128(System.Runtime.Intrinsics.Vector128`1[Byte],System.Runtime.Intrinsics.Vector128`1[Byte],ubyte):System.Runtime.Intrinsics.Vector128`1[Byte]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00 ] ( 3, 3 ) simd16 -> [fp+0x20] HFA(simd16) do-not-enreg[XS] addr-exposed
; V01 arg1 [V01 ] ( 3, 3 ) simd16 -> [fp+0x10] HFA(simd16) do-not-enreg[XS] addr-exposed
; V02 arg2 [V02,T00] ( 3, 3 ) ubyte -> x0
;# V03 OutArgs [V03 ] ( 1, 1 ) lclBlk ( 0) [sp+0x00] "OutgoingArgSpace"
; V04 cse0 [V04,T01] ( 3, 3 ) int -> x0 "CSE - aggressive"
;
; Lcl frame size = 32
G_M54180_IG01:
A9BD7BFD stp fp, lr, [sp,#-48]!
910003FD mov fp, sp
3D800BA0 str q0, [fp,#32]
3D8007A1 str q1, [fp,#16]
;; bbWeight=1 PerfScore 3.50
G_M54180_IG02:
3DC00BB0 ldr q16, [fp,#32]
3DC007B1 ldr q17, [fp,#16]
53001C00 uxtb w0, w0
7100401F cmp w0, #16
540004C2 bhs G_M54180_IG21
10000061 adr x1, [G_M54180_IG03]
8B000C21 add x1, x1, x0, LSL #3
D61F0020 br x1
;; bbWeight=1 PerfScore 8.50
G_M54180_IG03:
6E110210 ext v16.16b, v16.16b, v17.16b, #0
1400001E b G_M54180_IG19
;; bbWeight=1 PerfScore 2.00
G_M54180_IG04:
6E110A10 ext v16.16b, v16.16b, v17.16b, #1
1400001C b G_M54180_IG19
;; bbWeight=1 PerfScore 2.00
G_M54180_IG05:
6E111210 ext v16.16b, v16.16b, v17.16b, #2
1400001A b G_M54180_IG19
;; bbWeight=1 PerfScore 2.00
G_M54180_IG06:
6E111A10 ext v16.16b, v16.16b, v17.16b, #3
14000018 b G_M54180_IG19
;; bbWeight=1 PerfScore 2.00
G_M54180_IG07:
6E112210 ext v16.16b, v16.16b, v17.16b, #4
14000016 b G_M54180_IG19
;; bbWeight=1 PerfScore 2.00
G_M54180_IG08:
6E112A10 ext v16.16b, v16.16b, v17.16b, #5
14000014 b G_M54180_IG19
;; bbWeight=1 PerfScore 2.00
G_M54180_IG09:
6E113210 ext v16.16b, v16.16b, v17.16b, #6
14000012 b G_M54180_IG19
;; bbWeight=1 PerfScore 2.00
G_M54180_IG10:
6E113A10 ext v16.16b, v16.16b, v17.16b, #7
14000010 b G_M54180_IG19
;; bbWeight=1 PerfScore 2.00
G_M54180_IG11:
6E114210 ext v16.16b, v16.16b, v17.16b, #8
1400000E b G_M54180_IG19
;; bbWeight=1 PerfScore 2.00
G_M54180_IG12:
6E114A10 ext v16.16b, v16.16b, v17.16b, #9
1400000C b G_M54180_IG19
;; bbWeight=1 PerfScore 2.00
G_M54180_IG13:
6E115210 ext v16.16b, v16.16b, v17.16b, #10
1400000A b G_M54180_IG19
;; bbWeight=1 PerfScore 2.00
G_M54180_IG14:
6E115A10 ext v16.16b, v16.16b, v17.16b, #11
14000008 b G_M54180_IG19
;; bbWeight=1 PerfScore 2.00
G_M54180_IG15:
6E116210 ext v16.16b, v16.16b, v17.16b, #12
14000006 b G_M54180_IG19
;; bbWeight=1 PerfScore 2.00
G_M54180_IG16:
6E116A10 ext v16.16b, v16.16b, v17.16b, #13
14000004 b G_M54180_IG19
;; bbWeight=1 PerfScore 2.00
G_M54180_IG17:
6E117210 ext v16.16b, v16.16b, v17.16b, #14
14000002 b G_M54180_IG19
;; bbWeight=1 PerfScore 2.00
G_M54180_IG18:
6E117A10 ext v16.16b, v16.16b, v17.16b, #15
;; bbWeight=1 PerfScore 1.00
G_M54180_IG19:
4EB01E00 mov v0.16b, v16.16b
;; bbWeight=1 PerfScore 0.50
G_M54180_IG20:
A8C37BFD ldp fp, lr, [sp],#48
D65F03C0 ret lr
;; bbWeight=1 PerfScore 2.00
G_M54180_IG21:
97FEEB1E bl CORINFO_HELP_THROW_ARGUMENTOUTOFRANGEEXCEPTION
D43E0000 bkpt
;; bbWeight=0 PerfScore 0.00
; Total bytes of code 192, prolog size 8, PerfScore 64.70, (MethodHash=08b52c5b) for method System.Runtime.Intrinsics.Arm.AdvSimd:ExtractVector128(System.Runtime.Intrinsics.Vector128`1[Byte],System.Runtime.Intrinsics.Vector128`1[Byte],ubyte):System.Runtime.Intrinsics.Vector128`1[Byte]
; ============================================================ This requires allocating of x1 as a branch target. However, for cases when immediate is 0 or 1 (e.g. ExtractVector128(Vector128<double>) we can branch with cbnz and don't need to allocate the additional register as in the following example: ; Assembly listing for method System.Runtime.Intrinsics.Arm.AdvSimd:ExtractVector128(System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double],ubyte):System.Runtime.Intrinsics.Vector128`1[Double]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00 ] ( 3, 3 ) simd16 -> [fp+0x20] HFA(simd16) do-not-enreg[XS] addr-exposed
; V01 arg1 [V01 ] ( 3, 3 ) simd16 -> [fp+0x10] HFA(simd16) do-not-enreg[XS] addr-exposed
; V02 arg2 [V02,T00] ( 3, 3 ) ubyte -> x0
;# V03 OutArgs [V03 ] ( 1, 1 ) lclBlk ( 0) [sp+0x00] "OutgoingArgSpace"
; V04 cse0 [V04,T01] ( 3, 3 ) int -> x0 "CSE - aggressive"
;
; Lcl frame size = 32
G_M55355_IG01:
A9BD7BFD stp fp, lr, [sp,#-48]!
910003FD mov fp, sp
3D800BA0 str q0, [fp,#32]
3D8007A1 str q1, [fp,#16]
;; bbWeight=1 PerfScore 3.50
G_M55355_IG02:
3DC00BB0 ldr q16, [fp,#32]
3DC007B1 ldr q17, [fp,#16]
53001C00 uxtb w0, w0
7100081F cmp w0, #2
54000102 bhs G_M55355_IG07
35000060 cbnz w0, G_M55355_IG04
;; bbWeight=1 PerfScore 7.00
G_M55355_IG03:
6E110210 ext v16.16b, v16.16b, v17.16b, #0
14000002 b G_M55355_IG05
;; bbWeight=1 PerfScore 2.00
G_M55355_IG04:
6E114210 ext v16.16b, v16.16b, v17.16b, #8
;; bbWeight=1 PerfScore 1.00
G_M55355_IG05:
4EB01E00 mov v0.16b, v16.16b
;; bbWeight=1 PerfScore 0.50
G_M55355_IG06:
A8C37BFD ldp fp, lr, [sp],#48
D65F03C0 ret lr
;; bbWeight=1 PerfScore 2.00
G_M55355_IG07:
97FEEB06 bl CORINFO_HELP_THROW_ARGUMENTOUTOFRANGEEXCEPTION
D43E0000 bkpt
;; bbWeight=0 PerfScore 0.00
; Total bytes of code 72, prolog size 8, PerfScore 23.20, (MethodHash=721027c4) for method System.Runtime.Intrinsics.Arm.AdvSimd:ExtractVector128(System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double],ubyte):System.Runtime.Intrinsics.Vector128`1[Double]
; ============================================================ In your case, (i.e. sri and sli) you will need to mark an intrinsic with HW_Category_IMM, compute an upper bound for immediate operand in HWIntrinsicInfo::lookupImmUpperBound() and add two case in the switch in LinearScan::BuildHWIntrinsic that will set |
@TamarChristinaArm I hope my explanation helped. If not - I would value any feedback how I can re-phrase this comment to make it more clear. Also feel free to ping me offline if you want to chat more about this. |
@echesakovMSFT It did!, I hadn't realized the JIT was emitting a runtime dispatch for this case, which makes sense, that's where my initial confusion came from :) Thanks for the explanation! |
@TamarChristinaArm since you are working on this - I have assigned the issue to you. |
The
A32
variants of these are blocked pending resolution of the<lanes>x<copies>
implementation in #24790 (e.g.int32x2x2
). The permute instructions such asZIP1
andZIP2
present an interesting challenge. Since intrinsics in CoreCLR/CoreFX are supposed to match down to a single hardware instruction this makes it a bit awkward, since onA32
ZIP, TRN, UZP
are destructive operations which perform both theOdd
andEven
shuffles at the same time. So while you could do the intrinsics forA32
by copying the vector and ignoring one of the outputs I believe that goes counter to the philosophy here (unless I'm mistaken.). It also means that if they were to be implemented onA32
for efficiency aZIP1, ZIP2
combo should be combined toZIP
and the moves not generated.This also means that the Arm
ZIP, TRN, UZP
intrinsics can't be implemented inA64
as a single intrinsics but rather the user needs to make two calls. This is the reason that in this proposal the intrinsics areA64
only, but it makes intrinsics code betweenA32
andA64
a bit less portable in this case.Also to make things easier to read I combined the documentation headers for the proposal. They will of course be separated out in the actual implementation
cc @tannergooding @CarolEidt @echesakovMSFT
The text was updated successfully, but these errors were encountered: