Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: Arm64 [Load/Store]Vector64 and [Load/Store]Vector128 for 2,3 and 4 variants #84510

Closed
Tracked by #94464
kunalspathak opened this issue Apr 8, 2023 · 18 comments
Labels
api-approved API was approved in API review, it can be implemented area-System.Runtime.Intrinsics
Milestone

Comments

@kunalspathak
Copy link
Member

kunalspathak commented Apr 8, 2023

Background and motivation

These APIs prove a way to load Vector64 and Vector128 from the address. The x2, x3 and x4 variants provides way to load 2, 3 and 4 vectors simultaneously.

API Proposal

namespace System.Runtime.Intrinsics.Arm;

public abstract partial class AdvSimd
{
    // LD1 (multiple structures)
    // LoadVector64 already present

    // LD1 (multiple structures) 2 register variant
    public static unsafe (Vector64<byte>   Value1, Vector64<byte>   Value2) LoadVector64x2AndUnzip(byte*   address);
    public static unsafe (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2) LoadVector64x2AndUnzip(sbyte*  address);
    public static unsafe (Vector64<short>  Value1, Vector64<short>  Value2) LoadVector64x2AndUnzip(short*  address);
    public static unsafe (Vector64<ushort> Value1, Vector64<ushort> Value2) LoadVector64x2AndUnzip(ushort* address);
    public static unsafe (Vector64<int>    Value1, Vector64<int>    Value2) LoadVector64x2AndUnzip(int*    address);
    public static unsafe (Vector64<uint>   Value1, Vector64<uint>   Value2) LoadVector64x2AndUnzip(uint*   address);
    public static unsafe (Vector64<float>  Value1, Vector64<float>  Value2) LoadVector64x2AndUnzip(float*  address);

    // LD1 (multiple structures) 3 register variant
    public static unsafe (Vector64<byte>   Value1, Vector64<byte>   Value2, Vector64<byte>   Value3) LoadVector64x3AndUnzip(byte*   address);
    public static unsafe (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2, Vector64<sbyte>  Value3) LoadVector64x3AndUnzip(sbyte*  address);
    public static unsafe (Vector64<short>  Value1, Vector64<short>  Value2, Vector64<short>  Value3) LoadVector64x3AndUnzip(short*  address);
    public static unsafe (Vector64<ushort> Value1, Vector64<ushort> Value2, Vector64<ushort> Value3) LoadVector64x3AndUnzip(ushort* address);
    public static unsafe (Vector64<int>    Value1, Vector64<int>    Value2, Vector64<int>    Value3) LoadVector64x3AndUnzip(int*    address);
    public static unsafe (Vector64<uint>   Value1, Vector64<uint>   Value2, Vector64<uint>   Value3) LoadVector64x3AndUnzip(uint*   address);
    public static unsafe (Vector64<float>  Value1, Vector64<float>  Value2, Vector64<float>  Value3) LoadVector64x3AndUnzip(float*  address);
    
    // LD1 (multiple structures) 4 register variant            
    public static unsafe (Vector64<byte>   Value1, Vector64<byte>   Value2, Vector64<byte>   Value3, Vector64<byte>   Value4) LoadVector64x4AndUnzip(byte*   address);
    public static unsafe (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2, Vector64<sbyte>  Value3, Vector64<sbyte>  Value4) LoadVector64x4AndUnzip(sbyte*  address);
    public static unsafe (Vector64<short>  Value1, Vector64<short>  Value2, Vector64<short>  Value3, Vector64<short>  Value4) LoadVector64x4AndUnzip(short*  address);
    public static unsafe (Vector64<ushort> Value1, Vector64<ushort> Value2, Vector64<ushort> Value3, Vector64<ushort> Value4) LoadVector64x4AndUnzip(ushort* address);
    public static unsafe (Vector64<int>    Value1, Vector64<int>    Value2, Vector64<int>    Value3, Vector64<int>    Value4) LoadVector64x4AndUnzip(int*    address);
    public static unsafe (Vector64<uint>   Value1, Vector64<uint>   Value2, Vector64<uint>   Value3, Vector64<uint>   Value4) LoadVector64x4AndUnzip(uint*   address);
    public static unsafe (Vector64<float>  Value1, Vector64<float>  Value2, Vector64<float>  Value3, Vector64<float>  Value4) LoadVector64x4AndUnzip(float*  address);
    
    // LD1 (single structure)
    // LoadAndInsertScalar already present

    // LD1R
    // LoadAndReplicateToVector64 already present
    
    // LD2 (multiple structures)
    public static unsafe (Vector64<byte>   Value1, Vector64<byte>   Value2) LoadVector64x2(byte*   address);
    public static unsafe (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2) LoadVector64x2(sbyte*  address);
    public static unsafe (Vector64<short>  Value1, Vector64<short>  Value2) LoadVector64x2(short*  address);
    public static unsafe (Vector64<ushort> Value1, Vector64<ushort> Value2) LoadVector64x2(ushort* address);
    public static unsafe (Vector64<int>    Value1, Vector64<int>    Value2) LoadVector64x2(int*    address);
    public static unsafe (Vector64<uint>   Value1, Vector64<uint>   Value2) LoadVector64x2(uint*   address);
    public static unsafe (Vector64<float>  Value1, Vector64<float>  Value2) LoadVector64x2(float*  address);

    // LD2 (single structure)
    public static unsafe (Vector64<byte>   Value1, Vector64<byte>   Value2) LoadVectorAndInsertScalar64x2((Vector64<byte>   Value1, Vector64<byte>   Value2) value, byte index, byte*   address);
    public static unsafe (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2) LoadVectorAndInsertScalar64x2((Vector64<sbyte>  Value1, Vector64<sbyte>  Value2) value, byte index, sbyte*  address);
    public static unsafe (Vector64<short>  Value1, Vector64<short>  Value2) LoadVectorAndInsertScalar64x2((Vector64<short>  Value1, Vector64<short>  Value2) value, byte index, short*  address);
    public static unsafe (Vector64<ushort> Value1, Vector64<ushort> Value2) LoadVectorAndInsertScalar64x2((Vector64<ushort> Value1, Vector64<ushort> Value2) value, byte index, ushort* address);
    public static unsafe (Vector64<int>    Value1, Vector64<int>    Value2) LoadVectorAndInsertScalar64x2((Vector64<int>    Value1, Vector64<int>    Value2) value, byte index, int*    address);
    public static unsafe (Vector64<uint>   Value1, Vector64<uint>   Value2) LoadVectorAndInsertScalar64x2((Vector64<uint>   Value1, Vector64<uint>   Value2) value, byte index, uint*   address);
    public static unsafe (Vector64<float>  Value1, Vector64<float>  Value2) LoadVectorAndInsertScalar64x2((Vector64<float>  Value1, Vector64<float>  Value2) value, byte index, float*  address);

    // LD2R
    public static unsafe (Vector64<byte>   Value1, Vector64<byte>   Value2) LoadAndReplicateToVector64x2(byte*   address);
    public static unsafe (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2) LoadAndReplicateToVector64x2(sbyte*  address);
    public static unsafe (Vector64<short>  Value1, Vector64<short>  Value2) LoadAndReplicateToVector64x2(short*  address);
    public static unsafe (Vector64<ushort> Value1, Vector64<ushort> Value2) LoadAndReplicateToVector64x2(ushort* address);
    public static unsafe (Vector64<int>    Value1, Vector64<int>    Value2) LoadAndReplicateToVector64x2(int*    address);
    public static unsafe (Vector64<uint>   Value1, Vector64<uint>   Value2) LoadAndReplicateToVector64x2(uint*   address);
    public static unsafe (Vector64<float>  Value1, Vector64<float>  Value2) LoadAndReplicateToVector64x2(float*  address);

    // LD3 (multiple structures)
    public static unsafe (Vector64<byte>   Value1, Vector64<byte>   Value2, Vector64<byte>   Value3) LoadVector64x3(byte*   address);
    public static unsafe (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2, Vector64<sbyte>  Value3) LoadVector64x3(sbyte*  address);
    public static unsafe (Vector64<short>  Value1, Vector64<short>  Value2, Vector64<short>  Value3) LoadVector64x3(short*  address);
    public static unsafe (Vector64<ushort> Value1, Vector64<ushort> Value2, Vector64<ushort> Value3) LoadVector64x3(ushort* address);
    public static unsafe (Vector64<int>    Value1, Vector64<int>    Value2, Vector64<int>    Value3) LoadVector64x3(int*    address);
    public static unsafe (Vector64<uint>   Value1, Vector64<uint>   Value2, Vector64<uint>   Value3) LoadVector64x3(uint*   address);
    public static unsafe (Vector64<float>  Value1, Vector64<float>  Value2, Vector64<float>  Value3) LoadVector64x3(float*  address);

    // LD3 (single structure)
    public static unsafe (Vector64<byte>   Value1, Vector64<byte>   Value2, Vector64<byte>   Value3) LoadVectorAndInsertScalar64x3((Vector64<byte>   Value1, Vector64<byte>   Value2, Vector64<byte>   Value3) value, byte index, byte*   address);
    public static unsafe (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2, Vector64<sbyte>  Value3) LoadVectorAndInsertScalar64x3((Vector64<sbyte>  Value1, Vector64<sbyte>  Value2, Vector64<sbyte>  Value3) value, byte index, sbyte*  address);
    public static unsafe (Vector64<short>  Value1, Vector64<short>  Value2, Vector64<short>  Value3) LoadVectorAndInsertScalar64x3((Vector64<short>  Value1, Vector64<short>  Value2, Vector64<short>  Value3) value, byte index, short*  address);
    public static unsafe (Vector64<ushort> Value1, Vector64<ushort> Value2, Vector64<ushort> Value3) LoadVectorAndInsertScalar64x3((Vector64<ushort> Value1, Vector64<ushort> Value2, Vector64<ushort> Value3) value, byte index, ushort* address);
    public static unsafe (Vector64<int>    Value1, Vector64<int>    Value2, Vector64<int>    Value3) LoadVectorAndInsertScalar64x3((Vector64<int>    Value1, Vector64<int>    Value2, Vector64<int>    Value3) value, byte index, int*    address);
    public static unsafe (Vector64<uint>   Value1, Vector64<uint>   Value2, Vector64<uint>   Value3) LoadVectorAndInsertScalar64x3((Vector64<uint>   Value1, Vector64<uint>   Value2, Vector64<uint>   Value3) value, byte index, uint*   address);
    public static unsafe (Vector64<float>  Value1, Vector64<float>  Value2, Vector64<float>  Value3) LoadVectorAndInsertScalar64x3((Vector64<float>  Value1, Vector64<float>  Value2, Vector64<float>  Value3) value, byte index, float*  address);

    // LD3R
    public static unsafe (Vector64<byte>   Value1, Vector64<byte>   Value2, Vector64<byte>   Value3) LoadAndReplicateToVector64x3(byte*   address);
    public static unsafe (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2, Vector64<sbyte>  Value3) LoadAndReplicateToVector64x3(sbyte*  address);
    public static unsafe (Vector64<short>  Value1, Vector64<short>  Value2, Vector64<short>  Value3) LoadAndReplicateToVector64x3(short*  address);
    public static unsafe (Vector64<ushort> Value1, Vector64<ushort> Value2, Vector64<ushort> Value3) LoadAndReplicateToVector64x3(ushort* address);
    public static unsafe (Vector64<int>    Value1, Vector64<int>    Value2, Vector64<int>    Value3) LoadAndReplicateToVector64x3(int*    address);
    public static unsafe (Vector64<uint>   Value1, Vector64<uint>   Value2, Vector64<uint>   Value3) LoadAndReplicateToVector64x3(uint*   address);
    public static unsafe (Vector64<float>  Value1, Vector64<float>  Value2, Vector64<float>  Value3) LoadAndReplicateToVector64x3(float*  address);

    // LD4 (multiple structures)
    public static unsafe (Vector64<byte>   Value1, Vector64<byte>   Value2, Vector64<byte>   Value3, Vector64<byte>   Value4) LoadVector64x4(byte*   address);
    public static unsafe (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2, Vector64<sbyte>  Value3, Vector64<sbyte>  Value4) LoadVector64x4(sbyte*  address);
    public static unsafe (Vector64<short>  Value1, Vector64<short>  Value2, Vector64<short>  Value3, Vector64<short>  Value4) LoadVector64x4(short*  address);
    public static unsafe (Vector64<ushort> Value1, Vector64<ushort> Value2, Vector64<ushort> Value3, Vector64<ushort> Value4) LoadVector64x4(ushort* address);
    public static unsafe (Vector64<int>    Value1, Vector64<int>    Value2, Vector64<int>    Value3, Vector64<int>    Value4) LoadVector64x4(int*    address);
    public static unsafe (Vector64<uint>   Value1, Vector64<uint>   Value2, Vector64<uint>   Value3, Vector64<uint>   Value4) LoadVector64x4(uint*   address);
    public static unsafe (Vector64<float>  Value1, Vector64<float>  Value2, Vector64<float>  Value3, Vector64<float>  Value4) LoadVector64x4(float*  address);

    // LD4 (single structure)
    public static unsafe (Vector64<byte>   Value1, Vector64<byte>   Value2, Vector64<byte>   Value3, Vector64<byte>   Value4) LoadVectorAndInsertScalar64x4((Vector64<byte>   Value1, Vector64<byte>   Value2, Vector64<byte>   Value3, Vector64<byte>   Value4) value, byte index, byte*   address);
    public static unsafe (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2, Vector64<sbyte>  Value3, Vector64<sbyte>  Value4) LoadVectorAndInsertScalar64x4((Vector64<sbyte>  Value1, Vector64<sbyte>  Value2, Vector64<sbyte>  Value3, Vector64<sbyte>  Value4) value, byte index, sbyte*  address);
    public static unsafe (Vector64<short>  Value1, Vector64<short>  Value2, Vector64<short>  Value3, Vector64<short>  Value4) LoadVectorAndInsertScalar64x4((Vector64<short>  Value1, Vector64<short>  Value2, Vector64<short>  Value3, Vector64<short>  Value4) value, byte index, short*  address);
    public static unsafe (Vector64<ushort> Value1, Vector64<ushort> Value2, Vector64<ushort> Value3, Vector64<ushort> Value4) LoadVectorAndInsertScalar64x4((Vector64<ushort> Value1, Vector64<ushort> Value2, Vector64<ushort> Value3, Vector64<ushort> Value4) value, byte index, ushort* address);
    public static unsafe (Vector64<int>    Value1, Vector64<int>    Value2, Vector64<int>    Value3, Vector64<int>    Value4) LoadVectorAndInsertScalar64x4((Vector64<int>    Value1, Vector64<int>    Value2, Vector64<int>    Value3, Vector64<int>    Value4) value, byte index, int*    address);
    public static unsafe (Vector64<uint>   Value1, Vector64<uint>   Value2, Vector64<uint>   Value3, Vector64<uint>   Value4) LoadVectorAndInsertScalar64x4((Vector64<uint>   Value1, Vector64<uint>   Value2, Vector64<uint>   Value3, Vector64<uint>   Value4) value, byte index, uint*   address);
    public static unsafe (Vector64<float>  Value1, Vector64<float>  Value2, Vector64<float>  Value3, Vector64<float>  Value4) LoadVectorAndInsertScalar64x4((Vector64<float>  Value1, Vector64<float>  Value2, Vector64<float>  Value3, Vector64<float>  Value4) value, byte index, float*  address);

    // LD4R
    public static unsafe (Vector64<byte>   Value1, Vector64<byte>   Value2, Vector64<byte>   Value3, Vector64<byte>   Value4) LoadAndReplicateToVector64x4(byte*   address);
    public static unsafe (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2, Vector64<sbyte>  Value3, Vector64<sbyte>  Value4) LoadAndReplicateToVector64x4(sbyte*  address);
    public static unsafe (Vector64<short>  Value1, Vector64<short>  Value2, Vector64<short>  Value3, Vector64<short>  Value4) LoadAndReplicateToVector64x4(short*  address);
    public static unsafe (Vector64<ushort> Value1, Vector64<ushort> Value2, Vector64<ushort> Value3, Vector64<ushort> Value4) LoadAndReplicateToVector64x4(ushort* address);
    public static unsafe (Vector64<int>    Value1, Vector64<int>    Value2, Vector64<int>    Value3, Vector64<int>    Value4) LoadAndReplicateToVector64x4(int*    address);
    public static unsafe (Vector64<uint>   Value1, Vector64<uint>   Value2, Vector64<uint>   Value3, Vector64<uint>   Value4) LoadAndReplicateToVector64x4(uint*   address);
    public static unsafe (Vector64<float>  Value1, Vector64<float>  Value2, Vector64<float>  Value3, Vector64<float>  Value4) LoadAndReplicateToVector64x4(float*  address);

    // ST1 (multiple structures)
    // StoreVector already present

    // ST1 (multiple structures) 2 register variant
    public static unsafe void StoreVector64x2AndUnzip(byte*   address, (Vector64<byte>   Value1, Vector64<byte>   Value2) value);
    public static unsafe void StoreVector64x2AndUnzip(sbyte*  address, (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2) value);
    public static unsafe void StoreVector64x2AndUnzip(short*  address, (Vector64<short>  Value1, Vector64<short>  Value2) value);
    public static unsafe void StoreVector64x2AndUnzip(ushort* address, (Vector64<ushort> Value1, Vector64<ushort> Value2) value);
    public static unsafe void StoreVector64x2AndUnzip(int*    address, (Vector64<int>    Value1, Vector64<int>    Value2) value);
    public static unsafe void StoreVector64x2AndUnzip(uint*   address, (Vector64<uint>   Value1, Vector64<uint>   Value2) value);
    public static unsafe void StoreVector64x2AndUnzip(float*  address, (Vector64<float>  Value1, Vector64<float>  Value2) value);

    // ST1 (multiple structures) 3 register variant
    public static unsafe void StoreVector64x3AndUnzip(byte*   address, (Vector64<byte>   Value1, Vector64<byte>   Value2, Vector64<byte>   Value3) value);
    public static unsafe void StoreVector64x3AndUnzip(sbyte*  address, (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2, Vector64<sbyte>  Value3) value);
    public static unsafe void StoreVector64x3AndUnzip(short*  address, (Vector64<short>  Value1, Vector64<short>  Value2, Vector64<short>  Value3) value);
    public static unsafe void StoreVector64x3AndUnzip(ushort* address, (Vector64<ushort> Value1, Vector64<ushort> Value2, Vector64<ushort> Value3) value);
    public static unsafe void StoreVector64x3AndUnzip(int*    address, (Vector64<int>    Value1, Vector64<int>    Value2, Vector64<int>    Value3) value);
    public static unsafe void StoreVector64x3AndUnzip(uint*   address, (Vector64<uint>   Value1, Vector64<uint>   Value2, Vector64<uint>   Value3) value);
    public static unsafe void StoreVector64x3AndUnzip(float*  address, (Vector64<float>  Value1, Vector64<float>  Value2, Vector64<float>  Value3) value);
    
    // ST1 (multiple structures) 4 register variant            
    public static unsafe void StoreVector64x4AndUnzip(byte*   address, (Vector64<byte>   Value1, Vector64<byte>   Value2, Vector64<byte>   Value3, Vector64<byte>   Value4) value);
    public static unsafe void StoreVector64x4AndUnzip(sbyte*  address, (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2, Vector64<sbyte>  Value3, Vector64<sbyte>  Value4) value);
    public static unsafe void StoreVector64x4AndUnzip(short*  address, (Vector64<short>  Value1, Vector64<short>  Value2, Vector64<short>  Value3, Vector64<short>  Value4) value);
    public static unsafe void StoreVector64x4AndUnzip(ushort* address, (Vector64<ushort> Value1, Vector64<ushort> Value2, Vector64<ushort> Value3, Vector64<ushort> Value4) value);
    public static unsafe void StoreVector64x4AndUnzip(int*    address, (Vector64<int>    Value1, Vector64<int>    Value2, Vector64<int>    Value3, Vector64<int>    Value4) value);
    public static unsafe void StoreVector64x4AndUnzip(uint*   address, (Vector64<uint>   Value1, Vector64<uint>   Value2, Vector64<uint>   Value3, Vector64<uint>   Value4) value);
    public static unsafe void StoreVector64x4AndUnzip(float*  address, (Vector64<float>  Value1, Vector64<float>  Value2, Vector64<float>  Value3, Vector64<float>  Value4) value);

    // ST1 (single structure)
    // StoreSelectedScalar already present
    
    // ST2 (multiple structures)
    public static unsafe void StoreVector64x2(byte*   address, (Vector64<byte>   Value1, Vector64<byte>   Value2) value);
    public static unsafe void StoreVector64x2(sbyte*  address, (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2) value);
    public static unsafe void StoreVector64x2(short*  address, (Vector64<short>  Value1, Vector64<short>  Value2) value);
    public static unsafe void StoreVector64x2(ushort* address, (Vector64<ushort> Value1, Vector64<ushort> Value2) value);
    public static unsafe void StoreVector64x2(int*    address, (Vector64<int>    Value1, Vector64<int>    Value2) value);
    public static unsafe void StoreVector64x2(uint*   address, (Vector64<uint>   Value1, Vector64<uint>   Value2) value);
    public static unsafe void StoreVector64x2(float*  address, (Vector64<float>  Value1, Vector64<float>  Value2) value);

    // ST2 (single structure)
    public static unsafe void StoreSelectedScalar64x2(byte*   address, (Vector64<byte>   Value1, Vector64<byte>   Value2) value, byte index);
    public static unsafe void StoreSelectedScalar64x2(sbyte*  address, (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2) value, byte index);
    public static unsafe void StoreSelectedScalar64x2(short*  address, (Vector64<short>  Value1, Vector64<short>  Value2) value, byte index);
    public static unsafe void StoreSelectedScalar64x2(ushort* address, (Vector64<ushort> Value1, Vector64<ushort> Value2) value, byte index);
    public static unsafe void StoreSelectedScalar64x2(int*    address, (Vector64<int>    Value1, Vector64<int>    Value2) value, byte index);
    public static unsafe void StoreSelectedScalar64x2(uint*   address, (Vector64<uint>   Value1, Vector64<uint>   Value2) value, byte index);
    public static unsafe void StoreSelectedScalar64x2(float*  address, (Vector64<float>  Value1, Vector64<float>  Value2) value, byte index);

    // ST3 (multiple structures)
    public static unsafe void StoreVector64x3(byte*   address, (Vector64<byte>   Value1, Vector64<byte>   Value2, Vector64<byte>   Value3) value);
    public static unsafe void StoreVector64x3(sbyte*  address, (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2, Vector64<sbyte>  Value3) value);
    public static unsafe void StoreVector64x3(short*  address, (Vector64<short>  Value1, Vector64<short>  Value2, Vector64<short>  Value3) value);
    public static unsafe void StoreVector64x3(ushort* address, (Vector64<ushort> Value1, Vector64<ushort> Value2, Vector64<ushort> Value3) value);
    public static unsafe void StoreVector64x3(int*    address, (Vector64<int>    Value1, Vector64<int>    Value2, Vector64<int>    Value3) value);
    public static unsafe void StoreVector64x3(uint*   address, (Vector64<uint>   Value1, Vector64<uint>   Value2, Vector64<uint>   Value3) value);
    public static unsafe void StoreVector64x3(float*  address, (Vector64<float>  Value1, Vector64<float>  Value2, Vector64<float>  Value3) value);

    // ST3 (single structure)
    public static unsafe void StoreSelectedScalar64x3(byte*   address, (Vector64<byte>   Value1, Vector64<byte>   Value2, Vector64<byte>   Value3) value, byte index);
    public static unsafe void StoreSelectedScalar64x3(sbyte*  address, (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2, Vector64<sbyte>  Value3) value, byte index);
    public static unsafe void StoreSelectedScalar64x3(short*  address, (Vector64<short>  Value1, Vector64<short>  Value2, Vector64<short>  Value3) value, byte index);
    public static unsafe void StoreSelectedScalar64x3(ushort* address, (Vector64<ushort> Value1, Vector64<ushort> Value2, Vector64<ushort> Value3) value, byte index);
    public static unsafe void StoreSelectedScalar64x3(int*    address, (Vector64<int>    Value1, Vector64<int>    Value2, Vector64<int>    Value3) value, byte index);
    public static unsafe void StoreSelectedScalar64x3(uint*   address, (Vector64<uint>   Value1, Vector64<uint>   Value2, Vector64<uint>   Value3) value, byte index);
    public static unsafe void StoreSelectedScalar64x3(float*  address, (Vector64<float>  Value1, Vector64<float>  Value2, Vector64<float>  Value3) value, byte index);

    // ST4 (multiple structures)
    public static unsafe void StoreVector64x4(byte*   address, (Vector64<byte>   Value1, Vector64<byte>   Value2, Vector64<byte>   Value3, Vector64<byte>   Value4) value);
    public static unsafe void StoreVector64x4(sbyte*  address, (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2, Vector64<sbyte>  Value3, Vector64<sbyte>  Value4) value);
    public static unsafe void StoreVector64x4(short*  address, (Vector64<short>  Value1, Vector64<short>  Value2, Vector64<short>  Value3, Vector64<short>  Value4) value);
    public static unsafe void StoreVector64x4(ushort* address, (Vector64<ushort> Value1, Vector64<ushort> Value2, Vector64<ushort> Value3, Vector64<ushort> Value4) value);
    public static unsafe void StoreVector64x4(int*    address, (Vector64<int>    Value1, Vector64<int>    Value2, Vector64<int>    Value3, Vector64<int>    Value4) value);
    public static unsafe void StoreVector64x4(uint*   address, (Vector64<uint>   Value1, Vector64<uint>   Value2, Vector64<uint>   Value3, Vector64<uint>   Value4) value);
    public static unsafe void StoreVector64x4(float*  address, (Vector64<float>  Value1, Vector64<float>  Value2, Vector64<float>  Value3, Vector64<float>  Value4) value);

    // ST4 (single structure)
    public static unsafe void StoreSelectedScalar64x4(byte*   address, (Vector64<byte>   Value1, Vector64<byte>   Value2, Vector64<byte>   Value3, Vector64<byte>   Value4) value, byte index);
    public static unsafe void StoreSelectedScalar64x4(sbyte*  address, (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2, Vector64<sbyte>  Value3, Vector64<sbyte>  Value4) value, byte index);
    public static unsafe void StoreSelectedScalar64x4(short*  address, (Vector64<short>  Value1, Vector64<short>  Value2, Vector64<short>  Value3, Vector64<short>  Value4) value, byte index);
    public static unsafe void StoreSelectedScalar64x4(ushort* address, (Vector64<ushort> Value1, Vector64<ushort> Value2, Vector64<ushort> Value3, Vector64<ushort> Value4) value, byte index);
    public static unsafe void StoreSelectedScalar64x4(int*    address, (Vector64<int>    Value1, Vector64<int>    Value2, Vector64<int>    Value3, Vector64<int>    Value4) value, byte index);
    public static unsafe void StoreSelectedScalar64x4(uint*   address, (Vector64<uint>   Value1, Vector64<uint>   Value2, Vector64<uint>   Value3, Vector64<uint>   Value4) value, byte index);
    public static unsafe void StoreSelectedScalar64x4(float*  address, (Vector64<float>  Value1, Vector64<float>  Value2, Vector64<float>  Value3, Vector64<float>  Value4) value, byte index);

    public partial class Arm64
    {
        // LD1 (multiple structures)
        // LoadVector128 already present

        // LD1 (multiple structures) 2 register variant
        public static unsafe (Vector128<byte>   Value1, Vector128<byte>   Value2) LoadVector128x2AndUnzip(byte*   address);
        public static unsafe (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2) LoadVector128x2AndUnzip(sbyte*  address);
        public static unsafe (Vector128<short>  Value1, Vector128<short>  Value2) LoadVector128x2AndUnzip(short*  address);
        public static unsafe (Vector128<ushort> Value1, Vector128<ushort> Value2) LoadVector128x2AndUnzip(ushort* address);
        public static unsafe (Vector128<int>    Value1, Vector128<int>    Value2) LoadVector128x2AndUnzip(int*    address);
        public static unsafe (Vector128<uint>   Value1, Vector128<uint>   Value2) LoadVector128x2AndUnzip(uint*   address);
        public static unsafe (Vector128<long>   Value1, Vector128<long>   Value2) LoadVector128x2AndUnzip(long*   address);
        public static unsafe (Vector128<ulong>  Value1, Vector128<ulong>  Value2) LoadVector128x2AndUnzip(ulong*  address);
        public static unsafe (Vector128<float>  Value1, Vector128<float>  Value2) LoadVector128x2AndUnzip(float*  address);
        public static unsafe (Vector128<double> Value1, Vector128<double> Value2) LoadVector128x2AndUnzip(double* address);

        // LD1 (multiple structures) 3 register variant
        public static unsafe (Vector128<byte>   Value1, Vector128<byte>   Value2, Vector128<byte>   Value3) LoadVector128x3AndUnzip(byte*   address);
        public static unsafe (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2, Vector128<sbyte>  Value3) LoadVector128x3AndUnzip(sbyte*  address);
        public static unsafe (Vector128<short>  Value1, Vector128<short>  Value2, Vector128<short>  Value3) LoadVector128x3AndUnzip(short*  address);
        public static unsafe (Vector128<ushort> Value1, Vector128<ushort> Value2, Vector128<ushort> Value3) LoadVector128x3AndUnzip(ushort* address);
        public static unsafe (Vector128<int>    Value1, Vector128<int>    Value2, Vector128<int>    Value3) LoadVector128x3AndUnzip(int*    address);
        public static unsafe (Vector128<uint>   Value1, Vector128<uint>   Value2, Vector128<uint>   Value3) LoadVector128x3AndUnzip(uint*   address);
        public static unsafe (Vector128<long>   Value1, Vector128<long>   Value2, Vector128<long>   Value3) LoadVector128x3AndUnzip(long*   address);
        public static unsafe (Vector128<ulong>  Value1, Vector128<ulong>  Value2, Vector128<ulong>  Value3) LoadVector128x3AndUnzip(ulong*  address);
        public static unsafe (Vector128<float>  Value1, Vector128<float>  Value2, Vector128<float>  Value3) LoadVector128x3AndUnzip(float*  address);
        public static unsafe (Vector128<double> Value1, Vector128<double> Value2, Vector128<double> Value3) LoadVector128x3AndUnzip(double* address);
        
        // LD1 (multiple structures) 4 register variant            
        public static unsafe (Vector128<byte>   Value1, Vector128<byte>   Value2, Vector128<byte>   Value3, Vector128<byte>   Value4) LoadVector128x4AndUnzip(byte*   address);
        public static unsafe (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2, Vector128<sbyte>  Value3, Vector128<sbyte>  Value4) LoadVector128x4AndUnzip(sbyte*  address);
        public static unsafe (Vector128<short>  Value1, Vector128<short>  Value2, Vector128<short>  Value3, Vector128<short>  Value4) LoadVector128x4AndUnzip(short*  address);
        public static unsafe (Vector128<ushort> Value1, Vector128<ushort> Value2, Vector128<ushort> Value3, Vector128<ushort> Value4) LoadVector128x4AndUnzip(ushort* address);
        public static unsafe (Vector128<int>    Value1, Vector128<int>    Value2, Vector128<int>    Value3, Vector128<int>    Value4) LoadVector128x4AndUnzip(int*    address);
        public static unsafe (Vector128<uint>   Value1, Vector128<uint>   Value2, Vector128<uint>   Value3, Vector128<uint>   Value4) LoadVector128x4AndUnzip(uint*   address);
        public static unsafe (Vector128<long>   Value1, Vector128<long>   Value2, Vector128<long>   Value3, Vector128<long>   Value4) LoadVector128x4AndUnzip(long*   address);
        public static unsafe (Vector128<ulong>  Value1, Vector128<ulong>  Value2, Vector128<ulong>  Value3, Vector128<ulong>  Value4) LoadVector128x4AndUnzip(ulong*  address);
        public static unsafe (Vector128<float>  Value1, Vector128<float>  Value2, Vector128<float>  Value3, Vector128<float>  Value4) LoadVector128x4AndUnzip(float*  address);
        public static unsafe (Vector128<double> Value1, Vector128<double> Value2, Vector128<double> Value3, Vector128<double> Value4) LoadVector128x4AndUnzip(double* address);

        // LD1 (single structure)
        // LoadAndInsertScalar already present

        // LD1R
        // LoadAndReplicateToVector128 already present
        
        // LD2 (multiple structures)
        public static unsafe (Vector128<byte>   Value1, Vector128<byte>   Value2) LoadVector128x2(byte*   address);
        public static unsafe (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2) LoadVector128x2(sbyte*  address);
        public static unsafe (Vector128<short>  Value1, Vector128<short>  Value2) LoadVector128x2(short*  address);
        public static unsafe (Vector128<ushort> Value1, Vector128<ushort> Value2) LoadVector128x2(ushort* address);
        public static unsafe (Vector128<int>    Value1, Vector128<int>    Value2) LoadVector128x2(int*    address);
        public static unsafe (Vector128<uint>   Value1, Vector128<uint>   Value2) LoadVector128x2(uint*   address);
        public static unsafe (Vector128<long>   Value1, Vector128<long>   Value2) LoadVector128x2(long*   address);
        public static unsafe (Vector128<ulong>  Value1, Vector128<ulong>  Value2) LoadVector128x2(ulong*  address);
        public static unsafe (Vector128<float>  Value1, Vector128<float>  Value2) LoadVector128x2(float*  address);
        public static unsafe (Vector128<double> Value1, Vector128<double> Value2) LoadVector128x2(double* address);

        // LD2 (single structure)
        public static unsafe (Vector128<byte>   Value1, Vector128<byte>   Value2) LoadVectorAndInsertScalar128x2((Vector128<byte>   Value1, Vector128<byte>   Value2) value, byte index, byte*   address);
        public static unsafe (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2) LoadVectorAndInsertScalar128x2((Vector128<sbyte>  Value1, Vector128<sbyte>  Value2) value, byte index, sbyte*  address);
        public static unsafe (Vector128<short>  Value1, Vector128<short>  Value2) LoadVectorAndInsertScalar128x2((Vector128<short>  Value1, Vector128<short>  Value2) value, byte index, short*  address);
        public static unsafe (Vector128<ushort> Value1, Vector128<ushort> Value2) LoadVectorAndInsertScalar128x2((Vector128<ushort> Value1, Vector128<ushort> Value2) value, byte index, ushort* address);
        public static unsafe (Vector128<int>    Value1, Vector128<int>    Value2) LoadVectorAndInsertScalar128x2((Vector128<int>    Value1, Vector128<int>    Value2) value, byte index, int*    address);
        public static unsafe (Vector128<uint>   Value1, Vector128<uint>   Value2) LoadVectorAndInsertScalar128x2((Vector128<uint>   Value1, Vector128<uint>   Value2) value, byte index, uint*   address);
        public static unsafe (Vector128<long>   Value1, Vector128<long>   Value2) LoadVectorAndInsertScalar128x2((Vector128<long>   Value1, Vector128<long>   Value2) value, byte index, long*   address);
        public static unsafe (Vector128<ulong>  Value1, Vector128<ulong>  Value2) LoadVectorAndInsertScalar128x2((Vector128<ulong>  Value1, Vector128<ulong>  Value2) value, byte index, ulong*  address);
        public static unsafe (Vector128<float>  Value1, Vector128<float>  Value2) LoadVectorAndInsertScalar128x2((Vector128<float>  Value1, Vector128<float>  Value2) value, byte index, float*  address);
        public static unsafe (Vector128<double> Value1, Vector128<double> Value2) LoadVectorAndInsertScalar128x2((Vector128<double> Value1, Vector128<double> Value2) value, byte index, double* address);

        // LD2R
        public static unsafe (Vector128<byte>   Value1, Vector128<byte>   Value2) LoadAndReplicateToVector128x2(byte*   address);
        public static unsafe (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2) LoadAndReplicateToVector128x2(sbyte*  address);
        public static unsafe (Vector128<short>  Value1, Vector128<short>  Value2) LoadAndReplicateToVector128x2(short*  address);
        public static unsafe (Vector128<ushort> Value1, Vector128<ushort> Value2) LoadAndReplicateToVector128x2(ushort* address);
        public static unsafe (Vector128<int>    Value1, Vector128<int>    Value2) LoadAndReplicateToVector128x2(int*    address);
        public static unsafe (Vector128<uint>   Value1, Vector128<uint>   Value2) LoadAndReplicateToVector128x2(uint*   address);
        public static unsafe (Vector128<long>   Value1, Vector128<long>   Value2) LoadAndReplicateToVector128x2(long*   address);
        public static unsafe (Vector128<ulong>  Value1, Vector128<ulong>  Value2) LoadAndReplicateToVector128x2(ulong*  address);
        public static unsafe (Vector128<float>  Value1, Vector128<float>  Value2) LoadAndReplicateToVector128x2(float*  address);
        public static unsafe (Vector128<double> Value1, Vector128<double> Value2) LoadAndReplicateToVector128x2(double* address);

        // LD3 (multiple structures)
        public static unsafe (Vector128<byte>   Value1, Vector128<byte>   Value2, Vector128<byte>   Value3) LoadVector128x3(byte*   address);
        public static unsafe (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2, Vector128<sbyte>  Value3) LoadVector128x3(sbyte*  address);
        public static unsafe (Vector128<short>  Value1, Vector128<short>  Value2, Vector128<short>  Value3) LoadVector128x3(short*  address);
        public static unsafe (Vector128<ushort> Value1, Vector128<ushort> Value2, Vector128<ushort> Value3) LoadVector128x3(ushort* address);
        public static unsafe (Vector128<int>    Value1, Vector128<int>    Value2, Vector128<int>    Value3) LoadVector128x3(int*    address);
        public static unsafe (Vector128<uint>   Value1, Vector128<uint>   Value2, Vector128<uint>   Value3) LoadVector128x3(uint*   address);
        public static unsafe (Vector128<long>   Value1, Vector128<long>   Value2, Vector128<long>   Value3) LoadVector128x3(long*   address);
        public static unsafe (Vector128<ulong>  Value1, Vector128<ulong>  Value2, Vector128<ulong>  Value3) LoadVector128x3(ulong*  address);
        public static unsafe (Vector128<float>  Value1, Vector128<float>  Value2, Vector128<float>  Value3) LoadVector128x3(float*  address);
        public static unsafe (Vector128<double> Value1, Vector128<double> Value2, Vector128<double> Value3) LoadVector128x3(double* address);

        // LD3 (single structure)
        public static unsafe (Vector128<byte>   Value1, Vector128<byte>   Value2, Vector128<byte>   Value3) LoadVectorAndInsertScalar128x3((Vector128<byte>   Value1, Vector128<byte>   Value2, Vector128<byte>   Value3) value, byte index, byte*   address);
        public static unsafe (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2, Vector128<sbyte>  Value3) LoadVectorAndInsertScalar128x3((Vector128<sbyte>  Value1, Vector128<sbyte>  Value2, Vector128<sbyte>  Value3) value, byte index, sbyte*  address);
        public static unsafe (Vector128<short>  Value1, Vector128<short>  Value2, Vector128<short>  Value3) LoadVectorAndInsertScalar128x3((Vector128<short>  Value1, Vector128<short>  Value2, Vector128<short>  Value3) value, byte index, short*  address);
        public static unsafe (Vector128<ushort> Value1, Vector128<ushort> Value2, Vector128<ushort> Value3) LoadVectorAndInsertScalar128x3((Vector128<ushort> Value1, Vector128<ushort> Value2, Vector128<ushort> Value3) value, byte index, ushort* address);
        public static unsafe (Vector128<int>    Value1, Vector128<int>    Value2, Vector128<int>    Value3) LoadVectorAndInsertScalar128x3((Vector128<int>    Value1, Vector128<int>    Value2, Vector128<int>    Value3) value, byte index, int*    address);
        public static unsafe (Vector128<uint>   Value1, Vector128<uint>   Value2, Vector128<uint>   Value3) LoadVectorAndInsertScalar128x3((Vector128<uint>   Value1, Vector128<uint>   Value2, Vector128<uint>   Value3) value, byte index, uint*   address);
        public static unsafe (Vector128<long>   Value1, Vector128<long>   Value2, Vector128<long>   Value3) LoadVectorAndInsertScalar128x3((Vector128<long>   Value1, Vector128<long>   Value2, Vector128<long>   Value3) value, byte index, long*   address);
        public static unsafe (Vector128<ulong>  Value1, Vector128<ulong>  Value2, Vector128<ulong>  Value3) LoadVectorAndInsertScalar128x3((Vector128<ulong>  Value1, Vector128<ulong>  Value2, Vector128<ulong>  Value3) value, byte index, ulong*  address);
        public static unsafe (Vector128<float>  Value1, Vector128<float>  Value2, Vector128<float>  Value3) LoadVectorAndInsertScalar128x3((Vector128<float>  Value1, Vector128<float>  Value2, Vector128<float>  Value3) value, byte index, float*  address);
        public static unsafe (Vector128<double> Value1, Vector128<double> Value2, Vector128<double> Value3) LoadVectorAndInsertScalar128x3((Vector128<double> Value1, Vector128<double> Value2, Vector128<double> Value3) value, byte index, double* address);

        // LD3R
        public static unsafe (Vector128<byte>   Value1, Vector128<byte>   Value2, Vector128<byte>   Value3) LoadAndReplicateToVector128x3(byte*   address);
        public static unsafe (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2, Vector128<sbyte>  Value3) LoadAndReplicateToVector128x3(sbyte*  address);
        public static unsafe (Vector128<short>  Value1, Vector128<short>  Value2, Vector128<short>  Value3) LoadAndReplicateToVector128x3(short*  address);
        public static unsafe (Vector128<ushort> Value1, Vector128<ushort> Value2, Vector128<ushort> Value3) LoadAndReplicateToVector128x3(ushort* address);
        public static unsafe (Vector128<int>    Value1, Vector128<int>    Value2, Vector128<int>    Value3) LoadAndReplicateToVector128x3(int*    address);
        public static unsafe (Vector128<uint>   Value1, Vector128<uint>   Value2, Vector128<uint>   Value3) LoadAndReplicateToVector128x3(uint*   address);
        public static unsafe (Vector128<long>   Value1, Vector128<long>   Value2, Vector128<long>   Value3) LoadAndReplicateToVector128x3(long*   address);
        public static unsafe (Vector128<ulong>  Value1, Vector128<ulong>  Value2, Vector128<ulong>  Value3) LoadAndReplicateToVector128x3(ulong*  address);
        public static unsafe (Vector128<float>  Value1, Vector128<float>  Value2, Vector128<float>  Value3) LoadAndReplicateToVector128x3(float*  address);
        public static unsafe (Vector128<double> Value1, Vector128<double> Value2, Vector128<double> Value3) LoadAndReplicateToVector128x3(double* address);

        // LD4 (multiple structures)
        public static unsafe (Vector128<byte>   Value1, Vector128<byte>   Value2, Vector128<byte>   Value3, Vector128<byte>   Value4) LoadVector128x4(byte*   address);
        public static unsafe (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2, Vector128<sbyte>  Value3, Vector128<sbyte>  Value4) LoadVector128x4(sbyte*  address);
        public static unsafe (Vector128<short>  Value1, Vector128<short>  Value2, Vector128<short>  Value3, Vector128<short>  Value4) LoadVector128x4(short*  address);
        public static unsafe (Vector128<ushort> Value1, Vector128<ushort> Value2, Vector128<ushort> Value3, Vector128<ushort> Value4) LoadVector128x4(ushort* address);
        public static unsafe (Vector128<int>    Value1, Vector128<int>    Value2, Vector128<int>    Value3, Vector128<int>    Value4) LoadVector128x4(int*    address);
        public static unsafe (Vector128<uint>   Value1, Vector128<uint>   Value2, Vector128<uint>   Value3, Vector128<uint>   Value4) LoadVector128x4(uint*   address);
        public static unsafe (Vector128<long>   Value1, Vector128<long>   Value2, Vector128<long>   Value3, Vector128<long>   Value4) LoadVector128x4(long*   address);
        public static unsafe (Vector128<ulong>  Value1, Vector128<ulong>  Value2, Vector128<ulong>  Value3, Vector128<ulong>  Value4) LoadVector128x4(ulong*  address);
        public static unsafe (Vector128<float>  Value1, Vector128<float>  Value2, Vector128<float>  Value3, Vector128<float>  Value4) LoadVector128x4(float*  address);
        public static unsafe (Vector128<double> Value1, Vector128<double> Value2, Vector128<double> Value3, Vector128<double> Value4) LoadVector128x4(double* address);

        // LD4 (single structure)
        public static unsafe (Vector128<byte>   Value1, Vector128<byte>   Value2, Vector128<byte>   Value3, Vector128<byte>   Value4) LoadVectorAndInsertScalar128x4((Vector128<byte>   Value1, Vector128<byte>   Value2, Vector128<byte>   Value3, Vector128<byte>   Value4) value, byte index, byte*   address);
        public static unsafe (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2, Vector128<sbyte>  Value3, Vector128<sbyte>  Value4) LoadVectorAndInsertScalar128x4((Vector128<sbyte>  Value1, Vector128<sbyte>  Value2, Vector128<sbyte>  Value3, Vector128<sbyte>  Value4) value, byte index, sbyte*  address);
        public static unsafe (Vector128<short>  Value1, Vector128<short>  Value2, Vector128<short>  Value3, Vector128<short>  Value4) LoadVectorAndInsertScalar128x4((Vector128<short>  Value1, Vector128<short>  Value2, Vector128<short>  Value3, Vector128<short>  Value4) value, byte index, short*  address);
        public static unsafe (Vector128<ushort> Value1, Vector128<ushort> Value2, Vector128<ushort> Value3, Vector128<ushort> Value4) LoadVectorAndInsertScalar128x4((Vector128<ushort> Value1, Vector128<ushort> Value2, Vector128<ushort> Value3, Vector128<ushort> Value4) value, byte index, ushort* address);
        public static unsafe (Vector128<int>    Value1, Vector128<int>    Value2, Vector128<int>    Value3, Vector128<int>    Value4) LoadVectorAndInsertScalar128x4((Vector128<int>    Value1, Vector128<int>    Value2, Vector128<int>    Value3, Vector128<int>    Value4) value, byte index, int*    address);
        public static unsafe (Vector128<uint>   Value1, Vector128<uint>   Value2, Vector128<uint>   Value3, Vector128<uint>   Value4) LoadVectorAndInsertScalar128x4((Vector128<uint>   Value1, Vector128<uint>   Value2, Vector128<uint>   Value3, Vector128<uint>   Value4) value, byte index, uint*   address);
        public static unsafe (Vector128<long>   Value1, Vector128<long>   Value2, Vector128<long>   Value3, Vector128<long>   Value4) LoadVectorAndInsertScalar128x4((Vector128<long>   Value1, Vector128<long>   Value2, Vector128<long>   Value3, Vector128<long>   Value4) value, byte index, long*   address);
        public static unsafe (Vector128<ulong>  Value1, Vector128<ulong>  Value2, Vector128<ulong>  Value3, Vector128<ulong>  Value4) LoadVectorAndInsertScalar128x4((Vector128<ulong>  Value1, Vector128<ulong>  Value2, Vector128<ulong>  Value3, Vector128<ulong>  Value4) value, byte index, ulong*  address);
        public static unsafe (Vector128<float>  Value1, Vector128<float>  Value2, Vector128<float>  Value3, Vector128<float>  Value4) LoadVectorAndInsertScalar128x4((Vector128<float>  Value1, Vector128<float>  Value2, Vector128<float>  Value3, Vector128<float>  Value4) value, byte index, float*  address);
        public static unsafe (Vector128<double> Value1, Vector128<double> Value2, Vector128<double> Value3, Vector128<double> Value4) LoadVectorAndInsertScalar128x4((Vector128<double> Value1, Vector128<double> Value2, Vector128<double> Value3, Vector128<double> Value4) value, byte index, double* address);

        // LD4R
        public static unsafe(Vector128<byte>   Value1, Vector128<byte>   Value2, Vector128<byte>   Value3, Vector128<byte>   Value4) LoadAndReplicateToVector128x4(byte*   address);
        public static unsafe(Vector128<sbyte>  Value1, Vector128<sbyte>  Value2, Vector128<sbyte>  Value3, Vector128<sbyte>  Value4) LoadAndReplicateToVector128x4(sbyte*  address);
        public static unsafe(Vector128<short>  Value1, Vector128<short>  Value2, Vector128<short>  Value3, Vector128<short>  Value4) LoadAndReplicateToVector128x4(short*  address);
        public static unsafe(Vector128<ushort> Value1, Vector128<ushort> Value2, Vector128<ushort> Value3, Vector128<ushort> Value4) LoadAndReplicateToVector128x4(ushort* address);
        public static unsafe(Vector128<int>    Value1, Vector128<int>    Value2, Vector128<int>    Value3, Vector128<int>    Value4) LoadAndReplicateToVector128x4(int*    address);
        public static unsafe(Vector128<uint>   Value1, Vector128<uint>   Value2, Vector128<uint>   Value3, Vector128<uint>   Value4) LoadAndReplicateToVector128x4(uint*   address);
        public static unsafe(Vector128<long>   Value1, Vector128<long>   Value2, Vector128<long>   Value3, Vector128<long>   Value4) LoadAndReplicateToVector128x4(long*   address);
        public static unsafe(Vector128<ulong>  Value1, Vector128<ulong>  Value2, Vector128<ulong>  Value3, Vector128<ulong>  Value4) LoadAndReplicateToVector128x4(ulong*  address);
        public static unsafe(Vector128<float>  Value1, Vector128<float>  Value2, Vector128<float>  Value3, Vector128<float>  Value4) LoadAndReplicateToVector128x4(float*  address);
        public static unsafe(Vector128<double> Value1, Vector128<double> Value2, Vector128<double> Value3, Vector128<double> Value4) LoadAndReplicateToVector128x4(double* address);

        // ST1 (multiple structures)
        // StoreVector already present

        // ST1 (multiple structures) 2 register variant
        public static unsafe void StoreVector128x2AndUnzip(byte*   address, (Vector128<byte>   Value1, Vector128<byte>   Value2) value);
        public static unsafe void StoreVector128x2AndUnzip(sbyte*  address, (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2) value);
        public static unsafe void StoreVector128x2AndUnzip(short*  address, (Vector128<short>  Value1, Vector128<short>  Value2) value);
        public static unsafe void StoreVector128x2AndUnzip(ushort* address, (Vector128<ushort> Value1, Vector128<ushort> Value2) value);
        public static unsafe void StoreVector128x2AndUnzip(int*    address, (Vector128<int>    Value1, Vector128<int>    Value2) value);
        public static unsafe void StoreVector128x2AndUnzip(uint*   address, (Vector128<uint>   Value1, Vector128<uint>   Value2) value);
        public static unsafe void StoreVector128x2AndUnzip(long*   address, (Vector128<long>   Value1, Vector128<long>   Value2) value);
        public static unsafe void StoreVector128x2AndUnzip(ulong*  address, (Vector128<ulong>  Value1, Vector128<ulong>  Value2) value);
        public static unsafe void StoreVector128x2AndUnzip(float*  address, (Vector128<float>  Value1, Vector128<float>  Value2) value);
        public static unsafe void StoreVector128x2AndUnzip(double* address, (Vector128<double> Value1, Vector128<double> Value2) value);

        // ST1 (multiple structures) 3 register variant
        public static unsafe void StoreVector128x3AndUnzip(byte*   address, (Vector128<byte>   Value1, Vector128<byte>   Value2, Vector128<byte>   Value3) value);
        public static unsafe void StoreVector128x3AndUnzip(sbyte*  address, (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2, Vector128<sbyte>  Value3) value);
        public static unsafe void StoreVector128x3AndUnzip(short*  address, (Vector128<short>  Value1, Vector128<short>  Value2, Vector128<short>  Value3) value);
        public static unsafe void StoreVector128x3AndUnzip(ushort* address, (Vector128<ushort> Value1, Vector128<ushort> Value2, Vector128<ushort> Value3) value);
        public static unsafe void StoreVector128x3AndUnzip(int*    address, (Vector128<int>    Value1, Vector128<int>    Value2, Vector128<int>    Value3) value);
        public static unsafe void StoreVector128x3AndUnzip(uint*   address, (Vector128<uint>   Value1, Vector128<uint>   Value2, Vector128<uint>   Value3) value);
        public static unsafe void StoreVector128x3AndUnzip(long*   address, (Vector128<long>   Value1, Vector128<long>   Value2, Vector128<long>   Value3) value);
        public static unsafe void StoreVector128x3AndUnzip(ulong*  address, (Vector128<ulong>  Value1, Vector128<ulong>  Value2, Vector128<ulong>  Value3) value);
        public static unsafe void StoreVector128x3AndUnzip(float*  address, (Vector128<float>  Value1, Vector128<float>  Value2, Vector128<float>  Value3) value);
        public static unsafe void StoreVector128x3AndUnzip(double* address, (Vector128<double> Value1, Vector128<double> Value2, Vector128<double> Value3) value);
        
        // ST1 (multiple structures) 4 register variant
        public static unsafe void StoreVector128x4AndUnzip(byte*   address, (Vector128<byte>   Value1, Vector128<byte>   Value2, Vector128<byte>   Value3, Vector128<byte>   Value4) value);
        public static unsafe void StoreVector128x4AndUnzip(sbyte*  address, (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2, Vector128<sbyte>  Value3, Vector128<sbyte>  Value4) value);
        public static unsafe void StoreVector128x4AndUnzip(short*  address, (Vector128<short>  Value1, Vector128<short>  Value2, Vector128<short>  Value3, Vector128<short>  Value4) value);
        public static unsafe void StoreVector128x4AndUnzip(ushort* address, (Vector128<ushort> Value1, Vector128<ushort> Value2, Vector128<ushort> Value3, Vector128<ushort> Value4) value);
        public static unsafe void StoreVector128x4AndUnzip(int*    address, (Vector128<int>    Value1, Vector128<int>    Value2, Vector128<int>    Value3, Vector128<int>    Value4) value);
        public static unsafe void StoreVector128x4AndUnzip(uint*   address, (Vector128<uint>   Value1, Vector128<uint>   Value2, Vector128<uint>   Value3, Vector128<uint>   Value4) value);
        public static unsafe void StoreVector128x4AndUnzip(long*   address, (Vector128<long>   Value1, Vector128<long>   Value2, Vector128<long>   Value3, Vector128<long>   Value4) value);
        public static unsafe void StoreVector128x4AndUnzip(ulong*  address, (Vector128<ulong>  Value1, Vector128<ulong>  Value2, Vector128<ulong>  Value3, Vector128<ulong>  Value4) value);
        public static unsafe void StoreVector128x4AndUnzip(float*  address, (Vector128<float>  Value1, Vector128<float>  Value2, Vector128<float>  Value3, Vector128<float>  Value4) value);
        public static unsafe void StoreVector128x4AndUnzip(double* address, (Vector128<double> Value1, Vector128<double> Value2, Vector128<double> Value3, Vector128<double> Value4) value);

        // ST1 (single structure)
        // StoreSelectedScalar already present
        
        // ST2 (multiple structures)
        public static unsafe void StoreVector128x2(byte*   address, (Vector128<byte>   Value1, Vector128<byte>   Value2) value);
        public static unsafe void StoreVector128x2(sbyte*  address, (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2) value);
        public static unsafe void StoreVector128x2(short*  address, (Vector128<short>  Value1, Vector128<short>  Value2) value);
        public static unsafe void StoreVector128x2(ushort* address, (Vector128<ushort> Value1, Vector128<ushort> Value2) value);
        public static unsafe void StoreVector128x2(int*    address, (Vector128<int>    Value1, Vector128<int>    Value2) value);
        public static unsafe void StoreVector128x2(uint*   address, (Vector128<uint>   Value1, Vector128<uint>   Value2) value);
        public static unsafe void StoreVector128x2(long*   address, (Vector128<long>   Value1, Vector128<long>   Value2) value);
        public static unsafe void StoreVector128x2(ulong*  address, (Vector128<ulong>  Value1, Vector128<ulong>  Value2) value);
        public static unsafe void StoreVector128x2(float*  address, (Vector128<float>  Value1, Vector128<float>  Value2) value);
        public static unsafe void StoreVector128x2(double* address, (Vector128<double> Value1, Vector128<double> Value2) value);

        // ST2 (single structure)
        public static unsafe void StoreSelectedScalar128x2(byte*   address, (Vector128<byte>   Value1, Vector128<byte>   Value2) value, byte index);
        public static unsafe void StoreSelectedScalar128x2(sbyte*  address, (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2) value, byte index);
        public static unsafe void StoreSelectedScalar128x2(short*  address, (Vector128<short>  Value1, Vector128<short>  Value2) value, byte index);
        public static unsafe void StoreSelectedScalar128x2(ushort* address, (Vector128<ushort> Value1, Vector128<ushort> Value2) value, byte index);
        public static unsafe void StoreSelectedScalar128x2(int*    address, (Vector128<int>    Value1, Vector128<int>    Value2) value, byte index);
        public static unsafe void StoreSelectedScalar128x2(uint*   address, (Vector128<uint>   Value1, Vector128<uint>   Value2) value, byte index);
        public static unsafe void StoreSelectedScalar128x2(long*   address, (Vector128<long>   Value1, Vector128<long>   Value2) value, byte index);
        public static unsafe void StoreSelectedScalar128x2(ulong*  address, (Vector128<ulong>  Value1, Vector128<ulong>  Value2) value, byte index);
        public static unsafe void StoreSelectedScalar128x2(float*  address, (Vector128<float>  Value1, Vector128<float>  Value2) value, byte index);
        public static unsafe void StoreSelectedScalar128x2(double* address, (Vector128<double> Value1, Vector128<double> Value2) value, byte index);

        // ST3 (multiple structures)
        public static unsafe void StoreVector128x3(byte*   address, (Vector128<byte>   Value1, Vector128<byte>   Value2, Vector128<byte>   Value3) value);
        public static unsafe void StoreVector128x3(sbyte*  address, (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2, Vector128<sbyte>  Value3) value);
        public static unsafe void StoreVector128x3(short*  address, (Vector128<short>  Value1, Vector128<short>  Value2, Vector128<short>  Value3) value);
        public static unsafe void StoreVector128x3(ushort* address, (Vector128<ushort> Value1, Vector128<ushort> Value2, Vector128<ushort> Value3) value);
        public static unsafe void StoreVector128x3(int*    address, (Vector128<int>    Value1, Vector128<int>    Value2, Vector128<int>    Value3) value);
        public static unsafe void StoreVector128x3(uint*   address, (Vector128<uint>   Value1, Vector128<uint>   Value2, Vector128<uint>   Value3) value);
        public static unsafe void StoreVector128x3(long*   address, (Vector128<long>   Value1, Vector128<long>   Value2, Vector128<long>   Value3) value);
        public static unsafe void StoreVector128x3(ulong*  address, (Vector128<ulong>  Value1, Vector128<ulong>  Value2, Vector128<ulong>  Value3) value);
        public static unsafe void StoreVector128x3(float*  address, (Vector128<float>  Value1, Vector128<float>  Value2, Vector128<float>  Value3) value);
        public static unsafe void StoreVector128x3(double* address, (Vector128<double> Value1, Vector128<double> Value2, Vector128<double> Value3) value);

        // ST3 (single structure)
        public static unsafe void StoreSelectedScalar128x3(byte*   address, (Vector128<byte>   Value1, Vector128<byte>   Value2, Vector128<byte>   Value3) value, byte index);
        public static unsafe void StoreSelectedScalar128x3(sbyte*  address, (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2, Vector128<sbyte>  Value3) value, byte index);
        public static unsafe void StoreSelectedScalar128x3(short*  address, (Vector128<short>  Value1, Vector128<short>  Value2, Vector128<short>  Value3) value, byte index);
        public static unsafe void StoreSelectedScalar128x3(ushort* address, (Vector128<ushort> Value1, Vector128<ushort> Value2, Vector128<ushort> Value3) value, byte index);
        public static unsafe void StoreSelectedScalar128x3(int*    address, (Vector128<int>    Value1, Vector128<int>    Value2, Vector128<int>    Value3) value, byte index);
        public static unsafe void StoreSelectedScalar128x3(uint*   address, (Vector128<uint>   Value1, Vector128<uint>   Value2, Vector128<uint>   Value3) value, byte index);
        public static unsafe void StoreSelectedScalar128x3(long*   address, (Vector128<long>   Value1, Vector128<long>   Value2, Vector128<long>   Value3) value, byte index);
        public static unsafe void StoreSelectedScalar128x3(ulong*  address, (Vector128<ulong>  Value1, Vector128<ulong>  Value2, Vector128<ulong>  Value3) value, byte index);
        public static unsafe void StoreSelectedScalar128x3(float*  address, (Vector128<float>  Value1, Vector128<float>  Value2, Vector128<float>  Value3) value, byte index);
        public static unsafe void StoreSelectedScalar128x3(double* address, (Vector128<double> Value1, Vector128<double> Value2, Vector128<double> Value3) value, byte index);

        // ST4 (multiple structures)
        public static unsafe void StoreVector128x4(byte*   address, (Vector128<byte>   Value1, Vector128<byte>   Value2, Vector128<byte>   Value3, Vector128<byte>   Value4) value);
        public static unsafe void StoreVector128x4(sbyte*  address, (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2, Vector128<sbyte>  Value3, Vector128<sbyte>  Value4) value);
        public static unsafe void StoreVector128x4(short*  address, (Vector128<short>  Value1, Vector128<short>  Value2, Vector128<short>  Value3, Vector128<short>  Value4) value);
        public static unsafe void StoreVector128x4(ushort* address, (Vector128<ushort> Value1, Vector128<ushort> Value2, Vector128<ushort> Value3, Vector128<ushort> Value4) value);
        public static unsafe void StoreVector128x4(int*    address, (Vector128<int>    Value1, Vector128<int>    Value2, Vector128<int>    Value3, Vector128<int>    Value4) value);
        public static unsafe void StoreVector128x4(uint*   address, (Vector128<uint>   Value1, Vector128<uint>   Value2, Vector128<uint>   Value3, Vector128<uint>   Value4) value);
        public static unsafe void StoreVector128x4(long*   address, (Vector128<long>   Value1, Vector128<long>   Value2, Vector128<long>   Value3, Vector128<long>   Value4) value);
        public static unsafe void StoreVector128x4(ulong*  address, (Vector128<ulong>  Value1, Vector128<ulong>  Value2, Vector128<ulong>  Value3, Vector128<ulong>  Value4) value);
        public static unsafe void StoreVector128x4(float*  address, (Vector128<float>  Value1, Vector128<float>  Value2, Vector128<float>  Value3, Vector128<float>  Value4) value);
        public static unsafe void StoreVector128x4(double* address, (Vector128<double> Value1, Vector128<double> Value2, Vector128<double> Value3, Vector128<double> Value4) value);

        // ST4 (single structure)
        public static unsafe void StoreSelectedScalar128x4(byte*   address, (Vector128<byte>   Value1, Vector128<byte>   Value2, Vector128<byte>   Value3, Vector128<byte>   Value4) value, byte index);
        public static unsafe void StoreSelectedScalar128x4(sbyte*  address, (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2, Vector128<sbyte>  Value3, Vector128<sbyte>  Value4) value, byte index);
        public static unsafe void StoreSelectedScalar128x4(short*  address, (Vector128<short>  Value1, Vector128<short>  Value2, Vector128<short>  Value3, Vector128<short>  Value4) value, byte index);
        public static unsafe void StoreSelectedScalar128x4(ushort* address, (Vector128<ushort> Value1, Vector128<ushort> Value2, Vector128<ushort> Value3, Vector128<ushort> Value4) value, byte index);
        public static unsafe void StoreSelectedScalar128x4(int*    address, (Vector128<int>    Value1, Vector128<int>    Value2, Vector128<int>    Value3, Vector128<int>    Value4) value, byte index);
        public static unsafe void StoreSelectedScalar128x4(uint*   address, (Vector128<uint>   Value1, Vector128<uint>   Value2, Vector128<uint>   Value3, Vector128<uint>   Value4) value, byte index);
        public static unsafe void StoreSelectedScalar128x4(long*   address, (Vector128<long>   Value1, Vector128<long>   Value2, Vector128<long>   Value3, Vector128<long>   Value4) value, byte index);
        public static unsafe void StoreSelectedScalar128x4(ulong*  address, (Vector128<ulong>  Value1, Vector128<ulong>  Value2, Vector128<ulong>  Value3, Vector128<ulong>  Value4) value, byte index);
        public static unsafe void StoreSelectedScalar128x4(float*  address, (Vector128<float>  Value1, Vector128<float>  Value2, Vector128<float>  Value3, Vector128<float>  Value4) value, byte index);
        public static unsafe void StoreSelectedScalar128x4(double* address, (Vector128<double> Value1, Vector128<double> Value2, Vector128<double> Value3, Vector128<double> Value4) value, byte index);
    }
}

API Usage

// Fancy the value
var v = LoadVector128x2(address);

// Getting the values out
Console.WriteLine(v.Item1);
Console.WriteLine(v.Item2);

Alternative Designs

No response

Risks

No response

@kunalspathak kunalspathak added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Apr 8, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Apr 8, 2023
@ghost
Copy link

ghost commented Apr 8, 2023

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

These APIs prove a way to load Vector64 and Vector128 from the address. The x2, x3 and x4 variants provides way to load 2, 3 and 4 vectors simultaneously.

API Proposal

namespace System.Runtime.Intrinsics.Arm
{
    public static class AdvSimd
    {
        public partial class Arm
        {
            public static unsafe (Vector64<byte> Value1,    Vector64<byte> Value2)    LoadVector64x2(byte*   address);
            public static unsafe (Vector64<sbyte> Value1,   Vector64<sbyte> Value2)   LoadVector64x2(sbyte*  address);
            public static unsafe (Vector64<short> Value1,   Vector64<short> Value2)   LoadVector64x2(short*  address);
            public static unsafe (Vector64<ushort> Value1,  Vector64<ushort> Value2)  LoadVector64x2(ushort* address);
            public static unsafe (Vector64<int> Value1,     Vector64<int> Value2)     LoadVector64x2(int*    address);
            public static unsafe (Vector64<uint> Value1,    Vector64<uint> Value2)    LoadVector64x2(uint*   address);
            public static unsafe (Vector64<float> Value1,   Vector64<float> Value2)   LoadVector64x2(float*  address);

            public static unsafe (Vector64<byte> Value1,    Vector64<byte> Value2,      Vector64<byte> Value3)      LoadVector64x3(byte*   address);
            public static unsafe (Vector64<sbyte> Value1,   Vector64<sbyte> Value2,     Vector64<sbyte> Value3)     LoadVector64x3(sbyte*  address);
            public static unsafe (Vector64<short> Value1,   Vector64<short> Value2,     Vector64<short> Value3)     LoadVector64x3(short*  address);
            public static unsafe (Vector64<ushort> Value1,  Vector64<ushort> Value2,    Vector64<ushort> Value3)    LoadVector64x3(ushort*  address);
            public static unsafe (Vector64<int> Value1,     Vector64<int> Value2,       Vector64<int> Value3)       LoadVector64x3(int*  address);
            public static unsafe (Vector64<uint> Value1,    Vector64<uint> Value2,      Vector64<uint> Value3)      LoadVector64x3(uint*  address);
            public static unsafe (Vector64<float> Value1,   Vector64<float> Value2,     Vector64<sbyte> Value3)     LoadVector64x3(float*  address);


            public static unsafe (Vector64<byte> Value1,        Vector64<byte> Value2,      Vector64<byte> Value3,      Vector64<byte> Value4)      LoadVector64x4(byte*   address);
            public static unsafe (Vector64<sbyte> Value1,       Vector64<sbyte> Value2,     Vector64<sbyte> Value3,     Vector64<sbyte> Value4)     LoadVector64x4(sbyte*   address);
            public static unsafe (Vector64<short> Value1,       Vector64<short> Value2,     Vector64<short> Value3,     Vector64<short> Value4)     LoadVector64x4(short*   address);
            public static unsafe (Vector64<ushort> Value1,      Vector64<ushort> Value2,    Vector64<ushort> Value3,    Vector64<ushort> Value4)    LoadVector64x4(ushort*   address);
            public static unsafe (Vector64<int> Value1,         Vector64<int> Value2,       Vector64<int> Value3,       Vector64<int> Value4)       LoadVector64x4(int*   address);
            public static unsafe (Vector64<uint> Value1,        Vector64<uint> Value2,      Vector64<uint> Value3,      Vector64<uint> Value4)      LoadVector64x4(uint*   address);
            public static unsafe (Vector64<float> Value1,       Vector64<float> Value2,     Vector64<float> Value3,     Vector64<float> Value4)     LoadVector64x4(float*   address);
        }

        public partial class Arm64
        {
            public static unsafe (Vector128<byte> Value1,   Vector128<byte> Value2)     LoadVector128x2(byte*   address);
            public static unsafe (Vector128<sbyte> Value1,  Vector128<sbyte> Value2)    LoadVector128x2(sbyte*  address);
            public static unsafe (Vector128<short> Value1,  Vector128<short> Value2)    LoadVector128x2(short*  address);
            public static unsafe (Vector128<ushort> Value1, Vector128<ushort> Value2)   LoadVector128x2(ushort* address);
            public static unsafe (Vector128<int> Value1,    Vector128<int> Value2)      LoadVector128x2(int*    address);
            public static unsafe (Vector128<uint> Value1,   Vector128<uint> Value2)     LoadVector128x2(uint*   address);
            public static unsafe (Vector128<float> Value1,  Vector128<float> Value2)    LoadVector128x2(float*  address);
            public static unsafe (Vector128<long> Value1,   Vector128<long> Value2)     LoadVector128x2(long*  address);
            public static unsafe (Vector128<ulong> Value1,  Vector128<ulong> Value2)    LoadVector128x2(ulong*  address);
            public static unsafe (Vector128<double> Value1, Vector128<double> Value2)   LoadVector128x2(double*  address);

            public static unsafe (Vector128<byte> Value1,    Vector128<byte> Value2,      Vector128<byte> Value3)      LoadVector128x3(byte*   address);
            public static unsafe (Vector128<sbyte> Value1,   Vector128<sbyte> Value2,     Vector128<sbyte> Value3)     LoadVector128x3(sbyte*  address);
            public static unsafe (Vector128<short> Value1,   Vector128<short> Value2,     Vector128<short> Value3)     LoadVector128x3(short*  address);
            public static unsafe (Vector128<ushort> Value1,  Vector128<ushort> Value2,    Vector128<ushort> Value3)    LoadVector128x3(ushort*  address);
            public static unsafe (Vector128<int> Value1,     Vector128<int> Value2,       Vector128<int> Value3)       LoadVector128x3(int*  address);
            public static unsafe (Vector128<uint> Value1,    Vector128<uint> Value2,      Vector128<uint> Value3)      LoadVector128x3(uint*  address);
            public static unsafe (Vector128<float> Value1,   Vector128<float> Value2,     Vector128<sbyte> Value3)     LoadVector128x3(float*  address);
            public static unsafe (Vector128<long> Value1,    Vector128<long> Value2,      Vector128<long> Value3)      LoadVector128x3(long*  address);
            public static unsafe (Vector128<ulong> Value1,   Vector128<ulong> Value2,     Vector128<ulong> Value3)     LoadVector128x3(ulong*  address);
            public static unsafe (Vector128<double> Value1,  Vector128<double> Value2,    Vector128<double> Value3)    LoadVector128x3(double*  address);


            public static unsafe (Vector128<byte> Value1,   Vector128<byte> Value2,      Vector128<byte> Value3,      Vector128<byte> Value4)      LoadVector128x4(byte*   address);
            public static unsafe (Vector128<sbyte> Value1,  Vector128<sbyte> Value2,     Vector128<sbyte> Value3,     Vector128<sbyte> Value4)     LoadVector128x4(sbyte*   address);
            public static unsafe (Vector128<short> Value1,  Vector128<short> Value2,     Vector128<short> Value3,     Vector128<short> Value4)     LoadVector128x4(short*   address);
            public static unsafe (Vector128<ushort> Value1, Vector128<ushort> Value2,    Vector128<ushort> Value3,    Vector128<ushort> Value4)    LoadVector128x4(ushort*   address);
            public static unsafe (Vector128<int> Value1,    Vector128<int> Value2,       Vector128<int> Value3,       Vector128<int> Value4)       LoadVector128x4(int*   address);
            public static unsafe (Vector128<uint> Value1,   Vector128<uint> Value2,      Vector128<uint> Value3,      Vector128<uint> Value4)      LoadVector128x4(uint*   address);
            public static unsafe (Vector128<float> Value1,  Vector128<float> Value2,     Vector128<float> Value3,     Vector128<float> Value4)     LoadVector128x4(float*   address);
            public static unsafe (Vector128<long> Value1,   Vector128<long> Value2,      Vector128<long> Value3,      Vector128<long> Value4)      LoadVector128x4(long*  address);
            public static unsafe (Vector128<ulong> Value1,  Vector128<ulong> Value2,     Vector128<ulong> Value3,     Vector128<ulong> Value4)     LoadVector128x4(ulong*  address);
            public static unsafe (Vector128<double> Value1, Vector128<double> Value2,    Vector128<double> Value3,    Vector128<double> Value4)    LoadVector128x4(double*  address);            
        }
    }
}

API Usage

// Fancy the value
var v = LoadVector128x2(address);

// Getting the values out
Console.WriteLine(v.Item1);
Console.WriteLine(v.Item2);

Alternative Designs

No response

Risks

No response

Author: kunalspathak
Assignees: -
Labels:

api-suggestion, area-System.Runtime.Intrinsics, untriaged

Milestone: -

@kunalspathak
Copy link
Member Author

@tannergooding

@MichalPetryka
Copy link
Contributor

MichalPetryka commented Apr 8, 2023

There are already LoadPairVector64 and LoadPairVector128, how is your LoadVector64x2 different from those? Does ARM64 provide instructions to load 3 or 4 vectors at once?

@tannergooding
Copy link
Member

There are tons of differing semantics between the relevant instructions.

We basically have:

  • LD1 (multiple structures)
  • LD1 (multiple structures) 2/3/4-register variant
  • LD1 (single structure) - 8/16/32/64-bit variant
  • LD1R
  • LD2 (multiple structures)
  • LD2 (single structure) - 8/16/32/64-bit variant
  • LD2R
  • LD3 (multiple structures)
  • LD3 (single structure) - 8/16/32/64-bit variant
  • LD3R
  • LD4 (multiple structures)
  • LD4 (single structure) - 8/16/32/64-bit variant
  • LD4R
  • LDP

For LD1/2/3/4 (single structure) - 8/16/32/64-bit variant we load four consecutive elements into four consecutive registers at the specified index:

return (
    Value0.WithElement(index, address[0]),    // LD1/2/3/4
    Value1.WithElement(index, address[1]),    // LD  2/3/4
    Value2.WithElement(index, address[2]),    // LD    3/4
    Value3.WithElement(index, address[3])     // LD      4
);

Thus the signatures are effectively:

public static unsafe Vector128<T> LoadAndInsertScalar(Vector128<T> value, byte index, T* address);
public static unsafe (Vector128<T> Value0, Vector128<T> Value1) LoadAndInsertScalar((Vector128<T> Value0, Vector128<T> Value1) values, byte index, T* address);
public static unsafe (Vector128<T> Value0, Vector128<T> Value1, Vector128<T> Value2) LoadAndInsertScalar((Vector128<T> Value0, Vector128<T> Value1, Vector128<T> Value2, Vector128<T> Value3) values, byte index, T* address);
public static unsafe (Vector128<T> Value0, Vector128<T> Value1, Vector128<T> Value2, Vector128<T> Value2) LoadAndInsertScalar((Vector128<T> Value0, Vector128<T> Value1, Vector128<T> Value2, Vector128<T> Value3) values, byte index, T* address);

For LD1/2/3/4R, we basically have the same thing but we replicate the value to all elements:

return (
    LoadAndReplicateToVector128(address[0]),  // LD1/2/3/4R
    LoadAndReplicateToVector128(address[1]),  // LD  2/3/4R
    LoadAndReplicateToVector128(address[2]),  // LD    3/4R
    LoadAndReplicateToVector128(address[3])   // LD      4R
);

Thus the signatures are effectively:

public static unsafe Vector128<T> LoadAndReplicateToVector128(T* address);
public static unsafe (Vector128<T> Value0, Vector128<T> Value1) LoadAndReplicateToVector128x2(T* address);
public static unsafe (Vector128<T> Value0, Vector128<T> Value1, Vector128<T> Value2) LoadAndReplicateToVector128x3(T* address);
public static unsafe (Vector128<T> Value0, Vector128<T> Value1, Vector128<T> Value2, Vector128<T> Value3) LoadAndReplicateToVector128x4(T* address);

For LD1/2/3/4 (multiple structures) we start getting into more interesting considerations as we get de-interleaving:

nuint index = 0;

for (int e = 0; e < Vector128<T>.Count; e++)
{
    Value0.WithElement(e, address[index++]);  // LD1/2/3/4
    Value1.WithElement(e, address[index++]);  // LD  2/3/4
    Value2.WithElement(e, address[index++]);  // LD    3/4
    Value3.WithElement(e, address[index++]);  // LD      4
}

return (Value0, Value1, Value2, Value3);

We also want to consider LD1 (multiple structures) - 2/3/4-register variant which is effectively 2, 3, or 4 consecutive LD1 calls:

return (
    LoadVector128(address + (count * 0)),     // LD1 - 2/3/4 register variant
    LoadVector128(address + (count * 1)),     // LD1 - 2/3/4 register variant
    LoadVector128(address + (count * 2)),     // LD1 -   3/4 register variant
    LoadVector128(address + (count * 3))      // LD1 -     4 register variant
);

Thus we have two sets of signatures:

public static unsafe Vector128<T> LoadVector128(T* address);
public static unsafe (Vector128<T> Value0, Vector128<T> Value1) LoadVector128x2(T* address);
public static unsafe (Vector128<T> Value0, Vector128<T> Value1, Vector128<T> Value2) LoadVector128x3(T* address);
public static unsafe (Vector128<T> Value0, Vector128<T> Value1, Vector128<T> Value2, Vector128<T> Value3) LoadVector128x4(T* address);

public static unsafe (Vector128<T> Value0, Vector128<T> Value1) LoadVector128x2AndUnzip(T* address);
public static unsafe (Vector128<T> Value0, Vector128<T> Value1, Vector128<T> Value2) LoadVector128x3AndUnzip(T* address);
public static unsafe (Vector128<T> Value0, Vector128<T> Value1, Vector128<T> Value2, Vector128<T> Value3) LoadVector128x4AndUnzip(T* address);

Finally we get to LDP. It does indeed ultimately load 2x vectors from memory much as LD1 - 2 register variant, but the documented operation is two memory accesses of size 32, 64, or 128-bits. This is different from LD1 which documents itself multiple single element memory accesses being combined.

This difference may or may not be important, depending on the exact intent of the developer and hardware involved and so exposing both and ensuring we use the correct optimizations to preserve semantics can be important.

@tannergooding
Copy link
Member

We should likely cover all the missing load functionality and ensure the corresponding store APIs are also covered as part of this proposal.

@tannergooding
Copy link
Member

@kunalspathak, could you update to cover the missing functionality so we can review it all at once?

@kunalspathak
Copy link
Member Author

@kunalspathak, could you update to cover the missing functionality so we can review it all at once?

Sorry @tannergooding I got occupied with other things. I will put this back on my list.

@kunalspathak
Copy link
Member Author

kunalspathak commented May 21, 2023

A nice reference for these instructions is in https://eclecticlight.co/2021/08/23/code-in-arm-assembly-lanes-and-loads-in-neon/

ld1-ld4

namespace SyLDem.Runtime.Intrinsics.Arm
{
    public static class AdvSimd
    {
        public partial class Arm
        {
            // LD1 (multiple structures)
            // LoadVector64 already present

            // LD1 (multiple structures) 2 register variant
            public static unsafe (Vector64<byte> Value1,    Vector64<byte> Value2)    LoadVector64x2AndUnzip(byte*   address);
            public static unsafe (Vector64<sbyte> Value1,   Vector64<sbyte> Value2)   LoadVector64x2AndUnzip(sbyte*  address);
            public static unsafe (Vector64<short> Value1,   Vector64<short> Value2)   LoadVector64x2AndUnzip(short*  address);
            public static unsafe (Vector64<ushort> Value1,  Vector64<ushort> Value2)  LoadVector64x2AndUnzip(ushort* address);
            public static unsafe (Vector64<int> Value1,     Vector64<int> Value2)     LoadVector64x2AndUnzip(int*    address);
            public static unsafe (Vector64<uint> Value1,    Vector64<uint> Value2)    LoadVector64x2AndUnzip(uint*   address);
            public static unsafe (Vector64<float> Value1,   Vector64<float> Value2)   LoadVector64x2AndUnzip(float*  address);

            // LD1 (multiple structures) 3 register variant
            public static unsafe (Vector64<byte> Value1,    Vector64<byte> Value2,      Vector64<byte> Value3)      LoadVector64x3AndUnzip(byte*   address);
            public static unsafe (Vector64<sbyte> Value1,   Vector64<sbyte> Value2,     Vector64<sbyte> Value3)     LoadVector64x3AndUnzip(sbyte*  address);
            public static unsafe (Vector64<short> Value1,   Vector64<short> Value2,     Vector64<short> Value3)     LoadVector64x3AndUnzip(short*  address);
            public static unsafe (Vector64<ushort> Value1,  Vector64<ushort> Value2,    Vector64<ushort> Value3)    LoadVector64x3AndUnzip(ushort*  address);
            public static unsafe (Vector64<int> Value1,     Vector64<int> Value2,       Vector64<int> Value3)       LoadVector64x3AndUnzip(int*  address);
            public static unsafe (Vector64<uint> Value1,    Vector64<uint> Value2,      Vector64<uint> Value3)      LoadVector64x3AndUnzip(uint*  address);            
            public static unsafe (Vector64<float> Value1,   Vector64<float> Value2,    Vector64<float> Value3)      LoadVector64x3AndUnzip(float*  address);            
            
            // LD1 (multiple structures) 4 register variant            
            public static unsafe (Vector64<byte> Value1,        Vector64<byte> Value2,      Vector64<byte> Value3,      Vector64<byte> Value4)      LoadVector64x4AndUnzip(byte*   address);
            public static unsafe (Vector64<sbyte> Value1,       Vector64<sbyte> Value2,     Vector64<sbyte> Value3,     Vector64<sbyte> Value4)     LoadVector64x4AndUnzip(sbyte*   address);
            public static unsafe (Vector64<short> Value1,       Vector64<short> Value2,     Vector64<short> Value3,     Vector64<short> Value4)     LoadVector64x4AndUnzip(short*   address);
            public static unsafe (Vector64<ushort> Value1,      Vector64<ushort> Value2,    Vector64<ushort> Value3,    Vector64<ushort> Value4)    LoadVector64x4AndUnzip(ushort*   address);
            public static unsafe (Vector64<int> Value1,         Vector64<int> Value2,       Vector64<int> Value3,       Vector64<int> Value4)       LoadVector64x4AndUnzip(int*   address);
            public static unsafe (Vector64<uint> Value1,        Vector64<uint> Value2,      Vector64<uint> Value3,      Vector64<uint> Value4)      LoadVector64x4AndUnzip(uint*   address);
            public static unsafe (Vector64<float> Value1,       Vector64<float> Value2,     Vector64<float> Value3,     Vector64<float> Value4)     LoadVector64x4AndUnzip(float*   address);    
            
            // LD1 (single structure)
            // LoadAndInsertScalar already present

            // LD1R
            // LoadAndReplicateToVector64 already present
            
            // LD2 (multiple structures)
            public static unsafe (Vector64<byte> Value1,    Vector64<byte> Value2)    LoadVector64x2(byte*   address);
            public static unsafe (Vector64<sbyte> Value1,   Vector64<sbyte> Value2)   LoadVector64x2(sbyte*  address);
            public static unsafe (Vector64<short> Value1,   Vector64<short> Value2)   LoadVector64x2(short*  address);
            public static unsafe (Vector64<ushort> Value1,  Vector64<ushort> Value2)  LoadVector64x2(ushort* address);
            public static unsafe (Vector64<int> Value1,     Vector64<int> Value2)     LoadVector64x2(int*    address);
            public static unsafe (Vector64<uint> Value1,    Vector64<uint> Value2)    LoadVector64x2(uint*   address);
            public static unsafe (Vector64<float> Value1,   Vector64<float> Value2)   LoadVector64x2(float*  address);

            // LD2 (single structure)
            public static unsafe (Vector64<byte> Value1,    Vector64<byte> Value2)     LoadVectorAndInsertScalar64x2((Vector64<byte> Value1,   Vector64<byte> Value2), byte index, byte*   address);
            public static unsafe (Vector64<sbyte> Value1,   Vector64<sbyte> Value2)    LoadVectorAndInsertScalar64x2((Vector64<sbyte> Value1,  Vector64<sbyte> Value2), byte index, sbyte*   address);
            public static unsafe (Vector64<short> Value1,   Vector64<short> Value2)    LoadVectorAndInsertScalar64x2((Vector64<short> Value1,  Vector64<short> Value2), byte index, short*   address);
            public static unsafe (Vector64<ushort> Value1,  Vector64<ushort> Value2)   LoadVectorAndInsertScalar64x2((Vector64<ushort> Value1, Vector64<ushort> Value2), Vector64<ushort> value, byte index, ushort*   address);
            public static unsafe (Vector64<int> Value1,     Vector64<int> Value2)      LoadVectorAndInsertScalar64x2((Vector64<int> Value1,    Vector64<int> Value2), byte index, int*   address);
            public static unsafe (Vector64<uint> Value1,    Vector64<uint> Value2)     LoadVectorAndInsertScalar64x2((Vector64<uint> Value1,   Vector64<uint> Value2), byte index, uint*   address);
            public static unsafe (Vector64<float> Value1,   Vector64<float> Value2)    LoadVectorAndInsertScalar64x2((Vector64<float> Value1,  Vector64<float> Value2), byte index, float*   address);

            // LD2R
            public static unsafe (Vector64<byte> Value1,    Vector64<byte> Value2)     LoadAndReplicateToVector64x2(byte*   address);
            public static unsafe (Vector64<sbyte> Value1,   Vector64<sbyte> Value2)    LoadAndReplicateToVector64x2(sbyte*   address);
            public static unsafe (Vector64<short> Value1,   Vector64<short> Value2)    LoadAndReplicateToVector64x2(short*   address);
            public static unsafe (Vector64<ushort> Value1,  Vector64<ushort> Value2)   LoadAndReplicateToVector64x2(ushort*   address);
            public static unsafe (Vector64<int> Value1,     Vector64<int> Value2)      LoadAndReplicateToVector64x2(int*   address);
            public static unsafe (Vector64<uint> Value1,    Vector64<uint> Value2)     LoadAndReplicateToVector64x2(uint*   address);
            public static unsafe (Vector64<float> Value1,   Vector64<float> Value2)    LoadAndReplicateToVector64x2(float*   address);

            // LD3 (multiple structures)
            public static unsafe (Vector64<byte> Value1,    Vector64<byte> Value2,      Vector64<byte> Value3)      LoadVector64x3(byte*   address);
            public static unsafe (Vector64<sbyte> Value1,   Vector64<sbyte> Value2,     Vector64<sbyte> Value3)     LoadVector64x3(sbyte*  address);
            public static unsafe (Vector64<short> Value1,   Vector64<short> Value2,     Vector64<short> Value3)     LoadVector64x3(short*  address);
            public static unsafe (Vector64<ushort> Value1,  Vector64<ushort> Value2,    Vector64<ushort> Value3)    LoadVector64x3(ushort*  address);
            public static unsafe (Vector64<int> Value1,     Vector64<int> Value2,       Vector64<int> Value3)       LoadVector64x3(int*  address);
            public static unsafe (Vector64<uint> Value1,    Vector64<uint> Value2,      Vector64<uint> Value3)      LoadVector64x3(uint*  address);
            public static unsafe (Vector64<float> Value1,   Vector64<float> Value2,     Vector64<float> Value3)     LoadVector64x3(float*  address);

            // LD3 (single structure)
            public static unsafe (Vector64<byte> Value1,    Vector64<byte> Value2,      Vector64<byte> Value3)      LoadVectorAndInsertScalar64x3((Vector64<byte> Value1,    Vector64<byte> Value2,      Vector64<byte> Value3), byte index, byte*   address);
            public static unsafe (Vector64<sbyte> Value1,   Vector64<sbyte> Value2,     Vector64<sbyte> Value3)     LoadVectorAndInsertScalar64x3((Vector64<sbyte> Value1,   Vector64<sbyte> Value2,     Vector64<sbyte> Value3), byte index, sbyte*  address);
            public static unsafe (Vector64<short> Value1,   Vector64<short> Value2,     Vector64<short> Value3)     LoadVectorAndInsertScalar64x3((Vector64<short> Value1,   Vector64<short> Value2,     Vector64<short> Value3), byte index, short*  address);
            public static unsafe (Vector64<ushort> Value1,  Vector64<ushort> Value2,    Vector64<ushort> Value3)    LoadVectorAndInsertScalar64x3((Vector64<ushort> Value1,  Vector64<ushort> Value2,    Vector64<ushort> Value3), byte index, ushort*  address);
            public static unsafe (Vector64<int> Value1,     Vector64<int> Value2,       Vector64<int> Value3)       LoadVectorAndInsertScalar64x3((Vector64<int> Value1,     Vector64<int> Value2,       Vector64<int> Value3), byte index, int*  address);
            public static unsafe (Vector64<uint> Value1,    Vector64<uint> Value2,      Vector64<uint> Value3)      LoadVectorAndInsertScalar64x3((Vector64<uint> Value1,    Vector64<uint> Value2,      Vector64<uint> Value3), byte index, uint*  address);
            public static unsafe (Vector64<float> Value1,   Vector64<float> Value2,     Vector64<float> Value3)     LoadVectorAndInsertScalar64x3((Vector64<float> Value1,   Vector64<float> Value2,     Vector64<sbyte> Value3), byte index, float*  address);

            // LD3R
            public static unsafe (Vector64<byte> Value1,    Vector64<byte> Value2,      Vector64<byte> Value3)      LoadAndReplicateToVector64x3(byte*   address);
            public static unsafe (Vector64<sbyte> Value1,   Vector64<sbyte> Value2,     Vector64<sbyte> Value3)     LoadAndReplicateToVector64x3(sbyte*  address);
            public static unsafe (Vector64<short> Value1,   Vector64<short> Value2,     Vector64<short> Value3)     LoadAndReplicateToVector64x3(short*  address);
            public static unsafe (Vector64<ushort> Value1,  Vector64<ushort> Value2,    Vector64<ushort> Value3)    LoadAndReplicateToVector64x3(ushort*  address);
            public static unsafe (Vector64<int> Value1,     Vector64<int> Value2,       Vector64<int> Value3)       LoadAndReplicateToVector64x3(int*  address);
            public static unsafe (Vector64<uint> Value1,    Vector64<uint> Value2,      Vector64<uint> Value3)      LoadAndReplicateToVector64x3(uint*  address);
            public static unsafe (Vector64<float> Value1,   Vector64<float> Value2,     Vector64<float> Value3)     LoadAndReplicateToVector64x3(float*  address);

            // LD4 (multiple structures)
            public static unsafe (Vector64<byte> Value1,        Vector64<byte> Value2,      Vector64<byte> Value3,      Vector64<byte> Value4)      LoadVector64x4(byte*   address);
            public static unsafe (Vector64<sbyte> Value1,       Vector64<sbyte> Value2,     Vector64<sbyte> Value3,     Vector64<sbyte> Value4)     LoadVector64x4(sbyte*   address);
            public static unsafe (Vector64<short> Value1,       Vector64<short> Value2,     Vector64<short> Value3,     Vector64<short> Value4)     LoadVector64x4(short*   address);
            public static unsafe (Vector64<ushort> Value1,      Vector64<ushort> Value2,    Vector64<ushort> Value3,    Vector64<ushort> Value4)    LoadVector64x4(ushort*   address);
            public static unsafe (Vector64<int> Value1,         Vector64<int> Value2,       Vector64<int> Value3,       Vector64<int> Value4)       LoadVector64x4(int*   address);
            public static unsafe (Vector64<uint> Value1,        Vector64<uint> Value2,      Vector64<uint> Value3,      Vector64<uint> Value4)      LoadVector64x4(uint*   address);
            public static unsafe (Vector64<float> Value1,       Vector64<float> Value2,     Vector64<float> Value3,     Vector64<float> Value4)     LoadVector64x4(float*   address);

            // LD4 (single structure)
            public static unsafe (Vector64<byte> Value1,        Vector64<byte> Value2,      Vector64<byte> Value3,      Vector64<byte> Value4)      LoadVectorAndInsertScalar64x4((Vector64<byte> Value1,        Vector64<byte> Value2,      Vector64<byte> Value3,      Vector64<byte> Value4), byte index, byte*   address);
            public static unsafe (Vector64<sbyte> Value1,       Vector64<sbyte> Value2,     Vector64<sbyte> Value3,     Vector64<sbyte> Value4)     LoadVectorAndInsertScalar64x4((Vector64<sbyte> Value1,       Vector64<sbyte> Value2,     Vector64<sbyte> Value3,     Vector64<sbyte> Value4), byte index, sbyte*   address);
            public static unsafe (Vector64<short> Value1,       Vector64<short> Value2,     Vector64<short> Value3,     Vector64<short> Value4)     LoadVectorAndInsertScalar64x4((Vector64<short> Value1,       Vector64<short> Value2,     Vector64<short> Value3,     Vector64<short> Value4), byte index, short*   address);
            public static unsafe (Vector64<ushort> Value1,      Vector64<ushort> Value2,    Vector64<ushort> Value3,    Vector64<ushort> Value4)    LoadVectorAndInsertScalar64x4((Vector64<ushort> Value1,      Vector64<ushort> Value2,    Vector64<ushort> Value3,    Vector64<ushort> Value4), byte index, ushort*   address);
            public static unsafe (Vector64<int> Value1,         Vector64<int> Value2,       Vector64<int> Value3,       Vector64<int> Value4)       LoadVectorAndInsertScalar64x4((Vector64<int> Value1,         Vector64<int> Value2,       Vector64<int> Value3,       Vector64<int> Value4), byte index, int*   address);
            public static unsafe (Vector64<uint> Value1,        Vector64<uint> Value2,      Vector64<uint> Value3,      Vector64<uint> Value4)      LoadVectorAndInsertScalar64x4((Vector64<uint> Value1,        Vector64<uint> Value2,      Vector64<uint> Value3,      Vector64<uint> Value4), byte index, uint*   address);
            public static unsafe (Vector64<float> Value1,       Vector64<float> Value2,     Vector64<float> Value3,     Vector64<float> Value4)     LoadVectorAndInsertScalar64x4((Vector64<float> Value1,       Vector64<float> Value2,     Vector64<float> Value3,     Vector64<float> Value4), byte index, float*   address);

            // LD4R
            public static unsafe (Vector64<byte> Value1,        Vector64<byte> Value2,      Vector64<byte> Value3,      Vector64<byte> Value4)      LoadAndReplicateToVector64x4(byte*   address);
            public static unsafe (Vector64<sbyte> Value1,       Vector64<sbyte> Value2,     Vector64<sbyte> Value3,     Vector64<sbyte> Value4)     LoadAndReplicateToVector64x4(sbyte*   address);
            public static unsafe (Vector64<short> Value1,       Vector64<short> Value2,     Vector64<short> Value3,     Vector64<short> Value4)     LoadAndReplicateToVector64x4(short*   address);
            public static unsafe (Vector64<ushort> Value1,      Vector64<ushort> Value2,    Vector64<ushort> Value3,    Vector64<ushort> Value4)    LoadAndReplicateToVector64x4(ushort*   address);
            public static unsafe (Vector64<int> Value1,         Vector64<int> Value2,       Vector64<int> Value3,       Vector64<int> Value4)       LoadAndReplicateToVector64x4(int*   address);
            public static unsafe (Vector64<uint> Value1,        Vector64<uint> Value2,      Vector64<uint> Value3,      Vector64<uint> Value4)      LoadAndReplicateToVector64x4(uint*   address);
            public static unsafe (Vector64<float> Value1,       Vector64<float> Value2,     Vector64<float> Value3,     Vector64<float> Value4)     LoadAndReplicateToVector64x4(float*   address);
        }

        public partial class Arm64
        {
            // LD1 (multiple structures)
            // LoadVector128 already present

            // LD1 (multiple structures) 2 register variant
            public static unsafe (Vector128<byte> Value1,    Vector128<byte> Value2)    LoadVector128x2AndUnzip(byte*   address);
            public static unsafe (Vector128<sbyte> Value1,   Vector128<sbyte> Value2)   LoadVector128x2AndUnzip(sbyte*  address);
            public static unsafe (Vector128<short> Value1,   Vector128<short> Value2)   LoadVector128x2AndUnzip(short*  address);
            public static unsafe (Vector128<ushort> Value1,  Vector128<ushort> Value2)  LoadVector128x2AndUnzip(ushort* address);
            public static unsafe (Vector128<int> Value1,     Vector128<int> Value2)     LoadVector128x2AndUnzip(int*    address);
            public static unsafe (Vector128<uint> Value1,    Vector128<uint> Value2)    LoadVector128x2AndUnzip(uint*   address);
            public static unsafe (Vector128<long> Value1,    Vector128<long> Value2)    LoadVector128x2AndUnzip(long*   address);
            public static unsafe (Vector128<ulong> Value1,   Vector128<ulong> Value2)  LoadVector128x2AndUnzip(ulong*   address);
            public static unsafe (Vector128<float> Value1,   Vector128<float> Value2)   LoadVector128x2AndUnzip(float*  address);
            public static unsafe (Vector128<double> Value1,  Vector128<double> Value2) LoadVector128x2AndUnzip(double*  address);

            // LD1 (multiple structures) 3 register variant
            public static unsafe (Vector128<byte> Value1,    Vector128<byte> Value2,      Vector128<byte> Value3)      LoadVector128x3AndUnzip(byte*   address);
            public static unsafe (Vector128<sbyte> Value1,   Vector128<sbyte> Value2,     Vector128<sbyte> Value3)     LoadVector128x3AndUnzip(sbyte*  address);
            public static unsafe (Vector128<short> Value1,   Vector128<short> Value2,     Vector128<short> Value3)     LoadVector128x3AndUnzip(short*  address);
            public static unsafe (Vector128<ushort> Value1,  Vector128<ushort> Value2,    Vector128<ushort> Value3)    LoadVector128x3AndUnzip(ushort*  address);
            public static unsafe (Vector128<int> Value1,     Vector128<int> Value2,       Vector128<int> Value3)       LoadVector128x3AndUnzip(int*  address);
            public static unsafe (Vector128<uint> Value1,    Vector128<uint> Value2,      Vector128<uint> Value3)      LoadVector128x3AndUnzip(uint*  address);
            public static unsafe (Vector128<long> Value1,    Vector128<long> Value2,      Vector128<long> Value3)      LoadVector128x3AndUnzip(long*  address);
            public static unsafe (Vector128<ulong> Value1,   Vector128<ulong> Value2,     Vector128<ulong> Value3)     LoadVector128x3AndUnzip(ulong*  address);
            public static unsafe (Vector128<float> Value1,   Vector128<float> Value2,     Vector128<float> Value3)     LoadVector128x3AndUnzip(float*  address);
            public static unsafe (Vector128<double> Value1,  Vector128<double> Value2,   Vector128<double> Value3)  LoadVector128x3AndUnzip(double*  address);
            
            // LD1 (multiple structures) 4 register variant            
            public static unsafe (Vector128<byte> Value1,        Vector128<byte> Value2,      Vector128<byte> Value3,      Vector128<byte> Value4)      LoadVector128x4AndUnzip(byte*   address);
            public static unsafe (Vector128<sbyte> Value1,       Vector128<sbyte> Value2,     Vector128<sbyte> Value3,     Vector128<sbyte> Value4)     LoadVector128x4AndUnzip(sbyte*   address);
            public static unsafe (Vector128<short> Value1,       Vector128<short> Value2,     Vector128<short> Value3,     Vector128<short> Value4)     LoadVector128x4AndUnzip(short*   address);
            public static unsafe (Vector128<ushort> Value1,      Vector128<ushort> Value2,    Vector128<ushort> Value3,    Vector128<ushort> Value4)    LoadVector128x4AndUnzip(ushort*   address);
            public static unsafe (Vector128<int> Value1,         Vector128<int> Value2,       Vector128<int> Value3,       Vector128<int> Value4)       LoadVector128x4AndUnzip(int*   address);
            public static unsafe (Vector128<uint> Value1,        Vector128<uint> Value2,      Vector128<uint> Value3,      Vector128<uint> Value4)      LoadVector128x4AndUnzip(uint*   address);
            public static unsafe (Vector128<long> Value1,        Vector128<long> Value2,      Vector128<long> Value3,      Vector128<long> Value4)      LoadVector128x4AndUnzip(long*   address);
            public static unsafe (Vector128<ulong> Value1,       Vector128<ulong> Value2,    Vector128<ulong> Value3,     Vector128<ulong> Value4)      LoadVector128x4AndUnzip(ulong*   address);
            public static unsafe (Vector128<float> Value1,       Vector128<float> Value2,     Vector128<float> Value3,     Vector128<float> Value4)     LoadVector128x4AndUnzip(float*   address);
            public static unsafe (Vector128<double> Value1,      Vector128<double> Value2,   Vector128<double> Value3,    Vector128<double> Value4)     LoadVector128x4AndUnzip(double*   address);
            // LD1 (single structure)
            // LoadAndInsertScalar already present

            // LD1R
            // LoadAndReplicateToVector128 already present
            
            // LD2 (multiple structures)
            public static unsafe (Vector128<byte> Value1,    Vector128<byte> Value2)    LoadVector128x2(byte*   address);
            public static unsafe (Vector128<sbyte> Value1,   Vector128<sbyte> Value2)   LoadVector128x2(sbyte*  address);
            public static unsafe (Vector128<short> Value1,   Vector128<short> Value2)   LoadVector128x2(short*  address);
            public static unsafe (Vector128<ushort> Value1,  Vector128<ushort> Value2)  LoadVector128x2(ushort* address);
            public static unsafe (Vector128<int> Value1,     Vector128<int> Value2)     LoadVector128x2(int*    address);
            public static unsafe (Vector128<uint> Value1,    Vector128<uint> Value2)    LoadVector128x2(uint*   address);
            public static unsafe (Vector128<long> Value1,    Vector128<long> Value2)    LoadVector128x2(long*   address);
            public static unsafe (Vector128<ulong> Value1,    Vector128<ulong> Value2)  LoadVector128x2(ulong*   address);
            public static unsafe (Vector128<float> Value1,   Vector128<float> Value2)   LoadVector128x2(float*  address);
            public static unsafe (Vector128<double> Value1,   Vector128<double> Value2) LoadVector128x2(double*  address);

            // LD2 (single structure)
            public static unsafe (Vector128<byte> Value1,    Vector128<byte> Value2)     LoadVectorAndInsertScalar128x2((Vector128<byte> Value1,   Vector128<byte> Value2), byte index, byte*   address);
            public static unsafe (Vector128<sbyte> Value1,   Vector128<sbyte> Value2)    LoadVectorAndInsertScalar128x2((Vector128<sbyte> Value1,  Vector128<sbyte> Value2), byte index, sbyte*   address);
            public static unsafe (Vector128<short> Value1,   Vector128<short> Value2)    LoadVectorAndInsertScalar128x2((Vector128<short> Value1,  Vector128<short> Value2), byte index, short*   address);
            public static unsafe (Vector128<ushort> Value1,  Vector128<ushort> Value2)   LoadVectorAndInsertScalar128x2((Vector128<ushort> Value1, Vector128<ushort> Value2), Vector128<ushort> value, byte index, ushort*   address);
            public static unsafe (Vector128<int> Value1,     Vector128<int> Value2)      LoadVectorAndInsertScalar128x2((Vector128<int> Value1,    Vector128<int> Value2), byte index, int*   address);
            public static unsafe (Vector128<uint> Value1,    Vector128<uint> Value2)     LoadVectorAndInsertScalar128x2((Vector128<uint> Value1,   Vector128<uint> Value2), byte index, uint*   address);
            public static unsafe (Vector128<long> Value1,    Vector128<long> Value2)     LoadVectorAndInsertScalar128x2((Vector128<long> Value1,   Vector128<long> Value2), byte index, long*   address);
            public static unsafe (Vector128<ulong> Value1,   Vector128<ulong> Value2)    LoadVectorAndInsertScalar128x2((Vector128<ulong> Value1,  Vector128<ulong> Value2), byte index, ulong*   address);
            public static unsafe (Vector128<float> Value1,   Vector128<float> Value2)    LoadVectorAndInsertScalar128x2((Vector128<float> Value1,  Vector128<float> Value2), byte index, float*   address);
            public static unsafe (Vector128<double> Value1,  Vector128<double> Value2)   LoadVectorAndInsertScalar128x2((Vector128<double> Value1, Vector128<double> Value2), byte index, double*   address);

            // LD2R
            public static unsafe (Vector128<byte> Value1,    Vector128<byte> Value2)     LoadAndReplicateToVector128x2(byte*   address);
            public static unsafe (Vector128<sbyte> Value1,   Vector128<sbyte> Value2)    LoadAndReplicateToVector128x2(sbyte*   address);
            public static unsafe (Vector128<short> Value1,   Vector128<short> Value2)    LoadAndReplicateToVector128x2(short*   address);
            public static unsafe (Vector128<ushort> Value1,  Vector128<ushort> Value2)   LoadAndReplicateToVector128x2(ushort*   address);
            public static unsafe (Vector128<int> Value1,     Vector128<int> Value2)      LoadAndReplicateToVector128x2(int*   address);
            public static unsafe (Vector128<uint> Value1,    Vector128<uint> Value2)     LoadAndReplicateToVector128x2(uint*   address);
            public static unsafe (Vector128<long> Value1,    Vector128<long> Value2)     LoadAndReplicateToVector128x2(long*   address);
            public static unsafe (Vector128<ulong> Value1,   Vector128<ulong> Value2)    LoadAndReplicateToVector128x2(ulong*   address);
            public static unsafe (Vector128<float> Value1,   Vector128<float> Value2)    LoadAndReplicateToVector128x2(float*   address);
            public static unsafe (Vector128<double> Value1,  Vector128<double> Value2)   LoadAndReplicateToVector128x2(double*   address);

            // LD3 (multiple structures)
            public static unsafe (Vector128<byte> Value1,    Vector128<byte> Value2,      Vector128<byte> Value3)      LoadVector128x3(byte*   address);
            public static unsafe (Vector128<sbyte> Value1,   Vector128<sbyte> Value2,     Vector128<sbyte> Value3)     LoadVector128x3(sbyte*  address);
            public static unsafe (Vector128<short> Value1,   Vector128<short> Value2,     Vector128<short> Value3)     LoadVector128x3(short*  address);
            public static unsafe (Vector128<ushort> Value1,  Vector128<ushort> Value2,    Vector128<ushort> Value3)    LoadVector128x3(ushort*  address);
            public static unsafe (Vector128<int> Value1,     Vector128<int> Value2,       Vector128<int> Value3)       LoadVector128x3(int*  address);
            public static unsafe (Vector128<uint> Value1,    Vector128<uint> Value2,      Vector128<uint> Value3)      LoadVector128x3(uint*  address);
            public static unsafe (Vector128<long> Value1,    Vector128<long> Value2,      Vector128<long> Value3)      LoadVector128x3(long*  address);
            public static unsafe (Vector128<ulong> Value1,   Vector128<ulong> Value2,     Vector128<ulong> Value3)     LoadVector128x3(ulong*  address);
            public static unsafe (Vector128<float> Value1,   Vector128<float> Value2,     Vector128<float> Value3)     LoadVector128x3(float*  address);
            public static unsafe (Vector128<double> Value1,  Vector128<double> Value2,    Vector128<double> Value3)    LoadVector128x3(double*  address);

            // LD3 (single structure)
            public static unsafe (Vector128<byte> Value1,    Vector128<byte> Value2,      Vector128<byte> Value3)      LoadVectorAndInsertScalar128x3((Vector128<byte> Value1,    Vector128<byte> Value2,      Vector128<byte> Value3), byte index, byte*   address);
            public static unsafe (Vector128<sbyte> Value1,   Vector128<sbyte> Value2,     Vector128<sbyte> Value3)     LoadVectorAndInsertScalar128x3((Vector128<sbyte> Value1,   Vector128<sbyte> Value2,     Vector128<sbyte> Value3), byte index, sbyte*  address);
            public static unsafe (Vector128<short> Value1,   Vector128<short> Value2,     Vector128<short> Value3)     LoadVectorAndInsertScalar128x3((Vector128<short> Value1,   Vector128<short> Value2,     Vector128<short> Value3), byte index, short*  address);
            public static unsafe (Vector128<ushort> Value1,  Vector128<ushort> Value2,    Vector128<ushort> Value3)    LoadVectorAndInsertScalar128x3((Vector128<ushort> Value1,  Vector128<ushort> Value2,    Vector128<ushort> Value3), byte index, ushort*  address);
            public static unsafe (Vector128<int> Value1,     Vector128<int> Value2,       Vector128<int> Value3)       LoadVectorAndInsertScalar128x3((Vector128<int> Value1,     Vector128<int> Value2,       Vector128<int> Value3), byte index, int*  address);
            public static unsafe (Vector128<uint> Value1,    Vector128<uint> Value2,      Vector128<uint> Value3)      LoadVectorAndInsertScalar128x3((Vector128<uint> Value1,    Vector128<uint> Value2,      Vector128<uint> Value3), byte index, uint*  address);
            public static unsafe (Vector128<long> Value1,    Vector128<long> Value2,      Vector128<long> Value3)      LoadVectorAndInsertScalar128x3((Vector128<long> Value1,    Vector128<long> Value2,      Vector128<long> Value3), byte index, long*  address);
            public static unsafe (Vector128<ulong> Value1,    Vector128<ulong> Value2,    Vector128<ulong> Value3)     LoadVectorAndInsertScalar128x3((Vector128<ulong> Value1,   Vector128<ulong> Value2,     Vector128<ulong> Value3), byte index, ulong*  address);
            public static unsafe (Vector128<float> Value1,   Vector128<float> Value2,     Vector128<float> Value3)     LoadVectorAndInsertScalar128x3((Vector128<float> Value1,   Vector128<float> Value2,     Vector128<sbyte> Value3), byte index, float*  address);
            public static unsafe (Vector128<double> Value1,   Vector128<double> Value2,   Vector128<double> Value3)    LoadVectorAndInsertScalar128x3((Vector128<double> Value1,   Vector128<double> Value2,   Vector128<double> Value3), byte index, double*  address);

            // LD3R
            public static unsafe (Vector128<byte> Value1,    Vector128<byte> Value2,      Vector128<byte> Value3)      LoadAndReplicateToVector128x3(byte*   address);
            public static unsafe (Vector128<sbyte> Value1,   Vector128<sbyte> Value2,     Vector128<sbyte> Value3)     LoadAndReplicateToVector128x3(sbyte*  address);
            public static unsafe (Vector128<short> Value1,   Vector128<short> Value2,     Vector128<short> Value3)     LoadAndReplicateToVector128x3(short*  address);
            public static unsafe (Vector128<ushort> Value1,  Vector128<ushort> Value2,    Vector128<ushort> Value3)    LoadAndReplicateToVector128x3(ushort*  address);
            public static unsafe (Vector128<int> Value1,     Vector128<int> Value2,       Vector128<int> Value3)       LoadAndReplicateToVector128x3(int*  address);
            public static unsafe (Vector128<uint> Value1,    Vector128<uint> Value2,      Vector128<uint> Value3)      LoadAndReplicateToVector128x3(uint*  address);
            public static unsafe (Vector128<long> Value1,    Vector128<long> Value2,      Vector128<long> Value3)      LoadAndReplicateToVector128x3(long*  address);
            public static unsafe (Vector128<ulong> Value1,   Vector128<ulong> Value2,     Vector128<ulong> Value3)      LoadAndReplicateToVector128x3(ulong*  address);
            public static unsafe (Vector128<float> Value1,   Vector128<float> Value2,     Vector128<float> Value3)     LoadAndReplicateToVector128x3(float*  address);
            public static unsafe (Vector128<double> Value1,  Vector128<double> Value2,    Vector128<double> Value3)     LoadAndReplicateToVector128x3(double*  address);

            // LD4 (multiple structures)
            public static unsafe (Vector128<byte> Value1,        Vector128<byte> Value2,      Vector128<byte> Value3,      Vector128<byte> Value4)      LoadVector128x4(byte*   address);
            public static unsafe (Vector128<sbyte> Value1,       Vector128<sbyte> Value2,     Vector128<sbyte> Value3,     Vector128<sbyte> Value4)     LoadVector128x4(sbyte*   address);
            public static unsafe (Vector128<short> Value1,       Vector128<short> Value2,     Vector128<short> Value3,     Vector128<short> Value4)     LoadVector128x4(short*   address);
            public static unsafe (Vector128<ushort> Value1,      Vector128<ushort> Value2,    Vector128<ushort> Value3,    Vector128<ushort> Value4)    LoadVector128x4(ushort*   address);
            public static unsafe (Vector128<int> Value1,         Vector128<int> Value2,       Vector128<int> Value3,       Vector128<int> Value4)       LoadVector128x4(int*   address);
            public static unsafe (Vector128<uint> Value1,        Vector128<uint> Value2,      Vector128<uint> Value3,      Vector128<uint> Value4)      LoadVector128x4(uint*   address);
            public static unsafe (Vector128<long> Value1,        Vector128<long> Value2,      Vector128<long> Value3,      Vector128<long> Value4)      LoadVector128x4(long*   address);
            public static unsafe (Vector128<ulong> Value1,       Vector128<ulong> Value2,     Vector128<ulong> Value3,     Vector128<ulong> Value4)     LoadVector128x4(ulong*   address);
            public static unsafe (Vector128<float> Value1,       Vector128<float> Value2,     Vector128<float> Value3,     Vector128<float> Value4)     LoadVector128x4(float*   address);
            public static unsafe (Vector128<double> Value1,      Vector128<double> Value2,    Vector128<double> Value3,    Vector128<double> Value4)    LoadVector128x4(double*   address);

            // LD4 (single structure)
            public static unsafe (Vector128<byte> Value1,        Vector128<byte> Value2,      Vector128<byte> Value3,      Vector128<byte> Value4)      LoadVectorAndInsertScalar128x4((Vector128<byte> Value1,        Vector128<byte> Value2,      Vector128<byte> Value3,      Vector128<byte> Value4), byte index, byte*   address);
            public static unsafe (Vector128<sbyte> Value1,       Vector128<sbyte> Value2,     Vector128<sbyte> Value3,     Vector128<sbyte> Value4)     LoadVectorAndInsertScalar128x4((Vector128<sbyte> Value1,       Vector128<sbyte> Value2,     Vector128<sbyte> Value3,     Vector128<sbyte> Value4), byte index, sbyte*   address);
            public static unsafe (Vector128<short> Value1,       Vector128<short> Value2,     Vector128<short> Value3,     Vector128<short> Value4)     LoadVectorAndInsertScalar128x4((Vector128<short> Value1,       Vector128<short> Value2,     Vector128<short> Value3,     Vector128<short> Value4), byte index, short*   address);
            public static unsafe (Vector128<ushort> Value1,      Vector128<ushort> Value2,    Vector128<ushort> Value3,    Vector128<ushort> Value4)    LoadVectorAndInsertScalar128x4((Vector128<ushort> Value1,      Vector128<ushort> Value2,    Vector128<ushort> Value3,    Vector128<ushort> Value4), byte index, ushort*   address);
            public static unsafe (Vector128<int> Value1,         Vector128<int> Value2,       Vector128<int> Value3,       Vector128<int> Value4)       LoadVectorAndInsertScalar128x4((Vector128<int> Value1,         Vector128<int> Value2,       Vector128<int> Value3,       Vector128<int> Value4), byte index, int*   address);
            public static unsafe (Vector128<uint> Value1,        Vector128<uint> Value2,      Vector128<uint> Value3,      Vector128<uint> Value4)      LoadVectorAndInsertScalar128x4((Vector128<uint> Value1,        Vector128<uint> Value2,      Vector128<uint> Value3,      Vector128<uint> Value4), byte index, uint*   address);
            public static unsafe (Vector128<long> Value1,        Vector128<long> Value2,      Vector128<long> Value3,      Vector128<long> Value4)      LoadVectorAndInsertScalar128x4((Vector128<long> Value1,        Vector128<long> Value2,      Vector128<long> Value3,      Vector128<long> Value4), byte index, long*   address);
            public static unsafe (Vector128<ulong> Value1,       Vector128<ulong> Value2,     Vector128<ulong> Value3,     Vector128<ulong> Value4)     LoadVectorAndInsertScalar128x4((Vector128<ulong> Value1,       Vector128<ulong> Value2,     Vector128<ulong> Value3,      Vector128<ulong> Value4), byte index, ulong*   address);
            public static unsafe (Vector128<float> Value1,       Vector128<float> Value2,     Vector128<float> Value3,     Vector128<float> Value4)     LoadVectorAndInsertScalar128x4((Vector128<float> Value1,       Vector128<float> Value2,     Vector128<float> Value3,     Vector128<float> Value4), byte index, float*   address);
            public static unsafe (Vector128<double> Value1,      Vector128<double> Value2,    Vector128<double> Value3,    Vector128<double> Value4)    LoadVectorAndInsertScalar128x4((Vector128<double> Value1,      Vector128<double> Value2,    Vector128<double> Value3,     Vector128<double> Value4), byte index, double*   address);

            // LD4R
            public static unsafe (Vector128<byte> Value1,        Vector128<byte> Value2,      Vector128<byte> Value3,      Vector128<byte> Value4)      LoadAndReplicateToVector128x4(byte*   address);
            public static unsafe (Vector128<sbyte> Value1,       Vector128<sbyte> Value2,     Vector128<sbyte> Value3,     Vector128<sbyte> Value4)     LoadAndReplicateToVector128x4(sbyte*   address);
            public static unsafe (Vector128<short> Value1,       Vector128<short> Value2,     Vector128<short> Value3,     Vector128<short> Value4)     LoadAndReplicateToVector128x4(short*   address);
            public static unsafe (Vector128<ushort> Value1,      Vector128<ushort> Value2,    Vector128<ushort> Value3,    Vector128<ushort> Value4)    LoadAndReplicateToVector128x4(ushort*   address);
            public static unsafe (Vector128<int> Value1,         Vector128<int> Value2,       Vector128<int> Value3,       Vector128<int> Value4)       LoadAndReplicateToVector128x4(int*   address);
            public static unsafe (Vector128<uint> Value1,        Vector128<uint> Value2,      Vector128<uint> Value3,      Vector128<uint> Value4)      LoadAndReplicateToVector128x4(uint*   address);
            public static unsafe (Vector128<long> Value1,        Vector128<long> Value2,      Vector128<long> Value3,      Vector128<long> Value4)      LoadAndReplicateToVector128x4(long*   address);
            public static unsafe (Vector128<ulong> Value1,       Vector128<ulong> Value2,     Vector128<ulong> Value3,     Vector128<ulong> Value4)     LoadAndReplicateToVector128x4(ulong*   address);
            public static unsafe (Vector128<float> Value1,       Vector128<float> Value2,     Vector128<float> Value3,     Vector128<float> Value4)     LoadAndReplicateToVector128x4(float*   address);        
            public static unsafe (Vector128<double> Value1,      Vector128<double> Value2,    Vector128<double> Value3,    Vector128<double> Value4)    LoadAndReplicateToVector128x4(double*   address);        
        }
    }
}

@kunalspathak
Copy link
Member Author

st1-st4

namespace System.Runtime.Intrinsics.Arm
{
    public static class AdvSimd
    {
        public partial class Arm
        {
            // ST1 (multiple structures)
            // StoreVector already present

            // ST1 (multiple structures) 2 register variant
            public static unsafe void StoreVector64x2AndUnzip(byte*   address, (Vector64<byte> Value1,    Vector64<byte> Value2));
            public static unsafe void StoreVector64x2AndUnzip(sbyte*  address, (Vector64<sbyte> Value1,   Vector64<sbyte> Value2));
            public static unsafe void StoreVector64x2AndUnzip(short*  address, (Vector64<short> Value1,   Vector64<short> Value2));
            public static unsafe void StoreVector64x2AndUnzip(ushort* address, (Vector64<ushort> Value1,  Vector64<ushort> Value2));
            public static unsafe void StoreVector64x2AndUnzip(int*    address, (Vector64<int> Value1,     Vector64<int> Value2));
            public static unsafe void StoreVector64x2AndUnzip(uint*   address, (Vector64<uint> Value1,    Vector64<uint> Value2));
            public static unsafe void StoreVector64x2AndUnzip(float*  address, (Vector64<float> Value1,   Vector64<float> Value2));

            // ST1 (multiple structures) 3 register variant
            public static unsafe void StoreVector64x3AndUnzip(byte*   address, (Vector64<byte> Value1,    Vector64<byte> Value2,      Vector64<byte> Value3));
            public static unsafe void StoreVector64x3AndUnzip(sbyte*  address, (Vector64<sbyte> Value1,   Vector64<sbyte> Value2,     Vector64<sbyte> Value3));
            public static unsafe void StoreVector64x3AndUnzip(short*  address, (Vector64<short> Value1,   Vector64<short> Value2,     Vector64<short> Value3));
            public static unsafe void StoreVector64x3AndUnzip(ushort*  address, (Vector64<ushort> Value1,  Vector64<ushort> Value2,    Vector64<ushort> Value3));
            public static unsafe void StoreVector64x3AndUnzip(int*  address, (Vector64<int> Value1,     Vector64<int> Value2,       Vector64<int> Value3));
            public static unsafe void StoreVector64x3AndUnzip(uint*  address, (Vector64<uint> Value1,    Vector64<uint> Value2,      Vector64<uint> Value3));
            public static unsafe void StoreVector64x3AndUnzip(float*  address, (Vector64<float> Value1,   Vector64<float> Value2,    Vector64<float> Value3));
            
            // ST1 (multiple structures) 4 register variant            
            public static unsafe void StoreVector64x4AndUnzip(byte*   address, (Vector64<byte> Value1,       Vector64<byte> Value2,    Vector64<byte> Value3,      Vector64<byte> Value4));
            public static unsafe void StoreVector64x4AndUnzip(sbyte*   address, (Vector64<sbyte> Value1,     Vector64<sbyte> Value2,   Vector64<sbyte> Value3,     Vector64<sbyte> Value4));
            public static unsafe void StoreVector64x4AndUnzip(short*   address, (Vector64<short> Value1,     Vector64<short> Value2,   Vector64<short> Value3,     Vector64<short> Value4));
            public static unsafe void StoreVector64x4AndUnzip(ushort*   address, (Vector64<ushort> Value1,   Vector64<ushort> Value2,  Vector64<ushort> Value3,    Vector64<ushort> Value4));
            public static unsafe void StoreVector64x4AndUnzip(int*   address, (Vector64<int> Value1,         Vector64<int> Value2,     Vector64<int> Value3,       Vector64<int> Value4));
            public static unsafe void StoreVector64x4AndUnzip(uint*   address, (Vector64<uint> Value1,       Vector64<uint> Value2,    Vector64<uint> Value3,      Vector64<uint> Value4));
            public static unsafe void StoreVector64x4AndUnzip(float*   address, (Vector64<float> Value1,     Vector64<float> Value2,   Vector64<float> Value3,     Vector64<float> Value4));

            // ST1 (single structure)
            // StoreSelectedScalar already present
            
            // ST2 (multiple structures)
            public static unsafe void StoreVector64x2(byte*   address, (Vector64<byte> Value1,    Vector64<byte> Value2));
            public static unsafe void StoreVector64x2(sbyte*  address, (Vector64<sbyte> Value1,   Vector64<sbyte> Value2));
            public static unsafe void StoreVector64x2(short*  address, (Vector64<short> Value1,   Vector64<short> Value2));
            public static unsafe void StoreVector64x2(ushort* address, (Vector64<ushort> Value1,  Vector64<ushort> Value2));
            public static unsafe void StoreVector64x2(int*    address, (Vector64<int> Value1,     Vector64<int> Value2));
            public static unsafe void StoreVector64x2(uint*   address, (Vector64<uint> Value1,    Vector64<uint> Value2));
            public static unsafe void StoreVector64x2(float*  address, (Vector64<float> Value1,   Vector64<float> Value2));

            // ST2 (single structure)
            public static unsafe void StoreSelectedScalar64x2(byte*   address, (Vector64<byte> Value1,   Vector64<byte> Value2), byte index);
            public static unsafe void StoreSelectedScalar64x2(sbyte*   address, (Vector64<sbyte> Value1,  Vector64<sbyte> Value2), byte index);
            public static unsafe void StoreSelectedScalar64x2(short*   address, (Vector64<short> Value1,  Vector64<short> Value2), byte index);
            public static unsafe void StoreSelectedScalar64x2(ushort*   address, (Vector64<ushort> Value1, Vector64<ushort> Value2), Vector64<ushort> value, byte index);
            public static unsafe void StoreSelectedScalar64x2(int*   address, (Vector64<int> Value1,    Vector64<int> Value2), byte index);
            public static unsafe void StoreSelectedScalar64x2(uint*   address, (Vector64<uint> Value1,   Vector64<uint> Value2), byte index);
            public static unsafe void StoreSelectedScalar64x2(float*   address, (Vector64<float> Value1,  Vector64<float> Value2), byte index);

            // ST3 (multiple structures)
            public static unsafe void StoreVector64x3(byte*   address, (Vector64<byte> Value1,    Vector64<byte> Value2,      Vector64<byte> Value3));
            public static unsafe void StoreVector64x3(sbyte*  address, (Vector64<sbyte> Value1,   Vector64<sbyte> Value2,     Vector64<sbyte> Value3));
            public static unsafe void StoreVector64x3(short*  address, (Vector64<short> Value1,   Vector64<short> Value2,     Vector64<short> Value3));
            public static unsafe void StoreVector64x3(ushort*  address, (Vector64<ushort> Value1,  Vector64<ushort> Value2,    Vector64<ushort> Value3));
            public static unsafe void StoreVector64x3(int*  address, (Vector64<int> Value1,     Vector64<int> Value2,       Vector64<int> Value3));
            public static unsafe void StoreVector64x3(uint*  address, (Vector64<uint> Value1,    Vector64<uint> Value2,      Vector64<uint> Value3));
            public static unsafe void StoreVector64x3(float*  address, (Vector64<float> Value1,   Vector64<float> Value2,     Vector64<sbyte> Value3));

            // ST3 (single structure)
            public static unsafe void StoreSelectedScalar64x3(byte*   address, (Vector64<byte> Value1,    Vector64<byte> Value2,      Vector64<byte> Value3), byte index);
            public static unsafe void StoreSelectedScalar64x3(sbyte*  address, (Vector64<sbyte> Value1,   Vector64<sbyte> Value2,     Vector64<sbyte> Value3), byte index);
            public static unsafe void StoreSelectedScalar64x3(short*  address, (Vector64<short> Value1,   Vector64<short> Value2,     Vector64<short> Value3), byte index);
            public static unsafe void StoreSelectedScalar64x3(ushort*  address, (Vector64<ushort> Value1,  Vector64<ushort> Value2,    Vector64<ushort> Value3), byte index);
            public static unsafe void StoreSelectedScalar64x3(int*  address, (Vector64<int> Value1,     Vector64<int> Value2,       Vector64<int> Value3), byte index);
            public static unsafe void StoreSelectedScalar64x3(uint*  address, (Vector64<uint> Value1,    Vector64<uint> Value2,      Vector64<uint> Value3), byte index);
            public static unsafe void StoreSelectedScalar64x3(float*  address, (Vector64<float> Value1,   Vector64<float> Value2,     Vector64<sbyte> Value3), byte index);

            // ST4 (multiple structures)
            public static unsafe void StoreVector64x4(byte*   address, (Vector64<byte> Value1,        Vector64<byte> Value2,      Vector64<byte> Value3,      Vector64<byte> Value4));
            public static unsafe void StoreVector64x4(sbyte*   address, (Vector64<sbyte> Value1,       Vector64<sbyte> Value2,     Vector64<sbyte> Value3,     Vector64<sbyte> Value4));
            public static unsafe void StoreVector64x4(short*   address, (Vector64<short> Value1,       Vector64<short> Value2,     Vector64<short> Value3,     Vector64<short> Value4) );
            public static unsafe void StoreVector64x4(ushort*   address, (Vector64<ushort> Value1,      Vector64<ushort> Value2,    Vector64<ushort> Value3,    Vector64<ushort> Value4));
            public static unsafe void StoreVector64x4(int*   address, (Vector64<int> Value1,         Vector64<int> Value2,       Vector64<int> Value3,       Vector64<int> Value4));
            public static unsafe void StoreVector64x4(uint*   address, (Vector64<uint> Value1,        Vector64<uint> Value2,      Vector64<uint> Value3,      Vector64<uint> Value4));
            public static unsafe void StoreVector64x4(float*   address, (Vector64<float> Value1,       Vector64<float> Value2,     Vector64<float> Value3,     Vector64<float> Value4));

            // ST4 (single structure)
            public static unsafe void StoreSelectedScalar64x4(byte*   address, (Vector64<byte> Value1,        Vector64<byte> Value2,      Vector64<byte> Value3,      Vector64<byte> Value4), byte index);
            public static unsafe void StoreSelectedScalar64x4(sbyte*   address, (Vector64<sbyte> Value1,       Vector64<sbyte> Value2,     Vector64<sbyte> Value3,     Vector64<sbyte> Value4), byte index);
            public static unsafe void StoreSelectedScalar64x4(short*   address, (Vector64<short> Value1,       Vector64<short> Value2,     Vector64<short> Value3,     Vector64<short> Value4), byte index);
            public static unsafe void StoreSelectedScalar64x4(ushort*   address, (Vector64<ushort> Value1,      Vector64<ushort> Value2,    Vector64<ushort> Value3,    Vector64<ushort> Value4), byte index);
            public static unsafe void StoreSelectedScalar64x4(int*   address, (Vector64<int> Value1,         Vector64<int> Value2,       Vector64<int> Value3,       Vector64<int> Value4), byte index);
            public static unsafe void StoreSelectedScalar64x4(uint*   address, (Vector64<uint> Value1,        Vector64<uint> Value2,      Vector64<uint> Value3,      Vector64<uint> Value4), byte index);
            public static unsafe void StoreSelectedScalar64x4(float*   address, (Vector64<float> Value1,       Vector64<float> Value2,     Vector64<float> Value3,     Vector64<float> Value4), byte index);
        }

        public partial class Arm64
        {
            // ST1 (multiple structures)
            // StoreVector already present

            // ST1 (multiple structures) 2 register variant
            public static unsafe void StoreVector128x2AndUnzip(byte*   address, (Vector128<byte> Value1,    Vector128<byte> Value2));
            public static unsafe void StoreVector128x2AndUnzip(sbyte*  address, (Vector128<sbyte> Value1,   Vector128<sbyte> Value2));
            public static unsafe void StoreVector128x2AndUnzip(short*  address, (Vector128<short> Value1,   Vector128<short> Value2));
            public static unsafe void StoreVector128x2AndUnzip(ushort* address, (Vector128<ushort> Value1,  Vector128<ushort> Value2));
            public static unsafe void StoreVector128x2AndUnzip(int*    address, (Vector128<int> Value1,     Vector128<int> Value2));
            public static unsafe void StoreVector128x2AndUnzip(uint*   address, (Vector128<uint> Value1,    Vector128<uint> Value2));
            public static unsafe void StoreVector128x2AndUnzip(long*   address, (Vector128<long> Value1,    Vector128<long> Value2));
            public static unsafe void StoreVector128x2AndUnzip(ulong*   address, (Vector128<ulong> Value1,  Vector128<ulong> Value2));
            public static unsafe void StoreVector128x2AndUnzip(float*  address, (Vector128<float> Value1,   Vector128<float> Value2));
            public static unsafe void StoreVector128x2AndUnzip(double*  address, (Vector128<double> Value1, Vector128<double> Value2));

            // ST1 (multiple structures) 3 register variant
            public static unsafe void StoreVector128x3AndUnzip(byte*   address, (Vector128<byte> Value1,    Vector128<byte> Value2,      Vector128<byte> Value3));
            public static unsafe void StoreVector128x3AndUnzip(sbyte*  address, (Vector128<sbyte> Value1,   Vector128<sbyte> Value2,     Vector128<sbyte> Value3));
            public static unsafe void StoreVector128x3AndUnzip(short*  address, (Vector128<short> Value1,   Vector128<short> Value2,     Vector128<short> Value3));
            public static unsafe void StoreVector128x3AndUnzip(ushort*  address, (Vector128<ushort> Value1, Vector128<ushort> Value2,    Vector128<ushort> Value3));
            public static unsafe void StoreVector128x3AndUnzip(int*  address, (Vector128<int> Value1,       Vector128<int> Value2,       Vector128<int> Value3));
            public static unsafe void StoreVector128x3AndUnzip(uint*  address, (Vector128<uint> Value1,     Vector128<uint> Value2,      Vector128<uint> Value3));
            public static unsafe void StoreVector128x3AndUnzip(long*  address, (Vector128<long> Value1,     Vector128<long> Value2,      Vector128<long> Value3));
            public static unsafe void StoreVector128x3AndUnzip(ulong*  address, (Vector128<ulong> Value1,   Vector128<ulong> Value2,     Vector128<ulong> Value3));
            public static unsafe void StoreVector128x3AndUnzip(float*  address, (Vector128<float> Value1,   Vector128<float> Value2,     Vector128<float> Value3));
            public static unsafe void StoreVector128x3AndUnzip(double*  address, (Vector128<double> Value1, Vector128<double> Value2,    Vector128<double> Value3));
            
            // ST1 (multiple structures) 4 register variant            
            public static unsafe void StoreVector128x4AndUnzip(byte*   address, (Vector128<byte> Value1,       Vector128<byte> Value2,      Vector128<byte> Value3,      Vector128<byte> Value4));
            public static unsafe void StoreVector128x4AndUnzip(sbyte*   address, (Vector128<sbyte> Value1,     Vector128<sbyte> Value2,     Vector128<sbyte> Value3,     Vector128<sbyte> Value4));
            public static unsafe void StoreVector128x4AndUnzip(short*   address, (Vector128<short> Value1,     Vector128<short> Value2,     Vector128<short> Value3,     Vector128<short> Value4));
            public static unsafe void StoreVector128x4AndUnzip(ushort*   address, (Vector128<ushort> Value1,   Vector128<ushort> Value2,    Vector128<ushort> Value3,    Vector128<ushort> Value4));
            public static unsafe void StoreVector128x4AndUnzip(int*   address, (Vector128<int> Value1,         Vector128<int> Value2,       Vector128<int> Value3,       Vector128<int> Value4));
            public static unsafe void StoreVector128x4AndUnzip(uint*   address, (Vector128<uint> Value1,       Vector128<uint> Value2,      Vector128<uint> Value3,      Vector128<uint> Value4));
            public static unsafe void StoreVector128x4AndUnzip(long*   address, (Vector128<long> Value1,       Vector128<long> Value2,      Vector128<long> Value3,      Vector128<long> Value4));
            public static unsafe void StoreVector128x4AndUnzip(ulong*   address, (Vector128<ulong> Value1,     Vector128<ulong> Value2,     Vector128<ulong> Value3,     Vector128<ulong> Value4));
            public static unsafe void StoreVector128x4AndUnzip(float*   address, (Vector128<float> Value1,     Vector128<float> Value2,     Vector128<float> Value3,     Vector128<float> Value4));
            public static unsafe void StoreVector128x4AndUnzip(double*   address, (Vector128<double> Value1,   Vector128<double> Value2,    Vector128<double> Value3,    Vector128<double> Value4));

            // ST1 (single structure)
            // StoreSelectedScalar already present
            
            // ST2 (multiple structures)
            public static unsafe void StoreVector128x2(byte*   address, (Vector128<byte> Value1,    Vector128<byte> Value2));
            public static unsafe void StoreVector128x2(sbyte*  address, (Vector128<sbyte> Value1,   Vector128<sbyte> Value2));
            public static unsafe void StoreVector128x2(short*  address, (Vector128<short> Value1,   Vector128<short> Value2));
            public static unsafe void StoreVector128x2(ushort* address, (Vector128<ushort> Value1,  Vector128<ushort> Value2));
            public static unsafe void StoreVector128x2(int*    address, (Vector128<int> Value1,     Vector128<int> Value2));
            public static unsafe void StoreVector128x2(uint*   address, (Vector128<uint> Value1,    Vector128<uint> Value2));
            public static unsafe void StoreVector128x2(long*   address, (Vector128<long> Value1,    Vector128<long> Value2));
            public static unsafe void StoreVector128x2(ulong*   address, (Vector128<ulong> Value1,  Vector128<ulong> Value2));
            public static unsafe void StoreVector128x2(float*  address, (Vector128<float> Value1,   Vector128<float> Value2));
            public static unsafe void StoreVector128x2(double*  address, (Vector128<double> Value1,  Vector128<double> Value2));

            // ST2 (single structure)
            public static unsafe void StoreSelectedScalar128x2(byte*   address, (Vector128<byte> Value1,   Vector128<byte> Value2), byte index);
            public static unsafe void StoreSelectedScalar128x2(sbyte*   address, (Vector128<sbyte> Value1,  Vector128<sbyte> Value2), byte index);
            public static unsafe void StoreSelectedScalar128x2(short*   address, (Vector128<short> Value1,  Vector128<short> Value2), byte index);
            public static unsafe void StoreSelectedScalar128x2(ushort*   address, (Vector128<ushort> Value1, Vector128<ushort> Value2), Vector128<ushort> value, byte index);
            public static unsafe void StoreSelectedScalar128x2(int*   address, (Vector128<int> Value1,    Vector128<int> Value2), byte index);
            public static unsafe void StoreSelectedScalar128x2(uint*   address, (Vector128<uint> Value1,   Vector128<uint> Value2), byte index);
            public static unsafe void StoreSelectedScalar128x2(long*   address, (Vector128<long> Value1,   Vector128<long> Value2), byte index);
            public static unsafe void StoreSelectedScalar128x2(ulong*   address, (Vector128<ulong> Value1,   Vector128<ulong> Value2), byte index);
            public static unsafe void StoreSelectedScalar128x2(float*   address, (Vector128<float> Value1,  Vector128<float> Value2), byte index);
            public static unsafe void StoreSelectedScalar128x2(double*   address, (Vector128<double> Value1,  Vector128<double> Value2), byte index);

            // ST3 (multiple structures)
            public static unsafe void StoreVector128x3(byte*   address, (Vector128<byte> Value1,     Vector128<byte> Value2,    Vector128<byte> Value3));
            public static unsafe void StoreVector128x3(sbyte*  address, (Vector128<sbyte> Value1,    Vector128<sbyte> Value2,   Vector128<sbyte> Value3));
            public static unsafe void StoreVector128x3(short*  address, (Vector128<short> Value1,    Vector128<short> Value2,   Vector128<short> Value3));
            public static unsafe void StoreVector128x3(ushort*  address, (Vector128<ushort> Value1,  Vector128<ushort> Value2,  Vector128<ushort> Value3));
            public static unsafe void StoreVector128x3(int*  address, (Vector128<int> Value1,        Vector128<int> Value2,     Vector128<int> Value3));
            public static unsafe void StoreVector128x3(uint*  address, (Vector128<uint> Value1,      Vector128<uint> Value2,    Vector128<uint> Value3));
            public static unsafe void StoreVector128x3(long*  address, (Vector128<long> Value1,      Vector128<long> Value2,    Vector128<long> Value3));
            public static unsafe void StoreVector128x3(ulong*  address, (Vector128<ulong> Value1,    Vector128<ulong> Value2,   Vector128<ulong> Value3));
            public static unsafe void StoreVector128x3(float*  address, (Vector128<float> Value1,    Vector128<float> Value2,   Vector128<sbyte> Value3));
            public static unsafe void StoreVector128x3(double*  address, (Vector128<double> Value1,  Vector128<double> Value2,  Vector128<double> Value3));

            // ST3 (single structure)
            public static unsafe void StoreSelectedScalar128x3(byte*   address, (Vector128<byte> Value1,    Vector128<byte> Value2,      Vector128<byte> Value3), byte index);
            public static unsafe void StoreSelectedScalar128x3(sbyte*  address, (Vector128<sbyte> Value1,   Vector128<sbyte> Value2,     Vector128<sbyte> Value3), byte index);
            public static unsafe void StoreSelectedScalar128x3(short*  address, (Vector128<short> Value1,   Vector128<short> Value2,     Vector128<short> Value3), byte index);
            public static unsafe void StoreSelectedScalar128x3(ushort*  address, (Vector128<ushort> Value1,  Vector128<ushort> Value2,    Vector128<ushort> Value3), byte index);
            public static unsafe void StoreSelectedScalar128x3(int*  address, (Vector128<int> Value1,     Vector128<int> Value2,       Vector128<int> Value3), byte index);
            public static unsafe void StoreSelectedScalar128x3(uint*  address, (Vector128<uint> Value1,    Vector128<uint> Value2,      Vector128<uint> Value3), byte index);
            public static unsafe void StoreSelectedScalar128x3(long*  address, (Vector128<long> Value1,    Vector128<long> Value2,      Vector128<long> Value3), byte index);
            public static unsafe void StoreSelectedScalar128x3(ulong*  address, (Vector128<ulong> Value1,    Vector128<ulong> Value2,      Vector128<ulong> Value3), byte index);
            public static unsafe void StoreSelectedScalar128x3(float*  address, (Vector128<float> Value1,   Vector128<float> Value2,     Vector128<float> Value3), byte index);
            public static unsafe void StoreSelectedScalar128x3(double*  address, (Vector128<double> Value1,   Vector128<double> Value2,     Vector128<double> Value3), byte index);

            // ST4 (multiple structures)
            public static unsafe void StoreVector128x4(byte*   address, (Vector128<byte> Value1,       Vector128<byte> Value2,      Vector128<byte> Value3,      Vector128<byte> Value4));
            public static unsafe void StoreVector128x4(sbyte*   address, (Vector128<sbyte> Value1,     Vector128<sbyte> Value2,     Vector128<sbyte> Value3,     Vector128<sbyte> Value4));
            public static unsafe void StoreVector128x4(short*   address, (Vector128<short> Value1,     Vector128<short> Value2,     Vector128<short> Value3,     Vector128<short> Value4) );
            public static unsafe void StoreVector128x4(ushort*   address, (Vector128<ushort> Value1,   Vector128<ushort> Value2,    Vector128<ushort> Value3,    Vector128<ushort> Value4));
            public static unsafe void StoreVector128x4(int*   address, (Vector128<int> Value1,         Vector128<int> Value2,       Vector128<int> Value3,       Vector128<int> Value4));
            public static unsafe void StoreVector128x4(uint*   address, (Vector128<uint> Value1,       Vector128<uint> Value2,      Vector128<uint> Value3,      Vector128<uint> Value4));
            public static unsafe void StoreVector128x4(long*   address, (Vector128<long> Value1,       Vector128<long> Value2,      Vector128<long> Value3,      Vector128<long> Value4));
            public static unsafe void StoreVector128x4(ulong*   address, (Vector128<ulong> Value1,     Vector128<ulong Value2,      Vector128<ulong> Value3,     Vector128<ulong> Value4));
            public static unsafe void StoreVector128x4(float*   address, (Vector128<float> Value1,     Vector128<float> Value2,     Vector128<float> Value3,     Vector128<float> Value4));
            public static unsafe void StoreVector128x4(double*   address, (Vector128<double> Value1,   Vector128<double> Value2,    Vector128<double> Value3,    Vector128<double> Value4));

            // ST4 (single structure)
            public static unsafe void StoreSelectedScalar128x4(byte*   address, (Vector128<byte> Value1,        Vector128<byte> Value2,      Vector128<byte> Value3,      Vector128<byte> Value4), byte index);
            public static unsafe void StoreSelectedScalar128x4(sbyte*   address, (Vector128<sbyte> Value1,       Vector128<sbyte> Value2,     Vector128<sbyte> Value3,     Vector128<sbyte> Value4), byte index);
            public static unsafe void StoreSelectedScalar128x4(short*   address, (Vector128<short> Value1,       Vector128<short> Value2,     Vector128<short> Value3,     Vector128<short> Value4), byte index);
            public static unsafe void StoreSelectedScalar128x4(ushort*   address, (Vector128<ushort> Value1,      Vector128<ushort> Value2,    Vector128<ushort> Value3,    Vector128<ushort> Value4), byte index);
            public static unsafe void StoreSelectedScalar128x4(int*   address, (Vector128<int> Value1,         Vector128<int> Value2,       Vector128<int> Value3,       Vector128<int> Value4), byte index);
            public static unsafe void StoreSelectedScalar128x4(uint*   address, (Vector128<uint> Value1,        Vector128<uint> Value2,      Vector128<uint> Value3,      Vector128<uint> Value4), byte index);
            public static unsafe void StoreSelectedScalar128x4(long*   address, (Vector128<long> Value1,        Vector128<long> Value2,      Vector128<long> Value3,      Vector128<long> Value4), byte index);
            public static unsafe void StoreSelectedScalar128x4(ulong*   address, (Vector128<ulong> Value1,        Vector128<ulong> Value2,      Vector128<ulong> Value3,      Vector128<ulong> Value4), byte index);
            public static unsafe void StoreSelectedScalar128x4(float*   address, (Vector128<float> Value1,       Vector128<float> Value2,     Vector128<float> Value3,     Vector128<float> Value4), byte index);
            public static unsafe void StoreSelectedScalar128x4(double*   address, (Vector128<double> Value1,       Vector128<double> Value2,     Vector128<double> Value3,     Vector128<double> Value4), byte index);      
        }
    }
}

@kunalspathak
Copy link
Member Author

cc: @a74nh

@a74nh
Copy link
Contributor

a74nh commented May 24, 2023

Just looking at the LD2 variants for the moment:

            // LD2 (multiple structures)
            public static unsafe (Vector128<byte> Value1,    Vector128<byte> Value2)    LoadVector128x2(byte*   address);
  • It's not immediately obvious to me that this is the LD2 that will de-interleave. LoadAndDeinterleaveVector128x2() or LoadVector128x2AndDeinterleave is more accurate but quite verbose.
  • Is there a function already for LDP? Becuase LDP is essentially LD2 without the deinterleave. Want to make sure the two can't be confused.
            // LD2 (single structure)
            public static unsafe (Vector128<byte> Value1,    Vector128<byte> Value2)     LoadVectorAndInsertScalar128x2((Vector128<byte> Value1,   Vector128<byte> Value2), byte index, byte*   address);
  • Just to confirm: after the call what the contents of Value1,Value2 that were passsed into the function are unchanged. So if those values are going to be used again in C#, then the compiler first needs to copy Value1,Value2 into new registers.
  • Maybe a quick discription in a comment would be useful: Loads a structure of type { byte X, byte Y }, then sets value1[index] = X and value2[index] = Y.
  • The name is a little confusing, Loadx2AndInsertIntoVector128() is more accurate.
            // LD2R
            public static unsafe (Vector128<byte> Value1,    Vector128<byte> Value2)     LoadAndReplicateToVector128x2(byte*   address);
  • Description, something like: Loads a structure of type { byte X, byte Y }, then sets value1 = {X,X,X...} and value2 = {Y,Y,Y...}.
  • Name would be more accurate as: Loadx2AndReplicateIntoVector128

Am I right in thinking it's intentional that there's no direct support for the post index versions?

@kunalspathak
Copy link
Member Author

kunalspathak commented May 24, 2023

Thanks @a74nh for your feedback.

Just looking at the LD2 variants for the moment:

            // LD2 (multiple structures)
            public static unsafe (Vector128<byte> Value1,    Vector128<byte> Value2)    LoadVector128x2(byte*   address);
  • It's not immediately obvious to me that this is the LD2 that will de-interleave. LoadAndDeinterleaveVector128x2() or LoadVector128x2AndDeinterleave is more accurate but quite verbose.

I agree it will be too verbose, but looking at other variants, we can have LoadAndDeinterleaveVector128x2() and likewise for 3/4 registers as well. I will leave it to the API review committee to decide.

  • Is there a function already for LDP? Becuase LDP is essentially LD2 without the deinterleave. Want to make sure the two can't be confused.

There is no API for LDP.
Edit: Ah, didn't realize we have LoadPairVector. Thanks Tanner.

            // LD2 (single structure)
            public static unsafe (Vector128<byte> Value1,    Vector128<byte> Value2)     LoadVectorAndInsertScalar128x2((Vector128<byte> Value1,   Vector128<byte> Value2), byte index, byte*   address);
  • Just to confirm: after the call what the contents of Value1,Value2 that were passsed into the function are unchanged. So if those values are going to be used again in C#, then the compiler first needs to copy Value1,Value2 into new registers.

Just like the instruction semantics, it should overwrite the destination register and hence the values passed as parameter should change. We have similar APIs that overwrites the parameters like InsertSelectedScalar. @tannergooding is that correct understanding. However, in that case, VectorTableLookupExtension() doesn't follow the instruction semantics, defaultValues are supposed to get used and updated as a result, we don't change it.

  • Maybe a quick discription in a comment would be useful: Loads a structure of type { byte X, byte Y }, then sets value1[index] = X and value2[index] = Y.
  • The name is a little confusing, Loadx2AndInsertIntoVector128() is more accurate.
            // LD2R
            public static unsafe (Vector128<byte> Value1,    Vector128<byte> Value2)     LoadAndReplicateToVector128x2(byte*   address);
  • Description, something like: Loads a structure of type { byte X, byte Y }, then sets value1 = {X,X,X...} and value2 = {Y,Y,Y...}.
  • Name would be more accurate as: Loadx2AndReplicateIntoVector128

Regarding Loadx2AndInsertIntoVector128 and Loadx2AndReplicateIntoVector128 naming, I was trying to have x2, x3 and x4 towards the end of the method name for better visibility and have that pattern consistent in most of the other APIs.

Am I right in thinking it's intentional that there's no direct support for the post index versions?

That's right. I did not include the post index versions in this.

@tannergooding
Copy link
Member

It's not immediately obvious to me that this is the LD2 that will de-interleave. LoadAndDeinterleaveVector128x2() or LoadVector128x2AndDeinterleave is more accurate but quite verbose.

The name given above and which matches the general Arm64 terminology for other intrinsics is Unzip

Is there a function already for LDP? Becuase LDP is essentially LD2 without the deinterleave. Want to make sure the two can't be confused.

Yes, its LoadPairVector128

Just to confirm: after the call what the contents of Value1,Value2 that were passsed into the function are unchanged. So if those values are going to be used again in C#, then the compiler first needs to copy Value1,Value2 into new registers

Yes, that's how RMW instructions work in general.

The name is a little confusing, Loadx2AndInsertIntoVector128() is more accurate.

We shouldn't need to include the vector size here. The name is only required when we have to differentiate on return size. Since this takes a differentiating parameter (that is a tuple of Vector128) we can just keep it LoadAndInsertScalar like the current name

Am I right in thinking it's intentional that there's no direct support for the post index versions?

Yes, post index is a JIT optimization based on the addressing mode of the pointer/parameters/etc

@tannergooding tannergooding added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation untriaged New issue has not been triaged by the area owner labels Jun 5, 2023
@tannergooding tannergooding added this to the 8.0.0 milestone Jun 5, 2023
@kunalspathak kunalspathak added the blocking Marks issues that we want to fast track in order to unblock other important work label Jun 5, 2023
@terrajobst
Copy link
Member

terrajobst commented Jun 6, 2023

Video

  • Looks good as proposed -- Store..AndUnzip was fixed to be Store...AndZip
namespace System.Runtime.Intrinsics.Arm;

public abstract partial class AdvSimd
{
    public static unsafe (Vector64<byte>   Value1, Vector64<byte>   Value2) LoadVector64x2AndUnzip(byte*   address);
    public static unsafe (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2) LoadVector64x2AndUnzip(sbyte*  address);
    public static unsafe (Vector64<short>  Value1, Vector64<short>  Value2) LoadVector64x2AndUnzip(short*  address);
    public static unsafe (Vector64<ushort> Value1, Vector64<ushort> Value2) LoadVector64x2AndUnzip(ushort* address);
    public static unsafe (Vector64<int>    Value1, Vector64<int>    Value2) LoadVector64x2AndUnzip(int*    address);
    public static unsafe (Vector64<uint>   Value1, Vector64<uint>   Value2) LoadVector64x2AndUnzip(uint*   address);
    public static unsafe (Vector64<float>  Value1, Vector64<float>  Value2) LoadVector64x2AndUnzip(float*  address);

    public static unsafe (Vector64<byte>   Value1, Vector64<byte>   Value2, Vector64<byte>   Value3) LoadVector64x3AndUnzip(byte*   address);
    public static unsafe (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2, Vector64<sbyte>  Value3) LoadVector64x3AndUnzip(sbyte*  address);
    public static unsafe (Vector64<short>  Value1, Vector64<short>  Value2, Vector64<short>  Value3) LoadVector64x3AndUnzip(short*  address);
    public static unsafe (Vector64<ushort> Value1, Vector64<ushort> Value2, Vector64<ushort> Value3) LoadVector64x3AndUnzip(ushort* address);
    public static unsafe (Vector64<int>    Value1, Vector64<int>    Value2, Vector64<int>    Value3) LoadVector64x3AndUnzip(int*    address);
    public static unsafe (Vector64<uint>   Value1, Vector64<uint>   Value2, Vector64<uint>   Value3) LoadVector64x3AndUnzip(uint*   address);
    public static unsafe (Vector64<float>  Value1, Vector64<float>  Value2, Vector64<float>  Value3) LoadVector64x3AndUnzip(float*  address);
    
    public static unsafe (Vector64<byte>   Value1, Vector64<byte>   Value2, Vector64<byte>   Value3, Vector64<byte>   Value4) LoadVector64x4AndUnzip(byte*   address);
    public static unsafe (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2, Vector64<sbyte>  Value3, Vector64<sbyte>  Value4) LoadVector64x4AndUnzip(sbyte*  address);
    public static unsafe (Vector64<short>  Value1, Vector64<short>  Value2, Vector64<short>  Value3, Vector64<short>  Value4) LoadVector64x4AndUnzip(short*  address);
    public static unsafe (Vector64<ushort> Value1, Vector64<ushort> Value2, Vector64<ushort> Value3, Vector64<ushort> Value4) LoadVector64x4AndUnzip(ushort* address);
    public static unsafe (Vector64<int>    Value1, Vector64<int>    Value2, Vector64<int>    Value3, Vector64<int>    Value4) LoadVector64x4AndUnzip(int*    address);
    public static unsafe (Vector64<uint>   Value1, Vector64<uint>   Value2, Vector64<uint>   Value3, Vector64<uint>   Value4) LoadVector64x4AndUnzip(uint*   address);
    public static unsafe (Vector64<float>  Value1, Vector64<float>  Value2, Vector64<float>  Value3, Vector64<float>  Value4) LoadVector64x4AndUnzip(float*  address);
    
    public static unsafe (Vector64<byte>   Value1, Vector64<byte>   Value2) LoadVector64x2(byte*   address);
    public static unsafe (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2) LoadVector64x2(sbyte*  address);
    public static unsafe (Vector64<short>  Value1, Vector64<short>  Value2) LoadVector64x2(short*  address);
    public static unsafe (Vector64<ushort> Value1, Vector64<ushort> Value2) LoadVector64x2(ushort* address);
    public static unsafe (Vector64<int>    Value1, Vector64<int>    Value2) LoadVector64x2(int*    address);
    public static unsafe (Vector64<uint>   Value1, Vector64<uint>   Value2) LoadVector64x2(uint*   address);
    public static unsafe (Vector64<float>  Value1, Vector64<float>  Value2) LoadVector64x2(float*  address);

    public static unsafe (Vector64<byte>   Value1, Vector64<byte>   Value2) LoadVectorAndInsertScalar64x2((Vector64<byte>   Value1, Vector64<byte>   Value2) value, byte index, byte*   address);
    public static unsafe (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2) LoadVectorAndInsertScalar64x2((Vector64<sbyte>  Value1, Vector64<sbyte>  Value2) value, byte index, sbyte*  address);
    public static unsafe (Vector64<short>  Value1, Vector64<short>  Value2) LoadVectorAndInsertScalar64x2((Vector64<short>  Value1, Vector64<short>  Value2) value, byte index, short*  address);
    public static unsafe (Vector64<ushort> Value1, Vector64<ushort> Value2) LoadVectorAndInsertScalar64x2((Vector64<ushort> Value1, Vector64<ushort> Value2) value, byte index, ushort* address);
    public static unsafe (Vector64<int>    Value1, Vector64<int>    Value2) LoadVectorAndInsertScalar64x2((Vector64<int>    Value1, Vector64<int>    Value2) value, byte index, int*    address);
    public static unsafe (Vector64<uint>   Value1, Vector64<uint>   Value2) LoadVectorAndInsertScalar64x2((Vector64<uint>   Value1, Vector64<uint>   Value2) value, byte index, uint*   address);
    public static unsafe (Vector64<float>  Value1, Vector64<float>  Value2) LoadVectorAndInsertScalar64x2((Vector64<float>  Value1, Vector64<float>  Value2) value, byte index, float*  address);

    public static unsafe (Vector64<byte>   Value1, Vector64<byte>   Value2) LoadAndReplicateToVector64x2(byte*   address);
    public static unsafe (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2) LoadAndReplicateToVector64x2(sbyte*  address);
    public static unsafe (Vector64<short>  Value1, Vector64<short>  Value2) LoadAndReplicateToVector64x2(short*  address);
    public static unsafe (Vector64<ushort> Value1, Vector64<ushort> Value2) LoadAndReplicateToVector64x2(ushort* address);
    public static unsafe (Vector64<int>    Value1, Vector64<int>    Value2) LoadAndReplicateToVector64x2(int*    address);
    public static unsafe (Vector64<uint>   Value1, Vector64<uint>   Value2) LoadAndReplicateToVector64x2(uint*   address);
    public static unsafe (Vector64<float>  Value1, Vector64<float>  Value2) LoadAndReplicateToVector64x2(float*  address);

    public static unsafe (Vector64<byte>   Value1, Vector64<byte>   Value2, Vector64<byte>   Value3) LoadVector64x3(byte*   address);
    public static unsafe (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2, Vector64<sbyte>  Value3) LoadVector64x3(sbyte*  address);
    public static unsafe (Vector64<short>  Value1, Vector64<short>  Value2, Vector64<short>  Value3) LoadVector64x3(short*  address);
    public static unsafe (Vector64<ushort> Value1, Vector64<ushort> Value2, Vector64<ushort> Value3) LoadVector64x3(ushort* address);
    public static unsafe (Vector64<int>    Value1, Vector64<int>    Value2, Vector64<int>    Value3) LoadVector64x3(int*    address);
    public static unsafe (Vector64<uint>   Value1, Vector64<uint>   Value2, Vector64<uint>   Value3) LoadVector64x3(uint*   address);
    public static unsafe (Vector64<float>  Value1, Vector64<float>  Value2, Vector64<float>  Value3) LoadVector64x3(float*  address);

    public static unsafe (Vector64<byte>   Value1, Vector64<byte>   Value2, Vector64<byte>   Value3) LoadVectorAndInsertScalar64x3((Vector64<byte>   Value1, Vector64<byte>   Value2, Vector64<byte>   Value3) value, byte index, byte*   address);
    public static unsafe (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2, Vector64<sbyte>  Value3) LoadVectorAndInsertScalar64x3((Vector64<sbyte>  Value1, Vector64<sbyte>  Value2, Vector64<sbyte>  Value3) value, byte index, sbyte*  address);
    public static unsafe (Vector64<short>  Value1, Vector64<short>  Value2, Vector64<short>  Value3) LoadVectorAndInsertScalar64x3((Vector64<short>  Value1, Vector64<short>  Value2, Vector64<short>  Value3) value, byte index, short*  address);
    public static unsafe (Vector64<ushort> Value1, Vector64<ushort> Value2, Vector64<ushort> Value3) LoadVectorAndInsertScalar64x3((Vector64<ushort> Value1, Vector64<ushort> Value2, Vector64<ushort> Value3) value, byte index, ushort* address);
    public static unsafe (Vector64<int>    Value1, Vector64<int>    Value2, Vector64<int>    Value3) LoadVectorAndInsertScalar64x3((Vector64<int>    Value1, Vector64<int>    Value2, Vector64<int>    Value3) value, byte index, int*    address);
    public static unsafe (Vector64<uint>   Value1, Vector64<uint>   Value2, Vector64<uint>   Value3) LoadVectorAndInsertScalar64x3((Vector64<uint>   Value1, Vector64<uint>   Value2, Vector64<uint>   Value3) value, byte index, uint*   address);
    public static unsafe (Vector64<float>  Value1, Vector64<float>  Value2, Vector64<float>  Value3) LoadVectorAndInsertScalar64x3((Vector64<float>  Value1, Vector64<float>  Value2, Vector64<float>  Value3) value, byte index, float*  address);

    public static unsafe (Vector64<byte>   Value1, Vector64<byte>   Value2, Vector64<byte>   Value3) LoadAndReplicateToVector64x3(byte*   address);
    public static unsafe (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2, Vector64<sbyte>  Value3) LoadAndReplicateToVector64x3(sbyte*  address);
    public static unsafe (Vector64<short>  Value1, Vector64<short>  Value2, Vector64<short>  Value3) LoadAndReplicateToVector64x3(short*  address);
    public static unsafe (Vector64<ushort> Value1, Vector64<ushort> Value2, Vector64<ushort> Value3) LoadAndReplicateToVector64x3(ushort* address);
    public static unsafe (Vector64<int>    Value1, Vector64<int>    Value2, Vector64<int>    Value3) LoadAndReplicateToVector64x3(int*    address);
    public static unsafe (Vector64<uint>   Value1, Vector64<uint>   Value2, Vector64<uint>   Value3) LoadAndReplicateToVector64x3(uint*   address);
    public static unsafe (Vector64<float>  Value1, Vector64<float>  Value2, Vector64<float>  Value3) LoadAndReplicateToVector64x3(float*  address);

    public static unsafe (Vector64<byte>   Value1, Vector64<byte>   Value2, Vector64<byte>   Value3, Vector64<byte>   Value4) LoadVector64x4(byte*   address);
    public static unsafe (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2, Vector64<sbyte>  Value3, Vector64<sbyte>  Value4) LoadVector64x4(sbyte*  address);
    public static unsafe (Vector64<short>  Value1, Vector64<short>  Value2, Vector64<short>  Value3, Vector64<short>  Value4) LoadVector64x4(short*  address);
    public static unsafe (Vector64<ushort> Value1, Vector64<ushort> Value2, Vector64<ushort> Value3, Vector64<ushort> Value4) LoadVector64x4(ushort* address);
    public static unsafe (Vector64<int>    Value1, Vector64<int>    Value2, Vector64<int>    Value3, Vector64<int>    Value4) LoadVector64x4(int*    address);
    public static unsafe (Vector64<uint>   Value1, Vector64<uint>   Value2, Vector64<uint>   Value3, Vector64<uint>   Value4) LoadVector64x4(uint*   address);
    public static unsafe (Vector64<float>  Value1, Vector64<float>  Value2, Vector64<float>  Value3, Vector64<float>  Value4) LoadVector64x4(float*  address);

    public static unsafe (Vector64<byte>   Value1, Vector64<byte>   Value2, Vector64<byte>   Value3, Vector64<byte>   Value4) LoadVectorAndInsertScalar64x4((Vector64<byte>   Value1, Vector64<byte>   Value2, Vector64<byte>   Value3, Vector64<byte>   Value4) value, byte index, byte*   address);
    public static unsafe (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2, Vector64<sbyte>  Value3, Vector64<sbyte>  Value4) LoadVectorAndInsertScalar64x4((Vector64<sbyte>  Value1, Vector64<sbyte>  Value2, Vector64<sbyte>  Value3, Vector64<sbyte>  Value4) value, byte index, sbyte*  address);
    public static unsafe (Vector64<short>  Value1, Vector64<short>  Value2, Vector64<short>  Value3, Vector64<short>  Value4) LoadVectorAndInsertScalar64x4((Vector64<short>  Value1, Vector64<short>  Value2, Vector64<short>  Value3, Vector64<short>  Value4) value, byte index, short*  address);
    public static unsafe (Vector64<ushort> Value1, Vector64<ushort> Value2, Vector64<ushort> Value3, Vector64<ushort> Value4) LoadVectorAndInsertScalar64x4((Vector64<ushort> Value1, Vector64<ushort> Value2, Vector64<ushort> Value3, Vector64<ushort> Value4) value, byte index, ushort* address);
    public static unsafe (Vector64<int>    Value1, Vector64<int>    Value2, Vector64<int>    Value3, Vector64<int>    Value4) LoadVectorAndInsertScalar64x4((Vector64<int>    Value1, Vector64<int>    Value2, Vector64<int>    Value3, Vector64<int>    Value4) value, byte index, int*    address);
    public static unsafe (Vector64<uint>   Value1, Vector64<uint>   Value2, Vector64<uint>   Value3, Vector64<uint>   Value4) LoadVectorAndInsertScalar64x4((Vector64<uint>   Value1, Vector64<uint>   Value2, Vector64<uint>   Value3, Vector64<uint>   Value4) value, byte index, uint*   address);
    public static unsafe (Vector64<float>  Value1, Vector64<float>  Value2, Vector64<float>  Value3, Vector64<float>  Value4) LoadVectorAndInsertScalar64x4((Vector64<float>  Value1, Vector64<float>  Value2, Vector64<float>  Value3, Vector64<float>  Value4) value, byte index, float*  address);

    public static unsafe (Vector64<byte>   Value1, Vector64<byte>   Value2, Vector64<byte>   Value3, Vector64<byte>   Value4) LoadAndReplicateToVector64x4(byte*   address);
    public static unsafe (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2, Vector64<sbyte>  Value3, Vector64<sbyte>  Value4) LoadAndReplicateToVector64x4(sbyte*  address);
    public static unsafe (Vector64<short>  Value1, Vector64<short>  Value2, Vector64<short>  Value3, Vector64<short>  Value4) LoadAndReplicateToVector64x4(short*  address);
    public static unsafe (Vector64<ushort> Value1, Vector64<ushort> Value2, Vector64<ushort> Value3, Vector64<ushort> Value4) LoadAndReplicateToVector64x4(ushort* address);
    public static unsafe (Vector64<int>    Value1, Vector64<int>    Value2, Vector64<int>    Value3, Vector64<int>    Value4) LoadAndReplicateToVector64x4(int*    address);
    public static unsafe (Vector64<uint>   Value1, Vector64<uint>   Value2, Vector64<uint>   Value3, Vector64<uint>   Value4) LoadAndReplicateToVector64x4(uint*   address);
    public static unsafe (Vector64<float>  Value1, Vector64<float>  Value2, Vector64<float>  Value3, Vector64<float>  Value4) LoadAndReplicateToVector64x4(float*  address);

    public static unsafe void StoreVector64x2AndZip(byte*   address, (Vector64<byte>   Value1, Vector64<byte>   Value2) value);
    public static unsafe void StoreVector64x2AndZip(sbyte*  address, (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2) value);
    public static unsafe void StoreVector64x2AndZip(short*  address, (Vector64<short>  Value1, Vector64<short>  Value2) value);
    public static unsafe void StoreVector64x2AndZip(ushort* address, (Vector64<ushort> Value1, Vector64<ushort> Value2) value);
    public static unsafe void StoreVector64x2AndZip(int*    address, (Vector64<int>    Value1, Vector64<int>    Value2) value);
    public static unsafe void StoreVector64x2AndZip(uint*   address, (Vector64<uint>   Value1, Vector64<uint>   Value2) value);
    public static unsafe void StoreVector64x2AndZip(float*  address, (Vector64<float>  Value1, Vector64<float>  Value2) value);

    public static unsafe void StoreVector64x3AndZip(byte*   address, (Vector64<byte>   Value1, Vector64<byte>   Value2, Vector64<byte>   Value3) value);
    public static unsafe void StoreVector64x3AndZip(sbyte*  address, (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2, Vector64<sbyte>  Value3) value);
    public static unsafe void StoreVector64x3AndZip(short*  address, (Vector64<short>  Value1, Vector64<short>  Value2, Vector64<short>  Value3) value);
    public static unsafe void StoreVector64x3AndZip(ushort* address, (Vector64<ushort> Value1, Vector64<ushort> Value2, Vector64<ushort> Value3) value);
    public static unsafe void StoreVector64x3AndZip(int*    address, (Vector64<int>    Value1, Vector64<int>    Value2, Vector64<int>    Value3) value);
    public static unsafe void StoreVector64x3AndZip(uint*   address, (Vector64<uint>   Value1, Vector64<uint>   Value2, Vector64<uint>   Value3) value);
    public static unsafe void StoreVector64x3AndZip(float*  address, (Vector64<float>  Value1, Vector64<float>  Value2, Vector64<float>  Value3) value);
    
    public static unsafe void StoreVector64x4AndZip(byte*   address, (Vector64<byte>   Value1, Vector64<byte>   Value2, Vector64<byte>   Value3, Vector64<byte>   Value4) value);
    public static unsafe void StoreVector64x4AndZip(sbyte*  address, (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2, Vector64<sbyte>  Value3, Vector64<sbyte>  Value4) value);
    public static unsafe void StoreVector64x4AndZip(short*  address, (Vector64<short>  Value1, Vector64<short>  Value2, Vector64<short>  Value3, Vector64<short>  Value4) value);
    public static unsafe void StoreVector64x4AndZip(ushort* address, (Vector64<ushort> Value1, Vector64<ushort> Value2, Vector64<ushort> Value3, Vector64<ushort> Value4) value);
    public static unsafe void StoreVector64x4AndZip(int*    address, (Vector64<int>    Value1, Vector64<int>    Value2, Vector64<int>    Value3, Vector64<int>    Value4) value);
    public static unsafe void StoreVector64x4AndZip(uint*   address, (Vector64<uint>   Value1, Vector64<uint>   Value2, Vector64<uint>   Value3, Vector64<uint>   Value4) value);
    public static unsafe void StoreVector64x4AndZip(float*  address, (Vector64<float>  Value1, Vector64<float>  Value2, Vector64<float>  Value3, Vector64<float>  Value4) value);
    
    public static unsafe void StoreVector64x2(byte*   address, (Vector64<byte>   Value1, Vector64<byte>   Value2) value);
    public static unsafe void StoreVector64x2(sbyte*  address, (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2) value);
    public static unsafe void StoreVector64x2(short*  address, (Vector64<short>  Value1, Vector64<short>  Value2) value);
    public static unsafe void StoreVector64x2(ushort* address, (Vector64<ushort> Value1, Vector64<ushort> Value2) value);
    public static unsafe void StoreVector64x2(int*    address, (Vector64<int>    Value1, Vector64<int>    Value2) value);
    public static unsafe void StoreVector64x2(uint*   address, (Vector64<uint>   Value1, Vector64<uint>   Value2) value);
    public static unsafe void StoreVector64x2(float*  address, (Vector64<float>  Value1, Vector64<float>  Value2) value);

    public static unsafe void StoreSelectedScalar64x2(byte*   address, (Vector64<byte>   Value1, Vector64<byte>   Value2) value, byte index);
    public static unsafe void StoreSelectedScalar64x2(sbyte*  address, (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2) value, byte index);
    public static unsafe void StoreSelectedScalar64x2(short*  address, (Vector64<short>  Value1, Vector64<short>  Value2) value, byte index);
    public static unsafe void StoreSelectedScalar64x2(ushort* address, (Vector64<ushort> Value1, Vector64<ushort> Value2) value, byte index);
    public static unsafe void StoreSelectedScalar64x2(int*    address, (Vector64<int>    Value1, Vector64<int>    Value2) value, byte index);
    public static unsafe void StoreSelectedScalar64x2(uint*   address, (Vector64<uint>   Value1, Vector64<uint>   Value2) value, byte index);
    public static unsafe void StoreSelectedScalar64x2(float*  address, (Vector64<float>  Value1, Vector64<float>  Value2) value, byte index);

    public static unsafe void StoreVector64x3(byte*   address, (Vector64<byte>   Value1, Vector64<byte>   Value2, Vector64<byte>   Value3) value);
    public static unsafe void StoreVector64x3(sbyte*  address, (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2, Vector64<sbyte>  Value3) value);
    public static unsafe void StoreVector64x3(short*  address, (Vector64<short>  Value1, Vector64<short>  Value2, Vector64<short>  Value3) value);
    public static unsafe void StoreVector64x3(ushort* address, (Vector64<ushort> Value1, Vector64<ushort> Value2, Vector64<ushort> Value3) value);
    public static unsafe void StoreVector64x3(int*    address, (Vector64<int>    Value1, Vector64<int>    Value2, Vector64<int>    Value3) value);
    public static unsafe void StoreVector64x3(uint*   address, (Vector64<uint>   Value1, Vector64<uint>   Value2, Vector64<uint>   Value3) value);
    public static unsafe void StoreVector64x3(float*  address, (Vector64<float>  Value1, Vector64<float>  Value2, Vector64<float>  Value3) value);

    public static unsafe void StoreSelectedScalar64x3(byte*   address, (Vector64<byte>   Value1, Vector64<byte>   Value2, Vector64<byte>   Value3) value, byte index);
    public static unsafe void StoreSelectedScalar64x3(sbyte*  address, (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2, Vector64<sbyte>  Value3) value, byte index);
    public static unsafe void StoreSelectedScalar64x3(short*  address, (Vector64<short>  Value1, Vector64<short>  Value2, Vector64<short>  Value3) value, byte index);
    public static unsafe void StoreSelectedScalar64x3(ushort* address, (Vector64<ushort> Value1, Vector64<ushort> Value2, Vector64<ushort> Value3) value, byte index);
    public static unsafe void StoreSelectedScalar64x3(int*    address, (Vector64<int>    Value1, Vector64<int>    Value2, Vector64<int>    Value3) value, byte index);
    public static unsafe void StoreSelectedScalar64x3(uint*   address, (Vector64<uint>   Value1, Vector64<uint>   Value2, Vector64<uint>   Value3) value, byte index);
    public static unsafe void StoreSelectedScalar64x3(float*  address, (Vector64<float>  Value1, Vector64<float>  Value2, Vector64<float>  Value3) value, byte index);

    public static unsafe void StoreVector64x4(byte*   address, (Vector64<byte>   Value1, Vector64<byte>   Value2, Vector64<byte>   Value3, Vector64<byte>   Value4) value);
    public static unsafe void StoreVector64x4(sbyte*  address, (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2, Vector64<sbyte>  Value3, Vector64<sbyte>  Value4) value);
    public static unsafe void StoreVector64x4(short*  address, (Vector64<short>  Value1, Vector64<short>  Value2, Vector64<short>  Value3, Vector64<short>  Value4) value);
    public static unsafe void StoreVector64x4(ushort* address, (Vector64<ushort> Value1, Vector64<ushort> Value2, Vector64<ushort> Value3, Vector64<ushort> Value4) value);
    public static unsafe void StoreVector64x4(int*    address, (Vector64<int>    Value1, Vector64<int>    Value2, Vector64<int>    Value3, Vector64<int>    Value4) value);
    public static unsafe void StoreVector64x4(uint*   address, (Vector64<uint>   Value1, Vector64<uint>   Value2, Vector64<uint>   Value3, Vector64<uint>   Value4) value);
    public static unsafe void StoreVector64x4(float*  address, (Vector64<float>  Value1, Vector64<float>  Value2, Vector64<float>  Value3, Vector64<float>  Value4) value);

    public static unsafe void StoreSelectedScalar64x4(byte*   address, (Vector64<byte>   Value1, Vector64<byte>   Value2, Vector64<byte>   Value3, Vector64<byte>   Value4) value, byte index);
    public static unsafe void StoreSelectedScalar64x4(sbyte*  address, (Vector64<sbyte>  Value1, Vector64<sbyte>  Value2, Vector64<sbyte>  Value3, Vector64<sbyte>  Value4) value, byte index);
    public static unsafe void StoreSelectedScalar64x4(short*  address, (Vector64<short>  Value1, Vector64<short>  Value2, Vector64<short>  Value3, Vector64<short>  Value4) value, byte index);
    public static unsafe void StoreSelectedScalar64x4(ushort* address, (Vector64<ushort> Value1, Vector64<ushort> Value2, Vector64<ushort> Value3, Vector64<ushort> Value4) value, byte index);
    public static unsafe void StoreSelectedScalar64x4(int*    address, (Vector64<int>    Value1, Vector64<int>    Value2, Vector64<int>    Value3, Vector64<int>    Value4) value, byte index);
    public static unsafe void StoreSelectedScalar64x4(uint*   address, (Vector64<uint>   Value1, Vector64<uint>   Value2, Vector64<uint>   Value3, Vector64<uint>   Value4) value, byte index);
    public static unsafe void StoreSelectedScalar64x4(float*  address, (Vector64<float>  Value1, Vector64<float>  Value2, Vector64<float>  Value3, Vector64<float>  Value4) value, byte index);

    public partial class Arm64
    {
        public static unsafe (Vector128<byte>   Value1, Vector128<byte>   Value2) LoadVector128x2AndUnzip(byte*   address);
        public static unsafe (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2) LoadVector128x2AndUnzip(sbyte*  address);
        public static unsafe (Vector128<short>  Value1, Vector128<short>  Value2) LoadVector128x2AndUnzip(short*  address);
        public static unsafe (Vector128<ushort> Value1, Vector128<ushort> Value2) LoadVector128x2AndUnzip(ushort* address);
        public static unsafe (Vector128<int>    Value1, Vector128<int>    Value2) LoadVector128x2AndUnzip(int*    address);
        public static unsafe (Vector128<uint>   Value1, Vector128<uint>   Value2) LoadVector128x2AndUnzip(uint*   address);
        public static unsafe (Vector128<long>   Value1, Vector128<long>   Value2) LoadVector128x2AndUnzip(long*   address);
        public static unsafe (Vector128<ulong>  Value1, Vector128<ulong>  Value2) LoadVector128x2AndUnzip(ulong*  address);
        public static unsafe (Vector128<float>  Value1, Vector128<float>  Value2) LoadVector128x2AndUnzip(float*  address);
        public static unsafe (Vector128<double> Value1, Vector128<double> Value2) LoadVector128x2AndUnzip(double* address);

        public static unsafe (Vector128<byte>   Value1, Vector128<byte>   Value2, Vector128<byte>   Value3) LoadVector128x3AndUnzip(byte*   address);
        public static unsafe (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2, Vector128<sbyte>  Value3) LoadVector128x3AndUnzip(sbyte*  address);
        public static unsafe (Vector128<short>  Value1, Vector128<short>  Value2, Vector128<short>  Value3) LoadVector128x3AndUnzip(short*  address);
        public static unsafe (Vector128<ushort> Value1, Vector128<ushort> Value2, Vector128<ushort> Value3) LoadVector128x3AndUnzip(ushort* address);
        public static unsafe (Vector128<int>    Value1, Vector128<int>    Value2, Vector128<int>    Value3) LoadVector128x3AndUnzip(int*    address);
        public static unsafe (Vector128<uint>   Value1, Vector128<uint>   Value2, Vector128<uint>   Value3) LoadVector128x3AndUnzip(uint*   address);
        public static unsafe (Vector128<long>   Value1, Vector128<long>   Value2, Vector128<long>   Value3) LoadVector128x3AndUnzip(long*   address);
        public static unsafe (Vector128<ulong>  Value1, Vector128<ulong>  Value2, Vector128<ulong>  Value3) LoadVector128x3AndUnzip(ulong*  address);
        public static unsafe (Vector128<float>  Value1, Vector128<float>  Value2, Vector128<float>  Value3) LoadVector128x3AndUnzip(float*  address);
        public static unsafe (Vector128<double> Value1, Vector128<double> Value2, Vector128<double> Value3) LoadVector128x3AndUnzip(double* address);

        public static unsafe (Vector128<byte>   Value1, Vector128<byte>   Value2, Vector128<byte>   Value3, Vector128<byte>   Value4) LoadVector128x4AndUnzip(byte*   address);
        public static unsafe (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2, Vector128<sbyte>  Value3, Vector128<sbyte>  Value4) LoadVector128x4AndUnzip(sbyte*  address);
        public static unsafe (Vector128<short>  Value1, Vector128<short>  Value2, Vector128<short>  Value3, Vector128<short>  Value4) LoadVector128x4AndUnzip(short*  address);
        public static unsafe (Vector128<ushort> Value1, Vector128<ushort> Value2, Vector128<ushort> Value3, Vector128<ushort> Value4) LoadVector128x4AndUnzip(ushort* address);
        public static unsafe (Vector128<int>    Value1, Vector128<int>    Value2, Vector128<int>    Value3, Vector128<int>    Value4) LoadVector128x4AndUnzip(int*    address);
        public static unsafe (Vector128<uint>   Value1, Vector128<uint>   Value2, Vector128<uint>   Value3, Vector128<uint>   Value4) LoadVector128x4AndUnzip(uint*   address);
        public static unsafe (Vector128<long>   Value1, Vector128<long>   Value2, Vector128<long>   Value3, Vector128<long>   Value4) LoadVector128x4AndUnzip(long*   address);
        public static unsafe (Vector128<ulong>  Value1, Vector128<ulong>  Value2, Vector128<ulong>  Value3, Vector128<ulong>  Value4) LoadVector128x4AndUnzip(ulong*  address);
        public static unsafe (Vector128<float>  Value1, Vector128<float>  Value2, Vector128<float>  Value3, Vector128<float>  Value4) LoadVector128x4AndUnzip(float*  address);
        public static unsafe (Vector128<double> Value1, Vector128<double> Value2, Vector128<double> Value3, Vector128<double> Value4) LoadVector128x4AndUnzip(double* address);

        public static unsafe (Vector128<byte>   Value1, Vector128<byte>   Value2) LoadVector128x2(byte*   address);
        public static unsafe (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2) LoadVector128x2(sbyte*  address);
        public static unsafe (Vector128<short>  Value1, Vector128<short>  Value2) LoadVector128x2(short*  address);
        public static unsafe (Vector128<ushort> Value1, Vector128<ushort> Value2) LoadVector128x2(ushort* address);
        public static unsafe (Vector128<int>    Value1, Vector128<int>    Value2) LoadVector128x2(int*    address);
        public static unsafe (Vector128<uint>   Value1, Vector128<uint>   Value2) LoadVector128x2(uint*   address);
        public static unsafe (Vector128<long>   Value1, Vector128<long>   Value2) LoadVector128x2(long*   address);
        public static unsafe (Vector128<ulong>  Value1, Vector128<ulong>  Value2) LoadVector128x2(ulong*  address);
        public static unsafe (Vector128<float>  Value1, Vector128<float>  Value2) LoadVector128x2(float*  address);
        public static unsafe (Vector128<double> Value1, Vector128<double> Value2) LoadVector128x2(double* address);


        public static unsafe (Vector128<byte>   Value1, Vector128<byte>   Value2) LoadVectorAndInsertScalar128x2((Vector128<byte>   Value1, Vector128<byte>   Value2) value, byte index, byte*   address);
        public static unsafe (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2) LoadVectorAndInsertScalar128x2((Vector128<sbyte>  Value1, Vector128<sbyte>  Value2) value, byte index, sbyte*  address);
        public static unsafe (Vector128<short>  Value1, Vector128<short>  Value2) LoadVectorAndInsertScalar128x2((Vector128<short>  Value1, Vector128<short>  Value2) value, byte index, short*  address);
        public static unsafe (Vector128<ushort> Value1, Vector128<ushort> Value2) LoadVectorAndInsertScalar128x2((Vector128<ushort> Value1, Vector128<ushort> Value2) value, byte index, ushort* address);
        public static unsafe (Vector128<int>    Value1, Vector128<int>    Value2) LoadVectorAndInsertScalar128x2((Vector128<int>    Value1, Vector128<int>    Value2) value, byte index, int*    address);
        public static unsafe (Vector128<uint>   Value1, Vector128<uint>   Value2) LoadVectorAndInsertScalar128x2((Vector128<uint>   Value1, Vector128<uint>   Value2) value, byte index, uint*   address);
        public static unsafe (Vector128<long>   Value1, Vector128<long>   Value2) LoadVectorAndInsertScalar128x2((Vector128<long>   Value1, Vector128<long>   Value2) value, byte index, long*   address);
        public static unsafe (Vector128<ulong>  Value1, Vector128<ulong>  Value2) LoadVectorAndInsertScalar128x2((Vector128<ulong>  Value1, Vector128<ulong>  Value2) value, byte index, ulong*  address);
        public static unsafe (Vector128<float>  Value1, Vector128<float>  Value2) LoadVectorAndInsertScalar128x2((Vector128<float>  Value1, Vector128<float>  Value2) value, byte index, float*  address);
        public static unsafe (Vector128<double> Value1, Vector128<double> Value2) LoadVectorAndInsertScalar128x2((Vector128<double> Value1, Vector128<double> Value2) value, byte index, double* address);

        public static unsafe (Vector128<byte>   Value1, Vector128<byte>   Value2) LoadAndReplicateToVector128x2(byte*   address);
        public static unsafe (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2) LoadAndReplicateToVector128x2(sbyte*  address);
        public static unsafe (Vector128<short>  Value1, Vector128<short>  Value2) LoadAndReplicateToVector128x2(short*  address);
        public static unsafe (Vector128<ushort> Value1, Vector128<ushort> Value2) LoadAndReplicateToVector128x2(ushort* address);
        public static unsafe (Vector128<int>    Value1, Vector128<int>    Value2) LoadAndReplicateToVector128x2(int*    address);
        public static unsafe (Vector128<uint>   Value1, Vector128<uint>   Value2) LoadAndReplicateToVector128x2(uint*   address);
        public static unsafe (Vector128<long>   Value1, Vector128<long>   Value2) LoadAndReplicateToVector128x2(long*   address);
        public static unsafe (Vector128<ulong>  Value1, Vector128<ulong>  Value2) LoadAndReplicateToVector128x2(ulong*  address);
        public static unsafe (Vector128<float>  Value1, Vector128<float>  Value2) LoadAndReplicateToVector128x2(float*  address);
        public static unsafe (Vector128<double> Value1, Vector128<double> Value2) LoadAndReplicateToVector128x2(double* address);

        public static unsafe (Vector128<byte>   Value1, Vector128<byte>   Value2, Vector128<byte>   Value3) LoadVector128x3(byte*   address);
        public static unsafe (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2, Vector128<sbyte>  Value3) LoadVector128x3(sbyte*  address);
        public static unsafe (Vector128<short>  Value1, Vector128<short>  Value2, Vector128<short>  Value3) LoadVector128x3(short*  address);
        public static unsafe (Vector128<ushort> Value1, Vector128<ushort> Value2, Vector128<ushort> Value3) LoadVector128x3(ushort* address);
        public static unsafe (Vector128<int>    Value1, Vector128<int>    Value2, Vector128<int>    Value3) LoadVector128x3(int*    address);
        public static unsafe (Vector128<uint>   Value1, Vector128<uint>   Value2, Vector128<uint>   Value3) LoadVector128x3(uint*   address);
        public static unsafe (Vector128<long>   Value1, Vector128<long>   Value2, Vector128<long>   Value3) LoadVector128x3(long*   address);
        public static unsafe (Vector128<ulong>  Value1, Vector128<ulong>  Value2, Vector128<ulong>  Value3) LoadVector128x3(ulong*  address);
        public static unsafe (Vector128<float>  Value1, Vector128<float>  Value2, Vector128<float>  Value3) LoadVector128x3(float*  address);
        public static unsafe (Vector128<double> Value1, Vector128<double> Value2, Vector128<double> Value3) LoadVector128x3(double* address);

        public static unsafe (Vector128<byte>   Value1, Vector128<byte>   Value2, Vector128<byte>   Value3) LoadVectorAndInsertScalar128x3((Vector128<byte>   Value1, Vector128<byte>   Value2, Vector128<byte>   Value3) value, byte index, byte*   address);
        public static unsafe (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2, Vector128<sbyte>  Value3) LoadVectorAndInsertScalar128x3((Vector128<sbyte>  Value1, Vector128<sbyte>  Value2, Vector128<sbyte>  Value3) value, byte index, sbyte*  address);
        public static unsafe (Vector128<short>  Value1, Vector128<short>  Value2, Vector128<short>  Value3) LoadVectorAndInsertScalar128x3((Vector128<short>  Value1, Vector128<short>  Value2, Vector128<short>  Value3) value, byte index, short*  address);
        public static unsafe (Vector128<ushort> Value1, Vector128<ushort> Value2, Vector128<ushort> Value3) LoadVectorAndInsertScalar128x3((Vector128<ushort> Value1, Vector128<ushort> Value2, Vector128<ushort> Value3) value, byte index, ushort* address);
        public static unsafe (Vector128<int>    Value1, Vector128<int>    Value2, Vector128<int>    Value3) LoadVectorAndInsertScalar128x3((Vector128<int>    Value1, Vector128<int>    Value2, Vector128<int>    Value3) value, byte index, int*    address);
        public static unsafe (Vector128<uint>   Value1, Vector128<uint>   Value2, Vector128<uint>   Value3) LoadVectorAndInsertScalar128x3((Vector128<uint>   Value1, Vector128<uint>   Value2, Vector128<uint>   Value3) value, byte index, uint*   address);
        public static unsafe (Vector128<long>   Value1, Vector128<long>   Value2, Vector128<long>   Value3) LoadVectorAndInsertScalar128x3((Vector128<long>   Value1, Vector128<long>   Value2, Vector128<long>   Value3) value, byte index, long*   address);
        public static unsafe (Vector128<ulong>  Value1, Vector128<ulong>  Value2, Vector128<ulong>  Value3) LoadVectorAndInsertScalar128x3((Vector128<ulong>  Value1, Vector128<ulong>  Value2, Vector128<ulong>  Value3) value, byte index, ulong*  address);
        public static unsafe (Vector128<float>  Value1, Vector128<float>  Value2, Vector128<float>  Value3) LoadVectorAndInsertScalar128x3((Vector128<float>  Value1, Vector128<float>  Value2, Vector128<float>  Value3) value, byte index, float*  address);
        public static unsafe (Vector128<double> Value1, Vector128<double> Value2, Vector128<double> Value3) LoadVectorAndInsertScalar128x3((Vector128<double> Value1, Vector128<double> Value2, Vector128<double> Value3) value, byte index, double* address);

        public static unsafe (Vector128<byte>   Value1, Vector128<byte>   Value2, Vector128<byte>   Value3) LoadAndReplicateToVector128x3(byte*   address);
        public static unsafe (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2, Vector128<sbyte>  Value3) LoadAndReplicateToVector128x3(sbyte*  address);
        public static unsafe (Vector128<short>  Value1, Vector128<short>  Value2, Vector128<short>  Value3) LoadAndReplicateToVector128x3(short*  address);
        public static unsafe (Vector128<ushort> Value1, Vector128<ushort> Value2, Vector128<ushort> Value3) LoadAndReplicateToVector128x3(ushort* address);
        public static unsafe (Vector128<int>    Value1, Vector128<int>    Value2, Vector128<int>    Value3) LoadAndReplicateToVector128x3(int*    address);
        public static unsafe (Vector128<uint>   Value1, Vector128<uint>   Value2, Vector128<uint>   Value3) LoadAndReplicateToVector128x3(uint*   address);
        public static unsafe (Vector128<long>   Value1, Vector128<long>   Value2, Vector128<long>   Value3) LoadAndReplicateToVector128x3(long*   address);
        public static unsafe (Vector128<ulong>  Value1, Vector128<ulong>  Value2, Vector128<ulong>  Value3) LoadAndReplicateToVector128x3(ulong*  address);
        public static unsafe (Vector128<float>  Value1, Vector128<float>  Value2, Vector128<float>  Value3) LoadAndReplicateToVector128x3(float*  address);
        public static unsafe (Vector128<double> Value1, Vector128<double> Value2, Vector128<double> Value3) LoadAndReplicateToVector128x3(double* address);

        public static unsafe (Vector128<byte>   Value1, Vector128<byte>   Value2, Vector128<byte>   Value3, Vector128<byte>   Value4) LoadVector128x4(byte*   address);
        public static unsafe (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2, Vector128<sbyte>  Value3, Vector128<sbyte>  Value4) LoadVector128x4(sbyte*  address);
        public static unsafe (Vector128<short>  Value1, Vector128<short>  Value2, Vector128<short>  Value3, Vector128<short>  Value4) LoadVector128x4(short*  address);
        public static unsafe (Vector128<ushort> Value1, Vector128<ushort> Value2, Vector128<ushort> Value3, Vector128<ushort> Value4) LoadVector128x4(ushort* address);
        public static unsafe (Vector128<int>    Value1, Vector128<int>    Value2, Vector128<int>    Value3, Vector128<int>    Value4) LoadVector128x4(int*    address);
        public static unsafe (Vector128<uint>   Value1, Vector128<uint>   Value2, Vector128<uint>   Value3, Vector128<uint>   Value4) LoadVector128x4(uint*   address);
        public static unsafe (Vector128<long>   Value1, Vector128<long>   Value2, Vector128<long>   Value3, Vector128<long>   Value4) LoadVector128x4(long*   address);
        public static unsafe (Vector128<ulong>  Value1, Vector128<ulong>  Value2, Vector128<ulong>  Value3, Vector128<ulong>  Value4) LoadVector128x4(ulong*  address);
        public static unsafe (Vector128<float>  Value1, Vector128<float>  Value2, Vector128<float>  Value3, Vector128<float>  Value4) LoadVector128x4(float*  address);
        public static unsafe (Vector128<double> Value1, Vector128<double> Value2, Vector128<double> Value3, Vector128<double> Value4) LoadVector128x4(double* address);

        public static unsafe (Vector128<byte>   Value1, Vector128<byte>   Value2, Vector128<byte>   Value3, Vector128<byte>   Value4) LoadVectorAndInsertScalar128x4((Vector128<byte>   Value1, Vector128<byte>   Value2, Vector128<byte>   Value3, Vector128<byte>   Value4) value, byte index, byte*   address);
        public static unsafe (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2, Vector128<sbyte>  Value3, Vector128<sbyte>  Value4) LoadVectorAndInsertScalar128x4((Vector128<sbyte>  Value1, Vector128<sbyte>  Value2, Vector128<sbyte>  Value3, Vector128<sbyte>  Value4) value, byte index, sbyte*  address);
        public static unsafe (Vector128<short>  Value1, Vector128<short>  Value2, Vector128<short>  Value3, Vector128<short>  Value4) LoadVectorAndInsertScalar128x4((Vector128<short>  Value1, Vector128<short>  Value2, Vector128<short>  Value3, Vector128<short>  Value4) value, byte index, short*  address);
        public static unsafe (Vector128<ushort> Value1, Vector128<ushort> Value2, Vector128<ushort> Value3, Vector128<ushort> Value4) LoadVectorAndInsertScalar128x4((Vector128<ushort> Value1, Vector128<ushort> Value2, Vector128<ushort> Value3, Vector128<ushort> Value4) value, byte index, ushort* address);
        public static unsafe (Vector128<int>    Value1, Vector128<int>    Value2, Vector128<int>    Value3, Vector128<int>    Value4) LoadVectorAndInsertScalar128x4((Vector128<int>    Value1, Vector128<int>    Value2, Vector128<int>    Value3, Vector128<int>    Value4) value, byte index, int*    address);
        public static unsafe (Vector128<uint>   Value1, Vector128<uint>   Value2, Vector128<uint>   Value3, Vector128<uint>   Value4) LoadVectorAndInsertScalar128x4((Vector128<uint>   Value1, Vector128<uint>   Value2, Vector128<uint>   Value3, Vector128<uint>   Value4) value, byte index, uint*   address);
        public static unsafe (Vector128<long>   Value1, Vector128<long>   Value2, Vector128<long>   Value3, Vector128<long>   Value4) LoadVectorAndInsertScalar128x4((Vector128<long>   Value1, Vector128<long>   Value2, Vector128<long>   Value3, Vector128<long>   Value4) value, byte index, long*   address);
        public static unsafe (Vector128<ulong>  Value1, Vector128<ulong>  Value2, Vector128<ulong>  Value3, Vector128<ulong>  Value4) LoadVectorAndInsertScalar128x4((Vector128<ulong>  Value1, Vector128<ulong>  Value2, Vector128<ulong>  Value3, Vector128<ulong>  Value4) value, byte index, ulong*  address);
        public static unsafe (Vector128<float>  Value1, Vector128<float>  Value2, Vector128<float>  Value3, Vector128<float>  Value4) LoadVectorAndInsertScalar128x4((Vector128<float>  Value1, Vector128<float>  Value2, Vector128<float>  Value3, Vector128<float>  Value4) value, byte index, float*  address);
        public static unsafe (Vector128<double> Value1, Vector128<double> Value2, Vector128<double> Value3, Vector128<double> Value4) LoadVectorAndInsertScalar128x4((Vector128<double> Value1, Vector128<double> Value2, Vector128<double> Value3, Vector128<double> Value4) value, byte index, double* address);


        public static unsafe(Vector128<byte>   Value1, Vector128<byte>   Value2, Vector128<byte>   Value3, Vector128<byte>   Value4) LoadAndReplicateToVector128x4(byte*   address);
        public static unsafe(Vector128<sbyte>  Value1, Vector128<sbyte>  Value2, Vector128<sbyte>  Value3, Vector128<sbyte>  Value4) LoadAndReplicateToVector128x4(sbyte*  address);
        public static unsafe(Vector128<short>  Value1, Vector128<short>  Value2, Vector128<short>  Value3, Vector128<short>  Value4) LoadAndReplicateToVector128x4(short*  address);
        public static unsafe(Vector128<ushort> Value1, Vector128<ushort> Value2, Vector128<ushort> Value3, Vector128<ushort> Value4) LoadAndReplicateToVector128x4(ushort* address);
        public static unsafe(Vector128<int>    Value1, Vector128<int>    Value2, Vector128<int>    Value3, Vector128<int>    Value4) LoadAndReplicateToVector128x4(int*    address);
        public static unsafe(Vector128<uint>   Value1, Vector128<uint>   Value2, Vector128<uint>   Value3, Vector128<uint>   Value4) LoadAndReplicateToVector128x4(uint*   address);
        public static unsafe(Vector128<long>   Value1, Vector128<long>   Value2, Vector128<long>   Value3, Vector128<long>   Value4) LoadAndReplicateToVector128x4(long*   address);
        public static unsafe(Vector128<ulong>  Value1, Vector128<ulong>  Value2, Vector128<ulong>  Value3, Vector128<ulong>  Value4) LoadAndReplicateToVector128x4(ulong*  address);
        public static unsafe(Vector128<float>  Value1, Vector128<float>  Value2, Vector128<float>  Value3, Vector128<float>  Value4) LoadAndReplicateToVector128x4(float*  address);
        public static unsafe(Vector128<double> Value1, Vector128<double> Value2, Vector128<double> Value3, Vector128<double> Value4) LoadAndReplicateToVector128x4(double* address);

        public static unsafe void StoreVector128x2AndZip(byte*   address, (Vector128<byte>   Value1, Vector128<byte>   Value2) value);
        public static unsafe void StoreVector128x2AndZip(sbyte*  address, (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2) value);
        public static unsafe void StoreVector128x2AndZip(short*  address, (Vector128<short>  Value1, Vector128<short>  Value2) value);
        public static unsafe void StoreVector128x2AndZip(ushort* address, (Vector128<ushort> Value1, Vector128<ushort> Value2) value);
        public static unsafe void StoreVector128x2AndZip(int*    address, (Vector128<int>    Value1, Vector128<int>    Value2) value);
        public static unsafe void StoreVector128x2AndZip(uint*   address, (Vector128<uint>   Value1, Vector128<uint>   Value2) value);
        public static unsafe void StoreVector128x2AndZip(long*   address, (Vector128<long>   Value1, Vector128<long>   Value2) value);
        public static unsafe void StoreVector128x2AndZip(ulong*  address, (Vector128<ulong>  Value1, Vector128<ulong>  Value2) value);
        public static unsafe void StoreVector128x2AndZip(float*  address, (Vector128<float>  Value1, Vector128<float>  Value2) value);
        public static unsafe void StoreVector128x2AndZip(double* address, (Vector128<double> Value1, Vector128<double> Value2) value);

        public static unsafe void StoreVector128x3AndZip(byte*   address, (Vector128<byte>   Value1, Vector128<byte>   Value2, Vector128<byte>   Value3) value);
        public static unsafe void StoreVector128x3AndZip(sbyte*  address, (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2, Vector128<sbyte>  Value3) value);
        public static unsafe void StoreVector128x3AndZip(short*  address, (Vector128<short>  Value1, Vector128<short>  Value2, Vector128<short>  Value3) value);
        public static unsafe void StoreVector128x3AndZip(ushort* address, (Vector128<ushort> Value1, Vector128<ushort> Value2, Vector128<ushort> Value3) value);
        public static unsafe void StoreVector128x3AndZip(int*    address, (Vector128<int>    Value1, Vector128<int>    Value2, Vector128<int>    Value3) value);
        public static unsafe void StoreVector128x3AndZip(uint*   address, (Vector128<uint>   Value1, Vector128<uint>   Value2, Vector128<uint>   Value3) value);
        public static unsafe void StoreVector128x3AndZip(long*   address, (Vector128<long>   Value1, Vector128<long>   Value2, Vector128<long>   Value3) value);
        public static unsafe void StoreVector128x3AndZip(ulong*  address, (Vector128<ulong>  Value1, Vector128<ulong>  Value2, Vector128<ulong>  Value3) value);
        public static unsafe void StoreVector128x3AndZip(float*  address, (Vector128<float>  Value1, Vector128<float>  Value2, Vector128<float>  Value3) value);
        public static unsafe void StoreVector128x3AndZip(double* address, (Vector128<double> Value1, Vector128<double> Value2, Vector128<double> Value3) value);

        public static unsafe void StoreVector128x4AndZip(byte*   address, (Vector128<byte>   Value1, Vector128<byte>   Value2, Vector128<byte>   Value3, Vector128<byte>   Value4) value);
        public static unsafe void StoreVector128x4AndZip(sbyte*  address, (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2, Vector128<sbyte>  Value3, Vector128<sbyte>  Value4) value);
        public static unsafe void StoreVector128x4AndZip(short*  address, (Vector128<short>  Value1, Vector128<short>  Value2, Vector128<short>  Value3, Vector128<short>  Value4) value);
        public static unsafe void StoreVector128x4AndZip(ushort* address, (Vector128<ushort> Value1, Vector128<ushort> Value2, Vector128<ushort> Value3, Vector128<ushort> Value4) value);
        public static unsafe void StoreVector128x4AndZip(int*    address, (Vector128<int>    Value1, Vector128<int>    Value2, Vector128<int>    Value3, Vector128<int>    Value4) value);
        public static unsafe void StoreVector128x4AndZip(uint*   address, (Vector128<uint>   Value1, Vector128<uint>   Value2, Vector128<uint>   Value3, Vector128<uint>   Value4) value);
        public static unsafe void StoreVector128x4AndZip(long*   address, (Vector128<long>   Value1, Vector128<long>   Value2, Vector128<long>   Value3, Vector128<long>   Value4) value);
        public static unsafe void StoreVector128x4AndZip(ulong*  address, (Vector128<ulong>  Value1, Vector128<ulong>  Value2, Vector128<ulong>  Value3, Vector128<ulong>  Value4) value);
        public static unsafe void StoreVector128x4AndZip(float*  address, (Vector128<float>  Value1, Vector128<float>  Value2, Vector128<float>  Value3, Vector128<float>  Value4) value);
        public static unsafe void StoreVector128x4AndZip(double* address, (Vector128<double> Value1, Vector128<double> Value2, Vector128<double> Value3, Vector128<double> Value4) value);

        public static unsafe void StoreVector128x2(byte*   address, (Vector128<byte>   Value1, Vector128<byte>   Value2) value);
        public static unsafe void StoreVector128x2(sbyte*  address, (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2) value);
        public static unsafe void StoreVector128x2(short*  address, (Vector128<short>  Value1, Vector128<short>  Value2) value);
        public static unsafe void StoreVector128x2(ushort* address, (Vector128<ushort> Value1, Vector128<ushort> Value2) value);
        public static unsafe void StoreVector128x2(int*    address, (Vector128<int>    Value1, Vector128<int>    Value2) value);
        public static unsafe void StoreVector128x2(uint*   address, (Vector128<uint>   Value1, Vector128<uint>   Value2) value);
        public static unsafe void StoreVector128x2(long*   address, (Vector128<long>   Value1, Vector128<long>   Value2) value);
        public static unsafe void StoreVector128x2(ulong*  address, (Vector128<ulong>  Value1, Vector128<ulong>  Value2) value);
        public static unsafe void StoreVector128x2(float*  address, (Vector128<float>  Value1, Vector128<float>  Value2) value);
        public static unsafe void StoreVector128x2(double* address, (Vector128<double> Value1, Vector128<double> Value2) value);

        public static unsafe void StoreSelectedScalar128x2(byte*   address, (Vector128<byte>   Value1, Vector128<byte>   Value2) value, byte index);
        public static unsafe void StoreSelectedScalar128x2(sbyte*  address, (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2) value, byte index);
        public static unsafe void StoreSelectedScalar128x2(short*  address, (Vector128<short>  Value1, Vector128<short>  Value2) value, byte index);
        public static unsafe void StoreSelectedScalar128x2(ushort* address, (Vector128<ushort> Value1, Vector128<ushort> Value2) value, byte index);
        public static unsafe void StoreSelectedScalar128x2(int*    address, (Vector128<int>    Value1, Vector128<int>    Value2) value, byte index);
        public static unsafe void StoreSelectedScalar128x2(uint*   address, (Vector128<uint>   Value1, Vector128<uint>   Value2) value, byte index);
        public static unsafe void StoreSelectedScalar128x2(long*   address, (Vector128<long>   Value1, Vector128<long>   Value2) value, byte index);
        public static unsafe void StoreSelectedScalar128x2(ulong*  address, (Vector128<ulong>  Value1, Vector128<ulong>  Value2) value, byte index);
        public static unsafe void StoreSelectedScalar128x2(float*  address, (Vector128<float>  Value1, Vector128<float>  Value2) value, byte index);
        public static unsafe void StoreSelectedScalar128x2(double* address, (Vector128<double> Value1, Vector128<double> Value2) value, byte index);

        public static unsafe void StoreVector128x3(byte*   address, (Vector128<byte>   Value1, Vector128<byte>   Value2, Vector128<byte>   Value3) value);
        public static unsafe void StoreVector128x3(sbyte*  address, (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2, Vector128<sbyte>  Value3) value);
        public static unsafe void StoreVector128x3(short*  address, (Vector128<short>  Value1, Vector128<short>  Value2, Vector128<short>  Value3) value);
        public static unsafe void StoreVector128x3(ushort* address, (Vector128<ushort> Value1, Vector128<ushort> Value2, Vector128<ushort> Value3) value);
        public static unsafe void StoreVector128x3(int*    address, (Vector128<int>    Value1, Vector128<int>    Value2, Vector128<int>    Value3) value);
        public static unsafe void StoreVector128x3(uint*   address, (Vector128<uint>   Value1, Vector128<uint>   Value2, Vector128<uint>   Value3) value);
        public static unsafe void StoreVector128x3(long*   address, (Vector128<long>   Value1, Vector128<long>   Value2, Vector128<long>   Value3) value);
        public static unsafe void StoreVector128x3(ulong*  address, (Vector128<ulong>  Value1, Vector128<ulong>  Value2, Vector128<ulong>  Value3) value);
        public static unsafe void StoreVector128x3(float*  address, (Vector128<float>  Value1, Vector128<float>  Value2, Vector128<float>  Value3) value);
        public static unsafe void StoreVector128x3(double* address, (Vector128<double> Value1, Vector128<double> Value2, Vector128<double> Value3) value);

        public static unsafe void StoreSelectedScalar128x3(byte*   address, (Vector128<byte>   Value1, Vector128<byte>   Value2, Vector128<byte>   Value3) value, byte index);
        public static unsafe void StoreSelectedScalar128x3(sbyte*  address, (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2, Vector128<sbyte>  Value3) value, byte index);
        public static unsafe void StoreSelectedScalar128x3(short*  address, (Vector128<short>  Value1, Vector128<short>  Value2, Vector128<short>  Value3) value, byte index);
        public static unsafe void StoreSelectedScalar128x3(ushort* address, (Vector128<ushort> Value1, Vector128<ushort> Value2, Vector128<ushort> Value3) value, byte index);
        public static unsafe void StoreSelectedScalar128x3(int*    address, (Vector128<int>    Value1, Vector128<int>    Value2, Vector128<int>    Value3) value, byte index);
        public static unsafe void StoreSelectedScalar128x3(uint*   address, (Vector128<uint>   Value1, Vector128<uint>   Value2, Vector128<uint>   Value3) value, byte index);
        public static unsafe void StoreSelectedScalar128x3(long*   address, (Vector128<long>   Value1, Vector128<long>   Value2, Vector128<long>   Value3) value, byte index);
        public static unsafe void StoreSelectedScalar128x3(ulong*  address, (Vector128<ulong>  Value1, Vector128<ulong>  Value2, Vector128<ulong>  Value3) value, byte index);
        public static unsafe void StoreSelectedScalar128x3(float*  address, (Vector128<float>  Value1, Vector128<float>  Value2, Vector128<float>  Value3) value, byte index);
        public static unsafe void StoreSelectedScalar128x3(double* address, (Vector128<double> Value1, Vector128<double> Value2, Vector128<double> Value3) value, byte index);

        public static unsafe void StoreVector128x4(byte*   address, (Vector128<byte>   Value1, Vector128<byte>   Value2, Vector128<byte>   Value3, Vector128<byte>   Value4) value);
        public static unsafe void StoreVector128x4(sbyte*  address, (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2, Vector128<sbyte>  Value3, Vector128<sbyte>  Value4) value);
        public static unsafe void StoreVector128x4(short*  address, (Vector128<short>  Value1, Vector128<short>  Value2, Vector128<short>  Value3, Vector128<short>  Value4) value);
        public static unsafe void StoreVector128x4(ushort* address, (Vector128<ushort> Value1, Vector128<ushort> Value2, Vector128<ushort> Value3, Vector128<ushort> Value4) value);
        public static unsafe void StoreVector128x4(int*    address, (Vector128<int>    Value1, Vector128<int>    Value2, Vector128<int>    Value3, Vector128<int>    Value4) value);
        public static unsafe void StoreVector128x4(uint*   address, (Vector128<uint>   Value1, Vector128<uint>   Value2, Vector128<uint>   Value3, Vector128<uint>   Value4) value);
        public static unsafe void StoreVector128x4(long*   address, (Vector128<long>   Value1, Vector128<long>   Value2, Vector128<long>   Value3, Vector128<long>   Value4) value);
        public static unsafe void StoreVector128x4(ulong*  address, (Vector128<ulong>  Value1, Vector128<ulong>  Value2, Vector128<ulong>  Value3, Vector128<ulong>  Value4) value);
        public static unsafe void StoreVector128x4(float*  address, (Vector128<float>  Value1, Vector128<float>  Value2, Vector128<float>  Value3, Vector128<float>  Value4) value);
        public static unsafe void StoreVector128x4(double* address, (Vector128<double> Value1, Vector128<double> Value2, Vector128<double> Value3, Vector128<double> Value4) value);

        public static unsafe void StoreSelectedScalar128x4(byte*   address, (Vector128<byte>   Value1, Vector128<byte>   Value2, Vector128<byte>   Value3, Vector128<byte>   Value4) value, byte index);
        public static unsafe void StoreSelectedScalar128x4(sbyte*  address, (Vector128<sbyte>  Value1, Vector128<sbyte>  Value2, Vector128<sbyte>  Value3, Vector128<sbyte>  Value4) value, byte index);
        public static unsafe void StoreSelectedScalar128x4(short*  address, (Vector128<short>  Value1, Vector128<short>  Value2, Vector128<short>  Value3, Vector128<short>  Value4) value, byte index);
        public static unsafe void StoreSelectedScalar128x4(ushort* address, (Vector128<ushort> Value1, Vector128<ushort> Value2, Vector128<ushort> Value3, Vector128<ushort> Value4) value, byte index);
        public static unsafe void StoreSelectedScalar128x4(int*    address, (Vector128<int>    Value1, Vector128<int>    Value2, Vector128<int>    Value3, Vector128<int>    Value4) value, byte index);
        public static unsafe void StoreSelectedScalar128x4(uint*   address, (Vector128<uint>   Value1, Vector128<uint>   Value2, Vector128<uint>   Value3, Vector128<uint>   Value4) value, byte index);
        public static unsafe void StoreSelectedScalar128x4(long*   address, (Vector128<long>   Value1, Vector128<long>   Value2, Vector128<long>   Value3, Vector128<long>   Value4) value, byte index);
        public static unsafe void StoreSelectedScalar128x4(ulong*  address, (Vector128<ulong>  Value1, Vector128<ulong>  Value2, Vector128<ulong>  Value3, Vector128<ulong>  Value4) value, byte index);
        public static unsafe void StoreSelectedScalar128x4(float*  address, (Vector128<float>  Value1, Vector128<float>  Value2, Vector128<float>  Value3, Vector128<float>  Value4) value, byte index);
        public static unsafe void StoreSelectedScalar128x4(double* address, (Vector128<double> Value1, Vector128<double> Value2, Vector128<double> Value3, Vector128<double> Value4) value, byte index);
    }
}

@terrajobst terrajobst added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Jun 6, 2023
@kunalspathak kunalspathak removed the blocking Marks issues that we want to fast track in order to unblock other important work label Jun 6, 2023
@kunalspathak
Copy link
Member Author

@a74nh, @SwapnilGaikwad - This is approved. When you get a chance, you can start with Store APIs.

@tannergooding tannergooding modified the milestones: 8.0.0, Future Jul 24, 2023
@tannergooding
Copy link
Member

This won't make .NET 8, we can start accepting PRs anytime after the repo opens up for .NET 9 next month

@kunalspathak
Copy link
Member Author

All the load vector APIs are complete.

@kunalspathak
Copy link
Member Author

All the APIs are implemented. Thank you @TIHan and @SwapnilGaikwad for your contributions.

@kunalspathak kunalspathak reopened this Nov 16, 2023
@github-actions github-actions bot locked and limited conversation to collaborators Dec 17, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Runtime.Intrinsics
Projects
None yet
Development

No branches or pull requests

5 participants