Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: SIMD spec #9389

Open
lemaitre opened this issue Jul 15, 2021 · 6 comments
Open

RFC: SIMD spec #9389

lemaitre opened this issue Jul 15, 2021 · 6 comments
Labels
proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Milestone

Comments

@lemaitre
Copy link

lemaitre commented Jul 15, 2021

Hello everyone, I would like to discuss the SIMD aspect of Zig.

There are many aspects to consider with SIMD. To simplify the discussion, I will use the following straw-man syntax to designate a vector: @Vector(N, T, ABI). As it will be important, @Mask(N, T, ABI) is a mask type for a @Vector(N, T, ABI) vector.

Types that can be elements of a SIMD vector

Vectors of arithmetic types should definitely be supported (iN, uN, fN).
I think it would also be a good idea to support vectors of pointers, especially to achieve nice scatter/gather syntax.
We could also support vectors of optionals as a way to represent a masked vector.
It would also be quite interesting to have vectors of vectors as a fancy way to have SIMD in SIMD computation, which is especially useful for shuffles.

However, I don't see much use for vectors of (tagged) unions or struct. It would definitely be possible to standardize those, if we really wanted, but for now, lets keep them out of the discussion.

Another question is: do we want to consider vectors of non-native arithmetic types, like i30 or f128?
Also, what would be the size of a @Vector(4, i30, ABI)?
I think currently Zig has already an answer to that question, but it would be interesting to reassess it.

Size of a vector

What should be the allowed vector sizes (the N from @Vec(N, T, ABI))?
Every time I try to write SIMD wrappers in C or C++, I limit myself to powers of 2 because it is much simpler, and there not really much use for non powers of 2.

But considering that you already could have non power-of-2-wide integers (like i30), maybe we could have them, but it might make some operations more tricky.

Order of the elements

This is a confusing part for many people.
We tend to think about integers in big endian (most significant bits on the left), while thinking about memory in little endian (least significant bits on the left).
Intel documentation refers to everything in big endian: integers, vector elements, memory.

I think we should keep the big endian for integers and little endian for everything else as it is the most natural, but we should avoid talking about left and right when referring to elements inside of vectors. We should most likely prefer the terms low and high instead. Therefore, the order is consistent with arrays, and thus no extra confusion arise.

Masks

This is a big chunk. Recent SIMD architectures have native masks (AVX512, SVE, Risc-V V). However, their interface is not compatible with each other: SVE masks have 1-bit per byte, AVX512 (and maybe Risc-V V) masks have 1 bit per element, and emulated masks (for archs without masks like SSE or Neon) have 1 bit per bit.
So it is impossible to have a single mask type for either all vectors of the same size, or the same length. It depends on the target arch.

That is why I believe we need a different mask type for every single vector type (hence the T in @Mask(N, T, ABI)). In theory, we might just need to know the number of bits of the elements and not their actual types, but I think Zig has no problem handling a gazillion types.

Also, we really need to deal with the ABI when defining the type as a @Mask(4, i32, .AVX512) would be a %k register, while a @Mask(4, i32, .SSE2) would be a %xmm register, despite masking in both cases a vector of 4 i32.
Also, you might want to write code for different archs in the same binary (for dynamic dispatch for instance).

The layout of a mask is then opaque. The ABI tag can change the actual layout used.

Semantically, a @Mask(N, T, ABI) is the same as a [N] bool or @Vector(N, bool), but optimized for masking a @Vector(N, T, ABI).

Operations

Arithmetic and bit operations

This is the simplest part of SIMD. Arithmetic operations are defined lane-wise (ie: per element), and are only defined on vectors of the same length. Specifically, operations keep their scalar semantics. The standard type coercion rules for element types apply.

  • negation: -a (with wrap-around: -%a)
  • addition: a + b (with wrap-around: a +% b, saturating: a +| b)
  • subtraction: a - b (with wrap-around: a -% b, saturating: a -| b)
  • multiplication: a * b (with wrap-around: a *% b, saturating: a *| b)
  • division: a / b
  • modulo: a % b
  • binary not: ~a
  • binary and: a & b
  • binary or: a | b
  • binary xor: a ^ b
  • left bit shift: a << b (lane-wise)
  • right bit shift: a >> b (lane-wise, arithmetic for signed types and logical for unsigned types, like scalar shifts)
  • extra builtins: @byteSwap, @bitReverse, @clz, @ctz, @divExact, @divFloor, @divTrunc, @max, @min, @mod, @popCount, @rem, @shlExact, @shrExact, @sqrt, @sin, @cos, @tan, @exp, @exp2, @log, @log2, @log10, @fabs, @floor, @ceil, @trunc, @round

It might be nice to have @addWithOverflow, @subWithOverflow, @mulWithOverflow, and @shlWithOverflow variants that return a mask instead of a bool. However, those might be tricky to efficiently implement.

As for the scalar operations, all floating points are IEEE754 compliant except configured otherwise with @setFloatMode. In case of FloatMode.Optimized, the compiler would be allowed to use SIMD instructions to perform division on architectures that do not support it like ARMv7 with approximate instructions.
In case of FloatMode.Strict (the default), division would be forced to IEEE754 implementation, even if the arch do not support it. This could be implemented by devectorizing the operation.

Note about signaling and traps

For now, signaling and trapping operations have no special support in Zig. I think we should keep the same with SIMD operations. This means that no operation is signaling or trapping, including overflows, interger division by zero, or sqrt of negative number. They do however produce detectable Undefined Behavior like their scalar counterparts.

Conversion

Conversion is a simple case of element-wise operation, but the input and output types can mismatch in every aspect, except their length. A specific builtin could be used, or we could just overload the scalar ones to the same effect.

Reinterpret cast (or bit cast)

It is extremely common in SIMD to reinterpret the bits of a vector from one type to another.
This is an operation we really want to be available.
Reusing the @bitcast would be fine, with the same constraints.

Vector creation

To create a vector with specific elements, an anonymous list literal could be used:

var vec: @Vector(4, i32, .SSE2) = .{0, 1, 2, 3};

However, this way of creating a vector tends to be tedious for large vector (in AVX512, we can have @Vector(64, i8, .AVX512), which would require 64 elements in the literal).

There are 2 pattern extremely common:

  • all the elements have the same value (@splat in the current zig)
  • all the elements have the value of their index (equivalent to .{0, 1, 2, 3...})

I think the first one could be covered by implicit conversion (more on that later), but @splat is also fine.
The second one would need a new builtin like @iota(@Vector(N, T, ABI)) (or a type that implicitly coerce to any vector types, creating the right vector).

Element access

Accessing a specific element could be a simple as vec[i] with vec a vector and i an integer.

Shuffles

Shuffles are extremely important in SIMD. There are many flavors of shuffles, even for the general purpose ones.
The current @shuffle builtin has a weird semantic to select the second vector.
It is usual that indices to select the second vector are just after the indices to select the first vector, which makes most of its usage much simpler.

Also, runtime indices are crucial for some algorithms as it allows to have a LUT in register.

I recommend to use the following semantic for a general purpose shuffle:

fn @shuffle(a: @Vector(a_len, T1), b: @Vector(b_len, T1), idx: @Vector(len, T2)) @Vector(len, T1) {
  const n = a_len + b_len;
  var res: @Vector(len, T1);
  var i: usize = 0;
  while (i < len) {
    var j = i % n;
    if (j < a_len) {
      res[i] = a[j];
    } else {
      res[i] = b[j - a_len];
    }
    i += 1;
  }
  return res;
}

The fact that len_a, len_b and len could be different could really lead to some interesting patterns like merging 2 vectors together.
In the case where a, b and idx have the same power of 2 length, the semantic of this function would be exactly the same as the __builtin_shuffle from gcc.

Some special shuffle patterns could have dedicated builtins just to make them easier like concat_odd, interleave_low, interleave_even...

Also, a one input shuffle is often useful, and such a shuffle could a special syntax: a[idx].

Memory access

This might be a bit controversial, but when writing SIMD code, we often have explicit memory loads and stores. This means that we do not have pointers/slices of vectors, we have pointers/slices of scalars, and we use a load instrinsic to get a vector from this pointer/slice.
This is simpler to reason about than casting scalar pointers to vector pointers, and dereference. And it does not require to change the input/output data types.

That is why I think it is good to have a way to do this, without relying on pointer casting.

@load(@Vector(N, T, ABI), p) and @store(v, p) could do it:

fn @load(Vec: type, p: [*]const Vec.child) Vec {
  var res: Vec;
  var i: usize = 0;
  while (i < Vec.len) {
    res[i] = p[i];
    i += 1;
  }
  return res;
}
fn @store(a: @Vector(N, T, ABI), p: [*]T) void {
  var i: usize = 0;
  while (i < N) {
    p[i] = res[i];
    i += 1;
  }
}

But sometimes, it might be useful to load elements that are not contiguous, but have a constant stride. Another builtin might useful for that (like @load_strided):

var a : @Vector(4, i32, .SSE2) = .{src[0], src[2], src[4], src[6]};

And finally, there are cases where we need to load elements whose index in an array is given in a vector.
This load operation is a gather (a scatter for a store). They would have the following semantic:

fn @gather(Vec: type, p: [*]const Vec.child, idx: @Vector(Vec.len, I, ABI)) Vec {
  // dedicated syntax: p[idx]
  var res: Vec;
  var i: usize = 0;
  while (i < Vec.len) {
    res[i] = p[idx[i]];
    i += 1;
  }
  return res;
}
fn @scatter(a: @Vector(N, T, ABI), p: [*]T, idx: @Vector(N, I, ABI)) void {
  // dedicated syntax: p[idx] = a
  var i: usize = 0;
  while (i < N) {
    p[idx[i]] = res[i];
    i += 1;
  }
}

If the compiler can detect that idx is in the form .{ i + 0*a, i + 1*a, i + 2*a, i + 3*a... }, then it could use a more specialized instruction. And even more so if a is 0 (plain old load/store).
Vector of pointers could be another way to have gathers/scatters using the same kind of syntax trick.

Masks

Masks could be manipulated with bitwise operations like standard vectors, but have no arithmetic nor shift operations.
They have, however, shuffle operations.

Masks would be primarily generated by comparing vectors. But could also be created by list literal (with bools).

A mask cannot be loaded nor stored in a meaningful way because of the arch dependent layout, so no load/store must be provided.
However, conversion between masks and vectors would be necessary.
Converting a vector to a mask would be semantically equivalent to create the mask from the sign bits of the vector (for arithmetic types).
And converting from a mask would be equivalent to create a vector whose elements are all bits set if the mask is set for this element.

A mask could be used to select elements from 2 vectors.

fn @select(m: @Mask(len, T, ABI), a: @Vector(len, T, ABI), b: @Vector(len, T, ABI)) @Vector(len, T, ABI) {
  var res: @Vector(len, T, ABI);
  var i: usize = 0;
  while (i < len) {
    if (m[i]) {
      res[i] = a[i];
    } else {
      res[i] = b[i];
    }
  }
  return res;
}

In gcc extensions, such an operation has been given the following syntax: m ? a : b, but as there is no ternary operator, I don't recommend having a special syntax for that.

Also, we could have a syntax for masked values: a[m] or m(a). Those masked values could be assignable.

It would also be useful to have builtins to convert from/to plain integers (bag of bits) as this can be used for LUTs in some algorithms.
Such functions could be the portable way to load/store masks in memory.

Platform specific operations

Many SIMD ISAs have custom operations to handle very specific use-cases that do not fit well into an abstract interface like this proposal. These operations should not be integrated in any special way to this proposal.
These operations should be accessible in a non generic way, like using intrinsics: #7702.

Implicit conversions

Something that would be really nice is implicit conversions from scalar to vectors.
It makes it so much easier to write SIMD code:

var vec : @Vector(4, i32, .SSE2) = 4;
vec += 6;

In fact, I think we could provide implicit conversion from a small vector to a larger vector (by repeating the elements) with the following constraints:

  • the larger vector length should be a multiple of the smaller vector length
  • their element type should match exactly

Sorry for such a long issue, but I have the feeling that it is required.
Maybe I could later split into multiple issues to better keep track of the disscussion, but I'm not sure.

@Snektron
Copy link
Collaborator

See also #903

@Vexu Vexu added the proposal This issue suggests modifications. If it also has the "accepted" label then it is planned. label Aug 6, 2021
@Vexu Vexu added this to the 0.9.0 milestone Aug 6, 2021
@andrewrk andrewrk modified the milestones: 0.9.0, 0.10.0 Nov 24, 2021
@tzakharko
Copy link

I must say I really love the explicit ABI parameter — this is exactly the way to do it and it avoids all the common pitfalls with SIMD types.

Is there a clear design roadmap/RFC for SIMD in zig beyond the comments in #903? If not, maybe it would be worth it to start a document where existing implementation and ideas like these can be consolidated and massaged into a single vision?

@andrewrk andrewrk modified the milestones: 0.10.0, 0.11.0 Apr 16, 2022
@matu3ba
Copy link
Contributor

matu3ba commented Mar 12, 2023

This proposal is lacking in some details, for example left and right shift can be both arithmetical and logical.
However, the main problem is that this proposal does not clarify how to provide a forward compatible way to handle overflows in general for both signaling vs non-signaling ones with the possibly necessary safety trade-offs per hardware target if portability + optimal performance is desired.

The other problem is that it does sidestep all the quirks of floating behavior in SIMD, which are relevant to estimate if extended semantics integer and float simd is unifiable or not, how much of a maintenance it would be, future directions of hardware etc.
As more simple example, it remains unclear what to do about non-IEE754 compliant SIMD implementations like ARM Neon (https://reviews.llvm.org/D18701).
Another one would be what to do with subnormals, fastmath etc.

@lemaitre
Copy link
Author

lemaitre commented Mar 12, 2023

This proposal is lacking in some details, for example left and right shift can be both arithmetical and logical. However, the main problem is that this proposal does not clarify how to provide a forward compatible way to handle overflows in general for both signaling vs non-signaling ones with the possibly necessary safety trade-offs per hardware target if portability + optimal performance is desired.

The goal of this proposal is to use the same operation semantics as for scalar, so left and right shift semantics are basically arithmetical for signed types, and logical for unsigned types. The same goes for the other operations regarding the overflows: use the +% if you want wrap around, +| for saturating. Maybe we can also enable the builtins addWithOverflow and subWithOverflow that returns a mask instead of a bool, but I am not sure how to implement it efficiently on any architecture.

I think we should not do anything related to signaling, and just always use the non-signaling behavior, at least for now. Signaling variants could be implemented later with builtins.

The other problem is that it does sidestep all the quirks of floating behavior in SIMD, which are relevant to estimate if extended semantics integer and float simd is unifiable or not, how much of a maintenance it would be, future directions of hardware etc. As more simple example, it remains unclear what to do about non-IEE754 compliant SIMD implementations like ARM Neon (https://reviews.llvm.org/D18701). Another one would be what to do with subnormals, fastmath etc.

The semantics of floating point operations in this proposal are exactly the same as for scalars, so full IEEE754 (except maybe for signaling NaNs). The goal of this proposal is not to be able to implement any algorithm efficiently on any platform. If some platforms have specific instructions or features that does not fit into this model, we should not try to force it into it: #7702. If you want to deal with platforms without IEEE754 (which are rarer and rarer), then intrinsics should be the way to go.

EDIT: I have updated the RFC to reflect my position on this.

@coffeebe4code
Copy link

coffeebe4code commented Mar 25, 2023

I'm very new to zig, but how do vectors work now for unclear operations?

    const a = @Vector(4, i32){ 1, 2, 3, 4 };
    const b = @Vector(4, i32){ 5, 6, 7, 8 };
    const c = a + b;

This is a fantastic interface. The required target should support 128 bit instructions. But many things can be unclear.
Here is one of the functions I have which I am looking to bring to zig.

static inline int cmp(const __m128i *method, __m128i xmm0) {
  register __m128i xmm1, xmm2;
  register unsigned int eax;

  xmm1 = _mm_loadu_si128(method);
  xmm2 = _mm_cmpeq_epi8(xmm0, xmm1);
  eax = _mm_movemask_epi8(xmm2);
  return (eax & 0x0F) == 0x0F;
}

What is preventing operations like == and the equivalent here, _mm_cmpeq_epi8 from using epi8 vs epi16?
epi16
epi8

Continuing with cmp from above: would we also need a builtin function for the operation? @Op( ==, a,b, .epi8), or @Op(==, a,b, u8)?

We could have the size of operation be inferred based on the size of the load, but that isn't necessarily what "everyone" would want. I personally would be fine with that.

@andrewrk andrewrk modified the milestones: 0.11.0, 0.12.0 Apr 9, 2023
@sharpobject
Copy link
Contributor

What is preventing operations like == and the equivalent here, _mm_cmpeq_epi8 from using epi8 vs epi16? epi16 epi8

Continuing with cmp from above: would we also need a builtin function for the operation? @Op( ==, a,b, .epi8), or @Op(==, a,b, u8)?

You'll be using the types @Vector(4, i32), @Vector(4, u32), @Vector(2, u64), @Vector(2, i64), @Vector(8, u16), @Vector(8, i16), @Vector(16, u8), and @Vector(16, i8) instead of the single type __m128i. The element-wise equality operator will use the appropriate instruction based on the width of your elements.

You can also use operators to get operations that don't correspond to one instruction on common targets like srli_epi8 or cmpgt_epu32, and you'll ultimately get a sequence of instructions for these (probably !(a==min(a,b)) for a>b and a 16-wide shift and mask for the 8-wide shift).

@andrewrk andrewrk modified the milestones: 0.13.0, 0.12.0 Jul 9, 2023
@andrewrk andrewrk modified the milestones: 0.14.0, 0.15.0 Feb 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Projects
None yet
Development

No branches or pull requests

8 participants