RFC: SIMD spec

Hello everyone, I would like to discuss the SIMD aspect of Zig.

There are many aspects to consider with SIMD. To simplify the discussion, I will use the following straw-man syntax to designate a vector: `@Vector(N, T, ABI)`. As it will be important, `@Mask(N, T, ABI)` is a mask type for a `@Vector(N, T, ABI)` vector.

# Types that can be elements of a SIMD vector

Vectors of arithmetic types should definitely be supported (`iN`, `uN`, `fN`).
I think it would also be a good idea to support vectors of pointers, especially to achieve nice scatter/gather syntax.
We could also support vectors of optionals as a way to represent a masked vector.
It would also be quite interesting to have vectors of vectors as a fancy way to have SIMD in SIMD computation, which is especially useful for shuffles.

However, I don't see much use for vectors of (tagged) unions or struct. It would definitely be possible to standardize those, if we really wanted, but for now, lets keep them out of the discussion.

Another question is: do we want to consider vectors of non-native arithmetic types, like `i30` or `f128`?
Also, what would be the size of a `@Vector(4, i30, ABI)`?
I think currently Zig has already an answer to that question, but it would be interesting to reassess it.

# Size of a vector

What should be the allowed vector sizes (the `N` from `@Vec(N, T, ABI)`)?
Every time I try to write SIMD wrappers in C or C++, I limit myself to powers of 2 because it is much simpler, and there not really much use for non powers of 2.

But considering that you already could have non power-of-2-wide integers (like `i30`), maybe we could have them, but it might make some operations more tricky.

# Order of the elements

This is a confusing part for many people.
We tend to think about integers in big endian (most significant bits on the left), while thinking about memory in little endian (least significant bits on the left).
Intel documentation refers to everything in big endian: integers, vector elements, memory.

I think we should keep the big endian for integers and little endian for everything else as it is the most natural, but we should avoid talking about left and right when referring to elements inside of vectors. We should most likely prefer the terms low and high instead. Therefore, the order is consistent with arrays, and thus no extra confusion arise.

# Masks

This is a big chunk. Recent SIMD architectures have native masks (AVX512, SVE, Risc-V V). However, their interface is not compatible with each other: SVE masks have 1-bit per byte, AVX512 (and maybe Risc-V V) masks have 1 bit per element, and emulated masks (for archs without masks like SSE or Neon) have 1 bit per bit.
So it is impossible to have a single mask type for either all vectors of the same size, or the same length. It depends on the target arch.

That is why I believe we need a different mask type for every single vector type (hence the `T` in `@Mask(N, T, ABI)`). In theory, we might just need to know the number of bits of the elements and not their actual types, but I think Zig has no problem handling a gazillion types.

Also, we really need to deal with the ABI when defining the type as a `@Mask(4, i32, .AVX512)` would be a `%k` register, while a `@Mask(4, i32, .SSE2)` would be a `%xmm` register, despite masking in both cases a vector of 4 `i32`.
Also, you might want to write code for different archs in the same binary (for dynamic dispatch for instance).

The layout of a mask is then opaque. The `ABI` tag can change the actual layout used.

Semantically, a `@Mask(N, T, ABI)` is the same as a `[N] bool` or `@Vector(N, bool)`, but optimized for masking a `@Vector(N, T, ABI)`.

# Operations

## Arithmetic and bit operations

This is the simplest part of SIMD. Arithmetic operations are defined lane-wise (ie: per element), and are only defined on vectors of the same length. Specifically, operations keep their scalar semantics. The standard type coercion rules for element types apply.

- negation: `-a` (with wrap-around: `-%a`)
- addition: `a + b` (with wrap-around: `a +% b`, saturating: `a +| b`)
- subtraction: `a - b` (with wrap-around: `a -% b`, saturating: `a -| b`)
- multiplication: `a * b` (with wrap-around: `a *% b`, saturating: `a *| b`)
- division: `a / b`
- modulo: `a % b`
- binary not: `~a`
- binary and: `a & b`
- binary or: `a | b`
- binary xor: `a ^ b`
- left bit shift: `a << b` (lane-wise)
- right bit shift: `a >> b` (lane-wise, arithmetic for signed types and logical for unsigned types, like scalar shifts)
- extra builtins: `@byteSwap`, `@bitReverse`, `@clz`, `@ctz`, `@divExact`, `@divFloor`, `@divTrunc`, `@max`, `@min`, `@mod`, `@popCount`, `@rem`, `@shlExact`, `@shrExact`, `@sqrt`, `@sin`, `@cos`, `@tan`, `@exp`, `@exp2`, `@log`, `@log2`, `@log10`, `@fabs`, `@floor`, `@ceil`, `@trunc`, `@round`

It might be nice to have `@addWithOverflow`, `@subWithOverflow`, `@mulWithOverflow`, and `@shlWithOverflow` variants that return a mask instead of a bool. However, those might be tricky to efficiently implement.

As for the scalar operations, all floating points are IEEE754 compliant except configured otherwise with `@setFloatMode`. In case of `FloatMode.Optimized`, the compiler would be allowed to use SIMD instructions to perform division on architectures that do not support it like ARMv7 with approximate instructions.
In case of `FloatMode.Strict` (the default), division would be forced to IEEE754 implementation, even if the arch do not support it. This could be implemented by devectorizing the operation.

### Note about signaling and traps

For now, signaling and trapping operations have no special support in Zig. I think we should keep the same with SIMD operations. This means that no operation is signaling or trapping, including overflows, interger division by zero, or sqrt of negative number. They do however produce detectable Undefined Behavior like their scalar counterparts.


## Conversion

Conversion is a simple case of element-wise operation, but the input and output types can mismatch in every aspect, except their length. A specific builtin could be used, or we could just overload the scalar ones to the same effect.

## Reinterpret cast (or bit cast)

It is extremely common in SIMD to reinterpret the bits of a vector from one type to another.
This is an operation we really want to be available.
Reusing the @bitCast would be fine, with the same constraints.

## Vector creation

To create a vector with specific elements, an anonymous list literal could be used:

```zig
var vec: @Vector(4, i32, .SSE2) = .{0, 1, 2, 3};
```

However, this way of creating a vector tends to be tedious for large vector (in AVX512, we can have `@Vector(64, i8, .AVX512)`, which would require 64 elements in the literal).

There are 2 pattern extremely common:

- all the elements have the same value (`@splat` in the current zig)
- all the elements have the value of their index (equivalent to `.{0, 1, 2, 3...}`)

I think the first one could be covered by implicit conversion (more on that later), but `@splat` is also fine.
The second one would need a new builtin like `@iota(@Vector(N, T, ABI))` (or a type that implicitly coerce to any vector types, creating the right vector).

## Element access

Accessing a specific element could be a simple as `vec[i]` with `vec` a vector and `i` an integer.

## Shuffles

Shuffles are extremely important in SIMD. There are many flavors of shuffles, even for the general purpose ones.
The current `@shuffle` builtin has a weird semantic to select the second vector.
It is usual that indices to select the second vector are just after the indices to select the first vector, which makes most of its usage much simpler.

Also, runtime indices are crucial for some algorithms as it allows to have a LUT in register.

I recommend to use the following semantic for a general purpose shuffle:

```zig
fn @shuffle(a: @Vector(a_len, T1), b: @Vector(b_len, T1), idx: @Vector(len, T2)) @Vector(len, T1) {
  const n = a_len + b_len;
  var res: @Vector(len, T1);
  var i: usize = 0;
  while (i < len) {
    var j = i % n;
    if (j < a_len) {
      res[i] = a[j];
    } else {
      res[i] = b[j - a_len];
    }
    i += 1;
  }
  return res;
}
```

The fact that `len_a`, `len_b` and `len` could be different could really lead to some interesting patterns like merging 2 vectors together.
In the case where `a`, `b` and `idx` have the same power of 2 length, the semantic of this function would be exactly the same as the `__builtin_shuffle` from gcc.

Some special shuffle patterns could have dedicated builtins just to make them easier like concat_odd, interleave_low, interleave_even...

Also, a one input shuffle is often useful, and such a shuffle could a special syntax: `a[idx]`.

## Memory access

This might be a bit controversial, but when writing SIMD code, we often have explicit memory loads and stores. This means that we do not have pointers/slices of vectors, we have pointers/slices of scalars, and we use a load instrinsic to get a vector from this pointer/slice.
This is simpler to reason about than casting scalar pointers to vector pointers, and dereference. And it does not require to change the input/output data types.

That is why I think it is good to have a way to do this, without relying on pointer casting.

`@load(@Vector(N, T, ABI), p)` and `@store(v, p)` could do it:

```zig
fn @load(Vec: type, p: [*]const Vec.child) Vec {
  var res: Vec;
  var i: usize = 0;
  while (i < Vec.len) {
    res[i] = p[i];
    i += 1;
  }
  return res;
}
fn @store(a: @Vector(N, T, ABI), p: [*]T) void {
  var i: usize = 0;
  while (i < N) {
    p[i] = res[i];
    i += 1;
  }
}
```

But sometimes, it might be useful to load elements that are not contiguous, but have a constant stride. Another builtin might useful for that (like `@load_strided`):

```zig
var a : @Vector(4, i32, .SSE2) = .{src[0], src[2], src[4], src[6]};
```

And finally, there are cases where we need to load elements whose index in an array is given in a vector.
This load operation is a gather (a scatter for a store). They would have the following semantic:

```zig
fn @gather(Vec: type, p: [*]const Vec.child, idx: @Vector(Vec.len, I, ABI)) Vec {
  // dedicated syntax: p[idx]
  var res: Vec;
  var i: usize = 0;
  while (i < Vec.len) {
    res[i] = p[idx[i]];
    i += 1;
  }
  return res;
}
fn @scatter(a: @Vector(N, T, ABI), p: [*]T, idx: @Vector(N, I, ABI)) void {
  // dedicated syntax: p[idx] = a
  var i: usize = 0;
  while (i < N) {
    p[idx[i]] = res[i];
    i += 1;
  }
}
```

If the compiler can detect that `idx` is in the form `.{ i + 0*a, i + 1*a, i + 2*a, i + 3*a... }`, then it could use a more specialized instruction. And even more so if `a` is 0 (plain old load/store).
Vector of pointers could be another way to have gathers/scatters using the same kind of syntax trick.

## Masks

Masks could be manipulated with bitwise operations like standard vectors, but have no arithmetic nor shift operations.
They have, however, shuffle operations.

Masks would be primarily generated by comparing vectors. But could also be created by list literal (with bools).

A mask cannot be loaded nor stored in a meaningful way because of the arch dependent layout, so no load/store must be provided.
However, conversion between masks and vectors would be necessary.
Converting a vector to a mask would be semantically equivalent to create the mask from the sign bits of the vector (for arithmetic types).
And converting from a mask would be equivalent to create a vector whose elements are all bits set if the mask is set for this element.

A mask could be used to select elements from 2 vectors.

```zig
fn @select(m: @Mask(len, T, ABI), a: @Vector(len, T, ABI), b: @Vector(len, T, ABI)) @Vector(len, T, ABI) {
  var res: @Vector(len, T, ABI);
  var i: usize = 0;
  while (i < len) {
    if (m[i]) {
      res[i] = a[i];
    } else {
      res[i] = b[i];
    }
  }
  return res;
}
```

In gcc extensions, such an operation has been given the following syntax: `m ? a : b`, but as there is no ternary operator, I don't recommend having a special syntax for that.

Also, we could have a syntax for masked values: `a[m]` or `m(a)`. Those masked values could be assignable.


It would also be useful to have builtins to convert from/to plain integers (bag of bits) as this can be used for LUTs in some algorithms.
Such functions could be the portable way to load/store masks in memory.

## Platform specific operations

Many SIMD ISAs have custom operations to handle very specific use-cases that do not fit well into an abstract interface like this proposal. These operations should not be integrated in any special way to this proposal.
These operations should be accessible in a non generic way, like using intrinsics: #7702.

# Implicit conversions

Something that would be really nice is implicit conversions from scalar to vectors.
It makes it so much easier to write SIMD code:

```zig
var vec : @Vector(4, i32, .SSE2) = 4;
vec += 6;
```

In fact, I think we could provide implicit conversion from a small vector to a larger vector (by repeating the elements) with the following constraints:
- the larger vector length should be a multiple of the smaller vector length
- their element type should match exactly

----

Sorry for such a long issue, but I have the feeling that it is required.
Maybe I could later split into multiple issues to better keep track of the disscussion, but I'm not sure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

RFC: SIMD spec #9389

Types that can be elements of a SIMD vector

Size of a vector

Order of the elements

Masks

Operations

Arithmetic and bit operations

Note about signaling and traps

Conversion

Reinterpret cast (or bit cast)

Vector creation

Element access

Shuffles

Memory access

Masks

Platform specific operations

Implicit conversions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

RFC: SIMD spec #9389

Description

Types that can be elements of a SIMD vector

Size of a vector

Order of the elements

Masks

Operations

Arithmetic and bit operations

Note about signaling and traps

Conversion

Reinterpret cast (or bit cast)

Vector creation

Element access

Shuffles

Memory access

Masks

Platform specific operations

Implicit conversions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions