diff --git a/text/0000-ppv.md b/text/0000-ppv.md new file mode 100644 index 00000000000..20cbe0b4679 --- /dev/null +++ b/text/0000-ppv.md @@ -0,0 +1,1596 @@ +- Feature Name: `portable_packed_vector_types` +- Start Date: (fill me in with today's date, YYYY-MM-DD) +- RFC PR: (leave this empty) +- Rust Issue: (leave this empty) + +# Summary +[summary]: #summary + +This RFC adds portable packed SIMD vector types up to 256-bit. + +Future RFCs will attempt to answer some of the unresolved questions and might +potentially cover extensions as they mature in `stdsimd`, like, for example, +portable memory gather and scatter operations, `m1xN` vector masks, masked +arithmetic/bitwise/shift operations, etc. + +# Motivation +[motivation]: #motivation + +The `std::arch` module exposes architecture-specific SIMD types like `_m128` - a +128-bit wide SIMD vector type. How these bits are interpreted depends on the intrinsic +being used. For example, let's sum 8 `f32`s values using the SSE4.1 facilities +in the `std::arch` module. This is one way to do it +([playground](https://play.rust-lang.org/?gist=165e2886b4883ec98d4e8bb4d6a32e22&version=nightly)): + +```rust +unsafe fn add_reduce(a: __m128, b: __m128) -> f32 { + let c = _mm_hadd_ps(a, b); + let c = _mm_hadd_ps(c, _mm_setzero_ps()); + let c = _mm_hadd_ps(c, _mm_setzero_ps()); + std::mem::transmute(_mm_extract_ps(c, 0)) +} + +fn main() { + unsafe { + let a = _mm_set_ps(1., 2., 3., 4.); + let b = _mm_set_ps(5., 6., 7., 8.); + let r = add_reduce(a, b); + assert_eq!(r, 36.); + } +} +``` + +Notice that: + +* one has to put some effort to extrapolate from `add_reduce`'s signature what + types of vectors it actually expects: "`add_reduce` takes 128-bit wide vectors and + returns an `f32` therefore those 128-bit vectors _probably_ must contain 4 packed + f32s because that's the only combination of `f32`s that fits in 128 bits!" + +* it requires a lot of `unsafe` code: the intrinsics are unsafe (which could be + improved via [RFC2122](https://github.com/rust-lang/rfcs/pull/2212)), the + intrinsic API relies on the user performing transmutes, constructing the + vectors is unsafe because it needs to be done via intrinsic calls, etc. + +* it requires a lot of architecture specific knowledge: how the intrinsics are + called, how they are used together + +* this solution only works on `x86` or `x86_64` with SSE4.1 enabled, that is, it + is not portable. + +With portable packed vector types, we can do much better +([playground](https://play.rust-lang.org/?gist=7fb4e3b6c711b5feb35533b50315a5fb&version=nightly)): + +```rust +fn main() { + let a = f32x4::new(1., 2., 3., 4.); + let b = f32x4::new(5., 6., 7., 8.); + let r = (a + b).sum(); + assert_eq!(r, 36.); +} +``` + +These types add zero-overhead over the architecture-specific types for the +operations that they support - if there is an architecture in which this does +not hold for some operation: the implementation has a bug. + +The motivation of this RFC is to provide reasonably high-level, reliable, and +portable access to common SIMD vector types and SIMD operations. + +At a higher level, the actual use cases for these specialty instructions are +boundless. SIMD intrinsics are used in graphics, multimedia, linear algebra, +scientific computing, games, cryptography, text search, machine learning, low +latency, and more. There are many crates in the Rust ecosystem using SIMD +intrinsics today, either through `stdsimd`, the `simd` crate, or both. Some +examples include: + +* [`encoding_rs`](https://github.com/hsivonen/encoding_rs) which uses the `simd` + crate to assist with speedy decoding. +* [`bytecount`](https://github.com/llogiq/bytecount) which uses the `simd` crate +with AVX2 extensions to accelerate counting bytes. +* [`regex`](https://github.com/rust-lang/regex) which uses the `stdsimd` crate +with SSSE3 extensions to accelerate multiple substrings search and also its +parent crate `teddy`. + +However, providing portable SIMD algorithms for all application domains is not +the intent of this RFC. + +The purpose of this RFC is to provide users with vocabulary types and +fundamental operations that they can build upon in their own crates to +effectively implement SIMD algorithms in their respective application domains. + +These types are meant to be extended by users with portable (or nonportable) SIMD +operations in their own crates, for example, via extension traits or new types. + +The operations provided in this RFC are thus either: + +**fundamental**: that is, they build the foundation required to write +higher-level SIMD algorithms. These include, amongst others, instantiating +vector types, read/writes from memory, masks and branchless conditional +operations, and type casts and conversions. + +**required**: to be part of the std. These include backend-specific compiler +intrinsics that we might never want to stabilize as well as the implementation of +std library traits which, due to trait coherence, users cannot extend the vector +types with. + +# Guide-level explanation +[guide-level-explanation]: #guide-level-explanation + +This RFC extends Rust with **portable packed SIMD vector types**, a set of types +used to perform **explicit vectorization**: + +* **SIMD**: stands for Single Instruction, Multiple Data. This RFC uses this + term in the context of hardware instruction set architectures (ISAs) to refer + to: + * SIMD instructions: instructions that (typically) perform operations on + multiple values simultaneously, and + * SIMD registers: the registers that the SIMD instructions take as operands. + These registers (typically) store multiple values that are + operated upon simultaneously by SIMD instructions. + +* **vector** types: types that abstract over memory stored in SIMD registers, + allowing to transfer memory to/from the registers and performing operations + directly on these registers. + +* **packed**: means that these vectors have a compile-time fixed size. It is + the opposite of **scalable** or "Cray vectors", which are SIMD vector types + with a dynamic size, that is, whose size is only known at run-time. + +* **explicit vectorization**: vectorization is the process of producing programs + that operate on multiple values simultaneously (typically) using SIMD + instructions and registers. Automatic vectorization is the process by which + the Rust compiler is, in some cases, able to transform scalar Rust code, that + is, code that does not use SIMD vector types, into machine code that does use + SIMD registers and instructions automatically (without user intervention). + Explicit vectorization is the process by which a Rust **user** manually writes + Rust code that states what kind of SIMD registers are to be used and what SIMD + instructions are executed on them. + +* **portable**: is the opposite of architecture-specific. These types work both + correctly and efficiently on all architectures. They are a zero-overhead + abstraction, that is, for the operations that these types support, one cannot + write better code by hand (otherwise, it is an implementation bug). + +* **masks**: are vector types used to **select** vector elements on which + operations are to be performed. This selection is performed by setting or + clearing the bits of the masks for a particular lane. + +Packed vector types are denotes as follows: `{i,u,f,m}{lane_width}x{#lanes}`, so +that `i64x8` is a 512-bit vector with eight `i64` lanes and `f32x4` a 128-bit +vector with four `f32` lanes. Here: + +* **lane**: is the number of values of a particular type stored in a vector - + the vector operations act on these values simultaneously. + +* **lane width**: the bit width of a vector lane, that is, the bit width of + the objects stored in the vector. For example, the type `f32` is 32-bits wide. + +That is, the `m8x4` type is a 32-bit wide vector mask with 4 lanes containing an +8-bit wide mask each. Vector masks are mainly used to select the lanes on which +vector operations are performed. When a lane has all of its bits set to `true`, +that lane is "selected", and when a lane has all of its bits set to `false`, +that lane is "not selected". The following bit pattern is thus a valid +bit-pattern for the `m8x4` mask: + +> 00000000_11111111_00000000_11111111 + +and it select two eight-bit wide lanes from a 32-bit wide vector type with four +lanes. The following bit-pattern is not, however, a valid value of the same mask +type: + +> 00000000_11111111_00000000_11110111 + +because it does not satisfies the invariant that all bits of a lane must be +either set or cleared. + +Operations on vector types can be either: + +* **vertical**: that is, lane-wise. For example, `a + b` adds each lane of `a` + with the corresponding lane of `b`, while `a.lt(b)` returns a boolean mask + that indicates whether the less-than (`<`, `lt`) comparison returned `true` or + `false` for each of the vector lanes. Most vertical operations are binary + operations (they take two input vectors). These operations are typically very + fast on most architectures and they are the most widely used in practice. + +* **horizontal**: that is, along a single vector - they are unary operations. + For example, `a.sum()` adds the elements of a vector together while + `a.max_element()` returns the largest element in a vector. These operations + (typically) translate to a sequence of multiple SIMD instructions on most + architectures and are therefore slower. In many cases, they are, however, + necessary. + +## Example: Average + +The first example computes the arithmetic average of the elements in a list. +Sequentially, we would write using iterators as follows: + +```rust +/// Arithmetic average of the elements in `xs`. +fn average_seq(xs: &[f32]) -> f32 { + if xs.len() > 0 { + xs.iter().sum() / xs.len() + } else { + 0. + } +} +``` + +The following implementation uses the 256-bit SIMD facilities provided by this +RFC. As the name suggests, it will be "slow": + +```rust +/// Computes the arithmetic average of the elements in the list. +/// +/// # Panics +/// +/// If `xs.len()` is not a multiple of `8`. +fn average_slow256(xs: &[f32]) -> f32 { + // The 256-bit wide floating-point vector type is f32x8. To + // avoid handling extra elements in this example we just panic. + assert!(xs.len() % 8 == 0, + "input length `{}` is not a multiple of 8", + xs.len()); + + let mut result = 0._f32; // This is where we store the result + + // We iterate over the input slice with a step of `8` elements: + for i in (0..xs.len()).step_by(8) { + // First, we read the next `8` elements into an `f32x8`. + // Since we haven't checked whether the input slice + // is aligned to the alignment of `f32x8`, we perform + // an unaligned memory read. + let data = f32x8::read_unaligned(&xs[i..]); + + // With the element in the vector, we perform an horizontal reduction + // and add them to the result. + result += data.sum(); + } + result / xs.len() +} +``` + +As mentioned this operation is "slow", why is that? The main issue is that, on +most architectures, horizontal reductions must perform a sequence of SIMD +operations while vertical operations typically require only a single +instruction. + +We can significantly improve the performance of our algorithm by writing it in +such a way that the number of horizontal reductions performed is reduced. + +```rust +fn average_fast256(xs: &[f32]) -> f32 { + assert!(xs.len() % 8 == 0, + "input length `{}` is not a multiple of 8", + xs.len()); + + // Our temporary result is now a f32x8 vector: + let mut result = f32x8::splat(0.); + for i in (0..xs.len()).step_by(8) { + let data = f32x8::read_unaligned(&xs[i..]); + // This adds the data elements to tour temporary result using + // a vertical lane-wise simd operation - this is a single SIMD + // instruction on most architectures. + result += data; + } + // Perform a single horizontal reduction at the end: + result.sum() / xs.len() +} +``` + +The performance could by further improved by requiring the input data to be +aligned to a 16-byte boundary, and/or by handling the elements before the next 16-byte boundary in a special way. + +## Example: scalar-vector multiply even + +To showcase the mask and `select` API the following function multiplies the +even elements of a vector with a scalar: + +```rust +fn mul_even(a: f32, x: f32x4) -> f32x4 { + // Create a vector mask for the even elements 0 and 2. + // The vector mask API uses `bool`s to set or clear + // all bits of a lane: + let m = m32x4::new(true, false, true, false); + + // Perform a full multiplication + let r = f32x4::splat(a) * x; + + // Use the mask to select the even elements from the + // multiplication result and the odd elements from + // the input: + m.select(r, x) +} +``` + +## Example: 4x4 Matrix multiplication + +To showcase the `shuffle` API the following function implements 4x4 Matrix +multiply using 128-bit wide vectors. + +```rust +fn mul4x4(a: [f32x4; 4], b: [f32x4; 4]) -> [f32x4; 4] { + let r = [f32x4::splat(0.); 4]; + + for i in 0..4 { + r[i] = + a[0] * shuffle!(b[i], [0,0,0,0]) + + a[1] * shuffle!(b[i], [1,1,1,1]) + + a[2] * shuffle!(b[i], [2,2,2,2]) + + a[3] * shuffle!(b[i], [3,3,3,3]); + } + r +} +``` + +# Reference-level explanation +[reference-level-explanation]: #reference-level-explanation + +## Vector types + +The vector types are named according to the following scheme: + +> {element_type}{lane_width}x{number_of_lanes} + +where the following element types are introduced by this RFC: + +* `i`: signed integer +* `u`: unsigned integer +* `f`: float +* `m`: mask + +So that `u16x8` reads "a SIMD vector of eight packed 16-bit wide unsigned +integers". The width of a vector can be computed by multiplying the +`{lane_width}` times the `{number_of_lanes}`. For `u16x8`, 16 x 8 = 128, so +this vector type is 128 bits wide. + +This RFC proposes adding all vector types with sizes in range [16, 256] bit to +the `std::simd` module, that is: + +* 16-bit wide vectors: `i8x2`, `u8x2`, `m8x2` +* 32-bit wide vectors: `i8x4`, `u8x4`, `m8x4`, `i16x2`, `u16x2`, `m16x2` +* 64-bit wide vectors: `i8x8`, `u8x8`, `m8x8`, `i16x4`, `u16x4`, `m16x4`, + `i32x2`, `u32x2`, `f32x2`, `m32x2` +* 128-bit wide vectors: `i8x16`, `u8x16`, `m8x16`, `i16x8`, `u16x8`, `m16x8`, + `i32x4`, `u32x4`, `f32x4`, `m32x4`, `i64x2`, `u64x2`, `f64x2`, `m64x2` +* 256-bit wide vectors: `i8x32`, `u8x32`, `m8x32`, `i16x16`, `u16x16`, `m16x16`, + `i32x8`, `u32x8`, `f32x8`, `m32x8`, `i64x4`, `u64x4`, `f64x4`, `m64x4` + +Note that this list is not comprehensive. In particular: + +* half-float `f16xN`: these vectors are supported in many architectures (ARM, + AArch64, PowerPC64, RISC-V, MIPS, ...) but their support is blocked on Rust + half-float support. +* AVX-512 vector types, not only 512-bit wide vector types, but also `m1xN` + vector masks. These are blocked on `std::arch` AVX-512 support. +* other vector types: x86, AArch64, PowerPC and others include types like + `i64x1`, `u64x1`, `f64x1`, `m64x1`, `i128x1`, `u128x1`, `m128x1`, ... These + can be always added later as the need for these arises, potentially in + combination with the stabilization of the `std::arch` intrinsics for those + architectures. + +### Layout of vector types + +The portable packed SIMD vector types introduced in this RFC are layout +compatible with the architecture-specific vector types. That is: + +```rust +union A { + port: f32x4, + arch: __m128, +} +let x: __m128 = _mm_setr_ps (0.0, 1.0, 2.0, 3.0); +let y: f32x4 = A { arch: x }.port; +assert_eq!(y.extract(0), 0.0); // OK +assert_eq!(y.extract(1), 1.0); // OK +assert_eq!(y.extract(2), 2.0); // OK +assert_eq!(y.extract(3), 3.0); // OK +``` + +The portable packed SIMD vector types are also layout compatible with arrays of +equal element type and whose length equals the number of vector lanes. That is: + +```rust +union A { + port: f32x4, + arr: [f32; 4], +} +let x: [f32; 4] = [0.0, 1.0, 2.0, 3.0]; +let y: f32x4 = A { arr: x }.port; +assert_eq!(y.extract(0), 0.0); // OK +assert_eq!(y.extract(1), 1.0); // OK +assert_eq!(y.extract(2), 2.0); // OK +assert_eq!(y.extract(3), 3.0); // OK +``` + +This transitively makes both portable packed and architecture specific SIMD +vector types layout compatible with all other types that are also layout +compatible with these array types. + +## API of portable packed SIMD vector types + +### Traits overview + +All vector types implement the following traits: + +* `Copy` +* `Clone` +* `Default`: zero-initializes the vector. +* `Debug`: formats the vector as `({}, {}, ...)`. +* `PartialEq`: performs a lane-wise comparison between two vectors and + returns `true` if all lanes compare `true`. It is equivalent to + `a.eq(b).all()`. +* `PartialOrd`: compares two vectors lexicographically. +* `From/Into` lossless casts between vectors with the same number of lanes. + +All signed integer, unsigned integer, and floating point vector types implement +the following traits: + +* `{Add,Sub,Mul,Div,Rem}`, + `{Add,Sub,Mul,Div,Rem}Assign`: vertical (lane-wise) arithmetic + operations. + +All signed and unsigned integer vectors and vector masks also implement: + +* `Eq`: equivalent to `PartialEq` +* `Ord`: equivalent to `PartialOrd` +* `Hash`: equivalent to `Hash` for `[element_type; number_of_elements]`. +* `fmt::LowerHex`/`fmt::UpperHex`: formats the vector as hexadecimal. +* `fmt::Octal`: formats the vector as an octal number. +* `fmt::Binary`: formats the vector as binary number. +* `Not`: vertical (lane-wise) negation, +* `Bit{And,Or,Xor}`, `Bit{And,Or,Xor}Assign`: + vertical (lane-wise) bitwise operations. + +All signed and unsigned integer vectors also implement: + +* `{Shl,Shr}`, `{Shl,Shr}Assign`: vertical + (lane-wise) bit-shift operations. + +Note: While IEEE 754-2008 provides total ordering predicates for floating-point +numbers, Rust does not implement `Eq` and `Ord` for the `f32` and `f64` +primitive types. This RFC follows suit and does not propose to implement `Eq` +and `Ord` for vectors of floating-point types. Any future RFC that might want to +extend Rust with a total order for floats should extend the portable +floating-point vector types with it as well. See [this internal +thread](https://users.rust-lang.org/t/how-to-sort-a-vec-of-floats/2838/3) for +more information. + +### Inherent Methods + +#### Construction and element access + +All portable signed integer, unsigned integer, and floating-point vector types +implement the following methods: + +```rust +impl {element_type}{lane_width}x{number_of_lanes} { +/// Creates a new instance of the vector from `number_of_lanes` +/// values. +pub const fn new(args...: element_type) -> Self; + +/// Returns the number of vector lanes. +pub const fn lanes() -> usize; + +/// Constructs a new instance with each element initialized to +/// `value`. +pub const fn splat(value: element_type) -> Self; + +/// Extracts the value at `index`. +/// +/// # Panics +/// +/// If `index >= Self::lanes()`. +pub fn extract(self, index: usize) -> element_type; + +/// Extracts the value at `index`. +/// +/// If `index >= Self::lanes()` the behavior is undefined. +pub unsafe fn extract_unchecked(self, index: usize) -> element_type; + +/// Returns a new vector where the value at `index` is replaced by `new_value`. +/// +/// # Panics +/// +/// If `index >= Self::lanes()`. +#[must_use = error-message] +pub fn replace(self, index: usize, new_value: $elem_ty) -> Self; + +/// Returns a new vector where the value at `index` is replaced by `new_value`. +#[must_use = error-message] +pub unsafe fn replace_unchecked(self, index: usize, + new_value: element_type) -> Self; +} +``` + +#### Reads and Writes + +##### Contiguous reads and writes + +All portable vector types implement the following methods: + +```rust +impl {element_type}{lane_width}x{number_of_lanes} { +/// Writes the values of the vector to the `slice` without +/// reading or dropping the old value. +/// +/// # Panics +/// +/// If `slice.len() != Self::lanes()` or `&slice[0]` is not +/// aligned to an `align_of::()` boundary. +pub fn write_aligned(self, slice: &mut [element_type]); + +/// Writes the values of the vector to the `slice` without +/// reading or dropping the old value. +/// +/// # Panics +/// +/// If `slice.len() != Self::lanes()`. +pub fn write_unaligned(self, slice: &mut [element_type]); + +/// Writes the values of the vector to the `slice` without +/// reading or dropping the old value. +/// +/// # Precondition +/// +/// If `slice.len() < Self::lanes()` or `&slice[0]` is not +/// aligned to an `align_of::()` boundary, the behavior is +/// undefined. +pub unsafe fn write_aligned_unchecked(self, slice: &mut [element_type]); + +/// Writes the values of the vector to the `slice` without reading +/// or dropping the old value. +/// +/// # Precondition +/// +/// If `slice.len() < Self::lanes()` the behavior is undefined. +pub unsafe fn write_unaligned_unchecked(self, slice: &mut [element_type]); + +/// Instantiates a new vector with the values of the `slice` without +/// moving them, leaving the memory in `slice` unchanged. +/// +/// # Panics +/// +/// If `slice.len() != Self::lanes()` or `&slice[0]` is not aligned +/// to an `align_of::()` boundary. +pub fn read_aligned(slice: &[element_type]) -> Self; + +/// Instantiates a new vector with the values of the `slice` without +/// moving them, leaving the memory in `slice` unchanged. +/// +/// # Panics +/// +/// If `slice.len() != Self::lanes()`. +pub fn read_unaligned(slice: &[element_type]) -> Self; + +/// Instantiates a new vector with the values of the `slice` without +/// moving them, leaving the memory in `slice` unchanged. +/// +/// # Precondition +/// +/// If `slice.len() < Self::lanes()` or `&slice[0]` is not aligned +/// to an `align_of::()` boundary, the behavior is undefined. +pub unsafe fn read_aligned_unchecked(slice: &[element_type]) -> Self; + +/// Instantiates a new vector with the values of the `slice` without +/// moving them, leaving the memory in `slice` unchanged. +/// +/// # Precondition +/// +/// If `slice.len() < Self::lanes()` the behavior is undefined. +pub unsafe fn read_unaligned_unchecked(slice: &[element_type]) -> Self; +} +``` + +##### Discontinuous masked reads and writes (scatter and gather) + +Vector masks implement the following methods: + +```rust +impl m{lane_width}x{number_of_lanes} { +/// Instantiates a new vector with the values of the `slice` located at +/// the `offset`s without moving them for which the mask (`self`) is `true` +/// and with the values of `default` otherwise. The memory of the `slice` at +/// the `offsets` for which the mask is `false` is not read. +/// +/// # Precondition +/// +/// If `slice.len() < offset.max_element()` the behavior is undefined. +pub unsafe fn read_scattered_unchecked(self, slice: &[T], offset: O, default: D) -> D + where + // for exposition only: + // number_of_lanes == D::lanes() == O::lanes(), + // D::element_type == T, + // O::element_type == usize, +; + +/// Writes the elements of the vector `values` for which the mask (`self`) +/// is `true` to the `slice` at `offset`s without reading or dropping +/// the old values. No memory is written to the `slice` elements at +/// the `offset`s for which the mask is `false`. +/// +/// If multiple `offset`s have the same value, that is, if multiple lanes +/// from `values` are to be written to the same memory location, the writes +/// are ordered from least significant to most significant element. +/// +/// # Precondition +/// +/// If `slice.len() < offset.max_element()` the behavior is undefined. +pub unsafe fn write_scattered_unchecked(self, slice: &mut [T], offset: O, values: D) + where + // for exposition only: + // number_of_lanes == D::lanes() == O::lanes(), + // D::element_type == T, + // O::element_type == usize, +; +} +``` + +#### Vertical arithmetic operations + +Vertical (lane-wise) arithmetic operations are provided by the following trait +implementations: + +* All signed integer, unsigned integer, and floating point vector types implement: + + * `{Add,Sub,Mul,Div,Rem}` + * `{Add,Sub,Mul,Div,Rem}Assign` + +* All signed and unsigned integer vectors also implement: + + * `{Shl,Shr}`, `{Shl,Shr}Assign`: vertical + (lane-wise) bit-shift operations. + +##### Integer vector semantics + +The behavior of these operations for integer vectors is the same as that of the +scalar integer types. That is: `panic!` on both overflow and division by zero if +`-C overflow-checks=on`. + +##### Floating-point semantics + +The behavior of these operations for floating-point numbers is the same as that +of the scalar floating point types, that is, `+-INFINITY` on overflow, `NaN` on +division by zero, etc. + +#### Wrapping arithmetic operations + +All signed and unsigned integer vector types implement the whole set of `pub fn +wrapping_{add,sub,mul,div,rem}(self, Self) -> Self` methods which, on overflow, +produce the correct mathematical result modulo `2^n`. + +The `div` and `rem` method `panic!` on division by zero in debug mode. + +#### Unsafe wrapping arithmetic operations + +All signed and unsigned integer vectors implement +`pub unsafe fn wrapping_{div,rem}_unchecked(self, Self) -> Self` +methods which, on overflow, produce the correct mathematical result modulo `2^n`. + +If any of the vector elements is divided by zero the behavior is undefined. + +#### Saturating arithmetic operations + +All signed and unsigned integer vector types implement the whole set of `pub fn +saturated_{add,sub,mul,div,rem}(self, Self) -> Self` methods which saturate on +overflow. + +The `div` and `rem` method `panic!` on division by zero in debug mode. + +#### Unsafe saturating arithmetic operations + +All signed and unsigned integer vectors implement `pub unsafe fn +saturating_{div,rem}_unchecked(self, Self) -> Self` methods which saturate on +overflow. + +If any of the vector elements is divided by zero the behavior is undefined. + +#### Binary `min`/`max` vertical operations + +All portable signed integer, unsigned integer, and floating-point vectors +implement the following methods: + +```rust +impl {element_type}{lane_width}x{number_of_lanes} { +/// Lane-wise `min`. +/// +/// Returns a vector whose lanes contain the smallest +/// element of the corresponding lane of `self` and `other`. +pub fn min(self, other: Self) -> Self; + +/// Lane-wise `max`. +/// +/// Returns a vector whose lanes contain the largest +/// element of the corresponding lane of `self` and `other`. +pub fn max(self, other: Self) -> Self; +} +``` + +##### Floating-point semantics + +The floating-point semantics follow the semantics of `min` and `max` for the +scalar `f32` and `f64` types. + +#### Floating-point vertical math operations + +All portable floating-point vector types implement the following methods: + +```rust +impl f{lane_width}x{number_of_lanes} { + /// Square-root + fn sqrt(self) -> Self; + /// Reciprocal square-root estimate + /// + /// **FIXME**: an upper bound on the error should + /// be guaranteed before stabilization. + fn rsqrte(self) -> Self; + /// Fused multiply add: `self * b + c` + fn fma(self, b: Self, c: self) -> Self; +} +``` + +#### Arithmetic reductions + +##### Integers + +All portable signed and unsigned integer vector types implement the following +methods: + +```rust +impl {element_type}{lane_width}x{number_of_lanes} { +/// Horizontal wrapping sum of the vector elements. +/// +/// The intrinsic performs a tree-reduction of the vector elements. +/// That is, for a 4 element vector: +/// +/// > (x0.wrapping_add(x1)).wrapping_add(x2.wrapping_add(x3)) +/// +/// If an operation overflows it returns the mathematical result +/// modulo `2^n` where `n` is the number of times it overflows. +pub fn wrapping_sum(self) -> element_type; + +/// Horizontal wrapping product of the vector elements. +/// +/// The intrinsic performs a tree-reduction of the vector elements. +/// That is, for a 4 element vector: +/// +/// > (x0.wrapping_mul(x1)).wrapping_mul(x2.wrapping_mul(x3)) +/// +/// If an operation overflows it returns the mathematical result +/// modulo `2^n` where `n` is the number of times it overflows. +pub fn wrapping_product(self) -> element_type; +} +``` + +##### Floating-point + +All portable floating-point vector types implement the following methods: + +```rust +impl {element_type}{lane_width}x{number_of_lanes} { +/// Horizontal sum of the vector elements. +/// +/// The intrinsic performs a tree-reduction of the vector elements. +/// That is, for a 8 element vector: +/// +/// > ((x0 + x1) + (x2 + x3)) + ((x4 + x5) + (x6 + x7)) +/// +/// If one of the vector element is `NaN` the reduction returns +/// `NaN`. The resulting `NaN` is not required to be equal to any +/// of the `NaN`s in the vector. +pub fn sum(self) -> element_type; + +/// Horizontal product of the vector elements. +/// +/// The intrinsic performs a tree-reduction of the vector elements. +/// That is, for an 8 element vector: +/// +/// > ((x0 * x1) * (x2 * x3)) * ((x4 * x5) * (x6 * x7)) +/// +/// If one of the vector element is `NaN` the reduction returns +/// `NaN`. The resulting `NaN` is not required to be equal to any +/// of the `NaN`s in the vector. +pub fn product(self) -> element_type; +} +``` + +#### Bitwise reductions + +All signed and unsigned integer vectors implement the following methods: + +```rust +impl {element_type}{lane_width}x{number_of_lanes} { +/// Horizontal bitwise `and` of the vector elements. +pub fn and(self) -> element_type; + +/// Horizontal bitwise `or` of the vector elements. +pub fn or(self) -> element_type; + +/// Horizontal bitwise `xor` of the vector elements. +pub fn xor(self) -> element_type; +} +``` + +#### Min/Max reductions + +All portable signed integer, unsigned integer, and floating-point vector types +implement the following methods: + +```rust +impl {element_type}{lane_width}x{number_of_lanes} { +/// Largest vector element value. +pub fn max_element(self) -> element_type; + +/// Smallest vector element value. +pub fn min_element(self) -> element_type; +} +``` + +Note: the semantics of `{min,max}_element` for floating-point numbers are the +same as that of their `min`/`max` methods. + +#### Mask construction and element access + +```rust +impl m{lane_width}x{number_of_lanes} { +/// Creates a new vector mask from `number_of_lanes` boolean +/// values. +/// +/// The values `true` and `false` respectively set and clear +/// the mask for a particular lane. +pub const fn new(args...: bool...) -> Self; + +/// Returns the number of vector lanes. +pub const fn lanes() -> usize; + +/// Constructs a new vector mask with all lane-wise +/// masks either set, if `value` equals `true`, or cleared, if +/// `value` equals `false`. +pub const fn splat(value: bool) -> Self; + +/// Returns `true` if the mask for the lane `index` is +/// set and `false` otherwise. +/// +/// # Panics +/// +/// If `index >= Self::lanes()`. +pub fn extract(self, index: usize) -> bool; + +/// Returns `true` if the mask for the lane `index` is +/// set and `false` otherwise. +/// +/// If `index >= Self::lanes()` the behavior is undefined. +pub unsafe fn extract_unchecked(self, index: usize) -> bool; + +/// Returns a new vector mask where mask of the lane `index` is +/// set if `new_value` is `true` and cleared otherwise. +/// +/// # Panics +/// +/// If `index >= Self::lanes()`. +#[must_use = error-message] +pub fn replace(self, index: usize, new_value: bool) -> Self; + +/// Returns a new vector mask where mask of the lane `index` is +/// set if `new_value` is `true` and cleared otherwise. +/// +/// If `index >= Self::lanes()` the behavior is undefined. +#[must_use = error-message] +pub unsafe fn replace_unchecked(self, index: usize, new_value: bool) -> Self; +} +``` + +#### Mask reductions + +All vector masks implement the following methods: + +```rust +impl m{lane_width}x{number_of_lanes} { +/// Are "all" lanes `true`? +pub fn all(self) -> bool; + +/// Is "any" lane `true`? +pub fn any(self) -> bool; + +/// Are "all" lanes `false`? +pub fn none(self) -> bool; +} +``` + +#### Mask vertical selection + +All vector masks implement the following method: + +```rust +impl m{lane_width}x{number_of_lanes} { +/// Lane-wise selection. +/// +/// The lanes of the result for which the mask is `true` contain +/// the values of `a` while the remaining lanes contain the values of `b`. +pub fn select(self, a: T, b: T) -> T + where + // for exposition only: + // T::lanes() == number_of_lanes, +; +} +``` + +Note: how `where` clause is enforced is an implementation detail. `stdsimd` +implements this using a sealed trait: + +```rust +pub fn select(self, a: T, b: T) -> T + where T: SelectMask +} +``` + +#### Vertical comparisions + +All vector types implement the following vertical (lane-wise) comparison methods +that returns a mask expressing the result. + +```rust +impl {element_type}{lane_width}x{number_of_lanes} { +/// Lane-wise equality comparison. +pub fn eq(self, other: $id) -> m{lane_width}x{number_of_lanes}; + +/// Lane-wise inequality comparison. +pub fn ne(self, other: $id) -> m{lane_width}x{number_of_lanes}; + +/// Lane-wise less-than comparison. +pub fn lt(self, other: $id) -> m{lane_width}x{number_of_lanes}; + +/// Lane-wise less-than-or-equals comparison. +pub fn le(self, other: $id) -> m{lane_width}x{number_of_lanes}; + +/// Lane-wise greater-than comparison. +pub fn gt(self, other: $id) -> m{lane_width}x{number_of_lanes}; + +/// Lane-wise greater-than-or-equals comparison. +pub fn ge(self, other: $id) -> m{lane_width}x{number_of_lanes}; +} +``` + +For all vector types proposed in this RFC, the `{lane_width}` of the mask +matches that of the vector type. However, this will not be the case for the +AVX-512 vector types. + +##### Semantics for floating-point numbers + +The semantics of the lane-wise comparisons for floating point numbers are the +same as in the scalar case. + +### Portable vector shuffles + +``` +/// Shuffles vector elements. +std::simd::shuffle!(...); +``` + +The `shuffle!` macro returns a new vector that contains a shuffle of the elements in +one or two input vectors. There are two versions: + + * `shuffle!(vec, indices)`: one-vector version + * `shuffle!(vec0, vec1, indices)`: two-vector version + +with the following preconditions: + + * `vec`, `vec0`, and `vec1` must be portable packed SIMD vector types. + * `vec0` and `vec1` must have the same type. + * `indices` must be a `const` array of type `[usize; N]` where `N` is any + power-of-two in range `(0, 2 * {vec,vec0,vec1}::lanes()]`. + * the values of `indices` must be in range `[0, N)` for the one-vector version, + and in range `[0, 2N)` for the two-vector version. + +On precondition violation a type error is produced. + +The macro returns a new vector whose: + +* element type equals that of the input vectors, +* length equals `N`, that is, the length of the `indices` array + +The `i`-th element of `indices` with value `j` in range `[0, N)` stores the +`j`-th element of the first vector into the `i`-th element of the result vector. + +In the two-vector version, the `i`-th element of `indices` with value `j` in +range `[N, 2N)` stores the `j - N`-th element of the second vector into the +`i`-th element of the result vector. + +#### Example: shuffles + +The `shuffle!` macro allows reordering the elements of a vector: + +```rust +let x = i32x4::new(1, 2, 3, 4); +let r = shuffle!(x, [2, 1, 3, 0]); +assert_eq!(r, i32x4::new(3, 2, 4, 1)); +``` + +where the resulting vector can also be smaller: + +```rust +let r = shuffle!(x, [1, 3]); +assert_eq!(r, i32x2::new(2, 4)); +``` + +or larger + +``` +let r = shuffle!(x, [1, 3, 2, 2, 1, 3, 2, 2]); +assert_eq!(r, i32x8::new(2, 4, 3, 3, 2, 4, 3, 3)); +``` + +than the input. The length of the result must be, however, limited to the range +`[2, 2 * vec::lanes()]`. + +It also allows shuffling between two vectors + +```rust +let y = i32x4::new(5, 6, 7, 8); +let r = shuffle!(x, y, [4, 0, 5, 1]); +assert_eq!(r, i32x4::new(5, 1, 6, 2)); +``` + +where the indices of the second vector's elements start at the `vec::lanes()` +offset. + +#### Conversions and bitcasts +[casts-and-conversions]: #casts-and-conversions + +##### Conversions / bitcasts between vector types + +There are three different ways to convert between vector types. + +* `From`/`Into`: value-preserving widening-conversion between vectors with the + same number of lanes. That is, `f32x4` can be converted into `f64x4` using + `From`/`Into`, but the opposite is not true because that conversion is not + value preserving. The `From`/`Into` implementations mirror that of the + primitive integer and floating-point types. These conversions can widen the + size of the element type, and thus the size of the SIMD vector type. Signed + vector types are sign-extended lane-wise, while unsigned vector types are + zero-extended lane-wise. The result of these conversions is + endian-independent. + +* `as`: non-value preserving truncating-conversions between vectors with the + same number of lanes. That is, `f64x4 as f32x4` performs a lane-wise `as` + cast, truncating the values if they would overflow the destination type. The + result of these conversions is endian-independent. + +* `unsafe mem::transmute`: bit-casts between vectors with the same size, that + is, the vectors do not need to have the same number of lanes. For example, + transmuting a `u8x16` into a `u16x8`. Note that while all bit-patterns of the + `{i,u,f}` vector types represent a valid vector value, there are many vector + mask bit-patterns that do not represent a valid mask. Note also that the + result of `unsafe mem::transmute` is **endian-dependent** (see examples + below). + +It is extremely common to perform "transmute" operations between equally-sized +portable vector types when writing SIMD algorithms. Rust currently does not have +any facilities to express that all bit-patterns of one type are also valid +bit-patterns of another type, and to perform these safe transmutes in an +endian-independent way. + +This forces users to resort to `unsafe { mem::transmute(x) }` and, very likely, +to write non-portable code. + +There is a very interesting discussion about [this in this internal +thread](https://internals.rust-lang.org/t/pre-rfc-frombits-intobits/7071/23) +about potential ways to attack this problem, and there is also an [open issue in +`stdsimd` about endian-dependent +behavior](https://github.com/rust-lang-nursery/stdsimd/issues/393) - if you care +deeply about it please chime in. + +These issues are not specific to portable packed SIMD vector types and fixing +them is not the purpose of this RFC, but these issues are critical for writing +efficient and portable SIMD code reliably and ergonomically. + +##### Other conversions + +The layout of the portable packed vector types is compatible to the layout of +fixed-size arrays of the same element type and the same number of lanes (e.g. +`f32x4` is layout compatible with `[f32; 4]`. + +For all signed, unsigned, and floating-point vector types with element type `E` +and number of lanes `N`, the following implementations exist: + +```rust +impl From<[E; N]> for ExN; +impl From for [E; N]; +``` + +# ABI and `std::simd` + +The ABI is first and foremost unspecified and may change at any time. + +All `std::simd` types are forbidden in `extern` functions (or warned against). +Basically the same story as types like `__m128i` and `extern` functions. + +As of today, they will be implemented as pass-via-pointer unconditionally. For +example: + +```rust +fn foo(a: u32x4) { /* ... */ } + +foo(u32x4::splat(3)); +``` + +This example will pass the variable `a` through memory. The function calling +`foo` will place `a` on the stack and then `foo` will read `a` from the stack +to work with it. Note that if `foo` changes the value of `a` this will not be +visible to the caller, they're semantically pass-by-value but implemented as +pass-via-pointers. + +Currently, we aren't aware of any slowdowns of perf hits from this mechanism +(pass through memory instead of by value). If something comes up, leaving the +ABI unspecified allows us to try to address it. + +# Drawbacks +[drawbacks]: #drawbacks + +## Generic vector type requirement for backends + +The `std::arch` module provides architecture-specific vector types where +backends only need to provide vector types for the architectures that they +support. + +This RFC requires backends to provide generic vector types. Most backends support +this in one form or another, but if one future backend does not, this RFC can be +implemented on top of the architecture specific types. + +## Achieving zero-overhead is outside Rust's control + +A future architecture might have an instruction that performs multiple +operations exposed by this API in one go, like `(a + b).wrapping_sum()` on an +`f32x4` vector. If that expression does not produce optimal machine code, Rust +has a performance bug. + +This is not a performance bug that can be easily worked around in `stdsimd` or +`rustc`, making this, almost certainly, a performance bug in the backend. These +performance bugs can be arbitrarily hard to fix, and fixing these might not +always be worth it. + +That is, while these APIs should make it possible for reasonably-designed +optimizing Rust backends to achieve zero-overhead, zero-overhead can only be +provided in practice on a best-effort basis. + +## Performance of this API might vary dramatically + +The performance of this API can vary dramatically depending on the architecture +being targeted and the target features enabled. + +First, this is a consequence of portability, and thus a feature. However, that +portability can introduce performance bugs is a real concern. In any case, if +the user is able to write faster code for some architecture, they should fill a +performance bug. + +# Rationale and alternatives +[alternatives]: #alternatives + +### Dynamic values result in poor code generation for some operations + +Some of the fundamental APIs proposed in this RFC, like `vec::{new, extract, +replace}` take run-time dynamic parameters. Consider the following example (see +the whole example live at [`rust.godbolt.org`](https://godbolt.org/g/yhiAa2): + +```rust +/// Returns a f32x8 with 0.,1.,2.,3. +fn increasing() -> f32x8 { + let mut x = f32x8::splat(0.); + for i in 0..f32x8::lanes() { + x = x.replace(i, i as f32); + } + x +} +``` + +In release mode, `rustc` generates the following assembly for this function: + +```asm +.LCPI0_0: + .long 0 + .long 1065353216 + .long 1073741824 + .long 1077936128 + .long 1082130432 + .long 1084227584 + .long 1086324736 + .long 1088421888 +example::increasing: + pushq %rbp + movq %rsp, %rbp + vmovaps .LCPI0_0(%rip), %ymm0 + vmovaps %ymm0, (%rdi) + movq %rdi, %rax + popq %rbp + vzeroupper + retq +``` + +which uses two vector reads to read the values into a SIMD register - +digression: this two reads are due to Rust's SIMD vector types ABI and happen +only "isolated" examples. + +If we change this function to accept run-time bounds for the loop + +```rust +/// Returns a f32x4::splat(0.) with the elements in [a, b) initialized +/// with an increasing sequence 0.,1.,2.,3. +fn increasing(a: usize, b: usize) -> f32x4 { + let mut x = f32x4::splat(0.); + for i in a..b { + x = x.replace(i, i as f32); + } + x +} +``` + +then the amount of instruction generated explodes: + +```asm +example::increasing_rt: + pushq %rbp + movq %rsp, %rbp + andq $-32, %rsp + subq $320, %rsp + vxorps %xmm0, %xmm0, %xmm0 + cmpq %rsi, %rdx + jbe .LBB1_34 + movl %edx, %r9d + subl %esi, %r9d + leaq -1(%rdx), %r8 + subq %rsi, %r8 + andq $7, %r9 + je .LBB1_2 + negq %r9 + vxorps %xmm0, %xmm0, %xmm0 + movq %rsi, %rcx +.LBB1_4: + testq %rcx, %rcx + js .LBB1_5 + vcvtsi2ssq %rcx, %xmm2, %xmm1 +...200 lines more... +``` + +This code isn't necessarily horrible, but it is definitely harder to reason about its +performance. This has two main causes: + +* **ISAs do not support these operations**: most (all?) ISAs support operations + like `extract`, `write`, and `replace` with constant indices only. That is, + these operations do not map to single instructions on most ISAs. + +* **these operations are slow**: even for constant indices, these operations are + slow. Often, for each constant index, a different instruction must be + generated, and occasionally, for a particular constant index, the operation + requires multiple instructions. + +So we have a trade-off to make between providing a comfortable API for programs +that really must extract a single value with a run-time index, and providing an +API that provides "reliable" performance. + +The proposed API accepts run-time indices (and values for `new`): + +* **common** SIMD code indexes with compile-time indices: this code gets optimized + reasonably well with the LLVM backend, but the user needs to deal with the + safe-but-checked and `unsafe`-but-unchecked APIs. If we were to only accept + constant indices, the unchecked API would not be necessary, since the checked + API would ensure that the indices are in-bounds at compile-time. + +* **rare** SIMD code indexes with run-time indices: this is code that one should + really avoid writing. The current API makes writing this code extremely easy, + resulting in SIMD code with potentially unexpected performance. Users also + have to deal with two APIs for this, the checked/unchecked APIs, and + also, the memory `read`/`write` APIs that are better suited for this use case. + +Whether the current design is the right design should probably be clarified +during the RFC. An important aspect to consider is that Rust support for +`const`ants is very basic: `const fn`s are getting started, `const` generics are +not there yet, etc. That is, making the API take constant indices might severely +limit the type of code that can be used with these APIs in today's Rust. + +### Binary (vector,scalar) and (scalar,vector) operations + +This RFC can be extended with binary vector-scalar and scalar vector operations +by implementing the following traits for signed integer, unsigned integer, and +floating-point vectors: + +* `{Add,Sub,Mul,Div,Rem}`, + `{Add,Sub,Mul,Div,Rem} for + {element_type}`, `{Add,Sub,Mul,Div,Rem}Assign`: binary + scalar-vector vertical (lane-wise) arithmetic operations. + +and the following trait for signed and unsigned integer vectors: + +* `Bit{And,Or,Xor}`, + `Bit{And,Or,Xor} for {element_type}`, + `Bit{And,Or,Xor}Assign` binary scalar-vector vertical + (lane-wise) bitwise operations. + +* `{Shl,Shr}`, `{Shl,Shr}Assign`: for all integer types `I` in + {`i8`, `i16`, `i32`, `i64`, `i128`, `isize`, `u8`, `u16`, `u32`, `u64`, + `u128`, `usize`}. Note: whether only `element_type` or all integer types + should be allowed is debatable: `stdsimd` currently allows using all integer + types. + +These traits slightly improve the ergonomics of scalar vector operations: + +```rust +let x: f32x4; +let y: f32x4; +let a: f32; +let z = a * x + y; +// instead of: z = f32x4::splat(a) * x + y; +x += a; +// instead of: x += f32x4::splat(a); +``` + +but they do not enable to do anything new that can't be easily done without them +by just using `vec::splat`, and initial feedback on the RFC suggested that they +are an abstraction that hides the cost of splatting the scalar into the vector. + +These traits are implemented in `stdsimd` (and thus available in nightly Rust), +are trivial to implement (`op(vec_ty::splat(scalar), vec)` and `op(vec, +vec_ty::splat(scalar))`), and cannot be "seamlessly" provided by users due to +coherence. + +They are not part of this RFC, but they can be easily added (now or later) if +there is consensus to do so. In the meantime, they can be experimented with on +nightly Rust. If there is consensus to remove them, porting nightly code off +these is also pretty easy. + +### Tiny vector types + +Most platforms SIMD registers have a constant width, and they can be used to +operate on vectors with a smaller bit width. However, 16 and 32-bit wide +vectors are "small" by most platforms standards. + +These types are useful for performing Simd Within A Register (SWAR) operations +in platforms without SIMD registers. While their performance has not been +extensively investigated in `stdsimd` yet, any performance issues are +performance bugs that should be fixed. + +### Portable shuffles API + +The portable shuffles are exposed via the `shuffle!` macro. Generating the +sequence of instructions required to perform a shuffle requires the shuffle +indices to be known at compile time. + +In the future, an alternative API based on `const`-generics and/or +`const`-function-arguments could be added in a backwards compatible way: + +```rust +impl {element_type}{element_width}x{number_of_lanes} { + pub fn shuffle(self, const indices: [isize; N]) + -> >::ShuffleResultType + where R: ShuffleResult; +} +``` + +Offering this same API today is doable: + +```rust +impl {element_type}{element_width}x{number_of_lanes} { + #[rustc_const_argument(2)] // specifies that indices must be a const + #[rustc_platform_intrinsic(simd_shuffle2)] + // ^^^ specifies that this method should be treated as the + // "platform-intrinsic" "simd_shuffle1" + pub fn shuffle2(self, other: Self, indices: I) + -> >::ShuffleResultType + where R: ShuffleResult; + + #[rustc_const_argument(1)] + #[rustc_platform_intrinsic(simd_shuffle1)] + pub fn shuffle(self, indices: I) + -> >::ShuffleResultType + where R: ShuffleResult; +} +``` + +If there is consensus for it the RFC can be easily amended. + +# Prior art +[prior-art]: #prior-art + +Most of this is implemented in `stdsimd` and can be used on nightly today via +the `std::simd` module. The `stdsimd` crate is an effort started by @burntsushi +to put the `rust-lang-nursery/simd` crate into a state suitable for +stabilization. The `rust-lang-nursery/simd` crate was mainly developed by @huonw +and IIRC it is heavily-inspired by Dart's SIMD which is from where the `f32x4` +naming scheme comes from. This RFC has been heavily inspired by Dart, and two of +the three examples used in the motivation come from the [Using SIMD in +Dart](https://www.dartlang.org/articles/dart-vm/simd) article written by John +McCutchan. Some of the key ideas of this RFC come from LLVM's design, which was +originally inspired by GCC's vector extensions, which was probably inspired by +something else. Most parts of this RFC are also consistent with the [128-bit SIMD +proposal for WebAssembly](https://github.com/WebAssembly/simd/blob/master/proposals/simd/SIMD.md) + +Or in other words: to the author's best knowledge, this RFC does not contain any +really novel ideas. Instead, it only draws inspriation from previous designs +that have withstood the test of time, and it adapts these designs to Rust. + +# Unresolved questions +[unresolved]: #unresolved-questions + +### Interaction with Cray vectors + +The vector types proposed in this RFC are packed, that is, their size is fixed +at compile-time. + +Many modern architectures support vector operations of run-time size, often +called Cray Vectors or scalable vectors. These include, amongst others, +NecSX, ARM SVE, and RISC-V's Vector Extension Proposal. These architectures have +traditionally relied on auto-vectorization combined with support for explicit +vectorization annotations, but newer architectures like ARM SVE introduce +explicit vectorization intrinsics. + +This is an example adapted from this [ARM SVE +paper](https://developer.arm.com/hpc/arm-scalable-vector-extensions-and-application-to-machine-learning) +to pseudo-Rust: + +```rust +/// Adds `c` to every element of the slice `src` storing the result in `dst`. +fn add_constant(dst: &mut [f64], src: &[f64], c: f64) { + assert!(dst.len() == src.len()); + + // Instantiate a dynamic vector (f64xN) with all lanes set to c: + let vc: f64xN = f64xN::splat(c); + + // The number of lanes that each iteration of the loop can process + // is unknown at compile-time (f64xN::lanes() is evaluated at run-time): + for i in (0..src.len()).step_by_with(f64xN::lanes()) { + + // Instantiate a dynamic boolean vector with the + // result of the predicate: `i + lane < src.len()`. + // This boolean vector acts as a mask, so that elements + // "in-bounds" of the slice `src` are initialized to `true`, + // while out-of-bounds elements contain `false`: + let m: bxN = f64xN::while_lt(i, src.len()); + + // Read the elements of the source using the mask: + let vsrc: f64xN = f64xN::read_unaligned(m, &src[i..]); + + // Add the vector with the constan using the mask: + let vdst: f64xN = vsrc.add(m, vc); + + // Write the result back to memory using the mask: + vdst.write_unaligned(m, &mut dst[i..]); + } +} +``` + +The RISC-V vector extension proposal introduces a model similar in spirit to ARM +SVE. These extensions are, however, not official yet, and it is currently +unknown whether GCC and LLVM will expose explicit intrinsics for them. It would +not be surprising if they do, and it would not be surprising if similar Cray +vector extensions are introduced in other architectures in the future. + +The main differences between Cray vectors and portable vectors are that: + +* the number of lanes of Cray vectors is a run-time dynamic value +* the Cray vector "objects" are like magical compiler token values +* the induction loop variable must be incremented by the dynamic number of lanes + of the vector type +* most Cray vector operations require a mask indicating which elements of + the vector the operation applies to + +These differences will probably force the API of Cray vector types to be +slightly different than that of packed vector types. + +The current RFC, therefore, assumes no interaction with Cray vector types. + +It does not prevent for portable Cray vector types to be added to Rust in +the future via an orthogonal API, nor it does prevent adding a way to interact +between both of them (e.g. through memory). But at this point in time whether +these things are possible are open research problems. + +### Half-float support + +Many architectures (ARM, AArch64, PowerPC, MIPS, RISC-V) support half-floats +(`f16`) vector types. It is unclear what to do with these at this point in time +since Rust currently lacks language support for half-float. + +### AVX-512 and m1xN masks support + +Currently, `std::arch` provides very limited AVX-512 support and the prototype +implementation of the `m1xN` masks like `m1x64` in `stdsimd` implements them as +512-bit wide vectors when they actually should only be 64-bit wide. + +Finishing the implementation of these types requires work that just has not been +done yet. + +### Fast math + +The performance of the portable operations can in some cases be significantly +improved by making assumptions about the kind of arithmetic that is allowed. + +For example, some of the horizontal reductions benefit from assuming math to be +finite (no `NaN`s) and others from assuming math to be associative (e.g. it +allows tree-like reductions from sums). + +A future RFC could add more reduction variants with different requirements and +performance characteristics, for example, `.wrapping_sum_unordered()` or +`.max_element_nanless()`, but these are not considered in this RFC because +their interaction with fast-math is unclear. + +A potentially better idea would be to allow users to specify the assumptions +that an optimizing compiler can make about floating-point arithmetic in a finer +grained way. + +For example, we could design an `#[fp_math]` attribute usable at, for example, +crate, module, function, and block scope, so that users can exactly specify +which IEEE754 restrictions the compiler is allowed to lift where: + +```rust +fn foo(x: f32x4, y: f32x4) -> f32x4 { + let (w, z) = + #[fp_math(assume = "associativity")] { + // All fp math is associative, reductions can be unordered: + let w = x.sum(); + let z = y.sum(); + (w, z) + }; + + let m = (w + z) * (x + y); + + #[fp_math(assume = "finite")] { + // All fp math is assumed finite, reduction can assume NaNs + // aren't present: + m.max() + } +} +``` + +There are obviously many approaches to tackle this problem, but it does make +sense to have a plan to tackle them before workarounds start getting bolted into +RFCs like this one. There is an [internal's +post](https://internals.rust-lang.org/t/pre-pre-rfc-floating-point-math-assumptions-fast-math/7162) +exploring the design space. + +### Endian-dependent behavior + +The results of the indexed operations (`extract`, `replace`, `write`), and the +`new` method are endian independent. That is, the following example is +guaranteed to pass on little-endian (LE) and big-endian (BE) architectures: + +```rust +let v = i32x4::new(0, 1, 2, 3); +assert_eq!(v.extract(0), 0); // OK in LE and BE +assert_eq!(v.extract(3), 0); // OK in LE - OK in BE +``` + +The result of bit-casting two equally-sized vectors using `mem::transmute` is, +however, endian dependent: + +```rust +let x = i8x16::new(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15); +let t: i16x8 = unsafe { mem::transmute(x) }; // UNSAFE +if cfg!(target_endian = "little") { + let t_el = i16x8::new(256, 770, 1284, 1798, 2312, 2826, 3340, 3854); + assert_eq!(t, t_el); // OK in LE | (would) ERROR in BE +} else if cfg!(target_endian = "big") { + let t_eb = i16x8::new(1, 515, 1029, 1543, 2057, 2571, 3085, 3599); + assert_eq!(t, t_eb); // OK in BE | (would) ERROR in LE +} +``` + +which applies to memory read and writes as well: + +```rust +let x = i8x16::new(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15); +let mut y: [i16; 8] = [0; 8]; +x.write_unaligned( unsafe { + slice::from_raw_parts_mut(&mut y as *mut _ as *mut i8, 16) +}); + +if cfg!(target_endian = "little") { + let e: [i16; 8] = [256, 770, 1284, 1798, 2312, 2826, 3340, 3854]; + assert_eq!(y, e); +} else if cfg!(target_endian = "big") { + let e: [i16; 8] = [1, 515, 1029, 1543, 2057, 2571, 3085, 3599]; + assert_eq!(y, e); +} + +let z = i8x16::read_unaligned(unsafe { + slice::from_raw_parts(&y as *const _ as *const i8, 16) +}); +assert_eq!(z, x); +```