From d817701cb6c65e1d179284dfd32488d40e6c85eb Mon Sep 17 00:00:00 2001
From: gnzlbg <gonzalobg88@gmail.com>
Date: Wed, 14 Mar 2018 19:34:10 +0100
Subject: [PATCH 01/17] rfc: portable packed SIMD vector types

---
 text/0000-ppv.md | 1387 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 1387 insertions(+)
 create mode 100644 text/0000-ppv.md

diff --git a/text/0000-ppv.md b/text/0000-ppv.md
new file mode 100644
index 00000000000..215d8f68279
--- /dev/null
+++ b/text/0000-ppv.md
@@ -0,0 +1,1387 @@
+- Feature Name: `portable_packed_vector_types`
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: (leave this empty)
+- Rust Issue: (leave this empty)
+
+# Summary
+[summary]: #summary
+
+This RFC adds portable packed SIMD vector types up to 256-bit.
+
+Future RFCs will attempt to answer some of the unresolved questions and might
+potentially cover extensions as they mature in `stdsimd`, like, for example,
+portable memory gather and scatter operations, `m1xN` vector masks, masked
+arithmetic/bitwise/shift operations, etc.
+
+# Motivation
+[motivation]: #motivation
+
+The `std::arch` module exposes architecture-specific SIMD types like `_m128` - a
+128-bit wide SIMD vector type. How these bits are interpreted depends on the intrinsic
+being used. For example, let's sum 8 `f32`s values using the SSE4.1 facilities
+in the `std::arch` module. This is one way to do it
+([playground](https://play.rust-lang.org/?gist=165e2886b4883ec98d4e8bb4d6a32e22&version=nightly)):
+
+```rust
+unsafe fn add_reduce(a: __m128, b: __m128) -> f32 {
+    let c = _mm_hadd_ps(a, b);
+    let c = _mm_hadd_ps(c, _mm_setzero_ps());
+    let c = _mm_hadd_ps(c, _mm_setzero_ps());
+    std::mem::transmute(_mm_extract_ps(c, 0))
+}
+
+fn main() {
+    unsafe {
+        let a = _mm_set_ps(1., 2., 3., 4.);
+        let b = _mm_set_ps(5., 6., 7., 8.);
+        let r = add_reduce(a, b);
+        assert_eq!(r, 36.);
+    }
+}
+```
+
+Notice that:
+
+* one has to put some effort to extrapolate from `add_reduce`'s signature what
+  types of vectors it actually expects: "`add_reduce` takes 128-bit wide vectors and
+  returns an `f32` therefore those 128-bit vectors _probably_ must contain 4 packed
+  f32s because that's the only combination of `f32`s that fits in 128 bits!"
+  
+* it requires a lot of `unsafe` code: the intrinsics are unsafe (which could be
+  improved via [RFC2122](https://github.com/rust-lang/rfcs/pull/2212)), the
+  intrinsic API relies on the user performing transmutes, constructing the
+  vectors is unsafe because it needs to be done via intrinsic calls, etc.
+
+* it requires a lot of architecture specific knowledge: how the intrinsics are
+  called, how they are used together
+  
+* this solution only works on `x86` or `x86_64` with SSE4.1 enabled, that is, it
+  is not portable.
+
+With portable packed vector types, we can do much better
+([playground](https://play.rust-lang.org/?gist=7fb4e3b6c711b5feb35533b50315a5fb&version=nightly)):
+
+```rust
+fn main() {
+    let a = f32x4::new(1., 2., 3., 4.);
+    let b = f32x4::new(5., 6., 7., 8.);
+    let r = (a + b).sum();
+    assert_eq!(r, 36.);
+}
+```
+
+These types add zero-overhead over the architecture-specific types for the
+operations that they support - if there is an architecture in which this does
+not hold for some operation: the implementation has a bug.
+
+The motivation of this RFC is to provide reasonably high-level, reliable, and
+portable access to common SIMD vector types and SIMD operations.
+
+At a higher level, the actual use cases for these specialty instructions are
+boundless. SIMD intrinsics are used in graphics, multimedia, linear algebra,
+scientific computing, games, cryptography, text search, machine learning, low
+latency, and more. There are many crates in the Rust ecosystem using SIMD
+intrinsics today, either through `stdsimd`, the `simd` crate, or both. Some
+examples include:
+
+* [`encoding_rs`](https://github.com/hsivonen/encoding_rs) which uses the `simd`
+  crate to assist with speedy decoding.
+* [`bytecount`](https://github.com/llogiq/bytecount) which uses the `simd` crate
+with AVX2 extensions to accelerate counting bytes.
+* [`regex`](https://github.com/rust-lang/regex) which uses the `stdsimd` crate
+with SSSE3 extensions to accelerate multiple substrings search and also its
+parent crate `teddy`.
+
+However, providing portable SIMD algorithms for all application domains is not
+the intent of this RFC.
+
+The purpose of this RFC is to provide users with vocabulary types and
+fundamental operations that they can build upon in their own crates to
+effectively implement SIMD algorithms in their respective application domains.
+
+These types are meant to be extended by users with portable (or nonportable) SIMD
+operations in their own crates, for example, via extension traits or new types.
+
+The operations provided in this RFC are thus either:
+
+**fundamental**: that is, they build the foundation required to write
+higher-level SIMD algorithms. These include, amongst others, instantiating
+vector types, load/stores from memory, masks and branchless conditional
+operations, and type casts and conversions.
+
+**required**: to be part of the std. These include backend-specific compiler
+intrinsics that we might never want to stabilize as well as the implementation of
+std library traits which, due to trait coherence, users cannot extend the vector
+types with.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+This RFC extends Rust with **portable packed SIMD vector types**, a set of types
+used to perform **explicit vectorization**:
+
+* **SIMD**: stands for Single Instruction, Multiple Data. This RFC uses this
+  term in the context of hardware instruction set architectures (ISAs) to refer
+  to:
+    * SIMD instructions: instructions that (typically) perform operations on
+      multiple values simultaneously, and
+    * SIMD registers: the registers that the SIMD instructions take as operands.
+      These registers (typically) store multiple values that are
+      operated upon simultaneously by SIMD instructions.
+
+* **vector** types: types that abstract over memory stored in SIMD registers,
+  allowing to transfer memory to/from the registers and performing operations
+  directly on these registers.
+
+* **packed**: means that these vectors have a compile-time fixed size. It is
+  the opposite of **scalable** or "Cray vectors", which are SIMD vector types
+  with a dynamic size, that is, whose size is only known at run-time.
+
+* **explicit vectorization**: vectorization is the process of producing programs
+  that operate on multiple values simultaneously (typically) using SIMD
+  instructions and registers. Automatic vectorization is the process by which
+  the Rust compiler is, in some cases, able to transform scalar Rust code, that
+  is, code that does not use SIMD vector types, into machine code that does use
+  SIMD registers and instructions automatically (without user intervention).
+  Explicit vectorization is the process by which a Rust **user** manually writes
+  Rust code that states what kind of SIMD registers are to be used and what SIMD
+  instructions are executed on them.
+
+* **portable**: is the opposite of architecture-specific. These types work both
+  correctly and efficiently on all architectures. They are a zero-overhead
+  abstraction, that is, for the operations that these types support, one cannot
+  write better code by hand (otherwise, it is an implementation bug).
+  
+* **masks**: are vector types used to **select** vector elements on which
+  operations are to be performed. This selection is performed by setting or
+  clearing the bits of the masks for a particular lane.
+  
+Packed vector types are denotes as follows: `{i,u,f,m}{lane_width}x{#lanes}`, so
+that `i64x8` is a 512-bit vector with eight `i64` lanes and `f32x4` a 128-bit
+vector with four `f32` lanes. Here:
+
+* **lane**: is the number of values of a particular type stored in a vector -
+  the vector operations act on these values simultaneously.
+  
+* **lane width**: the bit width of a vector lane, that is, the bit width of
+  the objects stored in the vector. For example, the type `f32` is 32-bits wide.
+
+Operations on vector types can be either:
+
+* **vertical**: that is, lane-wise. For example, `a + b` adds each lane of `a`
+  with the corresponding lane of `b`, while `a.lt(b)` returns a boolean mask
+  that indicates whether the less-than (`<`, `lt`) comparison returned `true` or
+  `false` for each of the vector lanes. Most vertical operations are binary operations (they take two input vectors). These operations are typically very fast on most architectures and they are the most widely used in practice.
+  
+* **horizontal**: that is, along a single vector - they are unary operations.
+  For example, `a.sum()` adds the elements of a vector together while `a.hmax()`
+  returns the largest element in a vector. These operations (typically)
+  translate to a sequence of multiple SIMD instructions on most architectures
+  and are therefore slower. In many cases, they are, however, necessary.
+  
+## Example: Average
+
+The first example computes the arithmetic average of the elements in a list.
+Sequentially, we would write using iterators as follows:
+
+```rust
+/// Arithmetic average of the elements in `xs`.
+fn average_seq(xs: &[f32]) -> f32 {
+    if xs.len() > 0 {
+        xs.iter().sum() / xs.len()
+    } else {
+        0.
+    }
+}
+```
+
+The following implementation uses the 256-bit SIMD facilities provided by this
+RFC. As the name suggests, it will be "slow":
+
+```rust
+/// Computes the arithmetic average of the elements in the list.
+///
+/// # Panics
+///
+/// If `xs.len()` is not a multiple of `8`.
+fn average_slow256(xs: &[f32]) -> f32 {
+    // The 256-bit wide floating-point vector type is f32x8. To
+    // avoid handling extra elements in this example we just panic.
+    assert!(xs.len() % 8 == 0, 
+            "input length `{}` is not a multiple of 8", 
+            xs.len());
+    
+    let mut result = 0._f32;  // This is where we store the result
+    
+    // We iterate over the input slice with a step of `8` elements:
+    for i in (0..xs.len()).step_by(8) {
+        // First, we load the next `8` elements into an `f32x8`.
+        // Since we haven't checked whether the input slice
+        // is aligned to the alignment of `f32x8`, we perform
+        // an unaligned memory load.
+        let data = f32x8::load_unaligned(&xs[i..]);
+
+        // With the element in the vector, we perform an horizontal reduction
+        // and add them to the result.
+        result += data.sum();
+    }
+    result
+}
+```
+
+As mentioned this operation is "slow", why is that? The main issue is that, on
+most architectures, horizontal reductions must perform a sequence of SIMD
+operations while vertical operations typically require only a single
+instruction.
+
+We can significantly improve the performance of our algorithm by writing it in
+such a way that the number of horizontal reductions performed is reduced.
+
+```rust
+fn average_fast256(xs: &[f32]) -> f32 {
+    assert!(xs.len() % 8 == 0, 
+            "input length `{}` is not a multiple of 8", 
+            xs.len());
+    
+    // Our temporary result is now a f32x8 vector:
+    let mut result = f32x8::splat(0.);
+    for i in (0..xs.len()).step_by(8) {
+        let data = f32x8::load_unaligned(&xs[i..]);
+        // This adds the data elements to tour temporary result using 
+        // a vertical lane-wise simd operation - this is a single SIMD
+        // instruction on most architectures.
+        result += data; 
+    }
+    // Perform a single horizontal reduction at the end:
+    result.sum()
+}
+```
+
+The performance could by further improved by requiring the input data to be
+aligned to a 16-byte boundary, and/or by handling the elements before the next 16-byte boundary in a special way.
+
+## Example: scalar-vector multiply even
+
+To showcase the mask and `select` API the following function multiplies the
+even elements of a vector with a scalar:
+
+```rust
+fn mul_even(a: f32, x: f32x4) -> f32x4 {
+    // Create a mask for the even elements 0 and 2:
+    let m = m32x4::new(true, false, true, false);
+
+    // Perform a full multiplication
+    let r = f32x4::splat(a) * x;
+    
+    // Use the mask to select the even elements from the
+    // multiplication result and the odd elements from
+    // the input:
+    m.select(r, x)
+}
+```
+
+## Example: 4x4 Matrix multiplication
+
+To showcase the `shuffle` API the following function implements 4x4 Matrix
+multiply using 128-bit wide vectors.
+
+```rust
+fn mul4x4(a: [f32x4; 4], b: [f32x4; 4]) -> [f32x4; 4] {
+    let r = [f32x4::splat(0.); 4];
+    
+    for i in 0..4 {
+        r[i] = 
+            a[0] * shuffle!(b[i], [0,0,0,0]) + 
+            a[1] * shuffle!(b[i], [1,1,1,1]) +
+            a[2] * shuffle!(b[i], [2,2,2,2]) +
+            a[3] * shuffle!(b[i], [3,3,3,3]);
+    }
+    r
+}
+```
+  
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+  
+## Vector types
+
+The vector types are named according to the following scheme:
+
+> {element_type}{lane_width}x{number_of_lanes}
+
+where the following element types are introduced by this RFC:
+
+* `i`: signed integer
+* `u`: unsigned integer
+* `f`: float
+* `m`: mask
+
+So that `u16x8` reads "a SIMD vector of eight packed 16-bit wide unsigned
+integers". The width of a vector can be computed by multiplying the
+`{lane_width}` times the `{number_of_lanes}`. For `u16x8`, 16 x 8 = 128, so
+this vector type is 128 bits wide.
+
+This RFC proposes adding all vector types with sizes in range [16, 256] bit to
+the `std::simd` module, that is:
+
+* 16-bit wide vectors: `i8x2`, `u8x2`, `m8x2`
+* 32-bit wide vectors: `i8x4`, `u8x4`, `m8x4`, `i16x2`, `u16x2`,  `m16x2`
+* 64-bit wide vectors: `i8x8`, `u8x8`, `m8x8`, `i16x4`, `u16x4`, `m16x4`,
+  `i32x2`, `u32x2`, `f32x2`, `m32x2`
+* 128-bit wide vectors: `i8x16`, u8x16`, `m8x16`, `i16x8`, `u16x8`, `m16x8`,
+  `i32x4`, `u32x4`, `f32x4`, `m32x4`, `i64x2`, `u64x2`, `f64x2`, `m64x2`
+* 256-bit wide vectors: `i8x32`, u8x32`, m8x32`, i16x16`, u16x16`, m16x16`,
+  i32x8`, u32x8`, f32x8`, m32x8`, i64x4`, u64x4`, f64x4`, `m64x4`
+
+Note that this list is not comprehensive. In particular:
+
+* half-float `f16xN`: these vectors are supported in many architectures (ARM,
+  AArch64, PowerPC64, RISC-V, MIPS, ...) but their support is blocked on Rust
+  half-float support.
+* AVX-512 vector types, not only 512-bit wide vector types, but also `m1xN`
+  vector masks. These are blocked on `std::arch` AVX-512 support.
+* other vector types: x86, AArch64, PowerPC and others include types like
+  `i64x1`, `u64x1`, `f64x1`, `m64x1`, `i128x1`, `u128x1`, `m128x1`, ... These
+  can be always added later as the need for these arises, potentially in combination with the stabilization of the `std::arch` intrinsics for those
+  architectures.
+
+## API of portable packed SIMD vector types
+
+### Traits overview
+
+All vector types implement the following traits:
+
+* `Copy`
+* `Clone`
+* `Default`: zero-initializes the vector.
+* `Debug`: formats the vector as `({}, {}, ...)`.
+* `PartialEq<Self>`: performs a lane-wise comparison between two vectors and
+  returns `true` if all lanes compare `true`. It is equivalent to
+  `a.eq(b).all()`.
+* `PartialOrd<Self>`: compares two vectors lexicographically.
+* `From/Into` lossless casts between vectors with the same number of lanes.
+
+All signed integer, unsigned integer, and floating point vector types implement
+the following traits:
+
+* `{Add,Sub,Mul,Div,Rem}<RHS=Self,Output=Self>`,
+  `{Add,Sub,Mul,Div,Rem}Assign<RHS=Self>`: vertical (lane-wise) arithmetic
+  operations.
+
+All signed and unsigned integer vectors and vector masks also implement:
+
+* `Eq`: equivalent to `PartialEq<Self>`
+* `Ord`: equivalent to `PartialOrd<Self>`
+* `Hash`: equivalent to `Hash` for `[element_type; number_of_elements]`.
+* `fmt::LowerHex`/`fmt::UpperHex`: formats the vector as hexadecimal.
+* `fmt::Octal`: formats the vector as an octal number.
+* `fmt::Binary`: formats the vector as binary number.
+* `Not<Output=Self>`: vertical (lane-wise) negation,
+* `Bit{And,Or,Xor}<RHS=Self,Output=Self>`, `Bit{And,Or,Xor}Assign<RHS=Self>`:
+  vertical (lane-wise) bitwise operations.
+
+All signed and unsigned integer vectors also implement:
+
+* `{Shl,Shr}<RHS=Self,Output=Self>`, `{Shl,Shr}Assign<RHS=Self>`: vertical
+  (lane-wise) bit-shift operations.
+
+Note: While IEEE 754-2008 provides total ordering predicates for floating-point
+numbers, Rust does not implement `Eq` and `Ord` for the `f32` and `f64`
+primitive types. This RFC follows suit and does not propose to implement `Eq`
+and `Ord` for vectors of floating-point types. Any future RFC that might want to
+extend Rust with a total order for floats should extend the portable
+floating-point vector types with it as well. See [this internal
+thread](https://users.rust-lang.org/t/how-to-sort-a-vec-of-floats/2838/3) for
+more information.
+
+### Inherent Methods
+
+#### Construction and element access
+
+All portable signed integer, unsigned integer, and floating-point vector types
+implement the following methods:
+
+```rust
+impl {element_type}{lane_width}x{number_of_lanes} {
+
+/// Creates a new instance of the vector from `number_of_lanes` 
+/// values.
+pub const fn new(args...: element_type) -> Self;
+
+/// Returns the number of vector lanes.
+pub const fn lanes() -> usize;
+
+/// Constructs a new instance with each element initialized to
+/// `value`.
+pub const fn splat(value: element_type) -> Self;
+
+/// Extracts the value at `index`.
+///
+/// # Panics
+///
+/// If `index >= Self::lanes()`.
+pub fn extract(self, index: usize) -> element_type;
+
+/// Extracts the value at `index`.
+///
+/// If `index >= Self::lanes()` the behavior is undefined.
+pub unsafe fn extract_unchecked(self, index: usize) -> element_type;
+
+/// Returns a new vector where the value at `index` is replaced by `new_value`.
+///
+/// # Panics
+///
+/// If `index >= Self::lanes()`.
+#[must_use = error-message]
+pub fn replace(self, index: usize, new_value: $elem_ty) -> Self;
+
+/// Returns a new vector where the value at `index` is replaced by `new_value`.
+#[must_use = error-message]
+pub unsafe fn replace_unchecked(self, index: usize, 
+                                new_value: element_type) -> Self;
+}
+```
+
+#### Loads and Stores
+
+All portable vector types implement the following methods:
+
+```rust
+impl {element_type}{lane_width}x{number_of_lanes} {
+
+/// Writes the values of the vector to the `slice`.
+///
+/// # Panics
+///
+/// If `slice.len() < Self::lanes()` or `&slice[0]` is not
+/// aligned to an `align_of::<Self>()` boundary.
+pub fn store_aligned(self, slice: &mut [element_type]);
+
+/// Writes the values of the vector to the `slice`.
+///
+/// # Panics
+///
+/// If `slice.len() < Self::lanes()`.
+pub fn store_unaligned(self, slice: &mut [element_type]);
+
+/// Writes the values of the vector to the `slice`.
+///
+/// # Precondition
+///
+/// If `slice.len() < Self::lanes()` or `&slice[0]` is not
+/// aligned to an `align_of::<Self>()` boundary, the behavior is
+/// undefined.
+pub unsafe fn store_aligned_unchecked(self, slice: &mut [element_type]);
+
+/// Writes the values of the vector to the `slice`.
+///
+/// # Precondition
+///
+/// If `slice.len() < Self::lanes()` the behavior is undefined.
+pub unsafe fn store_unaligned_unchecked(self, slice: &mut [element_type]);
+
+/// Instantiates a new vector with the values of the `slice`.
+///
+/// # Panics
+///
+/// If `slice.len() < Self::lanes()` or `&slice[0]` is not aligned
+/// to an `align_of::<Self>()` boundary.
+pub fn load_aligned(slice: &[element_type]) -> Self;
+
+/// Instantiates a new vector with the values of the `slice`.
+///
+/// # Panics
+///
+/// If `slice.len() < Self::lanes()`.
+pub fn load_unaligned(slice: &[element_type]) -> Self;
+
+/// Instantiates a new vector with the values of the `slice`.
+///
+/// # Precondition
+///
+/// If `slice.len() < Self::lanes()` or `&slice[0]` is not aligned
+/// to an `align_of::<Self>()` boundary, the behavior is undefined.
+pub unsafe fn load_aligned_unchecked(slice: &[element_type]) -> Self;
+
+/// Instantiates a new vector with the values of the `slice`.
+///
+/// # Precondition
+///
+/// If `slice.len() < Self::lanes()` the behavior is undefined.
+pub unsafe fn load_unaligned_unchecked(slice: &[element_type]) -> Self;
+}
+```
+
+#### Binary minmax vertical operations
+
+All portable signed integer, unsigned integer, and floating-point vectors
+implement the following methods:
+
+```rust
+impl {element_type}{lane_width}x{number_of_lanes} {   
+/// Lane-wise `min`.
+///
+/// Returns a vector whose lanes contain the smallest 
+/// element of the corresponding lane of `self` and `other`.
+pub fn min(self, other: Self) -> Self;
+
+/// Lane-wise `max`.
+///
+/// Returns a vector whose lanes contain the largest 
+/// element of the corresponding lane of `self` and `other`.
+pub fn max(self, other: Self) -> Self;
+}
+```
+
+##### Floating-point semantics
+
+The floating-point semantics follow the semantics of `min` and `max` for the
+scalar `f32` and `f64` types. That is:
+
+If either operand is a `NaN`, returns the other non-NaN operand. Returns `NaN`
+only if both operands are `NaN`. If the operands compare equal, returns a value
+that compares equal to both operands. This means that `min(+/-0.0, +/-0.0)`
+could return either `-0.0` or `0.0`. Otherwise, `min` and `max` return the
+smallest and largest operand, respectively.
+
+#### Arithmetic reductions 
+
+##### Integers
+
+All portable signed and unsigned integer vector types implement the following
+methods:
+
+```rust
+impl {element_type}{lane_width}x{number_of_lanes} {
+
+/// Horizontal wrapping sum of the vector elements.
+///
+/// The intrinsic performs a tree-reduction of the vector elements.
+/// That is, for a 4 element vector:
+///
+/// > (x0.wrapping_add(x1)).wrapping_add(x2.wrapping_add(x3))
+///
+/// If an operation overflows it returns the mathematical result
+/// modulo `2^n` where `n` is the number of times it overflows.
+pub fn wrapping_sum(self) -> element_type;
+
+/// Horizontal wrapping product of the vector elements.
+///
+/// The intrinsic performs a tree-reduction of the vector elements.
+/// That is, for a 4 element vector:
+///
+/// > (x0.wrapping_mul(x1)).wrapping_mul(x2.wrapping_mul(x3))
+///
+/// If an operation overflows it returns the mathematical result
+/// modulo `2^n` where `n` is the number of times it overflows.
+pub fn wrapping_product(self) -> element_type;
+}
+```
+
+##### Floating-point
+
+All portable floating-point vector types implement the following methods:
+
+```rust
+impl {element_type}{lane_width}x{number_of_lanes} {
+
+/// Horizontal sum of the vector elements.
+///
+/// The intrinsic performs a tree-reduction of the vector elements.
+/// That is, for a 8 element vector:
+///
+/// > ((x0 + x1) + (x2 + x3)) + ((x4 + x5) + (x6 + x7))
+///
+/// If one of the vector element is `NaN` the reduction returns
+/// `NaN`. The resulting `NaN` is not required to be equal to any
+/// of the `NaN`s in the vector.
+pub fn sum(self) -> element_type;
+
+/// Horizontal product of the vector elements.
+///
+/// The intrinsic performs a tree-reduction of the vector elements.
+/// That is, for an 8 element vector:
+///
+/// > ((x0 * x1) * (x2 * x3)) * ((x4 * x5) * (x6 * x7))
+///
+/// If one of the vector element is `NaN` the reduction returns
+/// `NaN`. The resulting `NaN` is not required to be equal to any
+/// of the `NaN`s in the vector.
+pub fn product(self) -> element_type;
+}
+```
+
+#### Bitwise reductions 
+
+All signed and unsigned integer vectors implement the following methods:
+
+```rust
+impl {element_type}{lane_width}x{number_of_lanes} {
+/// Horizontal bitwise `and` of the vector elements.
+pub fn and(self) -> element_type;
+
+/// Horizontal bitwise `or` of the vector elements.
+pub fn or(self) -> element_type;
+
+/// Horizontal bitwise `xor` of the vector elements.
+pub fn xor(self) -> element_type;
+}
+```
+
+#### Min/Max reductions
+
+All portable signed integer, unsigned integer, and floating-point vector types
+implement the following methods:
+
+```rust
+impl {element_type}{lane_width}x{number_of_lanes} {
+/// Value of the largest element in the vector.
+///
+/// # Floating-point
+///
+/// If the result contains `NaN`s the result is a 
+/// `NaN` that is not necessarily equal to any of 
+/// the `NaN`s in the vector.
+pub fn hmax(self) -> element_type;
+
+/// Value of the smallest element in the vector.
+///
+/// # Floating-point
+///
+/// If the result contains `NaN`s the result is a 
+/// `NaN` that is not necessarily equal to any of 
+/// the `NaN`s in the vector.
+pub fn hmin(self) -> element_type;
+}
+```
+
+#### Mask construction and element access
+
+```rust
+impl m{lane_width}x{number_of_lanes} {
+/// Creates a new vector mask from `number_of_lanes` boolean
+/// values.
+///
+/// The values `true` and `false` respectively set and clear 
+/// the mask for a particular lane.
+pub const fn new(args...: bool...) -> Self;
+
+/// Returns the number of vector lanes.
+pub const fn lanes() -> usize;
+
+/// Constructs a new vector mask with all lane-wise 
+/// masks either set, if `value` equals `true`, or cleared, if 
+/// `value` equals `false`.
+pub const fn splat(value: bool) -> Self;
+
+/// Returns `true` if the mask for the lane `index` is 
+/// set and `false` otherwise.
+///
+/// # Panics
+///
+/// If `index >= Self::lanes()`.
+pub fn extract(self, index: usize) -> bool;
+
+/// Returns `true` if the mask for the lane `index` is 
+/// set and `false` otherwise.
+///
+/// If `index >= Self::lanes()` the behavior is undefined.
+pub unsafe fn extract_unchecked(self, index: usize) -> bool;
+
+/// Returns a new vector mask where mask of the lane `index` is
+/// set if `new_value` is `true` and cleared otherwise.
+///
+/// # Panics
+///
+/// If `index >= Self::lanes()`.
+#[must_use = error-message]
+pub fn replace(self, index: usize, new_value: bool) -> Self;
+
+/// Returns a new vector mask where mask of the lane `index` is
+/// set if `new_value` is `true` and cleared otherwise.
+///
+/// If `index >= Self::lanes()` the behavior is undefined.
+#[must_use = error-message]
+pub unsafe fn replace_unchecked(self, index: usize, new_value: bool) -> Self;
+}
+```
+
+#### Mask reductions
+
+All vector masks implement the following methods:
+
+```rust
+impl m{lane_width}x{number_of_lanes} {
+/// Are "all" lanes `true`?
+pub fn all(self) -> bool;
+
+/// Is "any" lane `true`?
+pub fn any(self) -> bool;
+
+/// Are "all" lanes `false`?
+pub fn none(self) -> bool;
+}
+```
+
+#### Mask vertical selection
+
+All vector masks implement the following method:
+
+```rust
+impl m{lane_width}x{number_of_lanes} {
+/// Lane-wise selection. 
+///
+/// The lanes of the result for which the mask is `true` contain
+/// the values of `a` while the remaining lanes contain the values of `b`.
+pub fn select<T>(self, a: T, b: T) -> T
+    where T::lanes() == number_of_lanes; // implementation-defined 
+}
+```
+
+Note: how `where` clause is enforced is an implementation detail. `stdsimd`
+implements this using a sealed trait:
+
+```rust
+pub fn select<T>(self, a: T, b: T) -> T
+    where T: SelectMask<Self>
+}
+```
+
+#### Vertical comparisions
+
+All vector types implement the following vertical (lane-wise) comparison methods
+that returns a mask expressing the result.
+
+```rust
+impl {element_type}{lane_width}x{number_of_lanes} {
+/// Lane-wise equality comparison.
+pub fn eq(self, other: $id) -> m{lane_width}x{number_of_lanes};
+
+/// Lane-wise inequality comparison.
+pub fn ne(self, other: $id) -> m{lane_width}x{number_of_lanes};
+
+/// Lane-wise less-than comparison.
+pub fn lt(self, other: $id) -> m{lane_width}x{number_of_lanes};
+
+/// Lane-wise less-than-or-equals comparison.
+pub fn le(self, other: $id) -> m{lane_width}x{number_of_lanes};
+
+/// Lane-wise greater-than comparison.
+pub fn gt(self, other: $id) -> m{lane_width}x{number_of_lanes};
+
+/// Lane-wise greater-than-or-equals comparison.
+pub fn ge(self, other: $id) -> m{lane_width}x{number_of_lanes};
+}
+```
+
+For all vector types proposed in this RFC, the `{lane_width}` of the mask
+matches that of the vector type. However, this will not be the case for the
+AVX-512 vector types.
+
+##### Semantics for floating-point numbers
+
+* `eq`: yields `true` if both operands are not a `QNAN` and `self` is equal to
+  `other`, yields `false` otherwise.
+* `gt`: yield `true` if both operands are not a `QNAN` and ``self`` is greater
+  than `other`, yields `false` otherwise.
+* `ge`: yields `true` if both operands are not a `QNAN` and `self` is greater
+  than or equal to `other`, yields `false` otherwise.
+* `lt`: yields `true` if both operands are not a `QNAN` and `self` is less than
+  `other`, yields `false` otherwise.
+* `le`: yields `true` if both operands are not a `QNAN` and `self` is less than
+  or equal to `other`, yields `false` otherwise.
+* `ne`: yields `true` if either operand is a `QNAN` or `self` is not equal to
+  `other`, yields `false` otherwise.
+  
+  
+### Portable vector shuffles
+
+```
+/// Shuffles vector elements.
+std::simd::shuffle!(...);
+```
+
+The `shuffle!` macro returns a new vector that contains a shuffle of the elements in
+one or two input vectors. That is, there are two versions:
+
+ * `shuffle!(vec, [indices...])`: one-vector version
+ * `shuffle!(vec0, vec1, [indices...])`: two-vector version
+
+In the two-vector version, both `vec0` and `vec1` must have the same type.
+The element type of the resulting vector is the element type of the input
+vector.
+
+The number of `indices` must be a power-of-two in range `[0, 64]` no longer
+than two times the number of lanes in the input vector. The length of the
+resulting vector equals the number of indices provided.
+
+Given a vector with `N` lanes, the indices in range `[0, N)` refer to the `N` elements in the vector. In the two-vector version, the indices in range `[N, 2*N)` refer to elements in the second vector.
+
+#### Example: shuffles
+
+The `shuffle!` macro allows reordering the elements of a vector:
+
+```rust
+let x = i32x4::new(1, 2, 3, 4);
+let r = shuffle!(x, [2, 1, 3, 0]);
+assert_eq!(r, i32x4::new(3, 2, 4, 1));
+```
+
+where the resulting vector can also be smaller:
+
+```rust
+let r = shuffle!(x, [1, 3]);
+assert_eq!(r, i32x2::new(2, 4));
+```
+
+or larger
+
+```
+let r = shuffle!(x, [1, 3, 2, 2, 1, 3, 2, 2]);
+assert_eq!(r, i32x8::new(2, 4, 3, 3, 2, 4, 3, 3));
+```
+
+than the input. The length of the result must be, however, limited to the range
+`[2, 2 * vec::lanes()]`.
+
+It also allows shuffling between two vectors
+
+```rust
+let y = i32x4::new(5, 6, 7, 8);
+let r = shuffle!(x, y, [4, 0, 5, 1]);
+assert_eq!(r, i32x4::new(5, 1, 6, 2));
+```
+
+where the indices of the second vector's elements start at the `vec::lanes()`
+offset.
+
+#### Conversions and bitcasts
+[casts-and-conversions]: #casts-and-conversions
+
+There are three different ways to convert between vector types.
+
+* `From`/`Into`: value-preserving widening-conversion between vectors with the
+  same number of lanes. That is, `f32x4` can be converted into `f64x4` using
+  `From`/`Into`, but the opposite is not true because that conversion is not
+  value preserving. The `From`/`Into` implementations mirror that of the
+  primitive integer and floating-point types. These conversions can widen the
+  size of the element type, and thus the size of the SIMD vector type. Signed
+  vector types are sign-extended lane-wise, while unsigned vector types are
+  zero-extended lane-wise. The result of these conversions is
+  endian-independent.
+
+* `as`: non-value preserving truncating-conversions between vectors with the
+  same number of lanes. That is, `f64x4 as f32x4` performs a lane-wise `as`
+  cast, truncating the values if they would overflow the destination type. The
+  result of these conversions is endian-independent.
+  
+* `unsafe mem::transmute`: bit-casts between vectors with the same size, that
+  is, the vectors do not need to have the same number of lanes. For example,
+  transmuting a `u8x16` into a `u16x8`. Note that while all bit-patterns of the
+  `{i,u,f}` vector types represent a valid vector value, there are many vector
+  mask bit-patterns that do not represent a valid mask. Note also that the
+  result of `unsafe mem::transmute` is **endian-dependent** (see examples
+  below).
+  
+It is extremely common to perform "transmute" operations between equally-sized
+portable vector types when writing SIMD algorithms. Rust currently does not have
+any facilities to express that all bit-patterns of one type are also valid
+bit-patterns of another type, and to perform these safe transmutes in an
+endian-independent way.
+
+This forces users to resort to `unsafe { mem::transmute(x) }` and, very likely,
+to write non-portable code.
+
+There is a very interesting discussion about [this in this internal
+thread](https://internals.rust-lang.org/t/pre-rfc-frombits-intobits/7071/23)
+about potential ways to attack this problem, and there is also an [open issue in
+`stdsimd` about endian-dependent
+behavior](https://github.com/rust-lang-nursery/stdsimd/issues/393) - if you care
+deeply about it please chime in. 
+
+These issues are not specific to portable packed SIMD vector types and fixing
+them is not the purpose of this RFC, but these issues are critical for writing
+efficient and portable SIMD code reliably and ergonomically.
+
+# ABI and `std::simd`
+
+The ABI is first and foremost unspecified and may change at any time.
+
+All `std::simd` types are forbidden in `extern` functions (or warned against).
+Basically the same story as types like `__m128i` and `extern` functions.
+
+As of today, they will be implemented as pass-via-pointer unconditionally. For
+example:
+
+```rust
+fn foo(a: u32x4) { /* ... */ }
+
+foo(u32x4::splat(3));
+```
+
+This example will pass the variable `a` through memory. The function calling
+`foo` will place `a` on the stack and then `foo` will read `a` from the stack
+to work with it. Note that if `foo` changes the value of `a` this will not be
+visible to the caller, they're semantically pass-by-value but implemented as
+pass-via-pointers.
+  
+Currently, we aren't aware of any slowdowns of perf hits from this mechanism
+(pass through memory instead of by value). If something comes up, leaving the
+ABI unspecified allows us to try to address it.
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+## Generic vector type requirement for backends
+
+The `std::arch` module provides architecture-specific vector types where
+backends only need to provide vector types for the architectures that they
+support.
+
+This RFC requires backends to provide generic vector types. Most backends support
+this in one form or another, but if one future backend does not, this RFC can be
+implemented on top of the architecture specific types.
+
+## Zero-overhead requirement for backends
+
+A future architecture might have an instruction that performs multiple
+operations exposed by this API in one go, like `(a + b).sum()` on an
+`f32x4` vector. The zero-overhead requirement makes it a bug if Rust does not
+generate optimal code for this situation.
+
+This is not a performance bug that can be easily worked around in `stdsimd` or
+`rustc`, making this almost certainly a performance bug in the backend.
+
+It is reasonable to assume that every optimizing Rust backed will have a
+pattern-matching engine powerful enough to perform these
+transformations, but it is worth it to keep this requirement in mind.
+
+## Performance of this API might vary dramatically
+
+The performance of this API can vary dramatically depending on the architecture
+being targeted and the target features enabled.
+
+First, this is a consequence of portability, and thus a feature. However, that
+portability can introduce performance bugs is a real concern. In any case, if
+the user is able to write faster code for some architecture, they should fill a
+performance bug.
+
+# Rationale and alternatives
+[alternatives]: #alternatives
+
+### Dynamic values result in poor code generation for some operations
+
+Some of the fundamental APIs proposed in this RFC, like `vec::{new, extract,
+store, replace}` take run-time dynamic parameters. Consider the following
+example (see the whole example live at [`rust.godbolt.org`](https://godbolt.org/g/yhiAa2):
+
+```rust 
+/// Returns a f32x8 with 0.,1.,2.,3.
+fn increasing() -> f32x8 {
+   let mut x = f32x8::splat(0.);
+   for i in 0..f32x8::lanes() {
+       x = x.replace(i, i as f32); 
+   }
+   x 
+}
+```
+
+In release mode, `rustc` generates the following assembly for this function:
+
+```asm
+.LCPI0_0:
+  .long 0
+  .long 1065353216
+  .long 1073741824
+  .long 1077936128
+  .long 1082130432
+  .long 1084227584
+  .long 1086324736
+  .long 1088421888
+example::increasing:
+  pushq %rbp
+  movq %rsp, %rbp
+  vmovaps .LCPI0_0(%rip), %ymm0
+  vmovaps %ymm0, (%rdi)
+  movq %rdi, %rax
+  popq %rbp
+  vzeroupper
+  retq
+```
+
+which uses two vector loads to load the values into a SIMD register -
+digression: this two loads are due to Rust's SIMD vector types ABI and happen
+only "isolated" examples.
+
+If we change this function to accept run-time bounds for the loop
+
+```rust 
+/// Returns a f32x4::splat(0.) with the elements in [a, b) initialized 
+/// with an increasing sequence 0.,1.,2.,3.
+fn increasing(a: usize, b: usize) -> f32x4 {
+   let mut x = f32x4::splat(0.);
+   for i in a..b {
+       x = x.replace(i, i as f32); 
+   }
+   x 
+}
+```
+
+then the amount of instruction generated explodes:
+
+```asm
+example::increasing_rt:
+  pushq %rbp
+  movq %rsp, %rbp
+  andq $-32, %rsp
+  subq $320, %rsp
+  vxorps %xmm0, %xmm0, %xmm0
+  cmpq %rsi, %rdx
+  jbe .LBB1_34
+  movl %edx, %r9d
+  subl %esi, %r9d
+  leaq -1(%rdx), %r8
+  subq %rsi, %r8
+  andq $7, %r9
+  je .LBB1_2
+  negq %r9
+  vxorps %xmm0, %xmm0, %xmm0
+  movq %rsi, %rcx
+.LBB1_4:
+  testq %rcx, %rcx
+  js .LBB1_5
+  vcvtsi2ssq %rcx, %xmm2, %xmm1
+...200 lines more...
+```
+
+This code isn't necessarily horrible, but it is definitely harder to reason about its
+performance. This has two main causes:
+
+*  **ISAs do not support these operations**: most (all?) ISAs support operations
+   like `extract`, `store`, and `replace` with constant indices only. That is,
+   these operations do not map to single instructions on most ISAs.
+   
+* **these operations are slow**: even for constant indices, these operations are
+  slow. Often, for each constant index, a different instruction must be
+  generated, and occasionally, for a particular constant index, the operation
+  requires multiple instructions.
+  
+So we have a trade-off to make between providing a comfortable API for programs
+that really must extract a single value with a run-time index, and providing an
+API that provides "reliable" performance. 
+
+The proposed API accepts run-time indices (and values for `new`):
+
+* **common** SIMD code indexes with compile-time indices: this code gets optimized
+  reasonably well with the LLVM backend, but the user needs to deal with the
+  safe-but-checked and `unsafe`-but-unchecked APIs. If we were to only accept
+  constant indices, the unchecked API would not be necessary, since the checked
+  API would ensure that the indices are in-bounds at compile-time.
+  
+* **rare** SIMD code indexes with run-time indices: this is code that one should
+  really avoid writing. The current API makes writing this code extremely easy,
+  resulting in SIMD code with potentially unexpected performance. Users also
+  have to deal with two APIs for this, the checked/unchecked APIs, and
+  also, the memory `load`/`store` APIs that are better suited for this use case.
+  
+Whether the current design is the right design should probably be clarified
+during the RFC. An important aspect to consider is that Rust support for
+`const`ants is very basic: `const fn`s are getting started, `const` generics are
+not there yet, etc. That is, making the API take constant indices might severely
+limit the type of code that can be used with these APIs in today's Rust.
+
+### Binary (vector,scalar) and (scalar,vector) operations
+
+This RFC can be extended with binary vector-scalar and scalar vector operations
+by implementing the following traits for signed integer, unsigned integer, and
+floating-point vectors:
+
+* `{Add,Sub,Mul,Div,Rem}<RHS={element_type},Output=Self>`,
+  `{Add,Sub,Mul,Div,Rem}<RHS={vector_type},Output={vector_type}> for
+  {element_type}`, `{Add,Sub,Mul,Div,Rem}Assign<RHS={element_type}>`: binary
+  scalar-vector vertical (lane-wise) arithmetic operations.
+
+and the following trait for signed and unsigned integer vectors:
+
+* `Bit{And,Or,Xor}<RHS={element_type},Output=Self>`,
+  `Bit{And,Or,Xor}<RHS={vector_type},Output={vector_type}> for {element_type}`,
+  `Bit{And,Or,Xor}Assign<RHS={element_type}>` binary scalar-vector vertical
+  (lane-wise) bitwise operations.
+
+* `{Shl,Shr}<RHS=I>`, `{Shl,Shr}Assign<RHS=I>`: for all integer types `I` in
+  {`i8`, `i16`, `i32`, `i64`, `i128`, `isize`, `u8`, `u16`, `u32`, `u64`,
+  `u128`, `usize`}. Note: whether only `element_type` or all integer types
+  should be allowed is debatable: `stdsimd` currently allows using all integer
+  types.
+
+These traits slightly improve the ergonomics of scalar vector operations:
+
+```rust
+let x: f32x4;
+let y: f32x4;
+let a: f32;
+let z = a * x + y;
+// instead of: z = f32x4::splat(a) * x + y;
+x += a;
+// instead of: x += f32x4::splat(a);
+```
+
+but they do not enable to do anything new that can't be easily done without them
+by just using `vec::splat`, and initial feedback on the RFC suggested that they
+are an abstraction that hides the cost of splatting the scalar into the vector.
+
+These traits are implemented in `stdsimd` (and thus available in nightly Rust),
+are trivial to implement (`op(vec_ty::splat(scalar), vec)` and `op(vec,
+vec_ty::splat(scalar))`), and cannot be "seamlessly" provided by users due to
+coherence.
+
+They are not part of this RFC, but they can be easily added (now or later) if
+there is consensus to do so. In the meantime, they can be experimented with on
+nightly Rust. If there is consensus to remove them, porting nightly code off
+these is also pretty easy.
+
+### Tiny vector types
+
+Most platforms SIMD registers have a constant width, and they can be used to
+operate on vectors with a smaller bit width. However, 16 and 32-bit wide
+vectors are "small" by most platforms standards.
+
+These types are useful for performing Simd Within A Register (SWAR) operations
+in platforms without SIMD registers. While their performance has not been
+extensively investigated in `stdsimd` yet, any performance issues are
+performance bugs that should be fixed.
+
+### Portable shuffles API
+
+The portable shuffles are exposed via the `shuffle!` macro. Generating the
+sequence of instructions required to perform a shuffle requires the shuffle
+indices to be known at compile time.
+
+In the future, an alternative API based on `const`-generics and/or
+`const`-function-arguments could be added in a backwards compatible way:
+
+```rust
+impl {element_type}{element_width}x{number_of_lanes} {
+    pub fn shuffle<const N: usize, R>(self, const indices: [isize; N]) 
+        -> <R as ShuffleResult<element_type, [isize; N]>>::ShuffleResultType
+      where R: ShuffleResult<element_type, [isize; N]>;
+}
+```
+
+Offering this same API today is doable:
+
+```rust
+impl {element_type}{element_width}x{number_of_lanes} {
+    #[rustc_const_argument(2)] // specifies that indices must be a const
+    #[rustc_platform_intrinsic(simd_shuffle2)]
+    // ^^^ specifies that this method should be treated as the 
+    // "platform-intrinsic" "simd_shuffle1"
+    pub fn shuffle2<I>(self, other: Self, indices: I) 
+        -> <R as ShuffleResult<element_type, I>>::ShuffleResultType
+      where R: ShuffleResult<element_type, I>;
+      
+    #[rustc_const_argument(1)]
+    #[rustc_platform_intrinsic(simd_shuffle1)]
+    pub fn shuffle<I>(self, indices: I) 
+        -> <R as ShuffleResult<element_type, I>>::ShuffleResultType
+      where R: ShuffleResult<element_type, I>;
+}
+```
+
+If there is consensus for it the RFC can be easily amended. 
+
+# Prior art
+[prior-art]: #prior-art
+
+All of this is implemented in `stdsimd` and can be used on nightly today via the
+`std::simd` module. The `stdsimd` crate is an effort started by @burntsushi to
+put the `rust-lang-nursery/simd` crate into a state suitable for stabilization.
+The `rust-lang-nursery/simd` crate was mainly developed by @huonw and IIRC it is
+heavily-inspired by Dart's SIMD which is from where the `f32x4` naming scheme
+comes from. This RFC has been heavily inspired by Dart, and two of the three
+examples used in the motivation come from the [Using SIMD in
+Dart](https://www.dartlang.org/articles/dart-vm/simd) article written by John
+McCutchan.
+
+# Unresolved questions
+[unresolved]: #unresolved-questions
+
+### Interaction with scalable vectors
+
+The vector types proposed in this RFC are packed, that is, their size is fixed
+at compile-time.
+
+Many modern architectures support vector operations of run-time size, often
+called scalable Vectors or scalable vectors. These include, amongst others, NecSX,
+ARM SVE, RISC-V Vectors. These architectures have traditionally relied on
+auto-vectorization combined with support for explicit vectorization annotations,
+but newer architectures like ARM SVE and RISC-V introduce explicit vectorization
+intrinsics. 
+
+This is an example adapted from this [ARM SVE
+paper](https://developer.arm.com/hpc/arm-scalable-vector-extensions-and-application-to-machine-learning)
+to pseudo-Rust:
+
+```rust
+/// Adds `c` to every element of the slice `src` storing the result in `dst`.
+fn add_constant(dst: &mut [f64], src: &[f64], c: f64) {
+    assert!(dst.len() == src.len());
+    
+    // Instantiate a dynamic vector (f64xN) with all lanes set to c:
+    let vc: f64xN = f64xN::splat(c);
+    
+    // The number of lanes that each iteration of the loop can process
+    // is unknown at compile-time (f64xN::lanes() is evaluated at run-time):
+    for i in (0..src.len()).step_by_with(f64xN::lanes()) {
+    
+        // Instantiate a dynamic boolean vector with the
+        // result of the predicate: `i + lane < src.len()`.
+        // This boolean vector acts as a mask, so that elements 
+        // "in-bounds" of the slice `src` are initialized to `true`,
+        // while out-of-bounds elements contain `false`:
+        let m: bxN = f64xN::while_lt(i, src.len());
+
+        // Load the elements of the source using the mask:
+        let vsrc: f64xN = f64xN::load(m, &src[i..]);
+        
+        // Add the vector with the constan using the mask:
+        let vdst: f64xN = vsrc.add(m, vc);
+        
+        // Store the result back to memory using the mask:
+        vdst.store_unaligned(m, &mut dst[i..]);
+    }
+}
+```
+
+RISC-V proposes a model similar in spirit, but not identical to the ARM SVE one.
+It would not be surprising if other popular architectures offered similar but not necessarily identical explicit vectorization models for scalable vectors in the future.
+
+The main differences between scalable and portable vectors are that:
+
+* the number of lanes of scalable vectors is a run-time dynamic value 
+* the scalable vector "objects" are like magical compiler token values
+* the induction loop variable must be incremented by the dynamic number of lanes
+  of the vector type
+* most scalable vector operations require a mask indicating which elements of
+  the vector the operation applies to
+  
+These differences will probably force the API of scalable vector types to be
+slightly different than that of packed vector types.
+
+The current RFC, therefore, assumes no interaction with scalable vector types. 
+
+It does not prevent for portable scalable vector types to be added to Rust in
+the future via an orthogonal API, nor it does prevent adding a way to interact
+between both of them (e.g. through memory). But at this point in time whether
+these things are possible are open research problems.
+
+### Half-float support
+
+Many architectures (ARM, AArch64, PowerPC, MIPS, RISC-V) support half-floats
+(`f16`) vector types. It is unclear what to do with these at this point in time
+since Rust currently lacks language support for half-float.
+
+### AVX-512 and m1xN masks support
+
+Currently, `std::arch` provides very limited AVX-512 support and the prototype
+implementation of the `m1xN` masks like `m1x64` in `stdsimd` implements them as
+512-bit wide vectors when they actually should only be 64-bit wide. 
+
+Finishing the implementation of these types requires work that just has not been
+done yet. 
+
+### Fast math
+
+The performance of the portable operations can in some cases be significantly
+improved by making assumptions about the kind of arithmetic that is allowed.
+
+For example, some of the horizontal reductions benefit from assuming math to be
+finite (no `NaN`s) and others from assuming math to be associative (e.g. it
+allows tree-like reductions from sums).
+
+A future RFC could add more reduction variants with different requirements and
+performance characteristics, for example, `.sum_unordered()` or
+`.hmax_nanless()`, but these are not considered in this RFC because
+their interaction with fast-math is unclear.
+
+A potentially better idea would be to allow users to specify the assumptions
+that an optimizing compiler can make about floating-point arithmetic in a finer
+grained way.
+
+For example, we could design an `#[fp_math]` attribute usable at, for example,
+crate, module, function, and block scope, so that users can exactly specify
+which IEEE754 restrictions the compiler is allowed to lift where:
+
+```rust
+fn foo(x: f32x4, y: f32x4) -> f32x4 {
+  let (w, z) = 
+  #[fp_math(assume = "associativity")] {
+      // All fp math is associative, reductions can be unordered:
+      let w = x.sum();
+      let z = y.sum();
+      (w, z)
+  };
+  
+  let m = (w + z) * (x + y);
+  
+  #[fp_math(assume = "finite")] {
+      // All fp math is assumed finite, reduction can assume NaNs 
+      // aren't present:
+      m.max()
+  }
+}
+```
+
+There are obviously many approaches to tackle this problem, but it does make
+sense to have a plan to tackle them before workarounds start getting bolted into
+RFCs like this one.
+
+### Endian-dependent behavior
+
+The results of the indexed operations (`extract`, `replace`, `store`), and the
+`new` method are endian independent. That is, the following example is
+guaranteed to pass on little-endian (LE) and big-endian (BE) architectures:
+
+```rust
+let v = i32x4::new(0, 1, 2, 3);
+assert_eq!(v.extract(0), 0); // OK in LE and BE
+assert_eq!(v.extract(3), 0); // OK in LE - OK in BE
+```
+
+The result of bit-casting two equally-sized vectors using `mem::transmute` is,
+however, endian dependent:
+
+```rust
+let x = i8x16::new(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);
+let t: i16x8 = unsafe { mem::transmute(x) }; // UNSAFE
+if cfg!(target_endian = "little") {
+    let t_el = i16x8::new(256, 770, 1284, 1798, 2312, 2826, 3340, 3854);
+    assert_eq!(t, t_el); // OK in LE | (would) ERROR in BE
+} else if cfg!(target_endian = "big") {
+    let t_eb = i16x8::new(1, 515, 1029, 1543, 2057, 2571, 3085, 3599);
+    assert_eq!(t, t_eb); // OK in BE | (would) ERROR in LE
+}
+```
+
+which applies to memory load and stores as well:
+
+```rust
+let x = i8x16::new(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);
+let mut y: [i16; 8] = [0; 8];
+x.store_unaligned( unsafe {
+    slice::from_raw_parts_mut(&mut y as *mut _ as *mut i8, 16)
+});
+
+if cfg!(target_endian = "little") {
+    let e: [i16; 8] = [256, 770, 1284, 1798, 2312, 2826, 3340, 3854];
+    assert_eq!(y, e);
+} else if cfg!(target_endian = "big") {
+    let e: [i16; 8] = [1, 515, 1029, 1543, 2057, 2571, 3085, 3599];
+    assert_eq!(y, e);
+}
+
+let z = i8x16::load_unaligned(unsafe {
+    slice::from_raw_parts(&y as *const _ as *const i8, 16)
+});
+assert_eq!(z, x);
+```

From 56d66079f1aed85f13b52576f7f4f4b1951aba59 Mon Sep 17 00:00:00 2001
From: gnzlbg <gonzalobg88@gmail.com>
Date: Thu, 22 Mar 2018 16:35:04 +0100
Subject: [PATCH 02/17] clarify the naming convention for horizontal
 reductions, or lack thereof

---
 text/0000-ppv.md | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/text/0000-ppv.md b/text/0000-ppv.md
index 215d8f68279..00474fd7f3c 100644
--- a/text/0000-ppv.md
+++ b/text/0000-ppv.md
@@ -177,7 +177,7 @@ Operations on vector types can be either:
   For example, `a.sum()` adds the elements of a vector together while `a.hmax()`
   returns the largest element in a vector. These operations (typically)
   translate to a sequence of multiple SIMD instructions on most architectures
-  and are therefore slower. In many cases, they are, however, necessary.
+  and are therefore slower. In many cases, they are, however, necessary. 
   
 ## Example: Average
 
@@ -655,6 +655,15 @@ pub fn hmin(self) -> element_type;
 }
 ```
 
+Note: In this RFC, horizontal reductions are named according to the operation
+they perform. When this name clashes with that of a vertical operation, the
+horizontal reduction's name get's a `h`-prefix. In this case, the horizontal
+`max` and `min` reductions clash with the vertical `a.max(b)` operation, and
+therefore get a `h`-prefix. 
+
+An alternative would be to prefix all horizontal operations with, for example,
+the `h_`-prefix.
+
 #### Mask construction and element access
 
 ```rust

From 9a8aac413ce98887a9a5b01074d25c00f5bd5209 Mon Sep 17 00:00:00 2001
From: gnzlbg <gonzalobg88@gmail.com>
Date: Thu, 22 Mar 2018 17:35:26 +0100
Subject: [PATCH 03/17] propose sane arithmetic and shift operation semantics

---
 text/0000-ppv.md | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/text/0000-ppv.md b/text/0000-ppv.md
index 00474fd7f3c..9460821035e 100644
--- a/text/0000-ppv.md
+++ b/text/0000-ppv.md
@@ -512,6 +512,48 @@ pub unsafe fn load_unaligned_unchecked(slice: &[element_type]) -> Self;
 }
 ```
 
+#### Vertical arithmetic operations
+
+Vertical (lane-wise) arithmetic operations are provided by the following trait
+implementations:
+
+* All signed integer, unsigned integer, and floating point vector types implement:
+
+    * `{Add,Sub,Mul,Div,Rem}<RHS=Self,Output=Self>`
+    * `{Add,Sub,Mul,Div,Rem}Assign<RHS=Self>`
+
+* All signed and unsigned integer vectors also implement:
+
+    * `{Shl,Shr}<RHS=Self,Output=Self>`, `{Shl,Shr}Assign<RHS=Self>`: vertical
+      (lane-wise) bit-shift operations.
+
+##### Integer vector semantics
+
+The behavior of these operations for integer vectors is the same as that of the
+scalar integer types. That is: `panic!` on both overflow and division by zero.
+  
+##### Floating-point semantics
+
+The behavior of these operations for floating-point numbers is the same as that
+of the scalar floating point types, that is, `+-INFINITY` on overflow, `NaN` on
+division by zero, etc.
+
+#### Wrapping arithmetic operations
+
+All signed and unsigned integer vector types implement the whole set of `pub fn
+wrapping_{add,sub,mul,div,rem}(self, Self) -> Self` methods which, on overflow,
+produce the correct mathematical result modulo `2^n`.
+
+The `div` and `rem` method `panic!` on division by zero.
+
+#### Unsafe wrapping arithmetic operations
+
+All signed and unsigned integer vectors implement
+`pub unsafe fn wrapping_{div,rem}_unchecked(self, Self) -> Self`
+methods which, on overflow, produce the correct mathematical result modulo `2^n`.
+
+If any of the vector elements is divided by zero the behavior is undefined.
+
 #### Binary minmax vertical operations
 
 All portable signed integer, unsigned integer, and floating-point vectors

From 64a3ce62440a32e8a6a21ea00ed9ea4ea1d5cdc8 Mon Sep 17 00:00:00 2001
From: gnzlbg <gonzalobg88@gmail.com>
Date: Thu, 22 Mar 2018 18:00:09 +0100
Subject: [PATCH 04/17] more prior art, typo fixes,...

---
 text/0000-ppv.md | 24 +++++++++++++++---------
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/text/0000-ppv.md b/text/0000-ppv.md
index 9460821035e..15ca592b88d 100644
--- a/text/0000-ppv.md
+++ b/text/0000-ppv.md
@@ -554,7 +554,7 @@ methods which, on overflow, produce the correct mathematical result modulo `2^n`
 
 If any of the vector elements is divided by zero the behavior is undefined.
 
-#### Binary minmax vertical operations
+#### Binary `min`/`max` vertical operations
 
 All portable signed integer, unsigned integer, and floating-point vectors
 implement the following methods:
@@ -1244,15 +1244,21 @@ If there is consensus for it the RFC can be easily amended.
 # Prior art
 [prior-art]: #prior-art
 
-All of this is implemented in `stdsimd` and can be used on nightly today via the
-`std::simd` module. The `stdsimd` crate is an effort started by @burntsushi to
-put the `rust-lang-nursery/simd` crate into a state suitable for stabilization.
-The `rust-lang-nursery/simd` crate was mainly developed by @huonw and IIRC it is
-heavily-inspired by Dart's SIMD which is from where the `f32x4` naming scheme
-comes from. This RFC has been heavily inspired by Dart, and two of the three
-examples used in the motivation come from the [Using SIMD in
+Most of this is implemented in `stdsimd` and can be used on nightly today via
+the `std::simd` module. The `stdsimd` crate is an effort started by @burntsushi
+to put the `rust-lang-nursery/simd` crate into a state suitable for
+stabilization. The `rust-lang-nursery/simd` crate was mainly developed by @huonw
+and IIRC it is heavily-inspired by Dart's SIMD which is from where the `f32x4`
+naming scheme comes from. This RFC has been heavily inspired by Dart, and two of
+the three examples used in the motivation come from the [Using SIMD in
 Dart](https://www.dartlang.org/articles/dart-vm/simd) article written by John
-McCutchan.
+McCutchan. Some of the key ideas of this RFC come from LLVM's design, which was
+originally inspired by GCC's vector extensions, which was probably inspired by
+something else.
+
+Or in other words: to the author's best knowledge, this RFC does not contain any
+really novel ideas. Instead, it only draws inspriation from previous designs
+that have withstood the test of time, and it adapts these designs to Rust.
 
 # Unresolved questions
 [unresolved]: #unresolved-questions

From 8b69e2a946e66c80b46d3c5ab154987a62b49dd0 Mon Sep 17 00:00:00 2001
From: gnzlbg <gonzalobg88@gmail.com>
Date: Thu, 22 Mar 2018 18:11:54 +0100
Subject: [PATCH 05/17] ocd

---
 text/0000-ppv.md | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/text/0000-ppv.md b/text/0000-ppv.md
index 15ca592b88d..daa8f5c2297 100644
--- a/text/0000-ppv.md
+++ b/text/0000-ppv.md
@@ -174,10 +174,11 @@ Operations on vector types can be either:
   `false` for each of the vector lanes. Most vertical operations are binary operations (they take two input vectors). These operations are typically very fast on most architectures and they are the most widely used in practice.
   
 * **horizontal**: that is, along a single vector - they are unary operations.
-  For example, `a.sum()` adds the elements of a vector together while `a.hmax()`
-  returns the largest element in a vector. These operations (typically)
-  translate to a sequence of multiple SIMD instructions on most architectures
-  and are therefore slower. In many cases, they are, however, necessary. 
+  For example, `a.sum()` adds the elements of a vector together while
+  `a.hmax()` returns the largest element in a vector. These operations
+  (typically) translate to a sequence of multiple SIMD instructions on most
+  architectures and are therefore slower. In many cases, they are, however,
+  necessary.
   
 ## Example: Average
 
@@ -403,7 +404,6 @@ implement the following methods:
 
 ```rust
 impl {element_type}{lane_width}x{number_of_lanes} {
-
 /// Creates a new instance of the vector from `number_of_lanes` 
 /// values.
 pub const fn new(args...: element_type) -> Self;
@@ -448,7 +448,6 @@ All portable vector types implement the following methods:
 
 ```rust
 impl {element_type}{lane_width}x{number_of_lanes} {
-
 /// Writes the values of the vector to the `slice`.
 ///
 /// # Panics
@@ -595,7 +594,6 @@ methods:
 
 ```rust
 impl {element_type}{lane_width}x{number_of_lanes} {
-
 /// Horizontal wrapping sum of the vector elements.
 ///
 /// The intrinsic performs a tree-reduction of the vector elements.
@@ -626,7 +624,6 @@ All portable floating-point vector types implement the following methods:
 
 ```rust
 impl {element_type}{lane_width}x{number_of_lanes} {
-
 /// Horizontal sum of the vector elements.
 ///
 /// The intrinsic performs a tree-reduction of the vector elements.
@@ -996,7 +993,7 @@ implemented on top of the architecture specific types.
 ## Zero-overhead requirement for backends
 
 A future architecture might have an instruction that performs multiple
-operations exposed by this API in one go, like `(a + b).sum()` on an
+operations exposed by this API in one go, like `(a + b).wrapping_sum()` on an
 `f32x4` vector. The zero-overhead requirement makes it a bug if Rust does not
 generate optimal code for this situation.
 

From 422a4547be6c9537ebd6f1e4ab14c51c554b7e63 Mon Sep 17 00:00:00 2001
From: gnzlbg <gonzalobg88@gmail.com>
Date: Fri, 23 Mar 2018 10:05:56 +0100
Subject: [PATCH 06/17] incorporate feedback from rkruppe and hsivonen

---
 text/0000-ppv.md | 88 ++++++++++++++++++++++++++----------------------
 1 file changed, 48 insertions(+), 40 deletions(-)

diff --git a/text/0000-ppv.md b/text/0000-ppv.md
index daa8f5c2297..ab760d352aa 100644
--- a/text/0000-ppv.md
+++ b/text/0000-ppv.md
@@ -529,7 +529,8 @@ implementations:
 ##### Integer vector semantics
 
 The behavior of these operations for integer vectors is the same as that of the
-scalar integer types. That is: `panic!` on both overflow and division by zero.
+scalar integer types. That is: `panic!` on both overflow and division by zero in
+debug mode.
   
 ##### Floating-point semantics
 
@@ -543,7 +544,7 @@ All signed and unsigned integer vector types implement the whole set of `pub fn
 wrapping_{add,sub,mul,div,rem}(self, Self) -> Self` methods which, on overflow,
 produce the correct mathematical result modulo `2^n`.
 
-The `div` and `rem` method `panic!` on division by zero.
+The `div` and `rem` method `panic!` on division by zero in debug mode.
 
 #### Unsafe wrapping arithmetic operations
 
@@ -559,7 +560,7 @@ All portable signed integer, unsigned integer, and floating-point vectors
 implement the following methods:
 
 ```rust
-impl {element_type}{lane_width}x{number_of_lanes} {   
+impl {element_type}{lane_width}x{number_of_lanes} {
 /// Lane-wise `min`.
 ///
 /// Returns a vector whose lanes contain the smallest 
@@ -828,19 +829,8 @@ AVX-512 vector types.
 
 ##### Semantics for floating-point numbers
 
-* `eq`: yields `true` if both operands are not a `QNAN` and `self` is equal to
-  `other`, yields `false` otherwise.
-* `gt`: yield `true` if both operands are not a `QNAN` and ``self`` is greater
-  than `other`, yields `false` otherwise.
-* `ge`: yields `true` if both operands are not a `QNAN` and `self` is greater
-  than or equal to `other`, yields `false` otherwise.
-* `lt`: yields `true` if both operands are not a `QNAN` and `self` is less than
-  `other`, yields `false` otherwise.
-* `le`: yields `true` if both operands are not a `QNAN` and `self` is less than
-  or equal to `other`, yields `false` otherwise.
-* `ne`: yields `true` if either operand is a `QNAN` or `self` is not equal to
-  `other`, yields `false` otherwise.
-  
+The semantics of the lane-wise comparisons for floating point numbers are the
+same as in the scalar case. 
   
 ### Portable vector shuffles
 
@@ -850,20 +840,33 @@ std::simd::shuffle!(...);
 ```
 
 The `shuffle!` macro returns a new vector that contains a shuffle of the elements in
-one or two input vectors. That is, there are two versions:
+one or two input vectors. There are two versions:
+
+ * `shuffle!(vec, indices)`: one-vector version
+ * `shuffle!(vec0, vec1, indices)`: two-vector version
+
+with the following preconditions:
+
+ * `vec`, `vec0`, and `vec1` must be portable packed SIMD vector types.
+ * `vec0` and `vec1` must have the same type. 
+ * `indices` must be a `const` array of type `[usize; N]` where `N` is any
+   power-of-two in range `(0, 2 * {vec,vec0,vec1}::lanes()]`.
+ * the values of `indices` must be in range `[0, N)` for the one-vector version,
+   and in range `[0, 2N)` for the two-vector version.
+   
+On precondition violation a type error is produced.
 
- * `shuffle!(vec, [indices...])`: one-vector version
- * `shuffle!(vec0, vec1, [indices...])`: two-vector version
+The macro returns a new vector whose:
 
-In the two-vector version, both `vec0` and `vec1` must have the same type.
-The element type of the resulting vector is the element type of the input
-vector.
+* element type equals that of the input vectors, 
+* length equals `N`, that is, the length of the `indices` array
 
-The number of `indices` must be a power-of-two in range `[0, 64]` no longer
-than two times the number of lanes in the input vector. The length of the
-resulting vector equals the number of indices provided.
+The `i`-th element of `indices` with value `j` in range `[0, N)` stores the
+`j`-th element of the first vector into the `i`-th element of the result vector.
 
-Given a vector with `N` lanes, the indices in range `[0, N)` refer to the `N` elements in the vector. In the two-vector version, the indices in range `[N, 2*N)` refer to elements in the second vector.
+In the two-vector version, the `i`-th element of `indices` with value `j` in
+range `[N, 2N)` stores the `j - N`-th element of the second vector into the
+`i`-th element of the result vector.
 
 #### Example: shuffles
 
@@ -990,19 +993,21 @@ This RFC requires backends to provide generic vector types. Most backends suppor
 this in one form or another, but if one future backend does not, this RFC can be
 implemented on top of the architecture specific types.
 
-## Zero-overhead requirement for backends
+## Achieving zero-overhead is outside Rust's control
 
 A future architecture might have an instruction that performs multiple
 operations exposed by this API in one go, like `(a + b).wrapping_sum()` on an
-`f32x4` vector. The zero-overhead requirement makes it a bug if Rust does not
-generate optimal code for this situation.
+`f32x4` vector. If that expression does not produce optimal machine code, Rust
+has a performance bug.
 
 This is not a performance bug that can be easily worked around in `stdsimd` or
-`rustc`, making this almost certainly a performance bug in the backend.
+`rustc`, making this, almost certainly, a performance bug in the backend. These
+performance bugs can be arbitrarily hard to fix, and fixing these might not
+always be worth it.
 
-It is reasonable to assume that every optimizing Rust backed will have a
-pattern-matching engine powerful enough to perform these
-transformations, but it is worth it to keep this requirement in mind.
+That is, while these APIs should make it possible for reasonably-designed
+optimizing Rust backends to achieve zero-overhead, zero-overhead can only be
+provided in practice on a best-effort basis.
 
 ## Performance of this API might vary dramatically
 
@@ -1266,11 +1271,11 @@ The vector types proposed in this RFC are packed, that is, their size is fixed
 at compile-time.
 
 Many modern architectures support vector operations of run-time size, often
-called scalable Vectors or scalable vectors. These include, amongst others, NecSX,
-ARM SVE, RISC-V Vectors. These architectures have traditionally relied on
-auto-vectorization combined with support for explicit vectorization annotations,
-but newer architectures like ARM SVE and RISC-V introduce explicit vectorization
-intrinsics. 
+called scalable Vectors or scalable vectors. These include, amongst others,
+NecSX, ARM SVE, and RISC-V's Vector Extension Proposal. These architectures have
+traditionally relied on auto-vectorization combined with support for explicit
+vectorization annotations, but newer architectures like ARM SVE introduce
+explicit vectorization intrinsics.
 
 This is an example adapted from this [ARM SVE
 paper](https://developer.arm.com/hpc/arm-scalable-vector-extensions-and-application-to-machine-learning)
@@ -1307,8 +1312,11 @@ fn add_constant(dst: &mut [f64], src: &[f64], c: f64) {
 }
 ```
 
-RISC-V proposes a model similar in spirit, but not identical to the ARM SVE one.
-It would not be surprising if other popular architectures offered similar but not necessarily identical explicit vectorization models for scalable vectors in the future.
+The RISC-V vector extension proposal introduces a model similar in spirit to ARM
+SVE. These extensions are, however, not official yet, and it is currently
+unknown whether GCC and LLVM will expose explicit intrinsics for them. It would
+not be surprising if they do, and it would not be surprising if similar scalable
+vector extensions are introduced in other architectures in the future.
 
 The main differences between scalable and portable vectors are that:
 

From 3bf5d2c6ef3ef25603c90ab61376421bf80618ab Mon Sep 17 00:00:00 2001
From: gnzlbg <gonzalobg88@gmail.com>
Date: Fri, 23 Mar 2018 10:19:41 +0100
Subject: [PATCH 07/17] add more vector masks explanations to the guide-level
 explanation

---
 text/0000-ppv.md | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/text/0000-ppv.md b/text/0000-ppv.md
index ab760d352aa..89faa20a604 100644
--- a/text/0000-ppv.md
+++ b/text/0000-ppv.md
@@ -165,13 +165,33 @@ vector with four `f32` lanes. Here:
   
 * **lane width**: the bit width of a vector lane, that is, the bit width of
   the objects stored in the vector. For example, the type `f32` is 32-bits wide.
+  
+That is, the `m8x4` type is a 32-bit wide vector mask with 4 lanes containing an
+8-bit wide mask each. Vector masks are mainly used to select the lanes on which
+vector operations are performed. When a lane has all of its bits set to `true`,
+that lane is "selected", and when a lane has all of its bits set to `false`,
+that lane is "not selected". The following bit pattern is thus a valid
+bit-pattern for the `m8x4` mask:
+
+> 00000000_11111111_00000000_11111111
+
+and it select two eight-bit wide lanes from a 32-bit wide vector type with four
+lanes. The following bit-pattern is not, however, a valid value of the same mask
+type:
+
+> 00000000_11111111_00000000_11110111
+
+because it does not satisfies the invariant that all bits of a lane must be
+either set or cleared.
 
 Operations on vector types can be either:
 
 * **vertical**: that is, lane-wise. For example, `a + b` adds each lane of `a`
   with the corresponding lane of `b`, while `a.lt(b)` returns a boolean mask
   that indicates whether the less-than (`<`, `lt`) comparison returned `true` or
-  `false` for each of the vector lanes. Most vertical operations are binary operations (they take two input vectors). These operations are typically very fast on most architectures and they are the most widely used in practice.
+  `false` for each of the vector lanes. Most vertical operations are binary
+  operations (they take two input vectors). These operations are typically very
+  fast on most architectures and they are the most widely used in practice.
   
 * **horizontal**: that is, along a single vector - they are unary operations.
   For example, `a.sum()` adds the elements of a vector together while
@@ -268,7 +288,9 @@ even elements of a vector with a scalar:
 
 ```rust
 fn mul_even(a: f32, x: f32x4) -> f32x4 {
-    // Create a mask for the even elements 0 and 2:
+    // Create a vector mask for the even elements 0 and 2.
+    // The vector mask API uses `bool`s to set or clear 
+    // all bits of a lane:
     let m = m32x4::new(true, false, true, false);
 
     // Perform a full multiplication

From 1762b5806ed96c44cdf4bf673f9f70cbd8adc9e7 Mon Sep 17 00:00:00 2001
From: gnzlbg <gonzalobg88@gmail.com>
Date: Fri, 23 Mar 2018 16:12:18 +0100
Subject: [PATCH 08/17] incorporate latest rkruppe feedback

---
 text/0000-ppv.md | 32 +++++++++++++-------------------
 1 file changed, 13 insertions(+), 19 deletions(-)

diff --git a/text/0000-ppv.md b/text/0000-ppv.md
index 89faa20a604..8dfb79205a0 100644
--- a/text/0000-ppv.md
+++ b/text/0000-ppv.md
@@ -551,8 +551,8 @@ implementations:
 ##### Integer vector semantics
 
 The behavior of these operations for integer vectors is the same as that of the
-scalar integer types. That is: `panic!` on both overflow and division by zero in
-debug mode.
+scalar integer types. That is: `panic!` on both overflow and division by zero if
+`-C overflow-checks=on`.
   
 ##### Floating-point semantics
 
@@ -600,13 +600,7 @@ pub fn max(self, other: Self) -> Self;
 ##### Floating-point semantics
 
 The floating-point semantics follow the semantics of `min` and `max` for the
-scalar `f32` and `f64` types. That is:
-
-If either operand is a `NaN`, returns the other non-NaN operand. Returns `NaN`
-only if both operands are `NaN`. If the operands compare equal, returns a value
-that compares equal to both operands. This means that `min(+/-0.0, +/-0.0)`
-could return either `-0.0` or `0.0`. Otherwise, `min` and `max` return the
-smallest and largest operand, respectively.
+scalar `f32` and `f64` types.
 
 #### Arithmetic reductions 
 
@@ -1287,13 +1281,13 @@ that have withstood the test of time, and it adapts these designs to Rust.
 # Unresolved questions
 [unresolved]: #unresolved-questions
 
-### Interaction with scalable vectors
+### Interaction with Cray vectors
 
 The vector types proposed in this RFC are packed, that is, their size is fixed
 at compile-time.
 
 Many modern architectures support vector operations of run-time size, often
-called scalable Vectors or scalable vectors. These include, amongst others,
+called Cray Vectors or scalable vectors. These include, amongst others,
 NecSX, ARM SVE, and RISC-V's Vector Extension Proposal. These architectures have
 traditionally relied on auto-vectorization combined with support for explicit
 vectorization annotations, but newer architectures like ARM SVE introduce
@@ -1337,24 +1331,24 @@ fn add_constant(dst: &mut [f64], src: &[f64], c: f64) {
 The RISC-V vector extension proposal introduces a model similar in spirit to ARM
 SVE. These extensions are, however, not official yet, and it is currently
 unknown whether GCC and LLVM will expose explicit intrinsics for them. It would
-not be surprising if they do, and it would not be surprising if similar scalable
+not be surprising if they do, and it would not be surprising if similar Cray
 vector extensions are introduced in other architectures in the future.
 
-The main differences between scalable and portable vectors are that:
+The main differences between Cray vectors and portable vectors are that:
 
-* the number of lanes of scalable vectors is a run-time dynamic value 
-* the scalable vector "objects" are like magical compiler token values
+* the number of lanes of Cray vectors is a run-time dynamic value 
+* the Cray vector "objects" are like magical compiler token values
 * the induction loop variable must be incremented by the dynamic number of lanes
   of the vector type
-* most scalable vector operations require a mask indicating which elements of
+* most Cray vector operations require a mask indicating which elements of
   the vector the operation applies to
   
-These differences will probably force the API of scalable vector types to be
+These differences will probably force the API of Cray vector types to be
 slightly different than that of packed vector types.
 
-The current RFC, therefore, assumes no interaction with scalable vector types. 
+The current RFC, therefore, assumes no interaction with Cray vector types. 
 
-It does not prevent for portable scalable vector types to be added to Rust in
+It does not prevent for portable Cray vector types to be added to Rust in
 the future via an orthogonal API, nor it does prevent adding a way to interact
 between both of them (e.g. through memory). But at this point in time whether
 these things are possible are open research problems.

From ec0810e63310486eb04754a116d5cd16b612a305 Mon Sep 17 00:00:00 2001
From: gnzlbg <gonzalobg88@gmail.com>
Date: Thu, 29 Mar 2018 09:26:11 +0200
Subject: [PATCH 09/17] formatting

---
 text/0000-ppv.md | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/text/0000-ppv.md b/text/0000-ppv.md
index 8dfb79205a0..94e58944990 100644
--- a/text/0000-ppv.md
+++ b/text/0000-ppv.md
@@ -351,10 +351,10 @@ the `std::simd` module, that is:
 * 32-bit wide vectors: `i8x4`, `u8x4`, `m8x4`, `i16x2`, `u16x2`,  `m16x2`
 * 64-bit wide vectors: `i8x8`, `u8x8`, `m8x8`, `i16x4`, `u16x4`, `m16x4`,
   `i32x2`, `u32x2`, `f32x2`, `m32x2`
-* 128-bit wide vectors: `i8x16`, u8x16`, `m8x16`, `i16x8`, `u16x8`, `m16x8`,
+* 128-bit wide vectors: `i8x16`, `u8x16`, `m8x16`, `i16x8`, `u16x8`, `m16x8`,
   `i32x4`, `u32x4`, `f32x4`, `m32x4`, `i64x2`, `u64x2`, `f64x2`, `m64x2`
-* 256-bit wide vectors: `i8x32`, u8x32`, m8x32`, i16x16`, u16x16`, m16x16`,
-  i32x8`, u32x8`, f32x8`, m32x8`, i64x4`, u64x4`, f64x4`, `m64x4`
+* 256-bit wide vectors: `i8x32`, `u8x32`, `m8x32`, `i16x16`, `u16x16`, `m16x16`,
+  `i32x8`, `u32x8`, `f32x8`, `m32x8`, `i64x4`, `u64x4`, `f64x4`, `m64x4`
 
 Note that this list is not comprehensive. In particular:
 
@@ -365,7 +365,8 @@ Note that this list is not comprehensive. In particular:
   vector masks. These are blocked on `std::arch` AVX-512 support.
 * other vector types: x86, AArch64, PowerPC and others include types like
   `i64x1`, `u64x1`, `f64x1`, `m64x1`, `i128x1`, `u128x1`, `m128x1`, ... These
-  can be always added later as the need for these arises, potentially in combination with the stabilization of the `std::arch` intrinsics for those
+  can be always added later as the need for these arises, potentially in
+  combination with the stabilization of the `std::arch` intrinsics for those
   architectures.
 
 ## API of portable packed SIMD vector types

From 0a5c4385c6bdedfd6db73dd826ae5d9a1a647a72 Mon Sep 17 00:00:00 2001
From: gnzlbg <gonzalobg88@gmail.com>
Date: Fri, 6 Apr 2018 19:33:38 +0200
Subject: [PATCH 10/17] vector layout; hmin/max->min/max_element; saturating
 arithmetic

---
 text/0000-ppv.md | 93 +++++++++++++++++++++++++++++++++---------------
 1 file changed, 65 insertions(+), 28 deletions(-)

diff --git a/text/0000-ppv.md b/text/0000-ppv.md
index 94e58944990..11309aa84eb 100644
--- a/text/0000-ppv.md
+++ b/text/0000-ppv.md
@@ -195,7 +195,7 @@ Operations on vector types can be either:
   
 * **horizontal**: that is, along a single vector - they are unary operations.
   For example, `a.sum()` adds the elements of a vector together while
-  `a.hmax()` returns the largest element in a vector. These operations
+  `a.max_element()` returns the largest element in a vector. These operations
   (typically) translate to a sequence of multiple SIMD instructions on most
   architectures and are therefore slower. In many cases, they are, however,
   necessary.
@@ -369,6 +369,44 @@ Note that this list is not comprehensive. In particular:
   combination with the stabilization of the `std::arch` intrinsics for those
   architectures.
 
+### Layout of vector types
+
+The portable packed SIMD vector types introduced in this RFC are layout
+compatible with the architecture-specific vector types. That is:
+
+```rust
+union A {
+   port: f32x4,
+   arch: __m128,
+}
+let x: __m128 = _mm_setr_ps (0.0, 1.0, 2.0, 3.0);
+let y: f32x4 = A { arch: x }.port;
+assert_eq!(y.extract(0), 0.0);  // OK
+assert_eq!(y.extract(1), 1.0);  // OK
+assert_eq!(y.extract(2), 2.0);  // OK
+assert_eq!(y.extract(3), 3.0);  // OK
+```
+
+The portable packed SIMD vector types are also layout compatible with arrays of
+equal element type and whose length equals the number of vector lanes. That is:
+
+```rust
+union A {
+   port: f32x4,
+   arr: [f32; 4],
+}
+let x: [f32; 4] = [0.0, 1.0, 2.0, 3.0];
+let y: f32x4 = A { arr: x }.port;
+assert_eq!(y.extract(0), 0.0);  // OK
+assert_eq!(y.extract(1), 1.0);  // OK
+assert_eq!(y.extract(2), 2.0);  // OK
+assert_eq!(y.extract(3), 3.0);  // OK
+```
+
+This transitively makes both portable packed and architecture specific SIMD
+vector types layout compatible with all other types that are also layout
+compatible with these array types.
+
 ## API of portable packed SIMD vector types
 
 ### Traits overview
@@ -577,6 +615,22 @@ methods which, on overflow, produce the correct mathematical result modulo `2^n`
 
 If any of the vector elements is divided by zero the behavior is undefined.
 
+#### Saturating arithmetic operations
+
+All signed and unsigned integer vector types implement the whole set of `pub fn
+saturated_{add,sub,mul,div,rem}(self, Self) -> Self` methods which saturate on
+overflow.
+
+The `div` and `rem` method `panic!` on division by zero in debug mode.
+
+#### Unsafe saturating arithmetic operations
+
+All signed and unsigned integer vectors implement `pub unsafe fn
+saturating_{div,rem}_unchecked(self, Self) -> Self` methods which saturate on
+overflow.
+
+If any of the vector elements is divided by zero the behavior is undefined.
+
 #### Binary `min`/`max` vertical operations
 
 All portable signed integer, unsigned integer, and floating-point vectors
@@ -692,34 +746,16 @@ implement the following methods:
 
 ```rust
 impl {element_type}{lane_width}x{number_of_lanes} {
-/// Value of the largest element in the vector.
-///
-/// # Floating-point
-///
-/// If the result contains `NaN`s the result is a 
-/// `NaN` that is not necessarily equal to any of 
-/// the `NaN`s in the vector.
-pub fn hmax(self) -> element_type;
+/// Largest vector element value.
+pub fn max_element(self) -> element_type;
 
-/// Value of the smallest element in the vector.
-///
-/// # Floating-point
-///
-/// If the result contains `NaN`s the result is a 
-/// `NaN` that is not necessarily equal to any of 
-/// the `NaN`s in the vector.
-pub fn hmin(self) -> element_type;
+/// Smallest vector element value.
+pub fn min_element(self) -> element_type;
 }
 ```
 
-Note: In this RFC, horizontal reductions are named according to the operation
-they perform. When this name clashes with that of a vertical operation, the
-horizontal reduction's name get's a `h`-prefix. In this case, the horizontal
-`max` and `min` reductions clash with the vertical `a.max(b)` operation, and
-therefore get a `h`-prefix. 
-
-An alternative would be to prefix all horizontal operations with, for example,
-the `h_`-prefix.
+Note: the semantics of `{min,max}_element` for floating-point numbers are the
+same as that of their `min`/`max` methods.
 
 #### Mask construction and element access
 
@@ -1273,7 +1309,8 @@ the three examples used in the motivation come from the [Using SIMD in
 Dart](https://www.dartlang.org/articles/dart-vm/simd) article written by John
 McCutchan. Some of the key ideas of this RFC come from LLVM's design, which was
 originally inspired by GCC's vector extensions, which was probably inspired by
-something else.
+something else. Most parts of this RFC are also consistent with the [128-bit SIMD
+proposal for WebAssembly](https://github.com/WebAssembly/simd/blob/master/proposals/simd/SIMD.md)
 
 Or in other words: to the author's best knowledge, this RFC does not contain any
 really novel ideas. Instead, it only draws inspriation from previous designs
@@ -1379,8 +1416,8 @@ finite (no `NaN`s) and others from assuming math to be associative (e.g. it
 allows tree-like reductions from sums).
 
 A future RFC could add more reduction variants with different requirements and
-performance characteristics, for example, `.sum_unordered()` or
-`.hmax_nanless()`, but these are not considered in this RFC because
+performance characteristics, for example, `.wrapping_sum_unordered()` or
+`.max_element_nanless()`, but these are not considered in this RFC because
 their interaction with fast-math is unclear.
 
 A potentially better idea would be to allow users to specify the assumptions

From 7da3e00801266de7f7fb0249569b6800274b1d99 Mon Sep 17 00:00:00 2001
From: gnzlbg <gonzalobg88@gmail.com>
Date: Thu, 24 May 2018 14:52:36 +0200
Subject: [PATCH 11/17] add fp-vector ops: fma, sqrt, sqrte

---
 text/0000-ppv.md | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/text/0000-ppv.md b/text/0000-ppv.md
index 11309aa84eb..09760996cd8 100644
--- a/text/0000-ppv.md
+++ b/text/0000-ppv.md
@@ -657,6 +657,24 @@ pub fn max(self, other: Self) -> Self;
 The floating-point semantics follow the semantics of `min` and `max` for the
 scalar `f32` and `f64` types.
 
+#### Floating-point vertical math operations
+
+All portable floating-point vector types implement the following methods:
+
+```rust
+impl f{lane_width}x{number_of_lanes} {
+    /// Square-root
+    fn sqrt(self) -> Self;
+    /// Reciprocal square-root estimate
+    ///
+    /// **FIXME**: an upper bound on the error should
+    /// be guaranteed before stabilization.
+    fn rsqrte(self) -> Self;
+    /// Fused multiply add: `self * b + c`
+    fn fma(self, b: Self, c: self) -> Self;
+}
+```
+
 #### Arithmetic reductions 
 
 ##### Integers
@@ -1450,7 +1468,9 @@ fn foo(x: f32x4, y: f32x4) -> f32x4 {
 
 There are obviously many approaches to tackle this problem, but it does make
 sense to have a plan to tackle them before workarounds start getting bolted into
-RFCs like this one.
+RFCs like this one. There is an [internal's
+post](https://internals.rust-lang.org/t/pre-pre-rfc-floating-point-math-assumptions-fast-math/7162)
+exploring the design space.
 
 ### Endian-dependent behavior
 

From a3a8bc6788ec2ea985c7aac090e16e14410a4c51 Mon Sep 17 00:00:00 2001
From: gnzlbg <gonzalobg88@gmail.com>
Date: Thu, 24 May 2018 14:59:01 +0200
Subject: [PATCH 12/17] add from implementations vector<-->arrays

---
 text/0000-ppv.md | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/text/0000-ppv.md b/text/0000-ppv.md
index 09760996cd8..e3b95843de9 100644
--- a/text/0000-ppv.md
+++ b/text/0000-ppv.md
@@ -980,6 +980,8 @@ offset.
 #### Conversions and bitcasts
 [casts-and-conversions]: #casts-and-conversions
 
+##### Conversions / bitcasts between vector types
+
 There are three different ways to convert between vector types.
 
 * `From`/`Into`: value-preserving widening-conversion between vectors with the
@@ -1004,7 +1006,7 @@ There are three different ways to convert between vector types.
   mask bit-patterns that do not represent a valid mask. Note also that the
   result of `unsafe mem::transmute` is **endian-dependent** (see examples
   below).
-  
+    
 It is extremely common to perform "transmute" operations between equally-sized
 portable vector types when writing SIMD algorithms. Rust currently does not have
 any facilities to express that all bit-patterns of one type are also valid
@@ -1025,6 +1027,20 @@ These issues are not specific to portable packed SIMD vector types and fixing
 them is not the purpose of this RFC, but these issues are critical for writing
 efficient and portable SIMD code reliably and ergonomically.
 
+##### Other conversions
+
+The layout of the portable packed vector types is compatible to the layout of
+fixed-size arrays of the same element type and the same number of lanes (e.g.
+`f32x4` is layout compatible with `[f32; 4]`.
+
+For all signed, unsigned, and floating-point vector types with element type `E`
+and number of lanes `N`, the following implementations exist:
+
+```rust
+impl From<[E; N]> for ExN;
+impl From<ExN> for [E; N];
+```
+
 # ABI and `std::simd`
 
 The ABI is first and foremost unspecified and may change at any time.

From 28105c6357e5f65fc54cc0fdb7c60b8b48654643 Mon Sep 17 00:00:00 2001
From: gnzlbg <gonzalobg88@gmail.com>
Date: Thu, 24 May 2018 15:00:08 +0200
Subject: [PATCH 13/17] fix bug in the average examples

---
 text/0000-ppv.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/text/0000-ppv.md b/text/0000-ppv.md
index e3b95843de9..1b4cd7c8a99 100644
--- a/text/0000-ppv.md
+++ b/text/0000-ppv.md
@@ -246,7 +246,7 @@ fn average_slow256(xs: &[f32]) -> f32 {
         // and add them to the result.
         result += data.sum();
     }
-    result
+    result / xs.len()
 }
 ```
 
@@ -274,7 +274,7 @@ fn average_fast256(xs: &[f32]) -> f32 {
         result += data; 
     }
     // Perform a single horizontal reduction at the end:
-    result.sum()
+    result.sum() / xs.len()
 }
 ```
 

From 0d3ca37ce8589cdf5db9dbfd0a60a83a5dc04b3f Mon Sep 17 00:00:00 2001
From: gnzlbg <gonzalobg88@gmail.com>
Date: Wed, 13 Jun 2018 13:46:12 +0200
Subject: [PATCH 14/17] rename load/store to read/write

---
 text/0000-ppv.md | 56 ++++++++++++++++++++++++------------------------
 1 file changed, 28 insertions(+), 28 deletions(-)

diff --git a/text/0000-ppv.md b/text/0000-ppv.md
index 1b4cd7c8a99..41f2ed8633e 100644
--- a/text/0000-ppv.md
+++ b/text/0000-ppv.md
@@ -106,7 +106,7 @@ The operations provided in this RFC are thus either:
 
 **fundamental**: that is, they build the foundation required to write
 higher-level SIMD algorithms. These include, amongst others, instantiating
-vector types, load/stores from memory, masks and branchless conditional
+vector types, read/writes from memory, masks and branchless conditional
 operations, and type casts and conversions.
 
 **required**: to be part of the std. These include backend-specific compiler
@@ -236,11 +236,11 @@ fn average_slow256(xs: &[f32]) -> f32 {
     
     // We iterate over the input slice with a step of `8` elements:
     for i in (0..xs.len()).step_by(8) {
-        // First, we load the next `8` elements into an `f32x8`.
+        // First, we read the next `8` elements into an `f32x8`.
         // Since we haven't checked whether the input slice
         // is aligned to the alignment of `f32x8`, we perform
-        // an unaligned memory load.
-        let data = f32x8::load_unaligned(&xs[i..]);
+        // an unaligned memory read.
+        let data = f32x8::read_unaligned(&xs[i..]);
 
         // With the element in the vector, we perform an horizontal reduction
         // and add them to the result.
@@ -267,7 +267,7 @@ fn average_fast256(xs: &[f32]) -> f32 {
     // Our temporary result is now a f32x8 vector:
     let mut result = f32x8::splat(0.);
     for i in (0..xs.len()).step_by(8) {
-        let data = f32x8::load_unaligned(&xs[i..]);
+        let data = f32x8::read_unaligned(&xs[i..]);
         // This adds the data elements to tour temporary result using 
         // a vertical lane-wise simd operation - this is a single SIMD
         // instruction on most architectures.
@@ -503,7 +503,7 @@ pub unsafe fn replace_unchecked(self, index: usize,
 }
 ```
 
-#### Loads and Stores
+#### Reads and Writes
 
 All portable vector types implement the following methods:
 
@@ -515,14 +515,14 @@ impl {element_type}{lane_width}x{number_of_lanes} {
 ///
 /// If `slice.len() < Self::lanes()` or `&slice[0]` is not
 /// aligned to an `align_of::<Self>()` boundary.
-pub fn store_aligned(self, slice: &mut [element_type]);
+pub fn write_aligned(self, slice: &mut [element_type]);
 
 /// Writes the values of the vector to the `slice`.
 ///
 /// # Panics
 ///
 /// If `slice.len() < Self::lanes()`.
-pub fn store_unaligned(self, slice: &mut [element_type]);
+pub fn write_unaligned(self, slice: &mut [element_type]);
 
 /// Writes the values of the vector to the `slice`.
 ///
@@ -531,14 +531,14 @@ pub fn store_unaligned(self, slice: &mut [element_type]);
 /// If `slice.len() < Self::lanes()` or `&slice[0]` is not
 /// aligned to an `align_of::<Self>()` boundary, the behavior is
 /// undefined.
-pub unsafe fn store_aligned_unchecked(self, slice: &mut [element_type]);
+pub unsafe fn write_aligned_unchecked(self, slice: &mut [element_type]);
 
 /// Writes the values of the vector to the `slice`.
 ///
 /// # Precondition
 ///
 /// If `slice.len() < Self::lanes()` the behavior is undefined.
-pub unsafe fn store_unaligned_unchecked(self, slice: &mut [element_type]);
+pub unsafe fn write_unaligned_unchecked(self, slice: &mut [element_type]);
 
 /// Instantiates a new vector with the values of the `slice`.
 ///
@@ -546,14 +546,14 @@ pub unsafe fn store_unaligned_unchecked(self, slice: &mut [element_type]);
 ///
 /// If `slice.len() < Self::lanes()` or `&slice[0]` is not aligned
 /// to an `align_of::<Self>()` boundary.
-pub fn load_aligned(slice: &[element_type]) -> Self;
+pub fn read_aligned(slice: &[element_type]) -> Self;
 
 /// Instantiates a new vector with the values of the `slice`.
 ///
 /// # Panics
 ///
 /// If `slice.len() < Self::lanes()`.
-pub fn load_unaligned(slice: &[element_type]) -> Self;
+pub fn read_unaligned(slice: &[element_type]) -> Self;
 
 /// Instantiates a new vector with the values of the `slice`.
 ///
@@ -561,14 +561,14 @@ pub fn load_unaligned(slice: &[element_type]) -> Self;
 ///
 /// If `slice.len() < Self::lanes()` or `&slice[0]` is not aligned
 /// to an `align_of::<Self>()` boundary, the behavior is undefined.
-pub unsafe fn load_aligned_unchecked(slice: &[element_type]) -> Self;
+pub unsafe fn read_aligned_unchecked(slice: &[element_type]) -> Self;
 
 /// Instantiates a new vector with the values of the `slice`.
 ///
 /// # Precondition
 ///
 /// If `slice.len() < Self::lanes()` the behavior is undefined.
-pub unsafe fn load_unaligned_unchecked(slice: &[element_type]) -> Self;
+pub unsafe fn read_unaligned_unchecked(slice: &[element_type]) -> Self;
 }
 ```
 
@@ -1112,8 +1112,8 @@ performance bug.
 ### Dynamic values result in poor code generation for some operations
 
 Some of the fundamental APIs proposed in this RFC, like `vec::{new, extract,
-store, replace}` take run-time dynamic parameters. Consider the following
-example (see the whole example live at [`rust.godbolt.org`](https://godbolt.org/g/yhiAa2):
+replace}` take run-time dynamic parameters. Consider the following example (see
+the whole example live at [`rust.godbolt.org`](https://godbolt.org/g/yhiAa2):
 
 ```rust 
 /// Returns a f32x8 with 0.,1.,2.,3.
@@ -1149,8 +1149,8 @@ example::increasing:
   retq
 ```
 
-which uses two vector loads to load the values into a SIMD register -
-digression: this two loads are due to Rust's SIMD vector types ABI and happen
+which uses two vector reads to read the values into a SIMD register -
+digression: this two reads are due to Rust's SIMD vector types ABI and happen
 only "isolated" examples.
 
 If we change this function to accept run-time bounds for the loop
@@ -1198,7 +1198,7 @@ This code isn't necessarily horrible, but it is definitely harder to reason abou
 performance. This has two main causes:
 
 *  **ISAs do not support these operations**: most (all?) ISAs support operations
-   like `extract`, `store`, and `replace` with constant indices only. That is,
+   like `extract`, `write`, and `replace` with constant indices only. That is,
    these operations do not map to single instructions on most ISAs.
    
 * **these operations are slow**: even for constant indices, these operations are
@@ -1222,7 +1222,7 @@ The proposed API accepts run-time indices (and values for `new`):
   really avoid writing. The current API makes writing this code extremely easy,
   resulting in SIMD code with potentially unexpected performance. Users also
   have to deal with two APIs for this, the checked/unchecked APIs, and
-  also, the memory `load`/`store` APIs that are better suited for this use case.
+  also, the memory `read`/`write` APIs that are better suited for this use case.
   
 Whether the current design is the right design should probably be clarified
 during the RFC. An important aspect to consider is that Rust support for
@@ -1388,14 +1388,14 @@ fn add_constant(dst: &mut [f64], src: &[f64], c: f64) {
         // while out-of-bounds elements contain `false`:
         let m: bxN = f64xN::while_lt(i, src.len());
 
-        // Load the elements of the source using the mask:
-        let vsrc: f64xN = f64xN::load(m, &src[i..]);
+        // Read the elements of the source using the mask:
+        let vsrc: f64xN = f64xN::read_unaligned(m, &src[i..]);
         
         // Add the vector with the constan using the mask:
         let vdst: f64xN = vsrc.add(m, vc);
         
-        // Store the result back to memory using the mask:
-        vdst.store_unaligned(m, &mut dst[i..]);
+        // Write the result back to memory using the mask:
+        vdst.write_unaligned(m, &mut dst[i..]);
     }
 }
 ```
@@ -1490,7 +1490,7 @@ exploring the design space.
 
 ### Endian-dependent behavior
 
-The results of the indexed operations (`extract`, `replace`, `store`), and the
+The results of the indexed operations (`extract`, `replace`, `write`), and the
 `new` method are endian independent. That is, the following example is
 guaranteed to pass on little-endian (LE) and big-endian (BE) architectures:
 
@@ -1515,12 +1515,12 @@ if cfg!(target_endian = "little") {
 }
 ```
 
-which applies to memory load and stores as well:
+which applies to memory read and writes as well:
 
 ```rust
 let x = i8x16::new(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);
 let mut y: [i16; 8] = [0; 8];
-x.store_unaligned( unsafe {
+x.write_unaligned( unsafe {
     slice::from_raw_parts_mut(&mut y as *mut _ as *mut i8, 16)
 });
 
@@ -1532,7 +1532,7 @@ if cfg!(target_endian = "little") {
     assert_eq!(y, e);
 }
 
-let z = i8x16::load_unaligned(unsafe {
+let z = i8x16::read_unaligned(unsafe {
     slice::from_raw_parts(&y as *const _ as *const i8, 16)
 });
 assert_eq!(z, x);

From aaab749e703ecd3a030d8a8d3650b96f13805e82 Mon Sep 17 00:00:00 2001
From: gnzlbg <gonzalobg88@gmail.com>
Date: Wed, 13 Jun 2018 13:50:13 +0200
Subject: [PATCH 15/17] be more specific with respect to vector reads/writes

---
 text/0000-ppv.md | 32 ++++++++++++++++++++------------
 1 file changed, 20 insertions(+), 12 deletions(-)

diff --git a/text/0000-ppv.md b/text/0000-ppv.md
index 41f2ed8633e..586d7d808c8 100644
--- a/text/0000-ppv.md
+++ b/text/0000-ppv.md
@@ -509,22 +509,25 @@ All portable vector types implement the following methods:
 
 ```rust
 impl {element_type}{lane_width}x{number_of_lanes} {
-/// Writes the values of the vector to the `slice`.
+/// Writes the values of the vector to the `slice` without 
+/// reading or dropping the old value.
 ///
 /// # Panics
 ///
-/// If `slice.len() < Self::lanes()` or `&slice[0]` is not
+/// If `slice.len() != Self::lanes()` or `&slice[0]` is not
 /// aligned to an `align_of::<Self>()` boundary.
 pub fn write_aligned(self, slice: &mut [element_type]);
 
-/// Writes the values of the vector to the `slice`.
+/// Writes the values of the vector to the `slice` without 
+/// reading or dropping the old value.
 ///
 /// # Panics
 ///
-/// If `slice.len() < Self::lanes()`.
+/// If `slice.len() != Self::lanes()`.
 pub fn write_unaligned(self, slice: &mut [element_type]);
 
-/// Writes the values of the vector to the `slice`.
+/// Writes the values of the vector to the `slice` without 
+/// reading or dropping the old value.
 ///
 /// # Precondition
 ///
@@ -533,29 +536,33 @@ pub fn write_unaligned(self, slice: &mut [element_type]);
 /// undefined.
 pub unsafe fn write_aligned_unchecked(self, slice: &mut [element_type]);
 
-/// Writes the values of the vector to the `slice`.
+/// Writes the values of the vector to the `slice` without reading 
+/// or dropping the old value.
 ///
 /// # Precondition
 ///
 /// If `slice.len() < Self::lanes()` the behavior is undefined.
 pub unsafe fn write_unaligned_unchecked(self, slice: &mut [element_type]);
 
-/// Instantiates a new vector with the values of the `slice`.
+/// Instantiates a new vector with the values of the `slice` without 
+/// moving them, leaving the memory in `slice` unchanged.
 ///
 /// # Panics
 ///
-/// If `slice.len() < Self::lanes()` or `&slice[0]` is not aligned
+/// If `slice.len() != Self::lanes()` or `&slice[0]` is not aligned
 /// to an `align_of::<Self>()` boundary.
 pub fn read_aligned(slice: &[element_type]) -> Self;
 
-/// Instantiates a new vector with the values of the `slice`.
+/// Instantiates a new vector with the values of the `slice` without 
+/// moving them, leaving the memory in `slice` unchanged.
 ///
 /// # Panics
 ///
-/// If `slice.len() < Self::lanes()`.
+/// If `slice.len() != Self::lanes()`.
 pub fn read_unaligned(slice: &[element_type]) -> Self;
 
-/// Instantiates a new vector with the values of the `slice`.
+/// Instantiates a new vector with the values of the `slice` without 
+/// moving them, leaving the memory in `slice` unchanged.
 ///
 /// # Precondition
 ///
@@ -563,7 +570,8 @@ pub fn read_unaligned(slice: &[element_type]) -> Self;
 /// to an `align_of::<Self>()` boundary, the behavior is undefined.
 pub unsafe fn read_aligned_unchecked(slice: &[element_type]) -> Self;
 
-/// Instantiates a new vector with the values of the `slice`.
+/// Instantiates a new vector with the values of the `slice` without 
+/// moving them, leaving the memory in `slice` unchanged.
 ///
 /// # Precondition
 ///

From 20746611308cff9d190842d567e72424656c9f91 Mon Sep 17 00:00:00 2001
From: gnzlbg <gonzalobg88@gmail.com>
Date: Wed, 13 Jun 2018 14:18:14 +0200
Subject: [PATCH 16/17] gather/scatters

---
 text/0000-ppv.md | 47 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 46 insertions(+), 1 deletion(-)

diff --git a/text/0000-ppv.md b/text/0000-ppv.md
index 586d7d808c8..d0884449963 100644
--- a/text/0000-ppv.md
+++ b/text/0000-ppv.md
@@ -505,6 +505,8 @@ pub unsafe fn replace_unchecked(self, index: usize,
 
 #### Reads and Writes
 
+##### Contiguous reads and writes
+
 All portable vector types implement the following methods:
 
 ```rust
@@ -580,6 +582,46 @@ pub unsafe fn read_unaligned_unchecked(slice: &[element_type]) -> Self;
 }
 ```
 
+##### Discontinuous masked reads and writes (scatter and gather)
+
+Vector masks implement the following methods:
+
+```rust
+impl m{lane_width}x{number_of_lanes} {
+/// Instantiates a new vector with the values of the `slice` located at 
+/// the `offset`s without moving them for which the mask (`self`) is `true`
+/// and with the values of `default` otherwise. The memory of the `slice` at 
+/// the `offsets` for which the mask is `false` is not read.
+///
+/// # Precondition
+///
+/// If `slice.len() < offset.max_element()` the behavior is undefined.
+pub unsafe fn read_scattered_unchecked<T, O, D>(self, slice: &[T], offset: O, default: D) -> D
+    where <implementation defined> 
+        // for exposition only:
+        // number_of_lanes == D::lanes() == O::lanes(), 
+        // D::element_type == T,
+        // O::element_type == usize,
+;
+
+/// Writes the elements of the vector `values` for which the mask (`self`) 
+/// is `true` to the `slice` at `offset`s without reading or dropping 
+/// the old values. No memory is written to the `slice` elements at 
+/// the `offset`s for which the mask is `false`.
+///
+/// # Precondition
+///
+/// If `slice.len() < offset.max_element()` the behavior is undefined.
+pub unsafe fn write_scattered_unchecked<T, O, D>(self, slice: &mut [T], offset: O, values: D)
+    where <implementation defined> 
+        // for exposition only:
+        // number_of_lanes == D::lanes() == O::lanes(), 
+        // D::element_type == T,
+        // O::element_type == usize,
+;
+}
+```
+
 #### Vertical arithmetic operations
 
 Vertical (lane-wise) arithmetic operations are provided by the following trait
@@ -862,7 +904,10 @@ impl m{lane_width}x{number_of_lanes} {
 /// The lanes of the result for which the mask is `true` contain
 /// the values of `a` while the remaining lanes contain the values of `b`.
 pub fn select<T>(self, a: T, b: T) -> T
-    where T::lanes() == number_of_lanes; // implementation-defined 
+    where <implementation-defined>
+        // for exposition only:
+        // T::lanes() == number_of_lanes,
+;
 }
 ```
 

From 7011673b2dc5b493cb8be937bf11dd8449f58135 Mon Sep 17 00:00:00 2001
From: gnzlbg <gonzalobg88@gmail.com>
Date: Wed, 13 Jun 2018 19:54:17 +0200
Subject: [PATCH 17/17] semantics of overlapping scatters

---
 text/0000-ppv.md | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/text/0000-ppv.md b/text/0000-ppv.md
index d0884449963..20cbe0b4679 100644
--- a/text/0000-ppv.md
+++ b/text/0000-ppv.md
@@ -609,6 +609,10 @@ pub unsafe fn read_scattered_unchecked<T, O, D>(self, slice: &[T], offset: O, de
 /// the old values. No memory is written to the `slice` elements at 
 /// the `offset`s for which the mask is `false`.
 ///
+/// If multiple `offset`s have the same value, that is, if multiple lanes 
+/// from `values` are to be written to the same memory location, the writes
+/// are ordered from least significant to most significant element.
+///
 /// # Precondition
 ///
 /// If `slice.len() < offset.max_element()` the behavior is undefined.