diff --git a/text/0000-ppv.md b/text/0000-ppv.md
new file mode 100644
index 00000000000..20cbe0b4679
--- /dev/null
+++ b/text/0000-ppv.md
@@ -0,0 +1,1596 @@
+- Feature Name: `portable_packed_vector_types`
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: (leave this empty)
+- Rust Issue: (leave this empty)
+
+# Summary
+[summary]: #summary
+
+This RFC adds portable packed SIMD vector types up to 256-bit.
+
+Future RFCs will attempt to answer some of the unresolved questions and might
+potentially cover extensions as they mature in `stdsimd`, like, for example,
+portable memory gather and scatter operations, `m1xN` vector masks, masked
+arithmetic/bitwise/shift operations, etc.
+
+# Motivation
+[motivation]: #motivation
+
+The `std::arch` module exposes architecture-specific SIMD types like `_m128` - a
+128-bit wide SIMD vector type. How these bits are interpreted depends on the intrinsic
+being used. For example, let's sum 8 `f32`s values using the SSE4.1 facilities
+in the `std::arch` module. This is one way to do it
+([playground](https://play.rust-lang.org/?gist=165e2886b4883ec98d4e8bb4d6a32e22&version=nightly)):
+
+```rust
+unsafe fn add_reduce(a: __m128, b: __m128) -> f32 {
+    let c = _mm_hadd_ps(a, b);
+    let c = _mm_hadd_ps(c, _mm_setzero_ps());
+    let c = _mm_hadd_ps(c, _mm_setzero_ps());
+    std::mem::transmute(_mm_extract_ps(c, 0))
+}
+
+fn main() {
+    unsafe {
+        let a = _mm_set_ps(1., 2., 3., 4.);
+        let b = _mm_set_ps(5., 6., 7., 8.);
+        let r = add_reduce(a, b);
+        assert_eq!(r, 36.);
+    }
+}
+```
+
+Notice that:
+
+* one has to put some effort to extrapolate from `add_reduce`'s signature what
+  types of vectors it actually expects: "`add_reduce` takes 128-bit wide vectors and
+  returns an `f32` therefore those 128-bit vectors _probably_ must contain 4 packed
+  f32s because that's the only combination of `f32`s that fits in 128 bits!"
+  
+* it requires a lot of `unsafe` code: the intrinsics are unsafe (which could be
+  improved via [RFC2122](https://github.com/rust-lang/rfcs/pull/2212)), the
+  intrinsic API relies on the user performing transmutes, constructing the
+  vectors is unsafe because it needs to be done via intrinsic calls, etc.
+
+* it requires a lot of architecture specific knowledge: how the intrinsics are
+  called, how they are used together
+  
+* this solution only works on `x86` or `x86_64` with SSE4.1 enabled, that is, it
+  is not portable.
+
+With portable packed vector types, we can do much better
+([playground](https://play.rust-lang.org/?gist=7fb4e3b6c711b5feb35533b50315a5fb&version=nightly)):
+
+```rust
+fn main() {
+    let a = f32x4::new(1., 2., 3., 4.);
+    let b = f32x4::new(5., 6., 7., 8.);
+    let r = (a + b).sum();
+    assert_eq!(r, 36.);
+}
+```
+
+These types add zero-overhead over the architecture-specific types for the
+operations that they support - if there is an architecture in which this does
+not hold for some operation: the implementation has a bug.
+
+The motivation of this RFC is to provide reasonably high-level, reliable, and
+portable access to common SIMD vector types and SIMD operations.
+
+At a higher level, the actual use cases for these specialty instructions are
+boundless. SIMD intrinsics are used in graphics, multimedia, linear algebra,
+scientific computing, games, cryptography, text search, machine learning, low
+latency, and more. There are many crates in the Rust ecosystem using SIMD
+intrinsics today, either through `stdsimd`, the `simd` crate, or both. Some
+examples include:
+
+* [`encoding_rs`](https://github.com/hsivonen/encoding_rs) which uses the `simd`
+  crate to assist with speedy decoding.
+* [`bytecount`](https://github.com/llogiq/bytecount) which uses the `simd` crate
+with AVX2 extensions to accelerate counting bytes.
+* [`regex`](https://github.com/rust-lang/regex) which uses the `stdsimd` crate
+with SSSE3 extensions to accelerate multiple substrings search and also its
+parent crate `teddy`.
+
+However, providing portable SIMD algorithms for all application domains is not
+the intent of this RFC.
+
+The purpose of this RFC is to provide users with vocabulary types and
+fundamental operations that they can build upon in their own crates to
+effectively implement SIMD algorithms in their respective application domains.
+
+These types are meant to be extended by users with portable (or nonportable) SIMD
+operations in their own crates, for example, via extension traits or new types.
+
+The operations provided in this RFC are thus either:
+
+**fundamental**: that is, they build the foundation required to write
+higher-level SIMD algorithms. These include, amongst others, instantiating
+vector types, read/writes from memory, masks and branchless conditional
+operations, and type casts and conversions.
+
+**required**: to be part of the std. These include backend-specific compiler
+intrinsics that we might never want to stabilize as well as the implementation of
+std library traits which, due to trait coherence, users cannot extend the vector
+types with.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+This RFC extends Rust with **portable packed SIMD vector types**, a set of types
+used to perform **explicit vectorization**:
+
+* **SIMD**: stands for Single Instruction, Multiple Data. This RFC uses this
+  term in the context of hardware instruction set architectures (ISAs) to refer
+  to:
+    * SIMD instructions: instructions that (typically) perform operations on
+      multiple values simultaneously, and
+    * SIMD registers: the registers that the SIMD instructions take as operands.
+      These registers (typically) store multiple values that are
+      operated upon simultaneously by SIMD instructions.
+
+* **vector** types: types that abstract over memory stored in SIMD registers,
+  allowing to transfer memory to/from the registers and performing operations
+  directly on these registers.
+
+* **packed**: means that these vectors have a compile-time fixed size. It is
+  the opposite of **scalable** or "Cray vectors", which are SIMD vector types
+  with a dynamic size, that is, whose size is only known at run-time.
+
+* **explicit vectorization**: vectorization is the process of producing programs
+  that operate on multiple values simultaneously (typically) using SIMD
+  instructions and registers. Automatic vectorization is the process by which
+  the Rust compiler is, in some cases, able to transform scalar Rust code, that
+  is, code that does not use SIMD vector types, into machine code that does use
+  SIMD registers and instructions automatically (without user intervention).
+  Explicit vectorization is the process by which a Rust **user** manually writes
+  Rust code that states what kind of SIMD registers are to be used and what SIMD
+  instructions are executed on them.
+
+* **portable**: is the opposite of architecture-specific. These types work both
+  correctly and efficiently on all architectures. They are a zero-overhead
+  abstraction, that is, for the operations that these types support, one cannot
+  write better code by hand (otherwise, it is an implementation bug).
+  
+* **masks**: are vector types used to **select** vector elements on which
+  operations are to be performed. This selection is performed by setting or
+  clearing the bits of the masks for a particular lane.
+  
+Packed vector types are denotes as follows: `{i,u,f,m}{lane_width}x{#lanes}`, so
+that `i64x8` is a 512-bit vector with eight `i64` lanes and `f32x4` a 128-bit
+vector with four `f32` lanes. Here:
+
+* **lane**: is the number of values of a particular type stored in a vector -
+  the vector operations act on these values simultaneously.
+  
+* **lane width**: the bit width of a vector lane, that is, the bit width of
+  the objects stored in the vector. For example, the type `f32` is 32-bits wide.
+  
+That is, the `m8x4` type is a 32-bit wide vector mask with 4 lanes containing an
+8-bit wide mask each. Vector masks are mainly used to select the lanes on which
+vector operations are performed. When a lane has all of its bits set to `true`,
+that lane is "selected", and when a lane has all of its bits set to `false`,
+that lane is "not selected". The following bit pattern is thus a valid
+bit-pattern for the `m8x4` mask:
+
+> 00000000_11111111_00000000_11111111
+
+and it select two eight-bit wide lanes from a 32-bit wide vector type with four
+lanes. The following bit-pattern is not, however, a valid value of the same mask
+type:
+
+> 00000000_11111111_00000000_11110111
+
+because it does not satisfies the invariant that all bits of a lane must be
+either set or cleared.
+
+Operations on vector types can be either:
+
+* **vertical**: that is, lane-wise. For example, `a + b` adds each lane of `a`
+  with the corresponding lane of `b`, while `a.lt(b)` returns a boolean mask
+  that indicates whether the less-than (`<`, `lt`) comparison returned `true` or
+  `false` for each of the vector lanes. Most vertical operations are binary
+  operations (they take two input vectors). These operations are typically very
+  fast on most architectures and they are the most widely used in practice.
+  
+* **horizontal**: that is, along a single vector - they are unary operations.
+  For example, `a.sum()` adds the elements of a vector together while
+  `a.max_element()` returns the largest element in a vector. These operations
+  (typically) translate to a sequence of multiple SIMD instructions on most
+  architectures and are therefore slower. In many cases, they are, however,
+  necessary.
+  
+## Example: Average
+
+The first example computes the arithmetic average of the elements in a list.
+Sequentially, we would write using iterators as follows:
+
+```rust
+/// Arithmetic average of the elements in `xs`.
+fn average_seq(xs: &[f32]) -> f32 {
+    if xs.len() > 0 {
+        xs.iter().sum() / xs.len()
+    } else {
+        0.
+    }
+}
+```
+
+The following implementation uses the 256-bit SIMD facilities provided by this
+RFC. As the name suggests, it will be "slow":
+
+```rust
+/// Computes the arithmetic average of the elements in the list.
+///
+/// # Panics
+///
+/// If `xs.len()` is not a multiple of `8`.
+fn average_slow256(xs: &[f32]) -> f32 {
+    // The 256-bit wide floating-point vector type is f32x8. To
+    // avoid handling extra elements in this example we just panic.
+    assert!(xs.len() % 8 == 0, 
+            "input length `{}` is not a multiple of 8", 
+            xs.len());
+    
+    let mut result = 0._f32;  // This is where we store the result
+    
+    // We iterate over the input slice with a step of `8` elements:
+    for i in (0..xs.len()).step_by(8) {
+        // First, we read the next `8` elements into an `f32x8`.
+        // Since we haven't checked whether the input slice
+        // is aligned to the alignment of `f32x8`, we perform
+        // an unaligned memory read.
+        let data = f32x8::read_unaligned(&xs[i..]);
+
+        // With the element in the vector, we perform an horizontal reduction
+        // and add them to the result.
+        result += data.sum();
+    }
+    result / xs.len()
+}
+```
+
+As mentioned this operation is "slow", why is that? The main issue is that, on
+most architectures, horizontal reductions must perform a sequence of SIMD
+operations while vertical operations typically require only a single
+instruction.
+
+We can significantly improve the performance of our algorithm by writing it in
+such a way that the number of horizontal reductions performed is reduced.
+
+```rust
+fn average_fast256(xs: &[f32]) -> f32 {
+    assert!(xs.len() % 8 == 0, 
+            "input length `{}` is not a multiple of 8", 
+            xs.len());
+    
+    // Our temporary result is now a f32x8 vector:
+    let mut result = f32x8::splat(0.);
+    for i in (0..xs.len()).step_by(8) {
+        let data = f32x8::read_unaligned(&xs[i..]);
+        // This adds the data elements to tour temporary result using 
+        // a vertical lane-wise simd operation - this is a single SIMD
+        // instruction on most architectures.
+        result += data; 
+    }
+    // Perform a single horizontal reduction at the end:
+    result.sum() / xs.len()
+}
+```
+
+The performance could by further improved by requiring the input data to be
+aligned to a 16-byte boundary, and/or by handling the elements before the next 16-byte boundary in a special way.
+
+## Example: scalar-vector multiply even
+
+To showcase the mask and `select` API the following function multiplies the
+even elements of a vector with a scalar:
+
+```rust
+fn mul_even(a: f32, x: f32x4) -> f32x4 {
+    // Create a vector mask for the even elements 0 and 2.
+    // The vector mask API uses `bool`s to set or clear 
+    // all bits of a lane:
+    let m = m32x4::new(true, false, true, false);
+
+    // Perform a full multiplication
+    let r = f32x4::splat(a) * x;
+    
+    // Use the mask to select the even elements from the
+    // multiplication result and the odd elements from
+    // the input:
+    m.select(r, x)
+}
+```
+
+## Example: 4x4 Matrix multiplication
+
+To showcase the `shuffle` API the following function implements 4x4 Matrix
+multiply using 128-bit wide vectors.
+
+```rust
+fn mul4x4(a: [f32x4; 4], b: [f32x4; 4]) -> [f32x4; 4] {
+    let r = [f32x4::splat(0.); 4];
+    
+    for i in 0..4 {
+        r[i] = 
+            a[0] * shuffle!(b[i], [0,0,0,0]) + 
+            a[1] * shuffle!(b[i], [1,1,1,1]) +
+            a[2] * shuffle!(b[i], [2,2,2,2]) +
+            a[3] * shuffle!(b[i], [3,3,3,3]);
+    }
+    r
+}
+```
+  
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+  
+## Vector types
+
+The vector types are named according to the following scheme:
+
+> {element_type}{lane_width}x{number_of_lanes}
+
+where the following element types are introduced by this RFC:
+
+* `i`: signed integer
+* `u`: unsigned integer
+* `f`: float
+* `m`: mask
+
+So that `u16x8` reads "a SIMD vector of eight packed 16-bit wide unsigned
+integers". The width of a vector can be computed by multiplying the
+`{lane_width}` times the `{number_of_lanes}`. For `u16x8`, 16 x 8 = 128, so
+this vector type is 128 bits wide.
+
+This RFC proposes adding all vector types with sizes in range [16, 256] bit to
+the `std::simd` module, that is:
+
+* 16-bit wide vectors: `i8x2`, `u8x2`, `m8x2`
+* 32-bit wide vectors: `i8x4`, `u8x4`, `m8x4`, `i16x2`, `u16x2`,  `m16x2`
+* 64-bit wide vectors: `i8x8`, `u8x8`, `m8x8`, `i16x4`, `u16x4`, `m16x4`,
+  `i32x2`, `u32x2`, `f32x2`, `m32x2`
+* 128-bit wide vectors: `i8x16`, `u8x16`, `m8x16`, `i16x8`, `u16x8`, `m16x8`,
+  `i32x4`, `u32x4`, `f32x4`, `m32x4`, `i64x2`, `u64x2`, `f64x2`, `m64x2`
+* 256-bit wide vectors: `i8x32`, `u8x32`, `m8x32`, `i16x16`, `u16x16`, `m16x16`,
+  `i32x8`, `u32x8`, `f32x8`, `m32x8`, `i64x4`, `u64x4`, `f64x4`, `m64x4`
+
+Note that this list is not comprehensive. In particular:
+
+* half-float `f16xN`: these vectors are supported in many architectures (ARM,
+  AArch64, PowerPC64, RISC-V, MIPS, ...) but their support is blocked on Rust
+  half-float support.
+* AVX-512 vector types, not only 512-bit wide vector types, but also `m1xN`
+  vector masks. These are blocked on `std::arch` AVX-512 support.
+* other vector types: x86, AArch64, PowerPC and others include types like
+  `i64x1`, `u64x1`, `f64x1`, `m64x1`, `i128x1`, `u128x1`, `m128x1`, ... These
+  can be always added later as the need for these arises, potentially in
+  combination with the stabilization of the `std::arch` intrinsics for those
+  architectures.
+
+### Layout of vector types
+
+The portable packed SIMD vector types introduced in this RFC are layout
+compatible with the architecture-specific vector types. That is:
+
+```rust
+union A {
+   port: f32x4,
+   arch: __m128,
+}
+let x: __m128 = _mm_setr_ps (0.0, 1.0, 2.0, 3.0);
+let y: f32x4 = A { arch: x }.port;
+assert_eq!(y.extract(0), 0.0);  // OK
+assert_eq!(y.extract(1), 1.0);  // OK
+assert_eq!(y.extract(2), 2.0);  // OK
+assert_eq!(y.extract(3), 3.0);  // OK
+```
+
+The portable packed SIMD vector types are also layout compatible with arrays of
+equal element type and whose length equals the number of vector lanes. That is:
+
+```rust
+union A {
+   port: f32x4,
+   arr: [f32; 4],
+}
+let x: [f32; 4] = [0.0, 1.0, 2.0, 3.0];
+let y: f32x4 = A { arr: x }.port;
+assert_eq!(y.extract(0), 0.0);  // OK
+assert_eq!(y.extract(1), 1.0);  // OK
+assert_eq!(y.extract(2), 2.0);  // OK
+assert_eq!(y.extract(3), 3.0);  // OK
+```
+
+This transitively makes both portable packed and architecture specific SIMD
+vector types layout compatible with all other types that are also layout
+compatible with these array types.
+
+## API of portable packed SIMD vector types
+
+### Traits overview
+
+All vector types implement the following traits:
+
+* `Copy`
+* `Clone`
+* `Default`: zero-initializes the vector.
+* `Debug`: formats the vector as `({}, {}, ...)`.
+* `PartialEq<Self>`: performs a lane-wise comparison between two vectors and
+  returns `true` if all lanes compare `true`. It is equivalent to
+  `a.eq(b).all()`.
+* `PartialOrd<Self>`: compares two vectors lexicographically.
+* `From/Into` lossless casts between vectors with the same number of lanes.
+
+All signed integer, unsigned integer, and floating point vector types implement
+the following traits:
+
+* `{Add,Sub,Mul,Div,Rem}<RHS=Self,Output=Self>`,
+  `{Add,Sub,Mul,Div,Rem}Assign<RHS=Self>`: vertical (lane-wise) arithmetic
+  operations.
+
+All signed and unsigned integer vectors and vector masks also implement:
+
+* `Eq`: equivalent to `PartialEq<Self>`
+* `Ord`: equivalent to `PartialOrd<Self>`
+* `Hash`: equivalent to `Hash` for `[element_type; number_of_elements]`.
+* `fmt::LowerHex`/`fmt::UpperHex`: formats the vector as hexadecimal.
+* `fmt::Octal`: formats the vector as an octal number.
+* `fmt::Binary`: formats the vector as binary number.
+* `Not<Output=Self>`: vertical (lane-wise) negation,
+* `Bit{And,Or,Xor}<RHS=Self,Output=Self>`, `Bit{And,Or,Xor}Assign<RHS=Self>`:
+  vertical (lane-wise) bitwise operations.
+
+All signed and unsigned integer vectors also implement:
+
+* `{Shl,Shr}<RHS=Self,Output=Self>`, `{Shl,Shr}Assign<RHS=Self>`: vertical
+  (lane-wise) bit-shift operations.
+
+Note: While IEEE 754-2008 provides total ordering predicates for floating-point
+numbers, Rust does not implement `Eq` and `Ord` for the `f32` and `f64`
+primitive types. This RFC follows suit and does not propose to implement `Eq`
+and `Ord` for vectors of floating-point types. Any future RFC that might want to
+extend Rust with a total order for floats should extend the portable
+floating-point vector types with it as well. See [this internal
+thread](https://users.rust-lang.org/t/how-to-sort-a-vec-of-floats/2838/3) for
+more information.
+
+### Inherent Methods
+
+#### Construction and element access
+
+All portable signed integer, unsigned integer, and floating-point vector types
+implement the following methods:
+
+```rust
+impl {element_type}{lane_width}x{number_of_lanes} {
+/// Creates a new instance of the vector from `number_of_lanes` 
+/// values.
+pub const fn new(args...: element_type) -> Self;
+
+/// Returns the number of vector lanes.
+pub const fn lanes() -> usize;
+
+/// Constructs a new instance with each element initialized to
+/// `value`.
+pub const fn splat(value: element_type) -> Self;
+
+/// Extracts the value at `index`.
+///
+/// # Panics
+///
+/// If `index >= Self::lanes()`.
+pub fn extract(self, index: usize) -> element_type;
+
+/// Extracts the value at `index`.
+///
+/// If `index >= Self::lanes()` the behavior is undefined.
+pub unsafe fn extract_unchecked(self, index: usize) -> element_type;
+
+/// Returns a new vector where the value at `index` is replaced by `new_value`.
+///
+/// # Panics
+///
+/// If `index >= Self::lanes()`.
+#[must_use = error-message]
+pub fn replace(self, index: usize, new_value: $elem_ty) -> Self;
+
+/// Returns a new vector where the value at `index` is replaced by `new_value`.
+#[must_use = error-message]
+pub unsafe fn replace_unchecked(self, index: usize, 
+                                new_value: element_type) -> Self;
+}
+```
+
+#### Reads and Writes
+
+##### Contiguous reads and writes
+
+All portable vector types implement the following methods:
+
+```rust
+impl {element_type}{lane_width}x{number_of_lanes} {
+/// Writes the values of the vector to the `slice` without 
+/// reading or dropping the old value.
+///
+/// # Panics
+///
+/// If `slice.len() != Self::lanes()` or `&slice[0]` is not
+/// aligned to an `align_of::<Self>()` boundary.
+pub fn write_aligned(self, slice: &mut [element_type]);
+
+/// Writes the values of the vector to the `slice` without 
+/// reading or dropping the old value.
+///
+/// # Panics
+///
+/// If `slice.len() != Self::lanes()`.
+pub fn write_unaligned(self, slice: &mut [element_type]);
+
+/// Writes the values of the vector to the `slice` without 
+/// reading or dropping the old value.
+///
+/// # Precondition
+///
+/// If `slice.len() < Self::lanes()` or `&slice[0]` is not
+/// aligned to an `align_of::<Self>()` boundary, the behavior is
+/// undefined.
+pub unsafe fn write_aligned_unchecked(self, slice: &mut [element_type]);
+
+/// Writes the values of the vector to the `slice` without reading 
+/// or dropping the old value.
+///
+/// # Precondition
+///
+/// If `slice.len() < Self::lanes()` the behavior is undefined.
+pub unsafe fn write_unaligned_unchecked(self, slice: &mut [element_type]);
+
+/// Instantiates a new vector with the values of the `slice` without 
+/// moving them, leaving the memory in `slice` unchanged.
+///
+/// # Panics
+///
+/// If `slice.len() != Self::lanes()` or `&slice[0]` is not aligned
+/// to an `align_of::<Self>()` boundary.
+pub fn read_aligned(slice: &[element_type]) -> Self;
+
+/// Instantiates a new vector with the values of the `slice` without 
+/// moving them, leaving the memory in `slice` unchanged.
+///
+/// # Panics
+///
+/// If `slice.len() != Self::lanes()`.
+pub fn read_unaligned(slice: &[element_type]) -> Self;
+
+/// Instantiates a new vector with the values of the `slice` without 
+/// moving them, leaving the memory in `slice` unchanged.
+///
+/// # Precondition
+///
+/// If `slice.len() < Self::lanes()` or `&slice[0]` is not aligned
+/// to an `align_of::<Self>()` boundary, the behavior is undefined.
+pub unsafe fn read_aligned_unchecked(slice: &[element_type]) -> Self;
+
+/// Instantiates a new vector with the values of the `slice` without 
+/// moving them, leaving the memory in `slice` unchanged.
+///
+/// # Precondition
+///
+/// If `slice.len() < Self::lanes()` the behavior is undefined.
+pub unsafe fn read_unaligned_unchecked(slice: &[element_type]) -> Self;
+}
+```
+
+##### Discontinuous masked reads and writes (scatter and gather)
+
+Vector masks implement the following methods:
+
+```rust
+impl m{lane_width}x{number_of_lanes} {
+/// Instantiates a new vector with the values of the `slice` located at 
+/// the `offset`s without moving them for which the mask (`self`) is `true`
+/// and with the values of `default` otherwise. The memory of the `slice` at 
+/// the `offsets` for which the mask is `false` is not read.
+///
+/// # Precondition
+///
+/// If `slice.len() < offset.max_element()` the behavior is undefined.
+pub unsafe fn read_scattered_unchecked<T, O, D>(self, slice: &[T], offset: O, default: D) -> D
+    where <implementation defined> 
+        // for exposition only:
+        // number_of_lanes == D::lanes() == O::lanes(), 
+        // D::element_type == T,
+        // O::element_type == usize,
+;
+
+/// Writes the elements of the vector `values` for which the mask (`self`) 
+/// is `true` to the `slice` at `offset`s without reading or dropping 
+/// the old values. No memory is written to the `slice` elements at 
+/// the `offset`s for which the mask is `false`.
+///
+/// If multiple `offset`s have the same value, that is, if multiple lanes 
+/// from `values` are to be written to the same memory location, the writes
+/// are ordered from least significant to most significant element.
+///
+/// # Precondition
+///
+/// If `slice.len() < offset.max_element()` the behavior is undefined.
+pub unsafe fn write_scattered_unchecked<T, O, D>(self, slice: &mut [T], offset: O, values: D)
+    where <implementation defined> 
+        // for exposition only:
+        // number_of_lanes == D::lanes() == O::lanes(), 
+        // D::element_type == T,
+        // O::element_type == usize,
+;
+}
+```
+
+#### Vertical arithmetic operations
+
+Vertical (lane-wise) arithmetic operations are provided by the following trait
+implementations:
+
+* All signed integer, unsigned integer, and floating point vector types implement:
+
+    * `{Add,Sub,Mul,Div,Rem}<RHS=Self,Output=Self>`
+    * `{Add,Sub,Mul,Div,Rem}Assign<RHS=Self>`
+
+* All signed and unsigned integer vectors also implement:
+
+    * `{Shl,Shr}<RHS=Self,Output=Self>`, `{Shl,Shr}Assign<RHS=Self>`: vertical
+      (lane-wise) bit-shift operations.
+
+##### Integer vector semantics
+
+The behavior of these operations for integer vectors is the same as that of the
+scalar integer types. That is: `panic!` on both overflow and division by zero if
+`-C overflow-checks=on`.
+  
+##### Floating-point semantics
+
+The behavior of these operations for floating-point numbers is the same as that
+of the scalar floating point types, that is, `+-INFINITY` on overflow, `NaN` on
+division by zero, etc.
+
+#### Wrapping arithmetic operations
+
+All signed and unsigned integer vector types implement the whole set of `pub fn
+wrapping_{add,sub,mul,div,rem}(self, Self) -> Self` methods which, on overflow,
+produce the correct mathematical result modulo `2^n`.
+
+The `div` and `rem` method `panic!` on division by zero in debug mode.
+
+#### Unsafe wrapping arithmetic operations
+
+All signed and unsigned integer vectors implement
+`pub unsafe fn wrapping_{div,rem}_unchecked(self, Self) -> Self`
+methods which, on overflow, produce the correct mathematical result modulo `2^n`.
+
+If any of the vector elements is divided by zero the behavior is undefined.
+
+#### Saturating arithmetic operations
+
+All signed and unsigned integer vector types implement the whole set of `pub fn
+saturated_{add,sub,mul,div,rem}(self, Self) -> Self` methods which saturate on
+overflow.
+
+The `div` and `rem` method `panic!` on division by zero in debug mode.
+
+#### Unsafe saturating arithmetic operations
+
+All signed and unsigned integer vectors implement `pub unsafe fn
+saturating_{div,rem}_unchecked(self, Self) -> Self` methods which saturate on
+overflow.
+
+If any of the vector elements is divided by zero the behavior is undefined.
+
+#### Binary `min`/`max` vertical operations
+
+All portable signed integer, unsigned integer, and floating-point vectors
+implement the following methods:
+
+```rust
+impl {element_type}{lane_width}x{number_of_lanes} {
+/// Lane-wise `min`.
+///
+/// Returns a vector whose lanes contain the smallest 
+/// element of the corresponding lane of `self` and `other`.
+pub fn min(self, other: Self) -> Self;
+
+/// Lane-wise `max`.
+///
+/// Returns a vector whose lanes contain the largest 
+/// element of the corresponding lane of `self` and `other`.
+pub fn max(self, other: Self) -> Self;
+}
+```
+
+##### Floating-point semantics
+
+The floating-point semantics follow the semantics of `min` and `max` for the
+scalar `f32` and `f64` types.
+
+#### Floating-point vertical math operations
+
+All portable floating-point vector types implement the following methods:
+
+```rust
+impl f{lane_width}x{number_of_lanes} {
+    /// Square-root
+    fn sqrt(self) -> Self;
+    /// Reciprocal square-root estimate
+    ///
+    /// **FIXME**: an upper bound on the error should
+    /// be guaranteed before stabilization.
+    fn rsqrte(self) -> Self;
+    /// Fused multiply add: `self * b + c`
+    fn fma(self, b: Self, c: self) -> Self;
+}
+```
+
+#### Arithmetic reductions 
+
+##### Integers
+
+All portable signed and unsigned integer vector types implement the following
+methods:
+
+```rust
+impl {element_type}{lane_width}x{number_of_lanes} {
+/// Horizontal wrapping sum of the vector elements.
+///
+/// The intrinsic performs a tree-reduction of the vector elements.
+/// That is, for a 4 element vector:
+///
+/// > (x0.wrapping_add(x1)).wrapping_add(x2.wrapping_add(x3))
+///
+/// If an operation overflows it returns the mathematical result
+/// modulo `2^n` where `n` is the number of times it overflows.
+pub fn wrapping_sum(self) -> element_type;
+
+/// Horizontal wrapping product of the vector elements.
+///
+/// The intrinsic performs a tree-reduction of the vector elements.
+/// That is, for a 4 element vector:
+///
+/// > (x0.wrapping_mul(x1)).wrapping_mul(x2.wrapping_mul(x3))
+///
+/// If an operation overflows it returns the mathematical result
+/// modulo `2^n` where `n` is the number of times it overflows.
+pub fn wrapping_product(self) -> element_type;
+}
+```
+
+##### Floating-point
+
+All portable floating-point vector types implement the following methods:
+
+```rust
+impl {element_type}{lane_width}x{number_of_lanes} {
+/// Horizontal sum of the vector elements.
+///
+/// The intrinsic performs a tree-reduction of the vector elements.
+/// That is, for a 8 element vector:
+///
+/// > ((x0 + x1) + (x2 + x3)) + ((x4 + x5) + (x6 + x7))
+///
+/// If one of the vector element is `NaN` the reduction returns
+/// `NaN`. The resulting `NaN` is not required to be equal to any
+/// of the `NaN`s in the vector.
+pub fn sum(self) -> element_type;
+
+/// Horizontal product of the vector elements.
+///
+/// The intrinsic performs a tree-reduction of the vector elements.
+/// That is, for an 8 element vector:
+///
+/// > ((x0 * x1) * (x2 * x3)) * ((x4 * x5) * (x6 * x7))
+///
+/// If one of the vector element is `NaN` the reduction returns
+/// `NaN`. The resulting `NaN` is not required to be equal to any
+/// of the `NaN`s in the vector.
+pub fn product(self) -> element_type;
+}
+```
+
+#### Bitwise reductions 
+
+All signed and unsigned integer vectors implement the following methods:
+
+```rust
+impl {element_type}{lane_width}x{number_of_lanes} {
+/// Horizontal bitwise `and` of the vector elements.
+pub fn and(self) -> element_type;
+
+/// Horizontal bitwise `or` of the vector elements.
+pub fn or(self) -> element_type;
+
+/// Horizontal bitwise `xor` of the vector elements.
+pub fn xor(self) -> element_type;
+}
+```
+
+#### Min/Max reductions
+
+All portable signed integer, unsigned integer, and floating-point vector types
+implement the following methods:
+
+```rust
+impl {element_type}{lane_width}x{number_of_lanes} {
+/// Largest vector element value.
+pub fn max_element(self) -> element_type;
+
+/// Smallest vector element value.
+pub fn min_element(self) -> element_type;
+}
+```
+
+Note: the semantics of `{min,max}_element` for floating-point numbers are the
+same as that of their `min`/`max` methods.
+
+#### Mask construction and element access
+
+```rust
+impl m{lane_width}x{number_of_lanes} {
+/// Creates a new vector mask from `number_of_lanes` boolean
+/// values.
+///
+/// The values `true` and `false` respectively set and clear 
+/// the mask for a particular lane.
+pub const fn new(args...: bool...) -> Self;
+
+/// Returns the number of vector lanes.
+pub const fn lanes() -> usize;
+
+/// Constructs a new vector mask with all lane-wise 
+/// masks either set, if `value` equals `true`, or cleared, if 
+/// `value` equals `false`.
+pub const fn splat(value: bool) -> Self;
+
+/// Returns `true` if the mask for the lane `index` is 
+/// set and `false` otherwise.
+///
+/// # Panics
+///
+/// If `index >= Self::lanes()`.
+pub fn extract(self, index: usize) -> bool;
+
+/// Returns `true` if the mask for the lane `index` is 
+/// set and `false` otherwise.
+///
+/// If `index >= Self::lanes()` the behavior is undefined.
+pub unsafe fn extract_unchecked(self, index: usize) -> bool;
+
+/// Returns a new vector mask where mask of the lane `index` is
+/// set if `new_value` is `true` and cleared otherwise.
+///
+/// # Panics
+///
+/// If `index >= Self::lanes()`.
+#[must_use = error-message]
+pub fn replace(self, index: usize, new_value: bool) -> Self;
+
+/// Returns a new vector mask where mask of the lane `index` is
+/// set if `new_value` is `true` and cleared otherwise.
+///
+/// If `index >= Self::lanes()` the behavior is undefined.
+#[must_use = error-message]
+pub unsafe fn replace_unchecked(self, index: usize, new_value: bool) -> Self;
+}
+```
+
+#### Mask reductions
+
+All vector masks implement the following methods:
+
+```rust
+impl m{lane_width}x{number_of_lanes} {
+/// Are "all" lanes `true`?
+pub fn all(self) -> bool;
+
+/// Is "any" lane `true`?
+pub fn any(self) -> bool;
+
+/// Are "all" lanes `false`?
+pub fn none(self) -> bool;
+}
+```
+
+#### Mask vertical selection
+
+All vector masks implement the following method:
+
+```rust
+impl m{lane_width}x{number_of_lanes} {
+/// Lane-wise selection. 
+///
+/// The lanes of the result for which the mask is `true` contain
+/// the values of `a` while the remaining lanes contain the values of `b`.
+pub fn select<T>(self, a: T, b: T) -> T
+    where <implementation-defined>
+        // for exposition only:
+        // T::lanes() == number_of_lanes,
+;
+}
+```
+
+Note: how `where` clause is enforced is an implementation detail. `stdsimd`
+implements this using a sealed trait:
+
+```rust
+pub fn select<T>(self, a: T, b: T) -> T
+    where T: SelectMask<Self>
+}
+```
+
+#### Vertical comparisions
+
+All vector types implement the following vertical (lane-wise) comparison methods
+that returns a mask expressing the result.
+
+```rust
+impl {element_type}{lane_width}x{number_of_lanes} {
+/// Lane-wise equality comparison.
+pub fn eq(self, other: $id) -> m{lane_width}x{number_of_lanes};
+
+/// Lane-wise inequality comparison.
+pub fn ne(self, other: $id) -> m{lane_width}x{number_of_lanes};
+
+/// Lane-wise less-than comparison.
+pub fn lt(self, other: $id) -> m{lane_width}x{number_of_lanes};
+
+/// Lane-wise less-than-or-equals comparison.
+pub fn le(self, other: $id) -> m{lane_width}x{number_of_lanes};
+
+/// Lane-wise greater-than comparison.
+pub fn gt(self, other: $id) -> m{lane_width}x{number_of_lanes};
+
+/// Lane-wise greater-than-or-equals comparison.
+pub fn ge(self, other: $id) -> m{lane_width}x{number_of_lanes};
+}
+```
+
+For all vector types proposed in this RFC, the `{lane_width}` of the mask
+matches that of the vector type. However, this will not be the case for the
+AVX-512 vector types.
+
+##### Semantics for floating-point numbers
+
+The semantics of the lane-wise comparisons for floating point numbers are the
+same as in the scalar case. 
+  
+### Portable vector shuffles
+
+```
+/// Shuffles vector elements.
+std::simd::shuffle!(...);
+```
+
+The `shuffle!` macro returns a new vector that contains a shuffle of the elements in
+one or two input vectors. There are two versions:
+
+ * `shuffle!(vec, indices)`: one-vector version
+ * `shuffle!(vec0, vec1, indices)`: two-vector version
+
+with the following preconditions:
+
+ * `vec`, `vec0`, and `vec1` must be portable packed SIMD vector types.
+ * `vec0` and `vec1` must have the same type. 
+ * `indices` must be a `const` array of type `[usize; N]` where `N` is any
+   power-of-two in range `(0, 2 * {vec,vec0,vec1}::lanes()]`.
+ * the values of `indices` must be in range `[0, N)` for the one-vector version,
+   and in range `[0, 2N)` for the two-vector version.
+   
+On precondition violation a type error is produced.
+
+The macro returns a new vector whose:
+
+* element type equals that of the input vectors, 
+* length equals `N`, that is, the length of the `indices` array
+
+The `i`-th element of `indices` with value `j` in range `[0, N)` stores the
+`j`-th element of the first vector into the `i`-th element of the result vector.
+
+In the two-vector version, the `i`-th element of `indices` with value `j` in
+range `[N, 2N)` stores the `j - N`-th element of the second vector into the
+`i`-th element of the result vector.
+
+#### Example: shuffles
+
+The `shuffle!` macro allows reordering the elements of a vector:
+
+```rust
+let x = i32x4::new(1, 2, 3, 4);
+let r = shuffle!(x, [2, 1, 3, 0]);
+assert_eq!(r, i32x4::new(3, 2, 4, 1));
+```
+
+where the resulting vector can also be smaller:
+
+```rust
+let r = shuffle!(x, [1, 3]);
+assert_eq!(r, i32x2::new(2, 4));
+```
+
+or larger
+
+```
+let r = shuffle!(x, [1, 3, 2, 2, 1, 3, 2, 2]);
+assert_eq!(r, i32x8::new(2, 4, 3, 3, 2, 4, 3, 3));
+```
+
+than the input. The length of the result must be, however, limited to the range
+`[2, 2 * vec::lanes()]`.
+
+It also allows shuffling between two vectors
+
+```rust
+let y = i32x4::new(5, 6, 7, 8);
+let r = shuffle!(x, y, [4, 0, 5, 1]);
+assert_eq!(r, i32x4::new(5, 1, 6, 2));
+```
+
+where the indices of the second vector's elements start at the `vec::lanes()`
+offset.
+
+#### Conversions and bitcasts
+[casts-and-conversions]: #casts-and-conversions
+
+##### Conversions / bitcasts between vector types
+
+There are three different ways to convert between vector types.
+
+* `From`/`Into`: value-preserving widening-conversion between vectors with the
+  same number of lanes. That is, `f32x4` can be converted into `f64x4` using
+  `From`/`Into`, but the opposite is not true because that conversion is not
+  value preserving. The `From`/`Into` implementations mirror that of the
+  primitive integer and floating-point types. These conversions can widen the
+  size of the element type, and thus the size of the SIMD vector type. Signed
+  vector types are sign-extended lane-wise, while unsigned vector types are
+  zero-extended lane-wise. The result of these conversions is
+  endian-independent.
+
+* `as`: non-value preserving truncating-conversions between vectors with the
+  same number of lanes. That is, `f64x4 as f32x4` performs a lane-wise `as`
+  cast, truncating the values if they would overflow the destination type. The
+  result of these conversions is endian-independent.
+  
+* `unsafe mem::transmute`: bit-casts between vectors with the same size, that
+  is, the vectors do not need to have the same number of lanes. For example,
+  transmuting a `u8x16` into a `u16x8`. Note that while all bit-patterns of the
+  `{i,u,f}` vector types represent a valid vector value, there are many vector
+  mask bit-patterns that do not represent a valid mask. Note also that the
+  result of `unsafe mem::transmute` is **endian-dependent** (see examples
+  below).
+    
+It is extremely common to perform "transmute" operations between equally-sized
+portable vector types when writing SIMD algorithms. Rust currently does not have
+any facilities to express that all bit-patterns of one type are also valid
+bit-patterns of another type, and to perform these safe transmutes in an
+endian-independent way.
+
+This forces users to resort to `unsafe { mem::transmute(x) }` and, very likely,
+to write non-portable code.
+
+There is a very interesting discussion about [this in this internal
+thread](https://internals.rust-lang.org/t/pre-rfc-frombits-intobits/7071/23)
+about potential ways to attack this problem, and there is also an [open issue in
+`stdsimd` about endian-dependent
+behavior](https://github.com/rust-lang-nursery/stdsimd/issues/393) - if you care
+deeply about it please chime in. 
+
+These issues are not specific to portable packed SIMD vector types and fixing
+them is not the purpose of this RFC, but these issues are critical for writing
+efficient and portable SIMD code reliably and ergonomically.
+
+##### Other conversions
+
+The layout of the portable packed vector types is compatible to the layout of
+fixed-size arrays of the same element type and the same number of lanes (e.g.
+`f32x4` is layout compatible with `[f32; 4]`.
+
+For all signed, unsigned, and floating-point vector types with element type `E`
+and number of lanes `N`, the following implementations exist:
+
+```rust
+impl From<[E; N]> for ExN;
+impl From<ExN> for [E; N];
+```
+
+# ABI and `std::simd`
+
+The ABI is first and foremost unspecified and may change at any time.
+
+All `std::simd` types are forbidden in `extern` functions (or warned against).
+Basically the same story as types like `__m128i` and `extern` functions.
+
+As of today, they will be implemented as pass-via-pointer unconditionally. For
+example:
+
+```rust
+fn foo(a: u32x4) { /* ... */ }
+
+foo(u32x4::splat(3));
+```
+
+This example will pass the variable `a` through memory. The function calling
+`foo` will place `a` on the stack and then `foo` will read `a` from the stack
+to work with it. Note that if `foo` changes the value of `a` this will not be
+visible to the caller, they're semantically pass-by-value but implemented as
+pass-via-pointers.
+  
+Currently, we aren't aware of any slowdowns of perf hits from this mechanism
+(pass through memory instead of by value). If something comes up, leaving the
+ABI unspecified allows us to try to address it.
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+## Generic vector type requirement for backends
+
+The `std::arch` module provides architecture-specific vector types where
+backends only need to provide vector types for the architectures that they
+support.
+
+This RFC requires backends to provide generic vector types. Most backends support
+this in one form or another, but if one future backend does not, this RFC can be
+implemented on top of the architecture specific types.
+
+## Achieving zero-overhead is outside Rust's control
+
+A future architecture might have an instruction that performs multiple
+operations exposed by this API in one go, like `(a + b).wrapping_sum()` on an
+`f32x4` vector. If that expression does not produce optimal machine code, Rust
+has a performance bug.
+
+This is not a performance bug that can be easily worked around in `stdsimd` or
+`rustc`, making this, almost certainly, a performance bug in the backend. These
+performance bugs can be arbitrarily hard to fix, and fixing these might not
+always be worth it.
+
+That is, while these APIs should make it possible for reasonably-designed
+optimizing Rust backends to achieve zero-overhead, zero-overhead can only be
+provided in practice on a best-effort basis.
+
+## Performance of this API might vary dramatically
+
+The performance of this API can vary dramatically depending on the architecture
+being targeted and the target features enabled.
+
+First, this is a consequence of portability, and thus a feature. However, that
+portability can introduce performance bugs is a real concern. In any case, if
+the user is able to write faster code for some architecture, they should fill a
+performance bug.
+
+# Rationale and alternatives
+[alternatives]: #alternatives
+
+### Dynamic values result in poor code generation for some operations
+
+Some of the fundamental APIs proposed in this RFC, like `vec::{new, extract,
+replace}` take run-time dynamic parameters. Consider the following example (see
+the whole example live at [`rust.godbolt.org`](https://godbolt.org/g/yhiAa2):
+
+```rust 
+/// Returns a f32x8 with 0.,1.,2.,3.
+fn increasing() -> f32x8 {
+   let mut x = f32x8::splat(0.);
+   for i in 0..f32x8::lanes() {
+       x = x.replace(i, i as f32); 
+   }
+   x 
+}
+```
+
+In release mode, `rustc` generates the following assembly for this function:
+
+```asm
+.LCPI0_0:
+  .long 0
+  .long 1065353216
+  .long 1073741824
+  .long 1077936128
+  .long 1082130432
+  .long 1084227584
+  .long 1086324736
+  .long 1088421888
+example::increasing:
+  pushq %rbp
+  movq %rsp, %rbp
+  vmovaps .LCPI0_0(%rip), %ymm0
+  vmovaps %ymm0, (%rdi)
+  movq %rdi, %rax
+  popq %rbp
+  vzeroupper
+  retq
+```
+
+which uses two vector reads to read the values into a SIMD register -
+digression: this two reads are due to Rust's SIMD vector types ABI and happen
+only "isolated" examples.
+
+If we change this function to accept run-time bounds for the loop
+
+```rust 
+/// Returns a f32x4::splat(0.) with the elements in [a, b) initialized 
+/// with an increasing sequence 0.,1.,2.,3.
+fn increasing(a: usize, b: usize) -> f32x4 {
+   let mut x = f32x4::splat(0.);
+   for i in a..b {
+       x = x.replace(i, i as f32); 
+   }
+   x 
+}
+```
+
+then the amount of instruction generated explodes:
+
+```asm
+example::increasing_rt:
+  pushq %rbp
+  movq %rsp, %rbp
+  andq $-32, %rsp
+  subq $320, %rsp
+  vxorps %xmm0, %xmm0, %xmm0
+  cmpq %rsi, %rdx
+  jbe .LBB1_34
+  movl %edx, %r9d
+  subl %esi, %r9d
+  leaq -1(%rdx), %r8
+  subq %rsi, %r8
+  andq $7, %r9
+  je .LBB1_2
+  negq %r9
+  vxorps %xmm0, %xmm0, %xmm0
+  movq %rsi, %rcx
+.LBB1_4:
+  testq %rcx, %rcx
+  js .LBB1_5
+  vcvtsi2ssq %rcx, %xmm2, %xmm1
+...200 lines more...
+```
+
+This code isn't necessarily horrible, but it is definitely harder to reason about its
+performance. This has two main causes:
+
+*  **ISAs do not support these operations**: most (all?) ISAs support operations
+   like `extract`, `write`, and `replace` with constant indices only. That is,
+   these operations do not map to single instructions on most ISAs.
+   
+* **these operations are slow**: even for constant indices, these operations are
+  slow. Often, for each constant index, a different instruction must be
+  generated, and occasionally, for a particular constant index, the operation
+  requires multiple instructions.
+  
+So we have a trade-off to make between providing a comfortable API for programs
+that really must extract a single value with a run-time index, and providing an
+API that provides "reliable" performance. 
+
+The proposed API accepts run-time indices (and values for `new`):
+
+* **common** SIMD code indexes with compile-time indices: this code gets optimized
+  reasonably well with the LLVM backend, but the user needs to deal with the
+  safe-but-checked and `unsafe`-but-unchecked APIs. If we were to only accept
+  constant indices, the unchecked API would not be necessary, since the checked
+  API would ensure that the indices are in-bounds at compile-time.
+  
+* **rare** SIMD code indexes with run-time indices: this is code that one should
+  really avoid writing. The current API makes writing this code extremely easy,
+  resulting in SIMD code with potentially unexpected performance. Users also
+  have to deal with two APIs for this, the checked/unchecked APIs, and
+  also, the memory `read`/`write` APIs that are better suited for this use case.
+  
+Whether the current design is the right design should probably be clarified
+during the RFC. An important aspect to consider is that Rust support for
+`const`ants is very basic: `const fn`s are getting started, `const` generics are
+not there yet, etc. That is, making the API take constant indices might severely
+limit the type of code that can be used with these APIs in today's Rust.
+
+### Binary (vector,scalar) and (scalar,vector) operations
+
+This RFC can be extended with binary vector-scalar and scalar vector operations
+by implementing the following traits for signed integer, unsigned integer, and
+floating-point vectors:
+
+* `{Add,Sub,Mul,Div,Rem}<RHS={element_type},Output=Self>`,
+  `{Add,Sub,Mul,Div,Rem}<RHS={vector_type},Output={vector_type}> for
+  {element_type}`, `{Add,Sub,Mul,Div,Rem}Assign<RHS={element_type}>`: binary
+  scalar-vector vertical (lane-wise) arithmetic operations.
+
+and the following trait for signed and unsigned integer vectors:
+
+* `Bit{And,Or,Xor}<RHS={element_type},Output=Self>`,
+  `Bit{And,Or,Xor}<RHS={vector_type},Output={vector_type}> for {element_type}`,
+  `Bit{And,Or,Xor}Assign<RHS={element_type}>` binary scalar-vector vertical
+  (lane-wise) bitwise operations.
+
+* `{Shl,Shr}<RHS=I>`, `{Shl,Shr}Assign<RHS=I>`: for all integer types `I` in
+  {`i8`, `i16`, `i32`, `i64`, `i128`, `isize`, `u8`, `u16`, `u32`, `u64`,
+  `u128`, `usize`}. Note: whether only `element_type` or all integer types
+  should be allowed is debatable: `stdsimd` currently allows using all integer
+  types.
+
+These traits slightly improve the ergonomics of scalar vector operations:
+
+```rust
+let x: f32x4;
+let y: f32x4;
+let a: f32;
+let z = a * x + y;
+// instead of: z = f32x4::splat(a) * x + y;
+x += a;
+// instead of: x += f32x4::splat(a);
+```
+
+but they do not enable to do anything new that can't be easily done without them
+by just using `vec::splat`, and initial feedback on the RFC suggested that they
+are an abstraction that hides the cost of splatting the scalar into the vector.
+
+These traits are implemented in `stdsimd` (and thus available in nightly Rust),
+are trivial to implement (`op(vec_ty::splat(scalar), vec)` and `op(vec,
+vec_ty::splat(scalar))`), and cannot be "seamlessly" provided by users due to
+coherence.
+
+They are not part of this RFC, but they can be easily added (now or later) if
+there is consensus to do so. In the meantime, they can be experimented with on
+nightly Rust. If there is consensus to remove them, porting nightly code off
+these is also pretty easy.
+
+### Tiny vector types
+
+Most platforms SIMD registers have a constant width, and they can be used to
+operate on vectors with a smaller bit width. However, 16 and 32-bit wide
+vectors are "small" by most platforms standards.
+
+These types are useful for performing Simd Within A Register (SWAR) operations
+in platforms without SIMD registers. While their performance has not been
+extensively investigated in `stdsimd` yet, any performance issues are
+performance bugs that should be fixed.
+
+### Portable shuffles API
+
+The portable shuffles are exposed via the `shuffle!` macro. Generating the
+sequence of instructions required to perform a shuffle requires the shuffle
+indices to be known at compile time.
+
+In the future, an alternative API based on `const`-generics and/or
+`const`-function-arguments could be added in a backwards compatible way:
+
+```rust
+impl {element_type}{element_width}x{number_of_lanes} {
+    pub fn shuffle<const N: usize, R>(self, const indices: [isize; N]) 
+        -> <R as ShuffleResult<element_type, [isize; N]>>::ShuffleResultType
+      where R: ShuffleResult<element_type, [isize; N]>;
+}
+```
+
+Offering this same API today is doable:
+
+```rust
+impl {element_type}{element_width}x{number_of_lanes} {
+    #[rustc_const_argument(2)] // specifies that indices must be a const
+    #[rustc_platform_intrinsic(simd_shuffle2)]
+    // ^^^ specifies that this method should be treated as the 
+    // "platform-intrinsic" "simd_shuffle1"
+    pub fn shuffle2<I>(self, other: Self, indices: I) 
+        -> <R as ShuffleResult<element_type, I>>::ShuffleResultType
+      where R: ShuffleResult<element_type, I>;
+      
+    #[rustc_const_argument(1)]
+    #[rustc_platform_intrinsic(simd_shuffle1)]
+    pub fn shuffle<I>(self, indices: I) 
+        -> <R as ShuffleResult<element_type, I>>::ShuffleResultType
+      where R: ShuffleResult<element_type, I>;
+}
+```
+
+If there is consensus for it the RFC can be easily amended. 
+
+# Prior art
+[prior-art]: #prior-art
+
+Most of this is implemented in `stdsimd` and can be used on nightly today via
+the `std::simd` module. The `stdsimd` crate is an effort started by @burntsushi
+to put the `rust-lang-nursery/simd` crate into a state suitable for
+stabilization. The `rust-lang-nursery/simd` crate was mainly developed by @huonw
+and IIRC it is heavily-inspired by Dart's SIMD which is from where the `f32x4`
+naming scheme comes from. This RFC has been heavily inspired by Dart, and two of
+the three examples used in the motivation come from the [Using SIMD in
+Dart](https://www.dartlang.org/articles/dart-vm/simd) article written by John
+McCutchan. Some of the key ideas of this RFC come from LLVM's design, which was
+originally inspired by GCC's vector extensions, which was probably inspired by
+something else. Most parts of this RFC are also consistent with the [128-bit SIMD
+proposal for WebAssembly](https://github.com/WebAssembly/simd/blob/master/proposals/simd/SIMD.md)
+
+Or in other words: to the author's best knowledge, this RFC does not contain any
+really novel ideas. Instead, it only draws inspriation from previous designs
+that have withstood the test of time, and it adapts these designs to Rust.
+
+# Unresolved questions
+[unresolved]: #unresolved-questions
+
+### Interaction with Cray vectors
+
+The vector types proposed in this RFC are packed, that is, their size is fixed
+at compile-time.
+
+Many modern architectures support vector operations of run-time size, often
+called Cray Vectors or scalable vectors. These include, amongst others,
+NecSX, ARM SVE, and RISC-V's Vector Extension Proposal. These architectures have
+traditionally relied on auto-vectorization combined with support for explicit
+vectorization annotations, but newer architectures like ARM SVE introduce
+explicit vectorization intrinsics.
+
+This is an example adapted from this [ARM SVE
+paper](https://developer.arm.com/hpc/arm-scalable-vector-extensions-and-application-to-machine-learning)
+to pseudo-Rust:
+
+```rust
+/// Adds `c` to every element of the slice `src` storing the result in `dst`.
+fn add_constant(dst: &mut [f64], src: &[f64], c: f64) {
+    assert!(dst.len() == src.len());
+    
+    // Instantiate a dynamic vector (f64xN) with all lanes set to c:
+    let vc: f64xN = f64xN::splat(c);
+    
+    // The number of lanes that each iteration of the loop can process
+    // is unknown at compile-time (f64xN::lanes() is evaluated at run-time):
+    for i in (0..src.len()).step_by_with(f64xN::lanes()) {
+    
+        // Instantiate a dynamic boolean vector with the
+        // result of the predicate: `i + lane < src.len()`.
+        // This boolean vector acts as a mask, so that elements 
+        // "in-bounds" of the slice `src` are initialized to `true`,
+        // while out-of-bounds elements contain `false`:
+        let m: bxN = f64xN::while_lt(i, src.len());
+
+        // Read the elements of the source using the mask:
+        let vsrc: f64xN = f64xN::read_unaligned(m, &src[i..]);
+        
+        // Add the vector with the constan using the mask:
+        let vdst: f64xN = vsrc.add(m, vc);
+        
+        // Write the result back to memory using the mask:
+        vdst.write_unaligned(m, &mut dst[i..]);
+    }
+}
+```
+
+The RISC-V vector extension proposal introduces a model similar in spirit to ARM
+SVE. These extensions are, however, not official yet, and it is currently
+unknown whether GCC and LLVM will expose explicit intrinsics for them. It would
+not be surprising if they do, and it would not be surprising if similar Cray
+vector extensions are introduced in other architectures in the future.
+
+The main differences between Cray vectors and portable vectors are that:
+
+* the number of lanes of Cray vectors is a run-time dynamic value 
+* the Cray vector "objects" are like magical compiler token values
+* the induction loop variable must be incremented by the dynamic number of lanes
+  of the vector type
+* most Cray vector operations require a mask indicating which elements of
+  the vector the operation applies to
+  
+These differences will probably force the API of Cray vector types to be
+slightly different than that of packed vector types.
+
+The current RFC, therefore, assumes no interaction with Cray vector types. 
+
+It does not prevent for portable Cray vector types to be added to Rust in
+the future via an orthogonal API, nor it does prevent adding a way to interact
+between both of them (e.g. through memory). But at this point in time whether
+these things are possible are open research problems.
+
+### Half-float support
+
+Many architectures (ARM, AArch64, PowerPC, MIPS, RISC-V) support half-floats
+(`f16`) vector types. It is unclear what to do with these at this point in time
+since Rust currently lacks language support for half-float.
+
+### AVX-512 and m1xN masks support
+
+Currently, `std::arch` provides very limited AVX-512 support and the prototype
+implementation of the `m1xN` masks like `m1x64` in `stdsimd` implements them as
+512-bit wide vectors when they actually should only be 64-bit wide. 
+
+Finishing the implementation of these types requires work that just has not been
+done yet. 
+
+### Fast math
+
+The performance of the portable operations can in some cases be significantly
+improved by making assumptions about the kind of arithmetic that is allowed.
+
+For example, some of the horizontal reductions benefit from assuming math to be
+finite (no `NaN`s) and others from assuming math to be associative (e.g. it
+allows tree-like reductions from sums).
+
+A future RFC could add more reduction variants with different requirements and
+performance characteristics, for example, `.wrapping_sum_unordered()` or
+`.max_element_nanless()`, but these are not considered in this RFC because
+their interaction with fast-math is unclear.
+
+A potentially better idea would be to allow users to specify the assumptions
+that an optimizing compiler can make about floating-point arithmetic in a finer
+grained way.
+
+For example, we could design an `#[fp_math]` attribute usable at, for example,
+crate, module, function, and block scope, so that users can exactly specify
+which IEEE754 restrictions the compiler is allowed to lift where:
+
+```rust
+fn foo(x: f32x4, y: f32x4) -> f32x4 {
+  let (w, z) = 
+  #[fp_math(assume = "associativity")] {
+      // All fp math is associative, reductions can be unordered:
+      let w = x.sum();
+      let z = y.sum();
+      (w, z)
+  };
+  
+  let m = (w + z) * (x + y);
+  
+  #[fp_math(assume = "finite")] {
+      // All fp math is assumed finite, reduction can assume NaNs 
+      // aren't present:
+      m.max()
+  }
+}
+```
+
+There are obviously many approaches to tackle this problem, but it does make
+sense to have a plan to tackle them before workarounds start getting bolted into
+RFCs like this one. There is an [internal's
+post](https://internals.rust-lang.org/t/pre-pre-rfc-floating-point-math-assumptions-fast-math/7162)
+exploring the design space.
+
+### Endian-dependent behavior
+
+The results of the indexed operations (`extract`, `replace`, `write`), and the
+`new` method are endian independent. That is, the following example is
+guaranteed to pass on little-endian (LE) and big-endian (BE) architectures:
+
+```rust
+let v = i32x4::new(0, 1, 2, 3);
+assert_eq!(v.extract(0), 0); // OK in LE and BE
+assert_eq!(v.extract(3), 0); // OK in LE - OK in BE
+```
+
+The result of bit-casting two equally-sized vectors using `mem::transmute` is,
+however, endian dependent:
+
+```rust
+let x = i8x16::new(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);
+let t: i16x8 = unsafe { mem::transmute(x) }; // UNSAFE
+if cfg!(target_endian = "little") {
+    let t_el = i16x8::new(256, 770, 1284, 1798, 2312, 2826, 3340, 3854);
+    assert_eq!(t, t_el); // OK in LE | (would) ERROR in BE
+} else if cfg!(target_endian = "big") {
+    let t_eb = i16x8::new(1, 515, 1029, 1543, 2057, 2571, 3085, 3599);
+    assert_eq!(t, t_eb); // OK in BE | (would) ERROR in LE
+}
+```
+
+which applies to memory read and writes as well:
+
+```rust
+let x = i8x16::new(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);
+let mut y: [i16; 8] = [0; 8];
+x.write_unaligned( unsafe {
+    slice::from_raw_parts_mut(&mut y as *mut _ as *mut i8, 16)
+});
+
+if cfg!(target_endian = "little") {
+    let e: [i16; 8] = [256, 770, 1284, 1798, 2312, 2826, 3340, 3854];
+    assert_eq!(y, e);
+} else if cfg!(target_endian = "big") {
+    let e: [i16; 8] = [1, 515, 1029, 1543, 2057, 2571, 3085, 3599];
+    assert_eq!(y, e);
+}
+
+let z = i8x16::read_unaligned(unsafe {
+    slice::from_raw_parts(&y as *const _ as *const i8, 16)
+});
+assert_eq!(z, x);
+```