# Vector lane ordering Or: How to implement raw\_bitcast on a big-endian architecture

Part 1: Clarifying raw\_bitcast semantics Part 2: Efficient implementation on big-endian machines

#### Rust / C / C++ SIMD vectors – little-endian architecture



raw\_bitcast: lane 0 of smaller type is **least**significant part of lane 0 of larger type  $\rightarrow$  LE lane order

## Rust / C / C++ SIMD vectors – big-endian architecture



raw\_bitcast: lane 0 of smaller type is **most**significant part of lane 0 of larger type  $\rightarrow$  BE lane order

### WebAssembly vector types – defined as values



raw\_bitcast: lane 0 of smaller type is **least**significant part of lane 0 of larger type  $\rightarrow$  LE lane order

## Summary of raw\_bitcast semantics

- The raw\_bitcast operator is underspecified, it actually exists in two flavors:
  - "Little-endian lane order":
    - Lane 0 of smaller type is **least**-significant part of lane 0 of larger type
  - "Big-endian lane order": Lane 0 of smaller type is most-significant part of lane 0 of larger type
- Current usage of raw\_bitcast
  - On a little-endian machine, all uses of raw\_bitcast use LE lane order
  - On all machines, raw\_bitcast emitted from wasmtime uses LE lane order
  - On a big-endian machine, raw\_bitcast emitted from cg\_clif uses BE lane order
- Options to define semantics
  - Implicit (back-end treats functions emitted from wasmtime differently)
  - Treat as memory operation (add MemFlags including endianness options)
  - Separate opcodes, e.g. raw\_bitcast\_le and raw\_bitcast\_be
    - Should we have an implicit "native" variant?

#### Implementation on little-endian machines



# Implementation on big-endian machines – default (e.g. cg\_clif)



### Implementation on big-endian machines – little endian memory



#### Implementation on big-endian machines – little endian memory



# Implementation on big-endian machines – LE memory, inverted



# Implementation on big-endian machines – BE memory, inverted



# Implementation on big-endian machines – BE memory, inverted



# Summary of implementation options

- The back-end has the choice between two options
  - In-register lane order matches HW order (BE lane order)
  - In-register lane order inverted from HW order (LE lane order)
- Impact of lane order choice on visible semantics
  - Either implementation option can fully implement CLIF semantics
  - In particular, either option can implement both LE and BE raw\_bitcast
  - Implementation option potentially visible in the ABI (vector argument/return regs)
    - SystemV ABI requires BE lane order, Wasmtime ABI free to define
    - Can be made transparent via lane swaps at ABI boundaries
    - More efficient to choose one defined lane order per ABI
- Impact of lane order choice on implementation
  - Either option can be fully implemented, including both LE/BE memory load/store
  - Affected instructions: memory ops, explicit lane number ops, raw\_bitcast
  - Efficiency: raw\_bitcast is no-op if and only if requested lane order matches implementation lane order, permute (element swap) otherwise

# **Proposed solution**

- · Choose in-register lane order based on the current function's ABI
  - LE lane order if function using Wasmtime ABI
  - BE lane order otherwise

#### • Effect

- Fully implemented CLIF semantics, including LE and BE raw\_bitcast
- No element swaps needed at ABI boundaries
- Efficient implementation of all variants of memory ops & explicit lane order ops
- raw\_bitcast implementation:
  - LE raw\_bitcast in Wasmtime ABI functions is no-op
  - BE raw\_bitcast in functions using other ABI is no-op
  - Result: every raw\_bitcast used by all current front ends is no-op!
- Staged implementation
  - Phase 1: Back-end only, assuming every raw\_bitcast is a no-op
  - Phase 2: Implement explicit raw\_bitcast semantics via CLIF extension

