Initial 128-bit SIMD proposal #1

stoklund · 2017-04-13T23:21:32Z

Originals at https://github.com/stoklund/portable-simd

jfbastien

Overall I like the way this is presented, and the opcodes and types seem like what I would expect.

There are things that I'd drop for now as noted in the comments.

I'd like to reiterate that after we agree on general form we expect significant performance work to show that the proposed primitives can:

reach performance gains close to native
in real world code, both integer and floating-point
on a set of "tier 1" ISAs, without one suffering while another wins

I don't think performance work is required for this PR as it would go against the purpose of having this repo. I do want to state our expectation up-front to avoid surprises in the future.

jfbastien · 2017-04-14T04:29:38Z

proposals/simd/Overview.md

+
+Referencing a common specification of the portable SIMD semantics reduces the
+work required to support both SIMD.js and WebAssembly SIMD in the same
+implementation.


Is that a goal which any implementation cares about? I don't think WebAssembly should constrain itself in that direction.

I agree that this is not too relevant to be included as a goal.

jfbastien · 2017-04-14T04:34:48Z

proposals/simd/Overview.md

+  out-of-range lane index is a validation error.
+
+* `RoundingMode`: Rounding modes are encoded as `varuint7` immediate operands.
+  An out-of-range rounding mode is a validation error.


Is this something that we need in a first version? I get that it's useful, but scalar operations don't support rounding mode.

IIRC SIMD.js didn't either. Why was it not added then, and what makes it more desirable now?

I don't think we need rounding mode support in the first version. I was anticipating their inclusion in the scalar spec.

We could represent this as an immediate field that must be RoundTiesToEven in the first version, or we can add duplicate opcodes with rounding mode support in the future. It's only 14 opcodes that take a rounding mode, so I guess duplicating them is not too bad.

I would suggest starting this PR without rounding at all, mimic scalar, and filing a separate issue proposing how to encode it for both SIMD and scalar.

jfbastien · 2017-04-14T05:53:46Z

proposals/simd/webassembly-opcodes.md

+| `i8x16.addSaturate_s(a: v128, b: v128) -> v128`            | [s8x16.addSaturate](portable-simd.md#saturating-integer-addition) |
+| `i8x16.addSaturate_u(a: v128, b: v128) -> v128`            | [u8x16.addSaturate](portable-simd.md#saturating-integer-addition) |
+| `i8x16.subSaturate_s(a: v128, b: v128) -> v128`            | [s8x16.subSaturate](portable-simd.md#saturating-integer-subtraction) |
+| `i8x16.subSaturate_u(a: v128, b: v128) -> v128`            | [u8x16.subSaturate](portable-simd.md#saturating-integer-subtraction) |


Do we need saturate operations in the initial version?

These are very useful operations both for fixed-point DSP and image processing. They all exist as individual SIMD instructions in ARM, Intel, and MIPS ISAs.

These instructions are not commonly available in scalar versions in ISAs. In scalar code you're usually computing with 32-bit registers anyway, so saturating results to 8 or 16 bits is easier. In SIMD code, you only get 8 or 16 bits per lane, so these instructions are more important there.

LLVM's auto-vectorizer can pattern-match these operations in scalar code and generate the vector instructions.

I agree saturation is useful. I'm asking if the first version needs this though. Not necessarily the first ship of SIMD, simply the first proposal.

I ask because by adding saturation here you're making the portable performance case that needs to be built harder, in my opinion.

Could you elaborate on what exactly makes portable performance harder here? I would argue that as all the saturate operations can be directly mapped to individual instructions on Arm, Intel and MIPS ISAs, this is an easier performance win than some of the other operations.

You'll need to justify this SIMD proposal on the merits of one more kind of benchmarks, across all those ISAs. SIMD can stand on its own with just a few FP and int benchmarks, but if you want to throw in saturation have at it. I'm just stating where I think the bar is.

jfbastien · 2017-04-14T05:58:08Z

proposals/simd/webassembly-opcodes.md

+| `f32x4.div(a: v128, b: v128, rmode: RoundingMode) -> v128`    | [f32x4.div](portable-simd.md#division) |
+| `f32x4.sqrt(a: v128, rmode: RoundingMode) -> v128`            | [f32x4.sqrt](portable-simd.md#square-root) |
+| `f32x4.reciprocalApproximation(a: v128) -> v128`              | [f32x4.reciprocalApproximation](portable-simd.md#reciprocal-approximation) |
+| `f32x4.reciprocalSqrtApproximation(a: v128) -> v128`          | [f32x4.reciprocalSqrtApproximation](portable-simd.md#reciprocal-square-root-approximation) |


These would be the first implementation-dependent functions. I would like to discuss these details separately from this PR.

jfbastien · 2017-04-14T06:07:02Z

proposals/simd/portable-simd.md

+* `TowardNegative`: Round towards negative infinity.
+* `TowardZero`: Round towards zero.
+
+## Default NaN value


To confirm: the intent here is to precisely match the latitude which WebAssembly provides w.r.t. NaNs, correct? I find it odd to repeat it here, with different wording.

Yes, that is correct.

jfbastien · 2017-04-14T06:12:11Z

proposals/simd/portable-simd.md

+
+Integer equality is independent of the signed/unsigned interpretation. Floating
+point equality follows IEEE semantics, so a NaN lane compares not equal with
+anything, including itself:


If mentioning NaN, it's worth also mentioning -0.0 and 0.0.

jfbastien · 2017-04-14T06:14:23Z

proposals/simd/portable-simd.md

+This specification does not address bounds checking and trap handling for memory
+operations. It is assumed that the range `addr .. addr+15` are valid offsets in
+the buffer, and that computing `addr+15` does not overflow the `ByteOffset`
+type. Bounds checking should be handled by the embedding specification.


WebAssembly simply states: "Out of bounds accesses trap. If the access is a store, if any of the accessed bytes are out of bounds, none of the bytes are modified."

We should do the same here.

jfbastien · 2017-04-14T06:17:54Z

proposals/simd/portable-simd.md

+    return S.lanewise_binary(max, a, b)
+```
+
+### NaN-suppressing minimum


I would rather introduce minNum and maxNum for scalar at the same time or earlier than for SIMD.

I agree, and you won't find these instruction in the table of proposed opcodes. I'll make a note that I omitted them.

jfbastien · 2017-04-14T06:19:07Z

proposals/simd/portable-simd.md

+Lane-wise conversion from floating point to integer using the IEEE
+`convertToIntegerTowardZero` function. If any lane is a NaN or the rounded
+integer value is outside the range of the destination type, return `fail = true`
+and an unspecified `result`.


WebAssembly doesn't currently support multi-return values. This should trap to match scalars.

Also, rename to truncate so it matches scalar.

Agreed. I did just that in the Floating point conversions section of the overview document.

jfbastien · 2017-04-14T06:23:29Z

proposals/simd/portable-simd.md

+    return result
+```
+
+## Integer arithmetic


This section is missing integer division.

That's an intentional omission. As it turns out, no ISAs provide vectorized integer division instructions.

Ah apologies, I though I'd seen it in one of the files in your PR, and it was missing here. This is good then!

- Remove rounding modes from the portable specification and from the proposed WebAssembly opcodes. - Add a few minor comments. - Omit padding in Markdown table columns to avoid excessive diffs in the future.

To be debated in #3. Add a complete list of omitted operations to Overview.md.

rossberg · 2017-04-19T11:19:17Z

Just a quick question: @jfbastien recently suggested distinguishing an int and a float SIMD type. Do you guys still intend to explore that option in follow-ups?

jfbastien · 2017-04-19T16:22:21Z

Just a quick question: @jfbastien recently suggested distinguishing an int and a float SIMD type. Do you guys still intend to explore that option in follow-ups?

I'd be interested in seeing if it makes a difference for producers or consumers to do this. I'm not convinced it's the right thing! @pizlonator thinks it'll cause way too many coercions (bloating the binary) and won't be helpful to the consumer anyways because the consumer already needs to reason about the SIMD types.

My thinking was that having a compiler like LLVM express things like lane crossing would mean browsers can be lazy about it and just trust the input. No need to think about PD versus PS mov / bitwise ops, or shuffle type, just trust the input blindly.

stoklund · 2017-04-19T16:24:08Z

@rossberg-chromium, in my opinion we don't need separate types, but possibly a few extra instructions.

To give some background, Intel's SIMD instruction sets have multiple versions of a few instructions. For example, pxor, xorps, and xorpd are architecturally identical implementations of v128.xor, but current micro-architectures will issue pxor to the integer execution stack and the two others to the floating point stack. There has never been a micro-architectural difference between the float-flavored xorps and the double-flavored xorpd as far as I know.

There is a 1-cycle bypass delay when an instruction in the integer stack depends on a result computed in the floating point stack or vice versa. This additional latency only matters if the dependency is on the critical path. As soon as there's a few cycles between the instructions, operands are read from the register file and not from pipeline bypasses. Then there is no longer any difference.

Other ISAs don't make this distinction, it is an Intel-only thing.

If we want to let WebAssembly producers give hints to the code generator's fiddling with these micro-architectural details, we could do it by providing float-flavored versions of the logical, load/store, and shuffle/swizzle operations. These new operations would be identical to the existing ones except for giving this hint to the code generator. I don't think a new type is required.

I would prefer that we don't do this in the initial proposal, and only add the new instructions if they have demonstrable performance benefits. I am expecting improvements to real-world benchmarks to be in the noise.

Without the hints, a simple SSE code generator would apply a basic heuristic like looking at the instructions that computed the inputs and choose xorps if they are floating-point instructions, pxor otherwise. LLVM's algorithm is a bit more involved, but a little goes a long way.

stoklund · 2017-04-19T18:45:06Z

My thinking was that having a compiler like LLVM express things like lane crossing would mean browsers can be lazy about it and just trust the input. No need to think about PD versus PS mov / bitwise ops, or shuffle type, just trust the input blindly.

Annoyingly, the Intel shuffle instructions don’t have the simple symmetry of the logical and load/store variants. For example, the integer-flavored `pshufd` implements `v32x4.swizzle` but doesn’t have a floating-point counterpart. The float-flavored `shufps` implements some of the `v32x4.shuffle` patterns but doesn’t have an integer counterpart. The specific SSE version you’re targeting can also affect the results. With SSE 4.1 support, `i32x4.mul` becomes a `pmulld` (int-like) instruction, but on earlier SSE versions it expands into an instruction sequence that ends in `shufps` (float-like), at least in the current Firefox implementation.

AndrewScheidecker · 2017-04-20T17:37:23Z

There are a few operators remaining that use mixed case to denote word boundaries (addSaturate, subSaturate, extractLanes, and replaceLanes). IMO those should use _ to match the rest of the wasm names.

Will globals of type v128 and b* be allowed? If so, those types should have a const operator that can be used as an initializer expression.

What is the natural alignment of the v128.load(1|2|3) operators?

Can we get vector versions of the float ceil, floor, trunc, nearest operators?

How about the integer rotl, rotr, clz, and ctz operators? Don't know how good the hardware support is for those, though.

stoklund · 2017-04-20T18:16:46Z

Thanks, @AndrewScheidecker

I don't have strong opinions on the operator names. I translated names for the operations with existing WebAssembly counterparts and left the rest identical to the SIMD.js spec. I imagined we would have a bikeshedding fest after we've decided which operations go in. Plain changing to snake case makes sense.

Will globals of type v128 and b* be allowed? If so, those types should have a const operator that can be used as an initializer expression.

I don't think there's a reason to disallow SIMD globals, so we should include them for completeness. I could imagine a use for global constants at least.

What is the natural alignment of the v128.load(1|2|3) operators?

I think 32 bits makes sense as the natural alignment for these operators.

Can we get vector versions of the float ceil, floor, trunc, nearest operators?

I am not sure why they didn't make it into the SIMD.js spec. Let me check if they are universally available.

How about the integer rotl, rotr, clz, and ctz operators? Don't know how good the hardware support is for those, though.

Rotates, probably. I'm not sure if clz and ctz have wide hardware support, and they could be expensive to emulate, particularly the 16-bit and 8-bit variants. I'll look into it.

sunfishcode

Looks good! Here are a few comments I had from an initial reading.

sunfishcode · 2017-04-21T11:06:32Z

proposals/simd/portable-simd.md

+
+* `f32`: A floating-point number in the [IEEE][ieee754] *binary32* interchange
+  format.
+* `f64`: A floating-point number in the [IEEE][ieee754] *binary64* interchange


IEEE 754 calls these "basic" formats, which indicates that they are suitable for both "interchange" and "arithmetic".

sunfishcode · 2017-04-21T11:19:43Z

proposals/simd/portable-simd.md

+    if unspecified_choice():
+        return canonicalize_nan(x)
+    else:
+        return canonicalize_nan(y)


The WebAssembly scalar spec has loosened the NaN bitpattern rules and no longer requires propagation. NaN propagation is only a "should" in IEEE 754, and architectures such as ARM in "default NaN" configuration and RISC-V don't implement it.

That's what you get for specifying things twice. I think the next step for this proposal is to roll up this portable spec and the wasm specifics into a single wasm-only document. Then I can simply refer to existing semantics instead of trying to do a stand-alone spec.

sunfishcode · 2017-04-21T11:24:24Z

proposals/simd/portable-simd.md

+    for j in range(S.Lanes):
+        result[j] = a[j]
+    result[i] = x
+    return result


I suggest an explicit mention that lane indices for insert and extract are dynamic operands, not required to be constant, somewhere.

My intention was for these indices to be immediate operands. Do you think we need dynamic lane indices for these operations? It's hard to generate good code for that.

I would rather have immediate operands, unless we can show performance advantages in dynamic ones.

sunfishcode · 2017-04-21T11:28:06Z

proposals/simd/portable-simd.md

+* `v32x4.select(s: b32x4, t: v128, f: v128) -> v128`
+* `v64x2.select(s: b64x2, t: v128, f: v128) -> v128`
+
+Use a boolean vector to select lanes from two numerical vectors.


I suggest mentioning that wasm's plain select instruction also supports vectors somewhere.

sunfishcode · 2017-04-21T11:35:11Z

proposals/simd/portable-simd.md

+        else:
+            result[i] = b[s[i] - S.lanes]
+    return result
+```


In LLVM, shuffle indices are required to be constant. I think it'd be reasonable for wasm to have the same restiction and make them immediate fields rather than plain operands. Either way, I suggest mentioning the choice explicitly somewhere.

I agree. Lane indices are supposed to be immediate operands everywhere.

sunfishcode · 2017-04-21T11:51:36Z

proposals/simd/portable-simd.md

+```
+
+Note that this function behaves differently than the IEEE 754 `minNum` function
+when one of the operands is a signaling NaN.


FWIW, it's not finalized yet, but IEEE 754-2018 is likely to deprecate minNum and maxNum and introduce new functions which are essentially the semantics specified here.

Seems like it was agreed upon in Simd.js to remove minNum and maxNum due to the compliance issue and not enough use cases. tc39/ecmascript_simd#341
Does it make sense to remove it from portable_simd as well?
And the replacement from IEEE754-2018 can be introduced to wasm first if it seems fit when it lands.

sunfishcode · 2017-04-21T12:06:09Z

proposals/simd/webassembly-opcodes.md

+## `f32x4` operations
+| WebAssembly | Portable SIMD |
+|:------------|:--------------|
+| `f32x4.build(x: f32[4]) -> v128` | [f32x4.build](portable-simd.md#build-vector-from-individual-lanes) |


I assume there should also be a v128.const instruction somewhere which takes a 128-bit immediate.

Then, it's worth asking whether these build opcodes are needed in wasm. LLVM doesn't have corresponding operators, for example, and they don't tend to map to single instructions in implementations.

We talked about this for asm.js too and ended up keeping them. I'll file an issue.

sunfishcode · 2017-04-21T12:08:28Z

proposals/simd/webassembly-opcodes.md

+## `b16x8` operations
+| WebAssembly | Portable SIMD |
+|:------------|:--------------|
+| `b16x8.build(x: i32[8]) -> b16x8` | [b16x8.build](portable-simd.md#build-vector-from-individual-lanes) |


Does a b16x8.const also make sense?

Yes, I think we should add const instructions for all the new types.

Does bools have uses beyond holding/manipulating results of logical and relational operations?
Wondering if there is a need to have support for building bools, since we intend to have bool consts.

sunfishcode · 2017-04-21T12:12:35Z

proposals/simd/webassembly-opcodes.md

+| `v32x4.shuffle(a: v128, b: v128, s: LaneIdx8[4]) -> v128` | [v32x4.shuffle](portable-simd.md#shuffle-lanes) |
+| `v32x4.load1(addr, offset) -> v128` | [v32x4.load1](portable-simd.md#partial-load) |
+| `v32x4.load2(addr, offset) -> v128` | [v32x4.load2](portable-simd.md#partial-load) |
+| `v32x4.load3(addr, offset) -> v128` | [v32x4.load3](portable-simd.md#partial-load) |


Partial loads and stores were removed from asm.js since the non-power-of-2 size is awkward in some implementations.

It's enabled by default, but if you wish to disable it you can with the CMake ENABLE_SIMD_PROTOTYPE option. This also includes an implementation of the Blake2b hash function with the proposed SIMD operators in Test/Blake2b/blake2b.wast

AndrewScheidecker · 2017-04-24T10:51:44Z

FYI I have a prototype for this proposal in WAVM, including a port of the Blake2b hash function.

@sunfishcode

Fold the portable-simd.md specification and the WebAssembly mapping into a single SIMD.md document that is specific to WebAssembly. This is easier to read and avoids a lot of confusion. - Don't attempt to specify floating-point semantics again. Just refer to existing WebAssembly behavior. - Remove the minNum and maxNum operations which were never intended to be included in WebAssembly. - Clarify the trapping behavior of the float-to-int conversions. - Remove the descriptions of recipriocal [sqrt] approximations. See #3. - Rename all operations to use snake_case. - Add *.const instructions for all the new types. - Remove the partial load and store operators as suggested by @sunfishcode. They were removed from asm.js too.

stoklund · 2017-04-25T19:42:38Z

@jfbastien, would you object to my merging this PR in its current state? I think I've addressed everything brought up in reviews either by changing the text or by filing issues for individual topics that need more discussion.

jfbastien · 2017-04-25T19:52:36Z

Yeah I think as is you've got a good starting point, received good feedback and definite interest from multiple folks. Go for it.

Add extended load definitions to SIMD.md

Jakob Stoklund Olesen added 2 commits April 13, 2017 15:11

Copy documents verbatim from portable-simd repo.

fa6adbd

Originals at https://github.com/stoklund/portable-simd

Main document of SIMD proposal.

2f5933c

jfbastien suggested changes Apr 14, 2017

View reviewed changes

stoklund mentioned this pull request Apr 14, 2017

Allow flushing of subnormals in floating point SIMD operations #2

Closed

Jakob Stoklund Olesen added 2 commits April 17, 2017 09:04

Update proposal after review comments.

1132e95

- Remove rounding modes from the portable specification and from the proposed WebAssembly opcodes. - Add a few minor comments. - Omit padding in Markdown table columns to avoid excessive diffs in the future.

Omit the reciprocal [sqrt] approximation operations.

27fe641

To be debated in #3. Add a complete list of omitted operations to Overview.md.

This was referenced Apr 20, 2017

Include vectorized rotate instructions #5

Closed

Include vectorized bit count instructions #6

Open

Include floating-point rounding instructions #7

Closed

sunfishcode reviewed Apr 21, 2017

View reviewed changes

This was referenced Apr 25, 2017

Allow out-of-range lane indices in swizzle and shuffle instructions #11

Closed

Remove "S.build" constructors #12

Closed

stoklund merged commit 5a3fd17 into WebAssembly:master Apr 25, 2017

penzn referenced this pull request in penzn/simd Jul 23, 2019

Merge pull request #1 from penzn/extended-load

9bb4698

Add extended load definitions to SIMD.md

AndrewScheidecker mentioned this pull request Oct 28, 2019

Concerns about integer vs floating-point instructions on x86 #125

Closed

abrown mentioned this pull request Feb 28, 2020

Too many raw_bitcasts in SIMD code bytecodealliance/wasmtime#1147

Open

stoklund mentioned this pull request Mar 17, 2021

Implementation-dependent reciprocal [sqrt] approximation instructions WebAssembly/relaxed-simd#4

Open

Initial 128-bit SIMD proposal #1

Initial 128-bit SIMD proposal #1

Conversation

stoklund commented Apr 13, 2017

jfbastien left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rossberg commented Apr 19, 2017

jfbastien commented Apr 19, 2017

stoklund commented Apr 19, 2017 • edited Loading

stoklund commented Apr 19, 2017 via email

AndrewScheidecker commented Apr 20, 2017

stoklund commented Apr 20, 2017

sunfishcode left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndrewScheidecker commented Apr 24, 2017

stoklund commented Apr 25, 2017

jfbastien commented Apr 25, 2017 • edited Loading

stoklund commented Apr 19, 2017 •

edited

Loading

jfbastien commented Apr 25, 2017 •

edited

Loading