Skip to content
This repository was archived by the owner on Dec 22, 2021. It is now read-only.

Initial 128-bit SIMD proposal #1

Merged
merged 5 commits into from
Apr 25, 2017
Merged

Initial 128-bit SIMD proposal #1

merged 5 commits into from
Apr 25, 2017

Conversation

stoklund
Copy link
Contributor

Copy link
Member

@jfbastien jfbastien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I like the way this is presented, and the opcodes and types seem like what I would expect.

There are things that I'd drop for now as noted in the comments.

I'd like to reiterate that after we agree on general form we expect significant performance work to show that the proposed primitives can:

  • reach performance gains close to native
  • in real world code, both integer and floating-point
  • on a set of "tier 1" ISAs, without one suffering while another wins

I don't think performance work is required for this PR as it would go against the purpose of having this repo. I do want to state our expectation up-front to avoid surprises in the future.


Referencing a common specification of the portable SIMD semantics reduces the
work required to support both SIMD.js and WebAssembly SIMD in the same
implementation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that a goal which any implementation cares about? I don't think WebAssembly should constrain itself in that direction.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is not too relevant to be included as a goal.

out-of-range lane index is a validation error.

* `RoundingMode`: Rounding modes are encoded as `varuint7` immediate operands.
An out-of-range rounding mode is a validation error.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something that we need in a first version? I get that it's useful, but scalar operations don't support rounding mode.

IIRC SIMD.js didn't either. Why was it not added then, and what makes it more desirable now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need rounding mode support in the first version. I was anticipating their inclusion in the scalar spec.

We could represent this as an immediate field that must be RoundTiesToEven in the first version, or we can add duplicate opcodes with rounding mode support in the future. It's only 14 opcodes that take a rounding mode, so I guess duplicating them is not too bad.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest starting this PR without rounding at all, mimic scalar, and filing a separate issue proposing how to encode it for both SIMD and scalar.

| `i8x16.addSaturate_s(a: v128, b: v128) -> v128` | [s8x16.addSaturate](portable-simd.md#saturating-integer-addition) |
| `i8x16.addSaturate_u(a: v128, b: v128) -> v128` | [u8x16.addSaturate](portable-simd.md#saturating-integer-addition) |
| `i8x16.subSaturate_s(a: v128, b: v128) -> v128` | [s8x16.subSaturate](portable-simd.md#saturating-integer-subtraction) |
| `i8x16.subSaturate_u(a: v128, b: v128) -> v128` | [u8x16.subSaturate](portable-simd.md#saturating-integer-subtraction) |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need saturate operations in the initial version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are very useful operations both for fixed-point DSP and image processing. They all exist as individual SIMD instructions in ARM, Intel, and MIPS ISAs.

These instructions are not commonly available in scalar versions in ISAs. In scalar code you're usually computing with 32-bit registers anyway, so saturating results to 8 or 16 bits is easier. In SIMD code, you only get 8 or 16 bits per lane, so these instructions are more important there.

LLVM's auto-vectorizer can pattern-match these operations in scalar code and generate the vector instructions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree saturation is useful. I'm asking if the first version needs this though. Not necessarily the first ship of SIMD, simply the first proposal.

I ask because by adding saturation here you're making the portable performance case that needs to be built harder, in my opinion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate on what exactly makes portable performance harder here? I would argue that as all the saturate operations can be directly mapped to individual instructions on Arm, Intel and MIPS ISAs, this is an easier performance win than some of the other operations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to justify this SIMD proposal on the merits of one more kind of benchmarks, across all those ISAs. SIMD can stand on its own with just a few FP and int benchmarks, but if you want to throw in saturation have at it. I'm just stating where I think the bar is.

| `f32x4.div(a: v128, b: v128, rmode: RoundingMode) -> v128` | [f32x4.div](portable-simd.md#division) |
| `f32x4.sqrt(a: v128, rmode: RoundingMode) -> v128` | [f32x4.sqrt](portable-simd.md#square-root) |
| `f32x4.reciprocalApproximation(a: v128) -> v128` | [f32x4.reciprocalApproximation](portable-simd.md#reciprocal-approximation) |
| `f32x4.reciprocalSqrtApproximation(a: v128) -> v128` | [f32x4.reciprocalSqrtApproximation](portable-simd.md#reciprocal-square-root-approximation) |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These would be the first implementation-dependent functions. I would like to discuss these details separately from this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #3.

* `TowardNegative`: Round towards negative infinity.
* `TowardZero`: Round towards zero.

## Default NaN value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To confirm: the intent here is to precisely match the latitude which WebAssembly provides w.r.t. NaNs, correct? I find it odd to repeat it here, with different wording.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is correct.


Integer equality is independent of the signed/unsigned interpretation. Floating
point equality follows IEEE semantics, so a NaN lane compares not equal with
anything, including itself:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If mentioning NaN, it's worth also mentioning -0.0 and 0.0.

This specification does not address bounds checking and trap handling for memory
operations. It is assumed that the range `addr .. addr+15` are valid offsets in
the buffer, and that computing `addr+15` does not overflow the `ByteOffset`
type. Bounds checking should be handled by the embedding specification.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WebAssembly simply states: "Out of bounds accesses trap. If the access is a store, if any of the accessed bytes are out of bounds, none of the bytes are modified."

We should do the same here.

return S.lanewise_binary(max, a, b)
```

### NaN-suppressing minimum
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather introduce minNum and maxNum for scalar at the same time or earlier than for SIMD.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, and you won't find these instruction in the table of proposed opcodes. I'll make a note that I omitted them.

Lane-wise conversion from floating point to integer using the IEEE
`convertToIntegerTowardZero` function. If any lane is a NaN or the rounded
integer value is outside the range of the destination type, return `fail = true`
and an unspecified `result`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WebAssembly doesn't currently support multi-return values. This should trap to match scalars.

Also, rename to truncate so it matches scalar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I did just that in the Floating point conversions section of the overview document.

return result
```

## Integer arithmetic
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is missing integer division.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an intentional omission. As it turns out, no ISAs provide vectorized integer division instructions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah apologies, I though I'd seen it in one of the files in your PR, and it was missing here. This is good then!

Jakob Stoklund Olesen added 2 commits April 17, 2017 09:04
- Remove rounding modes from the portable specification and from the
  proposed WebAssembly opcodes.
- Add a few minor comments.
- Omit padding in Markdown table columns to avoid excessive diffs in the
  future.
To be debated in #3.

Add a complete list of omitted operations to Overview.md.
@rossberg
Copy link
Member

Just a quick question: @jfbastien recently suggested distinguishing an int and a float SIMD type. Do you guys still intend to explore that option in follow-ups?

@jfbastien
Copy link
Member

Just a quick question: @jfbastien recently suggested distinguishing an int and a float SIMD type. Do you guys still intend to explore that option in follow-ups?

I'd be interested in seeing if it makes a difference for producers or consumers to do this. I'm not convinced it's the right thing! @pizlonator thinks it'll cause way too many coercions (bloating the binary) and won't be helpful to the consumer anyways because the consumer already needs to reason about the SIMD types.

My thinking was that having a compiler like LLVM express things like lane crossing would mean browsers can be lazy about it and just trust the input. No need to think about PD versus PS mov / bitwise ops, or shuffle type, just trust the input blindly.

@stoklund
Copy link
Contributor Author

stoklund commented Apr 19, 2017

@rossberg-chromium, in my opinion we don't need separate types, but possibly a few extra instructions.

To give some background, Intel's SIMD instruction sets have multiple versions of a few instructions. For example, pxor, xorps, and xorpd are architecturally identical implementations of v128.xor, but current micro-architectures will issue pxor to the integer execution stack and the two others to the floating point stack. There has never been a micro-architectural difference between the float-flavored xorps and the double-flavored xorpd as far as I know.

There is a 1-cycle bypass delay when an instruction in the integer stack depends on a result computed in the floating point stack or vice versa. This additional latency only matters if the dependency is on the critical path. As soon as there's a few cycles between the instructions, operands are read from the register file and not from pipeline bypasses. Then there is no longer any difference.

Other ISAs don't make this distinction, it is an Intel-only thing.

If we want to let WebAssembly producers give hints to the code generator's fiddling with these micro-architectural details, we could do it by providing float-flavored versions of the logical, load/store, and shuffle/swizzle operations. These new operations would be identical to the existing ones except for giving this hint to the code generator. I don't think a new type is required.

I would prefer that we don't do this in the initial proposal, and only add the new instructions if they have demonstrable performance benefits. I am expecting improvements to real-world benchmarks to be in the noise.

Without the hints, a simple SSE code generator would apply a basic heuristic like looking at the instructions that computed the inputs and choose xorps if they are floating-point instructions, pxor otherwise. LLVM's algorithm is a bit more involved, but a little goes a long way.

@stoklund
Copy link
Contributor Author

stoklund commented Apr 19, 2017 via email

@AndrewScheidecker
Copy link
Contributor

There are a few operators remaining that use mixed case to denote word boundaries (addSaturate, subSaturate, extractLanes, and replaceLanes). IMO those should use _ to match the rest of the wasm names.

Will globals of type v128 and b* be allowed? If so, those types should have a const operator that can be used as an initializer expression.

What is the natural alignment of the v128.load(1|2|3) operators?

Can we get vector versions of the float ceil, floor, trunc, nearest operators?

How about the integer rotl, rotr, clz, and ctz operators? Don't know how good the hardware support is for those, though.

@stoklund
Copy link
Contributor Author

Thanks, @AndrewScheidecker

I don't have strong opinions on the operator names. I translated names for the operations with existing WebAssembly counterparts and left the rest identical to the SIMD.js spec. I imagined we would have a bikeshedding fest after we've decided which operations go in. Plain changing to snake case makes sense.

Will globals of type v128 and b* be allowed? If so, those types should have a const operator that can be used as an initializer expression.

I don't think there's a reason to disallow SIMD globals, so we should include them for completeness. I could imagine a use for global constants at least.

What is the natural alignment of the v128.load(1|2|3) operators?

I think 32 bits makes sense as the natural alignment for these operators.

Can we get vector versions of the float ceil, floor, trunc, nearest operators?

I am not sure why they didn't make it into the SIMD.js spec. Let me check if they are universally available.

How about the integer rotl, rotr, clz, and ctz operators? Don't know how good the hardware support is for those, though.

Rotates, probably. I'm not sure if clz and ctz have wide hardware support, and they could be expensive to emulate, particularly the 16-bit and 8-bit variants. I'll look into it.

Copy link
Member

@sunfishcode sunfishcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Here are a few comments I had from an initial reading.


* `f32`: A floating-point number in the [IEEE][ieee754] *binary32* interchange
format.
* `f64`: A floating-point number in the [IEEE][ieee754] *binary64* interchange
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IEEE 754 calls these "basic" formats, which indicates that they are suitable for both "interchange" and "arithmetic".

if unspecified_choice():
return canonicalize_nan(x)
else:
return canonicalize_nan(y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WebAssembly scalar spec has loosened the NaN bitpattern rules and no longer requires propagation. NaN propagation is only a "should" in IEEE 754, and architectures such as ARM in "default NaN" configuration and RISC-V don't implement it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what you get for specifying things twice. I think the next step for this proposal is to roll up this portable spec and the wasm specifics into a single wasm-only document. Then I can simply refer to existing semantics instead of trying to do a stand-alone spec.

for j in range(S.Lanes):
result[j] = a[j]
result[i] = x
return result
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest an explicit mention that lane indices for insert and extract are dynamic operands, not required to be constant, somewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention was for these indices to be immediate operands. Do you think we need dynamic lane indices for these operations? It's hard to generate good code for that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather have immediate operands, unless we can show performance advantages in dynamic ones.

* `v32x4.select(s: b32x4, t: v128, f: v128) -> v128`
* `v64x2.select(s: b64x2, t: v128, f: v128) -> v128`

Use a boolean vector to select lanes from two numerical vectors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest mentioning that wasm's plain select instruction also supports vectors somewhere.

else:
result[i] = b[s[i] - S.lanes]
return result
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In LLVM, shuffle indices are required to be constant. I think it'd be reasonable for wasm to have the same restiction and make them immediate fields rather than plain operands. Either way, I suggest mentioning the choice explicitly somewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Lane indices are supposed to be immediate operands everywhere.

```

Note that this function behaves differently than the IEEE 754 `minNum` function
when one of the operands is a signaling NaN.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, it's not finalized yet, but IEEE 754-2018 is likely to deprecate minNum and maxNum and introduce new functions which are essentially the semantics specified here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it was agreed upon in Simd.js to remove minNum and maxNum due to the compliance issue and not enough use cases. tc39/ecmascript_simd#341
Does it make sense to remove it from portable_simd as well?
And the replacement from IEEE754-2018 can be introduced to wasm first if it seems fit when it lands.

## `f32x4` operations
| WebAssembly | Portable SIMD |
|:------------|:--------------|
| `f32x4.build(x: f32[4]) -> v128` | [f32x4.build](portable-simd.md#build-vector-from-individual-lanes) |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume there should also be a v128.const instruction somewhere​ which takes a 128-bit immediate.

Then, it's worth asking whether these build opcodes are needed in wasm. LLVM doesn't have corresponding operators, for example, and they don't tend to map to single instructions in implementations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about this for asm.js too and ended up keeping them. I'll file an issue.

## `b16x8` operations
| WebAssembly | Portable SIMD |
|:------------|:--------------|
| `b16x8.build(x: i32[8]) -> b16x8` | [b16x8.build](portable-simd.md#build-vector-from-individual-lanes) |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does a b16x8.const also make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we should add const instructions for all the new types.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does bools have uses beyond holding/manipulating results of logical and relational operations?
Wondering if there is a need to have support for building bools, since we intend to have bool consts.

| `v32x4.shuffle(a: v128, b: v128, s: LaneIdx8[4]) -> v128` | [v32x4.shuffle](portable-simd.md#shuffle-lanes) |
| `v32x4.load1(addr, offset) -> v128` | [v32x4.load1](portable-simd.md#partial-load) |
| `v32x4.load2(addr, offset) -> v128` | [v32x4.load2](portable-simd.md#partial-load) |
| `v32x4.load3(addr, offset) -> v128` | [v32x4.load3](portable-simd.md#partial-load) |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial loads and stores were removed from asm.js since the non-power-of-2 size is awkward in some implementations.

AndrewScheidecker added a commit to WebAssembly/wasm-jit-prototype that referenced this pull request Apr 22, 2017
It's enabled by default, but if you wish to disable it you can with the CMake ENABLE_SIMD_PROTOTYPE option.
This also includes an implementation of the Blake2b hash function with the proposed SIMD operators in Test/Blake2b/blake2b.wast
@AndrewScheidecker
Copy link
Contributor

FYI I have a prototype for this proposal in WAVM, including a port of the Blake2b hash function.

Fold the portable-simd.md specification and the WebAssembly mapping into
a single SIMD.md document that is specific to WebAssembly. This is
easier to read and avoids a lot of confusion.

- Don't attempt to specify floating-point semantics again. Just refer to
  existing WebAssembly behavior.
- Remove the minNum and maxNum operations which were never intended to
  be included in WebAssembly.
- Clarify the trapping behavior of the float-to-int conversions.
- Remove the descriptions of recipriocal [sqrt] approximations. See #3.
- Rename all operations to use snake_case.
- Add *.const instructions for all the new types.
- Remove the partial load and store operators as suggested by
  @sunfishcode. They were removed from asm.js too.
@stoklund
Copy link
Contributor Author

@jfbastien, would you object to my merging this PR in its current state? I think I've addressed everything brought up in reviews either by changing the text or by filing issues for individual topics that need more discussion.

@jfbastien
Copy link
Member

jfbastien commented Apr 25, 2017

Yeah I think as is you've got a good starting point, received good feedback and definite interest from multiple folks. Go for it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants