-
Notifications
You must be signed in to change notification settings - Fork 43
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I like the way this is presented, and the opcodes and types seem like what I would expect.
There are things that I'd drop for now as noted in the comments.
I'd like to reiterate that after we agree on general form we expect significant performance work to show that the proposed primitives can:
- reach performance gains close to native
- in real world code, both integer and floating-point
- on a set of "tier 1" ISAs, without one suffering while another wins
I don't think performance work is required for this PR as it would go against the purpose of having this repo. I do want to state our expectation up-front to avoid surprises in the future.
proposals/simd/Overview.md
Outdated
|
||
Referencing a common specification of the portable SIMD semantics reduces the | ||
work required to support both SIMD.js and WebAssembly SIMD in the same | ||
implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that a goal which any implementation cares about? I don't think WebAssembly should constrain itself in that direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this is not too relevant to be included as a goal.
proposals/simd/Overview.md
Outdated
out-of-range lane index is a validation error. | ||
|
||
* `RoundingMode`: Rounding modes are encoded as `varuint7` immediate operands. | ||
An out-of-range rounding mode is a validation error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something that we need in a first version? I get that it's useful, but scalar operations don't support rounding mode.
IIRC SIMD.js didn't either. Why was it not added then, and what makes it more desirable now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need rounding mode support in the first version. I was anticipating their inclusion in the scalar spec.
We could represent this as an immediate field that must be RoundTiesToEven
in the first version, or we can add duplicate opcodes with rounding mode support in the future. It's only 14 opcodes that take a rounding mode, so I guess duplicating them is not too bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest starting this PR without rounding at all, mimic scalar, and filing a separate issue proposing how to encode it for both SIMD and scalar.
| `i8x16.addSaturate_s(a: v128, b: v128) -> v128` | [s8x16.addSaturate](portable-simd.md#saturating-integer-addition) | | ||
| `i8x16.addSaturate_u(a: v128, b: v128) -> v128` | [u8x16.addSaturate](portable-simd.md#saturating-integer-addition) | | ||
| `i8x16.subSaturate_s(a: v128, b: v128) -> v128` | [s8x16.subSaturate](portable-simd.md#saturating-integer-subtraction) | | ||
| `i8x16.subSaturate_u(a: v128, b: v128) -> v128` | [u8x16.subSaturate](portable-simd.md#saturating-integer-subtraction) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need saturate operations in the initial version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are very useful operations both for fixed-point DSP and image processing. They all exist as individual SIMD instructions in ARM, Intel, and MIPS ISAs.
These instructions are not commonly available in scalar versions in ISAs. In scalar code you're usually computing with 32-bit registers anyway, so saturating results to 8 or 16 bits is easier. In SIMD code, you only get 8 or 16 bits per lane, so these instructions are more important there.
LLVM's auto-vectorizer can pattern-match these operations in scalar code and generate the vector instructions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree saturation is useful. I'm asking if the first version needs this though. Not necessarily the first ship of SIMD, simply the first proposal.
I ask because by adding saturation here you're making the portable performance case that needs to be built harder, in my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate on what exactly makes portable performance harder here? I would argue that as all the saturate operations can be directly mapped to individual instructions on Arm, Intel and MIPS ISAs, this is an easier performance win than some of the other operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll need to justify this SIMD proposal on the merits of one more kind of benchmarks, across all those ISAs. SIMD can stand on its own with just a few FP and int benchmarks, but if you want to throw in saturation have at it. I'm just stating where I think the bar is.
| `f32x4.div(a: v128, b: v128, rmode: RoundingMode) -> v128` | [f32x4.div](portable-simd.md#division) | | ||
| `f32x4.sqrt(a: v128, rmode: RoundingMode) -> v128` | [f32x4.sqrt](portable-simd.md#square-root) | | ||
| `f32x4.reciprocalApproximation(a: v128) -> v128` | [f32x4.reciprocalApproximation](portable-simd.md#reciprocal-approximation) | | ||
| `f32x4.reciprocalSqrtApproximation(a: v128) -> v128` | [f32x4.reciprocalSqrtApproximation](portable-simd.md#reciprocal-square-root-approximation) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These would be the first implementation-dependent functions. I would like to discuss these details separately from this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #3.
proposals/simd/portable-simd.md
Outdated
* `TowardNegative`: Round towards negative infinity. | ||
* `TowardZero`: Round towards zero. | ||
|
||
## Default NaN value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To confirm: the intent here is to precisely match the latitude which WebAssembly provides w.r.t. NaNs, correct? I find it odd to repeat it here, with different wording.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is correct.
proposals/simd/portable-simd.md
Outdated
|
||
Integer equality is independent of the signed/unsigned interpretation. Floating | ||
point equality follows IEEE semantics, so a NaN lane compares not equal with | ||
anything, including itself: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If mentioning NaN, it's worth also mentioning -0.0
and 0.0
.
proposals/simd/portable-simd.md
Outdated
This specification does not address bounds checking and trap handling for memory | ||
operations. It is assumed that the range `addr .. addr+15` are valid offsets in | ||
the buffer, and that computing `addr+15` does not overflow the `ByteOffset` | ||
type. Bounds checking should be handled by the embedding specification. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WebAssembly simply states: "Out of bounds accesses trap. If the access is a store, if any of the accessed bytes are out of bounds, none of the bytes are modified."
We should do the same here.
proposals/simd/portable-simd.md
Outdated
return S.lanewise_binary(max, a, b) | ||
``` | ||
|
||
### NaN-suppressing minimum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather introduce minNum
and maxNum
for scalar at the same time or earlier than for SIMD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, and you won't find these instruction in the table of proposed opcodes. I'll make a note that I omitted them.
proposals/simd/portable-simd.md
Outdated
Lane-wise conversion from floating point to integer using the IEEE | ||
`convertToIntegerTowardZero` function. If any lane is a NaN or the rounded | ||
integer value is outside the range of the destination type, return `fail = true` | ||
and an unspecified `result`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WebAssembly doesn't currently support multi-return values. This should trap to match scalars.
Also, rename to truncate so it matches scalar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I did just that in the Floating point conversions section of the overview document.
proposals/simd/portable-simd.md
Outdated
return result | ||
``` | ||
|
||
## Integer arithmetic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section is missing integer division.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an intentional omission. As it turns out, no ISAs provide vectorized integer division instructions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah apologies, I though I'd seen it in one of the files in your PR, and it was missing here. This is good then!
- Remove rounding modes from the portable specification and from the proposed WebAssembly opcodes. - Add a few minor comments. - Omit padding in Markdown table columns to avoid excessive diffs in the future.
To be debated in #3. Add a complete list of omitted operations to Overview.md.
Just a quick question: @jfbastien recently suggested distinguishing an int and a float SIMD type. Do you guys still intend to explore that option in follow-ups? |
I'd be interested in seeing if it makes a difference for producers or consumers to do this. I'm not convinced it's the right thing! @pizlonator thinks it'll cause way too many coercions (bloating the binary) and won't be helpful to the consumer anyways because the consumer already needs to reason about the SIMD types. My thinking was that having a compiler like LLVM express things like lane crossing would mean browsers can be lazy about it and just trust the input. No need to think about PD versus PS mov / bitwise ops, or shuffle type, just trust the input blindly. |
@rossberg-chromium, in my opinion we don't need separate types, but possibly a few extra instructions. To give some background, Intel's SIMD instruction sets have multiple versions of a few instructions. For example, There is a 1-cycle bypass delay when an instruction in the integer stack depends on a result computed in the floating point stack or vice versa. This additional latency only matters if the dependency is on the critical path. As soon as there's a few cycles between the instructions, operands are read from the register file and not from pipeline bypasses. Then there is no longer any difference. Other ISAs don't make this distinction, it is an Intel-only thing. If we want to let WebAssembly producers give hints to the code generator's fiddling with these micro-architectural details, we could do it by providing float-flavored versions of the logical, load/store, and shuffle/swizzle operations. These new operations would be identical to the existing ones except for giving this hint to the code generator. I don't think a new type is required. I would prefer that we don't do this in the initial proposal, and only add the new instructions if they have demonstrable performance benefits. I am expecting improvements to real-world benchmarks to be in the noise. Without the hints, a simple SSE code generator would apply a basic heuristic like looking at the instructions that computed the inputs and choose |
My thinking was that having a compiler like LLVM express things like lane crossing would mean browsers can be lazy about it and just trust the input. No need to think about PD versus PS mov / bitwise ops, or shuffle type, just trust the input blindly.
Annoyingly, the Intel shuffle instructions don’t have the simple symmetry of the logical and load/store variants. For example, the integer-flavored `pshufd` implements `v32x4.swizzle` but doesn’t have a floating-point counterpart. The float-flavored `shufps` implements some of the `v32x4.shuffle` patterns but doesn’t have an integer counterpart.
The specific SSE version you’re targeting can also affect the results. With SSE 4.1 support, `i32x4.mul` becomes a `pmulld` (int-like) instruction, but on earlier SSE versions it expands into an instruction sequence that ends in `shufps` (float-like), at least in the current Firefox implementation.
|
There are a few operators remaining that use mixed case to denote word boundaries ( Will globals of type What is the natural alignment of the Can we get vector versions of the float How about the integer |
Thanks, @AndrewScheidecker I don't have strong opinions on the operator names. I translated names for the operations with existing WebAssembly counterparts and left the rest identical to the SIMD.js spec. I imagined we would have a bikeshedding fest after we've decided which operations go in. Plain changing to snake case makes sense.
I don't think there's a reason to disallow SIMD globals, so we should include them for completeness. I could imagine a use for global constants at least.
I think 32 bits makes sense as the natural alignment for these operators.
I am not sure why they didn't make it into the SIMD.js spec. Let me check if they are universally available.
Rotates, probably. I'm not sure if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Here are a few comments I had from an initial reading.
proposals/simd/portable-simd.md
Outdated
|
||
* `f32`: A floating-point number in the [IEEE][ieee754] *binary32* interchange | ||
format. | ||
* `f64`: A floating-point number in the [IEEE][ieee754] *binary64* interchange |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IEEE 754 calls these "basic" formats, which indicates that they are suitable for both "interchange" and "arithmetic".
proposals/simd/portable-simd.md
Outdated
if unspecified_choice(): | ||
return canonicalize_nan(x) | ||
else: | ||
return canonicalize_nan(y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The WebAssembly scalar spec has loosened the NaN bitpattern rules and no longer requires propagation. NaN propagation is only a "should" in IEEE 754, and architectures such as ARM in "default NaN" configuration and RISC-V don't implement it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what you get for specifying things twice. I think the next step for this proposal is to roll up this portable spec and the wasm specifics into a single wasm-only document. Then I can simply refer to existing semantics instead of trying to do a stand-alone spec.
proposals/simd/portable-simd.md
Outdated
for j in range(S.Lanes): | ||
result[j] = a[j] | ||
result[i] = x | ||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest an explicit mention that lane indices for insert and extract are dynamic operands, not required to be constant, somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intention was for these indices to be immediate operands. Do you think we need dynamic lane indices for these operations? It's hard to generate good code for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather have immediate operands, unless we can show performance advantages in dynamic ones.
proposals/simd/portable-simd.md
Outdated
* `v32x4.select(s: b32x4, t: v128, f: v128) -> v128` | ||
* `v64x2.select(s: b64x2, t: v128, f: v128) -> v128` | ||
|
||
Use a boolean vector to select lanes from two numerical vectors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest mentioning that wasm's plain select
instruction also supports vectors somewhere.
proposals/simd/portable-simd.md
Outdated
else: | ||
result[i] = b[s[i] - S.lanes] | ||
return result | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In LLVM, shuffle indices are required to be constant. I think it'd be reasonable for wasm to have the same restiction and make them immediate fields rather than plain operands. Either way, I suggest mentioning the choice explicitly somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Lane indices are supposed to be immediate operands everywhere.
proposals/simd/portable-simd.md
Outdated
``` | ||
|
||
Note that this function behaves differently than the IEEE 754 `minNum` function | ||
when one of the operands is a signaling NaN. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, it's not finalized yet, but IEEE 754-2018 is likely to deprecate minNum and maxNum and introduce new functions which are essentially the semantics specified here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it was agreed upon in Simd.js to remove minNum and maxNum due to the compliance issue and not enough use cases. tc39/ecmascript_simd#341
Does it make sense to remove it from portable_simd as well?
And the replacement from IEEE754-2018 can be introduced to wasm first if it seems fit when it lands.
## `f32x4` operations | ||
| WebAssembly | Portable SIMD | | ||
|:------------|:--------------| | ||
| `f32x4.build(x: f32[4]) -> v128` | [f32x4.build](portable-simd.md#build-vector-from-individual-lanes) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume there should also be a v128.const
instruction somewhere which takes a 128-bit immediate.
Then, it's worth asking whether these build
opcodes are needed in wasm. LLVM doesn't have corresponding operators, for example, and they don't tend to map to single instructions in implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We talked about this for asm.js too and ended up keeping them. I'll file an issue.
## `b16x8` operations | ||
| WebAssembly | Portable SIMD | | ||
|:------------|:--------------| | ||
| `b16x8.build(x: i32[8]) -> b16x8` | [b16x8.build](portable-simd.md#build-vector-from-individual-lanes) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does a b16x8.const
also make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think we should add const instructions for all the new types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does bools have uses beyond holding/manipulating results of logical and relational operations?
Wondering if there is a need to have support for building bools, since we intend to have bool consts.
| `v32x4.shuffle(a: v128, b: v128, s: LaneIdx8[4]) -> v128` | [v32x4.shuffle](portable-simd.md#shuffle-lanes) | | ||
| `v32x4.load1(addr, offset) -> v128` | [v32x4.load1](portable-simd.md#partial-load) | | ||
| `v32x4.load2(addr, offset) -> v128` | [v32x4.load2](portable-simd.md#partial-load) | | ||
| `v32x4.load3(addr, offset) -> v128` | [v32x4.load3](portable-simd.md#partial-load) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial loads and stores were removed from asm.js since the non-power-of-2 size is awkward in some implementations.
It's enabled by default, but if you wish to disable it you can with the CMake ENABLE_SIMD_PROTOTYPE option. This also includes an implementation of the Blake2b hash function with the proposed SIMD operators in Test/Blake2b/blake2b.wast
FYI I have a prototype for this proposal in WAVM, including a port of the Blake2b hash function. |
Fold the portable-simd.md specification and the WebAssembly mapping into a single SIMD.md document that is specific to WebAssembly. This is easier to read and avoids a lot of confusion. - Don't attempt to specify floating-point semantics again. Just refer to existing WebAssembly behavior. - Remove the minNum and maxNum operations which were never intended to be included in WebAssembly. - Clarify the trapping behavior of the float-to-int conversions. - Remove the descriptions of recipriocal [sqrt] approximations. See #3. - Rename all operations to use snake_case. - Add *.const instructions for all the new types. - Remove the partial load and store operators as suggested by @sunfishcode. They were removed from asm.js too.
@jfbastien, would you object to my merging this PR in its current state? I think I've addressed everything brought up in reviews either by changing the text or by filing issues for individual topics that need more discussion. |
Yeah I think as is you've got a good starting point, received good feedback and definite interest from multiple folks. Go for it. |
Add extended load definitions to SIMD.md
Rendered.