Floating-Point to Nearest Integer Conversions #247

Maratyszcza · 2020-06-06T06:52:49Z

Introduction

This PR adds four forms of floating-point-to-integer conversion with rounding to nearest (ties to even), in addition to existing instructions with rounding towards zero mode. This operation is natively supported in SSE2 and ARMv8 NEON, and can be efficiently simulated in native instructions on ARMv7 NEON.

Applications

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86 processors with SSE2 instruction set

i32x4.nearest_sat_f32x4_s
- y = i32x4.nearest_sat_f32x4_s(x) (y is NOT x) is lowered to
  - MOVAPS xmm_y, xmm_x
  - MOVAPS xmm_tmp, wasm_splat_f32(0x1.0p+31f)
  - CMPUNORDSS xmm_y, xmm_y
  - CMPLEPS xmm_tmp, xmm_x
  - ANDNPS xmm_y, xmm_x
  - CVTDQ2PS xmm_y, xmm_y
  - PXOR xmm_y, xmm_tmp
i32x4.nearest_sat_f32x4_u
- y = i32x4.nearest_sat_f32x4_u(x) (y is NOT x) is lowered to
  - MOVAPS xmm_tmp0, wasm_splat_f32(0x1.0p+31f)
  - MOVAPS xmm_tmp1, xmm_x
  - CMPNLTPS xmm_tmp1, xmm_tmp0
  - MOVAPS xmm_y, xmm_x
  - MOVAPS xmm_tmp2, xmm_tmp0
  - ANDPS xmm_tmp0, xmm_tmp1
  - SUBPS xmm_y, xmm_tmp0
  - PSLLD xmm_tmp1, 31
  - CMPLEPS xmm_tmp2, xmm_y
  - CVTPS2DQ xmm_y, xmm_y
  - PXOR xmm_tmp1, xmm_y
  - PXOR xmm_y, xmm_y
  - PCMPGTD xmm_y, xmm_x
  - POR xmm_tmp1, xmm_tmp2
  - PANDN xmm_y, xmm_tmp1
i32x4.trunc_sat_f64x2_s_zero
- y = i32x4.trunc_sat_f64x2_s_zero(x) is lowered to:
  - XORPS xmm_tmp, xmm_tmp
  - CMPEQPD xmm_tmp, xmm_x
  - MOVAPS xmm_y, xmm_x
  - ANDPS xmm_tmp, [wasm_f64x2_splat(2147483647.0)]
  - MINPD xmm_y, xmm_tmp
  - CVTPD2DQ xmm_y, xmm_y
i32x4.nearest_sat_f64x2_u_zero
- y = i32x4.nearest_sat_f64x2_u_zero(x) is lowered to:
  - MOVAPD xmm_y, xmm_x
  - XORPD xmm_tmp, xmm_tmp
  - MAXPD xmm_y, xmm_tmp
  - MINPD xmm_y, [wasm_f64x2_splat(4294967295.0)]
  - ADDPD xmm_y, [wasm_f64x2_splat(0x1.0p+52)]
  - SHUFPS xmm_y, xmm_xmp, 0x88

ARM64 processors

i32x4.nearest_sat_f32x4_s
- y = i32x4.nearest_sat_f32x4_s(x) is lowered to FCVTNS Vy.4S, Vx.4S
i32x4.nearest_sat_f32x4_u
- y = i32x4.nearest_sat_f32x4_u(x) is lowered to FCVTNU Vy.4S, Vx.4S
i32x4.nearest_sat_f64x2_s_zero
- y = i32x4.nearest_sat_f64x2_s_zero(x) is lowered to:
  - FCVTNS Vy.2D, Vx.2D
  - SQXTN Vy.2S, Vy.2D
i32x4.nearest_sat_f64x2_u_zero
- y = i32x4.nearest_sat_f64x2_u_zero(x) is lowered to:
  - FCVTNU Vy.2D, Vx.2D
  - UQXTN Vy.2S, Vy.2D

ARM processors with ARMv8 (32-bit) instruction set

i32x4.nearest_sat_f32x4_s
- y = i32x4.nearest_sat_f32x4_s(x) is lowered to VCVTN.S32.F32 Qy, Qx
i32x4.nearest_sat_f32x4_u
- y = i32x4.nearest_sat_f32x4_u(x) is lowered to VCVTN.U32.F32 Qy, Qx
i32x4.trunc_sat_f64x2_s_zero
- y = i32x4.trunc_sat_f64x2_s_zero(x) is lowered to:
  - FCVTZS Vy.2D, Vx.2D
  - SQXTN Vy.2S, Vy.2D
i32x4.trunc_sat_f64x2_u_zero
- y = i32x4.trunc_sat_f64x2_u_zero(x) is lowered to:
  - FCVTZU Vy.2D, Vx.2D
  - UQXTN Vy.2S, Vy.2D

ARM processors with ARMv7 (32-bit) instruction set

i32x4.nearest_sat_f32x4_s
- y = i32x4.nearest_sat_f32x4_s(x) (y is NOT x) is lowered to
  - VMOV.I32 Qtmp, 0x80000000
  - VMOV.F32 Qy, 0x4B000000
  - VBSL Qtmp, Qx, Qy
  - VADD.F32 Qy, Qx, Qtmp
  - VSUB.F32 Qy, Qy, Qtmp
  - VACLT.F32 Qtmp, Qx, Qtmp
  - VBSL Qtmp, Qy, Qx
  - VCVT.S32.F32 Qy, Qtmp
i32x4.nearest_sat_f32x4_u
- y = i32x4.nearest_sat_f32x4_u(x) (y is NOT x) is lowered to
  - VMOV.I32 Qtmp, 0x4B000000
  - VADD.F32 Qy, Qx, Qtmp
  - VSUB.F32 Qy, Qy, Qtmp
  - VCLT.U32 Qtmp, Qx, Qtmp
  - VBSL Qtmp, Qy, Qx
  - VCVT.U32.F32 Qy, Qtmp

dtig · 2020-06-09T00:00:37Z

These instructions are widely useful, and I agree that these are hard to emulate without operations being explicitly exposed. The suboptimal mapping on Intel hardware has been contentious in the past for the conversion and other operations, but we have included them as there's no good way to emulate them - explicitly asking @arunetm and other Intel folks for opinions here.

zeux · 2020-06-10T02:49:43Z

Just noting that in all my high performance kernels that needed f32->i32 conversion, I had to stay away from the "native" Wasm instructions due to the large overhead.

I have three kernels that need this instruction; they run at 2 GB/s, 3.6 GB/s and 2.6 GB/s when using a fast emulation. When using the "native" instruction on latest v8, I get 1.25 GB/s, 2.4 GB/s and 2.1 GB/s - very significant and noticeable penalty, and that's given that the "native" code doesn't perform the rounding that's required and I get for free in the emulated version, so the real perf delta is larger. As usual, these aren't microbenchmarks, and the conversion is merely part of the computational chain.

The instructions proposed here would perhaps help a bit in that at least my emulation can be tested vs a native rounding instruction and I'd expect these to perform similarly to existing variants, but I'm expecting these to be similarly not-useful for performance sensitive code unless it's impossible to implement the algorithm without these.

That's not really an objection to adding these, as these instructions aren't worse than what we already have, merely an observation. In the examples linked I believe the expectation is that the lowering is much more optimal than the one proposed (because of the differences in handling saturation/NaNs).

zeux · 2020-06-10T02:59:07Z

(on a less pessimistic note, if we decide to go ahead with these I'd be happy to contribute the kernels above as benchmarks for perf evaluation, we could compare "manual" rounding (adding 0.5 with the proper sign and using truncate), "assisted" rounding (using new fp32 rounding + truncate), proposed direct rounding and the fast emulation)

arunetm · 2020-06-10T03:52:25Z

I think the current state of spec makes it too risky to include these. Mapping on x86 looks concerning for these instructions with the largest gap w.r.t instr count (16 & 7). We already have a significant asymmetry in spec considering costs of op-implementations on x86. I am afraid inclusion of these significantly increases the risk of hiding higher perf penalties/regressions on one popular platform vs. others limiting their usability for developers and moving away from spec goals.
The instructions looks useful from a convenience standpoint, but the use-case perf benefits and tradeoffs seems unclear. Even included, the perf cost on x86 may force users to rely on emulations like @zeux pointed out negating any benefits. Thanks @zeux for sharing the info.
I suggest not including these in current spec and re-consider for post MVP given their usefulness.

Maratyszcza · 2020-06-10T04:43:47Z

@arunetm please note that x86 lowering differs only by one instruction (CVTTPS2DQ -> CVTPS2DQ) from the lowering of i32x4.trunc_sat_f32x4_s and i32x4.nearest_sat_f32x4_u instructions already in the spec. The other instructions are needed to handle the difference between out-of-bounds behavior of x86 conversation instructions (return INT32_MIN) and WAsm conversion instructions (saturate between INT32_MIN and INT32_MAX and convert NaN to 0), and to simulate the floating-point -> unsigned integer conversion missing on pre-AVX512 x86.

Without i32x4.nearest_sat_f32x4_s and i32x4.nearest_sat_f32x4_u in the spec, developers would implement them as i32x4.trunc_sat_f32x4_s(f32x4.nearest(x)) and i32x4.trunc_sat_f32x4_u(f32x4.nearest(x)), which is strictly worse on all platforms, but particularly so on pre-SSE4 x86.

arunetm · 2020-06-10T17:37:28Z

lowering of i32x4.trunc_sat_f32x4_s and i32x4.nearest_sat_f32x4_u instructions already in the spec.

Did you mean i32x4.trunc_sat_f32x4_s & i32x4.trunc_sat_f32x4_u here? Unfortunately these ops are highly expensive to implement on x86 and we have open issues discussing their tradeoffs #173 . We need to clearly understand their real world implications regarding perf cliffs before including trunc instructions. We may be compounding the problem by adding the new ops that rely on these. Given that we cannot assume AVX-512 instructions broad availability, the extra cost of handling out of bounds will make it even worse.

Agree that not including these will force developers to choose workarounds that may not be ideal on certain platforms. IMO, its a a better tradeoff than letting them be vulnerable to hidden perf cliffs on certain common platforms when they rely SIMD anticipating consistent performance gains. Also, runtimes always has the option of adding platform specific optimizations in these cases where developer expectations will not be broken by the spec and only enhanced by implementers.

tlively · 2020-10-07T18:51:29Z

@Maratyszcza can you assign these proposed instructions new opcodes? The current opcodes conflict with pmin/pmax.

dtig · 2021-02-04T22:34:45Z

Adding a preliminary vote for the inclusion of i32x4.nearest_sat_f32x4_s and i32x4.nearest_sat_f32x4_u to the SIMD proposal below. Please vote with -

👍 For including i32x4.nearest_sat_f32x4_s and i32x4.nearest_sat_f32x4_u operations
👎 Against including i32x4.nearest_sat_f32x4_s and i32x4.nearest_sat_f32x4_u operations

Maratyszcza · 2021-02-05T16:08:20Z

Added double-precision variants similar to #383. These instructions introduce floating-point-to-integer conversions that round to nearest-even rather than truncate. The lowering of these instructions on x86 and ARM64 differs only by one instruction from the truncation variants, and is more efficient than simulation. Lowering on x86 is somewhat inefficient due to special handling of out-of-bounds inputs (albeit not more inefficient that existing trunc_sat conversion instructions), but this would be fixed with the subsequent Fast SIMD proposal.

zeux · 2021-02-05T16:16:11Z

@Maratyszcza Please update the PR text so that it's clear that this is now proposing 4 instructions

dtig · 2021-03-05T02:14:55Z

Closing as per #436.

Maratyszcza changed the title ~~[WIP] i32x4.nearest_sat_f32x4_s and i32x4.nearest_sat_f32x4_u~~ i32x4.nearest_sat_f32x4_s and i32x4.nearest_sat_f32x4_u Jun 7, 2020

Maratyszcza force-pushed the nearest-sat branch from 1d420d4 to b6e39b9 Compare June 8, 2020 18:52

Maratyszcza mentioned this pull request Sep 21, 2020

Finalizing the instruction set #343

Closed

tlively mentioned this pull request Oct 14, 2020

We need persistent floating-point rounding mode control WebAssembly/design#1384

Open

ngzhian mentioned this pull request Oct 21, 2020

Remove integer SIMD not-equals instructions #351

Closed

dtig added the needs discussion Proposal with an unclear resolution label Feb 2, 2021

dtig mentioned this pull request Feb 2, 2021

Agenda for sync meeting 2/5/21 #436

Closed

i32x4.nearest_sat_f32x4_s and i32x4.nearest_sat_f32x4_u

9e74b3b

Maratyszcza force-pushed the nearest-sat branch from b6e39b9 to 9e74b3b Compare February 5, 2021 07:26

Maratyszcza changed the title ~~i32x4.nearest_sat_f32x4_s and i32x4.nearest_sat_f32x4_u~~ Floating-Point to Nearest Integer Conversions Feb 5, 2021

dtig closed this Mar 5, 2021

yurydelendik mentioned this pull request Sep 28, 2021

relaxed i32x4.trunc_sat_f32x4_{s,u} i32x4.trunc_sat_f64x2_{s,u}_zero WebAssembly/relaxed-simd#21

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Floating-Point to Nearest Integer Conversions #247

Floating-Point to Nearest Integer Conversions #247

Maratyszcza commented Jun 6, 2020 •

edited

Loading

dtig commented Jun 9, 2020

zeux commented Jun 10, 2020 •

edited

Loading

zeux commented Jun 10, 2020 •

edited

Loading

arunetm commented Jun 10, 2020

Maratyszcza commented Jun 10, 2020 •

edited

Loading

arunetm commented Jun 10, 2020

tlively commented Oct 7, 2020

dtig commented Feb 4, 2021

Maratyszcza commented Feb 5, 2021

zeux commented Feb 5, 2021

dtig commented Mar 5, 2021

Floating-Point to Nearest Integer Conversions #247

Floating-Point to Nearest Integer Conversions #247

Conversation

Maratyszcza commented Jun 6, 2020 • edited Loading

Introduction

Applications

Mapping to Common Instruction Sets

x86 processors with SSE2 instruction set

ARM64 processors

ARM processors with ARMv8 (32-bit) instruction set

ARM processors with ARMv7 (32-bit) instruction set

dtig commented Jun 9, 2020

zeux commented Jun 10, 2020 • edited Loading

zeux commented Jun 10, 2020 • edited Loading

arunetm commented Jun 10, 2020

Maratyszcza commented Jun 10, 2020 • edited Loading

arunetm commented Jun 10, 2020

tlively commented Oct 7, 2020

dtig commented Feb 4, 2021

Maratyszcza commented Feb 5, 2021

zeux commented Feb 5, 2021

dtig commented Mar 5, 2021

Maratyszcza commented Jun 6, 2020 •

edited

Loading

zeux commented Jun 10, 2020 •

edited

Loading

zeux commented Jun 10, 2020 •

edited

Loading

Maratyszcza commented Jun 10, 2020 •

edited

Loading