-
Notifications
You must be signed in to change notification settings - Fork 43
Floating-Point to Nearest Integer Conversions #247
Conversation
These instructions are widely useful, and I agree that these are hard to emulate without operations being explicitly exposed. The suboptimal mapping on Intel hardware has been contentious in the past for the conversion and other operations, but we have included them as there's no good way to emulate them - explicitly asking @arunetm and other Intel folks for opinions here. |
Just noting that in all my high performance kernels that needed f32->i32 conversion, I had to stay away from the "native" Wasm instructions due to the large overhead. I have three kernels that need this instruction; they run at 2 GB/s, 3.6 GB/s and 2.6 GB/s when using a fast emulation. When using the "native" instruction on latest v8, I get 1.25 GB/s, 2.4 GB/s and 2.1 GB/s - very significant and noticeable penalty, and that's given that the "native" code doesn't perform the rounding that's required and I get for free in the emulated version, so the real perf delta is larger. As usual, these aren't microbenchmarks, and the conversion is merely part of the computational chain. The instructions proposed here would perhaps help a bit in that at least my emulation can be tested vs a native rounding instruction and I'd expect these to perform similarly to existing variants, but I'm expecting these to be similarly not-useful for performance sensitive code unless it's impossible to implement the algorithm without these. That's not really an objection to adding these, as these instructions aren't worse than what we already have, merely an observation. In the examples linked I believe the expectation is that the lowering is much more optimal than the one proposed (because of the differences in handling saturation/NaNs). |
(on a less pessimistic note, if we decide to go ahead with these I'd be happy to contribute the kernels above as benchmarks for perf evaluation, we could compare "manual" rounding (adding 0.5 with the proper sign and using truncate), "assisted" rounding (using new fp32 rounding + truncate), proposed direct rounding and the fast emulation) |
I think the current state of spec makes it too risky to include these. Mapping on x86 looks concerning for these instructions with the largest gap w.r.t instr count (16 & 7). We already have a significant asymmetry in spec considering costs of op-implementations on x86. I am afraid inclusion of these significantly increases the risk of hiding higher perf penalties/regressions on one popular platform vs. others limiting their usability for developers and moving away from spec goals. |
@arunetm please note that x86 lowering differs only by one instruction ( Without |
Did you mean i32x4.trunc_sat_f32x4_s & i32x4.trunc_sat_f32x4_u here? Unfortunately these ops are highly expensive to implement on x86 and we have open issues discussing their tradeoffs #173 . We need to clearly understand their real world implications regarding perf cliffs before including trunc instructions. We may be compounding the problem by adding the new ops that rely on these. Given that we cannot assume AVX-512 instructions broad availability, the extra cost of handling out of bounds will make it even worse. Agree that not including these will force developers to choose workarounds that may not be ideal on certain platforms. IMO, its a a better tradeoff than letting them be vulnerable to hidden perf cliffs on certain common platforms when they rely SIMD anticipating consistent performance gains. Also, runtimes always has the option of adding platform specific optimizations in these cases where developer expectations will not be broken by the spec and only enhanced by implementers. |
@Maratyszcza can you assign these proposed instructions new opcodes? The current opcodes conflict with pmin/pmax. |
Adding a preliminary vote for the inclusion of i32x4.nearest_sat_f32x4_s and i32x4.nearest_sat_f32x4_u to the SIMD proposal below. Please vote with - 👍 For including i32x4.nearest_sat_f32x4_s and i32x4.nearest_sat_f32x4_u operations |
b6e39b9
to
9e74b3b
Compare
Added double-precision variants similar to #383. These instructions introduce floating-point-to-integer conversions that round to nearest-even rather than truncate. The lowering of these instructions on x86 and ARM64 differs only by one instruction from the truncation variants, and is more efficient than simulation. Lowering on x86 is somewhat inefficient due to special handling of out-of-bounds inputs (albeit not more inefficient that existing |
@Maratyszcza Please update the PR text so that it's clear that this is now proposing 4 instructions |
Closing as per #436. |
Introduction
This PR adds four forms of floating-point-to-integer conversion with rounding to nearest (ties to even), in addition to existing instructions with rounding towards zero mode. This operation is natively supported in SSE2 and ARMv8 NEON, and can be efficiently simulated in native instructions on ARMv7 NEON.
Applications
Mapping to Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.
x86 processors with SSE2 instruction set
y = i32x4.nearest_sat_f32x4_s(x)
(y
is NOTx
) is lowered toMOVAPS xmm_y, xmm_x
MOVAPS xmm_tmp, wasm_splat_f32(0x1.0p+31f)
CMPUNORDSS xmm_y, xmm_y
CMPLEPS xmm_tmp, xmm_x
ANDNPS xmm_y, xmm_x
CVTDQ2PS xmm_y, xmm_y
PXOR xmm_y, xmm_tmp
y = i32x4.nearest_sat_f32x4_u(x)
(y
is NOTx
) is lowered toMOVAPS xmm_tmp0, wasm_splat_f32(0x1.0p+31f)
MOVAPS xmm_tmp1, xmm_x
CMPNLTPS xmm_tmp1, xmm_tmp0
MOVAPS xmm_y, xmm_x
MOVAPS xmm_tmp2, xmm_tmp0
ANDPS xmm_tmp0, xmm_tmp1
SUBPS xmm_y, xmm_tmp0
PSLLD xmm_tmp1, 31
CMPLEPS xmm_tmp2, xmm_y
CVTPS2DQ xmm_y, xmm_y
PXOR xmm_tmp1, xmm_y
PXOR xmm_y, xmm_y
PCMPGTD xmm_y, xmm_x
POR xmm_tmp1, xmm_tmp2
PANDN xmm_y, xmm_tmp1
y = i32x4.trunc_sat_f64x2_s_zero(x)
is lowered to:XORPS xmm_tmp, xmm_tmp
CMPEQPD xmm_tmp, xmm_x
MOVAPS xmm_y, xmm_x
ANDPS xmm_tmp, [wasm_f64x2_splat(2147483647.0)]
MINPD xmm_y, xmm_tmp
CVTPD2DQ xmm_y, xmm_y
y = i32x4.nearest_sat_f64x2_u_zero(x)
is lowered to:MOVAPD xmm_y, xmm_x
XORPD xmm_tmp, xmm_tmp
MAXPD xmm_y, xmm_tmp
MINPD xmm_y, [wasm_f64x2_splat(4294967295.0)]
ADDPD xmm_y, [wasm_f64x2_splat(0x1.0p+52)]
SHUFPS xmm_y, xmm_xmp, 0x88
ARM64 processors
y = i32x4.nearest_sat_f32x4_s(x)
is lowered toFCVTNS Vy.4S, Vx.4S
y = i32x4.nearest_sat_f32x4_u(x)
is lowered toFCVTNU Vy.4S, Vx.4S
y = i32x4.nearest_sat_f64x2_s_zero(x)
is lowered to:FCVTNS Vy.2D, Vx.2D
SQXTN Vy.2S, Vy.2D
y = i32x4.nearest_sat_f64x2_u_zero(x)
is lowered to:FCVTNU Vy.2D, Vx.2D
UQXTN Vy.2S, Vy.2D
ARM processors with ARMv8 (32-bit) instruction set
y = i32x4.nearest_sat_f32x4_s(x)
is lowered toVCVTN.S32.F32 Qy, Qx
y = i32x4.nearest_sat_f32x4_u(x)
is lowered toVCVTN.U32.F32 Qy, Qx
y = i32x4.trunc_sat_f64x2_s_zero(x)
is lowered to:FCVTZS Vy.2D, Vx.2D
SQXTN Vy.2S, Vy.2D
y = i32x4.trunc_sat_f64x2_u_zero(x)
is lowered to:FCVTZU Vy.2D, Vx.2D
UQXTN Vy.2S, Vy.2D
ARM processors with ARMv7 (32-bit) instruction set
y = i32x4.nearest_sat_f32x4_s(x)
(y
is NOTx
) is lowered toVMOV.I32 Qtmp, 0x80000000
VMOV.F32 Qy, 0x4B000000
VBSL Qtmp, Qx, Qy
VADD.F32 Qy, Qx, Qtmp
VSUB.F32 Qy, Qy, Qtmp
VACLT.F32 Qtmp, Qx, Qtmp
VBSL Qtmp, Qy, Qx
VCVT.S32.F32 Qy, Qtmp
y = i32x4.nearest_sat_f32x4_u(x)
(y
is NOTx
) is lowered toVMOV.I32 Qtmp, 0x4B000000
VADD.F32 Qy, Qx, Qtmp
VSUB.F32 Qy, Qy, Qtmp
VCLT.U32 Qtmp, Qx, Qtmp
VBSL Qtmp, Qy, Qx
VCVT.U32.F32 Qy, Qtmp