-
Notifications
You must be signed in to change notification settings - Fork 43
Double-precision conversion instructions #383
Conversation
How does ARM codegen look like? Similar to ARM64? Or complicated enough that we will likely do runtime calls? |
Similar to ARM64, but with additional constraints on register allocation (only the first 8 NEON registers are accessible as S FPU registers). |
Do you have more specific pointers to code that will benefit from this? And also plans for benchmarking if this is prototyped? |
These instructions are commonly used in scientific computing, but it is hard to provide code pointers as explicit vectorization in scientific computing codes is rare, they usually rely on compiler auto-vectorization of code in C/C++ and FORTRAN. My plan for evaluating these is to implement XorShift RNG with mapping of output to |
4d052b3
to
78ac3e2
Compare
FYI there is code for the integer portion in https://gitlab.com/wg1/jpeg-xl/-/blob/master/lib/jxl/xorshift128plus-inl.h :) |
Compiler auto-vectorization performance is somewhat nebulous, especially for cross platform performance for i64x2 operations. The last time we tried this IIRC we didn't see any benefit from autovectorizing. Perhaps one way to make a decision on many of these i64x2 operations is to evaluate a couple of sample scientific applications that might use these operations and see if the generated code would actually map to these operations. @Maratyszcza (or anyone else) Are there any open source applications that we can evaluate in addition to the jpeg-xl link above? |
I'd like to express support for this. JPEG XL promotes floats to double and the Highway math library requires f64 -> i32 conversions. |
Promote and demote maps to native nicely. |
Can you help me understand your concern? Math libraries need floatN -> intM to do bitwise ops (shifting exponent into place for LoadExp). For N=64 (double), x86 only provides M=32. This works out OK for ARM as well. double->int64 would be much more expensive on x86. Do you see a better alternative? |
I'm only pointing out that the codegen for |
|
78ac3e2
to
d6d4030
Compare
As proposed in WebAssembly/simd#383, with opcodes coordinated with the WIP V8 prototype.
As proposed in WebAssembly/simd#383, with opcodes coordinated with the WIP V8 prototype.
As proposed in WebAssembly/simd#383. Differential Revision: https://reviews.llvm.org/D95012
For
This is a compare with 0, should it be
? |
Any suggested lowering for SSE2 |
Adding a preliminary vote for the inclusion of double-precision conversion operations to the SIMD proposal below. Please vote with - 👍 For including double-precision conversion operations |
I don't think using a poor SIMD sequence instead of a poor scalar sequence is really a solution, plus we have no performance data on this. I don't think this instruction is ready to be included. |
@penzn I'm working on performance data. Expect it later today. |
Prototype these 6 instructions on x64: - f64x2.convert_low_i32x4_s - f64x2.convert_low_i32x4_u - i32x4.trunc_sat_f64x2_s_zero - i32x4.trunc_sat_f64x2_u_zero - f32x4.demote_f64x2_zero - f64x2.promote_low_f32x4 Some of these code sequences make use of special masks, we keep them in external references. Code sequence based on suggestions at: WebAssembly/simd#383 Bug: v8:11265 Change-Id: Ied67d7b5b6beaaccac7c179ec13504482cb9c915 Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2643562 Reviewed-by: Deepti Gandluri <gdeepti@chromium.org> Reviewed-by: Georg Neis <neis@chromium.org> Commit-Queue: Zhi An Ng <zhin@chromium.org> Cr-Commit-Position: refs/heads/master@{#72297}
To evaluate this proposal I implemented SIMD and semi-SIMD (mostly SIMD, but using scalar conversions) versions of the mixed-precision dot product operating where inputs are single-precision floats, but accumulators and outputs are double-precision. Here's the SIMD version: double DotProd(size_t n, const float* x, const float* y) {
assert(n != 0);
assert(n % 4 == 0);
v128_t vacc_lo = wasm_f64x2_const(0.0, 0.0);
v128_t vacc_hi = wasm_f64x2_const(0.0, 0.0);
do {
v128_t vx = wasm_v128_load(x);
x += 4;
v128_t vy = wasm_v128_load(y);
y += 4;
vacc_lo = wasm_f64x2_add(vacc_lo,
wasm_f64x2_mul(
__builtin_wasm_promote_low_f32x4_f64x2(vx),
__builtin_wasm_promote_low_f32x4_f64x2(vy)));
vx = wasm_v32x4_shuffle(vx, vx, 2, 3, 0, 1);
vy = wasm_v32x4_shuffle(vy, vy, 2, 3, 0, 1);
vacc_hi = wasm_f64x2_add(vacc_hi,
wasm_f64x2_mul(
__builtin_wasm_promote_low_f32x4_f64x2(vx),
__builtin_wasm_promote_low_f32x4_f64x2(vy)));
n -= 4;
} while (n != 0);
const v128_t vacc = wasm_f64x2_add(vacc_lo, vacc_hi);
return wasm_f64x2_extract_lane(vacc, 0) + wasm_f64x2_extract_lane(vacc, 1);
} And here's the semi-SIMD version: double DotProd(size_t n, const float* x, const float* y) {
assert(n != 0);
assert(n % 4 == 0);
v128_t vacc_lo = wasm_f64x2_const(0.0, 0.0);
v128_t vacc_hi = wasm_f64x2_const(0.0, 0.0);
do {
const v128_t vx = wasm_v128_load(x);
x += 4;
const v128_t vy = wasm_v128_load(y);
y += 4;
vacc_lo = wasm_f64x2_add(vacc_lo,
wasm_f64x2_mul(
wasm_f64x2_make(
(double) wasm_f32x4_extract_lane(vx, 0),
(double) wasm_f32x4_extract_lane(vx, 1)),
wasm_f64x2_make(
(double) wasm_f32x4_extract_lane(vy, 0),
(double) wasm_f32x4_extract_lane(vy, 1))));
vacc_hi = wasm_f64x2_add(vacc_hi,
wasm_f64x2_mul(
wasm_f64x2_make(
(double) wasm_f32x4_extract_lane(vx, 2),
(double) wasm_f32x4_extract_lane(vx, 3)),
wasm_f64x2_make(
(double) wasm_f32x4_extract_lane(vy, 2),
(double) wasm_f32x4_extract_lane(vy, 3))));
n -= 4;
} while (n != 0);
const v128_t vacc = wasm_f64x2_add(vacc_lo, vacc_hi);
return wasm_f64x2_extract_lane(vacc, 0) + wasm_f64x2_extract_lane(vacc, 1);
} So far V8 implements the double-precision conversion instructions only on x86-64 (ARM64 implementation is in review), and the results on x86-64 systems are presented below:
|
The version using conversion looks great, but for the semi-SIMD version would not it be faster to construct |
@penzn I tried a version with scalar loads: double DotProd(size_t n, const float* x, const float* y) {
assert(n != 0);
assert(n % 4 == 0);
v128_t vacc_lo = wasm_f64x2_const(0.0, 0.0);
v128_t vacc_hi = wasm_f64x2_const(0.0, 0.0);
do {
const float vx0 = x[0];
const float vx1 = x[1];
const float vy0 = y[0];
const float vy1 = y[1];
vacc_lo = wasm_f64x2_add(vacc_lo,
wasm_f64x2_mul(
wasm_f64x2_make((double) vx0, (double) vx1),
wasm_f64x2_make((double) vy0, (double) vy1)));
const float vx2 = x[2];
const float vx3 = x[3];
x += 4;
const float vy2 = y[2];
const float vy3 = y[3];
y += 4;
vacc_hi = wasm_f64x2_add(vacc_hi,
wasm_f64x2_mul(
wasm_f64x2_make((double) vx2, (double) vx3),
wasm_f64x2_make((double) vy2, (double) vy3)));
n -= 4;
} while (n != 0);
const v128_t vacc = wasm_f64x2_add(vacc_lo, vacc_hi);
return wasm_f64x2_extract_lane(vacc, 0) + wasm_f64x2_extract_lane(vacc, 1);
} Performance is presented below:
|
Some good speedups with f64x2.promote_low_f32x4. We also have a use case in OpenCV. The lowering is fantastic as well, so I am supportive of this instruction. |
For later relaxed semantics of floating-point to integer, I think it would suffice to have, on x86:
(We're fixing up 0x80..00 to 0x7F..FF if the sign changed) |
For relaxed semantics my preference would be to directly map to |
At the risk of making you do more work @Maratyszcza, what do you think about splitting up promote/demote from this PR? From the meeting earlier, I got the sense that most of us feel pretty comfortable about promote/demote. We can probably get that in without much discussion. f64x2.convert_i32x4_{s,u} maybe should be split out as well, these require less discussion than truncsat (better lowering), and also already have use cases and benchmarks (at least the signed onces). Having trunc_sat out on its own will let us have a more focused discussion. The risk here I guess is orthogonality. But grouping all these instructions together risk us losing some useful instructions because of objections to poor lowering of others. |
@ngzhian These instructions were voted on together, so I'd rather have them merged together. Besides, I evaluated performance on this code for evaluation of log2(x) which is reused across several GitHub projects. Here's the full-SIMD version (with SIMD conversions): void remez5_0_log2_simd(const double *inputs, double* outputs, size_t num) {
assert(num != 0);
assert(num % 4 == 0);
const v128_t offset = wasm_i32x4_const(1023, 1023, 0, 0);
do {
v128_t k1, k2;
v128_t p1, p2;
v128_t a1, a2;
v128_t xmm0, xmm1;
v128_t x1, x2;
/* Load four double values. */
xmm0 = wasm_f64x2_const(7.09782712893383996843e2, 7.09782712893383996843e2);
xmm1 = wasm_f64x2_const(-7.08396418532264106224e2, -7.08396418532264106224e2);
x1 = wasm_v128_load(inputs);
x2 = wasm_v128_load(inputs+2);
inputs += 4;
x1 = wasm_f64x2_pmin(xmm0, x1);
x2 = wasm_f64x2_pmin(xmm0, x2);
x1 = wasm_f64x2_pmax(xmm1, x1);
x2 = wasm_f64x2_pmax(xmm1, x2);
/* a = x / log2; */
xmm0 = wasm_f64x2_const(1.4426950408889634073599, 1.4426950408889634073599);
xmm1 = wasm_f64x2_const(0.0, 0.0);
a1 = wasm_f64x2_mul(x1, xmm0);
a2 = wasm_f64x2_mul(x2, xmm0);
/* k = (int)floor(a); p = (float)k; */
p1 = wasm_f64x2_lt(a1, xmm1);
p2 = wasm_f64x2_lt(a2, xmm1);
xmm0 = wasm_f64x2_const(1.0, 1.0);
p1 = wasm_v128_and(p1, xmm0);
p2 = wasm_v128_and(p2, xmm0);
a1 = wasm_f64x2_sub(a1, p1);
a2 = wasm_f64x2_sub(a2, p2);
k1 = __builtin_wasm_trunc_saturate_zero_s_f64x2_i32x4(a1);
k2 = __builtin_wasm_trunc_saturate_zero_s_f64x2_i32x4(a2);
p1 = __builtin_wasm_convert_low_s_i32x4_f64x2(k1);
p2 = __builtin_wasm_convert_low_s_i32x4_f64x2(k2);
/* x -= p * log2; */
xmm0 = wasm_f64x2_const(6.93145751953125E-1, 6.93145751953125E-1);
xmm1 = wasm_f64x2_const(1.42860682030941723212E-6, 1.42860682030941723212E-6);
a1 = wasm_f64x2_mul(p1, xmm0);
a2 = wasm_f64x2_mul(p2, xmm0);
x1 = wasm_f64x2_sub(x1, a1);
x2 = wasm_f64x2_sub(x2, a2);
a1 = wasm_f64x2_mul(p1, xmm1);
a2 = wasm_f64x2_mul(p2, xmm1);
x1 = wasm_f64x2_sub(x1, a1);
x2 = wasm_f64x2_sub(x2, a2);
/* Compute e^x using a polynomial approximation. */
xmm0 = wasm_f64x2_const(1.185268231308989403584147407056378360798378534739e-2, 1.185268231308989403584147407056378360798378534739e-2);
xmm1 = wasm_f64x2_const(3.87412011356070379615759057344100690905653320886699e-2, 3.87412011356070379615759057344100690905653320886699e-2);
a1 = wasm_f64x2_mul(x1, xmm0);
a2 = wasm_f64x2_mul(x2, xmm0);
a1 = wasm_f64x2_add(a1, xmm1);
a2 = wasm_f64x2_add(a2, xmm1);
xmm0 = wasm_f64x2_const(0.16775408658617866431779970932853611481292418818223, 0.16775408658617866431779970932853611481292418818223);
xmm1 = wasm_f64x2_const(0.49981934577169208735732248650232562589934399402426, 0.49981934577169208735732248650232562589934399402426);
a1 = wasm_f64x2_mul(a1, x1);
a2 = wasm_f64x2_mul(a2, x2);
a1 = wasm_f64x2_add(a1, xmm0);
a2 = wasm_f64x2_add(a2, xmm0);
a1 = wasm_f64x2_mul(a1, x1);
a2 = wasm_f64x2_mul(a2, x2);
a1 = wasm_f64x2_add(a1, xmm1);
a2 = wasm_f64x2_add(a2, xmm1);
xmm0 = wasm_f64x2_const(1.00001092396453942157124178508842412412025643386873, 1.00001092396453942157124178508842412412025643386873);
xmm1 = wasm_f64x2_const(0.99999989311082729779536722205742989232069120354073, 0.99999989311082729779536722205742989232069120354073);
a1 = wasm_f64x2_mul(a1, x1);
a2 = wasm_f64x2_mul(a2, x2);
a1 = wasm_f64x2_add(a1, xmm0);
a2 = wasm_f64x2_add(a2, xmm0);
a1 = wasm_f64x2_mul(a1, x1);
a2 = wasm_f64x2_mul(a2, x2);
a1 = wasm_f64x2_add(a1, xmm1);
a2 = wasm_f64x2_add(a2, xmm1);
/* p = 2^k; */
k1 = wasm_i32x4_add(k1, offset);
k2 = wasm_i32x4_add(k2, offset);
k1 = wasm_i32x4_shl(k1, 20);
k2 = wasm_i32x4_shl(k2, 20);
k1 = wasm_v32x4_shuffle(k1, k1, 1, 3, 0, 2);
k2 = wasm_v32x4_shuffle(k2, k2, 1, 3, 0, 2);
p1 = k1;
p2 = k2;
/* a *= 2^k. */
a1 = wasm_f64x2_mul(a1, p1);
a2 = wasm_f64x2_mul(a2, p2);
/* Store the results. */
wasm_v128_store(outputs, a1);
wasm_v128_store(outputs+2, a2);
outputs += 4;
num -= 4;
} while (num != 0);
} Here's the semi-SIMD version (scalar conversions, rest is SIMD): void remez5_0_log2_semisimd(const double *inputs, double* outputs, size_t num) {
assert(num != 0);
assert(num % 4 == 0);
const v128_t offset = wasm_i32x4_const(1023, 1023, 0, 0);
do {
v128_t k1, k2;
v128_t p1, p2;
v128_t a1, a2;
v128_t xmm0, xmm1;
v128_t x1, x2;
/* Load four double values. */
xmm0 = wasm_f64x2_const(7.09782712893383996843e2, 7.09782712893383996843e2);
xmm1 = wasm_f64x2_const(-7.08396418532264106224e2, -7.08396418532264106224e2);
x1 = wasm_v128_load(inputs);
x2 = wasm_v128_load(inputs+2);
inputs += 4;
x1 = wasm_f64x2_pmin(xmm0, x1);
x2 = wasm_f64x2_pmin(xmm0, x2);
x1 = wasm_f64x2_pmax(xmm1, x1);
x2 = wasm_f64x2_pmax(xmm1, x2);
/* a = x / log2; */
xmm0 = wasm_f64x2_const(1.4426950408889634073599, 1.4426950408889634073599);
xmm1 = wasm_f64x2_const(0.0, 0.0);
a1 = wasm_f64x2_mul(x1, xmm0);
a2 = wasm_f64x2_mul(x2, xmm0);
/* k = (int)floor(a); p = (float)k; */
p1 = wasm_f64x2_lt(a1, xmm1);
p2 = wasm_f64x2_lt(a2, xmm1);
xmm0 = wasm_f64x2_const(1.0, 1.0);
p1 = wasm_v128_and(p1, xmm0);
p2 = wasm_v128_and(p2, xmm0);
a1 = wasm_f64x2_sub(a1, p1);
a2 = wasm_f64x2_sub(a2, p2);
const int k1_lo = (int) wasm_f64x2_extract_lane(a1, 0);
const int k1_hi = (int) wasm_f64x2_extract_lane(a1, 1);
const int k2_lo = (int) wasm_f64x2_extract_lane(a2, 0);
const int k2_hi = (int) wasm_f64x2_extract_lane(a2, 1);
k1 = wasm_i32x4_make(k1_lo, k1_hi, 0, 0);
k2 = wasm_i32x4_make(k2_lo, k2_hi, 0, 0);
p1 = wasm_f64x2_make((double) k1_lo, (double) k1_hi);
p2 = wasm_f64x2_make((double) k2_lo, (double) k2_hi);
/* x -= p * log2; */
xmm0 = wasm_f64x2_const(6.93145751953125E-1, 6.93145751953125E-1);
xmm1 = wasm_f64x2_const(1.42860682030941723212E-6, 1.42860682030941723212E-6);
a1 = wasm_f64x2_mul(p1, xmm0);
a2 = wasm_f64x2_mul(p2, xmm0);
x1 = wasm_f64x2_sub(x1, a1);
x2 = wasm_f64x2_sub(x2, a2);
a1 = wasm_f64x2_mul(p1, xmm1);
a2 = wasm_f64x2_mul(p2, xmm1);
x1 = wasm_f64x2_sub(x1, a1);
x2 = wasm_f64x2_sub(x2, a2);
/* Compute e^x using a polynomial approximation. */
xmm0 = wasm_f64x2_const(1.185268231308989403584147407056378360798378534739e-2, 1.185268231308989403584147407056378360798378534739e-2);
xmm1 = wasm_f64x2_const(3.87412011356070379615759057344100690905653320886699e-2, 3.87412011356070379615759057344100690905653320886699e-2);
a1 = wasm_f64x2_mul(x1, xmm0);
a2 = wasm_f64x2_mul(x2, xmm0);
a1 = wasm_f64x2_add(a1, xmm1);
a2 = wasm_f64x2_add(a2, xmm1);
xmm0 = wasm_f64x2_const(0.16775408658617866431779970932853611481292418818223, 0.16775408658617866431779970932853611481292418818223);
xmm1 = wasm_f64x2_const(0.49981934577169208735732248650232562589934399402426, 0.49981934577169208735732248650232562589934399402426);
a1 = wasm_f64x2_mul(a1, x1);
a2 = wasm_f64x2_mul(a2, x2);
a1 = wasm_f64x2_add(a1, xmm0);
a2 = wasm_f64x2_add(a2, xmm0);
a1 = wasm_f64x2_mul(a1, x1);
a2 = wasm_f64x2_mul(a2, x2);
a1 = wasm_f64x2_add(a1, xmm1);
a2 = wasm_f64x2_add(a2, xmm1);
xmm0 = wasm_f64x2_const(1.00001092396453942157124178508842412412025643386873, 1.00001092396453942157124178508842412412025643386873);
xmm1 = wasm_f64x2_const(0.99999989311082729779536722205742989232069120354073, 0.99999989311082729779536722205742989232069120354073);
a1 = wasm_f64x2_mul(a1, x1);
a2 = wasm_f64x2_mul(a2, x2);
a1 = wasm_f64x2_add(a1, xmm0);
a2 = wasm_f64x2_add(a2, xmm0);
a1 = wasm_f64x2_mul(a1, x1);
a2 = wasm_f64x2_mul(a2, x2);
a1 = wasm_f64x2_add(a1, xmm1);
a2 = wasm_f64x2_add(a2, xmm1);
/* p = 2^k; */
k1 = wasm_i32x4_add(k1, offset);
k2 = wasm_i32x4_add(k2, offset);
k1 = wasm_i32x4_shl(k1, 20);
k2 = wasm_i32x4_shl(k2, 20);
k1 = wasm_v32x4_shuffle(k1, k1, 1, 3, 0, 2);
k2 = wasm_v32x4_shuffle(k2, k2, 1, 3, 0, 2);
p1 = k1;
p2 = k2;
/* a *= 2^k. */
a1 = wasm_f64x2_mul(a1, p1);
a2 = wasm_f64x2_mul(a2, p2);
/* Store the results. */
wasm_v128_store(outputs, a1);
wasm_v128_store(outputs+2, a2);
outputs += 4;
num -= 4;
} while (num != 0);
} Here're the performance results:
|
d6d4030
to
ffba47f
Compare
Merging as per #429 and discussion above. |
This converts 2 f64 to 2 i32, then zeroes the top 2 lanes. These 2 instructions were merged as part of WebAssembly#383. This change also refactors some test cases from simd_conversions out into a script that generates both i32x4.trunc_sat_i32x4_{s,u} and i32x4.trunc_sat_f64x2_{s,u}_zero.
* [interpreter] Implement i32x4.trunc_sat_f64x2_{s,u}_zero This converts 2 f64 to 2 i32, then zeroes the top 2 lanes. These 2 instructions were merged as part of #383. This change also refactors some test cases from simd_conversions out into a script that generates both i32x4.trunc_sat_i32x4_{s,u} and i32x4.trunc_sat_f64x2_{s,u}_zero. * Add encode/decode * Update interpreter/exec/simd.ml Co-authored-by: Andreas Rossberg <rossberg@mpi-sws.org> Co-authored-by: Andreas Rossberg <rossberg@mpi-sws.org>
These 2 instructions were merged as part of WebAssembly#383. The test cases are add manually to simd_conversions.wast, using constants from test/core/conversions.wast.
These 2 instructions were merged as part of #383. The test cases are add manually to simd_conversions.wast, using constants from test/core/conversions.wast.
These 2 instructions were merged as part of WebAssembly#383.
These 2 instructions were merged as part of #383.
Instructions were added in WebAssembly#383. Consolidate conversion operations (vcvtop) more, merging int-int widening operations. Drive-by fix extmul definition in syntax (shouldn't include the shape).
Instructions were added in WebAssembly#383. Consolidate conversion operations (vcvtop) more, merging int-int widening operations. Drive-by fix extmul definition in syntax (shouldn't include the shape).
Instructions were added in #383. Consolidate conversion operations (vcvtop) more, merging int-int widening operations. Drive-by fix extmul definition in syntax (shouldn't include the shape).
As proposed in WebAssembly/simd#383. Differential Revision: https://reviews.llvm.org/D95012
@Maratyszcza @ngzhian Hi .. I am lowering f64x2.convert_low_i32x4_u for sse2 in a backend. I see this conversion here:
I am trying to understand the magic here. Do you a reference where there is more detail or have an explaination. I checked out v8's implementation here: https://source.chromium.org/chromium/chromium/src/+/master:v8/src/codegen/x64/macro-assembler-x64.cc;l=2406;drc=92a46414765f4873251cf9ecefbef130648e2af8 where it is mentioned that
But I am not really sure what this implies as far as how this conversion works. Any comments? |
@jbl |
@Maratyszcza Hi, thanks so much. I got it now! The "magic" I was missing was is in the subtract. |
Introduction
As discussed in #348, WebAssembly SIMD specification is functionally incomplete w.r.t double-precision computations as it doesn't cover many basic C/C++ operations, particularly conversions between
double
type and other types. Currently any code that use these operations would be de-vectorized (see examples inemmintrin.h
header in Emscripten). Double-precision computations are extensively used in scientific computing, and providing a complete set of double-precision is a must to bring scientific computing applications to the Web. In the foreseeable future WebAssembly SIMD will be the only avenue for high-performance scientific computing in the Web, as neither WebGL nor WebGPU support double-precision computations.This PR introduce the minimal set of conversion operations to fill this gap.
Applications
Use-cases in OpenCV
f32x4.demote_f64x2_zero
f64x2.promote_low_f32x4
f64x2.convert_low_i32x4_s
Mapping to Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.
x86/x86-64 processors with AVX512F + AVX512VL instruction sets
y = f64x2.convert_low_i32x4_u(x)
is lowered toVCVTUDQ2PD xmm_y, xmm_x
y = i32x4.trunc_sat_f64x2_u_zero(x)
(y
is NOTx
) is lowered to:VXORPD xmm_y, xmm_y, xmm_y
VMAXPD xmm_y, xmm_x, xmm_y
VCVTTPD2UDQ xmm_y, xmm_y
x86/x86-64 processors with AVX instruction set
y = f64x2.convert_low_i32x4_s(x)
is lowered toVCVTDQ2PD xmm_y, xmm_x
y = f64x2.convert_low_i32x4_u(x)
is lowered to:VUNPCKLPS xmm_y, xmm_x, [wasm_i32x4_splat(0x43300000)]
VSUBPD xmm_y, xmm_y, [wasm_f64x2_splat(0x1.0p+52)]
y = i32x4.trunc_sat_f64x2_s_zero(x)
(y
is NOTx
) is lowered to:VCMPEQPD xmm_y, xmm_x, xmm_x
VANDPD xmm_y, [wasm_f64x2_splat(2147483647.0)]
VMINPD xmm_y, xmm_x, xmm_y
VCVTTPD2DQ xmm_y, xmm_y
y = i32x4.trunc_sat_f64x2_u_zero(x)
is lowered to:VXORPD xmm_tmp, xmm_tmp, xmm_tmp
VMAXPD xmm_y, xmm_x, xmm_tmp
VMINPD xmm_y, xmm_y, [wasm_f64x2_splat(4294967295.0)]
VROUNDPD xmm_y, xmm_y, 0x0B
VADDPD xmm_y, xmm_y, [wasm_f64x2_splat(0x1.0p+52)]
VSHUFPS xmm_y, xmm_y, xmm_xmp, 0x88
y = f32x4.demote_f64x2_zero(x)
is lowered toVCVTPD2PS xmm_y, xmm_x
y = f64x2.promote_low_f32x4(x)
is lowered toVCVTPS2PD xmm_y, xmm_x
x86/x86-64 processors with SSE4.1 instruction set
y = i32x4.trunc_sat_f64x2_u_zero(x)
is lowered to:MOVAPD xmm_y, xmm_x
XORPD xmm_tmp, xmm_tmp
MAXPD xmm_y, xmm_tmp
MINPD xmm_y, [wasm_f64x2_splat(4294967295.0)]
ROUNDPD xmm_y, xmm_y, 0x0B
ADDPD xmm_y, [wasm_f64x2_splat(0x1.0p+52)]
SHUFPS xmm_y, xmm_xmp, 0x88
x86/x86-64 processors with SSE2 instruction set
y = f64x2.convert_low_i32x4_s(x)
is lowered toCVTDQ2PD xmm_y, xmm_x
y = f64x2.convert_low_i32x4_u(x)
is lowered to:MOVAPS xmm_y, xmm_x
UNPCKLPS xmm_y, [wasm_i32x4_splat(0x43300000)]
SUBPD xmm_y, [wasm_f64x2_splat(0x1.0p+52)]
y = i32x4.trunc_sat_f64x2_s_zero(x)
is lowered to:XORPS xmm_tmp, xmm_tmp
CMPEQPD xmm_tmp, xmm_x
MOVAPS xmm_y, xmm_x
ANDPS xmm_tmp, [wasm_f64x2_splat(2147483647.0)]
MINPD xmm_y, xmm_tmp
CVTTPD2DQ xmm_y, xmm_y
y = f32x4.demote_f64x2_zero(x)
is lowered toCVTPD2PS xmm_y, xmm_x
y = f64x2.promote_low_f32x4(x)
is lowered toCVTPS2PD xmm_y, xmm_x
ARM64 processors
y = f64x2.convert_low_i32x4_s(x)
is lowered to:SSHLL Vy.2D, Vx.2S, 0
SCVTF Vy.2D, Vy.2D
y = f64x2.convert_low_i32x4_u(x)
is lowered to:USHLL Vy.2D, Vx.2S, 0
UCVTF Vy.2D, Vy.2D
y = i32x4.trunc_sat_f64x2_s_zero(x)
is lowered to:FCVTZS Vy.2D, Vx.2D
SQXTN Vy.2S, Vy.2D
y = i32x4.trunc_sat_f64x2_u_zero(x)
is lowered to:FCVTZU Vy.2D, Vx.2D
UQXTN Vy.2S, Vy.2D
y = f32x4.demote_f64x2_zero(x)
is lowered toFCVTN Vy.2S, Vx.2D
y = f64x2.promote_low_f32x4(x)
is lowered toFCVTL Vy.2D, Vx.2S
ARMv7 processors with NEON
y = f64x2.convert_low_i32x4_s(x)
(x
inq0
-q7
) is lowered to:VCVT.F64.S32 Dy_lo, Sx_ll
VCVT.F64.S32 Dy_hi, Sx_lh
y = f64x2.convert_low_i32x4_u(x)
(x
inq0
-q7
) is lowered to:VCVT.F64.U32 Dy_lo, Sx_ll
VCVT.F64.U32 Dy_hi, Sx_lh
y = i32x4.trunc_sat_f64x2_s_zero(x)
(y
inq0
-q7
) is lowered to:VCVT.S32.F64 Sy_ll, Dx_lo
VCVT.S32.F64 Sy_lh, Dx_lo
VMOV.I32 Dy_hi, 0
y = i32x4.trunc_sat_f64x2_u_zero(x)
(y
inq0
-q7
) is lowered to:VCVT.U32.F64 Sy_ll, Dx_lo
VCVT.U32.F64 Sy_lh, Dx_lo
VMOV.I32 Dy_hi, 0
y = f32x4.demote_f64x2_zero(x)
is lowered to:VCVT.F32.F64 Sy_ll, Dx_lo
VCVT.F32.F64 Sy_lh, Dx_hi
VMOV.I32 Dy_hi, 0
y = f64x2.promote_low_f32x4(x)
is lowered to:VCVT.F64.F32 Dy_lo, Sx_ll
VCVT.F64.F32 Dy_hi, Sx_lh