-
Notifications
You must be signed in to change notification settings - Fork 43
Extended pairwise addition instructions #380
Conversation
@Maratyszcza You beat me to it! :-) I was just about finished my proposal on this. ;-) |
@Maratyszcza Unsigned: movdqa xmm_x, xmm0
movdqa xmm_tmp, [wasm_i32x4_splat(0x0000ffff)]
pand xmm_tmp, xmm_x
psrld xmm_x, 16
paddd xmm_x, xmm_tmp Signed: movdqa xmm_tmp, xmm0
movdqa xmm_out, xmm0
pslld xmm_tmp, 16
psrad xmm_tmp, 16
psrad xmm_out, 16
paddd xmm_out, xmm_tmp |
@omnisip You snippets destroy the input |
Got it. I corrected the examples above and added a movdqa to prevent clobbering an input register. |
https://crrev.com/c/2513872 prototypes this for arm64. |
Add new macro-assembler instructions that can handle both AVX and SSE. In the SSE case it checks that dst == src1. (This is different from that the AvxHelper does, which passes dst as the first operand to AVX instructions.) Sorted SSSE3_INSTRUCTION_LIST by instruction code. Header additions are added by clangd, we were already using something from those headers via transitive includes, adding them explicitly gets us closer to IWYU. Codegen sequences are from WebAssembly/simd#380 and also WebAssembly/simd#380 (comment). Bug: v8:11086 Change-Id: I4c04f836e471ed8b00f9ff1a1b2e6348a593d4de Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2578797 Commit-Queue: Zhi An Ng <zhin@chromium.org> Reviewed-by: Bill Budge <bbudge@chromium.org> Cr-Commit-Position: refs/heads/master@{#71688}
d52bd6d
to
667c67c
Compare
As proposed in WebAssembly/simd#380. This commit makes the new instructions available only via clang builtins and LLVM intrinsics to make their use opt-in while they are still being evaluated for inclusion in the SIMD proposal. Depends on D93771. Differential Revision: https://reviews.llvm.org/D93775
As proposed in WebAssembly/simd#380, using the opcodes used in LLVM and V8. Since these opcodes overlap with the opcodes of i64x2.all_true and i64x2.any_true, which have long since been removed from the SIMD proposal, this PR also removes those instructions.
As proposed in WebAssembly/simd#380, using the opcodes used in LLVM and V8. Since these opcodes overlap with the opcodes of i64x2.all_true and i64x2.any_true, which have long since been removed from the SIMD proposal, this PR also removes those instructions.
I evaluated this proposal on the O(N^3) part of the 3-Point Correlation computation, described here. The use of extended pairwise addition instructions is guarded by
As evidenced by the above results, even though the algorithm is not particularly rich on pairwise summations, extended pairwise addition instructions bring noticeable speedup. |
cfeecc8
to
21ba048
Compare
21ba048
to
3158f3d
Compare
Results on x86-64 systems are presented below:
Unlike ARM64, x86-64 results demonstrate mostly unchanged performance. However, the x86-64 performance is hindered by suboptimal loading of constants in V8, and could be expected to improve in the future. |
I have concerns regarding with both the use cases and speedup. From my point of view, needing to implement code from an unrelated paper raises some questions about feasibility of use cases listed above. Secondly, in addition to being flat on one platform, this shows low double digit millisecond gains on the other while the whole run is well under half a second on both, I don't think this would make noticeable difference in production (also with short runs like this I am not sure we can make accurate measurements). I think we ought to have somewhat higher standards for inclusion of the instructions this late in the game. |
@penzn I disagree that N-point correlation is not a suitable workload for evaluation, but nonetheless did another experiment within XNNPACK to evaluate these instruction. In the experiment I added two new variants of 8-bit fixed-point GEMM/IGEMM microkernels: one using Extended Multiplication instructions in combination with the proposed Extended Pairwise Addition instructions, and another using the Extended Multiplication instructions in combination with the widening and addition instructions (not equivalent to extended pairwise addition in general, but works in this particular case and is cheaper than emulation of the extended pairwise addition operation). In these microkernels the extended pairwise addition instructions are the hot loop and their effect on end-to-end performance is more pronounced than in the N-point correlation kernels. Results for end-to-end inference performance are presented in the table below:
|
The source code changes for the last experiment can be seen here |
I don't think I ever said that - it is a fine test, what I meant is that we were using something different from use cases listed in PR description. XNNPACK is a much better example - thank you for that! |
Adding a preliminary vote for the inclusion of extended pairwise addition operations to the SIMD proposal below. Please vote with - 👍 For including extended pairwise addition operations |
In native ARM I've used paddl, padal alot. They've very fast and useful. |
I am not 100% convinced - there are drops in performance on some platforms, plus this is, again, short microkernels. |
The XNNPACK results are end-to-end |
@Maratyszcza sorry, "microkernels" threw me off, you did mention they were end to end. |
Updated end-to-end results in XNNPACK on x86-64 to reflect the recent optimizations in V8. |
The XNNPACK benchmark only use Side note: the naming of this intrinsic seems off? Should it be |
The 3-Point Correlation experiment used
cc @tlively |
Don't worry about that name. The |
3158f3d
to
fed6cbf
Compare
These 4 instructions: - i32x4.extadd_pairwise_i16x8_s - i32x4.extadd_pairwise_i16x8_u - i16x8.extadd_pairwise_i8x16_s - i16x8.extadd_pairwise_i8x16_u were merged in WebAssembly#380. Drive-by cleanup to meta/README.md to list all generated files.
This was merged in WebAssembly#380.
These 4 instructions: - i32x4.extadd_pairwise_i16x8_s - i32x4.extadd_pairwise_i16x8_u - i16x8.extadd_pairwise_i8x16_s - i16x8.extadd_pairwise_i8x16_u were merged in #380. Drive-by cleanup to meta/README.md to list all generated files.
This was merged in WebAssembly#380.
This was merged in WebAssembly#380.
Introduction
This PR introduce extended pairwise addition operation that computes extended sums within adjacent pairs of lanes of a SIMD vector, and produce a SIMD vector with twice fewer of twice wider lanes. This operation naturally arise when doing partial reductions in fixed-point algorithms, nicely maps to a single instruction on ARM, and can be simulated with just 1-4 instructions on SSE4+ x86.
Applications
Mapping to Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.
x86/x86-64 processors with XOP instruction set
y = i16x8.extadd_pairwise_i8x16_s(x)
is lowered toVPHADDBW xmm_y, xmm_x
y = i16x8.extadd_pairwise_i8x16_u(x)
is lowered toVPHADDUBW xmm_y, xmm_x
y = i32x4.extadd_pairwise_i16x8_s(x)
is lowered toVPHADDWD xmm_y, xmm_x
y = i32x4.extadd_pairwise_i16x8_u(x)
is lowered toVPHADDUWD xmm_y, xmm_x
x86/x86-64 processors with AVX instruction set
y = i16x8.extadd_pairwise_i8x16_s(x)
(y
is NOTx
) is lowered to:VMOVDQA xmm_y, [wasm_i8x16_splat(1)]
VPMADDUBSW xmm_y, xmm_y, xmm_x
y = i16x8.extadd_pairwise_i8x16_u(x)
is lowered toVPMADDUBSW xmm_y, xmm_x, [wasm_i8x16_splat(1)]
y = i32x4.extadd_pairwise_i16x8_s(x)
is lowered toVPMADDWD xmm_y, xmm_x, [wasm_i16x8_splat(1)]
y = i32x4.extadd_pairwise_i16x8_u(x)
(y
is notx
) is lowered to:VPSRLD xmm_y, xmm_x, 16
VPBLENDW xmm_tmp, xmm_x, xmm_y, 0xAA
VPADDD xmm_y, xmm_y, xmm_tmp
x86/x86-64 processors with SSSE3 instruction set
y = i16x8.extadd_pairwise_i8x16_s(x)
(y
is NOTx
) is lowered to:MOVDQA xmm_y, [wasm_i8x16_splat(1)]
PMADDUBSW xmm_y, xmm_x
y = i16x8.extadd_pairwise_i8x16_u(x)
is lowered to:MOVDQA xmm_y, xmm_x
PMADDUBSW xmm_y, [wasm_i8x16_splat(1)]
x86/x86-64 processors with SSE2 instruction set
y = i16x8.extadd_pairwise_i8x16_s(x)
(y
is NOTx
) is lowered to:MOVDQA xmm_tmp, xmm_x
MOVDQA xmm_y, xmm_x
PSLLW xmm_tmp, 8
PSRAW xmm_y, 8
PSRAW xmm_tmp, 8
PADDW xmm_y, xmm_tmp
y = i16x8.extadd_pairwise_i8x16_u(x)
is lowered to:MOVDQA xmm_tmp, [wasm_i16x8_splat(0x00FF)]
MOVDQA xmm_y, xmm_x
PAND xmm_tmp, xmm_x
PSRLW xmm_y, 8
PADDW xmm_y, xmm_tmp
y = i32x4.extadd_pairwise_i16x8_s(x)
is lowered to:MOVDQA xmm_y, xmm_x
PMADDWD xmm_y, [wasm_i16x8_splat(1)]
y = i32x4.extadd_pairwise_i16x8_u(x)
is lowered to:MOVDQA xmm_y, xmm_x
PXOR xmm_y, [wasm_i16x8_splat(0x8000)]
PMADDWD xmm_y, [wasm_i16x8_splat(1)]
PADDD xmm_y, [wasm_i32x4_splat(0x00010000)]
ARM64 processors
y = i16x8.extadd_pairwise_i8x16_s(x)
is lowered toSADDLP Vy.8H, Vx.16B
y = i16x8.extadd_pairwise_i8x16_u(x)
is lowered toUADDLP Vy.8H, Vx.16B
y = i32x4.extadd_pairwise_i16x8_s(x)
is lowered toSADDLP Vy.4S, Vx.8H
y = i32x4.extadd_pairwise_i16x8_u(x)
is lowered toUADDLP Vy.4S, Vx.8H
ARMv7 processors with NEON instruction set
y = i16x8.extadd_pairwise_i8x16_s(x)
is lowered toVPADDL.S8 Qy, Qx
y = i16x8.extadd_pairwise_i8x16_u(x)
is lowered toVPADDL.U8 Qy, Qx
y = i32x4.extadd_pairwise_i16x8_s(x)
is lowered toVPADDL.S16 Qy, Qx
y = i32x4.extadd_pairwise_i16x8_u(x)
is lowered toVPADDL.U16 Qy, Qx