Extended pairwise addition instructions #380

Maratyszcza · 2020-10-09T19:44:23Z

Introduction

This PR introduce extended pairwise addition operation that computes extended sums within adjacent pairs of lanes of a SIMD vector, and produce a SIMD vector with twice fewer of twice wider lanes. This operation naturally arise when doing partial reductions in fixed-point algorithms, nicely maps to a single instruction on ARM, and can be simulated with just 1-4 instructions on SSE4+ x86.

Applications

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with XOP instruction set

i16x8.extadd_pairwise_i8x16_s
- y = i16x8.extadd_pairwise_i8x16_s(x) is lowered to VPHADDBW xmm_y, xmm_x
i16x8.extadd_pairwise_i8x16_u
- y = i16x8.extadd_pairwise_i8x16_u(x) is lowered to VPHADDUBW xmm_y, xmm_x
i32x4.extadd_pairwise_i16x8_s
- y = i32x4.extadd_pairwise_i16x8_s(x) is lowered to VPHADDWD xmm_y, xmm_x
i32x4.extadd_pairwise_i16x8_u
- y = i32x4.extadd_pairwise_i16x8_u(x) is lowered to VPHADDUWD xmm_y, xmm_x

x86/x86-64 processors with AVX instruction set

i16x8.extadd_pairwise_i8x16_s
- y = i16x8.extadd_pairwise_i8x16_s(x) (y is NOT x) is lowered to:
  - VMOVDQA xmm_y, [wasm_i8x16_splat(1)]
  - VPMADDUBSW xmm_y, xmm_y, xmm_x
i16x8.extadd_pairwise_i8x16_u
- y = i16x8.extadd_pairwise_i8x16_u(x) is lowered to VPMADDUBSW xmm_y, xmm_x, [wasm_i8x16_splat(1)]
i32x4.extadd_pairwise_i16x8_s
- y = i32x4.extadd_pairwise_i16x8_s(x) is lowered to VPMADDWD xmm_y, xmm_x, [wasm_i16x8_splat(1)]
i32x4.extadd_pairwise_i16x8_u
- y = i32x4.extadd_pairwise_i16x8_u(x) (y is not x) is lowered to:
  - VPSRLD xmm_y, xmm_x, 16
  - VPBLENDW xmm_tmp, xmm_x, xmm_y, 0xAA
  - VPADDD xmm_y, xmm_y, xmm_tmp

x86/x86-64 processors with SSSE3 instruction set

i16x8.extadd_pairwise_i8x16_s
- y = i16x8.extadd_pairwise_i8x16_s(x) (y is NOT x) is lowered to:
  - MOVDQA xmm_y, [wasm_i8x16_splat(1)]
  - PMADDUBSW xmm_y, xmm_x
i16x8.extadd_pairwise_i8x16_u
- y = i16x8.extadd_pairwise_i8x16_u(x) is lowered to:
  - MOVDQA xmm_y, xmm_x
  - PMADDUBSW xmm_y, [wasm_i8x16_splat(1)]

x86/x86-64 processors with SSE2 instruction set

i16x8.extadd_pairwise_i8x16_s
- y = i16x8.extadd_pairwise_i8x16_s(x) (y is NOT x) is lowered to:
  - MOVDQA xmm_tmp, xmm_x
  - MOVDQA xmm_y, xmm_x
  - PSLLW xmm_tmp, 8
  - PSRAW xmm_y, 8
  - PSRAW xmm_tmp, 8
  - PADDW xmm_y, xmm_tmp
i16x8.extadd_pairwise_i8x16_u
- y = i16x8.extadd_pairwise_i8x16_u(x) is lowered to:
  - MOVDQA xmm_tmp, [wasm_i16x8_splat(0x00FF)]
  - MOVDQA xmm_y, xmm_x
  - PAND xmm_tmp, xmm_x
  - PSRLW xmm_y, 8
  - PADDW xmm_y, xmm_tmp
i32x4.extadd_pairwise_i16x8_s
- y = i32x4.extadd_pairwise_i16x8_s(x) is lowered to:
  - MOVDQA xmm_y, xmm_x
  - PMADDWD xmm_y, [wasm_i16x8_splat(1)]
i32x4.extadd_pairwise_i16x8_u
- y = i32x4.extadd_pairwise_i16x8_u(x) is lowered to:
  - MOVDQA xmm_y, xmm_x
  - PXOR xmm_y, [wasm_i16x8_splat(0x8000)]
  - PMADDWD xmm_y, [wasm_i16x8_splat(1)]
  - PADDD xmm_y, [wasm_i32x4_splat(0x00010000)]

ARM64 processors

i16x8.extadd_pairwise_i8x16_s
- y = i16x8.extadd_pairwise_i8x16_s(x) is lowered to SADDLP Vy.8H, Vx.16B
i16x8.extadd_pairwise_i8x16_u
- y = i16x8.extadd_pairwise_i8x16_u(x) is lowered to UADDLP Vy.8H, Vx.16B
i32x4.extadd_pairwise_i16x8_s
- y = i32x4.extadd_pairwise_i16x8_s(x) is lowered to SADDLP Vy.4S, Vx.8H
i32x4.extadd_pairwise_i16x8_u
- y = i32x4.extadd_pairwise_i16x8_u(x) is lowered to UADDLP Vy.4S, Vx.8H

ARMv7 processors with NEON instruction set

i16x8.extadd_pairwise_i8x16_s
- y = i16x8.extadd_pairwise_i8x16_s(x) is lowered to VPADDL.S8 Qy, Qx
i16x8.extadd_pairwise_i8x16_u
- y = i16x8.extadd_pairwise_i8x16_u(x) is lowered to VPADDL.U8 Qy, Qx
i32x4.extadd_pairwise_i16x8_s
- y = i32x4.extadd_pairwise_i16x8_s(x) is lowered to VPADDL.S16 Qy, Qx
i32x4.extadd_pairwise_i16x8_u
- y = i32x4.extadd_pairwise_i16x8_u(x) is lowered to VPADDL.U16 Qy, Qx

omnisip · 2020-10-13T01:08:46Z

@Maratyszcza You beat me to it! :-) I was just about finished my proposal on this. ;-)

omnisip · 2020-10-13T02:12:03Z

@Maratyszcza
This might be a little bit more performant for the i16x8->i32x4 variant. It's also going to work on SSE2.

Unsigned:

        movdqa xmm_x, xmm0
        movdqa  xmm_tmp, [wasm_i32x4_splat(0x0000ffff)]
        pand    xmm_tmp, xmm_x
        psrld   xmm_x, 16
        paddd    xmm_x, xmm_tmp

Signed:

        movdqa  xmm_tmp, xmm0
        movdqa xmm_out, xmm0
        pslld   xmm_tmp, 16
        psrad   xmm_tmp, 16
        psrad   xmm_out, 16
        paddd   xmm_out, xmm_tmp

Maratyszcza · 2020-10-13T02:37:58Z

@omnisip You snippets destroy the input x value, WAsm engines are not allowed to do this in general.

omnisip · 2020-10-13T02:49:08Z

@omnisip You snippets destroy the input x value, WAsm engines are not allowed to do this in general.

Got it. I corrected the examples above and added a movdqa to prevent clobbering an input register.

ngzhian · 2020-11-10T02:33:12Z

https://crrev.com/c/2513872 prototypes this for arm64.

Add new macro-assembler instructions that can handle both AVX and SSE. In the SSE case it checks that dst == src1. (This is different from that the AvxHelper does, which passes dst as the first operand to AVX instructions.) Sorted SSSE3_INSTRUCTION_LIST by instruction code. Header additions are added by clangd, we were already using something from those headers via transitive includes, adding them explicitly gets us closer to IWYU. Codegen sequences are from WebAssembly/simd#380 and also WebAssembly/simd#380 (comment). Bug: v8:11086 Change-Id: I4c04f836e471ed8b00f9ff1a1b2e6348a593d4de Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2578797 Commit-Queue: Zhi An Ng <zhin@chromium.org> Reviewed-by: Bill Budge <bbudge@chromium.org> Cr-Commit-Position: refs/heads/master@{#71688}

proposals/simd/SIMD.md

As proposed in WebAssembly/simd#380. This commit makes the new instructions available only via clang builtins and LLVM intrinsics to make their use opt-in while they are still being evaluated for inclusion in the SIMD proposal. Depends on D93771. Differential Revision: https://reviews.llvm.org/D93775

As proposed in WebAssembly/simd#380, using the opcodes used in LLVM and V8. Since these opcodes overlap with the opcodes of i64x2.all_true and i64x2.any_true, which have long since been removed from the SIMD proposal, this PR also removes those instructions.

Maratyszcza · 2021-01-14T20:51:56Z

I evaluated this proposal on the O(N^3) part of the 3-Point Correlation computation, described here. The use of extended pairwise addition instructions is guarded by USE_EXTPADD macro. Results on ARM64 systems are presented below:

Implementation	`USE_EXTPADD=0`	`USE_EXTPADD=1`	Speedup
Snapdragon 670 (Pixel 3a)	188 us	164 us	15%
Exynos 8895 (Galaxy S8)	125 us	114 us	10%

As evidenced by the above results, even though the algorithm is not particularly rich on pairwise summations, extended pairwise addition instructions bring noticeable speedup.

Maratyszcza · 2021-01-20T03:22:08Z

Results on x86-64 systems are presented below:

Implementation	`USE_EXTPADD=0`	`USE_EXTPADD=1`	Speedup
AMD Pro A10-8700B	93 us	93 us	0%
AMD A4-7210	303 us	300 us	1%
Intel Xeon W-2135	66 us	66 us	0%
Intel Celeron N3060	396 us	396 us	0%

Unlike ARM64, x86-64 results demonstrate mostly unchanged performance. However, the x86-64 performance is hindered by suboptimal loading of constants in V8, and could be expected to improve in the future.

penzn · 2021-01-20T21:26:15Z

I have concerns regarding with both the use cases and speedup. From my point of view, needing to implement code from an unrelated paper raises some questions about feasibility of use cases listed above. Secondly, in addition to being flat on one platform, this shows low double digit millisecond gains on the other while the whole run is well under half a second on both, I don't think this would make noticeable difference in production (also with short runs like this I am not sure we can make accurate measurements). I think we ought to have somewhat higher standards for inclusion of the instructions this late in the game.

Maratyszcza · 2021-01-21T08:34:51Z

@penzn I disagree that N-point correlation is not a suitable workload for evaluation, but nonetheless did another experiment within XNNPACK to evaluate these instruction. In the experiment I added two new variants of 8-bit fixed-point GEMM/IGEMM microkernels: one using Extended Multiplication instructions in combination with the proposed Extended Pairwise Addition instructions, and another using the Extended Multiplication instructions in combination with the widening and addition instructions (not equivalent to extended pairwise addition in general, but works in this particular case and is cheaper than emulation of the extended pairwise addition operation). In these microkernels the extended pairwise addition instructions are the hot loop and their effect on end-to-end performance is more pronounced than in the N-point correlation kernels. Results for end-to-end inference performance are presented in the table below:

Processor (Device)	NN Model	Time with widen+add	Time with extended pairwise addition	Speedup
AMD PRO A10-8700B	MN v1	130 ms	84 ms	55%
AMD PRO A10-8700B	MN v2	80 ms	56 ms	43%
AMD A4-7210	MN v1	323 ms	232 ms	39%
AMD A4-7210	MN v2	203 ms	155 ms	31%
Intel Xeon W-2135	MN v1	91 ms	46 ms	98%
Intel Xeon W-2135	MN v2	55 ms	31 ms	77%
Intel Celeron N3060	MN v1	420 ms	431 ms	-3%
Intel Celeron N3060	MN v2	266 ms	267 ms	0%
Qualcomm Snapdragon 670 (Pixel 3a)	MN v1	203 ms	137 ms	48%
Qualcomm Snapdragon 670 (Pixel 3a)	MN v2	135 ms	101 ms	34%
Samsung Exynos 8895 (Galaxy S8)	MN v1	224 ms	116 ms	93%
Samsung Exynos 8895 (Galaxy S8)	MN v2	145 ms	89 ms	63%

Maratyszcza · 2021-01-21T19:50:37Z

The source code changes for the last experiment can be seen here

penzn · 2021-01-22T02:27:57Z

@penzn I disagree that N-point correlation is not a suitable workload for evaluation

I don't think I ever said that - it is a fine test, what I meant is that we were using something different from use cases listed in PR description. XNNPACK is a much better example - thank you for that!

dtig · 2021-01-25T19:32:15Z

Adding a preliminary vote for the inclusion of extended pairwise addition operations to the SIMD proposal below. Please vote with -

👍 For including extended pairwise addition operations
👎 Against including extended pairwise addition operations

fbarchard · 2021-01-25T23:18:28Z

In native ARM I've used paddl, padal alot. They've very fast and useful.
On Intel I've used the trick of vpmaddubsw with constants of 1 to achieve the same. The instruction is very fast, even with the redundent multiply.
Ideally we'd have the accumulate version, but paddl is still useful, followed by an add.
An example of using it for an image is a 2x2 down sample. paddl to add values horizontally in pairs. Then add the results, and rounding shift by 2 to get average of the 4 pixels.
It can also be used to accumulate values, such as popcnts.

penzn · 2021-01-25T23:25:58Z

I am not 100% convinced - there are drops in performance on some platforms, plus this is, again, short microkernels.

Maratyszcza · 2021-01-26T00:53:54Z

plus this is, again, short microkernels.

The XNNPACK results are end-to-end

penzn · 2021-01-26T17:33:52Z

The XNNPACK results are end-to-end

@Maratyszcza sorry, "microkernels" threw me off, you did mention they were end to end.

Maratyszcza · 2021-01-27T09:27:45Z

Updated end-to-end results in XNNPACK on x86-64 to reflect the recent optimizations in V8.

ngzhian · 2021-01-27T18:35:53Z

The XNNPACK benchmark only use __builtin_wasm_extadd_pairwise_i16x8_s_i32x4, I think we can probably expect similar (good) speedups for the other 3 instructions. I see examples of the other 3 instructions in the listed use cases too.

Side note: the naming of this intrinsic seems off? Should it be __builtin_wasm_extadd_pairwise_i32x4_i16x8_s instead? (dst shape first).

Maratyszcza · 2021-01-27T19:48:39Z

The XNNPACK benchmark only use __builtin_wasm_extadd_pairwise_i16x8_s_i32x4, I think we can probably expect similar (good) speedups for the other 3 instructions. I see examples of the other 3 instructions in the listed use cases too.

The 3-Point Correlation experiment used __builtin_wasm_extadd_pairwise_i8x16_u_i16x8 and __builtin_wasm_extadd_pairwise_i16x8_u_i32x4.

Side note: the naming of this intrinsic seems off? Should it be __builtin_wasm_extadd_pairwise_i32x4_i16x8_s instead? (dst shape first).

cc @tlively

tlively · 2021-01-27T23:55:23Z

Don't worry about that name. The __builtin_wasm_* functions are not intended to be user-facing and follow their own weird naming conventions. We will add properly named user-facing intrinsics eventually if we approve these instructions.

These 4 instructions: - i32x4.extadd_pairwise_i16x8_s - i32x4.extadd_pairwise_i16x8_u - i16x8.extadd_pairwise_i8x16_s - i16x8.extadd_pairwise_i8x16_u were merged in WebAssembly#380. Drive-by cleanup to meta/README.md to list all generated files.

This was merged in WebAssembly#380.

These 4 instructions: - i32x4.extadd_pairwise_i16x8_s - i32x4.extadd_pairwise_i16x8_u - i16x8.extadd_pairwise_i8x16_s - i16x8.extadd_pairwise_i8x16_u were merged in #380. Drive-by cleanup to meta/README.md to list all generated files.

This was merged in WebAssembly#380.

This was merged in #380.

tlively reviewed Dec 23, 2020

View reviewed changes

proposals/simd/SIMD.md Outdated Show resolved Hide resolved

Maratyszcza force-pushed the extadd-pair branch from d52bd6d to 667c67c Compare December 23, 2020 20:02

tlively mentioned this pull request Jan 5, 2021

Prototype SIMD extending pairwise add instructions WebAssembly/binaryen#3466

Merged

tlively mentioned this pull request Jan 8, 2021

Agenda for sync meeting 1/8/2021 #410

Closed

Maratyszcza force-pushed the extadd-pair branch 2 times, most recently from cfeecc8 to 21ba048 Compare January 14, 2021 20:59

Maratyszcza mentioned this pull request Jan 14, 2021

Folding integer additions with operands of mixed bit widths #228

Open

Maratyszcza force-pushed the extadd-pair branch from 21ba048 to 3158f3d Compare January 19, 2021 21:03

penzn mentioned this pull request Jan 22, 2021

Population count instruction #379

Merged

Maratyszcza mentioned this pull request Jan 22, 2021

Agenda for sync meeting 1/22/21 #419

Closed

tlively mentioned this pull request Jan 23, 2021

Agenda for sync meeting 1/29/21 #429

Closed

ngzhian added the 2021-01-29 Agenda for sync meeting 1/29/21 label Jan 26, 2021

dtig removed the 2021-01-29 Agenda for sync meeting 1/29/21 label Feb 2, 2021

Extended pairwise addition instructions

fed6cbf

Maratyszcza force-pushed the extadd-pair branch from 3158f3d to fed6cbf Compare February 3, 2021 08:54

Merge branch 'master' into extadd-pair

13969db

tlively merged commit b6ca6b2 into WebAssembly:master Feb 4, 2021

ngzhian mentioned this pull request Feb 18, 2021

[interpreter] Implement extadd pairwise instructions #474

Merged

ngzhian added a commit to ngzhian/simd that referenced this pull request Feb 19, 2021

[spectext] Add extended pairwise add instructions

cc83565

This was merged in WebAssembly#380.

ngzhian mentioned this pull request Feb 19, 2021

[spectext] Add extended pairwise add instructions #476

Merged

ngzhian added a commit to ngzhian/simd that referenced this pull request Feb 23, 2021

[spectext] Add extended pairwise add instructions

c8bd1fd

This was merged in WebAssembly#380.

ngzhian added a commit to ngzhian/simd that referenced this pull request Feb 24, 2021

[spectext] Add extended pairwise add instructions

4b303eb

This was merged in WebAssembly#380.

ngzhian added a commit that referenced this pull request Feb 24, 2021

[spectext] Add extended pairwise add instructions (#476)

33ba45f

This was merged in #380.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extended pairwise addition instructions #380

Extended pairwise addition instructions #380

Maratyszcza commented Oct 9, 2020 •

edited

Loading

omnisip commented Oct 13, 2020

omnisip commented Oct 13, 2020 •

edited

Loading

Maratyszcza commented Oct 13, 2020

omnisip commented Oct 13, 2020

ngzhian commented Nov 10, 2020

Maratyszcza commented Jan 14, 2021

Maratyszcza commented Jan 20, 2021

penzn commented Jan 20, 2021 •

edited

Loading

Maratyszcza commented Jan 21, 2021 •

edited

Loading

Maratyszcza commented Jan 21, 2021

penzn commented Jan 22, 2021

dtig commented Jan 25, 2021

fbarchard commented Jan 25, 2021

penzn commented Jan 25, 2021

Maratyszcza commented Jan 26, 2021

penzn commented Jan 26, 2021

Maratyszcza commented Jan 27, 2021 •

edited

Loading

ngzhian commented Jan 27, 2021

Maratyszcza commented Jan 27, 2021 •

edited

Loading

tlively commented Jan 27, 2021

Extended pairwise addition instructions #380

Extended pairwise addition instructions #380

Conversation

Maratyszcza commented Oct 9, 2020 • edited Loading

Introduction

Applications

Mapping to Common Instruction Sets

x86/x86-64 processors with XOP instruction set

x86/x86-64 processors with AVX instruction set

x86/x86-64 processors with SSSE3 instruction set

x86/x86-64 processors with SSE2 instruction set

ARM64 processors

ARMv7 processors with NEON instruction set

omnisip commented Oct 13, 2020

omnisip commented Oct 13, 2020 • edited Loading

Maratyszcza commented Oct 13, 2020

omnisip commented Oct 13, 2020

ngzhian commented Nov 10, 2020

Maratyszcza commented Jan 14, 2021

Maratyszcza commented Jan 20, 2021

penzn commented Jan 20, 2021 • edited Loading

Maratyszcza commented Jan 21, 2021 • edited Loading

Maratyszcza commented Jan 21, 2021

penzn commented Jan 22, 2021

dtig commented Jan 25, 2021

fbarchard commented Jan 25, 2021

penzn commented Jan 25, 2021

Maratyszcza commented Jan 26, 2021

penzn commented Jan 26, 2021

Maratyszcza commented Jan 27, 2021 • edited Loading

ngzhian commented Jan 27, 2021

Maratyszcza commented Jan 27, 2021 • edited Loading

tlively commented Jan 27, 2021

Maratyszcza commented Oct 9, 2020 •

edited

Loading

omnisip commented Oct 13, 2020 •

edited

Loading

penzn commented Jan 20, 2021 •

edited

Loading

Maratyszcza commented Jan 21, 2021 •

edited

Loading

Maratyszcza commented Jan 27, 2021 •

edited

Loading

Maratyszcza commented Jan 27, 2021 •

edited

Loading