Cranelift aarch64 backend: implement 8/16/32-bit popcnt more efficiently #1537

cfallin · 2020-04-17T17:04:11Z

We currently use a sequence of instructions intended for 64-bit operation, and zero-extend narrower inputs. The original implementation by @jgouly almost works for 32 bits (and 8/16 bits extended to 32 bits), with a slight issue. We should rework the lowering to fix this issue and remove the need for zero-extending 32-bit operands.

…cnt instructions. Includes a temporary bugfix for popcnt with 32-bit operand. The popcnt issue was initially identified by Benjamin Bouvier <public@benj.me>, and the root cause was debugged by Joey Gouly <joey.gouly@arm.com>. This patch is simply a quick fix that zero-extends the operand to 64 bits; Joey plans to contribute a more permanent fix shortly (tracked in bytecodealliance#1537).

github-actions · 2020-04-17T17:21:50Z

Subscribe to Label Action

cc @bnjbvr

This issue or pull request has been labeled: "cranelift"

Thus the following users have been cc'd because of the following labels:

bnjbvr: cranelift

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

…cnt instructions. Includes a temporary bugfix for popcnt with 32-bit operand. The popcnt issue was initially identified by Benjamin Bouvier <public@benj.me>, and the root cause was debugged by Joey Gouly <joey.gouly@arm.com>. This patch is simply a quick fix that zero-extends the operand to 64 bits; Joey plans to contribute a more permanent fix shortly (tracked in bytecodealliance#1537).

bnjbvr · 2020-04-21T14:28:56Z

For what it's worth, the x86 legalization used a slightly different sequence that ought to be reusable here. Would it make sense to make it a target-independent legalization, and use it in simple_legalize then? (doing it at the CLIR level would allow applying optimizations onto the constants, esp. GVN could common out the constants instead of regenerating as many times as the popcnt instruction is used)

akirilov-arm · 2020-09-10T16:56:40Z

Note that the proper implementation for 32- and 64-bit operands may use the Neon CNT instruction (for example, this is what GCC and LLVM do), so a target-independent legalization may be counterproductive. However, I am not sure about 8- and 16-bit operands.

akirilov-arm · 2021-01-19T18:06:16Z

I decided to compare the approach using the Neon CNT instruction (which I'll call "vector") with the one that just used operations on general-purpose registers (i.e. the existing implementation in Cranelift, which I'll call "scalar"), so I wrote several microbenchmarks.

Here's the scalar implementation for 8-bit operands:

  lsr wd, ws, #1
  and wd, wd, #0x55555555
  sub wd, ws, wd
  and wt, wd, #0x33333333
  lsr wd, wd, #2
  and wd, wd, #0x33333333
  add wd, wd, wt
  add wd, wd, wd, lsr #4
  and wd, wd, #0xf

And the vector one:

  fmov st, ws
  cnt vt.8b, vt.8b
  umov wd, vt.b[0]

16-bit scalar:

  lsr wd, ws, #1
  and wd, wd, #0x55555555
  sub wd, ws, wd
  and wt, wd, #0x33333333
  lsr wd, wd, #2
  and wd, wd, #0x33333333
  add wd, wd, wt
  add wd, wd, wd, lsr #4
  and wd, wd, #0x0f0f0f0f
  add wd, wd, wd, lsr #8
  and wd, wd, #0xff

16-bit vector:

  fmov st, ws
  cnt vt.8b, vt.8b
  addp vt.8b, vt.8b, vt.8b
  umov wd, vt.b[0]

32-bit scalar:

  lsr wd, ws, #1
  and wd, wd, #0x55555555
  sub wd, ws, wd
  and wt, wd, #0x33333333
  lsr wd, wd, #2
  and wd, wd, #0x33333333
  add wd, wd, wt
  add wd, wd, wd, lsr #4
  and wd, wd, #0x0f0f0f0f
  mov wt, #0x01010101
  mul wd, wd, wt
  lsr wd, wd, #24

32-bit vector (almost the same as the 64-bit one - only the first instruction differs):

  fmov st, ws
  cnt vt.8b, vt.8b
  addv bt, vt.8b
  umov wd, vt.b[0]

For the 64-bit case I tried 2 scalar variants - one that doesn't use multiplication and another one that does; here's the first one:

  lsr xd, xs, #1
  and xd, xd, #0x5555555555555555
  sub xd, xs, xd
  and xt, xd, #0x3333333333333333
  lsr xd, xd, #2
  and xd, xd, #0x3333333333333333
  add xt, xd, xt
  add xt, xt, xt, lsr #4
  and xt, xt, #0x0f0f0f0f0f0f0f0f
  add xt, xt, xt, lsl #8
  add xt, xt, xt, lsl #16
  add xt, xt, xt, lsl #32
  lsr xd, xt, #56

And the second one:

  lsr xd, xs, #1
  and xd, xd, #0x5555555555555555
  sub xd, xs, xd
  and xt, xd, #0x3333333333333333
  lsr xd, xd, #2
  and xd, xd, #0x3333333333333333
  add xt, xd, xt
  add xt, xt, xt, lsr #4
  and xt, xt, #0x0f0f0f0f0f0f0f0f
  mov xd, #0x0101010101010101
  mul xd, xd, xt
  lsr xd, xd, #56

64-bit vector:

  fmov dt, xs
  cnt vt.8b, vt.8b
  addv bt, vt.8b
  umov wd, vt.b[0]

Here are the results for various microarchitectures in terms of speedup achieved by the vector implementation, i.e. higher numbers mean that the latter is better:

CPU core	8-bit latency speedup	8-bit throughput speedup	16-bit latency speedup	16-bit throughput speedup	32-bit latency speedup	32-bit throughput speedup	64-bit latency speedup	64-bit throughput speedup
Arm Cortex-A55	252.84%	276.55%	224.39%	241.52%	256.10%	282.28%	293.79%	321.77%
Arm Neoverse N1	132.56%	348.84%	135.26%	243.04%	96.96%	197.12%	123.08%	213.46%

In addition, here's how the alternative 64-bit scalar implementation (using multiplication) performs:

CPU core	Latency speedup	Throughput speedup
Arm Cortex-A55	111.81%	109.10%
Arm Neoverse N1	112.60%	110.45%

The results are based on median runtimes from 20 runs; standard errors were 0.02% or less.

The data demonstrates quite clearly that the vector variant is better than the scalar one; the only regression is in the 32-bit latency value. While the alternative 64-bit scalar implementation is a definite improvement, it still lags behind the vector one.

cfallin · 2021-01-19T21:48:21Z

Thanks very much for this very detailed benchmarking!

cfallin mentioned this issue Apr 17, 2020

arm64: Support less-than-64-bit integers in Bitrev, Clz, Cls, and Popcnt instructions. #1528

Merged

bnjbvr added bug Incorrect behavior in the current implementation that needs fixing cranelift Issues related to the Cranelift code generator labels Apr 17, 2020

cfallin added the cranelift:area:aarch64 Issues related to AArch64 backend. label Apr 18, 2020

bnjbvr mentioned this issue Apr 21, 2020

Cranelift ARM64 and Spidermonkey integration #1519

Closed

11 tasks

akirilov-arm mentioned this issue Jan 19, 2021

Cranelift AArch64: Improve the Popcnt implementation #2589

Merged

cfallin closed this as completed in #2589 Jan 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cranelift aarch64 backend: implement 8/16/32-bit popcnt more efficiently #1537

Cranelift aarch64 backend: implement 8/16/32-bit popcnt more efficiently #1537

cfallin commented Apr 17, 2020

github-actions bot commented Apr 17, 2020

bnjbvr commented Apr 21, 2020

akirilov-arm commented Sep 10, 2020

akirilov-arm commented Jan 19, 2021 •

edited

Loading

cfallin commented Jan 19, 2021

Cranelift aarch64 backend: implement 8/16/32-bit popcnt more efficiently #1537

Cranelift aarch64 backend: implement 8/16/32-bit popcnt more efficiently #1537

Comments

cfallin commented Apr 17, 2020

github-actions bot commented Apr 17, 2020

Subscribe to Label Action

bnjbvr commented Apr 21, 2020

akirilov-arm commented Sep 10, 2020

akirilov-arm commented Jan 19, 2021 • edited Loading

cfallin commented Jan 19, 2021

akirilov-arm commented Jan 19, 2021 •

edited

Loading