Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cranelift aarch64 backend: implement 8/16/32-bit popcnt more efficiently #1537

Closed
cfallin opened this issue Apr 17, 2020 · 5 comments · Fixed by #2589
Closed

Cranelift aarch64 backend: implement 8/16/32-bit popcnt more efficiently #1537

cfallin opened this issue Apr 17, 2020 · 5 comments · Fixed by #2589
Labels
bug Incorrect behavior in the current implementation that needs fixing cranelift:area:aarch64 Issues related to AArch64 backend. cranelift Issues related to the Cranelift code generator

Comments

@cfallin
Copy link
Member

cfallin commented Apr 17, 2020

We currently use a sequence of instructions intended for 64-bit operation, and zero-extend narrower inputs. The original implementation by @jgouly almost works for 32 bits (and 8/16 bits extended to 32 bits), with a slight issue. We should rework the lowering to fix this issue and remove the need for zero-extending 32-bit operands.

cfallin added a commit to cfallin/wasmtime that referenced this issue Apr 17, 2020
…cnt instructions.

Includes a temporary bugfix for popcnt with 32-bit operand. The popcnt
issue was initially identified by Benjamin Bouvier <public@benj.me>, and
the root cause was debugged by Joey Gouly <joey.gouly@arm.com>. This
patch is simply a quick fix that zero-extends the operand to 64 bits;
Joey plans to contribute a more permanent fix shortly (tracked in
 bytecodealliance#1537).
@bnjbvr bnjbvr added bug Incorrect behavior in the current implementation that needs fixing cranelift Issues related to the Cranelift code generator labels Apr 17, 2020
@github-actions
Copy link

Subscribe to Label Action

cc @bnjbvr

This issue or pull request has been labeled: "cranelift"

Thus the following users have been cc'd because of the following labels:

  • bnjbvr: cranelift

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

cfallin added a commit to cfallin/wasmtime that referenced this issue Apr 17, 2020
…cnt instructions.

Includes a temporary bugfix for popcnt with 32-bit operand. The popcnt
issue was initially identified by Benjamin Bouvier <public@benj.me>, and
the root cause was debugged by Joey Gouly <joey.gouly@arm.com>. This
patch is simply a quick fix that zero-extends the operand to 64 bits;
Joey plans to contribute a more permanent fix shortly (tracked in
 bytecodealliance#1537).
@cfallin cfallin added the cranelift:area:aarch64 Issues related to AArch64 backend. label Apr 18, 2020
@bnjbvr
Copy link
Member

bnjbvr commented Apr 21, 2020

For what it's worth, the x86 legalization used a slightly different sequence that ought to be reusable here. Would it make sense to make it a target-independent legalization, and use it in simple_legalize then? (doing it at the CLIR level would allow applying optimizations onto the constants, esp. GVN could common out the constants instead of regenerating as many times as the popcnt instruction is used)

@akirilov-arm
Copy link
Contributor

Note that the proper implementation for 32- and 64-bit operands may use the Neon CNT instruction (for example, this is what GCC and LLVM do), so a target-independent legalization may be counterproductive. However, I am not sure about 8- and 16-bit operands.

@akirilov-arm
Copy link
Contributor

akirilov-arm commented Jan 19, 2021

I decided to compare the approach using the Neon CNT instruction (which I'll call "vector") with the one that just used operations on general-purpose registers (i.e. the existing implementation in Cranelift, which I'll call "scalar"), so I wrote several microbenchmarks.

Here's the scalar implementation for 8-bit operands:

  lsr wd, ws, #1
  and wd, wd, #0x55555555
  sub wd, ws, wd
  and wt, wd, #0x33333333
  lsr wd, wd, #2
  and wd, wd, #0x33333333
  add wd, wd, wt
  add wd, wd, wd, lsr #4
  and wd, wd, #0xf

And the vector one:

  fmov st, ws
  cnt vt.8b, vt.8b
  umov wd, vt.b[0]

16-bit scalar:

  lsr wd, ws, #1
  and wd, wd, #0x55555555
  sub wd, ws, wd
  and wt, wd, #0x33333333
  lsr wd, wd, #2
  and wd, wd, #0x33333333
  add wd, wd, wt
  add wd, wd, wd, lsr #4
  and wd, wd, #0x0f0f0f0f
  add wd, wd, wd, lsr #8
  and wd, wd, #0xff

16-bit vector:

  fmov st, ws
  cnt vt.8b, vt.8b
  addp vt.8b, vt.8b, vt.8b
  umov wd, vt.b[0]

32-bit scalar:

  lsr wd, ws, #1
  and wd, wd, #0x55555555
  sub wd, ws, wd
  and wt, wd, #0x33333333
  lsr wd, wd, #2
  and wd, wd, #0x33333333
  add wd, wd, wt
  add wd, wd, wd, lsr #4
  and wd, wd, #0x0f0f0f0f
  mov wt, #0x01010101
  mul wd, wd, wt
  lsr wd, wd, #24

32-bit vector (almost the same as the 64-bit one - only the first instruction differs):

  fmov st, ws
  cnt vt.8b, vt.8b
  addv bt, vt.8b
  umov wd, vt.b[0]

For the 64-bit case I tried 2 scalar variants - one that doesn't use multiplication and another one that does; here's the first one:

  lsr xd, xs, #1
  and xd, xd, #0x5555555555555555
  sub xd, xs, xd
  and xt, xd, #0x3333333333333333
  lsr xd, xd, #2
  and xd, xd, #0x3333333333333333
  add xt, xd, xt
  add xt, xt, xt, lsr #4
  and xt, xt, #0x0f0f0f0f0f0f0f0f
  add xt, xt, xt, lsl #8
  add xt, xt, xt, lsl #16
  add xt, xt, xt, lsl #32
  lsr xd, xt, #56

And the second one:

  lsr xd, xs, #1
  and xd, xd, #0x5555555555555555
  sub xd, xs, xd
  and xt, xd, #0x3333333333333333
  lsr xd, xd, #2
  and xd, xd, #0x3333333333333333
  add xt, xd, xt
  add xt, xt, xt, lsr #4
  and xt, xt, #0x0f0f0f0f0f0f0f0f
  mov xd, #0x0101010101010101
  mul xd, xd, xt
  lsr xd, xd, #56

64-bit vector:

  fmov dt, xs
  cnt vt.8b, vt.8b
  addv bt, vt.8b
  umov wd, vt.b[0]

Here are the results for various microarchitectures in terms of speedup achieved by the vector implementation, i.e. higher numbers mean that the latter is better:

CPU core 8-bit latency speedup 8-bit throughput speedup 16-bit latency speedup 16-bit throughput speedup 32-bit latency speedup 32-bit throughput speedup 64-bit latency speedup 64-bit throughput speedup
Arm Cortex-A55 252.84% 276.55% 224.39% 241.52% 256.10% 282.28% 293.79% 321.77%
Arm Neoverse N1 132.56% 348.84% 135.26% 243.04% 96.96% 197.12% 123.08% 213.46%

In addition, here's how the alternative 64-bit scalar implementation (using multiplication) performs:

CPU core Latency speedup Throughput speedup
Arm Cortex-A55 111.81% 109.10%
Arm Neoverse N1 112.60% 110.45%

The results are based on median runtimes from 20 runs; standard errors were 0.02% or less.

The data demonstrates quite clearly that the vector variant is better than the scalar one; the only regression is in the 32-bit latency value. While the alternative 64-bit scalar implementation is a definite improvement, it still lags behind the vector one.

@cfallin
Copy link
Member Author

cfallin commented Jan 19, 2021

Thanks very much for this very detailed benchmarking!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior in the current implementation that needs fixing cranelift:area:aarch64 Issues related to AArch64 backend. cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants