-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cranelift aarch64 backend: implement 8/16/32-bit popcnt more efficiently #1537
Comments
…cnt instructions. Includes a temporary bugfix for popcnt with 32-bit operand. The popcnt issue was initially identified by Benjamin Bouvier <public@benj.me>, and the root cause was debugged by Joey Gouly <joey.gouly@arm.com>. This patch is simply a quick fix that zero-extends the operand to 64 bits; Joey plans to contribute a more permanent fix shortly (tracked in bytecodealliance#1537).
Subscribe to Label Actioncc @bnjbvr
This issue or pull request has been labeled: "cranelift"
Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
…cnt instructions. Includes a temporary bugfix for popcnt with 32-bit operand. The popcnt issue was initially identified by Benjamin Bouvier <public@benj.me>, and the root cause was debugged by Joey Gouly <joey.gouly@arm.com>. This patch is simply a quick fix that zero-extends the operand to 64 bits; Joey plans to contribute a more permanent fix shortly (tracked in bytecodealliance#1537).
For what it's worth, the |
Note that the proper implementation for 32- and 64-bit operands may use the Neon |
I decided to compare the approach using the Neon Here's the scalar implementation for 8-bit operands:
And the vector one:
16-bit scalar:
16-bit vector:
32-bit scalar:
32-bit vector (almost the same as the 64-bit one - only the first instruction differs):
For the 64-bit case I tried 2 scalar variants - one that doesn't use multiplication and another one that does; here's the first one:
And the second one:
64-bit vector:
Here are the results for various microarchitectures in terms of speedup achieved by the vector implementation, i.e. higher numbers mean that the latter is better:
In addition, here's how the alternative 64-bit scalar implementation (using multiplication) performs:
The results are based on median runtimes from 20 runs; standard errors were 0.02% or less. The data demonstrates quite clearly that the vector variant is better than the scalar one; the only regression is in the 32-bit latency value. While the alternative 64-bit scalar implementation is a definite improvement, it still lags behind the vector one. |
Thanks very much for this very detailed benchmarking! |
We currently use a sequence of instructions intended for 64-bit operation, and zero-extend narrower inputs. The original implementation by @jgouly almost works for 32 bits (and 8/16 bits extended to 32 bits), with a slight issue. We should rework the lowering to fix this issue and remove the need for zero-extending 32-bit operands.
The text was updated successfully, but these errors were encountered: