-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
std::bit_ceil emits quite a bit of assembly on x64. This seems to occur mostly due the branch in _Checked_x86_x64_countl_zero and to a lesser extent due to the branch in bit_ceil itself.
I've written a variant which produces a more compact result: https://godbolt.org/z/q4EEz83aW
(It also removes the extra branch on ARM64 by using conditional assignments.)
I've checked the PR that introduced the code (#795) and it appears as if the cost of this if condition wasn't discussed. The if condition generally makes sense though: bsr is costly on AMD CPUs (up to Zen3, 4 cycles/op latency), whereas lzcnt is very fast on any architecture (<= 1 cycle).
But it takes up 3 slots in the CPU's branch target buffer (contemporary hardware has ~4096 slots, newly added branches incur an extra 5-20 cycle latency), generates larger binaries after inlining and unfortunately the added instructions seem to add about ~5 cycles of latency themselves, offsetting the cost of bsr.
This makes me wonder: Should we drop the __isa_available check?