Skip to content

<bit>: Is the __isa_available check for lzcnt worth the cost? #2849

@lhecker

Description

@lhecker

std::bit_ceil emits quite a bit of assembly on x64. This seems to occur mostly due the branch in _Checked_x86_x64_countl_zero and to a lesser extent due to the branch in bit_ceil itself.

I've written a variant which produces a more compact result: https://godbolt.org/z/q4EEz83aW
(It also removes the extra branch on ARM64 by using conditional assignments.)

I've checked the PR that introduced the code (#795) and it appears as if the cost of this if condition wasn't discussed. The if condition generally makes sense though: bsr is costly on AMD CPUs (up to Zen3, 4 cycles/op latency), whereas lzcnt is very fast on any architecture (<= 1 cycle).
But it takes up 3 slots in the CPU's branch target buffer (contemporary hardware has ~4096 slots, newly added branches incur an extra 5-20 cycle latency), generates larger binaries after inlining and unfortunately the added instructions seem to add about ~5 cycles of latency themselves, offsetting the cost of bsr.

This makes me wonder: Should we drop the __isa_available check?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions