Skip to content

Destination register for x86 bsf and bsr should always be initialized #129659

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
CatsAreFluffy opened this issue Mar 4, 2025 · 6 comments
Open

Comments

@CatsAreFluffy
Copy link

The x86 instructions bsf and bsr (which are used to implement countl_zero and countr_zero respectively on targets not supporting BMI1) have a dependency on their output register. Currently, LLVM does not do anything to break this dependency, which can lead to strange slowdowns depending on the whims of register allocation. For example, in this Rust program, on the baseline target "New isqrt" is much slower than "Stdlib isqrt", while on a modern target "New isqrt" is much faster than "Stdlib isqrt", since on the baseline target "New isqrt" has a damaging bsr-induced dependency. This dependency should be broken by initializing the output register before bsr or bsf. (LLVM already does this on 64-bit targets when the argument is not known to be nonzero as of #123623, but the dependency still exists even in other cases.)

@llvmbot
Copy link
Member

llvmbot commented Mar 4, 2025

@llvm/issue-subscribers-backend-x86

Author: None (CatsAreFluffy)

The x86 instructions `bsf` and `bsr` (which are used to implement `countl_zero` and `countr_zero` respectively on targets not supporting BMI1) have a dependency on their output register. Currently, LLVM does not do anything to break this dependency, which can lead to strange slowdowns depending on the whims of register allocation. For example, in [this Rust program](https://godbolt.org/z/45axenMjd), on the baseline target "New isqrt" is much slower than "Stdlib isqrt", while on a modern target "New isqrt" is much faster than "Stdlib isqrt", since on the baseline target "New isqrt" has a damaging `bsr`-induced dependency. This dependency should be broken by initializing the output register before `bsr` or `bsf`. (LLVM already does this on 64-bit targets when the argument is not known to be nonzero as of #123623, but the dependency still exists even in other cases.)

@RKSimon
Copy link
Collaborator

RKSimon commented Mar 4, 2025

Are there CPUs that are affected worse by this?

@tgross35
Copy link
Contributor

tgross35 commented Mar 5, 2025

Are there CPUs that are affected worse by this?

rust-lang/rust#137786 (comment) has one such example. The missed optimization described here may not account for the entire difference between the two benchmarks though.

@nikic
Copy link
Contributor

nikic commented Mar 20, 2025

We generate rep bsf (aka tzcnt) instead of bsf for trailing zeroes to avoid this dependency. Is there any reason why we don't use rep bsr (aka lzcnt) for the leading zeroes case? I see GCC doesn't do it in the bsr case either, so maybe there's some difference between these instructions I'm missing.

@phoebewang
Copy link
Contributor

Yes, bsr and lzcnt are totally different. When SRC is 0, BSR set ZF, while LZCNT set CF. When SRC is non-0, BSR return the index of MSB, which LZCNT return OperandSize - Index.

@nikic
Copy link
Contributor

nikic commented Mar 20, 2025

Thanks, I didn't realize that the return value between bsr and lzcnt is inverted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants