-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Optimize integer pow
by removing the exit branch
#122884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@bors r+ |
…Amanieu Optimize integer `pow` by removing the exit branch The branch at the end of the `pow` implementations is redundant with multiplication code already present in the loop. By rotating the exit check, this branch can be largely removed, improving code size and reducing instruction cache misses. Testing on my machine (`x86_64`, 11th Gen Intel Core i5-1135G7 @ 2.40GHz), the `num::int_pow` benchmarks improve by some 40% for the unchecked operations and show some slight improvement for the checked operations as well.
This comment has been minimized.
This comment has been minimized.
💔 Test failed - checks-actions |
I'm not familiar with this check, so I don't understand what's failing here and what should the fix be. |
It seems that your PR has introduced a regression: LLVM is no longer able to optimize |
Does this mean the modified code performs worse in this specific case? |
Yes, it will perform much worse in that specific case since LLVM is unable to optimize the loop away. See https://godbolt.org/z/nMY79Gn8r for the current code that is being generated. It might be possible to re-arrange the code so that you still get the performance benefit of this PR while still letting LLVM optimize the loop, but I'm not sure. |
You can see a comparison of the old and new versions here: https://godbolt.org/z/dx3WKxhad |
@bors r- |
@mzabaluev Any updates on this? Thanks! |
The lack of optimization in case of a small const argument value is unfortunate. I briefly tried to salvage it by giving the optimizer an easier time without re-introducing redundancy in the dynamic case, but didn't come up with any good ideas. Maybe an unrolled fast path for argument values in the 0..=6 range? This would feel like an exercise in tricking the optimizer and placating the benchmarks. |
The branch at the end of the `pow` implementations is redundant with multiplication code already present in the loop. By rotating the exit check, this branch can be largely removed, improving code size and instruction cache coherence.
If I understand correctly you don't know how to fix the regression in a satisfactory manner, and you're not going to make the argument that the regression is tolerable? If I'm right you should probably close this. You can always reopen if you get some new inspiration or can find guidance. |
If might be worth trying something with |
The newly optimized loop has introduced a regression in the case when pow is called with a small constant exponent. LLVM is no longer able to unroll the loop and the generated code is larger and slower than what's expected in tests. Match and handle small exponent values separately by branching out to an explicit multiplication sequence for that exponent. Powers larger than 6 need more than three multiplications, so these cases are less likely to benefit from this optimization, also such constant exponents are less likely to be used in practice. For uses with a non-constant exponent, this might also provide a performance benefit if the exponent is small and does not vary between successive calls, so the same match arm tends to be taken as a predicted branch.
I will combine this with my suggestion for the statically known case, thanks for the tip! @oskgo it looks like we've found a way to resolve the regression, don't close this yet. |
I get this error when trying to use
|
It's already enabled in the library. |
That's strange the feature seems to work fine here: https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=761301ba758e4ba8d60381021abf908a |
It is what it says on the tin: |
Right, in that case maybe it's best to go back to the version with the unroll loop. |
Oh, I get it: |
In the dynamic exponent case, it's preferred to not increase code size, so use solely the loop-based implementation there. This shows about 4% penalty in the variable exponent benchmarks on x86_64.
010c332
to
2f23534
Compare
pinging @rust-lang/wg-const-eval due to new usage of |
|
library/core/src/num/int_macros.rs
Outdated
// This gives the optimizer a way to efficiently inline call sites | ||
// for the most common use cases with constant exponents. | ||
// Currently, LLVM is unable to unroll the loop below. | ||
match exp { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than this special casing could we instead just have the original loop (which LLVM knows how to unroll) for the is_val_statically_known
case and your new loop for the non-constant case?
And do the same for all the other pow
functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated pow
and wrapped_pow
as suggested.
I'm not sure the extra complication is justified for the checked operations, but I guess the optimizer will have better opportunities with the original loop there as well. I will try to make a macro so that uniform code is used everywhere without repetition.
This comment has been minimized.
This comment has been minimized.
Give LLVM the for original, optimizable loop in pow and wrapped_pow functions in the case when the exponent is statically known.
10db28f
to
ac88b33
Compare
@bors r+ |
…iaskrgr Rollup of 7 pull requests Successful merges: - rust-lang#122884 (Optimize integer `pow` by removing the exit branch) - rust-lang#127857 (Allow to customize `// TODO:` comment for deprecated safe autofix) - rust-lang#129034 (Add `#[must_use]` attribute to `Coroutine` trait) - rust-lang#129049 (compiletest: Don't panic on unknown JSON-like output lines) - rust-lang#129050 (Emit a warning instead of an error if `--generate-link-to-definition` is used with other output formats than HTML) - rust-lang#129056 (Fix one usage of target triple in bootstrap) - rust-lang#129058 (Add mw back to review rotation) r? `@ghost` `@rustbot` modify labels: rollup
…iaskrgr Rollup of 7 pull requests Successful merges: - rust-lang#122884 (Optimize integer `pow` by removing the exit branch) - rust-lang#127857 (Allow to customize `// TODO:` comment for deprecated safe autofix) - rust-lang#129034 (Add `#[must_use]` attribute to `Coroutine` trait) - rust-lang#129049 (compiletest: Don't panic on unknown JSON-like output lines) - rust-lang#129050 (Emit a warning instead of an error if `--generate-link-to-definition` is used with other output formats than HTML) - rust-lang#129056 (Fix one usage of target triple in bootstrap) - rust-lang#129058 (Add mw back to review rotation) r? `@ghost` `@rustbot` modify labels: rollup
…iaskrgr Rollup of 7 pull requests Successful merges: - rust-lang#122884 (Optimize integer `pow` by removing the exit branch) - rust-lang#127857 (Allow to customize `// TODO:` comment for deprecated safe autofix) - rust-lang#129034 (Add `#[must_use]` attribute to `Coroutine` trait) - rust-lang#129049 (compiletest: Don't panic on unknown JSON-like output lines) - rust-lang#129050 (Emit a warning instead of an error if `--generate-link-to-definition` is used with other output formats than HTML) - rust-lang#129056 (Fix one usage of target triple in bootstrap) - rust-lang#129058 (Add mw back to review rotation) r? `@ghost` `@rustbot` modify labels: rollup
Rollup merge of rust-lang#122884 - mzabaluev:pow-remove-exit-branch, r=Amanieu Optimize integer `pow` by removing the exit branch The branch at the end of the `pow` implementations is redundant with multiplication code already present in the loop. By rotating the exit check, this branch can be largely removed, improving code size and reducing instruction cache misses. Testing on my machine (`x86_64`, 11th Gen Intel Core i5-1135G7 @ 2.40GHz), the `num::int_pow` benchmarks improve by some 40% for the unchecked operations and show some slight improvement for the checked operations as well.
The branch at the end of the
pow
implementations is redundant with multiplication code already present in the loop. By rotating the exit check, this branch can be largely removed, improving code size and reducing instruction cache misses.Testing on my machine (
x86_64
, 11th Gen Intel Core i5-1135G7 @ 2.40GHz), thenum::int_pow
benchmarks improve by some 40% for the unchecked operations and show some slight improvement for the checked operations as well.