Skip to content

Conversation

@AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented Oct 23, 2021

@AlexGuteniev AlexGuteniev marked this pull request as ready for review October 23, 2021 14:49
@AlexGuteniev AlexGuteniev requested a review from a team as a code owner October 23, 2021 14:49
@CaseyCarter CaseyCarter added the performance Must go faster label Oct 23, 2021
Fix "assmuption" typo.

Simplify _Non_zero_input to _Nonzero, consistent with function name.

Avoid negation in _No_assumption; _Possibly_zero focuses on
what we're concerned with.
@StephanTLavavej
Copy link
Member

@AlexGuteniev I have pushed a merge with main, containing major changes to <limits>. I believe that this preserves the intent of your changes, while applying them to the post-#2333 codebase, refactoring the logic to be cleaner (particularly around the widen-8/16-bits logic), and further enhances the nonzero optimization (in particular, we can avoid inspecting BSF's return value).

Please let me know if anything looks weird/wrong!

@AlexGuteniev
Copy link
Contributor Author

Please let me know if anything looks weird/wrong!

I think it is correct, but looks like without attempting to use tzcnt as bsf the intent may need reconsideration.

As the change was originally written, it avoided branches. The branches were there only for zero input.
Now there is a branch between tzcnt and bsf, regardless of zero/nonzero. And bsf itself is branchless (it uses cmov).
And bsf is a fallback.

So currently the change appears to be near-zero improvement, that doesn't even worth complicating code.

Need to benchmark and inspect codegen to figure out what should be done. Maybe implementing the same thing as for vector<👻>. Maybe just implementing #2133 will do (then again, this change may be useful, but may be not).

@AlexGuteniev AlexGuteniev marked this pull request as draft November 17, 2021 07:24
@AlexGuteniev AlexGuteniev marked this pull request as ready for review November 17, 2021 19:48
@AlexGuteniev
Copy link
Contributor Author

This optimization makes gcd 1 to 2 percent faster.
I've tried alternative approach (similar to _Select_popcount_impl), it turned out to be slightly faster than this one, still between 1 and 2 percent faster than the current.
Do we want to proceed?

@StephanTLavavej
Copy link
Member

1-2% is worth it, I think (it's pretty small on the scale of libs performance improvements, but it's easily something that the backend team would celebrate as their codegen is already highly optimized, and improving program perf without user-programmer intervention is always nice). @barcharcraz agrees - let's proceed!

@AlexGuteniev
Copy link
Contributor Author

Please consider submitting #2343 instead of this one.
It:

  • Gives slightly better results
  • Easier to explain the optimization
  • Consistent with similar optimization for _Popcount

@StephanTLavavej
Copy link
Member

Looking at #2343, I agree that it appears to be superior. Closing this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Must go faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

<numeric>: _Countr_zero used by std::gcd has unnecessary zero checks

5 participants