-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Optimization opportunity: ((((a) > (b)) - ((a) < (b))) < 0) -> ((a) < (b)) #62790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Might be worth looking at |
|
A bug has been filed against GCC regarding the high level optimization opportunity at the advice of the Gentoo toolchain team: |
#39374 appears to be related. |
No amount of inserting __builtin_unpredictable makes a difference. |
Yeah that metadata doesn't make it past IR. @davidbolvansky was looking at fixing that I think. |
Would I be correct to think that line 6 of the assembly output serves no purpose due to line 3? |
CC @RKSimon |
Yes, I need to rebase https://reviews.llvm.org/D118118 and check every callsite. |
Pragma is not a bad idea (many __builtin_unpredictable calls could be a bit inconvenient in user code), but for now, after D118118 lands, __builtin_unpredictable becomes stronger and should solve your issues. New pragma is a good follow-up, I think. |
This was going to be a write-up explaining how not having this optimization slows us down. To my pleasant surprise, LLVM/Clang trunk appears to perform this optimization: https://gcc.godbolt.org/z/P6Pbxxovv That being said, there is room to do better. Here is the assembly generated by LLVM trunk for the first case statement for the assembly that I linked:
Here is the assembly generated by GCC 12.3:
GCC's output is about 10% faster in micro-benckmarks, although doing micro-benchmarks of this code is complicated by #62948. It must be worked around by manually patching the assembly output before any sort of benchmarking can be done:
In any case, I modified the microbenchmark from scandum/binary_search to better suit a workload that I am micro-optimizing: https://github.com/scandum/binary_search In specific, I wanted to test on small arrays, but avoid cache effects contaminating the results. One could easily add the search functions from my modified version into the original to get get numbers for bigger array sizes. The source code for the modified micro-benchmark is attached to the following GCC issue: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109901 Should I file another issue for this or is leaving this note here fine? Does anyone know offhand which commit to trunk made this difference? |
Disregard my remarks that this is fixed. I just looked at the original example and that is not fixed in trunk. |
Landed. Now you can use __builtin_unpredictable to get CMOVs. godbolt: |
Awesome. Thanks. |
When trying to make a binary search function used in ZFS run faster, I noticed that Clang misses a high level optimization opportunity, which is
((((a) > (b)) - ((a) < (b))) < 0) -> ((a) < (b))
.Other variations of this include:
((((a) > (b)) - ((a) < (b))) <= 0) -> ((a) <= (b))
((((a) > (b)) - ((a) < (b))) == -1) -> ((a) < (b))
((((a) > (b)) - ((a) < (b))) == 1) -> ((a) > (b))
((((a) > (b)) - ((a) < (b))) == 0) -> ((a) == (b))
((((a) > (b)) - ((a) < (b))) > 0) -> ((a) > (b))
.((((a) > (b)) - ((a) < (b))) >= 0) -> ((a) >= (b))
.((((a) >= (b)) - ((a) <= (b))) < 0) -> ((a) < (b))
((((a) >= (b)) - ((a) <= (b))) <= 0) -> ((a) <= (b))
((((a) >= (b)) - ((a) <= (b))) == -1) -> ((a) < (b))
((((a) >= (b)) - ((a) <= (b))) == 1) -> ((a) > (b))
((((a) >= (b)) - ((a) <= (b))) == 0) -> ((a) == (b))
((((a) >= (b)) - ((a) <= (b))) > 0) -> ((a) > (b))
.((((a) >= (b)) - ((a) <= (b))) >= 0) -> ((a) >= (b))
.(((a) > (b)) - ((a) < (b)))
is a trick to generate -1, 0 or 1 when comparing two integers (signed or unsigned) and when comparators using this trick are inlined into the caller, the above transformations become applicable.Getting back to binary search, the idea to make binary search faster was to eliminate a control hazard from the binary search function by making the loop body branchless. The following site explains the control hazard and shows to how avoid it:
https://en.algorithmica.org/hpc/data-structures/binary-search/#the-bottleneck
This site provides benchmarks that confirm that avoiding the control hazard makes binary search faster:
https://github.com/scandum/binary_search/tree/master
The comparators used in ZFS are expected to return -1, 0, or 1, so
(((a) > (b)) - ((a) < (b)))
is often used in them. Since I wanted to avoid control hazards in binary search, I wrote a macro that can be used to do C++-style inlining of comparators into the binary search function by making a different version for each. Unfortunately, LLVM/Clang reintroduced the control hazard when all of this was put together. Here is a minimal version that avoids the macro so that it is easy to see which lines correspond to which instructions:https://gcc.godbolt.org/z/KbKc6zqjT
On amd64, if I comment out line 35 and uncomment line 36, the control hazard vanishes, but the code generation is less than ideal. If I instead uncomment line 37 instead of 36, Clang generates what I wanted in the first place, but I cannot use it because I need the compiler to inline the existing comparison function.
If I compile for rv64gc, LLVM/Clang does not generate a control hazard for any of those versions, although it still generates different code. Line 35 gives a 13 instruction loop. Line 36 gives a 16 instruction loop. Line 37 gives a 12 instruction loop.
For comparison, GCC 13.1 generates a 15-instruction loop for amd64 without any control hazards for both versions on lines 35 and 36, although it also misses this optimization opportunity.
If LLVM had an optimization pass that recognized the equivalency of the above expressions, then it would generate code corresponding to what it generated before I had done the abstraction needed to make this function useful in production.
I will switch my code to use line 36 at the expense of code generation on RISC-V, but I would be happier if LLVM/Clang implemented this optimization.
The text was updated successfully, but these errors were encountered: