Optimization opportunity: ((((a) > (b)) - ((a) < (b))) < 0) -> ((a) < (b)) #62790

ryao · 2023-05-18T00:08:51Z

When trying to make a binary search function used in ZFS run faster, I noticed that Clang misses a high level optimization opportunity, which is ((((a) > (b)) - ((a) < (b))) < 0) -> ((a) < (b)).

Other variations of this include:

((((a) > (b)) - ((a) < (b))) <= 0) -> ((a) <= (b))
((((a) > (b)) - ((a) < (b))) == -1) -> ((a) < (b))
((((a) > (b)) - ((a) < (b))) == 1) -> ((a) > (b))
((((a) > (b)) - ((a) < (b))) == 0) -> ((a) == (b))
((((a) > (b)) - ((a) < (b))) > 0) -> ((a) > (b)).
((((a) > (b)) - ((a) < (b))) >= 0) -> ((a) >= (b)).
((((a) >= (b)) - ((a) <= (b))) < 0) -> ((a) < (b))
((((a) >= (b)) - ((a) <= (b))) <= 0) -> ((a) <= (b))
((((a) >= (b)) - ((a) <= (b))) == -1) -> ((a) < (b))
((((a) >= (b)) - ((a) <= (b))) == 1) -> ((a) > (b))
((((a) >= (b)) - ((a) <= (b))) == 0) -> ((a) == (b))
((((a) >= (b)) - ((a) <= (b))) > 0) -> ((a) > (b)).
((((a) >= (b)) - ((a) <= (b))) >= 0) -> ((a) >= (b)).

(((a) > (b)) - ((a) < (b))) is a trick to generate -1, 0 or 1 when comparing two integers (signed or unsigned) and when comparators using this trick are inlined into the caller, the above transformations become applicable.

Getting back to binary search, the idea to make binary search faster was to eliminate a control hazard from the binary search function by making the loop body branchless. The following site explains the control hazard and shows to how avoid it:

https://en.algorithmica.org/hpc/data-structures/binary-search/#the-bottleneck

This site provides benchmarks that confirm that avoiding the control hazard makes binary search faster:

https://github.com/scandum/binary_search/tree/master

The comparators used in ZFS are expected to return -1, 0, or 1, so (((a) > (b)) - ((a) < (b))) is often used in them. Since I wanted to avoid control hazards in binary search, I wrote a macro that can be used to do C++-style inlining of comparators into the binary search function by making a different version for each. Unfortunately, LLVM/Clang reintroduced the control hazard when all of this was put together. Here is a minimal version that avoids the macro so that it is easy to see which lines correspond to which instructions:

https://gcc.godbolt.org/z/KbKc6zqjT

On amd64, if I comment out line 35 and uncomment line 36, the control hazard vanishes, but the code generation is less than ideal. If I instead uncomment line 37 instead of 36, Clang generates what I wanted in the first place, but I cannot use it because I need the compiler to inline the existing comparison function.

If I compile for rv64gc, LLVM/Clang does not generate a control hazard for any of those versions, although it still generates different code. Line 35 gives a 13 instruction loop. Line 36 gives a 16 instruction loop. Line 37 gives a 12 instruction loop.

For comparison, GCC 13.1 generates a 15-instruction loop for amd64 without any control hazards for both versions on lines 35 and 36, although it also misses this optimization opportunity.

If LLVM had an optimization pass that recognized the equivalency of the above expressions, then it would generate code corresponding to what it generated before I had done the abstraction needed to make this function useful in production.

I will switch my code to use line 36 at the expense of code generation on RISC-V, but I would be happier if LLVM/Clang implemented this optimization.

The text was updated successfully, but these errors were encountered:

topperc · 2023-05-18T00:29:52Z

Might be worth looking at -mllvm -x86-cmov-converter=false That should stop the X86 backend from turning cmov instructions into control flow. Maybe its heuristics need some more adjusting.

ryao · 2023-05-18T00:48:50Z

-mllvm -x86-cmov-converter=false causes LLVM/Clang to generate a beautiful 9-instruction version of the loop. Is there any way to specify that for a small region of code through a pragma?

ryao · 2023-05-18T00:57:17Z

A bug has been filed against GCC regarding the high level optimization opportunity at the advice of the Gentoo toolchain team:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109901

ryao · 2023-05-18T02:23:02Z

#39374 appears to be related.

ryao · 2023-05-18T02:29:59Z

No amount of inserting __builtin_unpredictable makes a difference.

topperc · 2023-05-18T02:32:47Z

No amount of inserting __builtin_unpredictable makes a difference.

Yeah that metadata doesn't make it past IR. @davidbolvansky was looking at fixing that I think.

ryao · 2023-05-18T05:20:13Z

https://gcc.godbolt.org/z/KbKc6zqjT

Would I be correct to think that line 6 of the assembly output serves no purpose due to line 3?

RKSimon · 2023-05-18T07:29:50Z

CC @RKSimon

davidbolvansky · 2023-05-18T07:46:29Z

No amount of inserting __builtin_unpredictable makes a difference.

Yeah that metadata doesn't make it past IR. @davidbolvansky was looking at fixing that I think.

Yes, I need to rebase https://reviews.llvm.org/D118118 and check every callsite.

davidbolvansky · 2023-05-18T07:53:59Z

Is there any way to specify that for a small region of code through a pragma?

Pragma is not a bad idea (many __builtin_unpredictable calls could be a bit inconvenient in user code), but for now, after D118118 lands, __builtin_unpredictable becomes stronger and should solve your issues. New pragma is a good follow-up, I think.

ryao · 2023-05-27T03:59:14Z

This was going to be a write-up explaining how not having this optimization slows us down. To my pleasant surprise, LLVM/Clang trunk appears to perform this optimization:

https://gcc.godbolt.org/z/P6Pbxxovv

That being said, there is room to do better.

Here is the assembly generated by LLVM trunk for the first case statement for the assembly that I linked:

        leal    512(%rcx), %eax
        cmpl    %edx, (%rdi,%rax,4)
        cmovgl  %ecx, %eax
        movl    %eax, %ecx

Here is the assembly generated by GCC 12.3:

        leal    512(%rdx), %ecx
        cmpl    %esi, (%rdi,%rcx,4)
        cmovle  %ecx, %edx

GCC's output is about 10% faster in micro-benckmarks, although doing micro-benchmarks of this code is complicated by #62948. It must be worked around by manually patching the assembly output before any sort of benchmarking can be done:

diff --git a/out.s b/out.s
index a336c59..e4cacbb 100644
--- a/out.s
+++ b/out.s
@@ -12,8 +12,8 @@ custom_binary_search_slow:              # @custom_binary_search_slow
        bsrl    %eax, %eax
        movl    %eax, %r8d
        xorl    $31, %r8d
-       movb    $30, %cl
-       subb    %r8b, %cl
+       movl    $30, %ecx
+       subl    %r8d, %ecx
        movl    $1, %r8d
        shll    %cl, %r8d
        movl    %esi, %r9d
@@ -191,8 +191,8 @@ custom_binary_search_fast:              # @custom_binary_search_fast
        bsrl    %eax, %eax
        movl    %eax, %r8d
        xorl    $31, %r8d
-       movb    $30, %cl
-       subb    %r8b, %cl
+       movl    $30, %ecx
+       subl    %r8d, %ecx
        movl    $1, %r8d
        shll    %cl, %r8d
        movl    %esi, %r9d

In any case, I modified the microbenchmark from scandum/binary_search to better suit a workload that I am micro-optimizing:

https://github.com/scandum/binary_search

In specific, I wanted to test on small arrays, but avoid cache effects contaminating the results. One could easily add the search functions from my modified version into the original to get get numbers for bigger array sizes.

The source code for the modified micro-benchmark is attached to the following GCC issue:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109901

Should I file another issue for this or is leaving this note here fine? Does anyone know offhand which commit to trunk made this difference?

ryao · 2023-05-27T04:05:04Z

Disregard my remarks that this is fixed. I just looked at the original example and that is not fixed in trunk.

davidbolvansky · 2023-06-02T20:26:26Z

No amount of inserting __builtin_unpredictable makes a difference.

Yeah that metadata doesn't make it past IR. @davidbolvansky was looking at fixing that I think.

Yes, I need to rebase https://reviews.llvm.org/D118118 and check every callsite.

Landed. Now you can use __builtin_unpredictable to get CMOVs.

godbolt:
https://gcc.godbolt.org/z/hWx39qdeM

ryao · 2023-06-03T21:35:39Z

No amount of inserting __builtin_unpredictable makes a difference.

Yeah that metadata doesn't make it past IR. @davidbolvansky was looking at fixing that I think.

Yes, I need to rebase https://reviews.llvm.org/D118118 and check every callsite.

Landed. Now you can use __builtin_unpredictable to get CMOVs.

godbolt: https://gcc.godbolt.org/z/hWx39qdeM

Awesome. Thanks.

github-actions bot added the new issue label May 18, 2023

EugeneZelenko added llvm:instcombine missed-optimization and removed new issue labels May 18, 2023

ryao mentioned this issue May 18, 2023

btree: Reduce opportunities for branch mispredictions in binary search openzfs/zfs#14866

Merged

13 tasks

ryao mentioned this issue May 28, 2023

Add shar's algorithm scandum/binary_search#6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization opportunity: ((((a) > (b)) - ((a) < (b))) < 0) -> ((a) < (b)) #62790

Optimization opportunity: ((((a) > (b)) - ((a) < (b))) < 0) -> ((a) < (b)) #62790

ryao commented May 18, 2023

topperc commented May 18, 2023 •

edited

Loading

ryao commented May 18, 2023 •

edited

Loading

ryao commented May 18, 2023

ryao commented May 18, 2023

ryao commented May 18, 2023

topperc commented May 18, 2023

ryao commented May 18, 2023 •

edited

Loading

RKSimon commented May 18, 2023

davidbolvansky commented May 18, 2023

davidbolvansky commented May 18, 2023 •

edited

Loading

ryao commented May 27, 2023

ryao commented May 27, 2023

davidbolvansky commented Jun 2, 2023 •

edited

Loading

ryao commented Jun 3, 2023

Optimization opportunity: ((((a) > (b)) - ((a) < (b))) < 0) -> ((a) < (b)) #62790

Optimization opportunity: ((((a) > (b)) - ((a) < (b))) < 0) -> ((a) < (b)) #62790

Comments

ryao commented May 18, 2023

topperc commented May 18, 2023 • edited Loading

ryao commented May 18, 2023 • edited Loading

ryao commented May 18, 2023

ryao commented May 18, 2023

ryao commented May 18, 2023

topperc commented May 18, 2023

ryao commented May 18, 2023 • edited Loading

RKSimon commented May 18, 2023

davidbolvansky commented May 18, 2023

davidbolvansky commented May 18, 2023 • edited Loading

ryao commented May 27, 2023

ryao commented May 27, 2023

davidbolvansky commented Jun 2, 2023 • edited Loading

ryao commented Jun 3, 2023

topperc commented May 18, 2023 •

edited

Loading

ryao commented May 18, 2023 •

edited

Loading

ryao commented May 18, 2023 •

edited

Loading

davidbolvansky commented May 18, 2023 •

edited

Loading

davidbolvansky commented Jun 2, 2023 •

edited

Loading