Even more optimal Ord implementation for integers #64082

scottmcm · 2019-09-02T02:20:39Z

When I saw #63767 it made me think of #61635, and I noticed I could use the same trick from the latter of those for cmp: https://godbolt.org/z/XCDDt7

Before:

  mov eax, dword ptr [rdi]
  xor ecx, ecx
  cmp eax, dword ptr [rsi]
  seta cl
  mov eax, 255
  cmovae eax, ecx

After:

  mov eax, dword ptr [rdi]
  cmp eax, dword ptr [rsi]
  seta al
  sbb al, 0

Old llvm-mca statistics:

Iterations:        100
Instructions:      700
Total Cycles:      217
Total uOps:        1100

Dispatch Width:    6
uOps Per Cycle:    5.07
IPC:               3.23
Block RThroughput: 1.8

New llvm-mca statistics:

Iterations:        100
Instructions:      500
Total Cycles:      214
Total uOps:        1000

Dispatch Width:    6
uOps Per Cycle:    4.67
IPC:               2.34
Block RThroughput: 1.7

r? @nagisa

nagisa · 2019-09-02T02:38:57Z

@bors r+ rollup

bors · 2019-09-02T02:38:59Z

📌 Commit b5e8f35ced2bd7fe8c3eb5029e0139f0c78316ce has been approved by nagisa

nagisa · 2019-09-02T02:40:04Z

@bors r-

Could we just call signum to implement this?

nagisa · 2019-09-02T02:43:57Z

@bors r+

eh, whatever

bors · 2019-09-02T02:44:14Z

📌 Commit b5e8f35ced2bd7fe8c3eb5029e0139f0c78316ce has been approved by nagisa

ollie27 · 2019-09-02T02:54:32Z

This seems to make some code worse: godbolt.

nagisa · 2019-09-02T03:09:50Z

A fair point, although I feel like comparing Ordering to a specific value (as opposed to directly comparing) is a very unlikely to occur in practice. And it is pretty much a worst-case scenario.

@bors r- to give it some time for comments.

scottmcm · 2019-09-02T03:18:31Z

Could we just call signum to implement this?

I assumed (a-b).signum() wouldn't work because it would overflow.

This seems to make some code worse

Hmm, that's an interesting failure case: https://godbolt.org/z/xwIxKw

  %0 = icmp ugt i32 %a.val, %b.val, !dbg !9
  %1 = icmp ult i32 %a.val, %b.val, !dbg !12
  %2 = zext i1 %1 to i64, !dbg !9
  %3 = xor i64 %2, -1, !dbg !9
  %.off = sext i1 %0 to i64, !dbg !13
  %switch = icmp eq i64 %3, %.off, !dbg !13

Looks like the bitwise-not (xor -1) is confusing it enough that it can't remove the [sz]exts to just compare the i1s? That pattern looks like it ought to be safe to fold...

nagisa · 2019-09-02T03:21:07Z

I have also found that replacing the match with

-1 => Less,
0 => Equal
_ => Greater

instead of having the unreachable_unchecked will improve the codegen slightly, which is entirely an LLVM issue.

scottmcm · 2019-09-02T03:27:31Z

I have also found that [...] will improve the codegen slightly

Can you elaborate, @nagisa? I seem to be missing something: https://godbolt.org/z/P0rVcN

nagisa · 2019-09-02T03:32:05Z

It helps in the particular case where the return value is compared to specific value that was posted above. https://godbolt.org/z/WD-xY4

scottmcm · 2019-09-02T04:02:13Z

Hmm, apparently the match doesn't really disappear the way it should. Time to go change it to a transmute apparently: https://godbolt.org/z/LwhuZx

scottmcm · 2019-09-02T06:00:38Z

Kicking the PR:

 Downloading crates ...
  Downloaded foreign-types-shared v0.1.1
error: failed to download from `https://crates.io/api/v1/crates/aho-corasick/0.7.3/download`

Caused by:
  [35] SSL connect error (error:14094410:SSL routines:ssl3_read_bytes:sslv3 alert handshake failure)
thread 'main' panicked at 'command did not execute successfully: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/cargo" "metadata" "--format-version" "1" "--features" "panic-unwind backtrace" "--manifest-path" "/checkout/src/libstd/Cargo.toml"
expected success, got: exit code: 101', src/build_helper/lib.rs:133:9

parched · 2019-09-02T07:39:21Z

Also consider checking other architectures or perhaps conditionally gate this change on x86_64
https://godbolt.org/z/tqOBnE

parched · 2019-09-02T07:53:38Z

I'm also curious how the change performs once inlined, like in some sorting algorithm or similar. The compiler will probably change the generated code a lot once it knows how the return value is used. E.g., if the caller is doing cmp(x,y) == Less, will it be better or worse?

leonardo-m · 2019-09-02T17:22:55Z

I agree with parched, probably a more real benchmark is in order, like sorting an array of integers.

scottmcm · 2019-09-02T22:28:44Z

Looks like I can't coerce LLVM into optimizing this correctly, so I'll file a bug there and abandon this.

Micro-optimize Ord::cmp for primitives I originally started looking into this because in MIR, `PartialOrd::cmp` is _huge_ and even for trivial types like `u32` which are theoretically a single statement to compare, the `PartialOrd::cmp` impl doesn't inline. A significant contributor to the size of the implementation is that it has two comparisons. And this actually follows through to the final x86_64 codegen too, which is... strange. We don't need two `cmp` instructions in order to do a single Rust-level comparison. So I started tweaking the implementation, and came up with the same thing as rust-lang#64082 (which I didn't know about at the time), I ran `llvm-mca` on it per the issue which was linked in the code to establish that it looked better, and submitted it for a benchmark run. The initial benchmark run regresses basically everything. By looking through the cachegrind diffs in the perf report then the `perf annotate` for regressed functions, I was able to identify one source of the regression: `Ord::min` and `Ord::max` no longer optimize well. Tweaking them to bypass `Ord::cmp` removed some regressions, but not much. Diving back into the cachegrind diffs and disassembly, I found one huge widespread issue was that the codegen for `Span`'s `hash_stable` regressed because `span_data_to_lines_and_cols` no longer inlined into it, because that function does a lot of `Range<BytePos>::contains`. The implementation of `Range::contains` uses `PartialOrd` multiple times, and we had massively regressed the codegen of `Range::contains`. The root problem here seems to be that `PartialOrd` is derived on `BytePos`, which is a simple wrapper around a `u32`. So for `BytePos`, `PartialOrd::{le, lt, ge, gt}` use the default impls, which go through `PartialOrd::cmp`, and LLVM fails to optimize these combinations of methods with the new `Ord::cmp` implementation. At a guess, the new implementation makes LLVM totally loses track of the fact that `<Ord for u32>::cmp` is an elaborate way to compare two integers. So I have low hopes for this overall, because my strategy (which is working) to recover the regressions is to avoid the "faster" implementation that this PR is based around. If we have to settle for an implementation of `Ord::cmp` which is on its own sub-optimal but is optimized better in combination with functions that use its return value in specific ways, so be it. However, one of the runs had an improvement in `coercions`. I don't know if that is jitter or relevant. But I'm still finding threads to pull here, so I'm going to keep at it. For the moment I am hacking up the implementations on `BytePos` instead of modifying the code that `derive(PartialOrd, Ord)` expands to because that would be hard, and it would also mean that we would just expand to more code, perhaps regressing compile time for that reason, even if the generated assembly is more efficient. --- Hacking up the remainder of the `PartialOrd`/`Ord` methods on `BytePos` took us down to 3 regressions and 6 improvements, which is interesting. All the improvements are in `coercions`, so I'm sure this improved _something_ but whether it matters... hard to say. Based on the findings of `@joboet,` I'm going to cherry-pick rust-lang#106065 onto this branch, because that strategy seems to improve `PartialOrd::lt` and `PartialOrd::ge` back to the original codegen, even when they are using our new `Ord::cmp` impl. If the remaining perf regressions are due to de-optimizing a `PartialOrd::lt` not on `BytePos`, this might be a further improvement. --- Okay, that cherry-pick brought us down to 2 regressions but that might be noise. We still have the same 6 improvements, all on `coercions`. I think the next thing to try here is modifying the implementation of `derive(PartialOrd)` to automatically emit the modifications that I made to `BytePos` (directly implementing all the methods for newtypes). But even if that works, I think the effect of this change is so mixed that it's probably not worth merging with current LLVM. What I'm afraid of is that this change currently pessimizes matching on `Ordering`, and that is the most natural thing to do with an enum. So I'm not closing this yet, but I think without a change from LLVM, I have other priorities at the moment. r? `@ghost`

rust-highfive assigned nagisa Sep 2, 2019

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Sep 2, 2019

scottmcm force-pushed the ord-again branch from fe203b0 to b5e8f35 Compare September 2, 2019 02:22

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Sep 2, 2019

bors added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Sep 2, 2019

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Sep 2, 2019

bors added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Sep 2, 2019

Even more optimal Ord implementation for integers

6fcc3dc

scottmcm force-pushed the ord-again branch from b5e8f35 to 6fcc3dc Compare September 2, 2019 05:52

scottmcm closed this Sep 2, 2019

scottmcm reopened this Sep 2, 2019

scottmcm closed this Sep 2, 2019

moshg mentioned this pull request Nov 30, 2019

Optimize Ord trait implementation for bool #66881

Merged

scottmcm mentioned this pull request Dec 18, 2022

Micro-optimize Ord::cmp for primitives #105840

Closed

Even more optimal Ord implementation for integers #64082

Even more optimal Ord implementation for integers #64082

Uh oh!

Conversation

scottmcm commented Sep 2, 2019

Uh oh!

nagisa commented Sep 2, 2019

Uh oh!

bors commented Sep 2, 2019

Uh oh!

nagisa commented Sep 2, 2019

Uh oh!

nagisa commented Sep 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bors commented Sep 2, 2019

Uh oh!

ollie27 commented Sep 2, 2019

Uh oh!

nagisa commented Sep 2, 2019

Uh oh!

scottmcm commented Sep 2, 2019

Uh oh!

nagisa commented Sep 2, 2019

Uh oh!

scottmcm commented Sep 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nagisa commented Sep 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scottmcm commented Sep 2, 2019

Uh oh!

scottmcm commented Sep 2, 2019

Uh oh!

parched commented Sep 2, 2019

Uh oh!

parched commented Sep 2, 2019

Uh oh!

leonardo-m commented Sep 2, 2019

Uh oh!

scottmcm commented Sep 2, 2019

Uh oh!

Uh oh!

nagisa commented Sep 2, 2019 •

edited

Loading

scottmcm commented Sep 2, 2019 •

edited

Loading

nagisa commented Sep 2, 2019 •

edited

Loading