-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tweak the default PartialOrd::{lt,le,gt,ge}
#106065
Tweak the default PartialOrd::{lt,le,gt,ge}
#106065
Conversation
Failed to set assignee to
|
This comment was marked as resolved.
This comment was marked as resolved.
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
⌛ Trying commit 6e1c3f0 with merge b6f32e9a3b254c2d1a3431d90ed5169aca532ea6... |
use std::cmp::Ordering; | ||
|
||
#[derive(PartialOrd, PartialEq)] | ||
pub struct Foo(u16); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully this test will ensure that the problem you saw with BytePos
won't happen again, and will be easier to catch if accidentally regressed.
library/core/src/cmp.rs
Outdated
@@ -1161,7 +1175,11 @@ pub trait PartialOrd<Rhs: ?Sized = Self>: PartialEq<Rhs> { | |||
#[must_use] | |||
#[stable(feature = "rust1", since = "1.0.0")] | |||
fn gt(&self, other: &Rhs) -> bool { | |||
matches!(self.partial_cmp(other), Some(Greater)) | |||
if let Some(ordering) = self.partial_cmp(other) { | |||
ordering.is_gt() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now conceptually two checks, rather than just one, so it's possible it's not always better. None is currently 2 here, so the old code was hypothetically just c == 1
, and now it's c != 2 && c > 0
. (Of course lt
ends up being c != 2 && c < 0
, which obviously folds to c < 0
, so that one's probably not impacted.)
My hope is that this is still better in practice:
- I would bet that most
partial_cmp
s are actuallycmp
s, and thus the optimizer will easily notice that the result is neverNone
-- like happens in the codegen test. - For things that can actually return
None
, hopefully jump-threading will usually notice that theNone
becomesfalse
and will again bypass actually running this check at runtime.
I'll see if I can prove that out in a codegen test...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I didn't manage to make a great codegen test for this, but I did in passing find two other things:
- We should start putting more
noundef
on parameters, https://rust-lang.zulipchat.com/#narrow/stream/187780-t-compiler.2Fwg-llvm/topic/We.20will.20want.20a.20lot.20of.20noundefs/near/317472833 - LLVM doesn't optimize everything as well as it should: Implementing
<=
via 3-way comparison doesn't optimize down llvm/llvm-project#59666
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (b6f32e9a3b254c2d1a3431d90ed5169aca532ea6): comparison URL. Overall result: no relevant changes - no action neededBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. @bors rollup=never Instruction countThis benchmark run did not return any relevant results for this metric. Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesThis benchmark run did not return any relevant results for this metric. |
Well that's a whole lot of nothing in perf 😅 I saw your thumb, I could also cut this back to just the codegen test, since it already passes, if that's useful but we don't want the (Apparently highfive didn't like my previously-proposed reviewer.) |
Yeah, I'm only t-miri, I can't approve anything in this repo. I like this work, but I'm extremely wary of checking in subtle changes like this that aren't backed up by any kind of test. I'm very curious to know what an LLVM expert thinks of that issue. If this is another "oh we're missing a fold for that" situation, that would be awesome. But I kind of doubt it. |
@saethlin I went to try making an assembly test that I'm hoping that the answer really is that there's just some fold or range logic missing. Alive2 proves that it's allowed to do it, so it's a matter of how/where to recognize it. |
Wow that's very minimized. You're getting my hopes up... |
That new 59668 one is too minimized to help the original, though -- it's about the backend, and in IR (where the optimizations that #105840 cares about would happen) they're all just |
I'm going to close this since the I've submitted the codegen test as #106100. |
Sorry, didn't comment on this before it was closed, but I agree that given the lack of improvement these changes are not worth. |
@compiler-errors No worries! Thanks for commenting. Given that it's the end of year I have no expectations that people would be looking at things for a while. |
…=compiler-errors Codegen test for derived `<` on trivial newtype [TEST ONLY] I originally wrote this for rust-lang#106065, but the libcore changes there aren't necessarily a win. So I pulled out this test to be its own PR since it's important (see rust-lang#105840 (comment)) and well-intentioned changes to core or the derive could accidentally break it without that being obvious (other than by massive unexplained perf changes).
Micro-optimize Ord::cmp for primitives I originally started looking into this because in MIR, `PartialOrd::cmp` is _huge_ and even for trivial types like `u32` which are theoretically a single statement to compare, the `PartialOrd::cmp` impl doesn't inline. A significant contributor to the size of the implementation is that it has two comparisons. And this actually follows through to the final x86_64 codegen too, which is... strange. We don't need two `cmp` instructions in order to do a single Rust-level comparison. So I started tweaking the implementation, and came up with the same thing as rust-lang#64082 (which I didn't know about at the time), I ran `llvm-mca` on it per the issue which was linked in the code to establish that it looked better, and submitted it for a benchmark run. The initial benchmark run regresses basically everything. By looking through the cachegrind diffs in the perf report then the `perf annotate` for regressed functions, I was able to identify one source of the regression: `Ord::min` and `Ord::max` no longer optimize well. Tweaking them to bypass `Ord::cmp` removed some regressions, but not much. Diving back into the cachegrind diffs and disassembly, I found one huge widespread issue was that the codegen for `Span`'s `hash_stable` regressed because `span_data_to_lines_and_cols` no longer inlined into it, because that function does a lot of `Range<BytePos>::contains`. The implementation of `Range::contains` uses `PartialOrd` multiple times, and we had massively regressed the codegen of `Range::contains`. The root problem here seems to be that `PartialOrd` is derived on `BytePos`, which is a simple wrapper around a `u32`. So for `BytePos`, `PartialOrd::{le, lt, ge, gt}` use the default impls, which go through `PartialOrd::cmp`, and LLVM fails to optimize these combinations of methods with the new `Ord::cmp` implementation. At a guess, the new implementation makes LLVM totally loses track of the fact that `<Ord for u32>::cmp` is an elaborate way to compare two integers. So I have low hopes for this overall, because my strategy (which is working) to recover the regressions is to avoid the "faster" implementation that this PR is based around. If we have to settle for an implementation of `Ord::cmp` which is on its own sub-optimal but is optimized better in combination with functions that use its return value in specific ways, so be it. However, one of the runs had an improvement in `coercions`. I don't know if that is jitter or relevant. But I'm still finding threads to pull here, so I'm going to keep at it. For the moment I am hacking up the implementations on `BytePos` instead of modifying the code that `derive(PartialOrd, Ord)` expands to because that would be hard, and it would also mean that we would just expand to more code, perhaps regressing compile time for that reason, even if the generated assembly is more efficient. --- Hacking up the remainder of the `PartialOrd`/`Ord` methods on `BytePos` took us down to 3 regressions and 6 improvements, which is interesting. All the improvements are in `coercions`, so I'm sure this improved _something_ but whether it matters... hard to say. Based on the findings of `@joboet,` I'm going to cherry-pick rust-lang#106065 onto this branch, because that strategy seems to improve `PartialOrd::lt` and `PartialOrd::ge` back to the original codegen, even when they are using our new `Ord::cmp` impl. If the remaining perf regressions are due to de-optimizing a `PartialOrd::lt` not on `BytePos`, this might be a further improvement. --- Okay, that cherry-pick brought us down to 2 regressions but that might be noise. We still have the same 6 improvements, all on `coercions`. I think the next thing to try here is modifying the implementation of `derive(PartialOrd)` to automatically emit the modifications that I made to `BytePos` (directly implementing all the methods for newtypes). But even if that works, I think the effect of this change is so mixed that it's probably not worth merging with current LLVM. What I'm afraid of is that this change currently pessimizes matching on `Ordering`, and that is the most natural thing to do with an enum. So I'm not closing this yet, but I think without a change from LLVM, I have other priorities at the moment. r? `@ghost`
r? @saethlin
who noticed that #105840 was having trouble because of these default implementations.
That got me inspired to give this a shot, to see whether tweaking those defaults might actually improve things -- and hopefully make that PR easier to land. (And maybe even test, since this adds a codegen test that it would not want to regress.)
Specifically, I noticed in https://rust.godbolt.org/z/3fbve7eW7 that
did optimize as desired, whereas
didn't. So this PR bases all the
Ordering
methods around comparisons against0
, rather than trying to match specific variants.Let's see what perf says 🤞
EDIT: Also, credit to @joboet in #105840 (comment) who first pointed out that matching the variants directly isn't necessarily better.