-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve SliceExt::binary_search performance #45333
Conversation
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @aturon (or someone else) soon. If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes. Please see the contribution instructions for more information. |
@@ -20,6 +20,6 @@ extern crate test; | |||
mod any; | |||
mod hash; | |||
mod iter; | |||
mod mem; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexcrichton How was this able to get past CI? Do we not run benches as part of libcore's tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we run benches in CI. The file was moved in #44943.
src/libcore/benches/slice.rs
Outdated
|
||
#[bench] | ||
fn binary_search(b: &mut Bencher) { | ||
let mut v = Vec::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(0..999).collect::<Vec<_>>()
will be more efficient and somewhat easier to read (at least to me).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/libcore/benches/slice.rs
Outdated
} | ||
let mut i = 0; | ||
b.iter(move || { | ||
i += 1299827; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we get a comment here as to what this number is intended to mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a large (compared to 999) prime to form a poor mans LCG. Maybe I should use librand instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fine to me -- and better than random numbers when dealing with benchmarks -- but I would like a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it into a proper LCG and linked to where I got the constants.
I like it, trading an unpredictable Branch for a predictable one (bound check) is neat. It doesn't exit when it finds the first equal element -- so does this PR change the result for some inputs? |
The biggest benefit comes from trading all the unpredictable branches inside the loop with conditional moves. This is otherwise known as "branchless binary search" and has been shown to be faster than "traditional binary search" and as fast as linear search for small inputs. There is a recent paper that covers different layouts for comparison based searching: https://arxiv.org/abs/1509.05053. It covers a lot more than "branchless binary search", nevertheless it is a good read. Answering your question: the PR should not change the results for any input. I added some extra test cases to increase confidence. |
Cool! Although I think this needs to be tested further with other payloads, like |
@arthurprs I don't think anything will change if (u64, u64) is used (or even strings for that matter). The |
Good point. I also ran a few benches just to confirm and it checks out. |
@alkis there are no test cases that cover this — with duplicate elements. I've had time to check this at my computer now, and the new algorithm does produce different results for some inputs: The example (playground link) is searching for 0 in a vector of zeros.
(Less important, but still interesting, is that the old algorithm will finish as quickly as possible with this input and the new algorithm it's a worst case.) |
@bluss: thanks for testing this. I don't think these changes are material and very unlikely to result in breakages. Let me explain why:
In addition the two implementations will have different behavior when the list is not sorted given the partial order defined by the predicate. I also think that this is not important. For performance it is true that the new algorithm always does I can add a few more benchmarks if you think they add value:
About possible breakages, if you have suggestions on how to investigate them before merging I would be glad to take a look. |
I improved the benchmarks a bit. Please take another look. |
I think we should definitely get a crater run on this to at least note any test failures that would be caused by landing this PR. Is it possible to keep the old behavior without removing the performance gains this PR makes? I'm somewhat hesitant to change the behavior of this, and I'm not sure I entirely agree with all of your points about this being unlikely to hurt anyone in practice. I agree that the new behavior on equal data is perhaps more elegant, but the old behavior (as I understand) is stable, if defined only by the algorithm used. Since it's stable, changing it now would lead me to believe that someone could depend on it -- even if they shouldn't be -- and we should be very hesitant to change it. Perhaps a survey of Rust code on GH that uses binary search could help here -- how much, if any, of it will change behavior given stretches of equal data? If we determine that probably none, then I'd be more inclined to make this change. If at least a couple different crates, then I'm pretty much against this. With regards to unsorted data, I don't think there's any problem in changing behavior there -- a binary search on unsorted data isn't going to be reliable, at all, and this is something we explicitly document. So, to summarize: I think that we should be very hesitant to land this without more data that this doesn't break people in practice (crater run and GH survey). It's "silent" breakage in that code continues to compile but changes in behavior, which is nearly the worst kind of breakage we can introduce. r? @BurntSushi cc @rust-lang/libs -- how do we feel about the potential breakage here? Author's thoughts and the breakage are outlined in #45333 (comment), mine are above. |
I'd be interested in seeing what a crater run turned up, but I'm not particularly worried about this change in behavior. We've never documented which of multiple equal values is arrived at, and it seems like you're in a bit of a weird place if you have that kind of data anyway. |
Honestly, if it was going to pick a value from among equals, then I'd expect it to pick the lower bound. Because the current implementation picks at effectively random (even if it is deterministic), there's absolutely no way I'd be able to rely on which value it picked. The lower bound is still deterministic, so it won't break any code that wants deterministic behavior, but I have a really hard time imagining what sort of code can manage to rely on which specific element the current algorithm picks at random. Of course we should still do a crater run to be sure, but if we can't find any legitimate cases, then we should absolutely do this change. |
I found myself wanting the lower bound of the equal subsequence recently. On the other hand I'd argue against guaranteeing this sort of behavior. Also, this discussion resembles the stable/unstable sort thing. |
The documentation of
|
@arthups I think not having lower_bound and upper_bound are holes in the std library. We can definitely add those. I can send a separate PR if there is consensus. |
ping @frankmcsherry, you might be interested in this |
This appeals to me greatly. I can't recall if I complained to @bluss and that is why I got the ping, but (i) better performance is great, and (ii) being able to get to the lower bound is important for lots of applications. While this PR doesn't commit to that, it does mean that I could in principle start using My understanding is that this could be slower if the comparison is very expensive, much like quicksort can be slower than mergesort if the comparison is expensive. Benchmarking on large randomly permuted Also, if I understand the linked article, there is the potential downside that most architectures will not prefetch through a computed address, as produced by Edit: Also, my understanding is that a lot of the "branchless" benefits go away if your comparison function has a branch in it, e.g. for @Mark-Simulacrum Rust has changed behavior a few times (for me wrt @kennytm The counter-point that I made recently (even though I'd love to have this land asap), is that no matter what the docs say if the change breaks code it breaks code. Not knowing anything about this stuff, a crater run seems like a great thing to do to grok whether people unknowingly over-bound to semantics that weren't documented. If they did, breaking their stuff and saying "sorry, but" still hurts (them, and the perception of stability). |
if cmp == Equal { Ok(base) } else { Err(base + (cmp == Less) as usize) } I don't know either way, but is |
In talking out the issue of intended semantics, what are the use cases where a collection may have multiple matching keys and returning "an arbitrary element" is acceptable? I'm familiar with "collection has distinct keys" in which things are unambiguous, and "collection has multiple keys, return the range please". I admit to never having seen an application that calls for what Rust currently offers; if someone could clue me in I would appreciate it! |
I think this implementation is really neat but just using it in a new set of methods (for lower and upper bounds), leaving binary search unchanged sounds like the best solution. It makes the faster implementation available, it makes the useful lower bound algorithm available, and it avoids doing more steps than needed in |
bluss@ it would extremely interesting to see cases that consistently get a regression out of this only because the old implementation bails out early. Also I don't think having a slow |
|
It's clear to me that lower_bound and upper_bound are desirable, deprecating (or not) binary_search. I fell that the name (self documenting/discoverable) is reason alone not to deprecate it though. @alkis If you have "duplicates" and/or an expensive cmp function the new code might be slower. |
Edit: btw, I much prefer your code to what exists at the moment, which I don't use because it doesn't do what I need. I'm just trying to call out the things folks should worry about and be sure they are ok with. |
@alkis To be brief, you can (at least I could) reproduce the slower case using |
@frankmcsherry and @bluss see updated benchmarks: Before:
After:
I don't see the regression. Furthermore lets step back a bit. Do we expect no regressions? I do not think this is realistic. If we accept the fact there are going to be regressions we have to use Amdhal's Law to assess the tradeoff. The performance increase on arrays that fit in L1 or L2 cache is about 2x. This is not trivial. So unless you think the regressions on the cases you think it regresses represent the majority of cases, I find it unwise to block this PR. Think of it in reverse. If the current code was the code in this PR, would be approve a change that changes it to the current code because it is faster on the contrived cases you mention? |
FWIW: I think the biggest risk of this change is that unit tests will break if they depend on the element/position returned by the current implementation of |
@alkis I'd like to point out that with this change using a binary search in the BTreeMap nodes could be a win. Maybe you want to take a stab at that. |
How? Binary trees are not contagious arrays, and the optimization techniques should be different. EDIT: I see your point. BTree uses partially contagious arrays. |
They're BTree's not BinaryTree's. |
@ishitatsuyuki perhaps @arthurprs is talking about the array in each BTree node itself. Not sure what is happening there but it might be linear search instead of binary search. |
src/libcore/tests/slice.rs
Outdated
} | ||
|
||
#[test] | ||
// When this test changes a crater run is highly advisable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test is named pretty obviously, but please leave a comment stating that this is testing implementation-specific behavior of what to do in the case of equivalent elements; and that it is OK to break this but (as you've already mentioned) this should be accompanied with a crater run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
unpredictable conditional branches in the loop. In addition improve the benchmarks to test performance in l1, l2 and l3 caches on sorted arrays with or without dups. Before: ``` test slice::binary_search_l1 ... bench: 48 ns/iter (+/- 1) test slice::binary_search_l2 ... bench: 63 ns/iter (+/- 0) test slice::binary_search_l3 ... bench: 152 ns/iter (+/- 12) test slice::binary_search_l1_with_dups ... bench: 36 ns/iter (+/- 0) test slice::binary_search_l2_with_dups ... bench: 64 ns/iter (+/- 1) test slice::binary_search_l3_with_dups ... bench: 153 ns/iter (+/- 6) ``` After: ``` test slice::binary_search_l1 ... bench: 15 ns/iter (+/- 0) test slice::binary_search_l2 ... bench: 23 ns/iter (+/- 0) test slice::binary_search_l3 ... bench: 100 ns/iter (+/- 17) test slice::binary_search_l1_with_dups ... bench: 15 ns/iter (+/- 0) test slice::binary_search_l2_with_dups ... bench: 23 ns/iter (+/- 0) test slice::binary_search_l3_with_dups ... bench: 98 ns/iter (+/- 14) ```
📌 Commit 2ca111b has been approved by |
Improve SliceExt::binary_search performance Improve the performance of binary_search by reducing the number of unpredictable conditional branches in the loop. In addition improve the benchmarks to test performance in l1, l2 and l3 caches on sorted arrays with or without dups. Before: ``` test slice::binary_search_l1 ... bench: 48 ns/iter (+/- 1) test slice::binary_search_l2 ... bench: 63 ns/iter (+/- 0) test slice::binary_search_l3 ... bench: 152 ns/iter (+/- 12) test slice::binary_search_l1_with_dups ... bench: 36 ns/iter (+/- 0) test slice::binary_search_l2_with_dups ... bench: 64 ns/iter (+/- 1) test slice::binary_search_l3_with_dups ... bench: 153 ns/iter (+/- 6) ``` After: ``` test slice::binary_search_l1 ... bench: 15 ns/iter (+/- 0) test slice::binary_search_l2 ... bench: 23 ns/iter (+/- 0) test slice::binary_search_l3 ... bench: 100 ns/iter (+/- 17) test slice::binary_search_l1_with_dups ... bench: 15 ns/iter (+/- 0) test slice::binary_search_l2_with_dups ... bench: 23 ns/iter (+/- 0) test slice::binary_search_l3_with_dups ... bench: 98 ns/iter (+/- 14) ```
Thanks for the thorough reviews! |
☀️ Test successful - status-appveyor, status-travis |
…tion to the standard library.
This restores the original binary search implementation from rust-lang#45333 which has the nice property of having a loop count that only depends on the size of the slice. This, along with explicit conditional moves from rust-lang#128250, means that the entire binary search loop can be perfectly predicted by the branch predictor. Additionally, LLVM is able to unroll the loop when the slice length is known at compile-time. This results in a very compact code sequence of 3-4 instructions per binary search step and zero branches. Fixes rust-lang#53823
This restores the original binary search implementation from rust-lang#45333 which has the nice property of having a loop count that only depends on the size of the slice. This, along with explicit conditional moves from rust-lang#128250, means that the entire binary search loop can be perfectly predicted by the branch predictor. Additionally, LLVM is able to unroll the loop when the slice length is known at compile-time. This results in a very compact code sequence of 3-4 instructions per binary search step and zero branches. Fixes rust-lang#53823
This restores the original binary search implementation from rust-lang#45333 which has the nice property of having a loop count that only depends on the size of the slice. This, along with explicit conditional moves from rust-lang#128250, means that the entire binary search loop can be perfectly predicted by the branch predictor. Additionally, LLVM is able to unroll the loop when the slice length is known at compile-time. This results in a very compact code sequence of 3-4 instructions per binary search step and zero branches. Fixes rust-lang#53823
This restores the original binary search implementation from rust-lang#45333 which has the nice property of having a loop count that only depends on the size of the slice. This, along with explicit conditional moves from rust-lang#128250, means that the entire binary search loop can be perfectly predicted by the branch predictor. Additionally, LLVM is able to unroll the loop when the slice length is known at compile-time. This results in a very compact code sequence of 3-4 instructions per binary search step and zero branches. Fixes rust-lang#53823
This restores the original binary search implementation from rust-lang#45333 which has the nice property of having a loop count that only depends on the size of the slice. This, along with explicit conditional moves from rust-lang#128250, means that the entire binary search loop can be perfectly predicted by the branch predictor. Additionally, LLVM is able to unroll the loop when the slice length is known at compile-time. This results in a very compact code sequence of 3-4 instructions per binary search step and zero branches. Fixes rust-lang#53823
This restores the original binary search implementation from rust-lang#45333 which has the nice property of having a loop count that only depends on the size of the slice. This, along with explicit conditional moves from rust-lang#128250, means that the entire binary search loop can be perfectly predicted by the branch predictor. Additionally, LLVM is able to unroll the loop when the slice length is known at compile-time. This results in a very compact code sequence of 3-4 instructions per binary search step and zero branches. Fixes rust-lang#53823
Rewrite binary search implementation This PR builds on top of rust-lang#128250, which should be merged first. This restores the original binary search implementation from rust-lang#45333 which has the nice property of having a loop count that only depends on the size of the slice. This, along with explicit conditional moves from rust-lang#128250, means that the entire binary search loop can be perfectly predicted by the branch predictor. Additionally, LLVM is able to unroll the loop when the slice length is known at compile-time. This results in a very compact code sequence of 3-4 instructions per binary search step and zero branches. Fixes rust-lang#53823 Fixes rust-lang#115271
Rewrite binary search implementation This PR builds on top of rust-lang#128250, which should be merged first. This restores the original binary search implementation from rust-lang#45333 which has the nice property of having a loop count that only depends on the size of the slice. This, along with explicit conditional moves from rust-lang#128250, means that the entire binary search loop can be perfectly predicted by the branch predictor. Additionally, LLVM is able to unroll the loop when the slice length is known at compile-time. This results in a very compact code sequence of 3-4 instructions per binary search step and zero branches. Fixes rust-lang#53823 Fixes rust-lang#115271
Improve the performance of binary_search by reducing the number of unpredictable conditional branches in the loop. In addition improve the benchmarks to test performance in l1, l2 and l3 caches on sorted arrays with or without dups.
Before:
After: