-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of stable sort #100856
Conversation
Hey! It looks like you've submitted a new PR for the library teams! If this PR contains changes to any Examples of
|
(rust-highfive has picked a reviewer for you, use r? to override) |
This comment has been minimized.
This comment has been minimized.
That looks really nice. There is another PR #90545 improving std::sort, did you compare the performance of these two? |
@EdorianDark I'm running some benchmarks tonight, preliminary results suggest it is slightly faster than the existing sort, but significantly slower than new_stable for copy types. From what I can tell, the discussion in the PR seems to be stuck at verification of the algorithm. One significant difference here is, that this PR does not change the core algorithm. It speeds up providing a minimum sorted slice, so it should be much easier to proof to a reasonable degree that this is as correct as the previous version. And thank you for pointing out that other PR, I was not aware of it. |
Plus I just discovered UB by running my test suite with miri and the proposed |
This comment has been minimized.
This comment has been minimized.
Regarding the failing test, I'm not sure what the expected behaviour should be. // Sort using a completely random comparison function.
// This will reorder the elements *somehow*, but won't panic.
let mut v = [0; 500];
for i in 0..v.len() {
v[i] = i as i32;
}
v.sort_by(|_, _| *[Less, Equal, Greater].choose(&mut rng).unwrap());
v.sort();
for i in 0..v.len() {
assert_eq!(v[i], i as i32);
} A simplified example:
The implementation of |
I don’t really like those relative speedup graphs, since they paint a skewed/biased picture of what’s happening: E.g., if the new implementation is twice as fast, it’ll say +100%, but if the old implementation is twice as fast, it’ll only say -50%. Unless I understand something wrong here. Anyways, I would prefer some non-biased statistics where both algorithms are treated the same, not one is arbitrarily chosen to always be the basis for determining what “100%” means. Edit: Actually… I just noticed numbers below -100% on this page, so perhaps the situation already is different than what I assumed and below zero, the roles switch? |
@frank I fully agree about 'directional' speedup graphs. This one is deliberately symmetric. Which means +100% means the new one is twice as fast as the old one. -100% means the old one is twice as fast as the new one. IIRC I mentioned this property in the PR description.
… On 27. Aug 2022, at 12:03, Frank Steffahn ***@***.***> wrote:
I don’t really like those relative speedup graphs, since they paint a skewed/biased picture of what’s happening: E.g., if the new implementation is twice as fast, it’ll say +100%, but if the old implementation is twice as fast, it’ll only say -50%. Unless I understand something wrong here. Anyways, I would prefer some non-biased statistics where both algorithms are treated the same, not one is arbitrarily chosen to always be the basis for determining what “100%” means.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.
|
@Voultapher Oh… right, you did mention there that it’s symmetric. Maybe it’d be easier to find if it was mentioned inside of the graph, too 😇 On a related note, it isn’t clear to me whether the linked results about # of comparisons is a symmetrical or an asymmetrical percentage. (I would probably assume those are asymmetrical though, looking at the numbers so far.) |
[ @Voultapher by the way, you had tagged the wrong person; careful with using @-sign in e-mail ;-) – also, since you’ve responded by e-mail you might have missed my edit on my first comment where I had already noticed some values below -100% on some linked graphs ] |
Indeed I currently have no access to my computer and will take a closer look on Monday. Good point about tagging, I didn't realize it turns into an actual tag, my apologies to whoever I've tagged now. Regarding the comparison diff, it's not symmetric. Back when I wrote that analysis code I hadn't yet figured out a good way to visualize difference in a symmetric way yet. If deemed important I can look at it again.
… On 27. Aug 2022, at 16:10, Frank Steffahn ***@***.***> wrote:
[ @Voultapher by the way, you had tagged the wrong person; careful with using @-sign in e-mail ;-) – also, since you’ve responded by e-mail you might have missed my edit on my first comment where I had already noticed some values below -100% on some linked graphs ]
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.
|
@Mark-Simulacrum I just noticed you changed the label from waiting on review to waiting on author. I'm waiting on review, regarding the failing CI, I commented about it here #100856 (comment). I could of course disable the failing test, but as I noted earlier I'm not sure what the expected behaviour ought to be. |
The code sample in #100856 (comment) seems to have both a sort_by (which is random) but then follows that with a regular sort() call. I think any changes in this PR should preserve behavior; even an incorrect Ord/PartialOrd impl shouldn't cause us to end up with a different set of elements in the array, so the subsequent sort should bring it back in order. Changing the specific results of the sort_by seems OK though; we don't need to guarantee the same set of cmp(..) calls or exact same results, so long as we still end up with a stable ordering. As an aside, it looks like this PR's diff currently is such that it replaces the main body of the implementation too, even if it doesn't actually change it (hard to tell). As-is, it would require basically a fresh review of the whole sort algorithm which won't happen soon. |
Good point, it seems that items are being duplicated and others are silently leaked. With types that impl Drop this might lead to a double free. |
Please read the comments and implementation. This would only happen for types that implement |
I would assume that it’s a sane use-case of |
I wonder if the faster sorting could be enabled for a set of trusted types like |
Yeah, just because a type is Copy doesn't mean we can ignore such a basic property of sort (entry data is the same as exit data, just reordered). Raw pointers are an example of how that can go wrong in practice, but even sorting &T would run into this. My guess is that many users calling sort_by may actually have a not-correct ordering function, and we shouldn't break entirely on that. If we panicked that would be a plausible discussion but randomly duplicating elements isn't acceptable. |
It’s hard to draw a line. |
☔ The latest upstream changes (presumably #101816) made this pull request unmergeable. Please resolve the merge conflicts. |
FYI, I have found a viable alternative that completely avoids the problem by using sorting networks. I'm still researching potential avenues for further improvements and will post the results here once these are explored. |
98cc003
to
a3cabd3
Compare
I'm back with an improved version that is completely free of any element duplication issues. It achieves this mainly by using a stable sorting-network to speedup
Achieving them all simultaneously is hard. Before we dive into the benchmark results, I wrote a test to explicitly verify that element duplication won't happen anymore, even in the presence of invalid Ord implementations. Performance improvements. Initially I experimented with bitonic merge networks to replace the parity_merge functions. However these turned out to require a substantial amount of additional comparisons. A collection of best known sorting-networks can be found here https://bertdobbelaere.github.io/sorting_networks.html. These describe ways to sort data without changing control-flow depending on what the result of the comparison was, which can be implemented without jump instructions. These sorting-networks are great at leveraging Instruction-Level-Parallelism (ILP), they expose multiple comparisons in straight-line code with builtin data-dependency parallelism and ordering per layer. Up to a certain point they complete a sort on random data in fewer comparisons than what an insertion sort does. Eg. sorting 20 elements, network 91 comparisons vs. insertion sort 111 comparisons. However for some patterns this is not ideal, such as already sorted inputs. The sorting-network will always finish after the same amount of comparisons. However none of these optimal sorting networks are applicable for a stable sort, as they require swapping non adjacent wires, however there are stable sorting-networks. I experimented with hybrid approaches, but ultimately these don't seem to be worth it. The hundred or so hours I've spent since the last iteration, cannot be adequately summed in a couple words so I'll leave it at that. Here are some of the core changes: Implementation:
Analysis:
A little note on the outstandingly good results previously achieved by Firestorm, later work revealed they were better than Zen3 mostly by emitting Here are the interactive results https://voultapher.github.io/sort-research-rs/results/30a063d/ Zen3 hot-u64 Zen3 hot-f128 Firestorm hot-u64 I did not run the benchmarks on the broadwell machine because it's kind of a pain to do, but if these are deemed interesting I can run them. For hot-u64 we can see that the results yield speedups across most pattern-size combination. Tapering off at 20-30% speedup for truly random inputs and surprisingly even better for random_dense inputs which have good prediction, as seen in the f128 chart. Fully ascending and descending see ~50% speedup, by virtue of not requiring allocation anymore. The largest speedups are found in descending inputs len <= 20. These see up to 6x improvements. A short excursion on ILP, this is what a single branchless swap looks like in x86-64 assembly https://godbolt.org/z/vv48jMdcs. The Binary size. I compiled a simple program instantiating 6 copy type sorts, with unknown slice length, and got these result when running
And for completeness sake a debug build
For non Copy types that don't Comparison count. This metric is very important for a general purpose sort, the user provides the comparison function and can make it arbitrarily expensive. Very quickly the only important factor how fast the sort is, will be the cost of the comparison function. For
For fully random and very dense random inputs of types that Cold results and a simpler version. I'm still not too sure about the methodology behind the cold tests, which aim to simulate the function being called after a lot of other code was called, instead of as part of a hot benchmark loop. Many of the results don't reproduce. Maybe in total they surface some signal by sheer weight of statistics? For Zen3 we see performance degradation for size < 20. Firestorm in contrast shows performance improvements. Also we are talking about 10-50ns slower, if by necessity the code is being called much, will a user care about the additional 30ns? Will those who for example sort small slices of indices in hot code, benefit from the customized implementation? I initially had a version that dispatched into more specialized functions, and while they look good in hot benchmarks, I'm not convinced that something like sort3 is worth it if the len is 3, because the additional overhead of getting to that special function is likely just not worth it in real world usage. Especially the zen3 results look pretty bad here. I'm looking for feedback and ideas how we could test the effectiveness of sort_small on real world use cases. There is always the option to revert back to just a plain insertion sort, but that can't handle reversed inputs. Alternatively we could look into doing a smaller version that can detect reversed inputs, which hopefully should perform better for reversed inputs. Or maybe even dropping the whole pre-check insertion sort and leverage the existing pattern/streak analysis which won't allocate now, and augment that with a small sort. Documentation changes. I'm not sure we should mention that some types may see different behavior. What do you think? Future work. Here are some other related topics I'm currently working on. But these will be better suited in their own PRs.
|
This comment has been minimized.
This comment has been minimized.
I pushed a new version you can find the results here https://voultapher.github.io/sort-research-rs/results/9af9473/analysis/zen3/ The major change is re-using the existing TimSort streak analysis for ascending and descending patterns. With this new version descending get's sorted with minimal comparisons for any size. In addition to that I implemented a new small sort function that does a allocation free merge sort for sizes up to 32. It was still significantly faster up to 40 elements even for f128, but I left it at 32 to limit the growth in worst case comparisons. Cold performance is not awesome but at least better than with the previous version. Only tested Zen3 for now: The analysis now benefits all types including Strings. On a sidenote, I've now run into it twice that I forgot to call |
@Mark-Simulacrum I think it's in a reviewable state now. As far as I can tell the CI is unhappy because the cfg'd imports are not visible and I didn't slap |
This comment has been minimized.
This comment has been minimized.
This reworks the internals of slice::sort. Mainly: - Introduce branchless swap_next_if and optimized sortX functions - Speedup core batch extension with sort16 - Many small tweaks to reduce the amount of branches/jumps This commit is incomplete and MUST NOT be merged as is. It is missing Copy detection and would break uniqueness preservation of values that are being sorted.
Add new loop based mini-merge sort for small sizes. This extends allocation free sorting for random inputs of types that qualify up to 32.
59ac873
to
061d4e8
Compare
@Mark-Simulacrum friendly ping. The PR is marked as waiting on author, however I'm waiting for review. Is there something missing before this can be reviewed? |
@rustbot ready |
https://rustc-dev-guide.rust-lang.org/conventions.html#formatting-and-the-tidy-script has some documentation here, you may be able to adjust what rust-analyzer runs via its configuration.
The no_global_oom_handling cfg removes any functions which may invoke the global oom handling (and typically this means that they will panic on allocation failure). If it excludes everything then moving it up may make sense, but I think in practice it sounds like it doesn't quite catch everything, so it makes sense to leave some API surface visible for usage even when it is set. |
It looks like the diff in github is still showing that the changes essentially fully move the merge sort -- I unfortunately don't have bandwidth to figure out a separate tool and work through moving this PR over into such a tool. If you can refactor the PR or separate it out such that the diff is smaller and focuses on changes rather than a full rewrite that would help facilitate the review. @rustbot author |
…homcc Unify stable and unstable sort implementations in same core module This moves the stable sort implementation to the core::slice::sort module. By virtue of being in core it can't access `Vec`. The two `Vec` used by merge sort, `buf` and `runs`, are modelled as custom types that implement the very limited required `Vec` interface with the help of provided allocation and free functions. This is done to allow future re-use of functions and logic between stable and unstable sort. Such as `insert_head`. This is in preparation of rust-lang#100856 and rust-lang#104116. It only moves code, it *doesn't* change any of the sort related logic. This unlocks the ability to share `insert_head`, `insert_tail`, `swap_if_less` `merge` and more. Tagging ``@Mark-Simulacrum`` I hope this allows progress on rust-lang#100856, by moving `merge_sort` here I hope future changes will be easier to review.
…homcc Unify stable and unstable sort implementations in same core module This moves the stable sort implementation to the core::slice::sort module. By virtue of being in core it can't access `Vec`. The two `Vec` used by merge sort, `buf` and `runs`, are modelled as custom types that implement the very limited required `Vec` interface with the help of provided allocation and free functions. This is done to allow future re-use of functions and logic between stable and unstable sort. Such as `insert_head`. This is in preparation of rust-lang#100856 and rust-lang#104116. It only moves code, it *doesn't* change any of the sort related logic. This unlocks the ability to share `insert_head`, `insert_tail`, `swap_if_less` `merge` and more. Tagging ```@Mark-Simulacrum``` I hope this allows progress on rust-lang#100856, by moving `merge_sort` here I hope future changes will be easier to review.
…homcc Unify stable and unstable sort implementations in same core module This moves the stable sort implementation to the core::slice::sort module. By virtue of being in core it can't access `Vec`. The two `Vec` used by merge sort, `buf` and `runs`, are modelled as custom types that implement the very limited required `Vec` interface with the help of provided allocation and free functions. This is done to allow future re-use of functions and logic between stable and unstable sort. Such as `insert_head`. This is in preparation of rust-lang#100856 and rust-lang#104116. It only moves code, it *doesn't* change any of the sort related logic. This unlocks the ability to share `insert_head`, `insert_tail`, `swap_if_less` `merge` and more. Tagging ````@Mark-Simulacrum```` I hope this allows progress on rust-lang#100856, by moving `merge_sort` here I hope future changes will be easier to review.
☔ The latest upstream changes (presumably #107143) made this pull request unmergeable. Please resolve the merge conflicts. |
Closing for now, work will be resumed in other PRs. Most of the information in here is obsolete now, and I've gotten a lot further in optimising. |
This reworks the internals of slice::sort. Mainly:
This commit is incomplete and MUST NOT be merged as is. It is missing Copy
detection and would break uniqueness preservation of values that are being
sorted.
This initially started as an exploration to port fluxsort to Rust. Together with ideas from the recent libcxx sort improvement, namely optimal sorting networks, this PR aims to improve the performance of
slice::sort
.Before submitting this PR, I wanted good answers for these two questions:
Maybe I've used it wrong, but I did not find a comprehensive set of tests for sort in the std test suite. Even simple bugs, manually added to the existing code, still passed the library/std suite. So I embarked on creating my own test and benchmark suite https://github.com/Voultapher/sort-research-rs. Having a variety of test types, pattern and sizes.
How can I know that it works correctly?
How can I know it is actually faster?
Benchmarking is notoriously tricky. Along the way I've mad all kinds of mistakes, ranging from false randomness. Not being representative enough. Botching type distributions and more. In it's current form the benchmark suite tests along 4 dimensions:
input type
I chose 5 types that are to represent some of the types users will call this generic sort implementation with:
i32
basic type often used to test sorting algorithms.u64
common type for usize on 64-bit machines. Sorting indices is very common.String
Larger type that is not Copy and does heap access.1k
Very large stack value 1kb, not Copy.f128
16 byte stack value that is Copy but has a relatively expensive cmp implementation.Input pattern
I chose 11 input patterns. Sorting algorithms behave wildly different based on the input pattern. And most hight performance implementation try to be some kind of adaptive.
For more details on the input patterns, take a look at their implementation.
Input size
While in general sorting random data should be n log(n), the specifics might be quite different. So it's important to get a representative set of test sizes. I chose these:
hot/cold prediction state
Modern CPUs are highly out-of-order and speculative to extract as much Instruction Level Parallelism (ILP) as possible. This is highly dependent on speculation. And while the common way to do benchmarks is good to get reliable standalone numbers, their magnitude might be misleading in programs that do more than just calling sort in a tight loop. So I run every benchmark twice once via criterion
iter_batched
(hot) anditer_batch PerIteration
with a function in-between that attempts to flush the Branch Target Buffer (BTB) and does a syscall for good measure too. The cold benchmarks are not necessarily representative for the absolute gains in a particular application, nor are the hot results. But together they give a decent range of possible speedups.What makes a sort algorithm fast
Talking in depth about all kinds of sort algorithm would go too deep here, especially given, that this does not change the underlying algorithm. But merely augments it with branchless code that can extract more ILP. Broadly speaking a generic sorting algorithms runtime will depend on 3 factors:
For example, a
u64
is very cheap to compare, very cheap to move and many of them fit into a cache line. And only that cache line is needed to compare them. A type likeu64
will give you pretty close to the maximum performance of the sorting algorithm. In contrast,String
is potentially expensive to compare, relatively cheap to move and access is hard to predict. And a type likef128
(just a name used for testing, not an official literal) is rather expensive to compare, it does two f64 divisions for each comparison, is cheap to move, and access is easily predicted. Given the vast freedom Rust users have, the sort algorithm can only do a best effort for common use cases and be reasonably good in others. If you decompress files with each comparison, that will be your bottleneck. Rust users can implement arbitrarily expensiveis_less
functions, and the key metric for such cases is how many comparisons are performed by the implementation. As I was developing the optimisations, I took great care to ensure that not only was it faster, but also not significantly slower. For example by measuring the total comparisons done for each input, type + pattern + size combination.For non Copy types the total comparisons done are roughly equal. For Copy types and random input, it's 6% more on average. And while at first it might seem intuitive that more comparisons means higher runtime. That's not true for various reasons, mostly cache access. For example
sort_unstable
does on average for random input, and input size > 20 14% more comparisons. There are even some pathological inputs such aspipe_organ
where the merge sort with streak analysis is vastly superior. Foru64-pipe_organ-10000
the stable sort performs 20k comparisons while unstable sort does 130k. If your type is expensive to compare, unstable sort will quickly loose its advantage.Benchmark results
Which the 4 input dimensions, we get 2.5k individual benchmarks. I ran them on 3 different microarchitectures. Comparing new_stable with std_stable, gives us 15k data points. That's a lot and requires some kind of analysis. The raw results are here, feel free to crush the numbers yourself, I probably did some mistakes.
TLDR:
Test setup:
Compiler: rustc 1.65.0-nightly (29e4a9e 2022-08-10)
Machines:
hot-u64 (Zen3)
This plots the
hot-u64
results. Everything above 0 means the new implementation was faster and everything below means the current version was faster. Check out an interactive version here, you can hover over all the values to see the relative speedup. Note, the speedup is calculated symmetrically, so it doesn't matter which 'side' you view it from.hot-u64-random <= 20 (Firestorm)
Interactive version here you have to scroll down a bit to get to random. But the other ones are interesting too of course.
hot-u64 (Firestorm)
Here you can see what I suspect is the really wide core flexing it's ILP capabilities, speeding up sorting a slice of 16 random
u64
by 2.4x.hot-u64 (Broadwell)
Even on older hardware the results look promising.
You can explore all results here. Note, I accidentally forgot to add 16 as test size when running the benchmarks on the Zen3 machine, the other two have this included. But from what I see the difference is not too big.
Outstanding work
is_copy
check to work within the standard library. I used specialisation in my repo, but I'm not sure what the right path forward here would be. On the surface it seems like a relatively simple property to check.is_less
panics insidesort16
during the final parity merge. With the current double copy, it retains the current behaviour of leavingv
in a valid state and preserving all its original elements. A faster version is possible, by omitting the two copy calls, and directly writing the result ofparity_merge
intoarr_ptr
, however this changes the current behaviour. Shouldis_less
panic,v
will be left in a valid state but there might be duplicate elements, losing the original set. The question is, how many people rely on such behaviour? How big of a footgun is that? Is the price worth having everyone pay it when they don't need it? Would documentation be enough?A note on 'uniqueness preservation'. Maybe this concept has a name, but I don't know it. I did experiments of allowing the
sort16
approach for non Copy types, however I saw slowdowns for such types, even when limiting it to relatively cheap to move types. I suspect Copy is a pretty good heuristic for types that are cheap to compare, move and access. But what makes them crucially the only ones applicable, is not panic safety as I initially thought. With the two extra copy'ssort16
which could be done only for non Copy types, that part is solved. However, what memory location is dereferenced and used to callis_less
is the tricky part. By creating a shallow copy of type and using that address to callis_less
, if this value is then not copied back, and used along the chain as 'unique' object along the progress of code, you could observe that for all intents and purposes, the type that is in your mutable slice, is the one you put in there. But if you self modify yourself insideis_less
this self modification can be lost. That's exactly whatparity_merge8
andparity_merge
do. The way they sweep along the slice overlapping between 'iterations' might compare something from src, copy it and then use src again for a comparison, but then laterdest
is copied back intov
, which has only seen one of the comparisons. And cruciallyCell
is not Copy, which means such logical foot guns are not possible with Copy types, or not to my knowledge. I wrote a test for it here.I would love to see if this improvement has any noticeable effect on compile times. Coarse analysis showed some places where the compiler uses stable sort with suitable types.
Future work
This is my first time contributing to the standard library, my apologies if I missed or got something important wrong.