Improve performance of stable sort #100856

Voultapher · 2022-08-21T21:17:57Z

This reworks the internals of slice::sort. Mainly:

Introduce branchless swap_next_if and optimized sortX functions
Speedup core batch extension with sort16
Many small tweaks to reduce the amount of branches/jumps

This commit is incomplete and MUST NOT be merged as is. It is missing Copy
detection and would break uniqueness preservation of values that are being
sorted.

This initially started as an exploration to port fluxsort to Rust. Together with ideas from the recent libcxx sort improvement, namely optimal sorting networks, this PR aims to improve the performance of slice::sort.

Before submitting this PR, I wanted good answers for these two questions:

How can I know that it works correctly?
How can I know it is actually faster?

Maybe I've used it wrong, but I did not find a comprehensive set of tests for sort in the std test suite. Even simple bugs, manually added to the existing code, still passed the library/std suite. So I embarked on creating my own test and benchmark suite https://github.com/Voultapher/sort-research-rs. Having a variety of test types, pattern and sizes.

How can I know that it works correctly?

The new implementation is tested with a variety of tests. Varying types, input size and test logic
miri does not complain when running the tests
Repeatedly running the tests with random inputs has not yielded failures (In its current form. While developing I had many bugs)
Careful analysis and reasoning of the code. Augmented with comments.
debug_asserts for invariants where doing so is reasonably cheap
Review process via PR
Potential nightly adoption and feedback

How can I know it is actually faster?

Benchmarking is notoriously tricky. Along the way I've mad all kinds of mistakes, ranging from false randomness. Not being representative enough. Botching type distributions and more. In it's current form the benchmark suite tests along 4 dimensions:

Input type
Input pattern
Input size
hot/cold prediction state

input type

I chose 5 types that are to represent some of the types users will call this generic sort implementation with:

i32 basic type often used to test sorting algorithms.
u64 common type for usize on 64-bit machines. Sorting indices is very common.
String Larger type that is not Copy and does heap access.
1k Very large stack value 1kb, not Copy.
f128 16 byte stack value that is Copy but has a relatively expensive cmp implementation.

Input pattern

I chose 11 input patterns. Sorting algorithms behave wildly different based on the input pattern. And most hight performance implementation try to be some kind of adaptive.

random
random_uniform
random_random_size
all_equal
ascending
descending
ascending_saw_5
ascending_saw_20
descending_saw_5
descending_saw_20
pipe_organ

For more details on the input patterns, take a look at their implementation.

Input size

While in general sorting random data should be n log(n), the specifics might be quite different. So it's important to get a representative set of test sizes. I chose these:

let test_sizes = [
    0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 13, 16, 17, 19, 20, 24, 36, 50, 101, 200, 500, 1_000,
    2_048, 10_000, 100_000, 1_000_000,
];

hot/cold prediction state

Modern CPUs are highly out-of-order and speculative to extract as much Instruction Level Parallelism (ILP) as possible. This is highly dependent on speculation. And while the common way to do benchmarks is good to get reliable standalone numbers, their magnitude might be misleading in programs that do more than just calling sort in a tight loop. So I run every benchmark twice once via criterion iter_batched (hot) and iter_batch PerIteration with a function in-between that attempts to flush the Branch Target Buffer (BTB) and does a syscall for good measure too. The cold benchmarks are not necessarily representative for the absolute gains in a particular application, nor are the hot results. But together they give a decent range of possible speedups.

What makes a sort algorithm fast

Talking in depth about all kinds of sort algorithm would go too deep here, especially given, that this does not change the underlying algorithm. But merely augments it with branchless code that can extract more ILP. Broadly speaking a generic sorting algorithms runtime will depend on 3 factors:

How cheap is your type to compare?
How expensive is your type to move?
How well can access to your type be predicted?

For example, a u64 is very cheap to compare, very cheap to move and many of them fit into a cache line. And only that cache line is needed to compare them. A type like u64 will give you pretty close to the maximum performance of the sorting algorithm. In contrast, String is potentially expensive to compare, relatively cheap to move and access is hard to predict. And a type like f128 (just a name used for testing, not an official literal) is rather expensive to compare, it does two f64 divisions for each comparison, is cheap to move, and access is easily predicted. Given the vast freedom Rust users have, the sort algorithm can only do a best effort for common use cases and be reasonably good in others. If you decompress files with each comparison, that will be your bottleneck. Rust users can implement arbitrarily expensive is_less functions, and the key metric for such cases is how many comparisons are performed by the implementation. As I was developing the optimisations, I took great care to ensure that not only was it faster, but also not significantly slower. For example by measuring the total comparisons done for each input, type + pattern + size combination.

For non Copy types the total comparisons done are roughly equal. For Copy types and random input, it's 6% more on average. And while at first it might seem intuitive that more comparisons means higher runtime. That's not true for various reasons, mostly cache access. For example sort_unstable does on average for random input, and input size > 20 14% more comparisons. There are even some pathological inputs such as pipe_organ where the merge sort with streak analysis is vastly superior. For u64-pipe_organ-10000 the stable sort performs 20k comparisons while unstable sort does 130k. If your type is expensive to compare, unstable sort will quickly loose its advantage.

Benchmark results

Which the 4 input dimensions, we get 2.5k individual benchmarks. I ran them on 3 different microarchitectures. Comparing new_stable with std_stable, gives us 15k data points. That's a lot and requires some kind of analysis. The raw results are here, feel free to crush the numbers yourself, I probably did some mistakes.

TLDR:

The biggest winners are Copy types, in random order. Such scenarios see 30-100% speedup.
Non Copy types remain largely the same, with some outliers that I guess are dependent on memory layout and alignment and other factors.
Some scenarios see slowdowns, but they are pretty contained
Wider and deeper microarchitectures see relatively larger wins (ie. current and future designs)

Test setup:

Compiler: rustc 1.65.0-nightly (29e4a9e 2022-08-10)

Machines:

AMD 5900X (Zen3)
Intel i7-5500U (Broadwell)
Apple M1 Pro (Firestorm)

hot-u64 (Zen3)

This plots the hot-u64 results. Everything above 0 means the new implementation was faster and everything below means the current version was faster. Check out an interactive version here, you can hover over all the values to see the relative speedup. Note, the speedup is calculated symmetrically, so it doesn't matter which 'side' you view it from.

hot-u64-random <= 20 (Firestorm)

Interactive version here you have to scroll down a bit to get to random. But the other ones are interesting too of course.

hot-u64 (Firestorm)

Here you can see what I suspect is the really wide core flexing it's ILP capabilities, speeding up sorting a slice of 16 random u64 by 2.4x.

hot-u64 (Broadwell)

Even on older hardware the results look promising.

You can explore all results here. Note, I accidentally forgot to add 16 as test size when running the benchmarks on the Zen3 machine, the other two have this included. But from what I see the difference is not too big.

Outstanding work

Get the is_copy check to work within the standard library. I used specialisation in my repo, but I'm not sure what the right path forward here would be. On the surface it seems like a relatively simple property to check.
Talk about the expected behaviour when is_less panics inside sort16 during the final parity merge. With the current double copy, it retains the current behaviour of leaving v in a valid state and preserving all its original elements. A faster version is possible, by omitting the two copy calls, and directly writing the result of parity_merge into arr_ptr, however this changes the current behaviour. Should is_less panic, v will be left in a valid state but there might be duplicate elements, losing the original set. The question is, how many people rely on such behaviour? How big of a footgun is that? Is the price worth having everyone pay it when they don't need it? Would documentation be enough?

A note on 'uniqueness preservation'. Maybe this concept has a name, but I don't know it. I did experiments of allowing the sort16 approach for non Copy types, however I saw slowdowns for such types, even when limiting it to relatively cheap to move types. I suspect Copy is a pretty good heuristic for types that are cheap to compare, move and access. But what makes them crucially the only ones applicable, is not panic safety as I initially thought. With the two extra copy's sort16 which could be done only for non Copy types, that part is solved. However, what memory location is dereferenced and used to call is_less is the tricky part. By creating a shallow copy of type and using that address to call is_less, if this value is then not copied back, and used along the chain as 'unique' object along the progress of code, you could observe that for all intents and purposes, the type that is in your mutable slice, is the one you put in there. But if you self modify yourself inside is_less this self modification can be lost. That's exactly what parity_merge8 and parity_merge do. The way they sweep along the slice overlapping between 'iterations' might compare something from src, copy it and then use src again for a comparison, but then later dest is copied back into v, which has only seen one of the comparisons. And crucially Cell is not Copy, which means such logical foot guns are not possible with Copy types, or not to my knowledge. I wrote a test for it here.

I would love to see if this improvement has any noticeable effect on compile times. Coarse analysis showed some places where the compiler uses stable sort with suitable types.

Future work

For input_size <20 stable and unstable sort both use insertion sort, albeit slightly different implementations. The same speedups here could be applied to unstable sort.
If I find the time I want to investigate porting the merge part of fluxsort and see if that speeds things up. My main worry is correctness. It's quite liberal in it's use of pointers and I've already discovered memory and logic bugs in the parts that I ported. For <16 elements its somewhat possible to think about in my head. But for general situations, I fear it might require formal verification, to attain the expected level of confidence.
Use sort16 to speedup unstable sort, or in general look into improving unstable sort.

This is my first time contributing to the standard library, my apologies if I missed or got something important wrong.

rustbot · 2022-08-21T21:18:00Z

Hey! It looks like you've submitted a new PR for the library teams!

If this PR contains changes to any rust-lang/rust public library APIs then please comment with @rustbot label +T-libs-api -T-libs to tag it appropriately. If this PR contains changes to any unstable APIs please edit the PR description to add a link to the relevant API Change Proposal or create one if you haven't already. If you're unsure where your change falls no worries, just leave it as is and the reviewer will take a look and make a decision to forward on if necessary.

Examples of T-libs-api changes:

Stabilizing library features
Introducing insta-stable changes such as new implementations of existing stable traits on existing stable types
Introducing new or changing existing unstable library APIs (excluding permanently unstable features / features without a tracking issue)
Changing public documentation in ways that create new stability guarantees
Changing observable runtime behavior of library APIs

rust-highfive · 2022-08-21T21:18:00Z

r? @Mark-Simulacrum

(rust-highfive has picked a reviewer for you, use r? to override)

EdorianDark · 2022-08-23T10:34:32Z

That looks really nice. There is another PR #90545 improving std::sort, did you compare the performance of these two?

Voultapher · 2022-08-23T18:04:48Z

@EdorianDark I'm running some benchmarks tonight, preliminary results suggest it is slightly faster than the existing sort, but significantly slower than new_stable for copy types. From what I can tell, the discussion in the PR seems to be stuck at verification of the algorithm. One significant difference here is, that this PR does not change the core algorithm. It speeds up providing a minimum sorted slice, so it should be much easier to proof to a reasonable degree that this is as correct as the previous version. And thank you for pointing out that other PR, I was not aware of it.

Voultapher · 2022-08-23T18:12:33Z

Plus I just discovered UB by running my test suite with miri and the proposed wpwoodjr_stable_sort.

Voultapher · 2022-08-24T17:06:23Z

Regarding the failing test, I'm not sure what the expected behaviour should be.

// Sort using a completely random comparison function.
// This will reorder the elements *somehow*, but won't panic.
let mut v = [0; 500];
for i in 0..v.len() {
    v[i] = i as i32;
}
v.sort_by(|_, _| *[Less, Equal, Greater].choose(&mut rng).unwrap());
v.sort();
for i in 0..v.len() {
    assert_eq!(v[i], i as i32);
}

A simplified example:

input:
[2, 1, 8, 4, 11, 5, 5, 0, 5, 11, 0, 6, 7, 10, 9, 15, 16, 17, 18, 19]
new_stable_sort::sort_by produces:
[0, 0, 1, 2, 4, 5, 5, 5, 6, 7, 8, 9, 10, 11, 11, 15, 16, 17, 18, 19]

The implementation of sort16 assumes the user correctly implemented Ord for their type. Given that sort16 should only be called for Copy types the result is memory safe, but changes the existing behaviour. The question is, how far do we want to go to give people violating the Ord API contract 'sensible' results. Looking at the comment above, the intent of the test seems to be, to ensure that this won't panic, and or violate memory safety. If users mess up Ord their results will most likely not be what they expected anyway, the result cannot logically be considered sorted after sort completes. There is even an argument to be made, for surfacing such API violations more prominently, as for example integer overflow does for debug builds.

steffahn · 2022-08-27T10:02:59Z

I don’t really like those relative speedup graphs, since they paint a skewed/biased picture of what’s happening: E.g., if the new implementation is twice as fast, it’ll say +100%, but if the old implementation is twice as fast, it’ll only say -50%. Unless I understand something wrong here. Anyways, I would prefer some non-biased statistics where both algorithms are treated the same, not one is arbitrarily chosen to always be the basis for determining what “100%” means.

Edit: Actually… I just noticed numbers below -100% on this page, so perhaps the situation already is different than what I assumed and below zero, the roles switch?

Voultapher · 2022-08-27T13:42:59Z

@frank I fully agree about 'directional' speedup graphs. This one is deliberately symmetric. Which means +100% means the new one is twice as fast as the old one. -100% means the old one is twice as fast as the new one. IIRC I mentioned this property in the PR description.

…

On 27. Aug 2022, at 12:03, Frank Steffahn ***@***.***> wrote: I don’t really like those relative speedup graphs, since they paint a skewed/biased picture of what’s happening: E.g., if the new implementation is twice as fast, it’ll say +100%, but if the old implementation is twice as fast, it’ll only say -50%. Unless I understand something wrong here. Anyways, I would prefer some non-biased statistics where both algorithms are treated the same, not one is arbitrarily chosen to always be the basis for determining what “100%” means. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.

steffahn · 2022-08-27T14:03:27Z

@Voultapher Oh… right, you did mention there that it’s symmetric. Maybe it’d be easier to find if it was mentioned inside of the graph, too 😇
E.g. where it currently says “100% = 2x”, it could clarify e.g. with something like “100% = 2×, −100% = 0.5×”. Perhaps even additional data points like “50% = 1.5×, −50% = 0.667×”.

On a related note, it isn’t clear to me whether the linked results about # of comparisons is a symmetrical or an asymmetrical percentage. (I would probably assume those are asymmetrical though, looking at the numbers so far.)

steffahn · 2022-08-27T14:10:07Z

[ @Voultapher by the way, you had tagged the wrong person; careful with using @-sign in e-mail ;-) – also, since you’ve responded by e-mail you might have missed my edit on my first comment where I had already noticed some values below -100% on some linked graphs ]

Voultapher · 2022-08-27T14:31:56Z

Indeed I currently have no access to my computer and will take a closer look on Monday. Good point about tagging, I didn't realize it turns into an actual tag, my apologies to whoever I've tagged now. Regarding the comparison diff, it's not symmetric. Back when I wrote that analysis code I hadn't yet figured out a good way to visualize difference in a symmetric way yet. If deemed important I can look at it again.

…

On 27. Aug 2022, at 16:10, Frank Steffahn ***@***.***> wrote: [ @Voultapher by the way, you had tagged the wrong person; careful with using @-sign in e-mail ;-) – also, since you’ve responded by e-mail you might have missed my edit on my first comment where I had already noticed some values below -100% on some linked graphs ] — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

Voultapher · 2022-08-31T15:28:10Z

@Mark-Simulacrum I just noticed you changed the label from waiting on review to waiting on author. I'm waiting on review, regarding the failing CI, I commented about it here #100856 (comment). I could of course disable the failing test, but as I noted earlier I'm not sure what the expected behaviour ought to be.

Mark-Simulacrum · 2022-09-04T15:14:46Z

The code sample in #100856 (comment) seems to have both a sort_by (which is random) but then follows that with a regular sort() call. I think any changes in this PR should preserve behavior; even an incorrect Ord/PartialOrd impl shouldn't cause us to end up with a different set of elements in the array, so the subsequent sort should bring it back in order. Changing the specific results of the sort_by seems OK though; we don't need to guarantee the same set of cmp(..) calls or exact same results, so long as we still end up with a stable ordering.

As an aside, it looks like this PR's diff currently is such that it replaces the main body of the implementation too, even if it doesn't actually change it (hard to tell). As-is, it would require basically a fresh review of the whole sort algorithm which won't happen soon.

est31 · 2022-09-04T21:01:32Z

even an incorrect Ord/PartialOrd impl shouldn't cause us to end up with a different set of elements in the array

Good point, it seems that items are being duplicated and others are silently leaked. With types that impl Drop this might lead to a double free.

Voultapher · 2022-09-05T11:14:06Z

@est31

Good point, it seems that items are being duplicated and others are silently leaked. With types that impl Drop this might lead to a double free.

Please read the comments and implementation. This would only happen for types that implement Copy, these types cannot have custom Drop implementations.

steffahn · 2022-09-05T11:25:14Z

@est31

Good point, it seems that items are being duplicated and others are silently leaked. With types that impl Drop this might lead to a double free.

Please read the comments and implementation. This would only happen for types that implement Copy, these types cannot have custom Drop implementations.

I would assume that it’s a sane use-case of sort_by to run it on a Vec<*const T> with some user-defined (and hence untrusted) (&T, &T) -> Ordering callback, in the context of a data structure that manually manages the memory for the T’s which are semantically assumed to be owned by the *const T. *const T is a Copy type.

est31 · 2022-09-05T11:32:33Z

I wonder if the faster sorting could be enabled for a set of trusted types like &str, integers, etc. where we provide the Ord impl so we trust it. And only enabling it if you are not doing sort_by. I guess it wouldn't even need to be Copy at that point, although that might be too dangerous as future extensions might cause scenarios where ordering can be customized in safe user code, idk.

Mark-Simulacrum · 2022-09-05T11:34:57Z

Yeah, just because a type is Copy doesn't mean we can ignore such a basic property of sort (entry data is the same as exit data, just reordered). Raw pointers are an example of how that can go wrong in practice, but even sorting &T would run into this. My guess is that many users calling sort_by may actually have a not-correct ordering function, and we shouldn't break entirely on that. If we panicked that would be a plausible discussion but randomly duplicating elements isn't acceptable.

steffahn · 2022-09-05T11:35:11Z

I wonder if the faster sorting could be enabled for a set of trusted types like &str, integers, etc. where we provide the Ord impl so we trust it.

It’s hard to draw a line. usize or u32 could be a memory offset, too, indicating an owned memory location. Especially in a vector where u32 or even u16 might be chosen for memory efficiency, and the Vec<u16> is stored alongside some base pointer.

bors · 2022-09-18T09:20:40Z

☔ The latest upstream changes (presumably #101816) made this pull request unmergeable. Please resolve the merge conflicts.

Voultapher · 2022-10-20T07:16:40Z

FYI, I have found a viable alternative that completely avoids the problem by using sorting networks. I'm still researching potential avenues for further improvements and will post the results here once these are explored.

Voultapher · 2022-10-30T09:11:12Z

@Mark-Simulacrum

I'm back with an improved version that is completely free of any element duplication issues.

It achieves this mainly by using a stable sorting-network to speedup
provide_sorted_batch. In addition it uses a variety of strategies to optimize
for:

Good performance across all patterns, sizes and types
Minimal increase in total comparison count across patterns and sizes
Limited binary growth

Achieving them all simultaneously is hard.

Before we dive into the benchmark results, I wrote a test to explicitly verify that element duplication won't happen anymore, even in the presence of invalid Ord implementations.

Performance improvements.

Initially I experimented with bitonic merge networks to replace the parity_merge functions. However these turned out to require a substantial amount of additional comparisons. A collection of best known sorting-networks can be found here https://bertdobbelaere.github.io/sorting_networks.html. These describe ways to sort data without changing control-flow depending on what the result of the comparison was, which can be implemented without jump instructions. These sorting-networks are great at leveraging Instruction-Level-Parallelism (ILP), they expose multiple comparisons in straight-line code with builtin data-dependency parallelism and ordering per layer. Up to a certain point they complete a sort on random data in fewer comparisons than what an insertion sort does. Eg. sorting 20 elements, network 91 comparisons vs. insertion sort 111 comparisons. However for some patterns this is not ideal, such as already sorted inputs. The sorting-network will always finish after the same amount of comparisons. However none of these optimal sorting networks are applicable for a stable sort, as they require swapping non adjacent wires, however there are stable sorting-networks. I experimented with hybrid approaches, but ultimately these don't seem to be worth it. The hundred or so hours I've spent since the last iteration, cannot be adequately summed in a couple words so I'll leave it at that. Here are some of the core changes:

Implementation:

Use cmov instead of setl branchless swap
Use sort8_stable network as core of provide_sorted_batch
Novel approach for integrating already sorted elements
The ability to sort reversed inputs with minimal comparisons starting at size 7 instead of 21.
Avoid allocation if the input is already full sorted or reversed.

Analysis:

Remove all_equal pattern, practically the same as ascending
random_uniform pattern -> random_dense and random_binary (these actually stress test the algorithm)
Use symmetric diff in analyze comp count
Use mean instead of median in analyze comp count
Increase sample count for comparison count measurements
Added benchmark size 15

A little note on the outstandingly good results previously achieved by Firestorm, later work revealed they were better than Zen3 mostly by emitting csel instructions instead of the setl bithacking emitted by the x86-64 backend. The new swap_if_less function emits cmov on x86-64 and csel on AArch64 which gives the win back to Zen3 when it comes to executing sort20_optimal (as found in graveyard.rs or the ongoing work on new_unstable_sort).

Here are the interactive results https://voultapher.github.io/sort-research-rs/results/30a063d/

Zen3 hot-u64

Zen3 hot-f128

Firestorm hot-u64

I did not run the benchmarks on the broadwell machine because it's kind of a pain to do, but if these are deemed interesting I can run them.

For hot-u64 we can see that the results yield speedups across most pattern-size combination. Tapering off at 20-30% speedup for truly random inputs and surprisingly even better for random_dense inputs which have good prediction, as seen in the f128 chart. Fully ascending and descending see ~50% speedup, by virtue of not requiring allocation anymore. The largest speedups are found in descending inputs len <= 20. These see up to 6x improvements.

A short excursion on ILP, this is what a single branchless swap looks like in x86-64 assembly https://godbolt.org/z/vv48jMdcs. The sort20_optimal function completes in ~32.5ns on Zen3, versus ~165ns when using the insertion sort with insert_head. Running at 4.9GHz that means the optimal sorting-network completes 91 comparisons and swaps in 159 cycles. That's one completed every 1.74 cycles. In contrast the insertion sort completes 111 comparisons in 809 cycles, completing one only every 7.3 cycles. This difference is the largest for inputs that are hard to predict, such as random where 50% of those 111 comparison branches/jumps will have been mispredicted. sort8_stable is not as extreme a difference but in essence the same story. That is the core of the speedup. The majority of the time is now spent in merge which remains unchanged.

Binary size.

I compiled a simple program instantiating 6 copy type sorts, with unknown slice length, and got these result when running cargo build --release && strip release-binary && ll release-binary:

- nothing:       300k
- std_stable:    343k
- std_unstable:  359k
- new_stable:    355k

And for completeness sake a debug build cargo build && ll debug-binary:

- nothing:       5.4M
- std_stable:    6.4M
- std_unstable:  6.6M
- new_stable:    6.6M

For non Copy types that don't qualifies_for_branchless_sort the size remains at 343k, so this binary size increase only affects certain instantiations.

Comparison count.

This metric is very important for a general purpose sort, the user provides the comparison function and can make it arbitrarily expensive. Very quickly the only important factor how fast the sort is, will be the cost of the comparison function. For String it takes up 80+% of the time, so you want a sort that completes in as few comparisons as possible, trumping ILP and other factors. The existing Timsort is already exceptionally good at doing the fewest possible comparisons, the heuristic here is that for the types that qualifies_for_branchless_sort the additional 10% for random inputs will be worth it. If the user sorts a distribution of only two elements, takes a mutex in the comparison function and does IO, and has a certain pattern, and some specific size, they may see up to a slowdown of 50%. But sniping an algorithm is always possible. Overall I'm confident this new implementation strikes a good balance for most use cases, and improves the runtime for many users. For sizes <= 20 the new implementation can detect a descending pattern, which can be seen in the largest f128 gains being the descending results. Comparison statics, now symmetric:

[i32-ascending-20-plus]:              mean: 0%      min: 0%    max: 0%
[i32-ascending-20-sub]:               mean: 0%      min: 0%    max: 0%
[i32-ascending_saw_20-20-plus]:       mean: 2%      min: 0%    max: 18%
[i32-ascending_saw_20-20-sub]:        mean: 0%      min: 0%    max: 0%
[i32-ascending_saw_5-20-plus]:        mean: 18%     min: 0%    max: 29%
[i32-ascending_saw_5-20-sub]:         mean: 0%      min: 0%    max: 3%
[i32-descending-20-plus]:             mean: 0%      min: 0%    max: 0%
[i32-descending-20-sub]:              mean: -411%   min: -900% max: 0%
[i32-descending_saw_20-20-plus]:      mean: 1%      min: 0%    max: 11%
[i32-descending_saw_20-20-sub]:       mean: -411%   min: -900% max: 0%
[i32-descending_saw_5-20-plus]:       mean: 7%      min: 0%    max: 12%
[i32-descending_saw_5-20-sub]:        mean: -70%    min: -350% max: 0%
[i32-pipe_organ-20-plus]:             mean: 0%      min: 0%    max: 2%
[i32-pipe_organ-20-sub]:              mean: 22%     min: 8%    max: 50%
[i32-random-20-plus]:                 mean: 11%     min: 0%    max: 17%
[i32-random-20-sub]:                  mean: 10%     min: 0%    max: 25%
[i32-random_binary-20-plus]:          mean: 20%     min: 0%    max: 34%
[i32-random_binary-20-sub]:           mean: 20%     min: 0%    max: 45%
[i32-random_dense-20-plus]:           mean: 11%     min: 0%    max: 16%
[i32-random_dense-20-sub]:            mean: 11%     min: 0%    max: 25%

For fully random and very dense random inputs of types that qualifies_for_branchless_sort we see ~10% more comparisons done. With an extrem of 20% when sorting something with only two different values (random_binary). The largest change is as expected descending for len <= 20 (descending-20-sub).

Cold results and a simpler version.

I'm still not too sure about the methodology behind the cold tests, which aim to simulate the function being called after a lot of other code was called, instead of as part of a hot benchmark loop. Many of the results don't reproduce. Maybe in total they surface some signal by sheer weight of statistics? For Zen3 we see performance degradation for size < 20. Firestorm in contrast shows performance improvements. Also we are talking about 10-50ns slower, if by necessity the code is being called much, will a user care about the additional 30ns? Will those who for example sort small slices of indices in hot code, benefit from the customized implementation? I initially had a version that dispatched into more specialized functions, and while they look good in hot benchmarks, I'm not convinced that something like sort3 is worth it if the len is 3, because the additional overhead of getting to that special function is likely just not worth it in real world usage. Especially the zen3 results look pretty bad here. I'm looking for feedback and ideas how we could test the effectiveness of sort_small on real world use cases. There is always the option to revert back to just a plain insertion sort, but that can't handle reversed inputs. Alternatively we could look into doing a smaller version that can detect reversed inputs, which hopefully should perform better for reversed inputs. Or maybe even dropping the whole pre-check insertion sort and leverage the existing pattern/streak analysis which won't allocate now, and augment that with a small sort.

Documentation changes.

I'm not sure we should mention that some types may see different behavior. What do you think?

Future work.

Here are some other related topics I'm currently working on. But these will be better suited in their own PRs.

Trying to speedup merge
Unified sort module and improvements to slice::sort_unstable
Ord violation detection for debug builds
Comparison to libcxx sort implementations
Integrating some of my tests into the std library test suite

Voultapher · 2022-11-01T13:05:21Z

I pushed a new version you can find the results here https://voultapher.github.io/sort-research-rs/results/9af9473/analysis/zen3/

The major change is re-using the existing TimSort streak analysis for ascending and descending patterns. With this new version descending get's sorted with minimal comparisons for any size. In addition to that I implemented a new small sort function that does a allocation free merge sort for sizes up to 32. It was still significantly faster up to 40 elements even for f128, but I left it at 32 to limit the growth in worst case comparisons. Cold performance is not awesome but at least better than with the previous version.

Only tested Zen3 for now:

The analysis now benefits all types including Strings.

On a sidenote, I've now run into it twice that I forgot to call x.py fmt before pushing, because rust-analyzer produces by default the wrong formatting. Can I configure it do the right thing?

Voultapher · 2022-11-01T13:29:21Z

@Mark-Simulacrum I think it's in a reviewable state now. As far as I can tell the CI is unhappy because the cfg'd imports are not visible and I didn't slap no_global_oom_handling on all the functions. Before I do so, it would be great to understand what that cfg is for. If it excludes nearly everything in the file why not do it on a higher level?

This reworks the internals of slice::sort. Mainly: - Introduce branchless swap_next_if and optimized sortX functions - Speedup core batch extension with sort16 - Many small tweaks to reduce the amount of branches/jumps This commit is incomplete and MUST NOT be merged as is. It is missing Copy detection and would break uniqueness preservation of values that are being sorted.

Add new loop based mini-merge sort for small sizes. This extends allocation free sorting for random inputs of types that qualify up to 32.

Voultapher · 2022-11-17T19:03:17Z

@Mark-Simulacrum friendly ping. The PR is marked as waiting on author, however I'm waiting for review. Is there something missing before this can be reviewed?

Mark-Simulacrum · 2022-11-17T20:42:48Z

@rustbot ready

Mark-Simulacrum · 2022-11-20T16:28:27Z

On a sidenote, I've now run into it twice that I forgot to call x.py fmt before pushing, because rust-analyzer produces by default the wrong formatting. Can I configure it do the right thing?

https://rustc-dev-guide.rust-lang.org/conventions.html#formatting-and-the-tidy-script has some documentation here, you may be able to adjust what rust-analyzer runs via its configuration.

As far as I can tell the CI is unhappy because the cfg'd imports are not visible and I didn't slap no_global_oom_handling on all the functions. Before I do so, it would be great to understand what that cfg is for. If it excludes nearly everything in the file why not do it on a higher level?

The no_global_oom_handling cfg removes any functions which may invoke the global oom handling (and typically this means that they will panic on allocation failure). If it excludes everything then moving it up may make sense, but I think in practice it sounds like it doesn't quite catch everything, so it makes sense to leave some API surface visible for usage even when it is set.

Mark-Simulacrum · 2022-11-20T17:14:15Z

Both of them are still called merge_sort, and putting them in a diff viewer yields:

It looks like the diff in github is still showing that the changes essentially fully move the merge sort -- I unfortunately don't have bandwidth to figure out a separate tool and work through moving this PR over into such a tool. If you can refactor the PR or separate it out such that the diff is smaller and focuses on changes rather than a full rewrite that would help facilitate the review.

@rustbot author

…homcc Unify stable and unstable sort implementations in same core module This moves the stable sort implementation to the core::slice::sort module. By virtue of being in core it can't access `Vec`. The two `Vec` used by merge sort, `buf` and `runs`, are modelled as custom types that implement the very limited required `Vec` interface with the help of provided allocation and free functions. This is done to allow future re-use of functions and logic between stable and unstable sort. Such as `insert_head`. This is in preparation of rust-lang#100856 and rust-lang#104116. It only moves code, it *doesn't* change any of the sort related logic. This unlocks the ability to share `insert_head`, `insert_tail`, `swap_if_less` `merge` and more. Tagging ``@Mark-Simulacrum`` I hope this allows progress on rust-lang#100856, by moving `merge_sort` here I hope future changes will be easier to review.

…homcc Unify stable and unstable sort implementations in same core module This moves the stable sort implementation to the core::slice::sort module. By virtue of being in core it can't access `Vec`. The two `Vec` used by merge sort, `buf` and `runs`, are modelled as custom types that implement the very limited required `Vec` interface with the help of provided allocation and free functions. This is done to allow future re-use of functions and logic between stable and unstable sort. Such as `insert_head`. This is in preparation of rust-lang#100856 and rust-lang#104116. It only moves code, it *doesn't* change any of the sort related logic. This unlocks the ability to share `insert_head`, `insert_tail`, `swap_if_less` `merge` and more. Tagging ```@Mark-Simulacrum``` I hope this allows progress on rust-lang#100856, by moving `merge_sort` here I hope future changes will be easier to review.

…homcc Unify stable and unstable sort implementations in same core module This moves the stable sort implementation to the core::slice::sort module. By virtue of being in core it can't access `Vec`. The two `Vec` used by merge sort, `buf` and `runs`, are modelled as custom types that implement the very limited required `Vec` interface with the help of provided allocation and free functions. This is done to allow future re-use of functions and logic between stable and unstable sort. Such as `insert_head`. This is in preparation of rust-lang#100856 and rust-lang#104116. It only moves code, it *doesn't* change any of the sort related logic. This unlocks the ability to share `insert_head`, `insert_tail`, `swap_if_less` `merge` and more. Tagging ````@Mark-Simulacrum```` I hope this allows progress on rust-lang#100856, by moving `merge_sort` here I hope future changes will be easier to review.

bors · 2023-01-21T08:04:10Z

☔ The latest upstream changes (presumably #107143) made this pull request unmergeable. Please resolve the merge conflicts.

Voultapher · 2023-01-26T07:57:05Z

Closing for now, work will be resumed in other PRs. Most of the information in here is obsolete now, and I've gotten a lot further in optimising.

rust-highfive assigned Mark-Simulacrum Aug 21, 2022

rustbot added the T-libs Relevant to the library team, which will review and decide on the PR/issue. label Aug 21, 2022

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Aug 21, 2022

This comment has been minimized.

Sign in to view

Mark-Simulacrum added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Aug 27, 2022

Mark-Simulacrum added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Aug 31, 2022

Mark-Simulacrum added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Sep 4, 2022

Voultapher force-pushed the speedup-stable-sort branch from 98cc003 to a3cabd3 Compare October 30, 2022 09:06

This comment has been minimized.

Sign in to view

Voultapher added 5 commits November 4, 2022 10:01

Fix formatting

d5f748e

Rework new_stable_sort with sorting-network

05a6b9c

Re-use TimSort analysis for small slices

bad7e05

Add new loop based mini-merge sort for small sizes. This extends allocation free sorting for random inputs of types that qualify up to 32.

Apply cfg(not(no_global_oom_handling) to everything

061d4e8

Voultapher force-pushed the speedup-stable-sort branch from 59ac873 to 061d4e8 Compare November 4, 2022 15:57

Voultapher mentioned this pull request Nov 7, 2022

Improve performance of unstable sort #104116

Closed

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Nov 17, 2022

rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Nov 20, 2022

Voultapher mentioned this pull request Nov 21, 2022

Unify stable and unstable sort implementations in same core module #104672

Merged

Voultapher closed this Jan 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of stable sort #100856

Improve performance of stable sort #100856

Voultapher commented Aug 21, 2022

rustbot commented Aug 21, 2022

rust-highfive commented Aug 21, 2022

This comment has been minimized.

EdorianDark commented Aug 23, 2022

Voultapher commented Aug 23, 2022 •

edited

Loading

Voultapher commented Aug 23, 2022

This comment has been minimized.

Voultapher commented Aug 24, 2022 •

edited

Loading

steffahn commented Aug 27, 2022 •

edited

Loading

Voultapher commented Aug 27, 2022 via email

steffahn commented Aug 27, 2022 •

edited

Loading

steffahn commented Aug 27, 2022

Voultapher commented Aug 27, 2022 via email

Voultapher commented Aug 31, 2022 •

edited

Loading

Mark-Simulacrum commented Sep 4, 2022

est31 commented Sep 4, 2022

Voultapher commented Sep 5, 2022

steffahn commented Sep 5, 2022

est31 commented Sep 5, 2022 •

edited

Loading

Mark-Simulacrum commented Sep 5, 2022

steffahn commented Sep 5, 2022 •

edited

Loading

bors commented Sep 18, 2022

Voultapher commented Oct 20, 2022

Voultapher commented Oct 30, 2022

This comment has been minimized.

Voultapher commented Nov 1, 2022 •

edited

Loading

Voultapher commented Nov 1, 2022 •

edited

Loading

This comment has been minimized.

Voultapher commented Nov 17, 2022

Mark-Simulacrum commented Nov 17, 2022

Mark-Simulacrum commented Nov 20, 2022

Mark-Simulacrum commented Nov 20, 2022

bors commented Jan 21, 2023

Voultapher commented Jan 26, 2023

Improve performance of stable sort #100856

Improve performance of stable sort #100856

Conversation

Voultapher commented Aug 21, 2022

How can I know that it works correctly?

How can I know it is actually faster?

input type

Input pattern

Input size

hot/cold prediction state

What makes a sort algorithm fast

Benchmark results

Outstanding work

Future work

rustbot commented Aug 21, 2022

rust-highfive commented Aug 21, 2022

This comment has been minimized.

EdorianDark commented Aug 23, 2022

Voultapher commented Aug 23, 2022 • edited Loading

Voultapher commented Aug 23, 2022

This comment has been minimized.

Voultapher commented Aug 24, 2022 • edited Loading

steffahn commented Aug 27, 2022 • edited Loading

Voultapher commented Aug 27, 2022 via email

steffahn commented Aug 27, 2022 • edited Loading

steffahn commented Aug 27, 2022

Voultapher commented Aug 27, 2022 via email

Voultapher commented Aug 31, 2022 • edited Loading

Mark-Simulacrum commented Sep 4, 2022

est31 commented Sep 4, 2022

Voultapher commented Sep 5, 2022

steffahn commented Sep 5, 2022

est31 commented Sep 5, 2022 • edited Loading

Mark-Simulacrum commented Sep 5, 2022

steffahn commented Sep 5, 2022 • edited Loading

bors commented Sep 18, 2022

Voultapher commented Oct 20, 2022

Voultapher commented Oct 30, 2022

This comment has been minimized.

Voultapher commented Nov 1, 2022 • edited Loading

Voultapher commented Nov 1, 2022 • edited Loading

This comment has been minimized.

Voultapher commented Nov 17, 2022

Mark-Simulacrum commented Nov 17, 2022

Mark-Simulacrum commented Nov 20, 2022

Mark-Simulacrum commented Nov 20, 2022

bors commented Jan 21, 2023

Voultapher commented Jan 26, 2023

Voultapher commented Aug 23, 2022 •

edited

Loading

Voultapher commented Aug 24, 2022 •

edited

Loading

steffahn commented Aug 27, 2022 •

edited

Loading

steffahn commented Aug 27, 2022 •

edited

Loading

Voultapher commented Aug 31, 2022 •

edited

Loading

est31 commented Sep 5, 2022 •

edited

Loading

steffahn commented Sep 5, 2022 •

edited

Loading

Voultapher commented Nov 1, 2022 •

edited

Loading

Voultapher commented Nov 1, 2022 •

edited

Loading