Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of stable sort #100856

Closed
wants to merge 5 commits into from

Conversation

Voultapher
Copy link
Contributor

This reworks the internals of slice::sort. Mainly:

  • Introduce branchless swap_next_if and optimized sortX functions
  • Speedup core batch extension with sort16
  • Many small tweaks to reduce the amount of branches/jumps

This commit is incomplete and MUST NOT be merged as is. It is missing Copy
detection and would break uniqueness preservation of values that are being
sorted.


This initially started as an exploration to port fluxsort to Rust. Together with ideas from the recent libcxx sort improvement, namely optimal sorting networks, this PR aims to improve the performance of slice::sort.

Before submitting this PR, I wanted good answers for these two questions:

  • How can I know that it works correctly?
  • How can I know it is actually faster?

Maybe I've used it wrong, but I did not find a comprehensive set of tests for sort in the std test suite. Even simple bugs, manually added to the existing code, still passed the library/std suite. So I embarked on creating my own test and benchmark suite https://github.com/Voultapher/sort-research-rs. Having a variety of test types, pattern and sizes.

How can I know that it works correctly?

  • The new implementation is tested with a variety of tests. Varying types, input size and test logic
  • miri does not complain when running the tests
  • Repeatedly running the tests with random inputs has not yielded failures (In its current form. While developing I had many bugs)
  • Careful analysis and reasoning of the code. Augmented with comments.
  • debug_asserts for invariants where doing so is reasonably cheap
  • Review process via PR
  • Potential nightly adoption and feedback

How can I know it is actually faster?

Benchmarking is notoriously tricky. Along the way I've mad all kinds of mistakes, ranging from false randomness. Not being representative enough. Botching type distributions and more. In it's current form the benchmark suite tests along 4 dimensions:

  • Input type
  • Input pattern
  • Input size
  • hot/cold prediction state

input type

I chose 5 types that are to represent some of the types users will call this generic sort implementation with:

  • i32 basic type often used to test sorting algorithms.
  • u64 common type for usize on 64-bit machines. Sorting indices is very common.
  • String Larger type that is not Copy and does heap access.
  • 1k Very large stack value 1kb, not Copy.
  • f128 16 byte stack value that is Copy but has a relatively expensive cmp implementation.

Input pattern

I chose 11 input patterns. Sorting algorithms behave wildly different based on the input pattern. And most hight performance implementation try to be some kind of adaptive.

  • random
  • random_uniform
  • random_random_size
  • all_equal
  • ascending
  • descending
  • ascending_saw_5
  • ascending_saw_20
  • descending_saw_5
  • descending_saw_20
  • pipe_organ

For more details on the input patterns, take a look at their implementation.

Input size

While in general sorting random data should be n log(n), the specifics might be quite different. So it's important to get a representative set of test sizes. I chose these:

let test_sizes = [
    0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 13, 16, 17, 19, 20, 24, 36, 50, 101, 200, 500, 1_000,
    2_048, 10_000, 100_000, 1_000_000,
];

hot/cold prediction state

Modern CPUs are highly out-of-order and speculative to extract as much Instruction Level Parallelism (ILP) as possible. This is highly dependent on speculation. And while the common way to do benchmarks is good to get reliable standalone numbers, their magnitude might be misleading in programs that do more than just calling sort in a tight loop. So I run every benchmark twice once via criterion iter_batched (hot) and iter_batch PerIteration with a function in-between that attempts to flush the Branch Target Buffer (BTB) and does a syscall for good measure too. The cold benchmarks are not necessarily representative for the absolute gains in a particular application, nor are the hot results. But together they give a decent range of possible speedups.

What makes a sort algorithm fast

Talking in depth about all kinds of sort algorithm would go too deep here, especially given, that this does not change the underlying algorithm. But merely augments it with branchless code that can extract more ILP. Broadly speaking a generic sorting algorithms runtime will depend on 3 factors:

  • How cheap is your type to compare?
  • How expensive is your type to move?
  • How well can access to your type be predicted?

For example, a u64 is very cheap to compare, very cheap to move and many of them fit into a cache line. And only that cache line is needed to compare them. A type like u64 will give you pretty close to the maximum performance of the sorting algorithm. In contrast, String is potentially expensive to compare, relatively cheap to move and access is hard to predict. And a type like f128 (just a name used for testing, not an official literal) is rather expensive to compare, it does two f64 divisions for each comparison, is cheap to move, and access is easily predicted. Given the vast freedom Rust users have, the sort algorithm can only do a best effort for common use cases and be reasonably good in others. If you decompress files with each comparison, that will be your bottleneck. Rust users can implement arbitrarily expensive is_less functions, and the key metric for such cases is how many comparisons are performed by the implementation. As I was developing the optimisations, I took great care to ensure that not only was it faster, but also not significantly slower. For example by measuring the total comparisons done for each input, type + pattern + size combination.

For non Copy types the total comparisons done are roughly equal. For Copy types and random input, it's 6% more on average. And while at first it might seem intuitive that more comparisons means higher runtime. That's not true for various reasons, mostly cache access. For example sort_unstable does on average for random input, and input size > 20 14% more comparisons. There are even some pathological inputs such as pipe_organ where the merge sort with streak analysis is vastly superior. For u64-pipe_organ-10000 the stable sort performs 20k comparisons while unstable sort does 130k. If your type is expensive to compare, unstable sort will quickly loose its advantage.

Benchmark results

Which the 4 input dimensions, we get 2.5k individual benchmarks. I ran them on 3 different microarchitectures. Comparing new_stable with std_stable, gives us 15k data points. That's a lot and requires some kind of analysis. The raw results are here, feel free to crush the numbers yourself, I probably did some mistakes.

TLDR:

  • The biggest winners are Copy types, in random order. Such scenarios see 30-100% speedup.
  • Non Copy types remain largely the same, with some outliers that I guess are dependent on memory layout and alignment and other factors.
  • Some scenarios see slowdowns, but they are pretty contained
  • Wider and deeper microarchitectures see relatively larger wins (ie. current and future designs)

Test setup:

Compiler: rustc 1.65.0-nightly (29e4a9e 2022-08-10)

Machines:

  • AMD 5900X (Zen3)
  • Intel i7-5500U (Broadwell)
  • Apple M1 Pro (Firestorm)

hot-u64 (Zen3)

Screenshot 2022-08-21 at 22 14 38

This plots the hot-u64 results. Everything above 0 means the new implementation was faster and everything below means the current version was faster. Check out an interactive version here, you can hover over all the values to see the relative speedup. Note, the speedup is calculated symmetrically, so it doesn't matter which 'side' you view it from.

hot-u64-random <= 20 (Firestorm)

Screenshot 2022-08-21 at 23 10 35

Interactive version here you have to scroll down a bit to get to random. But the other ones are interesting too of course.

hot-u64 (Firestorm)

Screenshot 2022-08-21 at 22 13 40

Here you can see what I suspect is the really wide core flexing it's ILP capabilities, speeding up sorting a slice of 16 random u64 by 2.4x.

hot-u64 (Broadwell)

Screenshot 2022-08-21 at 22 14 38

Even on older hardware the results look promising.

You can explore all results here. Note, I accidentally forgot to add 16 as test size when running the benchmarks on the Zen3 machine, the other two have this included. But from what I see the difference is not too big.


Outstanding work

  • Get the is_copy check to work within the standard library. I used specialisation in my repo, but I'm not sure what the right path forward here would be. On the surface it seems like a relatively simple property to check.
  • Talk about the expected behaviour when is_less panics inside sort16 during the final parity merge. With the current double copy, it retains the current behaviour of leaving v in a valid state and preserving all its original elements. A faster version is possible, by omitting the two copy calls, and directly writing the result of parity_merge into arr_ptr, however this changes the current behaviour. Should is_less panic, v will be left in a valid state but there might be duplicate elements, losing the original set. The question is, how many people rely on such behaviour? How big of a footgun is that? Is the price worth having everyone pay it when they don't need it? Would documentation be enough?

A note on 'uniqueness preservation'. Maybe this concept has a name, but I don't know it. I did experiments of allowing the sort16 approach for non Copy types, however I saw slowdowns for such types, even when limiting it to relatively cheap to move types. I suspect Copy is a pretty good heuristic for types that are cheap to compare, move and access. But what makes them crucially the only ones applicable, is not panic safety as I initially thought. With the two extra copy's sort16 which could be done only for non Copy types, that part is solved. However, what memory location is dereferenced and used to call is_less is the tricky part. By creating a shallow copy of type and using that address to call is_less, if this value is then not copied back, and used along the chain as 'unique' object along the progress of code, you could observe that for all intents and purposes, the type that is in your mutable slice, is the one you put in there. But if you self modify yourself inside is_less this self modification can be lost. That's exactly what parity_merge8 and parity_merge do. The way they sweep along the slice overlapping between 'iterations' might compare something from src, copy it and then use src again for a comparison, but then later dest is copied back into v, which has only seen one of the comparisons. And crucially Cell is not Copy, which means such logical foot guns are not possible with Copy types, or not to my knowledge. I wrote a test for it here.

I would love to see if this improvement has any noticeable effect on compile times. Coarse analysis showed some places where the compiler uses stable sort with suitable types.

Future work

  • For input_size <20 stable and unstable sort both use insertion sort, albeit slightly different implementations. The same speedups here could be applied to unstable sort.
  • If I find the time I want to investigate porting the merge part of fluxsort and see if that speeds things up. My main worry is correctness. It's quite liberal in it's use of pointers and I've already discovered memory and logic bugs in the parts that I ported. For <16 elements its somewhat possible to think about in my head. But for general situations, I fear it might require formal verification, to attain the expected level of confidence.
  • Use sort16 to speedup unstable sort, or in general look into improving unstable sort.

This is my first time contributing to the standard library, my apologies if I missed or got something important wrong.

@rustbot rustbot added the T-libs Relevant to the library team, which will review and decide on the PR/issue. label Aug 21, 2022
@rustbot
Copy link
Collaborator

rustbot commented Aug 21, 2022

Hey! It looks like you've submitted a new PR for the library teams!

If this PR contains changes to any rust-lang/rust public library APIs then please comment with @rustbot label +T-libs-api -T-libs to tag it appropriately. If this PR contains changes to any unstable APIs please edit the PR description to add a link to the relevant API Change Proposal or create one if you haven't already. If you're unsure where your change falls no worries, just leave it as is and the reviewer will take a look and make a decision to forward on if necessary.

Examples of T-libs-api changes:

  • Stabilizing library features
  • Introducing insta-stable changes such as new implementations of existing stable traits on existing stable types
  • Introducing new or changing existing unstable library APIs (excluding permanently unstable features / features without a tracking issue)
  • Changing public documentation in ways that create new stability guarantees
  • Changing observable runtime behavior of library APIs

@rust-highfive
Copy link
Collaborator

r? @Mark-Simulacrum

(rust-highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Aug 21, 2022
@rust-log-analyzer

This comment has been minimized.

@EdorianDark
Copy link
Contributor

That looks really nice. There is another PR #90545 improving std::sort, did you compare the performance of these two?

@Voultapher
Copy link
Contributor Author

Voultapher commented Aug 23, 2022

@EdorianDark I'm running some benchmarks tonight, preliminary results suggest it is slightly faster than the existing sort, but significantly slower than new_stable for copy types. From what I can tell, the discussion in the PR seems to be stuck at verification of the algorithm. One significant difference here is, that this PR does not change the core algorithm. It speeds up providing a minimum sorted slice, so it should be much easier to proof to a reasonable degree that this is as correct as the previous version. And thank you for pointing out that other PR, I was not aware of it.

@Voultapher
Copy link
Contributor Author

Plus I just discovered UB by running my test suite with miri and the proposed wpwoodjr_stable_sort.

@rust-log-analyzer

This comment has been minimized.

@Voultapher
Copy link
Contributor Author

Voultapher commented Aug 24, 2022

Regarding the failing test, I'm not sure what the expected behaviour should be.

// Sort using a completely random comparison function.
// This will reorder the elements *somehow*, but won't panic.
let mut v = [0; 500];
for i in 0..v.len() {
    v[i] = i as i32;
}
v.sort_by(|_, _| *[Less, Equal, Greater].choose(&mut rng).unwrap());
v.sort();
for i in 0..v.len() {
    assert_eq!(v[i], i as i32);
}

A simplified example:

input:
[2, 1, 8, 4, 11, 5, 5, 0, 5, 11, 0, 6, 7, 10, 9, 15, 16, 17, 18, 19]
new_stable_sort::sort_by produces:
[0, 0, 1, 2, 4, 5, 5, 5, 6, 7, 8, 9, 10, 11, 11, 15, 16, 17, 18, 19]

The implementation of sort16 assumes the user correctly implemented Ord for their type. Given that sort16 should only be called for Copy types the result is memory safe, but changes the existing behaviour. The question is, how far do we want to go to give people violating the Ord API contract 'sensible' results. Looking at the comment above, the intent of the test seems to be, to ensure that this won't panic, and or violate memory safety. If users mess up Ord their results will most likely not be what they expected anyway, the result cannot logically be considered sorted after sort completes. There is even an argument to be made, for surfacing such API violations more prominently, as for example integer overflow does for debug builds.

@steffahn
Copy link
Member

steffahn commented Aug 27, 2022

I don’t really like those relative speedup graphs, since they paint a skewed/biased picture of what’s happening: E.g., if the new implementation is twice as fast, it’ll say +100%, but if the old implementation is twice as fast, it’ll only say -50%. Unless I understand something wrong here. Anyways, I would prefer some non-biased statistics where both algorithms are treated the same, not one is arbitrarily chosen to always be the basis for determining what “100%” means.


Edit: Actually… I just noticed numbers below -100% on this page, so perhaps the situation already is different than what I assumed and below zero, the roles switch?

@Voultapher
Copy link
Contributor Author

Voultapher commented Aug 27, 2022 via email

@steffahn
Copy link
Member

steffahn commented Aug 27, 2022

@Voultapher Oh… right, you did mention there that it’s symmetric. Maybe it’d be easier to find if it was mentioned inside of the graph, too 😇
E.g. where it currently says “100% = 2x”, it could clarify e.g. with something like “100% = 2×, −100% = 0.5×”. Perhaps even additional data points like “50% = 1.5×, −50% = 0.667×”.

On a related note, it isn’t clear to me whether the linked results about # of comparisons is a symmetrical or an asymmetrical percentage. (I would probably assume those are asymmetrical though, looking at the numbers so far.)

@steffahn
Copy link
Member

[ @Voultapher by the way, you had tagged the wrong person; careful with using @-sign in e-mail ;-) – also, since you’ve responded by e-mail you might have missed my edit on my first comment where I had already noticed some values below -100% on some linked graphs ]

@Voultapher
Copy link
Contributor Author

Voultapher commented Aug 27, 2022 via email

@Mark-Simulacrum Mark-Simulacrum added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Aug 27, 2022
@Voultapher
Copy link
Contributor Author

Voultapher commented Aug 31, 2022

@Mark-Simulacrum I just noticed you changed the label from waiting on review to waiting on author. I'm waiting on review, regarding the failing CI, I commented about it here #100856 (comment). I could of course disable the failing test, but as I noted earlier I'm not sure what the expected behaviour ought to be.

@Mark-Simulacrum Mark-Simulacrum added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Aug 31, 2022
@Mark-Simulacrum
Copy link
Member

The code sample in #100856 (comment) seems to have both a sort_by (which is random) but then follows that with a regular sort() call. I think any changes in this PR should preserve behavior; even an incorrect Ord/PartialOrd impl shouldn't cause us to end up with a different set of elements in the array, so the subsequent sort should bring it back in order. Changing the specific results of the sort_by seems OK though; we don't need to guarantee the same set of cmp(..) calls or exact same results, so long as we still end up with a stable ordering.


As an aside, it looks like this PR's diff currently is such that it replaces the main body of the implementation too, even if it doesn't actually change it (hard to tell). As-is, it would require basically a fresh review of the whole sort algorithm which won't happen soon.

@Mark-Simulacrum Mark-Simulacrum added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Sep 4, 2022
@est31
Copy link
Member

est31 commented Sep 4, 2022

even an incorrect Ord/PartialOrd impl shouldn't cause us to end up with a different set of elements in the array

Good point, it seems that items are being duplicated and others are silently leaked. With types that impl Drop this might lead to a double free.

@Voultapher
Copy link
Contributor Author

@est31

Good point, it seems that items are being duplicated and others are silently leaked. With types that impl Drop this might lead to a double free.

Please read the comments and implementation. This would only happen for types that implement Copy, these types cannot have custom Drop implementations.

@steffahn
Copy link
Member

steffahn commented Sep 5, 2022

@est31

Good point, it seems that items are being duplicated and others are silently leaked. With types that impl Drop this might lead to a double free.

Please read the comments and implementation. This would only happen for types that implement Copy, these types cannot have custom Drop implementations.

I would assume that it’s a sane use-case of sort_by to run it on a Vec<*const T> with some user-defined (and hence untrusted) (&T, &T) -> Ordering callback, in the context of a data structure that manually manages the memory for the T’s which are semantically assumed to be owned by the *const T. *const T is a Copy type.

@est31
Copy link
Member

est31 commented Sep 5, 2022

I wonder if the faster sorting could be enabled for a set of trusted types like &str, integers, etc. where we provide the Ord impl so we trust it. And only enabling it if you are not doing sort_by. I guess it wouldn't even need to be Copy at that point, although that might be too dangerous as future extensions might cause scenarios where ordering can be customized in safe user code, idk.

@Mark-Simulacrum
Copy link
Member

Yeah, just because a type is Copy doesn't mean we can ignore such a basic property of sort (entry data is the same as exit data, just reordered). Raw pointers are an example of how that can go wrong in practice, but even sorting &T would run into this. My guess is that many users calling sort_by may actually have a not-correct ordering function, and we shouldn't break entirely on that. If we panicked that would be a plausible discussion but randomly duplicating elements isn't acceptable.

@steffahn
Copy link
Member

steffahn commented Sep 5, 2022

I wonder if the faster sorting could be enabled for a set of trusted types like &str, integers, etc. where we provide the Ord impl so we trust it.

It’s hard to draw a line. usize or u32 could be a memory offset, too, indicating an owned memory location. Especially in a vector where u32 or even u16 might be chosen for memory efficiency, and the Vec<u16> is stored alongside some base pointer.

@bors
Copy link
Contributor

bors commented Sep 18, 2022

☔ The latest upstream changes (presumably #101816) made this pull request unmergeable. Please resolve the merge conflicts.

@Voultapher
Copy link
Contributor Author

FYI, I have found a viable alternative that completely avoids the problem by using sorting networks. I'm still researching potential avenues for further improvements and will post the results here once these are explored.

@Voultapher
Copy link
Contributor Author

@Mark-Simulacrum

I'm back with an improved version that is completely free of any element duplication issues.

It achieves this mainly by using a stable sorting-network to speedup
provide_sorted_batch. In addition it uses a variety of strategies to optimize
for:

  • Good performance across all patterns, sizes and types
  • Minimal increase in total comparison count across patterns and sizes
  • Limited binary growth

Achieving them all simultaneously is hard.


Before we dive into the benchmark results, I wrote a test to explicitly verify that element duplication won't happen anymore, even in the presence of invalid Ord implementations.


Performance improvements.

Initially I experimented with bitonic merge networks to replace the parity_merge functions. However these turned out to require a substantial amount of additional comparisons. A collection of best known sorting-networks can be found here https://bertdobbelaere.github.io/sorting_networks.html. These describe ways to sort data without changing control-flow depending on what the result of the comparison was, which can be implemented without jump instructions. These sorting-networks are great at leveraging Instruction-Level-Parallelism (ILP), they expose multiple comparisons in straight-line code with builtin data-dependency parallelism and ordering per layer. Up to a certain point they complete a sort on random data in fewer comparisons than what an insertion sort does. Eg. sorting 20 elements, network 91 comparisons vs. insertion sort 111 comparisons. However for some patterns this is not ideal, such as already sorted inputs. The sorting-network will always finish after the same amount of comparisons. However none of these optimal sorting networks are applicable for a stable sort, as they require swapping non adjacent wires, however there are stable sorting-networks. I experimented with hybrid approaches, but ultimately these don't seem to be worth it. The hundred or so hours I've spent since the last iteration, cannot be adequately summed in a couple words so I'll leave it at that. Here are some of the core changes:

Implementation:

  • Use cmov instead of setl branchless swap
  • Use sort8_stable network as core of provide_sorted_batch
  • Novel approach for integrating already sorted elements
  • The ability to sort reversed inputs with minimal comparisons starting at size 7 instead of 21.
  • Avoid allocation if the input is already full sorted or reversed.

Analysis:

  • Remove all_equal pattern, practically the same as ascending
  • random_uniform pattern -> random_dense and random_binary (these actually stress test the algorithm)
  • Use symmetric diff in analyze comp count
  • Use mean instead of median in analyze comp count
  • Increase sample count for comparison count measurements
  • Added benchmark size 15

A little note on the outstandingly good results previously achieved by Firestorm, later work revealed they were better than Zen3 mostly by emitting csel instructions instead of the setl bithacking emitted by the x86-64 backend. The new swap_if_less function emits cmov on x86-64 and csel on AArch64 which gives the win back to Zen3 when it comes to executing sort20_optimal (as found in graveyard.rs or the ongoing work on new_unstable_sort).

Here are the interactive results https://voultapher.github.io/sort-research-rs/results/30a063d/

hot-u64-zen3

Zen3 hot-u64

hot-f128-zen3

Zen3 hot-f128

hot-u64-firestorm

Firestorm hot-u64

I did not run the benchmarks on the broadwell machine because it's kind of a pain to do, but if these are deemed interesting I can run them.

For hot-u64 we can see that the results yield speedups across most pattern-size combination. Tapering off at 20-30% speedup for truly random inputs and surprisingly even better for random_dense inputs which have good prediction, as seen in the f128 chart. Fully ascending and descending see ~50% speedup, by virtue of not requiring allocation anymore. The largest speedups are found in descending inputs len <= 20. These see up to 6x improvements.

A short excursion on ILP, this is what a single branchless swap looks like in x86-64 assembly https://godbolt.org/z/vv48jMdcs. The sort20_optimal function completes in ~32.5ns on Zen3, versus ~165ns when using the insertion sort with insert_head. Running at 4.9GHz that means the optimal sorting-network completes 91 comparisons and swaps in 159 cycles. That's one completed every 1.74 cycles. In contrast the insertion sort completes 111 comparisons in 809 cycles, completing one only every 7.3 cycles. This difference is the largest for inputs that are hard to predict, such as random where 50% of those 111 comparison branches/jumps will have been mispredicted. sort8_stable is not as extreme a difference but in essence the same story. That is the core of the speedup. The majority of the time is now spent in merge which remains unchanged.


Binary size.

I compiled a simple program instantiating 6 copy type sorts, with unknown slice length, and got these result when running cargo build --release && strip release-binary && ll release-binary:

- nothing:       300k
- std_stable:    343k
- std_unstable:  359k
- new_stable:    355k

And for completeness sake a debug build cargo build && ll debug-binary:

- nothing:       5.4M
- std_stable:    6.4M
- std_unstable:  6.6M
- new_stable:    6.6M

For non Copy types that don't qualifies_for_branchless_sort the size remains at 343k, so this binary size increase only affects certain instantiations.


Comparison count.

This metric is very important for a general purpose sort, the user provides the comparison function and can make it arbitrarily expensive. Very quickly the only important factor how fast the sort is, will be the cost of the comparison function. For String it takes up 80+% of the time, so you want a sort that completes in as few comparisons as possible, trumping ILP and other factors. The existing Timsort is already exceptionally good at doing the fewest possible comparisons, the heuristic here is that for the types that qualifies_for_branchless_sort the additional 10% for random inputs will be worth it. If the user sorts a distribution of only two elements, takes a mutex in the comparison function and does IO, and has a certain pattern, and some specific size, they may see up to a slowdown of 50%. But sniping an algorithm is always possible. Overall I'm confident this new implementation strikes a good balance for most use cases, and improves the runtime for many users. For sizes <= 20 the new implementation can detect a descending pattern, which can be seen in the largest f128 gains being the descending results. Comparison statics, now symmetric:

[i32-ascending-20-plus]:              mean: 0%      min: 0%    max: 0%
[i32-ascending-20-sub]:               mean: 0%      min: 0%    max: 0%
[i32-ascending_saw_20-20-plus]:       mean: 2%      min: 0%    max: 18%
[i32-ascending_saw_20-20-sub]:        mean: 0%      min: 0%    max: 0%
[i32-ascending_saw_5-20-plus]:        mean: 18%     min: 0%    max: 29%
[i32-ascending_saw_5-20-sub]:         mean: 0%      min: 0%    max: 3%
[i32-descending-20-plus]:             mean: 0%      min: 0%    max: 0%
[i32-descending-20-sub]:              mean: -411%   min: -900% max: 0%
[i32-descending_saw_20-20-plus]:      mean: 1%      min: 0%    max: 11%
[i32-descending_saw_20-20-sub]:       mean: -411%   min: -900% max: 0%
[i32-descending_saw_5-20-plus]:       mean: 7%      min: 0%    max: 12%
[i32-descending_saw_5-20-sub]:        mean: -70%    min: -350% max: 0%
[i32-pipe_organ-20-plus]:             mean: 0%      min: 0%    max: 2%
[i32-pipe_organ-20-sub]:              mean: 22%     min: 8%    max: 50%
[i32-random-20-plus]:                 mean: 11%     min: 0%    max: 17%
[i32-random-20-sub]:                  mean: 10%     min: 0%    max: 25%
[i32-random_binary-20-plus]:          mean: 20%     min: 0%    max: 34%
[i32-random_binary-20-sub]:           mean: 20%     min: 0%    max: 45%
[i32-random_dense-20-plus]:           mean: 11%     min: 0%    max: 16%
[i32-random_dense-20-sub]:            mean: 11%     min: 0%    max: 25%

For fully random and very dense random inputs of types that qualifies_for_branchless_sort we see ~10% more comparisons done. With an extrem of 20% when sorting something with only two different values (random_binary). The largest change is as expected descending for len <= 20 (descending-20-sub).


Cold results and a simpler version.

I'm still not too sure about the methodology behind the cold tests, which aim to simulate the function being called after a lot of other code was called, instead of as part of a hot benchmark loop. Many of the results don't reproduce. Maybe in total they surface some signal by sheer weight of statistics? For Zen3 we see performance degradation for size < 20. Firestorm in contrast shows performance improvements. Also we are talking about 10-50ns slower, if by necessity the code is being called much, will a user care about the additional 30ns? Will those who for example sort small slices of indices in hot code, benefit from the customized implementation? I initially had a version that dispatched into more specialized functions, and while they look good in hot benchmarks, I'm not convinced that something like sort3 is worth it if the len is 3, because the additional overhead of getting to that special function is likely just not worth it in real world usage. Especially the zen3 results look pretty bad here. I'm looking for feedback and ideas how we could test the effectiveness of sort_small on real world use cases. There is always the option to revert back to just a plain insertion sort, but that can't handle reversed inputs. Alternatively we could look into doing a smaller version that can detect reversed inputs, which hopefully should perform better for reversed inputs. Or maybe even dropping the whole pre-check insertion sort and leverage the existing pattern/streak analysis which won't allocate now, and augment that with a small sort.


Documentation changes.

I'm not sure we should mention that some types may see different behavior. What do you think?


Future work.

Here are some other related topics I'm currently working on. But these will be better suited in their own PRs.

  • Trying to speedup merge
  • Unified sort module and improvements to slice::sort_unstable
  • Ord violation detection for debug builds
  • Comparison to libcxx sort implementations
  • Integrating some of my tests into the std library test suite

@rust-log-analyzer

This comment has been minimized.

@Voultapher
Copy link
Contributor Author

Voultapher commented Nov 1, 2022

I pushed a new version you can find the results here https://voultapher.github.io/sort-research-rs/results/9af9473/analysis/zen3/

The major change is re-using the existing TimSort streak analysis for ascending and descending patterns. With this new version descending get's sorted with minimal comparisons for any size. In addition to that I implemented a new small sort function that does a allocation free merge sort for sizes up to 32. It was still significantly faster up to 40 elements even for f128, but I left it at 32 to limit the growth in worst case comparisons. Cold performance is not awesome but at least better than with the previous version.

Only tested Zen3 for now:

image

image

The analysis now benefits all types including Strings.

On a sidenote, I've now run into it twice that I forgot to call x.py fmt before pushing, because rust-analyzer produces by default the wrong formatting. Can I configure it do the right thing?

@Voultapher
Copy link
Contributor Author

Voultapher commented Nov 1, 2022

@Mark-Simulacrum I think it's in a reviewable state now. As far as I can tell the CI is unhappy because the cfg'd imports are not visible and I didn't slap no_global_oom_handling on all the functions. Before I do so, it would be great to understand what that cfg is for. If it excludes nearly everything in the file why not do it on a higher level?

@rust-log-analyzer

This comment has been minimized.

This reworks the internals of slice::sort. Mainly:
- Introduce branchless swap_next_if and optimized sortX functions
- Speedup core batch extension with sort16
- Many small tweaks to reduce the amount of branches/jumps

This commit is incomplete and MUST NOT be merged as is. It is missing Copy
detection and would break uniqueness preservation of values that are being
sorted.
Add new loop based mini-merge sort for small sizes. This extends
allocation free sorting for random inputs of types that qualify up to
32.
@Voultapher
Copy link
Contributor Author

@Mark-Simulacrum friendly ping. The PR is marked as waiting on author, however I'm waiting for review. Is there something missing before this can be reviewed?

@Mark-Simulacrum
Copy link
Member

@rustbot ready

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Nov 17, 2022
@Mark-Simulacrum
Copy link
Member

On a sidenote, I've now run into it twice that I forgot to call x.py fmt before pushing, because rust-analyzer produces by default the wrong formatting. Can I configure it do the right thing?

https://rustc-dev-guide.rust-lang.org/conventions.html#formatting-and-the-tidy-script has some documentation here, you may be able to adjust what rust-analyzer runs via its configuration.

As far as I can tell the CI is unhappy because the cfg'd imports are not visible and I didn't slap no_global_oom_handling on all the functions. Before I do so, it would be great to understand what that cfg is for. If it excludes nearly everything in the file why not do it on a higher level?

The no_global_oom_handling cfg removes any functions which may invoke the global oom handling (and typically this means that they will panic on allocation failure). If it excludes everything then moving it up may make sense, but I think in practice it sounds like it doesn't quite catch everything, so it makes sense to leave some API surface visible for usage even when it is set.

@Mark-Simulacrum
Copy link
Member

Both of them are still called merge_sort, and putting them in a diff viewer yields:

It looks like the diff in github is still showing that the changes essentially fully move the merge sort -- I unfortunately don't have bandwidth to figure out a separate tool and work through moving this PR over into such a tool. If you can refactor the PR or separate it out such that the diff is smaller and focuses on changes rather than a full rewrite that would help facilitate the review.

@rustbot author

@rustbot rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Nov 20, 2022
compiler-errors added a commit to compiler-errors/rust that referenced this pull request Jan 20, 2023
…homcc

Unify stable and unstable sort implementations in same core module

This moves the stable sort implementation to the core::slice::sort module. By virtue of being in core it can't access `Vec`. The two `Vec` used by merge sort, `buf` and `runs`, are modelled as custom types that implement the very limited required `Vec` interface with the help of provided allocation and free functions. This is done to allow future re-use of functions and logic between stable and unstable sort. Such as `insert_head`.

This is in preparation of rust-lang#100856 and rust-lang#104116. It only moves code, it *doesn't* change any of the sort related logic. This unlocks the ability to share `insert_head`, `insert_tail`, `swap_if_less` `merge` and more.

Tagging ``@Mark-Simulacrum`` I hope this allows progress on rust-lang#100856, by moving `merge_sort` here I hope future changes will be easier to review.
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this pull request Jan 20, 2023
…homcc

Unify stable and unstable sort implementations in same core module

This moves the stable sort implementation to the core::slice::sort module. By virtue of being in core it can't access `Vec`. The two `Vec` used by merge sort, `buf` and `runs`, are modelled as custom types that implement the very limited required `Vec` interface with the help of provided allocation and free functions. This is done to allow future re-use of functions and logic between stable and unstable sort. Such as `insert_head`.

This is in preparation of rust-lang#100856 and rust-lang#104116. It only moves code, it *doesn't* change any of the sort related logic. This unlocks the ability to share `insert_head`, `insert_tail`, `swap_if_less` `merge` and more.

Tagging ```@Mark-Simulacrum``` I hope this allows progress on rust-lang#100856, by moving `merge_sort` here I hope future changes will be easier to review.
compiler-errors added a commit to compiler-errors/rust that referenced this pull request Jan 21, 2023
…homcc

Unify stable and unstable sort implementations in same core module

This moves the stable sort implementation to the core::slice::sort module. By virtue of being in core it can't access `Vec`. The two `Vec` used by merge sort, `buf` and `runs`, are modelled as custom types that implement the very limited required `Vec` interface with the help of provided allocation and free functions. This is done to allow future re-use of functions and logic between stable and unstable sort. Such as `insert_head`.

This is in preparation of rust-lang#100856 and rust-lang#104116. It only moves code, it *doesn't* change any of the sort related logic. This unlocks the ability to share `insert_head`, `insert_tail`, `swap_if_less` `merge` and more.

Tagging ````@Mark-Simulacrum```` I hope this allows progress on rust-lang#100856, by moving `merge_sort` here I hope future changes will be easier to review.
@bors
Copy link
Contributor

bors commented Jan 21, 2023

☔ The latest upstream changes (presumably #107143) made this pull request unmergeable. Please resolve the merge conflicts.

@Voultapher
Copy link
Contributor Author

Closing for now, work will be resumed in other PRs. Most of the information in here is obsolete now, and I've gotten a lot further in optimising.

@Voultapher Voultapher closed this Jan 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants