Skip to content

Conversation

@Voultapher
Copy link
Contributor

This is a followup to #100856, this time speeding up slice::unstable_sort. Fundamentally it uses optimal sorting networks to speedup sorting small slices. Before going into too much detail on the speedup, the most important thing is correctness. It passes my test suite in https://github.com/Voultapher/sort-research-rs both with normal Rust and Miri. That includes tests not found or not found with the same rigor in the standard library tests, observable_is_less, panic_retain_original_set and violate_ord_retain_original_set. And from a code structure point, it copies several elements from slice::sort. They live in separate modules, and I don't know enough about the structure of the standard library to unify them. In essence I could image them both living in core and stable sorting requiring a passed in function that does the allocation. But even then I'm not sure how that affects LTO and inlining, which are critical for performance. In my repository I have all the implementations copied into individual modules, and I'm not sure how to test that in the standard library.


Speedups:

  • 1.2-1.5x faster than existing implementation for random, saw and other patterns (integers)
  • Same performance for very densely distributed random inputs and ascending and descending inputs, len >= 40.
  • On average fewer comparisons for types that are judged cheap to move
  • Near perfect sorting of descending inputs starting from size 8.

To understand this PR, it's advisable to read #100856. There I go into more detail on the test methodology and graphs.

The full benchmark results can be found here: https://voultapher.github.io/sort-research-rs/results/e746538/

Here are the speedups and slowdowns for Zen3:

image

For hot-u64 we can see that the speedups are the most extreme for smaller sizes and level out at 30%.

image

For hot-string which is relatively expensive to access we see it level out at 4% while seeing 10-15% speedup for smaller sizes, and 3x speedup as the most extreme case hot-string-descending-20. These results are in-line with the average reduced comparison counts and the extreme speedups are explained by the ability to detect fully or mostly decreasing inputs even for small inputs. For smaller sizes there are also a noticeable amount of slowdowns. I'd argue that the overall speedup is worth it here, but this can be tuned with the qualifies_for_branchless_sort heuristic.

image

The 1k type generally produces the most noisy results, so I'm not sure this is a real signal. But it might be thanks to the tuned insert right and left functions.

image

The cheap to access but expensive to compare type f128 shows a modest speedup of 10-20%, with descending inputs again showing the largest outliers at a 5x speedup for descending-20, which switches from insertion sort to the more sophisticated sort_small.

image

The cold results don't look too hot for small sizes <= 20, these could be addressed by adding extra logic. But even then it's a tricky balance, where something like pattern analysis is really hard if not impossible to do without some slowdown for cold code. Also note, I'm not too sure about the my methodology for cold code here.

The speedups and slowdowns for Firestorm are relatively similar:

image


Comparison statistics.

Percent comparisons done more by rust_new_unstable than rust_std_unstable
[i32-ascending-20-plus]:              mean: -2%     min: -13%  max: 0%
[i32-ascending-20-sub]:               mean: 14%     min: 0%    max: 67%
[i32-ascending_saw_20-20-plus]:       mean: -7%     min: -19%  max: -2%
[i32-ascending_saw_20-20-sub]:        mean: 14%     min: 0%    max: 67%
[i32-ascending_saw_5-20-plus]:        mean: -8%     min: -24%  max: -3%
[i32-ascending_saw_5-20-sub]:         mean: 13%     min: -11%  max: 67%
[i32-descending-20-plus]:             mean: -47%    min: -240% max: 0%
[i32-descending-20-sub]:              mean: -396%   min: -900% max: 0%
[i32-descending_saw_20-20-plus]:      mean: -54%    min: -240% max: -3%
[i32-descending_saw_20-20-sub]:       mean: -396%   min: -900% max: 0%
[i32-descending_saw_5-20-plus]:       mean: -6%     min: -22%  max: -1%
[i32-descending_saw_5-20-sub]:        mean: -74%    min: -350% max: 0%
[i32-pipe_organ-20-plus]:             mean: -27%    min: -64%  max: -6%
[i32-pipe_organ-20-sub]:              mean: -54%    min: -143% max: 18%
[i32-random-20-plus]:                 mean: -6%     min: -14%  max: -3%
[i32-random-20-sub]:                  mean: -8%     min: -34%  max: 25%
[i32-random_binary-20-plus]:          mean: 13%     min: -13%  max: 85%
[i32-random_binary-20-sub]:           mean: 22%     min: 0%    max: 62%
[i32-random_dense-20-plus]:           mean: 7%      min: 0%    max: 35%
[i32-random_dense-20-sub]:            mean: 4%      min: -15%  max: 31%

With some exceptions, overall we see a nice average reduction of comparisons. That's the reasons why I tuned qualifies_for_branchless_sort to allow all types except those that are deemed expensive to move.


Binary bloat.

Compiling a simple hello world and sorting 6 different integer types and 2 different String containers, in release mode and stripping the binary yields:

Baseline:            300k
rust_std_unstable:   379k
rust_new_unstable:   425k

Why is this faster?

Moder hardware is so complex that all I can do is guess, but in essence if you look at the assembly call graph of insertion_sort_shift_left:

insertion-sort-assembly
mbly

And compare that to sort8_optimal:

sort8-assembly

It might seem clear, why the sort8_optimal and it's siblings are able to extract more Instruction-Level-Parallelism (ILP) than insertion sort which is currently being used.

Another reason is, that pdqsort spends a lot of time in insertion sort:

image

So it's worthwhile to speed up this area. The kind of optimizations pdqsort does to avoid choosing a bad pivot or even switching to heapsort make a lot of sense at larger sizes, but are overkill for smaller sizes. Which are better addressed with something tailormade like sort_small.

At it's core it leverages branchless compare and swap operations employed in
optimal sorting networks. A variety of strategies is used to optimize for hot
and cold runtime, binary size, and max comparisons done. Many patterns see a
reduction in average comparisons performed. So this improvement is applied to
all types that are deemed cheap to move. It copies parts of the stable sort.

Before this get's merged they ought to be unified.
@rustbot
Copy link
Collaborator

rustbot commented Nov 7, 2022

r? @m-ou-se

(rustbot has picked a reviewer for you, use r? to override)

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Nov 7, 2022
@rustbot
Copy link
Collaborator

rustbot commented Nov 7, 2022

Hey! It looks like you've submitted a new PR for the library teams!

If this PR contains changes to any rust-lang/rust public library APIs then please comment with @rustbot label +T-libs-api -T-libs to tag it appropriately. If this PR contains changes to any unstable APIs please edit the PR description to add a link to the relevant API Change Proposal or create one if you haven't already. If you're unsure where your change falls no worries, just leave it as is and the reviewer will take a look and make a decision to forward on if necessary.

Examples of T-libs-api changes:

  • Stabilizing library features
  • Introducing insta-stable changes such as new implementations of existing stable traits on existing stable types
  • Introducing new or changing existing unstable library APIs (excluding permanently unstable features / features without a tracking issue)
  • Changing public documentation in ways that create new stability guarantees
  • Changing observable runtime behavior of library APIs

@rust-log-analyzer
Copy link
Collaborator

The job mingw-check failed! Check out the build log: (web) (plain)

Click to see the possible cause of the failure (guessed by this bot)
configure: rust.debug-assertions := True
configure: rust.overflow-checks := True
configure: llvm.assertions      := True
configure: dist.missing-tools   := True
configure: build.configure-args := ['--enable-sccache', '--disable-manage-submodu ...
configure: writing `config.toml` in current directory
configure: 
configure: run `python /checkout/x.py --help`
Attempting with retry: make prepare
---
tidy: Skipping binary file check, read-only filesystem
Checking which error codes lack tests...
* 632 error codes
* highest error code: E0790
tidy error: /checkout/library/core/src/slice/sort.rs:773: undocumented unsafe
tidy error: /checkout/library/core/src/slice/sort.rs:880: TODO is deprecated; use FIXME
tidy error: /checkout/library/core/src/slice/sort.rs:890: undocumented unsafe
tidy error: /checkout/library/core/src/slice/sort.rs:957: undocumented unsafe
tidy error: /checkout/library/core/src/slice/sort.rs:1056: undocumented unsafe
tidy error: /checkout/library/core/src/slice/sort.rs:1080: undocumented unsafe
tidy error: /checkout/library/core/src/slice/sort.rs:1103: undocumented unsafe
tidy error: /checkout/library/core/src/slice/sort.rs:1116: undocumented unsafe
tidy error: /checkout/library/core/src/slice/sort.rs:1127: undocumented unsafe
tidy error: /checkout/library/core/src/slice/sort.rs:1140: undocumented unsafe
tidy error: /checkout/library/core/src/slice/sort.rs:1155: undocumented unsafe
tidy error: /checkout/library/core/src/slice/sort.rs:1160: undocumented unsafe
tidy error: /checkout/library/core/src/slice/sort.rs:1174: undocumented unsafe
tidy error: /checkout/library/core/src/slice/sort.rs:1492: TODO is deprecated; use FIXME
Found 0 error(s) in error codes
Done!
* 389 features
some tidy checks failed

@the8472
Copy link
Member

the8472 commented Nov 7, 2022

They live in separate modules, and I don't know enough about the structure of the standard library to unify them.

If this is about the core/alloc split then you can expose them as an unstable, hidden feature without tracking issue and use that feature in the other crate.

But even then I'm not sure how that affects LTO and inlining

Generic methods are always eligible for inlining because they are only instantiated once the concrete type is known.

Fundamentally it uses optimal sorting networks to speedup sorting small slices. Before going into too much detail on the speedup

Looking at the linked page it looks like sort networks optimize for short dependency chains, which should be good for ILP. But is there room for optimizing for cache effects too? I guess that's more of a research question than something that can be done in a PR.

@Voultapher
Copy link
Contributor Author

FYI the CI fails because of a TODO comment in the code. That TODO is a question for the authors here, it would be relatively easy to re-use small_sort for slice::select_nth_unstable but I'm not too familiar with the code and the implications that would have and I don't have benchmarks for that. In addition I'd like to wait with this PR until the other one was reviewed and hopefully merged.

compiler-errors added a commit to compiler-errors/rust that referenced this pull request Jan 20, 2023
…homcc

Unify stable and unstable sort implementations in same core module

This moves the stable sort implementation to the core::slice::sort module. By virtue of being in core it can't access `Vec`. The two `Vec` used by merge sort, `buf` and `runs`, are modelled as custom types that implement the very limited required `Vec` interface with the help of provided allocation and free functions. This is done to allow future re-use of functions and logic between stable and unstable sort. Such as `insert_head`.

This is in preparation of rust-lang#100856 and rust-lang#104116. It only moves code, it *doesn't* change any of the sort related logic. This unlocks the ability to share `insert_head`, `insert_tail`, `swap_if_less` `merge` and more.

Tagging ``@Mark-Simulacrum`` I hope this allows progress on rust-lang#100856, by moving `merge_sort` here I hope future changes will be easier to review.
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this pull request Jan 20, 2023
…homcc

Unify stable and unstable sort implementations in same core module

This moves the stable sort implementation to the core::slice::sort module. By virtue of being in core it can't access `Vec`. The two `Vec` used by merge sort, `buf` and `runs`, are modelled as custom types that implement the very limited required `Vec` interface with the help of provided allocation and free functions. This is done to allow future re-use of functions and logic between stable and unstable sort. Such as `insert_head`.

This is in preparation of rust-lang#100856 and rust-lang#104116. It only moves code, it *doesn't* change any of the sort related logic. This unlocks the ability to share `insert_head`, `insert_tail`, `swap_if_less` `merge` and more.

Tagging ```@Mark-Simulacrum``` I hope this allows progress on rust-lang#100856, by moving `merge_sort` here I hope future changes will be easier to review.
compiler-errors added a commit to compiler-errors/rust that referenced this pull request Jan 21, 2023
…homcc

Unify stable and unstable sort implementations in same core module

This moves the stable sort implementation to the core::slice::sort module. By virtue of being in core it can't access `Vec`. The two `Vec` used by merge sort, `buf` and `runs`, are modelled as custom types that implement the very limited required `Vec` interface with the help of provided allocation and free functions. This is done to allow future re-use of functions and logic between stable and unstable sort. Such as `insert_head`.

This is in preparation of rust-lang#100856 and rust-lang#104116. It only moves code, it *doesn't* change any of the sort related logic. This unlocks the ability to share `insert_head`, `insert_tail`, `swap_if_less` `merge` and more.

Tagging ````@Mark-Simulacrum```` I hope this allows progress on rust-lang#100856, by moving `merge_sort` here I hope future changes will be easier to review.
@Voultapher
Copy link
Contributor Author

Closing for now, work will be resumed in other PRs. Most of the information in here is obsolete now, and I've gotten a lot further in optimising.

@Voultapher Voultapher closed this Jan 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants