Add a `sort` method to &mut [T] #11064

huonw · 2013-12-19T06:02:03Z

This uses quite a bit of unsafe code for speed and failure safety, and allocates 2*n temporary storage.

n	new	priority_queue	quick3
5	200	155	106
100	6490	8750	5810
10000	1300000	1790000	1060000
100000	16700000	23600000	12700000
sorted	520000	1380000	53900000
trend	1310000	1690000	1100000

(The times are in nanoseconds, having subtracted the set-up time (i.e. the just_generate bench target).)

I imagine that there is still significant room for improvement, particularly because both priority_queue and quick3 are doing a static call via Ord or TotalOrd for the comparisons, while this is using a (boxed) closure.

Also, this code does not clone, unlike quick_sort3; and is stable, unlike both of the others.

huonw · 2013-12-19T06:18:06Z

This uses a lot of unsafe code, but I don't see a way around it to maintain efficiency.

The code there is essentially an optimised version of the following (which is several times slower):

fn merge_sort_safe<T>(v: &mut [T], less_eq: |&T, &T| -> bool) {
    if v.is_empty() { return }
    let len = v.len();
    let mut idx: ~[uint] = vec::with_capacity(len);

    for start in iter::range_step(0, len, INSERTION) {
        for i in range(start, cmp::min(start + INSERTION, len)) {
            let mut j = i as int;
            while j > start as int && !less_eq(&v[idx[j - 1]], &v[i]) {
                j -= 1;
            }
            idx.insert(j as uint, i);
        }
    }

    let mut width = INSERTION;
    let mut next_idx = vec::with_capacity(len);
    while width < len {
        for start in iter::range_step(0, len, 2 * width) {
            // the end of the first run/start of the second
            let left_end = cmp::min(start + width, len);
            // end of the second
            let right_end = cmp::min(start + 2 * width, len);
            let mut left = start;
            let mut right = left_end;
            for _ in range(start, right_end) {
                let choose_left = left < left_end && (right >= right_end ||
                                                      less_eq(&v[idx[left]], &v[idx[right]]));
                next_idx.push(if choose_left {
                    let l = idx[left];
                    left += 1;
                    l
                } else {
                    let r = idx[right];
                    right += 1;
                    r
                })
            }
        }
        swap(&mut idx, &mut next_idx);
        next_idx.truncate(0);
        width *= 2;
    }

    // trickery to put all the elements in the correct place.
    let mut inverse_idx = next_idx;
    inverse_idx.push_all(idx); // get the appropriate length; we overwrite it later.
    for (i, &j) in idx.iter().enumerate() {
        inverse_idx[j] = i
    }

    for i in range(0, len - 1) {
        let idx_i = idx[i];
        if idx_i == i { continue }
        v.swap(i, idx_i);
        idx.swap(i, inverse_idx[i]);
        inverse_idx.swap(i,idx_i);
    }
}

alexcrichton · 2013-12-19T07:05:27Z

A few high-level questions before I dive too deep into the code:

Have you considered other algorithms? The one that comes to mind for me is quicksort which I don't think needs the temporary storage. I also haven't implemented it in a serious setting, so I can't really speak to how reasonable it is to expect it.
At least in ruby the sort() method takes no arguments, but sort_by() takes a closure for comparing to elements. I think there's also a flavor that takes a closure which maps an element to a sortable value. I don't think that we want to get too fancy with this, but it may be worth considering having the sort() method be the convenience method and have an extra one for the sort-by-closure method.

Other than that, awesome work! At a quick glance this looks pretty awesome and I'm loving that diff stat.

thestinger · 2013-12-19T07:13:33Z

@alexcrichton: quicksort has an O(n^2) worst-case without slowing it down by making it use median-of-median pivot selection and isn't a stable sort

huonw · 2013-12-19T07:20:34Z

Have you considered other algorithms?

I was just going off @thestinger's suggestion of merge sort + insertion sort on #9819.

At least in ruby the sort() method takes no arguments, but sort_by() takes a closure for comparing to elements

I'm happy to do this, I'm also currently experimenting with:

pub trait LessThanEqualComparator<T> {
    fn less_eq(&self, a: &T, b: &T) -> bool;
}

pub struct SortForward;
impl<T: Ord> LessThanEqualComparator<T> for SortForward {
    fn less_eq(&self, a: &T, b: &T) -> bool { *a <= *b }
}

pub struct SortReverse;
impl<T: Ord> LessThanEqualComparator<T> for SortReverse {
    fn less_eq(&self, a: &T, b: &T) -> bool { *b <= *a }
}

impl<'a, T> LessThanEqualComparator<T> for 'a |&T, &T| -> bool {
    fn less_eq(&self, a: &T, b: &T) -> bool {
        (*self)(a, b)
    }
}

[...]

    fn sort<Sort: LessThanEqualComparator>(self, less_eq: Sort);

which would avoid the boxed closure problem for a plain a <= b sort, but still be flexible, and avoid writing the error prone |a,b| a <= b repeatedly. I have a feeling that this will require type annotations on the closure, which may be too ugly (and it would require importing std::vec::SortForward etc, so the sugary .sort() method might still be nice).

huonw · 2013-12-19T11:16:59Z

I've added a commit that does the above suggestion, it does require putting type signatures on the closures, which is fairly annoying. :( (I'm happy to drop it if necessary.)

However, this adjusts the times to

n	original	trait
5	200	155
100	6490	5900
10000	1300000	1280000
100000	16700000	16400000
sorted	520000	310000
trend	1310000	1250000

If it were split into .sort() and .sort_by(<comparator>) this might not be so ugly in practice.

huonw · 2013-12-19T12:05:17Z

And added one for .sort() and .sort_by(..). Unfortunately that requires adding a whole new trait and adding it to the prelude. :(

alexcrichton · 2013-12-19T22:03:36Z

@brson and I were talking about this this morning, and here's some thoughts that we had (plus a few of my own with these updates).

There are 3 flavors of sorting that I know of. sort() which requires T: Ord, sort(|a| { ... }) which maps elements to a orderable-type, and sort(|a, b| { ... }) which compares two elements directly. I think that sort() should definitely be available (as is today), and sort(|a, b| { ... }) is also quite useful (as you have right now), and I'm not entirely sure if we want sort(|a| { ... }) for now. I think we can get by without it.
The standard sort should be super fast. It appears that other programming languages don't necessarily put a huge focus on having the default sort being a stable sort. I personally have never cared about my sort being stable, but there are very real use cases which require this. Perhaps a good solution for this would be to have extra::sort with a fast stable sort (and maybe some other flavors), but the std::sort variant would be a super-fast explicitly not guaranteed to be stable sort. @brson mentions that the "hot sorting algorithm" today is timsort, which we apparently have an implementation of in libextra (but may need to be optimized).
I think that the api should be a little different. I always forget what order I need to do things in the closure. Should I return less than, less than or equal, or the inverse of these? You solve this well with a less_eq method (so it's clear), but this seems like yet another trait which isn't that necessary. It seems to be that a closure should be perfectly sufficient, but the return value of the closure should be Ordering. I still personally don't understand the difference between Ord and TotalOrd, but the basic idea that I would like to see is just returning the relative comparison between the two elements, and then the array is always sorted smallest to largest. We could have convenient ordering.invert() methods for inverting how the array is sorted.

How does that sound?

huonw · 2013-12-20T00:23:32Z

@brson mentions that the "hot sorting algorithm" today is timsort, which we apparently have an implementation of in libextra (but may need to be optimized).

It definitely does; this algorithm here is much faster in general that the timsort in extra (although timsort is amazing on pre-sorted lists):

n	original	trait	timsort
5	200	155	236
100	6490	5900	24000
10000	1300000	1280000	2810000
100000	16700000	16400000	37000000
sorted	520000	310000	15000
trend	1310000	1250000	2770000

this seems like yet another trait which isn't that necessary. It seems to be that a closure should be perfectly sufficient

The point of the trait is to get static dispatch for a simple a <= b comparison since this is faster (see the table above); (I agree that it would be unnecessary if we had unboxed stack closures).

the return value of the closure should be Ordering

Sounds good.

Re stability and the exact choice of algorithm, you'll have to talk to @thestinger, since I'm just following his suggestion on #9819.

huonw · 2013-12-20T00:54:31Z

I'm reading over the tim_sort code in extra now... there is a lot of it (560 LOC for the implementation, not including any tests or anything) and it looks fairly complicated; I imagine that doing it with unsafe code would be hard to verify as correct, and will be hard to get to be as efficient as this (especially when avoiding .clone).

brson · 2013-12-20T01:29:54Z

This looks very nice, @huonw. Thanks.

brson · 2013-12-20T01:30:29Z

Well commented.

very small runs. This uses a lot of unsafe code for speed, otherwise we would be having to sort by sorting lists of indices and then do a pile of swaps to put everything in the correct place. Fixes rust-lang#9819.

Fixes rust-lang#9676.

huonw · 2013-12-20T03:44:31Z

Pushed the less_eq -> Ordering change. It makes some things ugly since it then requires TotalOrd which makes handling (for example) floats less nice (cc #5585 and #10320).

alexcrichton · 2013-12-20T17:17:16Z

@brson/@pcwalton and I were talking about this, and we're a little wary of the Sort: SortComparator<T> bound. It should always be the case that arr.sort_by(|a, b| a.foo.cmp(&b.foo)) works, but in the future this will likely be done with an unboxed closure rather than a trait bound. For forward compatibility, we think that for now the argument should just be a closure. If we have a trait bound, code will eventually break because it relied on using the trait rather than a function-like thing which we're a little worried about.

I realize that it's a little slower, but I would be more worried about forwards compatibility. What do you think?

pnkfelix · 2013-12-20T20:17:48Z

@alexcrichton Your comment, "There are 3 flavors of sorting that I know of ..." (which then listed three flavors, at least two of which were comparison-based sorts, not sure about the third) reminded me that I have been meaning to go back over this paper on Generic Discrimination based Sorting/Partitioning [1] [2], which I saw presented some years ago but never got a chance to really play with.

(The idea, if I recall correctly, is a generalization of the ideas that drive a radix-sort to more general data structures, yielding "linear time" sorting, where I've quoted linear time because I'm sure there are certain assumptions that must be met.)

[1] http://www.diku.dk/hjemmesider/ansatte/henglein/papers/henglein2007b.pdf

[2] http://www.icfpconference.org/icfp2008/accepted/41.html

This moves the custom sorting to `.sort_by`.

huonw · 2013-12-22T02:19:54Z

Rebased out the trait addition (I agree with the backward-compatibility concerns).

(I'm "away" at the moment, so iteration of this PR will be slow.)

(implicitly) less_eq.

This uses quite a bit of unsafe code for speed and failure safety, and allocates `2*n` temporary storage. [Performance](https://gist.github.com/huonw/5547f2478380288a28c2): | n | new | priority_queue | quick3 | |-------:|---------:|---------------:|---------:| | 5 | 200 | 155 | 106 | | 100 | 6490 | 8750 | 5810 | | 10000 | 1300000 | 1790000 | 1060000 | | 100000 | 16700000 | 23600000 | 12700000 | | sorted | 520000 | 1380000 | 53900000 | | trend | 1310000 | 1690000 | 1100000 | (The times are in nanoseconds, having subtracted the set-up time (i.e. the `just_generate` bench target).) I imagine that there is still significant room for improvement, particularly because both priority_queue and quick3 are doing a static call via `Ord` or `TotalOrd` for the comparisons, while this is using a (boxed) closure. Also, this code does not `clone`, unlike `quick_sort3`; and is stable, unlike both of the others.

@huonw

Implement a faster sort algorithm Hi everyone, this is my first PR. I've made some changes to the standard sort algorithm, starting out with a few tweaks here and there, but in the end this endeavour became a complete rewrite of it. #### Summary Changes: * Improved performance, especially on partially sorted inputs. * Performs less comparisons on both random and partially sorted inputs. * Decreased the size of temporary memory: the new sort allocates 4x less. Benchmark: ``` name out1 ns/iter out2 ns/iter diff ns/iter diff % slice::bench::sort_large_ascending 85,323 (937 MB/s) 8,970 (8918 MB/s) -76,353 -89.49% slice::bench::sort_large_big_ascending 2,135,297 (599 MB/s) 355,955 (3595 MB/s) -1,779,342 -83.33% slice::bench::sort_large_big_descending 2,266,402 (564 MB/s) 416,479 (3073 MB/s) -1,849,923 -81.62% slice::bench::sort_large_big_random 3,053,031 (419 MB/s) 1,921,389 (666 MB/s) -1,131,642 -37.07% slice::bench::sort_large_descending 313,181 (255 MB/s) 14,725 (5432 MB/s) -298,456 -95.30% slice::bench::sort_large_mostly_ascending 287,706 (278 MB/s) 243,204 (328 MB/s) -44,502 -15.47% slice::bench::sort_large_mostly_descending 415,078 (192 MB/s) 271,028 (295 MB/s) -144,050 -34.70% slice::bench::sort_large_random 545,872 (146 MB/s) 521,559 (153 MB/s) -24,313 -4.45% slice::bench::sort_large_random_expensive 30,321,770 (2 MB/s) 23,533,735 (3 MB/s) -6,788,035 -22.39% slice::bench::sort_medium_ascending 616 (1298 MB/s) 155 (5161 MB/s) -461 -74.84% slice::bench::sort_medium_descending 1,952 (409 MB/s) 202 (3960 MB/s) -1,750 -89.65% slice::bench::sort_medium_random 3,646 (219 MB/s) 3,421 (233 MB/s) -225 -6.17% slice::bench::sort_small_ascending 39 (2051 MB/s) 34 (2352 MB/s) -5 -12.82% slice::bench::sort_small_big_ascending 96 (13333 MB/s) 96 (13333 MB/s) 0 0.00% slice::bench::sort_small_big_descending 248 (5161 MB/s) 243 (5267 MB/s) -5 -2.02% slice::bench::sort_small_big_random 501 (2554 MB/s) 490 (2612 MB/s) -11 -2.20% slice::bench::sort_small_descending 95 (842 MB/s) 63 (1269 MB/s) -32 -33.68% slice::bench::sort_small_random 372 (215 MB/s) 354 (225 MB/s) -18 -4.84% ``` #### Background First, let me just do a quick brain dump to discuss what I learned along the way. The official documentation says that the standard sort in Rust is a stable sort. This constraint is thus set in stone and immediately rules out many popular sorting algorithms. Essentially, the only algorithms we might even take into consideration are: 1. [Merge sort](https://en.wikipedia.org/wiki/Merge_sort) 2. [Block sort](https://en.wikipedia.org/wiki/Block_sort) (famous implementations are [WikiSort](https://github.com/BonzaiThePenguin/WikiSort) and [GrailSort](https://github.com/Mrrl/GrailSort)) 3. [TimSort](https://en.wikipedia.org/wiki/Timsort) Actually, all of those are just merge sort flavors. :) The current standard sort in Rust is a simple iterative merge sort. It has three problems. First, it's slow on partially sorted inputs (even though #29675 helped quite a bit). Second, it always makes around `log(n)` iterations copying the entire array between buffers, no matter what. Third, it allocates huge amounts of temporary memory (a buffer of size `2*n`, where `n` is the size of input). The problem of auxilliary memory allocation is a tough one. Ideally, it would be best for our sort to allocate `O(1)` additional memory. This is what block sort (and it's variants) does. However, it's often very complicated (look at [this](https://github.com/BonzaiThePenguin/WikiSort/blob/master/WikiSort.cpp)) and even then performs rather poorly. The author of WikiSort claims good performance, but that must be taken with a grain of salt. It performs well in comparison to `std::stable_sort` in C++. It can even beat `std::sort` on partially sorted inputs, but on random inputs it's always far worse. My rule of thumb is: high performance, low memory overhead, stability - choose two. TimSort is another option. It allocates a buffer of size `n/2`, which is not great, but acceptable. Performs extremelly well on partially sorted inputs. However, it seems pretty much all implementations suck on random inputs. I benchmarked implementations in [Rust](https://github.com/notriddle/rust-timsort), [C++](https://github.com/gfx/cpp-TimSort), and [D](https://github.com/dlang/phobos/blob/fd518eb310a9494cccf28c54892542b052c49669/std/algorithm/sorting.d#L2062). The results were a bit disappointing. It seems bad performance is due to complex galloping procedures in hot loops. Galloping noticeably improves performance on partially sorted inputs, but worsens it on random ones. #### The new algorithm Choosing the best algorithm is not easy. Plain merge sort is bad on partially sorted inputs. TimSort is bad on random inputs and block sort is even worse. However, if we take the main ideas from TimSort (intelligent merging strategy of sorted runs) and drop galloping, then we'll have great performance on random inputs and it won't be bad on partially sorted inputs either. That is exactly what this new algorithm does. I can't call it TimSort, since it steals just a few of it's ideas. Complete TimSort would be a much more complex and elaborate implementation. In case we in the future figure out how to incorporate more of it's ideas into this implementation without crippling performance on random inputs, it's going to be very easy to extend. I also did several other minor improvements, like reworked insertion sort to make it faster. There are also new, more thorough benchmarks and panic safety tests. The final code is not terribly complex and has less unsafe code than I anticipated, but there's still plenty of it that should be carefully reviewed. I did my best at documenting non-obvious code. I'd like to notify several people of this PR, since they might be interested and have useful insights: 1. @huonw because he wrote the [original merge sort](#11064). 2. @alexcrichton because he was involved in multiple discussions of it. 3. @veddan because he wrote [introsort](https://github.com/veddan/rust-introsort) in Rust. 4. @notriddle because he wrote [TimSort](https://github.com/notriddle/rust-timsort) in Rust. 5. @bluss because he had an attempt at writing WikiSort in Rust. 6. @gnzlbg, @rkruppe, and @mark-i-m because they were involved in discussion #36318. **P.S.** [quickersort](https://github.com/notriddle/quickersort) describes itself as being universally [faster](https://github.com/notriddle/quickersort/blob/master/perf.txt) than the standard sort, which is true. However, if this PR gets merged, things might [change](https://gist.github.com/stjepang/b9f0c3eaa0e1f1280b61b963dae19a30) a bit. ;)

[`unnecessary_literal_unwrap`]: Fix ICE on None.unwrap_or_default() Fixes rust-lang#11099 Fixes rust-lang#11064 I'm running into rust-lang#11099 (cc `@y21)` on my Rust codebase. Clippy ICEs on this code when evaluating the `unnecessary_literal_unwrap` lint: ```rust fn main() { let val1: u8 = None.unwrap_or_default(); } ``` This fixes that ICE and adds an message specifically for that case: ``` error: used `unwrap_or_default()` on `None` value --> $DIR/unnecessary_literal_unwrap.rs:26:5 | LL | None::<String>.unwrap_or_default(); | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: remove the `None` and `unwrap_or_default()`: `String::default()` ``` This PR also fixes the same ICE with `None.unwrap_or_else` (by giving the generic error message for the lint in that case). changelog: Fix ICE in `unnecessary_literal_unwrap` on `None.unwrap_or_default()`

huonw added 2 commits December 20, 2013 12:38

std::vec: implement a stable merge sort, deferring to insertion sort for

721609e

very small runs. This uses a lot of unsafe code for speed, otherwise we would be having to sort by sorting lists of indices and then do a pile of swaps to put everything in the correct place. Fixes rust-lang#9819.

extra: remove sort in favour of the std method.

48fedcb

Fixes rust-lang#9676.

std::vec: add a sugary .sort() method for plain Ord sorting.

1b1e4ca

This moves the custom sorting to `.sort_by`.

std::vec: make the sorting closure use Ordering rather than just being

2e8c522

(implicitly) less_eq.

fix check-fast tests.

645fff4

bors closed this Dec 22, 2013

bors merged commit 645fff4 into rust-lang:master Dec 22, 2013

huonw deleted the vec-sort branch December 31, 2013 15:34

Gankra mentioned this pull request Oct 31, 2015

VecDeque has no sort_by method (should possibly have one to match Vec). #27322

Closed

ghost mentioned this pull request Dec 6, 2016

Implement a faster sort algorithm #38192

Merged

Uh oh!

Add a sort method to &mut [T] #11064

Add a sort method to &mut [T] #11064

Uh oh!

Conversation

huonw commented Dec 19, 2013

Uh oh!

huonw commented Dec 19, 2013

Uh oh!

alexcrichton commented Dec 19, 2013

Uh oh!

thestinger commented Dec 19, 2013

Uh oh!

huonw commented Dec 19, 2013

Uh oh!

huonw commented Dec 19, 2013

Uh oh!

huonw commented Dec 19, 2013

Uh oh!

alexcrichton commented Dec 19, 2013

Uh oh!

huonw commented Dec 20, 2013

Uh oh!

huonw commented Dec 20, 2013

Uh oh!

brson commented Dec 20, 2013

Uh oh!

brson commented Dec 20, 2013

Uh oh!

huonw commented Dec 20, 2013

Uh oh!

alexcrichton commented Dec 20, 2013

Uh oh!

pnkfelix commented Dec 20, 2013

Uh oh!

huonw commented Dec 22, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Add a `sort` method to &mut [T] #11064

Add a `sort` method to &mut [T] #11064