Improve shuffle performance #297

pitdicker · 2018-03-12T16:17:49Z

Using two methods: not using perfectly unbiased ranges (see rationale in the comment), and making better use of the RNG (like not requesting 64 bits on an usize when there are less than 2^32 elements).

Before:

test misc_shuffle_1000 (x86_64)      ... bench:      10,370 ns/iter (+/- 91)
test misc_shuffle_1000 (x86)         ... bench:       7,565 ns/iter (+/- 45)

After:

test misc_shuffle_1000 (x86_64)      ... bench:       1,559 ns/iter (+/- 9)
test misc_shuffle_1000 (x86)         ... bench:       2,540 ns/iter (+/- 112)

Note that I didn't do endless tries optimizing this, I only implemented the basic ideas. It is a little surprising that x86 does not improve in a way similar to x86_64.

pitdicker · 2018-03-12T17:31:48Z

Nice to compare against Xorshift.

test gen_u32_xorshift      ... bench:       1,374 ns/iter (+/- 6) = 2911 MB/s

dhardy · 2018-03-12T18:08:49Z

Wait, your code can shuffle a 1000-element list in 13% more time than it takes to generate a u32 with Xorshift? I don't believe that.

Also, I stand by my previous comment that algorithms don't belong in the Rng trait 😉 Maybe it's time to investigate @vks's idea for an algorithms module.

TODO: investigate why the benchmark is so fast. This doesn't need to land in the 0.5 release.

pitdicker · 2018-03-12T18:51:57Z

I believe the benchmark is right. For a slice with only 1000 elements two values can be swapped with every round of the RNG. Also the only things it does extra are keeping a loop counter, a multiply and the swap itself. If you increase the benchmark to use slices of for example 40mb it takes about twice as long, as expected because it mostly uses the slower loop.

dhardy · 2018-03-12T19:32:39Z

Am I missing something? b.iter(...) should be measuring the time taken to shuffle 1000 elements, which with your code requires 250 u64 values, which is 500 calls to Xorshift's next_u32 fn — so the iteration time should be at least 500 times longer than next_u32 — oh, but gen_u32_xorshift calls that 1000 times per iteration; okay.

pitdicker · 2018-03-12T19:38:21Z

Ah, I suppose I have played too much with the RNG benchmarks. I just assumed the factor 1000 and didn't even think to mention it 😄.

dhardy

Other than the details mentioned about i, I'm happy with the rough shape of the algorithm.

I'm not really happy about putting a non-trivial algorithm within Rng though. Hopefully later we can find another place for this; in the mean time perhaps a private function would do.

I'm satisfied the biases would not be significant for most uses, but e.g. std has sort and sort_unstable, i.e. the default version is the safest version, so perhaps we should have shuffle and shuffle_fast (or shuffle_biased)?

dhardy · 2018-03-14T12:45:10Z

src/lib.rs

+        }
+
+        let mut i = i as u16;
+        while i >= 2 {


Shouldn't this be while i >= 5 with special cases for the last choices? This algorithm could wrap i. Also implies the guards on above while loops may need changing.

That is also possible. My thought was to handle the special cases before starting a loop. Maybe doesn't matter anything.

pitdicker · 2018-03-14T16:26:03Z

You are getting creative with the labels 😄.

I'm not really happy about putting a non-trivial algorithm within Rng though. Hopefully later we can find another place for this; in the mean time perhaps a private function would do.

I agree. At the moment I am trying out a single trait that implements functions like dhardy#82 (comment). Curious to see if it works at all...

Also can we ignore this PR for a while? I am getting less convinced by my own arguments that the bias is small 😄. The range is getting 1 step smaller every time. Is there another way to determine the acceptable zone to sample from without division, making use of this pattern?

dhardy · 2018-03-14T16:34:31Z

Yes, sure, it doesn't really matter if this is left until after 0.5.

Huh, I was trying to make labels actually convey useful information at a glance. Slightly useful maybe.

pitdicker · 2018-03-14T19:55:25Z

Huh, I was trying to make labels actually convey useful information at a glance. Slightly useful maybe.

Seems like a success to me

pitdicker · 2018-03-21T19:40:42Z

I have done several tries with a ShrinkingRange distribution. If we can calculate the acceptable zone for the ranges without divisions, would shuffle be faster? It is possible with only additions and subtractions after the initial set-up. But the extra instructions and branches turned out to not be really faster, but much more complex.

So I have only kept the idea here to use smaller integers than usize if possible, and used the normal gen_range.

Benchmark results:

test misc_shuffle_1000               ... bench:       4,026 ns/iter (+/- 276)
test misc_shuffle_1000 (x86)         ... bench:       4,630 ns/iter (+/- 42)

pitdicker · 2018-03-21T20:20:08Z

Maybe moving the function to the seq module was not the best idea. The entire module is not available without std. I would like to make it available, because of the idea's that are forming in dhardy#82, that are mostly no_std compatible.

What do you think?

pitdicker · 2018-03-21T20:22:47Z

Also, if you want to ignore this until after 0.5, feel free to do so.
I just didn't want to leave it a half-finished idea.

dhardy · 2018-03-21T20:56:57Z

I have some catching up to do, sorry. I'd like to get a 0.5 pre-release cut very soon; I guess non-ABI breaking changes like this could potentially still make the final 0.5 anyway.

dhardy

Not sure if you've got your range off by 1. Other than that looks okay. It's complicated and like most things random difficult to test conclusively, unfortunately.

dhardy · 2018-03-22T08:38:58Z

src/seq.rs

+    let mut i = values.len() as u64;
+    while i > (1 << 31) {
+        i -= 1;
+        values.swap(i as usize, rng.gen_range(0, i + 1) as usize);


We allow sampling i here (no swap)

If i is the length of the slice, i - 1 is the latest index. We should generate the range over (i - 1) + 1, because it is exclusive, and to allow a case where i does not get swapped. So I believe this code is correct.

In a single line you have "swap i and gen_range(0, i+1)", i.e. the latter sample can be i, which is I believe correct.

Just confirming 😄

dhardy · 2018-03-22T08:39:46Z

src/seq.rs

+            let val = value as u16;
+            value = value >> 16;
+            let (hi, lo) = val.wmul(i);
+            let zone = ::core::u16::MAX - (::core::u16::MAX - i + 1) % i;


But this code looks like it excludes i, comparing to the range code. I think replace i by range = i + 1 in this line and the one above (wmul(range))?

dhardy · 2018-03-22T08:45:11Z

Not sure about the organisation. You could no-std enable seq or just put the private function in lib.rs for now; I guess dhardy#82 will result in much of this changing again in the future anyway.

pitdicker · 2018-03-22T12:26:39Z

Updated. (just in case)

pitdicker · 2018-03-23T08:16:42Z

A nice property of the changes here: previously shuffle was platform-dependent, because it depended on the size of usize for the number of bits for every swap. Now it will use 32 bits or less if the slice contains less than 2^31 elements on both 32- and 64-bit platforms. As I believe that is the limit for allocations on 32-bit, that means the shuffle operation is now repeatable on all architectures.

pitdicker · 2018-03-31T17:19:21Z

I am going to close this PR for now. I have kept working in it, and some more of the things from dhardy#82, in https://github.com/pitdicker/rand/commits/slice_random.

TheIronBorn · 2018-06-08T20:17:09Z

I found another easy performance win for shuffles: unchecked swaps:

test misc_shuffle_1000               ... bench:      14,025 ns/iter (+/- 9,005)
test misc_shuffle_unchecked_1000     ... bench:      13,618 ns/iter (+/- 10,550)

(I used only gen_range::<u32> to reduce overhead)

Perhaps there is some way we could convince the compiler bound checks are not needed. Otherwise this will do:

#[inline]
unsafe fn swap_unchecked<T>(values: &mut [T], a: usize, b: usize) {
    let pa: *mut T = values.get_unchecked_mut(a);
    let pb: *mut T = values.get_unchecked_mut(b);
    ptr::swap(pa, pb);
}

(we could even only use get_unchecked_mut for the random index)

dhardy added the P-medium label Mar 12, 2018

dhardy reviewed Mar 14, 2018

View reviewed changes

dhardy added C-optimisation B-value Breakage: changes output values labels Mar 14, 2018

dhardy added P-low Priority: Low and removed P-medium labels Mar 14, 2018

pitdicker added 3 commits March 21, 2018 19:47

Increase shuffle bench slice length

8cad1fb

Improve shuffle performance

16b74d8

Use unbiased ranges in shuffle, move to separate function

88c88e0

pitdicker force-pushed the shuffle_perf branch from 122cad0 to 88c88e0 Compare March 21, 2018 19:30

dhardy mentioned this pull request Mar 22, 2018

Tracker: planned changes for 0.5 #232

Closed

33 tasks

Make seq module available in no_std

0d05746

dhardy reviewed Mar 22, 2018

View reviewed changes

Fix off by 1 error

1914cb4

pitdicker closed this Mar 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve shuffle performance #297

Improve shuffle performance #297

pitdicker commented Mar 12, 2018

pitdicker commented Mar 12, 2018

dhardy commented Mar 12, 2018

pitdicker commented Mar 12, 2018

dhardy commented Mar 12, 2018

pitdicker commented Mar 12, 2018

dhardy left a comment

dhardy Mar 14, 2018

pitdicker Mar 14, 2018

pitdicker commented Mar 14, 2018

dhardy commented Mar 14, 2018

pitdicker commented Mar 14, 2018

pitdicker commented Mar 21, 2018

pitdicker commented Mar 21, 2018

pitdicker commented Mar 21, 2018

dhardy commented Mar 21, 2018

dhardy left a comment

dhardy Mar 22, 2018

pitdicker Mar 22, 2018

dhardy Mar 22, 2018

pitdicker Mar 22, 2018

dhardy Mar 22, 2018

dhardy commented Mar 22, 2018

pitdicker commented Mar 22, 2018

pitdicker commented Mar 23, 2018

pitdicker commented Mar 31, 2018

TheIronBorn commented Jun 8, 2018 •

edited

Loading

Improve shuffle performance #297

Improve shuffle performance #297

Conversation

pitdicker commented Mar 12, 2018

pitdicker commented Mar 12, 2018

dhardy commented Mar 12, 2018

pitdicker commented Mar 12, 2018

dhardy commented Mar 12, 2018

pitdicker commented Mar 12, 2018

dhardy left a comment

Choose a reason for hiding this comment

dhardy Mar 14, 2018

Choose a reason for hiding this comment

pitdicker Mar 14, 2018

Choose a reason for hiding this comment

pitdicker commented Mar 14, 2018

dhardy commented Mar 14, 2018

pitdicker commented Mar 14, 2018

pitdicker commented Mar 21, 2018

pitdicker commented Mar 21, 2018

pitdicker commented Mar 21, 2018

dhardy commented Mar 21, 2018

dhardy left a comment

Choose a reason for hiding this comment

dhardy Mar 22, 2018

Choose a reason for hiding this comment

pitdicker Mar 22, 2018

Choose a reason for hiding this comment

dhardy Mar 22, 2018

Choose a reason for hiding this comment

pitdicker Mar 22, 2018

Choose a reason for hiding this comment

dhardy Mar 22, 2018

Choose a reason for hiding this comment

dhardy commented Mar 22, 2018

pitdicker commented Mar 22, 2018

pitdicker commented Mar 23, 2018

pitdicker commented Mar 31, 2018

TheIronBorn commented Jun 8, 2018 • edited Loading

TheIronBorn commented Jun 8, 2018 •

edited

Loading