Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve shuffle performance #297

Closed
wants to merge 5 commits into from

Conversation

pitdicker
Copy link
Contributor

Using two methods: not using perfectly unbiased ranges (see rationale in the comment), and making better use of the RNG (like not requesting 64 bits on an usize when there are less than 2^32 elements).

Before:

test misc_shuffle_1000 (x86_64)      ... bench:      10,370 ns/iter (+/- 91)
test misc_shuffle_1000 (x86)         ... bench:       7,565 ns/iter (+/- 45)

After:

test misc_shuffle_1000 (x86_64)      ... bench:       1,559 ns/iter (+/- 9)
test misc_shuffle_1000 (x86)         ... bench:       2,540 ns/iter (+/- 112)

Note that I didn't do endless tries optimizing this, I only implemented the basic ideas. It is a little surprising that x86 does not improve in a way similar to x86_64.

@pitdicker
Copy link
Contributor Author

Nice to compare against Xorshift.

test gen_u32_xorshift      ... bench:       1,374 ns/iter (+/- 6) = 2911 MB/s

@dhardy
Copy link
Member

dhardy commented Mar 12, 2018

Wait, your code can shuffle a 1000-element list in 13% more time than it takes to generate a u32 with Xorshift? I don't believe that.

Also, I stand by my previous comment that algorithms don't belong in the Rng trait 😉 Maybe it's time to investigate @vks's idea for an algorithms module.

TODO: investigate why the benchmark is so fast. This doesn't need to land in the 0.5 release.

@pitdicker
Copy link
Contributor Author

I believe the benchmark is right. For a slice with only 1000 elements two values can be swapped with every round of the RNG. Also the only things it does extra are keeping a loop counter, a multiply and the swap itself. If you increase the benchmark to use slices of for example 40mb it takes about twice as long, as expected because it mostly uses the slower loop.

@dhardy
Copy link
Member

dhardy commented Mar 12, 2018

Am I missing something? b.iter(...) should be measuring the time taken to shuffle 1000 elements, which with your code requires 250 u64 values, which is 500 calls to Xorshift's next_u32 fn — so the iteration time should be at least 500 times longer than next_u32 — oh, but gen_u32_xorshift calls that 1000 times per iteration; okay.

@pitdicker
Copy link
Contributor Author

Ah, I suppose I have played too much with the RNG benchmarks. I just assumed the factor 1000 and didn't even think to mention it 😄.

Copy link
Member

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the details mentioned about i, I'm happy with the rough shape of the algorithm.

I'm not really happy about putting a non-trivial algorithm within Rng though. Hopefully later we can find another place for this; in the mean time perhaps a private function would do.

I'm satisfied the biases would not be significant for most uses, but e.g. std has sort and sort_unstable, i.e. the default version is the safest version, so perhaps we should have shuffle and shuffle_fast (or shuffle_biased)?

src/lib.rs Outdated
}

let mut i = i as u16;
while i >= 2 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be while i >= 5 with special cases for the last choices? This algorithm could wrap i. Also implies the guards on above while loops may need changing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is also possible. My thought was to handle the special cases before starting a loop. Maybe doesn't matter anything.

@dhardy dhardy added C-optimisation B-value Breakage: changes output values labels Mar 14, 2018
@pitdicker
Copy link
Contributor Author

You are getting creative with the labels 😄.

I'm not really happy about putting a non-trivial algorithm within Rng though. Hopefully later we can find another place for this; in the mean time perhaps a private function would do.

I agree. At the moment I am trying out a single trait that implements functions like dhardy#82 (comment). Curious to see if it works at all...

Also can we ignore this PR for a while? I am getting less convinced by my own arguments that the bias is small 😄. The range is getting 1 step smaller every time. Is there another way to determine the acceptable zone to sample from without division, making use of this pattern?

@dhardy
Copy link
Member

dhardy commented Mar 14, 2018

Yes, sure, it doesn't really matter if this is left until after 0.5.

Huh, I was trying to make labels actually convey useful information at a glance. Slightly useful maybe.

@dhardy dhardy added P-low Priority: Low and removed P-medium labels Mar 14, 2018
@pitdicker
Copy link
Contributor Author

Huh, I was trying to make labels actually convey useful information at a glance. Slightly useful maybe.

Seems like a success to me

@pitdicker
Copy link
Contributor Author

I have done several tries with a ShrinkingRange distribution. If we can calculate the acceptable zone for the ranges without divisions, would shuffle be faster? It is possible with only additions and subtractions after the initial set-up. But the extra instructions and branches turned out to not be really faster, but much more complex.

So I have only kept the idea here to use smaller integers than usize if possible, and used the normal gen_range.

Benchmark results:

test misc_shuffle_1000               ... bench:       4,026 ns/iter (+/- 276)
test misc_shuffle_1000 (x86)         ... bench:       4,630 ns/iter (+/- 42)

@pitdicker
Copy link
Contributor Author

Maybe moving the function to the seq module was not the best idea. The entire module is not available without std. I would like to make it available, because of the idea's that are forming in dhardy#82, that are mostly no_std compatible.

What do you think?

@pitdicker
Copy link
Contributor Author

Also, if you want to ignore this until after 0.5, feel free to do so.
I just didn't want to leave it a half-finished idea.

@dhardy
Copy link
Member

dhardy commented Mar 21, 2018

I have some catching up to do, sorry. I'd like to get a 0.5 pre-release cut very soon; I guess non-ABI breaking changes like this could potentially still make the final 0.5 anyway.

@dhardy dhardy mentioned this pull request Mar 22, 2018
33 tasks
Copy link
Member

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you've got your range off by 1. Other than that looks okay. It's complicated and like most things random difficult to test conclusively, unfortunately.

let mut i = values.len() as u64;
while i > (1 << 31) {
i -= 1;
values.swap(i as usize, rng.gen_range(0, i + 1) as usize);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We allow sampling i here (no swap)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If i is the length of the slice, i - 1 is the latest index. We should generate the range over (i - 1) + 1, because it is exclusive, and to allow a case where i does not get swapped. So I believe this code is correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a single line you have "swap i and gen_range(0, i+1)", i.e. the latter sample can be i, which is I believe correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just confirming 😄

src/seq.rs Outdated
let val = value as u16;
value = value >> 16;
let (hi, lo) = val.wmul(i);
let zone = ::core::u16::MAX - (::core::u16::MAX - i + 1) % i;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this code looks like it excludes i, comparing to the range code. I think replace i by range = i + 1 in this line and the one above (wmul(range))?

@dhardy
Copy link
Member

dhardy commented Mar 22, 2018

Not sure about the organisation. You could no-std enable seq or just put the private function in lib.rs for now; I guess dhardy#82 will result in much of this changing again in the future anyway.

@pitdicker
Copy link
Contributor Author

Updated. (just in case)

@pitdicker
Copy link
Contributor Author

A nice property of the changes here: previously shuffle was platform-dependent, because it depended on the size of usize for the number of bits for every swap. Now it will use 32 bits or less if the slice contains less than 2^31 elements on both 32- and 64-bit platforms. As I believe that is the limit for allocations on 32-bit, that means the shuffle operation is now repeatable on all architectures.

@pitdicker
Copy link
Contributor Author

I am going to close this PR for now. I have kept working in it, and some more of the things from dhardy#82, in https://github.com/pitdicker/rand/commits/slice_random.

@pitdicker pitdicker closed this Mar 31, 2018
@TheIronBorn
Copy link
Contributor

TheIronBorn commented Jun 8, 2018

I found another easy performance win for shuffles: unchecked swaps:

test misc_shuffle_1000               ... bench:      14,025 ns/iter (+/- 9,005)
test misc_shuffle_unchecked_1000     ... bench:      13,618 ns/iter (+/- 10,550)

(I used only gen_range::<u32> to reduce overhead)

Perhaps there is some way we could convince the compiler bound checks are not needed. Otherwise this will do:

#[inline]
unsafe fn swap_unchecked<T>(values: &mut [T], a: usize, b: usize) {
    let pa: *mut T = values.get_unchecked_mut(a);
    let pb: *mut T = values.get_unchecked_mut(b);
    ptr::swap(pa, pb);
}

(we could even only use get_unchecked_mut for the random index)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B-value Breakage: changes output values C-optimisation P-low Priority: Low
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants