Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow sampling from a closed integer range #2

Merged
merged 4 commits into from
Oct 14, 2017

Conversation

pitdicker
Copy link

This does not compile, but I wanted to get your opinion.

The last few days I tried just about every trick to make sampling from a range of integers faster. Like making the zone a power of 2 (so mod becomes &), and multiplication with mersenne numbers. In the end, you are just trading the modulus for more times to call the prng (18~25% more). Because prng's can be much slower than the Xorshift used in the benchmarks, this is not a promising route to take...

It also took me a lot of time to understand your method to calculate the zone. So I tried to come up with something myself, and it is exactly the same. I guess that proves there are no more off-by-one errors :-).

What I could do was add some extra comments to explain what is happening here. Also the modulus could be optimised a very little bit thanks to this trick: skipping it for the small chance v falls in the target range, making it a few percent faster.

What I would like to know your opinion about: is it useful to expose an inclusive range? As per this comment. It fit a little better with my mental model when writing the code.

Copy link
Owner

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like most of the changes are to handle the case where unsigned_max is within the range, and yet it's still not possible to construct such a range.

I like the code but won't merge yet.

What do you think? We could add a Range::closed or new_range_inclusive constructor maybe?

// the type, it has to store `unsigned_max + 1`, which can't be
// represented. But a range of size 0 can't exist, and a
// modulus op `unsigned_max + 1` is a no-op. So we treat this as
// a special case. Wrapping arithmetic makes representing
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar: "even" before "makes", not "simple"

// `unsigned_max + 1` as 0 even simple.
//
// We don't calculate zone directly, but first calculate the
// number of integers to reject first. With a wrikle to handle
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's a "wrikle"?

@pitdicker
Copy link
Author

I fixed the spelling in the comments, thank you.

About how or where to add a constructor: there are so many designs and questions in the RFC thread, I don't know where to start. And I now almost nothing about API design...
Probably best to leave some comments on the RFC tread first

@dhardy
Copy link
Owner

dhardy commented Aug 30, 2017

I'd rather address other stuff (the traits and crate split) on the RFC thread first; there's too much going on there at the moment.

Using inclusive ranges seems sensible to me but would be a breaking change, so maybe a second constructor/function.

@pitdicker
Copy link
Author

Sorry for derailing the discussion on the RFC tread, I saw your comment to late.

I sure don't think changing the default from open to closed ranges is a good idea! It would be very easy to silently break the code of others. Even with good documentation.

@pitdicker
Copy link
Author

I had two idea's that should help make the range code a bit faster in some cases:

  • If the type of the range is u8 or u16, we should pick a zone that fits in an u32. This greatly reduces the number of random integers that have to be rejected. The zone can still be stored in the RangeInt struct, thanks to a trick: the bits of the u32 that get truncated, happen to all be 1's.
  • There are two good techniques for this range code: the current one, that creates a zone as large as possible and reduces it with modulus. And one that reduces the random integers first to the power of two that is greater then the range, and then rejects everything greater than the range. What is optimal depends on how fast generating random numbers is compared to the modulus operation, and on how close a range is to its next power of two. We can just implement both techniques, and the pick the one that is probably best based on the size of the range. I think a good cutoff is when the range is more than 3/4 of the next power of two. Depending on the range, 0~25% of the numbers have to be rejected, 12.5% on average. But maybe 7/8 is better.

I'll report back when I have something.

@pitdicker pitdicker force-pushed the range_int branch 3 times, most recently from d8facb2 to fc0a6e5 Compare September 2, 2017 17:32
@pitdicker
Copy link
Author

I spend a few more hours trying to figure out why to code runs slower than it should. Without much success...

It turns out the modulus operation is slow, but not the real problem. Apparently rust adds a check before dividing by 0, and possibly panics. This slowed down the function by about 5%. If I leave out the loop from sample, it is 40% faster. And that while the loop almost always (99+%) runs just once.

The idea to pick a larger zone for u8 and u16 worked out nicely. But it got a little complicated with the rules for casting. The idea of providing two different techniques produced some nice and complicated code, but was not any faster. Benchmarks:
Before:

test distr_range2_i16        ... bench:       2,513 ns/iter (+/- 51) = 397 MB/s
test distr_range2_i8         ... bench:       2,941 ns/iter (+/- 30) = 340 MB/s
test distr_range2_int        ... bench:       3,104 ns/iter (+/- 37) = 2577 MB/s
test distr_range_int         ... bench:       3,102 ns/iter (+/- 29) = 2578 MB/s

After:

test distr_range2_i16        ... bench:       1,238 ns/iter (+/- 24) = 807 MB/s
test distr_range2_i8         ... bench:       1,231 ns/iter (+/- 34) = 812 MB/s
test distr_range2_int        ... bench:       2,961 ns/iter (+/- 55) = 2701 MB/s
test distr_range_int         ... bench:       3,120 ns/iter (+/- 43) = 2564 MB/s

I have added a function new_inclusive to RangeImpl, but not exposed the closed range methods further.

@pitdicker pitdicker force-pushed the range_int branch 2 times, most recently from 57b5c5b to 272b385 Compare September 7, 2017 09:32
@pitdicker
Copy link
Author

Rebased.
You found the strange performance difference I have been searching for for hours: the compiler became to smart with the benchmarks :-). Now it all looks a lot less nice...

Before:

test distr_range2_i8         ... bench:       5,706 ns/iter (+/- 52) = 175 MB/s
test distr_range2_i16        ... bench:       4,699 ns/iter (+/- 33) = 425 MB/s
test distr_range2_i32        ... bench:       5,325 ns/iter (+/- 46) = 751 MB/s
test distr_range2_i64        ... bench:      11,080 ns/iter (+/- 26) = 722 MB/s

After:

test distr_range2_i8         ... bench:       4,863 ns/iter (+/- 38) = 205 MB/s
test distr_range2_i16        ... bench:       4,861 ns/iter (+/- 38) = 411 MB/s
test distr_range2_i32        ... bench:       5,426 ns/iter (+/- 50) = 737 MB/s
test distr_range2_i64        ... bench:      10,925 ns/iter (+/- 24) = 732 MB/s

Some improvements, some losses. But it all depends very much on the size of the range, at least for i32 and i65.

It still think this pr is useful, as it adds support for closed ranges (e.g. handling ranges that can cover the entire range of the type). And the optimisation of small integers and extra comments.

@dhardy
Copy link
Owner

dhardy commented Sep 7, 2017

Yes, I had my head scratching why moving benchmarks from one module to another made a big difference, until I realised one used blackbox. Micro-benchmarks are tricky.

Can you add a Range::new_inclusive constructor?

@pitdicker
Copy link
Author

Finally finished this. Sorry for taking so long.

Would it be okay if I make a PR that removes Range and replaces it with Range2?

These changes make it possible to sample from closed ranges, not only from open.

Included is a small optimisation for the modulus operator, and an optimisation
for the types i8/u8 and i16/u16.
@dhardy
Copy link
Owner

dhardy commented Sep 29, 2017

No problem.

Yes, I was planning on removing the original range; actually this PR is the reason I didn't yet. Go ahead.

@pitdicker pitdicker force-pushed the range_int branch 3 times, most recently from b15a99d to fdf5141 Compare September 30, 2017 17:41
@pitdicker
Copy link
Author

Thank you. Removed the original range.

@@ -263,6 +264,7 @@ pub use thread_local::{ThreadRng, thread_rng, set_thread_rng, set_new_thread_rng
random, random_with};

use prng::IsaacWordRng;
use distributions::range::Range;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: there is a second import around line 472: use distributions::range::SampleRange;.
I left it there to avoid a rebase.

@pitdicker
Copy link
Author

Hi @dhardy. Is there a chance that you could merge this and the other PRs, or that we can move them along?

@dhardy
Copy link
Owner

dhardy commented Oct 14, 2017

Yeah, I guess. Sorry, I've been busy with some other work and travel the last couple of weeks, should have more time now.

@dhardy dhardy merged commit 97ab178 into dhardy:master Oct 14, 2017
@pitdicker
Copy link
Author

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants