Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a benchmark for 32-bit UniformInt sample_single #985

Merged
merged 2 commits into from
Jul 19, 2020

Conversation

jongiddy
Copy link
Contributor

@jongiddy jongiddy commented Jun 9, 2020

This PR adds a benchmark for sample_single for UniformInt<u32>.

This demonstrates the problem that I described in #951

I get these results, where dist is sampling from an existing distribution and single is using sample_single:

test sample_u32_3_dist        ... bench:          11 ns/iter (+/- 1)
test sample_u32_3_single      ... bench:          18 ns/iter (+/- 0)
test sample_u32_4_dist        ... bench:          11 ns/iter (+/- 0)
test sample_u32_4_single      ... bench:          35 ns/iter (+/- 1)
test sample_u32_half_dist     ... bench:          11 ns/iter (+/- 0)
test sample_u32_half_single   ... bench:          35 ns/iter (+/- 5)
test sample_u32_halfm1_dist   ... bench:          11 ns/iter (+/- 2)
test sample_u32_halfm1_single ... bench:           8 ns/iter (+/- 0)
test sample_u32_halfp1_dist   ... bench:          39 ns/iter (+/- 1)
test sample_u32_halfp1_single ... bench:          39 ns/iter (+/- 5)
test sample_u32_max_dist      ... bench:          11 ns/iter (+/- 1)
test sample_u32_max_single    ... bench:           7 ns/iter (+/- 0)

This shows that, while sample_single is faster for some ranges, it can be significantly slower for others.

Before making any changes to the sample_single code, I'd like to get these results verified so I don't make changes and then find there's a bug in the benchmark.

@jongiddy
Copy link
Contributor Author

jongiddy commented Jun 9, 2020

Simply removing the heuristic gives the following results:

test sample_u32_3_dist        ... bench:          11 ns/iter (+/- 1)
test sample_u32_3_single      ... bench:           8 ns/iter (+/- 1)
test sample_u32_4_dist        ... bench:          11 ns/iter (+/- 1)
test sample_u32_4_single      ... bench:           7 ns/iter (+/- 4)
test sample_u32_half_dist     ... bench:          11 ns/iter (+/- 2)
test sample_u32_half_single   ... bench:           7 ns/iter (+/- 1)
test sample_u32_halfm1_dist   ... bench:          11 ns/iter (+/- 1)
test sample_u32_halfm1_single ... bench:           8 ns/iter (+/- 0)
test sample_u32_halfp1_dist   ... bench:          40 ns/iter (+/- 2)
test sample_u32_halfp1_single ... bench:          37 ns/iter (+/- 6)
test sample_u32_max_dist      ... bench:          11 ns/iter (+/- 7)
test sample_u32_max_single    ... bench:           8 ns/iter (+/- 0)

So, at least in my environment with rustc 1.44.0, sample_single is then faster for all cases.

@jongiddy
Copy link
Contributor Author

jongiddy commented Jun 9, 2020

Putting the low and high values into their own black_box to prevent constant propagation simplifying the modulo operation gets back to dist running faster (with the heuristic removed).

test sample_u32_3_dist        ... bench:          11 ns/iter (+/- 0)
test sample_u32_3_single      ... bench:          14 ns/iter (+/- 0)
test sample_u32_4_dist        ... bench:          11 ns/iter (+/- 1)
test sample_u32_4_single      ... bench:          14 ns/iter (+/- 0)
test sample_u32_half_dist     ... bench:          11 ns/iter (+/- 0)
test sample_u32_half_single   ... bench:          15 ns/iter (+/- 8)
test sample_u32_halfm1_dist   ... bench:          11 ns/iter (+/- 0)
test sample_u32_halfm1_single ... bench:          14 ns/iter (+/- 1)
test sample_u32_halfp1_dist   ... bench:          39 ns/iter (+/- 0)
test sample_u32_halfp1_single ... bench:          47 ns/iter (+/- 0)
test sample_u32_max_dist      ... bench:          11 ns/iter (+/- 0)
test sample_u32_max_single    ... bench:          14 ns/iter (+/- 0)

@jongiddy
Copy link
Contributor Author

jongiddy commented Jun 9, 2020

With the heuristic added back in:

test sample_u32_3_dist        ... bench:          11 ns/iter (+/- 0)
test sample_u32_3_single      ... bench:          20 ns/iter (+/- 1)
test sample_u32_4_dist        ... bench:          11 ns/iter (+/- 0)
test sample_u32_4_single      ... bench:          39 ns/iter (+/- 5)
test sample_u32_half_dist     ... bench:          11 ns/iter (+/- 0)
test sample_u32_half_single   ... bench:          40 ns/iter (+/- 7)
test sample_u32_halfm1_dist   ... bench:          11 ns/iter (+/- 0)
test sample_u32_halfm1_single ... bench:           9 ns/iter (+/- 0)
test sample_u32_halfp1_dist   ... bench:          38 ns/iter (+/- 0)
test sample_u32_halfp1_single ... bench:          40 ns/iter (+/- 2)
test sample_u32_max_dist      ... bench:          11 ns/iter (+/- 0)
test sample_u32_max_single    ... bench:          10 ns/iter (+/- 0)

Some conclusions:

  • Using the heuristic approximation gives a small speedup to some ranges, but causes large slowdowns in other ranges.
  • Using the modulo calculation in sample_single is slower than using an existing distribution, but only by a small amount. On this basis, it would seem sensible to remove sample_single and use the default.
  • The modulo calculation benefits if the range limits are known constants, to the point that it becomes faster than using an existing distribution. This might suggest keeping sample_single with the heuristic removed.

@jongiddy
Copy link
Contributor Author

I've made some changes to the benchmark to ensure that it is representative. Most importantly, I moved distribution creation into the timed part of the benchmark to better reflect the cost of generating a single random value.
Also:

  • iterate more times to get a larger time, for better resolution
  • change 3 to 6, generally same properties, but represents a common use of random: dice throws
  • dice throws would typically use constants, so benchmark that as well
  • benchmark the smaller int sizes

The u32 values with no code changes to UniformInt are now:

test sample_u32_4_dist            ... bench:          84 ns/iter (+/- 4)
test sample_u32_4_single          ... bench:         204 ns/iter (+/- 10)
test sample_u32_6_dist            ... bench:          84 ns/iter (+/- 7)
test sample_u32_6_single          ... bench:         112 ns/iter (+/- 1)
test sample_u32_6_single_constant ... bench:         110 ns/iter (+/- 3)
test sample_u32_half_dist         ... bench:          84 ns/iter (+/- 16)
test sample_u32_half_single       ... bench:         205 ns/iter (+/- 9)
test sample_u32_halfm1_dist       ... bench:          83 ns/iter (+/- 5)
test sample_u32_halfm1_single     ... bench:          63 ns/iter (+/- 1)
test sample_u32_halfp1_dist       ... bench:         260 ns/iter (+/- 9)
test sample_u32_halfp1_single     ... bench:         205 ns/iter (+/- 9)
test sample_u32_max_dist          ... bench:          84 ns/iter (+/- 55)
test sample_u32_max_single        ... bench:          62 ns/iter (+/- 3)

This generally shows the same pattern as the original benchmark, although halfp1 (n/2+1) is now faster using sample_single.

With the approximation removed:

test sample_u32_4_dist            ... bench:          87 ns/iter (+/- 31)
test sample_u32_4_single          ... bench:          73 ns/iter (+/- 29)
test sample_u32_6_dist            ... bench:          87 ns/iter (+/- 2)
test sample_u32_6_single          ... bench:          72 ns/iter (+/- 7)
test sample_u32_6_single_constant ... bench:          68 ns/iter (+/- 20)
test sample_u32_half_dist         ... bench:          93 ns/iter (+/- 16)
test sample_u32_half_single       ... bench:          72 ns/iter (+/- 4)
test sample_u32_halfm1_dist       ... bench:          87 ns/iter (+/- 21)
test sample_u32_halfm1_single     ... bench:          72 ns/iter (+/- 4)
test sample_u32_halfp1_dist       ... bench:         261 ns/iter (+/- 20)
test sample_u32_halfp1_single     ... bench:         226 ns/iter (+/- 12)
test sample_u32_max_dist          ... bench:          87 ns/iter (+/- 19)
test sample_u32_max_single        ... bench:          72 ns/iter (+/- 1)

the best cases (max and halfm1) and the worst case (halfp1) are slower, but overall sample_single is consistently faster than sample for all cases.

@jongiddy
Copy link
Contributor Author

I'm trying to square these results with the benchmark in rand_distr which, with the existing code, gives:

test gen_range_i128                        ... bench:      11,377 ns/iter (+/- 2,543) = 1406 MB/s
test gen_range_i16                         ... bench:       4,422 ns/iter (+/- 2,681) = 452 MB/s
test gen_range_i32                         ... bench:       3,461 ns/iter (+/- 695) = 1155 MB/s
test gen_range_i64                         ... bench:       3,877 ns/iter (+/- 570) = 2063 MB/s
test gen_range_i8                          ... bench:       3,737 ns/iter (+/- 408) = 267 MB/s

but with the removal of the approximation, gives:

test gen_range_i128                        ... bench:     240,309 ns/iter (+/- 23,155) = 66 MB/s
test gen_range_i16                         ... bench:       3,734 ns/iter (+/- 41) = 535 MB/s
test gen_range_i32                         ... bench:       5,137 ns/iter (+/- 1,233) = 778 MB/s
test gen_range_i64                         ... bench:      10,349 ns/iter (+/- 752) = 773 MB/s
test gen_range_i8                          ... bench:       3,740 ns/iter (+/- 31) = 267 MB/s

The current modulo implementation is much more expensive for larger integers, but it's also disappointing that the 32-bit benchmark increases so much. I expect that this is the source of the claims in uniform.rs that 8- and 16-bits don't benefit from the approximation.

@dhardy
Copy link
Member

dhardy commented Jun 16, 2020

For comparison, on my Haswell desktop CPU without changing sampling code:

test gen_range_f32                         ... bench:       3,872 ns/iter (+/- 162) = 1033 MB/s
test gen_range_f64                         ... bench:       4,265 ns/iter (+/- 239) = 1875 MB/s
test gen_range_i128                        ... bench:      13,621 ns/iter (+/- 678) = 1174 MB/s
test gen_range_i16                         ... bench:       5,278 ns/iter (+/- 102) = 378 MB/s
test gen_range_i32                         ... bench:       3,215 ns/iter (+/- 134) = 1244 MB/s
test gen_range_i64                         ... bench:       3,642 ns/iter (+/- 75) = 2196 MB/s
test gen_range_i8                          ... bench:       5,038 ns/iter (+/- 294) = 198 MB/s

test sample_u16_4_dist            ... bench:          96 ns/iter (+/- 4)
test sample_u16_4_single          ... bench:          79 ns/iter (+/- 3)
test sample_u16_6_dist            ... bench:          97 ns/iter (+/- 7)
test sample_u16_6_single          ... bench:          79 ns/iter (+/- 3)
test sample_u16_6_single_constant ... bench:          79 ns/iter (+/- 4)
test sample_u16_half_dist         ... bench:          96 ns/iter (+/- 2)
test sample_u16_half_single       ... bench:          79 ns/iter (+/- 3)
test sample_u16_halfm1_dist       ... bench:          96 ns/iter (+/- 8)
test sample_u16_halfm1_single     ... bench:          79 ns/iter (+/- 4)
test sample_u16_halfp1_dist       ... bench:          96 ns/iter (+/- 3)
test sample_u16_halfp1_single     ... bench:          79 ns/iter (+/- 4)
test sample_u16_max_dist          ... bench:          95 ns/iter (+/- 3)
test sample_u16_max_single        ... bench:          80 ns/iter (+/- 3)
test sample_u32_4_dist            ... bench:          95 ns/iter (+/- 3)
test sample_u32_4_single          ... bench:         179 ns/iter (+/- 12)
test sample_u32_6_dist            ... bench:          96 ns/iter (+/- 5)
test sample_u32_6_single          ... bench:         103 ns/iter (+/- 4)
test sample_u32_6_single_constant ... bench:          99 ns/iter (+/- 2)
test sample_u32_half_dist         ... bench:          96 ns/iter (+/- 7)
test sample_u32_half_single       ... bench:         178 ns/iter (+/- 7)
test sample_u32_halfm1_dist       ... bench:          95 ns/iter (+/- 4)
test sample_u32_halfm1_single     ... bench:          62 ns/iter (+/- 5)
test sample_u32_halfp1_dist       ... bench:         231 ns/iter (+/- 6)
test sample_u32_halfp1_single     ... bench:         180 ns/iter (+/- 10)
test sample_u32_max_dist          ... bench:          96 ns/iter (+/- 9)
test sample_u32_max_single        ... bench:          62 ns/iter (+/- 2)
test sample_u8_4_dist             ... bench:          96 ns/iter (+/- 9)
test sample_u8_4_single           ... bench:          79 ns/iter (+/- 5)
test sample_u8_6_dist             ... bench:          96 ns/iter (+/- 3)
test sample_u8_6_single           ... bench:          79 ns/iter (+/- 3)
test sample_u8_6_single_constant  ... bench:          77 ns/iter (+/- 4)
test sample_u8_half_dist          ... bench:          96 ns/iter (+/- 3)
test sample_u8_half_single        ... bench:          79 ns/iter (+/- 5)
test sample_u8_halfm1_dist        ... bench:          95 ns/iter (+/- 5)
test sample_u8_halfm1_single      ... bench:          79 ns/iter (+/- 3)
test sample_u8_halfp1_dist        ... bench:          96 ns/iter (+/- 3)
test sample_u8_halfp1_single      ... bench:          79 ns/iter (+/- 5)
test sample_u8_max_dist           ... bench:          96 ns/iter (+/- 5)
test sample_u8_max_single         ... bench:          79 ns/iter (+/- 2)

With the left-shift approximation removed:

test gen_range_f32                         ... bench:       3,913 ns/iter (+/- 342) = 1022 MB/s
test gen_range_f64                         ... bench:       4,309 ns/iter (+/- 181) = 1856 MB/s
test gen_range_i128                        ... bench:     232,733 ns/iter (+/- 5,645) = 68 MB/s
test gen_range_i16                         ... bench:       5,314 ns/iter (+/- 177) = 376 MB/s
test gen_range_i32                         ... bench:       5,433 ns/iter (+/- 274) = 736 MB/s
test gen_range_i64                         ... bench:      10,041 ns/iter (+/- 1,082) = 796 MB/s
test gen_range_i8                          ... bench:       5,274 ns/iter (+/- 488) = 189 MB/s

test sample_u16_4_dist            ... bench:          98 ns/iter (+/- 12)
test sample_u16_4_single          ... bench:          81 ns/iter (+/- 9)
test sample_u16_6_dist            ... bench:          96 ns/iter (+/- 5)
test sample_u16_6_single          ... bench:          79 ns/iter (+/- 6)
test sample_u16_6_single_constant ... bench:          81 ns/iter (+/- 9)
test sample_u16_half_dist         ... bench:          98 ns/iter (+/- 15)
test sample_u16_half_single       ... bench:          79 ns/iter (+/- 4)
test sample_u16_halfm1_dist       ... bench:          95 ns/iter (+/- 5)
test sample_u16_halfm1_single     ... bench:          79 ns/iter (+/- 9)
test sample_u16_halfp1_dist       ... bench:          96 ns/iter (+/- 5)
test sample_u16_halfp1_single     ... bench:          79 ns/iter (+/- 4)
test sample_u16_max_dist          ... bench:          96 ns/iter (+/- 5)
test sample_u16_max_single        ... bench:          79 ns/iter (+/- 4)
test sample_u32_4_dist            ... bench:          96 ns/iter (+/- 2)
test sample_u32_4_single          ... bench:          80 ns/iter (+/- 8)
test sample_u32_6_dist            ... bench:          96 ns/iter (+/- 1)
test sample_u32_6_single          ... bench:          79 ns/iter (+/- 3)
test sample_u32_6_single_constant ... bench:          77 ns/iter (+/- 2)
test sample_u32_half_dist         ... bench:          96 ns/iter (+/- 3)
test sample_u32_half_single       ... bench:          80 ns/iter (+/- 8)
test sample_u32_halfm1_dist       ... bench:          96 ns/iter (+/- 2)
test sample_u32_halfm1_single     ... bench:          79 ns/iter (+/- 4)
test sample_u32_halfp1_dist       ... bench:         231 ns/iter (+/- 7)
test sample_u32_halfp1_single     ... bench:         201 ns/iter (+/- 7)
test sample_u32_max_dist          ... bench:          96 ns/iter (+/- 2)
test sample_u32_max_single        ... bench:          79 ns/iter (+/- 3)
test sample_u8_4_dist             ... bench:          95 ns/iter (+/- 4)
test sample_u8_4_single           ... bench:          79 ns/iter (+/- 4)
test sample_u8_6_dist             ... bench:          95 ns/iter (+/- 3)
test sample_u8_6_single           ... bench:          79 ns/iter (+/- 3)
test sample_u8_6_single_constant  ... bench:          78 ns/iter (+/- 2)
test sample_u8_half_dist          ... bench:          95 ns/iter (+/- 6)
test sample_u8_half_single        ... bench:          79 ns/iter (+/- 4)
test sample_u8_halfm1_dist        ... bench:          95 ns/iter (+/- 4)
test sample_u8_halfm1_single      ... bench:          79 ns/iter (+/- 3)
test sample_u8_halfp1_dist        ... bench:          95 ns/iter (+/- 4)
test sample_u8_halfp1_single      ... bench:          79 ns/iter (+/- 6)
test sample_u8_max_dist           ... bench:          95 ns/iter (+/- 5)
test sample_u8_max_single         ... bench:          79 ns/iter (+/- 4)

@dhardy
Copy link
Member

dhardy commented Jun 16, 2020

Looks like I'm seeing most of the costs and fewer of the gains of your suggested simplification. What's your CPU? And your change is simply commenting out the else branch of the UniformInt impl of sample_single?

At any rate, the gains are mostly less than a factor 2. The costs for gen_range_i64 are over approx 3x, so at most one might consider using modulus for 32-bit values (from these numbers).

@jongiddy
Copy link
Contributor Author

Thanks for taking a look.

Looks like I'm seeing most of the costs and none of the gains of your suggested simplification.

For the new benchmarks, removing the approximation only affects sample_single for 32 bits. Focusing on your results for those tests:

no change:

test sample_u32_4_single          ... bench:         179 ns/iter (+/- 12)
test sample_u32_6_single          ... bench:         103 ns/iter (+/- 4)
test sample_u32_6_single_constant ... bench:          99 ns/iter (+/- 2)
test sample_u32_half_single       ... bench:         178 ns/iter (+/- 7)
test sample_u32_halfm1_single     ... bench:          62 ns/iter (+/- 5)
test sample_u32_halfp1_single     ... bench:         180 ns/iter (+/- 10)
test sample_u32_max_single        ... bench:          62 ns/iter (+/- 2)

with change:

test sample_u32_4_single          ... bench:          80 ns/iter (+/- 8)
test sample_u32_6_single          ... bench:          79 ns/iter (+/- 3)
test sample_u32_6_single_constant ... bench:          77 ns/iter (+/- 2)
test sample_u32_half_single       ... bench:          80 ns/iter (+/- 8)
test sample_u32_halfm1_single     ... bench:          79 ns/iter (+/- 4)
test sample_u32_halfp1_single     ... bench:         201 ns/iter (+/- 7)
test sample_u32_max_single        ... bench:          79 ns/iter (+/- 3)

I'd say you're seeing more gains than costs:

  • total time per iteration reduces from 863 ns to 675 ns.
  • the increasing times only go up 15%. The decreasing times go down by 43%.
  • the largest increase is 27%, the largest decrease is 55%.

Comparing the dist and single samples, there's another gain: consistency. To be able to say "sample_single is always better when you only need one sample and new + sample is always better when you need multiple samples" will make life a lot easier for users of the library.

Admittedly the gen_range tests don't look so good. That's a combination of two things:

  1. They test only one range per type, and for 32 bits that range happens to be just under 25% of the space, which most closely resembles the behavior of halfm1 in that the approximation expands to something close to the optimal value.
  2. Modulo on larger types is painfully slow in Rust, but not necessarily so (see benchmarks at https://github.com/AaronKutch/specialized-div-rem/blob/8101edc/README.md)

The aim in this PR is to introduce more representative benchmarks. I'm trying to get that right before submitting any changes to the code itself.

@jongiddy
Copy link
Contributor Author

I changed the gen_range benchmarks to use (-1, 0) as initial values. Small ranges tend to increase the number of samples that are rejected. This would, for example, be the pattern of ranges for the Fisher-Yates shuffle. (I used (-1, 0) rather than (0, 1) due to the way wrapping works in the benchmark).

no change:

test gen_range_i128                        ... bench:      14,096 ns/iter (+/- 942) = 1135 MB/s
test gen_range_i16                         ... bench:       3,510 ns/iter (+/- 397) = 569 MB/s
test gen_range_i32                         ... bench:       6,712 ns/iter (+/- 229) = 595 MB/s
test gen_range_i64                         ... bench:       7,373 ns/iter (+/- 565) = 1085 MB/s
test gen_range_i8                          ... bench:       3,479 ns/iter (+/- 329) = 287 MB/s

with change:

test gen_range_i128                        ... bench:     354,799 ns/iter (+/- 33,606) = 45 MB/s
test gen_range_i16                         ... bench:       3,801 ns/iter (+/- 917) = 526 MB/s
test gen_range_i32                         ... bench:       3,475 ns/iter (+/- 360) = 1151 MB/s
test gen_range_i64                         ... bench:      10,630 ns/iter (+/- 406) = 752 MB/s
test gen_range_i8                          ... bench:       3,482 ns/iter (+/- 441) = 287 MB/s

As you suggest, while this supports the case that 32-bit sometimes benefits from removing the approximation, larger bit sizes are overwhelmed by the cost of the modulo.

Interesting that removing the approximation gives a 48% increase for the original 32-bit benchmark and a 48% decrease for this benchmark.

@dhardy
Copy link
Member

dhardy commented Jun 17, 2020

Interesting that removing the approximation gives a 48% increase for the original 32-bit benchmark and a 48% decrease for this benchmark.

Yes it is. So you're proposing to replace the gen_range benchmarks with the new benchmarks? Makes sense, but at risk of spelling out the obvious: it needs at least something for 64- and 128-bit types, and the names are not very good (sample_dist could refer to any distribution; perhaps rename reject.rs to uniform_int.rs).

Probably the gen_range and distr_uniform benchmarks should be extracted from the rand_distr crate since they do not depend on code there. (That they ended up here is mostly an accident: we extracted distributions from rand core and didn't pay much attention to the benchmarks when doing so. Also they share some macros with other distribution benchmarks, though this isn't a good excuse.)

And thanks for giving this stuff some attention.

@jongiddy
Copy link
Contributor Author

Latest changes move the uniform and standard benchmarks out of rand_distr and then add my additional benchmarks on top.

I have some code changes lined up, but I will introduce them as a separate PR, since this benchmark PR could be merged now, but the code changes will affect the value stability of the outputs.

Copy link
Collaborator

@vks vks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@dhardy
Copy link
Member

dhardy commented Jul 19, 2020

Thank you. Then I shall go ahead and merge.

@dhardy dhardy merged commit 39a37f0 into rust-random:master Jul 19, 2020
@vks vks mentioned this pull request May 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants