Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Random (performance, APIs, ...) #47085

Merged
merged 5 commits into from
Jan 21, 2021

Conversation

stephentoub
Copy link
Member

@stephentoub stephentoub commented Jan 17, 2021

  • Changes the algorithm used by new Random() to be one that's smaller and faster
  • Refactors the implementation to make it internally pluggable, and let's us more easily iterate on the algorithm used by new Random() in the future
  • Moves the existing implementation to be used when it's necessary for back compat
  • Adds NextInt64 and NextSingle methods

Fixes #26741
Fixes #23198
Fixes #41457
cc: @tannergooding, @danmosemsft, @colgreen

Based on looking at lots of Random usage, new Random() is by far the most common way Random is used in real apps. A seed is generally only specified when either a) it's test code and a repeatable sequence is desired or b) sometimes trying to workaround the poor default seed quality we had in .NET Framework, where the seed was based on Environment.TickCount and thus lots of Randoms created in the same quantum would end up with the same seed. There's also a smattering a derived types, often to add thread-safety with each override locking around a call to the base or some such thing. As such, I focused on optimizing new Random() (where no guarantees are made about the sequence or what methods may or may not delegate to other methods) and moved the existing implementation to be used only if a seed is provided or if a derived type is being instantiated.

Method Toolchain Mean Error StdDev Ratio Allocated
Ctor \master 1,475.427 ns 13.3713 ns 25.7619 ns 1.00 280 B
Ctor \pr 141.151 ns 1.5279 ns 1.4292 ns 0.09 72 B
Next \master 7.604 ns 0.0361 ns 0.0302 ns 1.00 -
Next \pr 3.252 ns 0.0405 ns 0.0339 ns 0.43 -
NextInt \master 9.643 ns 0.0452 ns 0.0377 ns 1.00 -
NextInt \pr 10.746 ns 0.0511 ns 0.0478 ns 1.11 -
NextIntInt \master 9.910 ns 0.1180 ns 0.1046 ns 1.00 -
NextIntInt \pr 10.298 ns 0.2229 ns 0.2085 ns 1.04 -
NextBytes \master 7,212.814 ns 17.1527 ns 15.2054 ns 1.00 -
NextBytes \pr 292.819 ns 0.9567 ns 0.7989 ns 0.04 -
NextBytesSpan \master 7,784.761 ns 28.5982 ns 25.3515 ns 1.00 -
NextBytesSpan \pr 296.508 ns 2.8035 ns 2.4852 ns 0.04 -
NextDouble \master 8.260 ns 0.0251 ns 0.0235 ns 1.00 -
NextDouble \pr 3.487 ns 0.0244 ns 0.0216 ns 0.42 -

A few open issues:

  • I want to add more tests; I added some, but our coverage of Random today isn't great.
  • I need to run perf tests on Linux (the above is Windows) and on 32-bit (the above is 64-bit).
  • There's probably a better way to implement NextInt64 in the legacy generator, but I don't know what: suggestions welcome.
  • @colgreen, after working on this I looked at your repro and see you've invested a lot of time and thought into a lot of this stuff. I'd welcome either a) your thoughts on improvements to make in this PR or b) you submitting a follow-up PR to add improvements on top of this. Also, I ended up using a similar strategy to the one you used with Log2, but I expect that's contributing to the slight regression in those overloads; it'd be good to think of possible alternatives, though given the wins elsewhere, I'm not terribly concerned.

Related to #46890
cc: @tarekgh, @noahfalk

@dotnet-issue-labeler
Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@danmoseley
Copy link
Member

Would it be clearer to break out the two impls into their own files?

@danmoseley
Copy link
Member

With respect to algorithm, it looks like we didn't really do a "survey" of the PRNG options (as we did eg for hashing algorithms a while ago). Since we aren't implementing a seeded API here, which we can always change later, perhaps that doesn't matter very much so long as we're reasonably confident that it's not "worse" since folks seem to be mostly wanting more performance and asking for more randomness.

Looking around at other stacks and utility libraries like Boost, there doesn't seem to be convergence on anything, and where there is, it's not clear whether it's particularly fast. (Or their limitations are accepted, eg., java.Util.Random.nextLong documentation says "this algorithm will not return all possible long values"). I was a bit perturbed by the link in the original issue to a criticism of xoshiro256** by another PRNG purveyor, but on her site the discussion of it seems to conclude its flaws are unlikely to be easily discoverable. So I guess, that algorithm seems good enough for me.

@colgreen
Copy link
Contributor

With respect to algorithm, it looks like we didn't really do a "survey" of the PRNG options (as we did eg for hashing algorithms a while ago).

There are some standard PRNG test frameworks that run their own 'battery of tests' against lots of PRNGs, and which publish results, e.g.

https://github.com/lemire/testingRNG
https://rurban.github.io/dieharder/QUALITY.html

and possibly (related to hash functions, but a PRNG is essentially serial application of a hash function):

https://github.com/rurban/smhasher

I've been suggesting folks use xoshiro256** for a while now, but based on those results I do wonder if wyrand would be a better choice. I don't think there's a problem with using xoshiro256** to be honest (the tests are quite exhaustive!), but since wyrand is available and appears to be faster, and passes more of those tests, maybe it's worth a little bit of effort looking into that a bit more.

@danmoseley
Copy link
Member

xoshiro256** implementation is super simple so unless we discover a significant flaw I think we should go ahead and use it and lock in the sweet perf gain. Then we can always change later.

@colgreen
Copy link
Contributor

xoshiro256** implementation is super simple so unless we discover a significant flaw I think we should go ahead and use it and lock in the sweet perf gain. Then we can always change later.

I am cool with this.

@stephentoub stephentoub force-pushed the betterrandom branch 2 times, most recently from 76d70ae to 5b69e72 Compare January 18, 2021 03:51
@adamsitnik adamsitnik added the tenet-performance Performance related issue label Jan 18, 2021
@colgreen
Copy link
Contributor

colgreen commented Jan 18, 2021

I believe the rejection sampling loops aren't quite right. Or rather, they can be tightened up to run less loops.

The issue relates to the calculation of bits:

int bits = BitOperations.Log2((uint)maxValue) + 1;

As an example let's take maxValue = 8. We want to sample from the set {0, 1, 2, 3, 4, 5, 6, 7}, noting that maxValue itself is an exclusive bound, and we have maxValue possible discrete samples.

As 8 is a power of two (2^3 == 8), we ought to be generating 3 bits and just returning without having to loop a second time. However, the bits expression above gives a result of 4. I.e. Log2(8) = 3, and then we add one.

I believe the correct/better way is to use a method such as:

/// <summary>
/// Evaluate the binary logarithm of a non-zero Int32, with rounding up of fractional results.
/// I.e. returns the exponent of the smallest power of two that is greater than or equal to the specified number.
/// </summary>
/// <param name="x">The input value.</param>
/// <returns>The exponent of the smallest integral power of two that is greater than or equal to x.</returns>
public static int Log2Ceiling(uint x)
{
    // Log2(x) gives the required power of two, however this is integer Log2() therefore any fractional
    // part in the result is truncated, i.e., the result may be 1 too low. To compensate we add 1 if x
    // is not an exact power of two.
    int exp = BitOperations.Log2(x);

    // Return (exp + 1) if x is non-zero, and not an exact power of two.
    return BitOperations.PopCount(x) > 1 ? exp + 1 : exp;
}

(from https://github.com/colgreen/Redzen/blob/master/Redzen/MathUtils.cs ... and recently fixed, so just be aware if you have an out of date local copy)

Now you can do:

int bits = Log2Ceiling(maxValue)

Which more than halves the number of rejection sampling loops, on average, when maxValue is a power of two.

@stephentoub
Copy link
Member Author

I believe the rejection sampling loops aren't quite right. Or rather, they can be tightened up to run less loops.

Specifically for powers of two, yes, the bound could be tighter.

@GrabYourPitchforks
Copy link
Member

@stephentoub @tannergooding How does this PR interact with #23198 and #41457? Does it resolve those issues?

@stephentoub
Copy link
Member Author

From my perspective it resolves both. (With the caveat that those issues still exist for a type derived from Random or constructing a Random with a specific seed.)

@tarekgh tarekgh mentioned this pull request Jan 19, 2021
@stephentoub
Copy link
Member Author

Tweaked things a bit further and added a 32-bit implementation. It’s now looking pretty reasonable on Windows. Still need to measure on Linux.

64-bit:

Method Toolchain Mean Ratio Allocated
Ctor master\corerun.exe 1,454.062 ns 1.00 280 B
Ctor pr\corerun.exe 112.203 ns 0.08 72 B
Next master\corerun.exe 7.554 ns 1.00 -
Next pr\corerun.exe 2.336 ns 0.31 -
NextInt master\corerun.exe 9.503 ns 1.00 -
NextInt pr\corerun.exe 8.895 ns 0.94 -
NextIntInt master\corerun.exe 9.924 ns 1.00 -
NextIntInt pr\corerun.exe 9.197 ns 0.93 -
NextBytes master\corerun.exe 7,181.847 ns 1.00 -
NextBytes pr\corerun.exe 259.274 ns 0.04 -
NextBytesSpan master\corerun.exe 7,792.237 ns 1.00 -
NextBytesSpan pr\corerun.exe 257.490 ns 0.03 -
NextDouble master\corerun.exe 8.372 ns 1.00 -
NextDouble pr\corerun.exe 2.664 ns 0.32 -

32-bit:

Method Toolchain Mean Ratio Allocated
Ctor master_x86\corerun.exe 1,680.583 ns 1.00 256 B
Ctor pr_x86\corerun.exe 259.524 ns 0.15 36 B
Next master_x86\corerun.exe 8.609 ns 1.00 -
Next pr_x86\corerun.exe 4.692 ns 0.55 -
NextInt master_x86\corerun.exe 11.472 ns 1.00 -
NextInt pr_x86\corerun.exe 10.833 ns 0.94 -
NextIntInt master_x86\corerun.exe 14.647 ns 1.00 -
NextIntInt pr_x86\corerun.exe 13.646 ns 0.93 -
NextBytes master_x86\corerun.exe 7,952.158 ns 1.00 -
NextBytes pr_x86\corerun.exe 739.291 ns 0.09 -
NextBytesSpan master_x86\corerun.exe 8,522.058 ns 1.00 -
NextBytesSpan pr_x86\corerun.exe 731.178 ns 0.09 -
NextDouble master_x86\corerun.exe 9.083 ns 1.00 -
NextDouble pr_x86\corerun.exe 7.972 ns 0.88 -

- Changes the algorithm used by `new Random()` to be one that's smaller and faster and produces better results
- Refactors the implementation to make it internally pluggable
- Moves the existing implementation to be used when it's necessary for back compat
- Adds NextInt64 and NextSingle methods
@stephentoub
Copy link
Member Author

Linux 64-bit:

Method Toolchain Mean Ratio Allocated
Ctor /master/corerun 1,510.009 ns 1.00 280 B
Ctor /pr/corerun 783.730 ns 0.52 72 B
Next /master/corerun 7.142 ns 1.00 -
Next /pr/corerun 2.912 ns 0.41 -
NextInt /master/corerun 9.890 ns 1.00 -
NextInt /pr/corerun 9.859 ns 1.00 -
NextIntInt /master/corerun 10.476 ns 1.00 -
NextIntInt /pr/corerun 9.471 ns 0.90 -
NextBytes /master/corerun 7,045.031 ns 1.00 -
NextBytes /pr/corerun 260.487 ns 0.04 -
NextBytesSpan /master/corerun 7,288.855 ns 1.00 -
NextBytesSpan /pr/corerun 263.415 ns 0.04 -
NextDouble /master/corerun 7.806 ns 1.00 -
NextDouble /pr/corerun 2.791 ns 0.36 -

@danmoseley
Copy link
Member

Just curious, why are the array-filler overloads so much faster on 32 bit vs old impl, when the Next() like methods are slower? I don't see it.

Next | master_x86\corerun.exe | 8.975 ns | 1.00 | -
Next | pr_x86\corerun.exe | 13.833 ns | 1.54 | -
NextBytes | master_x86\corerun.exe | 7,942.055 ns | 1.00 | -
NextBytes | pr_x86\corerun.exe | 1,599.183 ns | 0.20 | -
NextBytesSpan | master_x86\corerun.exe | 8,523.159 ns | 1.00 | -
NextBytesSpan | pr_x86\corerun.exe | 1,604.615 ns | 0.19

@stephentoub
Copy link
Member Author

Just curious, why are the array-filler overloads so much faster on 32 bit vs old impl, when the Next() like methods are slower? I don't see it.

Because they previously were calling Next per byte and now aren't.

@danmoseley
Copy link
Member

Because they previously were calling Next per byte and now aren't.

https://github.com/dotnet/runtime/pull/47085/files#diff-92b19116b63bd5daf93516a0908297653d54d08263b754c8a1bbb2930e14db1bR167
I was looking at this - I must be spacing out.

@danmoseley
Copy link
Member

I have no more feedback, LGTM but I haven't gone through everything so I can't sign off this moment.

@stephentoub stephentoub merged commit 5ee25aa into dotnet:master Jan 21, 2021
@stephentoub stephentoub deleted the betterrandom branch January 21, 2021 01:27
@GSPP
Copy link

GSPP commented Jan 26, 2021

A XorShift-style algorithm for Random. Awesome!

For some reason, it is not widely known that really good algorithms for random numbers exist. This family of algorithms is of high quality and super fast. Great, that this is finding its way into .NET.

As far as I understand, things like the Mersenne twister and other popular algorithms are strongly inferior.

For reference, I'm linking my thoughts on this: #6203 (comment)

@colgreen
Copy link
Contributor

colgreen commented Jan 26, 2021

As far as I understand, things like the Mersenne twister and other popular algorithms are strongly inferior.

I think so, yes. Also, the original xor-shift PRNG as devised by George Marsaglia was a good fast PRNG for its time (early 2000s?), but these new 'xoshiro' derivatives from it pass more statistical tests while maintaining good performance.

There might be an argument in future for switching again to a wyrand PRNG (or a derivative if one emerges), or something else, but I think using xoshiro is the right decision today; it gives a nice performance boost and a much better quality of randomness, i.e. fixes all of the raised issues with the old System.Random to the best of my knowledge.

Ultimately if you want really good random noise you flip to the slower crypto random generators of course, so there's a kind of spectrum of PRNGs, and trade offs, and this work as at a sweet spot I think between performance and RNG quality.

@ghost ghost locked as resolved and limited conversation to collaborators Feb 25, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet