-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] changing the Random PRNG to a splittable algorithm from PRINGO #28
Conversation
eef173e
to
8f1a9e6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if that's the proper way to comment on a RFC, but here we go.
- If we keep a single global mutable generator state, it needs to be protected by a lock, which makes the PRNG a concurrency bottleneck. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lock-free implementations are possible for some PRNGs. For instance, SplitMix needs only a 64-bit fetch-and-add, because the mutable internal state is just a 64-bit counter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point, but (1) it constrains the algorithm design, and (2) atomic operations would still be measurably slower than purely local computations on a PRNG-bound program.
| SplitMix | int64 | slightly faster | no (C stubs) | no | after 2^30 draws | | ||
| ChaCha | int32 | slightly slower | no (C stubs) | yes (weak crypto) | after 2^64 draws | | ||
|
||
To give a sense of the performance difference, on my machine, on a micro-benchmarking drawing numbers in a tight loop (`make benchmarks` from the pringo directory): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The relative performance varies a lot depending on 1- 64 vs 32 bit processor, and 2- the size of the numbers drawn (schematically: 64-bit integers benefit SplitMix but bytes benefit ChaCha.
- drawing `int64` takes 55% of the Random time with SplitMix and 93% of the Random time with ChaCha, and | ||
- drawing `float` values in [0;1] takes 66% of the Random time with SplitMix, and 120% of the Random time with ChaCha. | ||
|
||
Those differences are unlikely to be noticeable, as drawing random numbers is a neglectible part of most programs. Some very specific PRNG-bound algorithms (Monte-Carlo simulation, etc.) probably use custom PRNGs anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't be so sure. QuickCheck-style random property testing can use a lot of pseudo-random numbers. I'd like to hear from Monte-Carlo specialists re: custom PRNGs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent point! I can try to benchmark QCheck with an alternative PRNG. My guess is that the testing time is dominated by the actual check in user code (running the test), so this would not be noticeable in practice.
I don't know if we have people resembling Monte-Carlo specialists, but my bet would be on @Octachron, with @jhjourdan as the informed hobbyist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can try to benchmark QCheck with an alternative PRNG
Or just profile to know how much time is spent in the stdlib PRNG. Then we can extrapolate from there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As one datapoint Stan (whose one job is Hamiltonian MCMC[0]) uses boost::ecuyer1988 ([1], also known as CombLec88 in eg [2]), using its ability to skip-ahead fast to generate several independant streams ([3] explains why they chose this approach). Unfortunately I have no measure of how much of a typical Stan run the RNG takes (though I suspect little, as Stan spends a lot of time computing likelihoods and derivatives).
[0] sorry if this is not the kind of Monte-Carlo you're thinking of
[1] https://www.boost.org/doc/libs/1_77_0/doc/html/boost_random/reference.html#boost_random.reference.generators
[2] http://videos.rennes.inria.fr/seminaire_Irisa/lecuyer/rngx.pdf
[3] stan-dev/stan#3028 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect Markov Chain Monte-Carlo simulation to tune their PRNG since those require to generate a lot of samples from a simple distribution to achieve thermalization. I will try to find some of my sampling experience from my PhD to do more performance analysis (but from my memory using efficient algorithm for sampling non-uniform distribution had a larger impact than the PRNG itself).
|
||
Random currently has a pure-OCaml implementation (except for the system-specific auto-seeding logic). Moving to C stubs may be an issue, at least for Mirage users, and require extra work from alternative implementations (js_of_ocaml, BuckleScript, etc.). | ||
|
||
SplitMix is a very simple algorithm, it is easy to provide a pure-OCaml version that should perform well. Porting Chacha, which is more elaborate, is more work, but it's also harder to predict performance for the pure-OCaml version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ChaCha reimplemented in OCaml will be awful. The C implementation of parts of SplitMix is motivated by having decent performance with ocamlopt on 32-bit machines and with ocamlc (bytecode interpretation). I agree that an OCaml implementation of SplitMix should be quite fast with ocamlopt on a 64-bit machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that you didn't care about 32-bit CPUs anymore? (If we keep x86-32 for cross-compilation we don't care about PRNG performance; riscv32 remains, and in general embedded targets.)
We could consider trying to have both a pure-OCaml and a C-stubs implementation, and choose between the two at configure-time. (Or by matching on Sys constants, but that would be less comfortable to Mirage people and other pure-OCaml constraints.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xavierleroy as a non-expert, I'm curious as to how you can tell that pure-OCaml ChaCha performance would be bad. The current implementation uses 15 mutable (stack-allocated) uint32_t
variables; using Int32.t ref
would introduce a lot of boxing.
But couldn't we use another approach, such that:
- unrolling the loop and using only (immutable) temporaries, and letting the register allocator sort it out,
- or using an
int32
bigarray stored in the generator state, which hopefully would let us read/write without boxing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think an OCaml implementation of ChaCha would look pretty much like the OCaml implementation of MD5 that we use as a test (test/lib-digest/md5.ml) and that I used as a benchmark back in the days, and performance was not great compared with a pure C implementation. The boxing of int32's in the state doesn't help. The fact that ocamlopt doesn't recognize "rotate" instructions doesn't either.
rfcs/splittable-prng.md
Outdated
|
||
- Having a crypto-secure PRNG is not part of our requirement -- Random is not secure in that sense. | ||
|
||
- Having to reseed SplitMix every 2^30 draws may be a problem in practice -- loops running 2^30 iterations are perfectly reasonable thse days. All other things being equal, ChaCha should be preferred. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again it's 2^32. "Having to reseed" may be too strong: it is recommended to reseed after 2^32 draws. Basically, with a N-bit counter as internal state, some statistical deviations are to be expected after sqrt(2^N) numbers have been generated, because of the birthday paradox and all that. But again I don't have a reference handy.
(spotted by Xavier Leroy)
Thanks @xavierleroy! I wonder if you have an "overall opinion" of the RFC, independently of the details: is it worth trying to change Random in this way? |
This is slightly tangential to the RFC but I'd just like to mention this in case it could also influence a final design. I'm just wondering about whether it is really a good idea to "save" the global The current situation is not that great and I think it would be better to make people move away from it. Especially for libraries where you might want to control their randomness globally and make sure these don't In general when I need random numbers in the function of an API I use: val f : ?r:Random.State.t -> ... and default to There are two problems here:
Basically what I would like here is something like: Random.State.default : ensure_seeded:bool -> Random.State.t That has a global state value (this could return the splitted version of the domain on multicore) and |
Yes, but I don't know when. The main reason I developed the Pringo library was to pave the way to a reimplementation of the Random module in the standard library. Splittable PRNGs are good for concurrency but also for generating random functions and random streams for Quickcheck-style property testing, if I remember correctly. Now, I would like to read more about the state of the art in PRNGs, and perhaps to experiment with a third algorithm |
Actually, I did test xoshiro in Pringo: https://github.com/xavierleroy/pringo/tree/xoshiro It has a very convincing approach to splitting, and a large state space making reseeding unnecessary. On the negative side, it's a bit slower than both SplitMix and ChaCha for drawing integers, and much slower for splitting. |
I had a discussion with @xavierleroy today and he proposed the following three requirements:
|
Here is some data about the PRNG profile with QCheck, the main Quickcheck-style library for OCaml. I wrote a dumb micro-benchmark: let test =
QCheck.Test.make ~count:100_000 ~name:"trivial test"
QCheck.(list small_nat)
(fun n -> true)
let () =
ignore (QCheck_runner.run_tests [test]);; which generates lists of integers ( Then running
My understanding is that, summing the "Self" column, this means that about 35% of the program runtime is spent in the PRNG. Note: this result is obvious sensitive to the complexity of the test (with our trivial case, we are checking the worst case) but also to the complexity of the random generator: some generators may draw much more random inputs per test, and therefore spend much less time in the QCheck scaffolding. I have some "real-world" generators lying around, and may give them a try. |
In my real-world use-case of QCheck, the time spent in the PRNG is neglectible in normal operation (0.03% according to
So about 4.5% in total. My understanding is that a generator for a more complex data structure spends more time checking validity of the structure, and thus less time in the PRNG. |
For such a trivial test, this is a reasonable figure. I note that QCheck uses |
And by the way: nice demangling of ocamlopt-generated symbols in the output of |
QCheck uses (* natural number generator *)
let nat st =
let p = RS.float st 1. in
if p < 0.5 then RS.int st 10
else if p < 0.75 then RS.int st 100
else if p < 0.95 then RS.int st 1_000
else RS.int st 10_000 (This could easily be rewritten without floats if performance mattered.) |
Oh, no! It's a perfectly legitimate use of a PRNG (to draw probabilities between 0 and 1). The PRNG should fit its uses, not the other way around. Splitmix and the Xoshiro family have an advantage here, since their primitive operation is to draw random 64-bit integers, while Chacha and other PRNGs based on cryptographic ciphers are oriented towards drawing random bytes. With a bit of exaggeration, PRNG uses tend to gravitate towards one of two uses: (1) draw probabilities between 0 and 1, and (2) fill N bytes with noise. Since use (2) often requires cryptographic quality (think nonces and session keys), it makes sense to favor (1). |
That just occured to me. Can you witness the effect of the non-uniform representation of floating point numbers when you do this ? |
The short answer is "no, things were done carefully". Pringo's float generator (on 64 bits) draws a random 53-bit integer, converts it in float-point and then multiplies it by 2^{-53} to get a float in [0;1]. (53 bits is the precision of a double's significand/mantissa, so that conversion is lossless). You really get the (double)floating-point approximation of a uniform [0;1] real. |
Actually, I'm not sure the Pringo method is the best possible. It gives you 2^53 possible results, evenly spaced between 0.0 and 1.0, and evenly distributed. But many FP values between 0.0 and 1.0 are never returned, such as 1.0 itself and a bunch of values close to 0.0. If we take a 64-bit random integer, convert it to FP and multiply by 2^{-64}, we would get some of these other FP values (but still not all), including 1.0 thanks to rounding, but with a bit of a bias because of rounding. (1.0 would occur half as often as its FP predecessor, if I'm not mistaken.) Off the top of my head, I don't know how to write a PRNG that samples ALL the FP values between 0.0 and 1.0 with the correct frequencies. But I'm not sure we care. I'll do a literature search soon, but pointers to the literature are welcome. |
If you open a more specific issue on Pringo, we can ping the usual float experts. I'm sure at least half of them also have a knack for PRNGs, somehow these vices are correlated. |
As promised, here are some references on generating pseudo-random FP numbers:
The bottom line is that 1) the algorithm used in PRINGO is not bad and is used in many standard libraries, 2) it might be worth exposing the core function that returns a pseudo-random FP number between 0.0 and 1.0 and document how it works and that it never returns 1.0. |
FTR, when I use a PRNG to sample values between 0 and 1, I would expect it to never return 0 nor 1. The reason is that we usually transform these sampled values to get other distributions, often in ]-oo, +oo[ or ]0, +oo[, and 0 and 1 usually correspond to infinite values in the transformed space. For example, the typical method to get a exponentially distributed real is by taking the opposite of the logarithm of a uniform [0,1] variable. If such uniform variable is allowed to be 0, then the exponentially distributed value can be +oo, which can lead to problems. This is for example the case in statmemprof. |
The bias induced by forbidding 0 and 1 is in the order of 2^53. Hence one would need about 2^100 samples to observe it. Seems ok to me. |
After more thoughts and experiments, I suggest to go with a Xoshiro-based PRNG. See proposed implementation here: ocaml/ocaml#10701 |
I'm happy to go with Xoshiro as long as you can build a consensus (with yourself :-) and we can move the decision-making stage. I will try to update my RFC document with an explanation for the choice of Xoshiro. Here is what I understand for now:
|
The consensus is being challenged at ocaml/ocaml#10701, so don't hurry. Note that Xoshiro can be implemented in OCaml without too much pain. (Less pain than Chacha but more pain than Splitmix.) But performance will suffer. |
The discussion at ocaml/ocaml#10701 points out that a There is a new paper on splittable generators, LXM: better splittable pseudorandom number generators (and almost as fast) by Steele and Vigna, OOPSLA 2021. This is a combination-based generator using some of our new or old friends as building blocks (Xoshiro, Murmur3). The paper is dense with dozens of combinations analyzed, and a very thorough experimental analysis of the resulting sequences which suggest that most combinations are statistically very robust. Basically I understand them as "improved SplitMix", in particular with sensibly larger periods (so lesser chances of accidental overlap on splitting). Note: the Related Work section mentions a "splittable generator survey" from Schaatun in 2015, which compares a large number of splittable-RNG constructions (but not SplitMix, which was not known to the author at the time), and concludes that the only robust approach is the "cryptographic" approach. My understanding is that Steele and Vigna agree that the "safest" approach to splitting (but not the fastest) is the cryptographic one. I looked at how QuickCheck uses splitting for function generation. The way they implement Taking a step back: basically the QuickCheck approach to generate a random function graph of type |
I'm not sure I understand the discussion in question. But there is no argument that Will look at the LXM paper. |
LXM is a very interesting design, indeed! It has everything we've been looking for: large state space, well-understood splitting, and efficient 64-bit implementation. I implemented the L64X128 variant in Pringo and I confirm the claim of the paper: it is barely any slower than SplitMix, despite the increased state space and additional operations. One thing that worries me is that LXM might be covered by this rather general patent: Generating pseudorandom number sequences by nonlinear mixing of multiple subsidiary pseudorandom number generators . |
Actually it's not that general because of the nonlinear mixing. LXM combines its L and X generators with a linear operation. The patent describes TwoLCG, an earlier design of Steele. So, I don't think this patent applies to LCM, but I wouldn't be surprised if a patent that covers LXM was in preparation. |
New proposal based on an LXM generator: ocaml/ocaml#10742 |
We did end up merging ocaml/ocaml#10742, so we have now replaced the Random algorithm by a splittable RNG, and this RFC can be closed. (Or merged? Doesn't matter, it served its purpose to facilitate and accelerate decision-making on the issue.) |
[RFC text copied below]
RFC: Change the stdlib Random implementation to a splittable PRNG from PRINGO (SplitMix or ChaCha)
This RFC proposes to replace the current implementation of the standard library Random module by one of the "splittable" PRNG proposed by @xavierleroy's pringo library. The motivation is the move to Multicore, where splittable generators let us provide better behavior for the current global Random interface.
Note: This RFC is in a Draft state: it needs input from other people (in particular @xavierleroy, js-of-OCaml and Mirage people) before it can be considered a final proposal.
Motivation: Multicore
Random and Multicore
Random
provides a "global" interface that is not explicitly parametrized over the generator state -- theRandom.State
module provides a parametrized version:Random.bool : unit -> bool
, etc. This "global" interface creates correctness or performance problems in a Multicore world:If we keep a single global mutable generator state, it needs to be protected by a lock, which makes the PRNG a concurrency bottleneck.
If we give an independent random generator to each domain, it is unclear how to initialize the state of a new domain. Reusing the state of the parent domain is wrong (the two domain will generate the same random numbers), and inflexible "seeding" policies will be incompatible against some user choices of seed.
Other approaches tend to have dubious randomness properties, and often require tight coupling between the Random library and the multicore runtime, which would increase implementation complexity.
Splitting saves the day
Some PRNG algorithms provide a "split" operation that returns two pairs of random-generator states, designed to generate independently-looking sequences.
With such "Splittable PRNGs", creating a new domain simply requires splitting the generator state of the parent domain. This approach is efficient (no synchronization of the generator state), has good randomness properties, and it is simple to implement inside the Random module (using runtime-provided thread-local-storage primitives). The decisions of the user in terms of seeding (OS-random or deterministic) are naturally respected, etc. In short: you want a splittable generator for Multicore.
(For a previous discussion of this problem where the suggestion to move to splittable PRNGs emerged, see ocaml-multicore/ocaml-multicore#582 (review) )
PRINGO generators
The pringo library provides two splittable PRNG implementations.
To give a sense of the performance difference, on one machine, on a micro-benchmarking drawing numbers in a tight loop (
make benchmarks
from the pringo directory):int
has the same speed for all generators,int64
takes 55% of the Random time with SplitMix and 93% of the Random time with ChaCha, andfloat
values in [0;1] takes 66% of the Random time with SplitMix, and 120% of the Random time with ChaCha.Those differences are unlikely to be noticeable, as drawing random numbers is a neglectible part of most programs. Some very specific PRNG-bound algorithms (Monte-Carlo simulation, etc.) probably use custom PRNGs anyway.
Requirements for a standard-library PRNG
Here is the list of potential requirements we considered to decide on including a PRNG implementation the standard library.
Non-concern: 32-bit hosts
SplitMix, using 64-bit integers internally, is going to have lesser performance under 32-bit systems. We have more or less decided that we care less about the performance of 32-bit CPU architectures these days (we specifically discuss Javascript below), so we propose to not consider this in the choice process.
Concern: good randomness
We want PRNGs that pass statistical tests for good randomness. All algorithms discussed here perform well under standard PRNG-quality test (diehard, etc.).
Possible concern: lesser portability of C stubs
Random currently has a pure-OCaml implementation (except for the system-specific auto-seeding logic). Moving to C stubs may be an issue, at least for Mirage users, and require extra work from alternative implementations (js_of_ocaml, BuckleScript, etc.).
SplitMix is a very simple algorithm, it is easy to provide a pure-OCaml version that should perform well. Porting Chacha, which is more elaborate, is more work, but it's also harder to predict performance for the pure-OCaml version.
Possible concern: Javascript performance
js_of_ocaml and Bucklescript/ReScript are important for our users, and it would be nice to ensure that the performance is not disappointing on these alternative implementations.
js_of_ocaml implements int64 with an emulation, suggesting that SplitMix could take a performance hit. (There is no native int64 type under Javascript anyway, "mainstream" approaches such as long.js emulate them with two numbers.) It may be that JS engines optimize the emulation layer efficiently, so we should evaluate each choice.
(Using C stubs in fact gives more flexibility for alternative implementations to provide their own native implementation of these functions.)
Summary
What are the requirements for the Random module?
The requirements discussed in the current version of the RFC are:
Here is a summary of what we understand to be the consensus so far (I'll edit this part of the RFC as discussion progresses):
pure-OCaml: yes.
A pure-OCaml implementation makes people's life simpler would be strongly preferable; we should port SplitMix and Chacha and re-run benchmarks.
performance: not important, but check that js_of_ocaml does okay.
A small performance hits are not very important for the standard Random PRNG, so a reasonable decrease (for everyone or just js_of_ocaml) would be perfectly acceptable.
We should still benchmark under js_of_ocaml to have an idea of the performance (once we have pure-OCaml versions). It's bad if some parts of OCaml programs suddenly become much slower there.
Having a crypto-secure PRNG is not part of our requirement -- Random is not secure in that sense.
Having to reseed SplitMix every 2^30 draws may be a problem in practice -- loops running 2^32 iterations are perfectly reasonable thse days. All other things being equal, ChaCha should be preferred.