Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] changing the Random PRNG to a splittable algorithm from PRINGO #28

Closed
wants to merge 2 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
102 changes: 102 additions & 0 deletions rfcs/splittable-prng.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# RFC: Change the stdlib Random implementation to a splittable PRNG from PRINGO (SplitMix or ChaCha)

This RFC proposes to replace the current implementation of the standard library [Random](https://ocaml.org/api/Random.html) module by one of the "splittable" PRNG proposed by @xavierleroy's [pringo](https://github.com/xavierleroy/pringo/) library. The motivation is the move to Multicore, where splittable generators let us provide better behavior for the current global Random interface.

*Note: This RFC is in a Draft state: it needs input from other people (in particular @xavierleroy, js-of-OCaml and mirage people) before it can be considered a final proposal.*


## Motivation: Multicore

### Random and Multicore

`Random` provides a "global" interface that is not explicitly parametrized over the generator state -- the `Random.State` module provides a parametrized version: `Random.bool : unit -> bool`, etc. This "global" interface creates correctness or performance problems in a Multicore world:

- If we keep a single global mutable generator state, it needs to be protected by a lock, which makes the PRNG a concurrency bottleneck.

Comment on lines +14 to +15

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lock-free implementations are possible for some PRNGs. For instance, SplitMix needs only a 64-bit fetch-and-add, because the mutable internal state is just a 64-bit counter.

Copy link
Member Author

@gasche gasche Sep 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point, but (1) it constrains the algorithm design, and (2) atomic operations would still be measurably slower than purely local computations on a PRNG-bound program.

- If we give an independent random generator to each domain, it is unclear how to initialize the state of a new domain. Reusing the state of the parent domain is wrong (the two domain will generate the same random numbers), and inflexible "seeding" policies will be incompatible against some user choices of seed.

Other approaches tend to have dubious randomness properties, and often require tight coupling between the Random library and the multicore runtime, which would increase implementation complexity.

### Splitting saves the day

Some PRNG algorithms provide a "split" operation that returns two pairs of random-generator states, designed to generate independently-looking sequences.

With such "Splittable PRNGs", creating a new domain simply requires splitting the generator state of the parent domain. This approach is efficient (no synchronization of the generator state), has good randomness properties, and it is simple to implement inside the Random module (using runtime-provided thread-local-storage primitives). The decisions of the user in terms of seeding (OS-random or deterministic) are naturally respected, etc. In short: you want a splittable generator for Multicore.

(For a previous discussion of this problem where the suggestion to move to splittable PRNGs emerged, see https://github.com/ocaml-multicore/ocaml-multicore/pull/582#pullrequestreview-683676282 )


## PRINGO generators

The [pringo](https://github.com/xavierleroy/pringo/) library provides two splittable PRNG implementations.

| Name | numeric type | speed wrt. Random | pure OCaml | secure | when to reseed? |
|----------|--------------|-------------------|--------------|-------------------|------------------|
| Random | int | = | yes | no | ? |
| SplitMix | int64 | slightly faster | no (C stubs) | no | after 2^32 draws |
| ChaCha | int32 | slightly slower | no (C stubs) | yes (weak crypto) | after 2^64 draws |

To give a sense of the performance difference, on my machine, on a micro-benchmarking drawing numbers in a tight loop (`make benchmarks` from the pringo directory):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The relative performance varies a lot depending on 1- 64 vs 32 bit processor, and 2- the size of the numbers drawn (schematically: 64-bit integers benefit SplitMix but bytes benefit ChaCha.


- drawing `int` has the same speed for all generators,
- drawing `int64` takes 55% of the Random time with SplitMix and 93% of the Random time with ChaCha, and
- drawing `float` values in [0;1] takes 66% of the Random time with SplitMix, and 120% of the Random time with ChaCha.

Those differences are unlikely to be noticeable, as drawing random numbers is a neglectible part of most programs. Some very specific PRNG-bound algorithms (Monte-Carlo simulation, etc.) probably use custom PRNGs anyway.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't be so sure. QuickCheck-style random property testing can use a lot of pseudo-random numbers. I'd like to hear from Monte-Carlo specialists re: custom PRNGs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent point! I can try to benchmark QCheck with an alternative PRNG. My guess is that the testing time is dominated by the actual check in user code (running the test), so this would not be noticeable in practice.

I don't know if we have people resembling Monte-Carlo specialists, but my bet would be on @Octachron, with @jhjourdan as the informed hobbyist.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can try to benchmark QCheck with an alternative PRNG

Or just profile to know how much time is spent in the stdlib PRNG. Then we can extrapolate from there.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As one datapoint Stan (whose one job is Hamiltonian MCMC[0]) uses boost::ecuyer1988 ([1], also known as CombLec88 in eg [2]), using its ability to skip-ahead fast to generate several independant streams ([3] explains why they chose this approach). Unfortunately I have no measure of how much of a typical Stan run the RNG takes (though I suspect little, as Stan spends a lot of time computing likelihoods and derivatives).

[0] sorry if this is not the kind of Monte-Carlo you're thinking of
[1] https://www.boost.org/doc/libs/1_77_0/doc/html/boost_random/reference.html#boost_random.reference.generators
[2] http://videos.rennes.inria.fr/seminaire_Irisa/lecuyer/rngx.pdf
[3] stan-dev/stan#3028 (comment)

Copy link
Member

@Octachron Octachron Sep 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect Markov Chain Monte-Carlo simulation to tune their PRNG since those require to generate a lot of samples from a simple distribution to achieve thermalization. I will try to find some of my sampling experience from my PhD to do more performance analysis (but from my memory using efficient algorithm for sampling non-uniform distribution had a larger impact than the PRNG itself).



## Requirements for a standard-library PRNG

Here is the list of potential requirements we considered to decide on including a PRNG implementation the standard library.

### Non-concern: 32-bit hosts

SplitMix, using 64-bit integers internally, is going to have lesser performance under 32-bit systems. We have more or less decided that we care less about the performance of 32-bit CPU architectures these days (we specifically discuss Javascript below), so we propose to not consider this in the choice process.

### Concern: good randomness

We want PRNGs that pass statistical tests for good randomness. All algorithms discussed here perform well under standard PRNG-quality test (diehard, etc.).


### Possible concern: lesser portability of C stubs

Random currently has a pure-OCaml implementation (except for the system-specific auto-seeding logic). Moving to C stubs may be an issue, at least for Mirage users, and require extra work from alternative implementations (js_of_ocaml, BuckleScript, etc.).

SplitMix is a very simple algorithm, it is easy to provide a pure-OCaml version that should perform well. Porting Chacha, which is more elaborate, is more work, but it's also harder to predict performance for the pure-OCaml version.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ChaCha reimplemented in OCaml will be awful. The C implementation of parts of SplitMix is motivated by having decent performance with ocamlopt on 32-bit machines and with ocamlc (bytecode interpretation). I agree that an OCaml implementation of SplitMix should be quite fast with ocamlopt on a 64-bit machine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that you didn't care about 32-bit CPUs anymore? (If we keep x86-32 for cross-compilation we don't care about PRNG performance; riscv32 remains, and in general embedded targets.)

We could consider trying to have both a pure-OCaml and a C-stubs implementation, and choose between the two at configure-time. (Or by matching on Sys constants, but that would be less comfortable to Mirage people and other pure-OCaml constraints.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xavierleroy as a non-expert, I'm curious as to how you can tell that pure-OCaml ChaCha performance would be bad. The current implementation uses 15 mutable (stack-allocated) uint32_t variables; using Int32.t ref would introduce a lot of boxing.

But couldn't we use another approach, such that:

  • unrolling the loop and using only (immutable) temporaries, and letting the register allocator sort it out,
  • or using an int32 bigarray stored in the generator state, which hopefully would let us read/write without boxing?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an OCaml implementation of ChaCha would look pretty much like the OCaml implementation of MD5 that we use as a test (test/lib-digest/md5.ml) and that I used as a benchmark back in the days, and performance was not great compared with a pure C implementation. The boxing of int32's in the state doesn't help. The fact that ocamlopt doesn't recognize "rotate" instructions doesn't either.



### Possible concern: Javascript performance

js_of_ocaml and Bucklescript/ReScript are important for our users, and it would be nice to ensure that the performance is not disappointing on these alternative implementations.

js_of_ocaml implements int64 with an emulation, suggesting that SplitMix could take a performance hit. (There is no native int64 type under Javascript anyway, "mainstream" approaches such as [long.js](https://github.com/dcodeIO/long.js) emulate them with two numbers.) It may be that JS engines optimize the emulation layer efficiently, so we should evaluate each choice.

(Using C stubs in fact gives more flexibility for alternative implementations to provide their own native implementation of these functions.)


## Summary

What are the requirements for the Random module?

The requirements discussed in the current version of the RFC are:
- implementation being pure OCaml
- performance
- "security" of the PRNG
- reseeding recommendations


Here is a summary of what we understand to be the consensus so far (I'll edit this part of the RFC as discussion progresses):

- pure-OCaml: yes.

A pure-OCaml implementation makes people's life simpler would be strongly preferable; we should port SplitMix and Chacha and re-run benchmarks.

- performance: not important, but check that js_of_ocaml does okay.

A small performance hits are not very important for the standard Random PRNG, so a reasonable decrease (for everyone or just js_of_ocaml) would be perfectly acceptable.

We should still benchmark under js_of_ocaml to have an idea of the performance (once we have pure-OCaml versions). It's bad if some parts of OCaml programs suddenly become much slower there.

- Having a crypto-secure PRNG is not part of our requirement -- Random is not secure in that sense.

- Having to reseed SplitMix every 2^32 draws may be a problem in practice -- loops running 2^32 iterations are perfectly reasonable these days. All other things being equal, ChaCha should be preferred.