-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate replacements for SmallRng algorithm #910
Comments
Do you need multiple streams? Constant-time jumps for parallel execution? Are 64-bit operations fine or would you prefer 32-bit operations? |
We expect the key size (seed + stream) to be at least 100 bits. We require an algorithm for 32-bit CPUs and for 64-bit CPUs (currently we have a different RNG for each; they could also be the same). For parallel usage we generally recommend seeding each PRNG independently from a strong master (CSPRNG or system source) instead of using jumps. |
Does this affect the current implementation of |
For these two cases, on my page I suggest xoroshiro128++ or xoshiro128++. The second generator is 32-bit though, so you'll need to call it twice for a 64-bit integer or a double. Note that in general generators writing to multiple locations (like those above) might perform better in artificial benchmarks (e.g., loops) than generators with a stream parameter, but in real life the cost of storing two words instead of one might be detectable even at the application level in some circumstances (e.g., a lot of generation but in a situation in which the compiler cannot keep the state of the PRNG in registers). SplitMix fills many of your requirements. The problem with it is that the parameter (the additive constant of the underlying Weyl generator) must have a balance of zeros and ones, so it might be tricky to have the user specify an arbitrary parameter and avoid collisions when you "fix" it. An alternative you might explore is any kind of 64-bit engine (e.g., a xorshift generator or an LCG with modulus 2^64) with a strong parameterized scrambling function. Something like a murmurHash mix, but inserting the constant defining the stream in the middel (e.g., when you multiply). Of course it is not guaranteed that all constants will work equally, etc. Streams are objectively difficult unless you are using a stream cipher. |
I don't believe so. In this case we care more about reproducibility and less about supporting many independent instances (the function only accepts 64-bit key, and uses a fixed stream). I did some quick benchmarks of our existing Rust PRNGs below. These are not fully representative (only one CPU arch, may be a little imprecise), but show roughly how our current code performs.
|
It's nice to see that chacha8 is faster than steprng thanks to vectorization! This looks like xoshiro128++ and xoshiro256++ are good choices for i686 and x86_64, respectively. |
Not on i686. I guess this has more to do with which CPU features can be assumed than the word size. |
Some results are improved a little when using |
Yes, this is expected, because |
Are you keeping the compiler from unrolling loops? My experience, in particular with clang, is that the compiler will unroll different code a different number of times, strongly biasing the results. That might not be the case here, but it doesn't hurt. The other parameter that might influence heavily the result is whether you let the compiler extract loop invariants. In that case, anything using large constants might appear much faster than in reality, because due to the small loop size all constants are loaded into registers before entering the loop. When you're mixing generation with other activities this might or might not happen (but this depends a lot on the CPU architecture). |
Most of the results (the MB/s value) are robust against doubling each of This doesn't rule out loop unrolling, but I would expect to see some larger changes if it were a significant factor.
I don't really know what we can do about this, except to recommend that users benchmark their particular applications (which we already do). |
I didn't say you had to do anything—that was just a general comment. My benchmarks use 10^10 iterations. In C, -march=native can have disruptive effects—for example, on my benchmark hardware SplitMix gets unrolled/vectorized by GCC and takes 0.50ns/word. |
Based on re-reading this and this, sfc64 and sfc32 also appear good candidates. I have updated @pitdicker's
|
Simulated i686 benchmarks imply we should prefer SFC32 over SFC64 for 32-bit platforms:
|
@vigna can you please comment on the repeats that O'Neill found in Xoshiro? I realise that there is a certain amount of hyperbole involved since this apparently affects only (approx) 2^64 of the 2^256 outputs of Xoshiro256. I understand that David Blackman did some further analysis? The SFC generator by Doty-Humphrey also appears interesting but appears to lack third-party review — a recurring problem in the space of fast non-crypto PRNGs! |
I guess the alternative is that we just retire |
SFC is very fast and, empirically, very good (it might be different in applications because writing four words of state it takes time—xoshiro256 has the same problem). I don't think anybody tried seriously to find flaws in it, tho. I have made a point of not considering in my work generators for which you cannot prove that the period is close to the size of the state space (the "bang for the buck" criterion), and SFC has a guaranteed period of at least 2^64 against 2^256 points of the state space. This is also a problem if you wanna do random seeding for multiple streams, because the assumption of random seeding is that it is like jumping at random in an orbit, and you estimate overlap based on the size of the orbit. The size is not known for SFC—we have just a lower bound that is too small for any practical random-seeding practice. One can, of course, engage in imaginative, non-mathematical considerations about that size, but I was trained otherwise, so not my cup of tea. The occurrence of a too-frequent pattern every 2^192 elements: yes, it is true. Basically every PRNG with structure, if you work it enough, will show you something like that. Harase, for example, shows that if you pick the output of the Mersenne Twister with certain specific lags (like, take one output, skip 15, take one, skip 37, etc.) periodically, the result fails the Birthday Spacings test (!). So subsequences of the Mersenne Twister are not random. Analogously, Matsumoto shows that if you take triples of output from xorshift128+, multiply them by 2^22 and use the result to generate points in the unit cube the points you obtain are not very random (multiplying by 2^22 moves to the most significant bits at the top, and so it "magnifies" carefully a slight bias in the middle bits that affects triples of consecutive outputs and that would be very difficult to detect directly). All these results (there are many more—I just wanted to share some examples) are very common: if your generator has structure and you bang the structure enough, you'll find something. I consider more relevant, for example, the fact that linear generators map states with few ones in states with few ones (the "escape from zeroland" problem), and my solution is to make good generators with a small state space (the Mersenne Twister needs millions of iterations to get back to normality starting from a state with all bits zero except one). The point is that there is statistically no way your computation can be influenced by a slight bias in probability happening after 2^192 values. If you believe that, all generators with less than 192 bits of state would be problematic, because if you use 2^192 values out of them you'll repeat the period several times, and every imaginable statistical feature you expect from a random stream will be wrong. It is a very different situation when structure kills the alleged properties of the generator. For example, all sequences produced by an LCG with power-of-two modulus (and multiplier with maximum potency, as usually happens) with varying constant are the same, modulo a sign change and an additive constant (in a couple of days there will be an arXiv paper explaining this in detail). It took some time to realize this—there are papers in the late '80s suggesting to use different constants to get different, independent streams. People even proved complex properties of these "independent streams", properties that were hinting, but not proving independence. Then Durst showed that all such sequences are basically the same, and everything fell like a house of cards. |
Assuming all orbits have size 2^64 within a 256-bit state, the chance of two random seeds having the same orbit is 2^-192. Of course this is not the case since orbits may be (and usually are) larger, but if you consider the RNG as having 2^192 slices of size 2^64 (each of which may jump to any slice at its end), the chance of two random seeds being within 2^64 of each other is still 2^-191, right? Thus I fail to see how the size of the slice is important to this argument. (Assuming that the state mapping function is bijective — perhaps this is an assumption too far?)
Yes, I read your recent paper attacking the Mersenne Twister. Lets hope it helps do what many past papers have not: convince an audience who are not expert in this subject that they should consider alternatives to MT19937. I would like to link it from our book.
Perhaps its more that 5-in-7 repeats are a very obvious problem, no matter which scrambler/distribution you apply to the results. Also, 2^64 such patterns makes this more prevalent than zero-land issues, right? (There are far less than 2^64 ways to choose 6 bits from 256.)
I look forward to the read! |
Usually you assume P processors using L outputs and you want to compute the probability of overlap. I don't know if the next-state function of SFC is bijective, but let's assume it. But yes, either the orbits are very long and then there will be collision on the orbits, but not on the sequence, or they are short and then there could be an overlap on the sequence, but since everyone is in a different orbit everything is fine. I have no idea if this is more "efficient" than a single orbit—one would need to compute the probability of overlap in both cases. But you're right, my considerations about random seeding were mostly FUD (given that the next-state function is bijective). Note that "Escape from Zeroland" is not my invention—it is a term used in the paper about WELL. For what matters the "no matter which scrambler": no, if you use a scrambler like ++ that combines several parts of the state the observation about 5-in-7 is no longer valid: there might be this bias in one of the combined words, but it will be hidden by the others. ** is different because it remaps bijectively a single word of state, and that word is specifically that with the reported bias (in fact, we could probably change the word we use if this makes someone happier). For the zeroland part, no, it's not less frequent—it is just a matter of how close you wanna look. Since there are 256 bits of state, you can consider the > 2^64 states containing at most 11 ones. You will definitely see a bias in terms of number of ones in the next state (which will be partially hidden by the scrambler, but it's there). |
Here's the SFC-64 transition function. I think |
BTW, if you're interested our paper is out: https://arxiv.org/abs/2001.05304 |
For your information I have written a short document on testing PRNGs for high-quality randomness. There is more to testing PRNGs than testing a single random number sequence. For instance:
However, if we limit the scope to PRNGs with 128 or maybe 256 bits of state, maybe a PractRand threshold of 1 TiB, as I suggest in that document, is too much for |
Its time we need to draw a conclusion here. I'm inclined to take @vks's suggestion:
Vigna noted that microbenchmarks may be misleading; still without more to go on (and knowing our target platforms) I believe the above is a good choice.
|
Due to close correlations of PCG streams (rust-random#907) and lack of right-state propegation (rust-random#905), the `SmallRng` algorithm is switched to xoshiro{128,256}++. The implementation is taken from the `rand_xoshiro` crate and slightly simplified. Fixes rust-random#910.
Due to close correlations of PCG streams (rust-random#907) and lack of right-state propagation (rust-random#905), the `SmallRng` algorithm is switched to xoshiro{128,256}++. The implementation is taken from the `rand_xoshiro` crate and slightly simplified. Fixes rust-random#910.
Due to close correlations of PCG streams (rust-random#907) and lack of right-state propagation (rust-random#905), the `SmallRng` algorithm is switched to xoshiro{128,256}++. The implementation is taken from the `rand_xoshiro` crate and slightly simplified. Fixes rust-random#910.
Due to close correlations of PCG streams and lack of right-state propegation we should consider replacing PCG with another algorithm(s) for the next Rand version (0.8).
From the docs, the purpose of
SmallRng
is:Ideally (in my opinion),
SmallRng
should be small but not too small; preferably 128-bit or 256-bit if we must. @vigna have you thoughts on this (given that you recommend a 256-bit variant of your generator for general usage, but in this case we already have a ChaCha-based generator for general usage)?There are other generators besides PCG and Xo(ro)shiro, e.g. GJrand, JSF and SFC, though I've seen less analysis of these. Previous decisions on this topic have been somewhat influenced by this thread, though it only considers benchmarks and some very basic analysis.
The text was updated successfully, but these errors were encountered: