Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate Mwc256XXA64 #18

Closed
tkaitchuck opened this issue Jan 11, 2021 · 23 comments
Closed

Evaluate Mwc256XXA64 #18

tkaitchuck opened this issue Jan 11, 2021 · 23 comments

Comments

@tkaitchuck
Copy link

tkaitchuck commented Jan 11, 2021

Background

There has been a lot of discussion about what to do about SmallRng, when special instructions such as aes-ni is not available:
rust-random/rand#910, rust-random/rand#767, rust-random/rand#603

Previously it was PCG, but there were some issues:
rust-random/rand#907, rust-random/rand#905
Including the small state. It was switched to Xoshiro which was an improvement but is still not ideal.

I set about creating a better option for this case.

Design

I've written up the design and more about the motivation here:
https://tom-kaitchuck.medium.com/designing-a-new-prng-1c4ffd27124d

Explanation of quality is detailed there as well.

The code is checked in here:
https://github.com/tkaitchuck/Mwc256XXA64

@rust-random If there is agreement that this is worth adding as an option I can create a PR to add a rand_mwc crate.

@dhardy
Copy link
Member

dhardy commented Jan 14, 2021

Sorry @tkaitchuck for not replying sooner; yes I have seen this. I haven't had time to take a look yet, so the only thing I can say is that we like to have third-party review before switching to a new RNG (none of us here are experts on this).

Perhaps though @vigna might be interested?

@vigna
Copy link

vigna commented Jan 14, 2021

I started to read the link. After a few lines, there's a phrase "Xoshiro128+ does not pass the PractRand test suite" linked to a site that does not discuss xoshiro128+ at all. The "also fails BigCrush", linked to a page talking about another completely different generator whose linear engine is by Marsaglia. And finally "has non-linear problems" pointing at a paper about linear problems 🤦🏻‍♂️.

I really don't have time to really read this, sorry. 😂

At a glance, it's a standard MWC + some xorshift/add like the parallel generator by Agner Fog and many others—designs like this have been common in the last decades. It seems fine—it would be nice to have a C version and proper benchmarks word-at-a-time: filling arrays is OK but that's not the standard way you use a generator.

Personally (my taste) to Marsaglia's MWCs I prefer the Goresky–Klapper generalization, which makes it possible to get very good spectral scores, like in

  uint64_t x = 1, c;

        uint64_t inline next() {
            const __uint128_t t = 0xff235b1bea19555a * (__uint128_t)x + c;
            x = 0xabe68618e06e7101 * (uint64_t)t;
            c = (t + 0xe38d70ff * (__uint128_t)x) >> 64;
            return x;
        }

at the expense of an additional multiplication. The generator above is excellent without any scrambling. MWC have in general bad spectral scores because the multiplier is too small with respect to the modulus (MWCs and generalized MWCs simulate a multiplicative congruential generator). That's why usually they have very large state (e.g., Marsaglia mother-of-all) or some scrambling on top—to hide the defects.

Both mechanisms have the usual issue of multiplicative generators—initial states differing by a small (and also not-so-small) multiplicative constant will generate strongly correlated sequences, because any multiplicative relationship between initial states is carried on forever. But that is fairly easy to fix that with careful initialization, and it is lessened by the scrambling. Just, don't let the user initialize the thing directly.

@tkaitchuck
Copy link
Author

tkaitchuck commented Jan 14, 2021

I started to read the link. After a few lines, there's a phrase "Xoshiro128+ does not pass the PractRand test suite" linked to a site that does not discuss xoshiro128+ at all. The "also fails BigCrush", linked to a page talking about another completely different generator whose linear engine is by Marsaglia.

Sorry, that should have read Xoroshiro128+ not Xoshiro128+. Corrected.

@tkaitchuck
Copy link
Author

initial states differing by a small (and also not-so-small) multiplicative constant will generate strongly correlated sequences, because any multiplicative relationship between initial states is carried on forever. But that is fairly easy to fix that with careful initialization, and it is lessened by the scrambling. Just, don't let the user initialize the thing directly.

Good point. The code as written does not allow user initialized values to specify X3 or C, so trivial multiples should already be impossible with the code as-is.
This does introduce a potential additional test: I'll change to code to allow for such trivial multiples and then create a composite generator which interleaves results from two generators which are initialized with trivial multiples. Then I will see if that passes BigCrush both with and without the permutation. (As it stands a single generator will pass BigCrush without the permutation.) If the result is the composite generator does not pass without the permutation but does pass with it, that would prove the permutation is needed and might be a good way to compare different permutations.

@vigna
Copy link

vigna commented Jan 14, 2021

This does introduce a potential additional test: I'll change to code to allow for such trivial multiples and then create a composite generator which interleaves results from two generators which are initialized with trivial multiples. Then I will see if that passes BigCrush both with and without the permutation.

I'd go for PractRand, it is much better than BigCrush to do interleaving tests because you can easily measure when the tests start to fail. My 2€¢.

I don't understand how you can avoid small-constant multiples with a test—there could be millions of generators. You cannot keep track of all the seeds you used.

@tkaitchuck
Copy link
Author

tkaitchuck commented Jan 14, 2021

I don't understand how you can avoid small-constant multiples with a test—there could be millions of generators. You cannot keep track of all the seeds you used.

I'm not sure what you mean. What I meant to say was, when it is not initialized from a random source, but from user's code: while the seed is 256 bits, the user is only allowed to specify 128 of those bits. The upper 64 and lower 64 are hard coded constants.
In such a case it is not easy to accidentally create a small multiple. Where as if the app specified all 256 bits and provided seeds '1', '2', '3'... Etc for the seeds for it's generators there would be a big problem.

@vigna
Copy link

vigna commented Jan 14, 2021

Perfect—that's what I meant by "Just, don't let the user initialize the thing directly.".

@vigna
Copy link

vigna commented Jan 14, 2021

BTW, I have no idea why you claim "For some reason this is not disclosed on the homepage" talking about linearity of the low bits of xoroshiro128+. My home page has a complete description of the linearity problems, and also of the Hamming-weight bias:

"If, however, one has to generate only 64-bit floating-point numbers (by extracting the upper 53 bits) xoshiro256+ is a slightly (≈15%) faster generator with analogous statistical properties. For general usage, one has to consider that its lowest bits have low linear complexity and will fail linearity tests; however, low linear complexity of the lowest bits can have hardly any impact in practice, and certainly has no impact at all if you generate floating-point numbers using the upper bits (we computed a precise estimate of the linear complexity of the lowest bits).

If you are tight on space, xoroshiro128++/xoroshiro128** (XOR/rotate/shift/rotate) and xoroshiro128+ have the same speed and use half of the space; the same comments apply. They are suitable only for low-scale parallel applications; moreover, xoroshiro128+ exhibits a mild dependency in Hamming weights that generates a failure after 5 TB of output in our test. We believe this slight bias cannot affect any application."

There are many other debatable claims in your text, but this really jumps out. There is an entire page dedicated to the issue (and linked in the text above) as so much misinformation has been spread about it: http://prng.di.unimi.it/lowcomp.php

@tkaitchuck
Copy link
Author

BTW, I have no idea why you claim "For some reason this is not disclosed on the homepage" talking about linearity of the low bits of xoroshiro128+. My home page has a complete description of the linearity problems, and also of the Hamming-weight bias:

Except it's not just the low bits, nor is it just one test: https://www.pcg-random.org/posts/xoroshiro-fails-truncated.html

I would not go so far as to say "that is likely to impact an application". I agree these are subtle problems. But a big part of PRNGs is thinking about headroom. My expectation would be that on the "PRNG shootout" page under the "failures" column it would link to http://prng.di.unimi.it/lowcomp.php

@tkaitchuck
Copy link
Author

The above discussion about the keys gave me a really great idea:

If the underlying MCG were an LCG there wouldn't be such sequences and the period would be larger. Because the modulus is already prime any number should work.
When the Mwc adds the carry bit to C uses the ADC assembly instruction on X86. This means a constant can be added at each step with zero cost.

I'll do a second post with all the details.

@vigna
Copy link

vigna commented Jan 15, 2021

Except it's not just the low bits

That is a false statement. There is no linearity test including only the upper bits (say, the bits used for floating-point generation) that fails. If you have any evidence of that, please show it. Evidence is a linearity test from BigCrush or PractRand failing on such bits. Since there is a mathematical proof stating that this is not gonna happen, I am curious to see such evidence.

nor is it just one test

The page does not claim that—this is another false statement. "moreover, xoroshiro128+ exhibits a mild dependency in Hamming weights that generates a failure after 5 TB of output in our test". This dependency can be detected by different tests (such as DC6 test in PractRand), but our test is presently the most sensitive and picks up the bias much before (e.g., PractRand needs 32TB), which is why we refer to it.

In any case, your claim of non-disclosure remains false (like most claims in your document, in fact).

@vigna
Copy link

vigna commented Jan 15, 2021

If the underlying MCG were an LCG there wouldn't be such sequences and the period would be larger. Because the modulus is already prime any number should work.

There are no LCGs with prime modulus. See Knuth, TAoCP II.

@tkaitchuck
Copy link
Author

Except it's not just the low bits

That is a false statement. There is no linearity test including only the upper bits (say, the bits used for floating-point generation) that fails. If you have any evidence of that, please show it. Evidence is a linearity test from BigCrush or PractRand failing on such bits.

That is literally what the blog I linked to above tested. They threw out the lower 32 bits and ran PractRand and it fails at 2^47 bits. I'm not doing original research here.

@tkaitchuck
Copy link
Author

tkaitchuck commented Jan 15, 2021

Re lcg: Experimentally confirmed: Adding an increment to the Mwc has no effect on the period.

I also tested Mwc32XXA8 on PractRand with various variations to test the failure point.
As described in the blog post it fails after 2^29 outputs.
This is about as good as can be expected given that it has a period of about 2^30.

If I add an increment it makes no discernable difference and also fails after 2^29 for all tested increment values.
If I take two Mwc32XXA8 instances and initialize them with the keys "0,1" and "0,2" and interleave their outputs, PractRand fails after 2^31 outputs.

If I change it so the application passes the full seed instead of just 2 of the 4 components and do the same test it fails after 2^20 (1MB) outputs (Though I suspect it would have actually failed even sooner, it just didn't run any statistical tests until the first 1mb of output.) This is expected as the even and odd outputs should be just a shift difference from one another.
If I do the same test but add the increment in the update function it fails after 2^31 outputs.

So adding the increment seems to be just as effective as simply only allowing the user to set two of the four parts of the key. Both have zero cost on X86. So I don't see much of an advantage one way or the other. Doing both doesn't really seem to provide any noticeable advantage.

I ran these same tests on the reduced period 40 bit state version with similar results.

All else being equal I think it may be better to leave out the increment for the benefit of non-x86 architectures and simplicity of explanation, and just continue the policy of hardcoding X3 and C.

@vigna
Copy link

vigna commented Jan 15, 2021

That is literally what the blog I linked to above tested. They threw out the lower 32 bits and ran PractRand and it fails at 2^47 bits. I'm not doing original research here.

If your notion of scientific evidence is a blog, the next step is chemtrails 😂..

I've checked the blog you linked—there is no linear test failing the upper 32 bits of xoroshiro128+, as predicted by theory. After half a petabyte of testing there's a Hamming-weight dependency test failing (BCFN_FF), which is expected—as I already remarked, and as explained on the home page, Hamming-weight dependencies will show at 5TB, much before, with a more sensitive test. If you don't like that mild bias, pick another generator.

It appears, however, that you are very confused about what a linearity test is. Linearity tests are of two kinds: tests that measure the jumps in linear complexity of an output bit as measured by the Berlekamp–Massey algorithm, like the LinearComp test from TestU01, and tests that measure the distribution of the ranks of random matrices on F₂ on a subset of bits (possibly all), like the MatrixRank test from TestU01. The only test of this kind in PractRand is BRank (the second kind), and there is no failure of this kind in the upper 32 bits of xoroshiro128+, even after half a petabyte of testing.

Your claims, thus, remain false.

Considering your confusion about very basic testing techniques, the slew of mistakes in your document, and your comments above about fiddling experimentally with the sophisticated (and fragile) mechanisms of MWCs or LCGs without understanding the mathematics behind them, I'm sorry but I would not trust particularly the generators you are proposing now.

@tkaitchuck
Copy link
Author

If your notion of scientific evidence is a blog, the next step is chemtrails ..
...
Your claims, thus, remain false.

The only claims I made were:

  • "Xoroshiro128+ does not pass the PractRand test suite". - It doesn't.
  • Truncating the output does not allow it to pass. - It still fails after 2^47 bytes per the post.

If you don't like that these were posted on a blog they are very easy results to reproduce.

I did not make any claims about which type of tests fail, why they fail, or say anything about Matrices or Hamming-weight. Nor did I say anything about what these failures should mean to an application. You seem to be arguing about something I did not say.

the slew of mistakes in your document, and your comments above about fiddling experimentally with the sophisticated (and fragile) mechanisms of MWCs or LCGs without understanding the mathematics behind them, I'm sorry but I would not trust particularly the generators you are proposing now.

If there is a factual inaccuracy, please post the relevant sentence and I will correct it.

You are correct that I do not understand the theory so deeply that I can operate without empirical testing. Hence the generator design, testing methodology, and post explaining it, rely heavily on empirical tests. My knowledge is not so deep as to be able to look at a failing PractRand or BigCrush result and say that it is not a problem. I don't pretend to know how applications use random numbers, so I treat any test failure as a failure.

If there is any bug in the code, please point it out and I will fix it. If there is a test or experiment that should to be run, name it and I will run it.

@vks
Copy link
Contributor

vks commented Apr 8, 2021

I read your blog post and I'm a bit confused about your argument against Rand's algorithm of choice for SmallRng, xoshiro256++. You write that xoshiro and its variants were "cracked" or "inverted", which I assume means that their results can be predicted by observing some generated values. However, this is explicitly not a design goal of a non-CSPRNG (otherwise it would be a CSPRNG). It does not seem like your proposal tries to address that either?

Later you look at randograms of xoshiro32++ and other reduced-state generators. However, this is only qualitative, as you acknowledge. You also look at the number of BigCrush failures for the reduced variants. I'm not sure whether the sum of failures is a good metric, it strongly depends on the tests you chose. Some were designed with specific RNG algorithms in mind, so this metric is biased against them! More importantly, it is not clear how your results generalize to variants with larger state. (Xoshiro32++ is also not an official variant, and it is not clear which constants you used. There is precedence for using a xoroshiro32++ variant that might be worth looking into.)

If I understood correctly, the number of BigCrush failures for reduced variants is the only metric (besides runtime performance) by which you want to improve over xoshiro256++?

@tkaitchuck
Copy link
Author

tkaitchuck commented Apr 21, 2021

I read your blog post and I'm a bit confused about your argument against Rand's algorithm of choice for SmallRng, xoshiro256++.

I am not proposing that. I am only asking that Mwc256XXA64 be evaluated for inclusion here as an option because it offers a useful alternative to PGG in that it has a 256 bit state and faster performance. The comparison with xoshiro is simply because it is the default generator today, hence any new algorithm must be measured against it.

You write that xoshiro and its variants were "cracked" or "inverted", which I assume means that their results can be predicted by observing some generated values. However, this is explicitly not a design goal of a non-CSPRNG (otherwise it would be a CSPRNG). It does not seem like your proposal tries to address that either?

I was referring to what you mentioned here: rust-random/rand#905 (comment)
It is not a serious issue to call PCG "difficult to predict" given that it took 20,000 CPU hours to recover the state, and it did so using a known increment (which the version here lacks). I mention it because it is an open issue in the repo that people (including yourself) seemed concerned with, presumably because it may be improved upon. My point was that the current "advantage" that xoshiro seems to have here may or may not be long lived.

I do not think that the full size version of either PCG or xoshiro256++ have any statistical problems. My analysis with the reduced size versions on PractRand and BigCrush seems to confirm this.

Later you look at randograms of xoshiro32++ and other reduced-state generators. However, this is only qualitative, as you acknowledge. You also look at the number of BigCrush failures for the reduced variants. I'm not sure whether the sum of failures is a good metric, it strongly depends on the tests you chose.

Yes. Agreed. The point where it fails PractRand is far more relevant, and I cover that too. The two methodologies result in closely correlated results.

Some were designed with specific RNG algorithms in mind, so this metric is biased against them!

Test suites are by their very nature regression suites. They have tests added to them that generators have been found to fail. This is their purpose. It is "biased" in the sense that any new algorithm which is built afterwards will make sure not to make the mistakes that any previous one did. Of course it is not guaranteed that it won't have some entirely new problem. At any given time the most that can be said of any algorithm is that it passes every test invented so far.

The problem with this methodology is that it doesn't allow for progress. Usually what happens is that algorithms because they cannot be proven correct are adopted first and then eventually a problem is found, or it is considered "good" because it became popular without anyone finding a flaw first. However in the meantime most other ideas are dead on arrival because they aren't popular enough to be considered good, and can't be validated. So little progress is made.

By looking at where things fail, I am attempting to break out of this model and have a way to validate a new algorithm. The theory being that if there is some statistical flaw that no existing test can catch at full scale (Like exists in middle square) it may be found in a reduced size version. Algorithms like SHISHUA trivially fail this analysis, and hence should be looked upon with suspicion. Where as PCG and xoshiro++ both do very well. As does MwcXXA.

More importantly, it is not clear how your results generalize to variants with larger state. (Xoshiro32++ is also not an official variant, and it is not clear which constants you used. There is precedence for using a xoroshiro32++ variant that might be worth looking into.)

They generalize very well. I am working on a follow up blog post on this. I'll update here when it is ready.

The constants were provided by David Blackman who is one of the authors on the paper.

If I understood correctly, the number of BigCrush failures for reduced variants is the only metric (besides runtime performance) by which you want to improve over xoshiro256++?

Mwc256XXA64 should be compared against PCG64 as it is a member of that family. It is after all an permuted Mcg internally.

Compared to PCG64 it offers twice the state size, both rightward and leftward propagation in the state update function, better looking randograms, fewer bigcrush failures at different size points, continues to pass PractRand longer with various state sizes, the ability to add some vectorization, a more than a 2x performance increase despite the larger state, and a 128 bit version with vastly better performance on 32 bit hardware.

Many of these also apply comparing with xoshiro256++, though it also does well on 32 bit hardware.

@dhardy
Copy link
Member

dhardy commented Apr 21, 2021

Perhaps it's time (or past!) to draw this "issue" to some kind of conclusion:

Evaluate Mwc256XXA64

This thread has generated some commentary but no serious evaluation and is unlikely to — doing so is considerably beyond the scope of the project (which, lets be clear, is a community effort only, lacking both time and expertise required for useful evaluation).

@rust-random If there is agreement that this is worth adding as an option I can create a PR to add a rand_mwc crate.

I am undecided on this — we don't have any particular criteria for inclusion in this repository, but I think it is worth having at least one: that PRNGs be published elsewhere and this repository not be their primary "home". This PRNG is already published (in a way) at https://github.com/tkaitchuck/Mwc256XXA64, so adding it here isn't really useful in my opinion. (It might however be useful to publish on crates.io, which does not currently appear to be the case.)

Some were designed with specific RNG algorithms in mind, so this metric is biased against them!

I agree with @vks here — "the number of failing tests" is not a particularly useful metric, even if trying to beat tests is a worthwhile goal.

To my (inexpert) eye, evaluating "scaled down" versions of the generators may be the most useful form of "metric" in your analysis, and it does appear that you have designed a reasonably good generator, but convincing me (or "the maintainers of this project") should not be your goal (see note above regarding evaluation).

I think the only other thing worth discussing here is whether we add to the comparison in the book — I don't see any reason we shouldn't (though it should have some rewording: these aren't exactly "our RNGs"). @vks do you agree? Filling out the table column should be easy enough. A prerequisite is that this is published on crates.io (as mentioned above, this might just as well happen from the existing repository).

@vks
Copy link
Contributor

vks commented Apr 22, 2021

@dhardy
I also don't have a strong opinion on the repository where Mwc should be included. Putting it here would shift the maintenance burden to rust-random, but I don't think it would require much maintenance. Either way, I don't think it makes a big difference, and I agree that it is probably enough to publish the crate on crates.io.

I would be happy to add Mwc to the benchmarks here, but this does not require it to live in the same repository.

I think the only other thing worth discussing here is whether we add to the comparison in the book — I don't see any reason we shouldn't (though it should have some rewording: these aren't exactly "our RNGs").

I agree with putting more RNGs in the comparison table. (We might even want to add the Mersenne Twister for reference.) Maybe "Comparison of RNG algorithms" is a better heading?

@tkaitchuck

I am not proposing that.

Fair enough. I misunderstood your initial issue description as proposing a different algorithm for SmallRng.

I was referring to what you mentioned here: rust-random/rand#905 (comment)
It is not a serious issue to call PCG "difficult to predict" given that it took 20,000 CPU hours to recover the state, and it did so using a known increment (which the version here lacks). I mention it because it is an open issue in the repo that people (including yourself) seemed concerned with, presumably because it may be improved upon. My point was that the current "advantage" that xoshiro seems to have here may or may not be long lived.

This is a bit off-topic, but I think the 20 000 CPU hours in the paper refer to PCG with unknown increment? Anyway, rust-random/rand#905 is not about being worried that PCG is predictable (this is expected for a non-CSPRNG), but about the claim that it is difficult to predict. As far as I can see, we are not claiming that any of our non-CSPRNG are "difficult to predict", including xoshiro. (We did in the past, which is what rust-random/rand#905 was about.) Xoshiro is not expected to be unpredictable.

Test suites are by their very nature regression suites. They have tests added to them that generators have been found to fail.

I think we are in agreement about the value of tests. I'm just saying that the metric of looking at the number of failed tests without more context is problematic. Some tests are extremely specific (I'm thinking of Marsaglia's clever binary rank test), so any algorithm that is fundamentally different would be expected to pass that test. Should such an algorithm be rewarded for this, while possibly failing a more important test?

Compared to PCG64 it offers twice the state size, both rightward and leftward propagation in the state update function, better looking randograms, fewer bigcrush failures at different size points, continues to pass PractRand longer with various state sizes, the ability to add some vectorization, a more than a 2x performance increase despite the larger state, and a 128 bit version with vastly better performance on 32 bit hardware.

Those are nice improvements! As @dhardy said, we cannot really evaluate the quality of the randomness.

Note that doubling the state size is expected to double performance for small RNGs due to SIMD. You can just evaluate two RNGs at the same time.

Better performance on 32 bit hardware is certainly good to have!

@tkaitchuck
Copy link
Author

I have published the crate here: https://crates.io/crates/pcg-mwc
Is there any other traits it should implement?

@dhardy
Copy link
Member

dhardy commented Jun 7, 2021

@tkaitchuck the API looks fine. I would suggest in the code you put both 32- and 64-bit versions into (private) sub-modules, then use pub use gen32::Mwc256XXA32 to export. Also you could implement Serde if you want; some people have requested it with our RNGs (though not many).

I think I may as well close this now.

@dhardy dhardy closed this as completed Jun 7, 2021
@tkaitchuck
Copy link
Author

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants