Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking Issue for deterministic random number generation #131606

Open
1 of 4 tasks
joboet opened this issue Oct 12, 2024 · 13 comments
Open
1 of 4 tasks

Tracking Issue for deterministic random number generation #131606

joboet opened this issue Oct 12, 2024 · 13 comments
Labels
C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@joboet
Copy link
Member

joboet commented Oct 12, 2024

Feature gate: #![feature(deterministic_random_chacha8)]

This is a tracking issue for DeterministicRandomSource, a deterministic random number generation that will always generate the same values for a given seed.

Public API

// core::random;

pub struct DeterministicRandomSource { .. }

impl DeterministicRandomSource {
    pub const fn from_seed(seed: [u8; 32]) -> DeterministicRandomSource;
    pub const fn seed(&self) -> &[u8; 32];
}

impl RandomSource for DeterministicRandomSource { .. }
impl Random for DeterministicRandomSource { .. }

Steps / History

Unresolved Questions

  • Is ChaCha8 the right random number generator to use? Given that this RNG should not be used for cryptographic purposes, ChaCha8 might be a bit overkill.
  • Does the seed function make sense?

Footnotes

  1. https://std-dev-guide.rust-lang.org/feature-lifecycle/stabilization.html

@joboet joboet added C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Oct 12, 2024
joboet added a commit to joboet/rust that referenced this issue Oct 12, 2024
ACP: rust-lang/libs-team#394
Tracking issue: rust-lang#131606

The version implemented here uses ChaCha8 as RNG. Whether this is the right choice is still open for debate, so I've included the RNG name in the feature gate to make sure people need to recheck their code if we change the RNG.

Also, I've made one minor change to the API proposed in the ACP: in accordance with [C-GETTER](https://rust-lang.github.io/api-guidelines/naming.html#getter-names-follow-rust-convention-c-getter), `get_seed` is now named `seed`.
@Noratrieb
Copy link
Member

Given that this RNG should not be used for cryptographic purposes, ChaCha8 might be a bit overkill.

https://go.dev/blog/randv2 might be relevant here

joboet added a commit to joboet/rust that referenced this issue Oct 12, 2024
ACP: rust-lang/libs-team#394
Tracking issue: rust-lang#131606

The version implemented here uses ChaCha8 as RNG. Whether this is the right choice is still open for debate, so I've included the RNG name in the feature gate to make sure people need to recheck their code if we change the RNG.

Also, I've made one minor change to the API proposed in the ACP: in accordance with [C-GETTER](https://rust-lang.github.io/api-guidelines/naming.html#getter-names-follow-rust-convention-c-getter), `get_seed` is now named `seed`.
@juntyr
Copy link
Contributor

juntyr commented Oct 12, 2024

If a counter-based PRNG is chosen instead, exposing the current state (in addition or instead of?) the original seed would be valuable

@juntyr
Copy link
Contributor

juntyr commented Oct 12, 2024

Could the source also support seeding from a u64 (e.g. adopting how rand expands the seed), so that there is one well-supported approaches for this very common use case?

@hanna-kruppe
Copy link
Contributor

Is ChaCha8 the right random number generator to use? Given that this RNG should not be used for cryptographic purposes, ChaCha8 might be a bit overkill.

Since it's always supposed to give the same output forever after stabilization, it seems prudent to choose an algorithm that can be expected to last for a decade or more without regrets. It would be awkward to deprecate this part of std because it later turns out to have some flaws that can't be fixed without breaking reproducibility (see the Go math/rand/v2 story). A well-established cipher that's considered secure today almost certainly won't turn out have major flaws as a source of statistical randomness: any feasible way to distinguish the output from random is a big deal for cryptanalysis, and even an "academic break" (say, key recovery in 2^100 time) doesn't necessarily mean anything about the suitability for Monte Carlo simulations. The same can't be said about the myriad of non-cryptographic designs, where it's often only a matter of time and eyeballs until serious flaws are discovered.

For example, AES (Rijndael) is from 1998 and still going strong while the non-cryptographic MT19937 from 1997 was very popular for many years but is now considered flawed and obsolete. ChaCha is from 2008. Several years later, people were still publishing new non-crypto RNGs that fail TestU01, a suite of statistical tests dating to 2007--2009.

@dhardy
Copy link
Contributor

dhardy commented Nov 28, 2024

Forgive me, but why re-invent this functionality in the std lib?

Provision of an OS-getrandom API makes sense since much of the code is in std already. Provision of an alternative to rand::rng (aka ThreadRng) may also make sense: it's a single API which may make use of the OS's random source to varying (platform-dependant) extent.

Yes, I get it: rand is complicated, so it's tempting to start fresh. I fear you'd eventually end up in a similar place however, after multiple iterations of additional frequently-requested functionality.

PRNGs

There is no particular best PRNG. ChaCha8 should be fine for this use-case, but it's fundamentally a block-based RNG whereas a word-based PRNG like Xoshiro or PCG will be substantially faster for many use-cases. Since this is explicitly about a user-seeded PRNG with user-managed-state there is no good reason not to let the user choose the algorithm too (from some set of options).

The name DeterministicRandomSource does not imply portable results. rand::rngs::SmallRng is deterministic but not portable for example (different 32-bit and 64-bit impls). If you only want one generator with portable results, name it PortableRandomSource or maybe ReproducibleRandomSource to avoid the confusion.

Random trait

The proposed trait Random appears to be the equivalent of StandardUniform (possibly restricted to primitive integer types). This is only one form of random-value-generation, and while it's mostly easy to support (there may be questions around precision for FP types), it's not that useful on its own.

Uniform ranged sampling is the obvious other application. There are quite a few algorithms for this. If you want reproducible outputs, pick one of these and stick with it (or keep the impl unstable until the algorithm is fixed). Now implement for f32, Duration, char, simd::i16x8 etc.

On the topic of reproducibility, we recently removed support for sampling isize and usize in rand due to multiple reports of it being a portability hazard. Exception: we do support usize sampling for fixed ranges but use 32-bit sampling wherever possible.

Additional scope?

There are several obvious possibilities for scope creep:

  • shuffling slices
  • choosing one or multiple elements of slices
  • choosing one or multiple elements of iterators
  • sampling random alphanumeric values
  • sampling a bool with a defined probability or ratio
  • weighted sampling of sequences

This is most of what we elected to keep in rand (vs rand_distr). If some alternative to ThreadRng (now rand::rng()) were to be implemented within std, then rand could be shrunk down to almost just this (we'd no longer need the getrandom dependency, we could merge a simpler RNG trait from rand_core).

Proposal

What I'd propose therefore is:

  • Stabilise trait RandomSource and DefaultRandomSource from Tracking Issue for secure random data generation in std #130703. These serve cryptography applications and other needs for a random-byte-generator Edit: this trait may not be the best design; see discussion below
  • We drop ThreadRng along with rand::rngs::OsRng, ReseedingRng and likely StdRng, thus removing everything "cryptographic" from rand
  • We merge a simpler RngCore trait into rand, possibly just with next_u32 and next_u64, and remove the rand_core dependency
  • We implement rand::rngs::RngCore for all RandomSource and make rand::rng a wrapper over DefaultRandomSource
  • We also provide deterministic and portable PRNGs in rand::rngs
  • Publish a new rand version with significantly less code and dependencies

I believe this would reduce many people's issues with rand, namely that it's quite large and complex. (Not completely though; there would still be significant code involved in impls for Uniform etc.)

By itself, this wouldn't remove all usages of unsafe in rand; that is another issue which I believe is addressable (though possibly not without performance impact), if people care sufficiently about usage of unsafe outside std.

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Nov 28, 2024

Under your proposal, why even have the RandomSource trait at all? It wouldn't even work very well as a vocabulary type if rand::RngCore would still exist and there's a impl<T: RandomSource> rand::RngCore for T. That blanket impl rules out directly implementing both traits for any given generator, and many generators will want to implement next_{u32,u64} directly instead of the detour through RandomSource::fill_bytes.

And if std drops the trait, then DefaultRandomSource could be simplified down to std::random::fill_bytes(). That's also the primary interface of the getrandom crate, modulo fallibility.


I can see how this would remove some of the most tricky parts of rand's internals/dependency tree. But it would not address the stated motivation of rust-lang/libs-team#394 - practically every use of randomness still requires a third-party crate, or manually re-implementing non-trivial parts of such a crate (likely in worse ways). Even getting a uniformly random usize from a byte stream has subtle portability/reproducibility pitfalls, as you point out. There is a risk of slippery-sloping the standard library into re-inventing all of rand in the limit. But the other extreme also has a risk: if you give people who want to limit themselves to the standard library just a raw byte stream, then many will implement basic operations like "pick an element from a slice" roughly like this:

let mut buf = [0; size_of::<usize>()];
DefaultRandomSouce::default().fill_bytes(&mut buf);
// ^ or a hand-rolled seedable PRNG of choice
let r = usize::from_ne_bytes(buf);
let elem = slice[r % slice.len()];

... and that's a worse outcome than std providing std::random::range(0..slice.len()) or something at roughly that level of abstraction.

@dhardy
Copy link
Contributor

dhardy commented Nov 29, 2024

Under your proposal, why even have the RandomSource trait at all? It wouldn't even work very well as a vocabulary type if rand::RngCore would still exist and there's a impl<T: RandomSource> rand::RngCore for T. That blanket impl rules out directly implementing both traits for any given generator, and many generators will want to implement next_{u32,u64} directly instead of the detour through RandomSource::fill_bytes.

This is a good point, but I think it's workable (though there is possibly reason to keep RngCore::fill_bytes too).

For block-based PRNGs like ChaCha as well as for OsRng / DefaultRandomSource (whether or not this uses a local cache), I see no reason to implement next_{u32,u64} directly.

For word-based PRNGs, it's questionable whether these should implement the RandomSource trait. Firstly, is there any implication regarding security of the output as there is with rand::CryptoRng? Second, the only reasons to use a word-based PRNG are simplicity of implementation, small state size, and performance of word-sized outputs. I can think of some cases where it might be useful having a fill_bytes type interface on such a PRNG (we use this in rand's SIMD implementations, and it might be used to seed another PRNG, though one needs to be careful to avoid correlations there), but for the most part I would expect a block-based PRNG to be a better choice when significant usage of byte-output is required. Thus, I think it is reasonable for word-based PRNGs not to implement RandomSource.

But it would not address the stated motivation of rust-lang/libs-team#394 - practically every use of randomness still requires a third-party crate, or manually re-implementing non-trivial parts of such a crate (likely in worse ways).

This goes for some other areas too, e.g. std::time can only tell you relative time, not local time, because that's complicated (though I suspect most uses of chrono will only use a small subset of its functionality like Local and NaiveDateTime). Similarly, I think an interface for seeding a randomised hash state or cryptographic key fits well within std, but that including machinery for generating random values is a slippery slope.

But the other extreme also has a risk: if you give people who want to limit themselves to the standard library just a raw byte stream, then many will implement basic operations like "pick an element from a slice" roughly like this:

I mean, we can tell them they shouldn't do that. But if you do want such functionality, then:

// surely it's better to support this:
let elt = slice.choose(&mut rng);
// instead of expecting people to write this:
let elt = slice[rng.random_range(..slice.len())];

? (This doesn't remove the need for fn random_range but is an obvious extension.)

And even if the above functionality is in std, I still don't see much need for trait Random, except possibly as a helper to implement the above. Random is easy to implement but most of the time not what people actually want to use (except possibly fuzzing, but arguably that might also want a different API).

@newpavlov
Copy link
Contributor

newpavlov commented Nov 29, 2024

I believe only the following things should be in std/core:

  • Secure "system" source.
  • Insecure "system" source used for stuff like seeding HashMaps and non-security-critical PRNGs.
  • ThreadRng, i.e. cryptographically strong and reasonably fast RNG periodically reseeded from the secure "system" source. It's implementation could benefit from the unstable #[thread_local] feature.
  • RNG traits to generalize over the previous items.

The "system" sources should allow registration of an alternative implementation similarly to #[global_allocator]. The latter 3 items can be considered "optional", but I think it's reasonable to have them as part of std. It also may be worth to add a weaker version of ThreadRng (i.e. SmallRng periodically reseeded from the "insecure" source), but its benefits are less clear.

Everything else should be just part of a third-party crate like rand.

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Nov 30, 2024

@dhardy

For block-based PRNGs like ChaCha as well as for OsRng / DefaultRandomSource (whether or not this uses a local cache), I see no reason to implement next_{u32,u64} directly.

OS interfaces are generally byte-based, but anything ChaCha-based (and related constructions like Blake3 used as XOF) does all the computations over words. If your output buffer is [u32; N] anyway you might as well read from it in word sized chunks. rand_chacha does that via BlockRng! Even when the internal buffer is byte-based, it's easier to implement a good fast path for word-sized reads from that buffer than to make sure that the generic fill_bytes loop is optimized into something reasonable for small fixed-size reads. (And you have the option of simplifying it by skipping some bytes when the read would straddle a block boundary, which may be significantly faster when that case occurs frequently.)

Thus, I think it is reasonable for word-based PRNGs not to implement RandomSource.

This would make RandomSource much less useful than rand::Rng as a bound or trait object. Security-sensitive code that would be okay with the restriction to blocks of bytes will frequently just use a single, trusted source instead of abstracting over it. So what use cases are left? Who would implement such a trait, who would use it, and why do these two groups of people need the trait to be in the standard library instead of a crates.io library?

I mean, we can tell them they shouldn't do that. But if you do want such functionality, then [surely it's better to support slice.choose() than expecting people to sample 0..slice.len()]

Just telling people to not do it is unhelpful and unlikely to be heeded. I've done the modulo thing myself more than once, in full awareness of its issues, just because it was more expedient in that case and I judged it to be good enough. Including "random integer in range" is a sweet spot on the slippery slope to re-inventing all of rand because it's low cost and high value to provide:

  • Doing better than rand_u32() % n is just tricky and subtle enough that any given user is very unlikely to do it when needed. Even more so if they don't already know a better algorithm (the vast majority won't), or aren't even aware that there's a problem with modulo.
  • However, it's a small and simple interface backed by a very limited amount of (tricky, but completely portable and low-maintenance) code, so it's a much smaller addition to the standard library than basically anything else in this space.
  • Once this operation is available, a lot of other tasks becomes much easier to implement correctly. While slice.choose() is nicer, implementing it in terms of "random integer in range" is just a small amount of boilerplate and no harder to get wrong than any other slice indexing. Having random-in-range available is even useful for decidedly less trivial tasks like shuffling a slice (I'd rather re-implement Fisher-Yates than any of the random-in-range algorithms).

I think the situation with std::time is quite different: anything touching on time zones immediately opens huge can of worms, both in implementation and in API design. There doesn't seem to be a minimal "easy" subset of "local time" handling that would be useful to provide because it's an inherently difficult topic and glossing over the hard parts makes it easier to write incorrect programs.

@dhardy
Copy link
Contributor

dhardy commented Dec 2, 2024

If your output buffer is [u32; N] anyway you might as well read from it in word sized chunks. rand_chacha does that via BlockRng!

Yes, you are right. Replacing the RngCore for BlockRng impls as follows:

impl<R: BlockRngCore<Item = u32>> RngCore for BlockRng<R> {
    #[inline]
    fn next_u32(&mut self) -> u32 {
        super::impls::next_u32_via_fill(self)
    }

    #[inline]
    fn next_u64(&mut self) -> u64 {
        super::impls::next_u64_via_fill(self)
    }
    
    // ...
}

causes a large regression (approx 2-4x cost for u32 / u64 output). So then it does make sense to consider using a trait like RngCore instead of RandomSource (also since WASI has get-random-u64).


Who would implement such a trait, who would use it,

This is the wrong place to discuss RandomSource, but briefly: this allows writing generic RNG-agnostic code which can then be tested deterministically and benchmarked without concern for OS overhead. Compare with why rand::CryptoRng exists.

As for why this should be in std and not an external crate: there is call for a "getrandom" interface within std and this is part of what was proposed. But yes, it is worth asking whether we should instead do one of the following:

  • Add next_u32, next_u64 or maybe something like fn fill_words(&mut self, buf: &mut [u32]) to this trait
  • Or drop the RandomSource trait and only implement a function like fn fill_bytes(but: &mut [u8]), suggesting people use rand::RngCore whenever they want a trait (with rand::OsRng being a shim over fn core::random::fill_bytes).

Just telling people to not do it is unhelpful and unlikely to be heeded. I've done the modulo thing myself more than once, in full awareness of its issues, just because it was more expedient in that case and I judged it to be good enough.

My apology for the tongue-in-cheek comment ("tell them they shouldn't do that"). So there is a need for "random integer in a range".

The first question regarding random-value-generation-in-std is when do you stop the scope creep? Slice shuffling seems useful, as does choose for slices and maybe iterators. choose_multiple and choose_weighted also made the cut into rand proper (not rand_distr). Also ranged sampling for f32, f64, char, Duration, SIMD types. Also different implementations of ranged sampling for "sample once" and "make an object for repeated sampling".

The second question is whether bringing this into std better than recommending usage of rand? This is a question of how large the standard library should be and how much one should rely on third-party crates. It's also a question of trust (but less so for choose than for getrandom). It's also worth noting that rand could be a significantly small dependency if RngCore and a default CSPRNG were brought into std.

The third question is whether there might be some other advantages to merging a subset of rand's functionality into std. For example, IndexedRandom is a trait enabling additional functionality on implementations of core::ops::Index<usize>; this might not need a separate trait in std. Actually, it might: all other methods require fn len, which is not provided by Index<usize>. SliceRandom at least could have its functionality merged into [T].


I think the situation with std::time is quite different

Yes, time zones are complicated. An API for "give me a UTC time stamp for now" is not, though its implementation may be.

The fact that there are plenty of details to discuss regarding random generators and algorithms despite many people "just wanting a random integer in a range" shows the cases are not entirely dissimilar.

No, I am not fundamentally opposed to incorporating random-value functionality into std, though I dislike some of the proposals so far. Especially the way it ignores a lot of what has been built and field-tested already.

@joshtriplett
Copy link
Member

joshtriplett commented Dec 15, 2024

If it'd be substantially more efficient (or enable types of generators which are substantially more efficient) to add methods like next_u32, next_u64, next_u128, and similar to RandomSource (which I assume can have reasonable default implementations), then by all means we should do that. That trait was an attempt to implement a minimal interface, something sufficient to build initial consensus on for experimentation.

There was no desire or attempt to ignore rand here. Rather, I think that attempting to propose a larger set of functionality all at once would not have been successful. The original proposals here were, effectively, just enough to say "we should have a system RNG in the standard library" and "we should have a seedable portable deterministic RNG in the standard library for testing and replay". Those were the most controversial aspects, and needed to be settled first before we could consider what set of utility functionality to supply atop them.

Along similar lines, while I think "give me a random value of this type" is a useful trait, it's not by any means the only one, and we've already discussed having some mechanism for sampling a distribution so that we can support random floats and similar.

Slice shuffling seems useful, as does choose for slices and maybe iterators.

I expect that these at a minimum would make the cut, along with random-in-range functionality (e.g. for things like die rolls).

@dhardy
Copy link
Contributor

dhardy commented Dec 15, 2024

If it'd be substantially more efficient (or enable types of generators which are substantially more efficient) to add methods like next_u32, next_u64, next_u128,

Short answer (from memory): this is important for small (non-block) RNGs for the word size of the output (usually u32 or u64). For rand we considered that (almost?) all real targets are 32-bit or 64-bit native, and though CPUs may have some 128-bit support there isn't usually much application for this (other than generating keys/blocks which can use fill on &mut [u8] instead). Also e.g. pcg32 operates on 64-bit words internally (and pcg64 128-bit words while Xoshiro256++ operates on [u64; 4] for u64 output). next_u32 + next_u64 gives us loss-less outputs in the most important two cases.

If you want benchmarks, we should be able to hack rand removing RngCore::next_u* easily enough, to compare vs the current design.

The other important point here is the generator (and intended application). If std::random::RandomSource is only used to produce random keys/seeds direct from the OS, then I don't see any need for next_u*. But if it's intended to replace rand::RngCore supporting deterministic seeded RNGs in games/sims, then I think there is good justification for these methods.

This is the point I'm getting a little lost on here: is the intended scope to cover only getrandom or a large part of what rand currently does? Because in the latter case, I think it would be better to start from what rand has now and whittle it down (e.g. if weighted sampling is not wanted).

Along similar lines, while I think "give me a random value of this type" is a useful trait

  • For key-generation, yes, but this is what Rng::fill is for.
  • In other cases, for integer types, I don't think this is a very useful trait, except...
  • ...as part of the implementation of Rng::random_range, yes.
  • For floats, an impl which returns values in the range [0, 1) (0.0 .. 1.0 in Rust parlance) is common, though usually only to then multiply/offset into a different target range.
  • For other applications, though, I don't think it's that useful, which is why rand named this functionality rand::distr::StandardUniform implementing the same Distribution<T> trait used for other types of random distribution.

So this is another point of potential scope creep: do we want a Distribution trait in std which then gets implemented by e.g. rand_distr::Normal? Or should this trait stay in a third-party crate like rand?

Should a Uniform distribution object be included in std or only a method like Rng::random_range?

Without having answers to these questions it's hard to know what exactly should be included in std which is why above I proposed only having support for generating keys/seeds in std and recommending usage of rand for everything else. But we've already had talk of people using random() % range or similar quick hacks, so I don't think it's useful to only consider trait Random for now (and incrementally extend the scope later).

@newpavlov
Copy link
Contributor

If std::random::RandomSource is only used to produce random keys/seeds direct from the OS, then I don't see any need for next_u*.

Some targets (Hermit, RDRAND, RNDR, WASI p2) directly generate random u32/u64 values. Relying on the byte API may be less efficient in these cases. Yes, it's not super important, but it's a relatively cheap addition which also can be quite convenient in practice. Plus, I think it's reasonable to assume that system RNG, thread RNG, and user-defined PRNGs will use the same RNG trait (there is an open question about error handling, but let's not focus on it right now), and, as you've wrote, it's important for small PRNGs to have next_u32/64 methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

7 participants