Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Rayon support to Rand #566

Closed
pitdicker opened this issue Apr 16, 2018 · 14 comments
Closed

Adding Rayon support to Rand #566

pitdicker opened this issue Apr 16, 2018 · 14 comments

Comments

@pitdicker
Copy link

I have made a PR to Rand to add support for Rayon. It is very much in the 'just exploring' phase. It would be great if someone who is more familiar with Rayon could weight in.

We have one basic problem: every PRNG needs to store its state somewhere and mutate it to produce the next random number. Just cloning the state would produce two identical streams of random numbers.

One solution that works right now is to use thread_rng() rust-random/rand#398. It does nothing special (from our perspective), it just sets up one RNG in thread-local storage for every thread in Rayons thread-pool. This may be a costly operation, but only has to be done once during the lifetime of the application. At least if the thread-pool also remains available all that time.

What I am exploring is somehow splitting/reseeding an RNG every time Rayon splits a job. If this a good idea at all, or in no way better than using RNGs in thread-local storage, is something to explore. And so are the details around splitting RNGs and keeping things correct, but that seems like a problem for us in Rand. It would have the advantage that users can choose their favourite, or most appropriate, RNG.

Can someone review the approach, and maybe answer some questions?

Is implementing IndexedParallelIterator, and setting the number of values to generate up front the right choice? I tried working with UnindexedProducer, but that expects to somehow finish by itself, which is something an RNG never does.

Is my abuse of DoubleEndedIterator okay, to implement the necessary traits? It will just produce the next random number like Iterator, it can't produce something like the previous random number. For normal PRNGs that is a computationally expensive operation (and not supported by Rand), and for cryptographic PRNGs impossible.

It is hard to predict what will happen to the statistical quality of PRNGs when they are split off many times, and only very short runs are used. So it seems sensible to set a minimum for the number of items that should be used from the PRNG, to hopefully keep similar statistical properties as one PRNG used continuously. Is setting a hard limit with min_len the best approach? Which Rayon supports increasing by users?

And from the final comment (currently): I wonder if Rayon is designed do deal with situations like ours, where every split of the job is not a basically free operation? So ideally it would split the job as little as possible, instead of splitting it into many manageable chunks. I don't know what drives Rayons decision here...

@dhardy
Copy link

dhardy commented Apr 17, 2018

It would have the advantage that users can choose their favourite, or most appropriate, RNG.

Considering that one of the reasons to choose an RNG is to have control over reproducibility, which cannot be guaranteed when using automatic migration of some samples to sub-RNGs, there is less need to have this flexibility.

It is hard to predict what will happen to the statistical quality of PRNGs when they are split off many times, and only very short runs are used.

If we switch streams and ensure each thread has a unique stream this isn't an issue. If we use freshly-seeded RNGs this is harder to reason about; with CSPRNGs there is no problem since these must ensure any output cannot be used to predict future output; with some simple PRNGs like Xorshift this is definitively a problem (without an extra permutation step); with other PRNGs I don't think statistical quality/independence is easy to reason about, though e.g. the PCG family are supposed to be difficult to predict.

There is also a potential circular dependency problem.

@vks
Copy link

vks commented Apr 17, 2018

Wouldn't it be much simpler if the thread pool managed the RNGs? This should also be more performant, because then the RNG initialization is only done once per worker instead of once per stolen work.

This could possibly implemented in ThreadPoolBuilder:

/// For each thread, initialize an RNG from the thread ID.
pub fn with_rng<C>(self, rng_constructor: C) -> ThreadPoolBuilder where
    C: Fn(usize) + Send + Sync + 'static -> impl rand_core::RngCore, 

ThreadPool (or some new type wrapping it) would need an additional type parameter for the RNG.

@dhardy
Copy link

dhardy commented Apr 17, 2018

Isn't that equivalent to using thread_rng()? Not quite I suppose; it could use SmallRng or another PRNG instead for some applications and could maybe get its entropy from somewhere other than EntropyRng.

@vks
Copy link

vks commented Apr 17, 2018

Yes, you could choose and initialize the RNG, which is not possible with thread_rng().

Interestingly enough, rayon already uses an internal per-thread RNG as an implementation detail for deciding which job to steal:

pub struct WorkerThread {
    /// the "worker" half of our local deque
    worker: Worker<JobRef>,

    index: usize,

    /// are these workers configured to steal breadth-first or not?
    breadth_first: bool,

    /// A weak random number generator.
    rng: UnsafeCell<rand::XorShiftRng>,

    registry: Arc<Registry>,
}
  • Does it make sense to expose this RNG to the user?
  • If yes, does it make sense to let the users swap out this RNG for another? This can break rayon if the RNG is pathological.
  • If no, does it make sense to have an optional user-defined RNG in addition to the internal one?

@cuviper
Copy link
Member

cuviper commented Apr 19, 2018

@pitdicker

Is implementing IndexedParallelIterator, and setting the number of values to generate up front the right choice? I tried working with UnindexedProducer, but that expects to somehow finish by itself, which is something an RNG never does.

We ran into a similar dilemma with repeat in #397, and decided to leave it unindexed and infinite, but also add a repeatn which is bounded. Repeat also has its own take and zip methods to turn it into RepeatN or Zip<RepeatN, _>. I think this design is OK, but I'm not sure if it really gets used.

But once you implement IndexedParallelIterator, you do need to be precise with the splits and how many items you produce. If nothing else, collect_into_vec() will panic if it gets the wrong number of items.

Is my abuse of DoubleEndedIterator okay, to implement the necessary traits? It will just produce the next random number like Iterator, it can't produce something like the previous random number. For normal PRNGs that is a computationally expensive operation (and not supported by Rand), and for cryptographic PRNGs impossible.

If "normal" PRNGs are supposed to be deterministic, then I would say this is problematic, but may be something you could just paper over as a documented limitation. The simplest test is that .rev() could be observed not to actually reverse anything.

It is hard to predict what will happen to the statistical quality of PRNGs when they are split off many times, and only very short runs are used. So it seems sensible to set a minimum for the number of items that should be used from the PRNG, to hopefully keep similar statistical properties as one PRNG used continuously.

I don't know too much about PRNGs, but it seems to me that one which actually degrades on splitting is not really suitable here.

Is setting a hard limit with min_len the best approach? Which Rayon supports increasing by users?

Suppose I tried to zip your random source with my own vector of 100 heavy work items. Your forced min_len of 100 means I won't get any parallelism at all!

Generally, I think splitting isn't something you should have to worry much about. Rayon will start with approximately the number of threads, and only split further when work is actively stolen. This can lead to small splits at the tail end of a run, but it's usually not noticeable except in microbenchmarks. Plus your benchmark can always call with_min_len itself, just as a user could.

I wonder if Rayon is designed do deal with situations like ours, where every split of the job is not a basically free operation? So ideally it would split the job as little as possible, instead of splitting it into many manageable chunks. I don't know what drives Rayons decision here...

Is your split so expensive that it would be noticeable compared to practical use of this data? Your benchmarks are just collecting the random data, so the workload is basically just a memory write.

@vks - I don't think we'd want to make rand a public dependency of rayon-core, at least not until rand is 1.0. If anything, I'm wondering if we even need to use rand internally, because our needs are so simple -- a direct implementation of a weak PRNG would probably suffice.

@cuviper
Copy link
Member

cuviper commented Apr 19, 2018

Taking a step back - If you just had some RNG wrapper which resets/reseeds when cloned, then I think one could just use that as the value for map_with. Then rand and rayon don't need to implement anything specifically for each other. It would be used something like:

(0..100_000).into_par_iter().map_with(RngCloner(r), |r, i| i + r.gen())...

Would this meet your needs? (Are your needs even well defined?)

@pitdicker
Copy link
Author

Thank you for the reply, very helpful!

We ran into a similar dilemma with repeat in #397, and decided to leave it unindexed and infinite, but also add a repeatn which is bounded. Repeat also has its own take and zip methods to turn it into RepeatN or Zip<RepeatN, _>. I think this design is OK, but I'm not sure if it really gets used.

Yes, that is exactly my problem! In one way it looks a bit ugly (zip or take have to be the first method in the chain), on the other hand very smart. I'll try the same thing.

But once you implement IndexedParallelIterator, you do need to be precise with the splits and how many items you produce. If nothing else, collect_into_vec() will panic if it gets the wrong number of items.

Returning the right number of items seems easy, I now just use a counter.

If "normal" PRNGs are supposed to be deterministic, then I would say this is problematic, but may be something you could just paper over as a documented limitation.

In a way it is funny how much people can value that their "random" numers are deterministic 😄. But I understand why (reproducibility of results), so definitely something to document. But it does not seem like Rayon depends on DoubleEndedIterator for correctness or something, only to offer the rev method? In that case the ability to collect results seems worth the trade-off to me.

Suppose I tried to zip your random source with my own vector of 100 heavy work items. Your forced min_len of 100 means I won't get any parallelism at all!

I have a little trouble imagining when somebody would do so, but it would not be good style then... Maybe it is best to just recommend users use with_min_len, but I'll think a bit more about the ensuring the quality.

Is your split so expensive that it would be noticeable compared to practical use of this data?

For some RNGs it could be basically free, for others I suspect one split can easily take as much time as 50 iterations of practical use (and for HC-128 something like a 1000 iterations, but I suppose that just disqualifies it here).

Taking a step back - If you just had some RNG wrapper which resets/reseeds when cloned, then I think one could just use that as the value for map_with.

That is an interesting function! Sounds almost as something we could use. But how do you prevent a second clone from ending up the same as the first clone? Except by always reseeding the RNG with something like OsRng? On a split I both modify the source iterator/RNG and produce another one.

@cuviper
Copy link
Member

cuviper commented Apr 20, 2018

Taking a step back - If you just had some RNG wrapper which resets/reseeds when cloned, then I think one could just use that as the value for map_with.

That is an interesting function! Sounds almost as something we could use. But how do you prevent a second clone from ending up the same as the first clone? Except by always reseeding the RNG with something like OsRng? On a split I both modify the source iterator/RNG and produce another one.

We don't prevent them from looking the same, nor does rayon even care. But with a custom wrapper, Clone can behave however you like. It could even use Cell or RefCell if you really need to modify the source -- note the map_with type is not required to be Sync, unlike most things rayon uses.

@vdanjean
Copy link

I'm not a specialist of rayon at all, but about parallel independent random generator, you should look at the Lecuyer random generator. It allows one to have parallel independent (reproducible) streams of random numbers.
Here is a link to the research paper: https://www.iro.umontreal.ca/~lecuyer/myftp/papers/handstat.pdf
And a link to some implementation: https://www.iro.umontreal.ca/~lecuyer/myftp/streams00/

@vks
Copy link

vks commented Apr 23, 2018

@cuviper

I don't think we'd want to make rand a public dependency of rayon-core, at least not until rand is 1.0.

Does this matter if it is behind a feature flag?

Would this meet your needs?

I think it would be nice to make it reproducible given a seed, but I'm not sure whether that is feasible with the work-stealing approach.

@cuviper
Copy link
Member

cuviper commented Apr 23, 2018

I don't think we'd want to make rand a public dependency of rayon-core, at least not until rand is 1.0.

Does this matter if it is behind a feature flag?

I don't see how the flag changes that consideration -- it's still a first-class part of the API for semver concerns. Even our former "unstable" feature caused problems when we made any changes, which is why we switched to the out-of-band --cfg rayon_unstable flag to enable it. (making it harder to opt into)

I think it would be nice to make it reproducible given a seed, but I'm not sure whether that is feasible with the work-stealing approach.

Note that any kind of thread-local solution also fails to be reproducible, since you can't control which thread each part will run on. Only an indexed Producer has a chance at generating deterministic results.

@pitdicker
Copy link
Author

Thanks for all the help so far! I am going with the wrapper approach. Wrote a lengthier comment a couple of days ago in this issue, but somehow it is lost...

Using Cell does the trick.

And if a user wants reproducible results I am thinking of adding a comment somewhere in our documentation to use both with_min_len and with_max_len. I don't think it really fits the philosophy of Rayon, with work-stealing and making the parallelism basically invisible. But it fits the common case where random number are not expected to be deterministic. And it also fits the solution I believe larger simulations in other languages now use of just splitting up the job over the number of cores, or some multiple of that.

@cuviper For me this issue can be closed. But as there are some more ideas floating around in this issue, can I leave deciding on that to you?

@pitdicker
Copy link
Author

Maybe one more question: Do you maybe want to give some input on rust-random/rand#413? The idea is to require a lower rustc version for rand_core and the planned crate with PRNGs. Except for a few tests that now use thread_rng, I think that would fill Rayons needs (without sticking to an old version of rand).

@cuviper
Copy link
Member

cuviper commented Apr 27, 2018

There's also #136 as an enhancement request for a parallel PRNG. I'll file a quick bug about the possibility of rolling our own PRNG for rayon-core. If there are any other individual ideas that should be tracked, please comment or raise them again in separate issues!

@cuviper cuviper closed this as completed Apr 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants