-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
User-supplied RNG algorithm in push()
#113
Comments
This seems sensible to enable. As I have just implemented the use of L'Ecuyer-CMRG streams in The motivation is to ensure good statistical properties for computations which are split into parallel processes and then brought back together - the classic The implementation in the base package At the moment, the implementation in Non-dispatcher However, the changes should also have no downsides vs before (just to note that the RNGkind in the worker processes are now by default L'ecuyer, although you may also choose to override this at the |
Is the main issue statistical reproducibility, or is it how "random" the draws look? For reproducibility, The latter issue seems trickier. If each task sets its own unique seed deterministically, then that would ignore the part of the RNG algorithm that transitions from state to state, and so the draws might not emulate randomness exactly as advertised. |
Or, I might actually be forgetting how seeds work. |
If this covers reproducibility, would I really need L'Ecuyer-CMRG? |
https://stackoverflow.com/a/13807851 seems relevant. The statistical guarantees are supposed to be:
The current approach of |
Exactly this. If it were as simple as setting the seed then BD Ripley would probably not have had to invent such an elaborate solution. The use of L'Ecuyer or not has no bearing on reproducibility. I don't have a definitive answer as of now as to how much an issue setting the seed deterministically actually might be - especially as each 'task' where this is done could be atomic on one hand or involve a very long sequence of statistical draws on the other.
The topic probably merits a deeper dive at some point. But at least we are incrementally making improvements! |
So if the computation runs long enough on an existing set of parallel processes, then the PRNG state in one process could potentially overlap the PRNG state in a different parallel process? Because it's not just one long sequence which e.g. Mersenne Twister alone could mitigate?
Yeah, so I guess |
Just because Mersenne-Twister has a long period, does not guarantee you that 2 different processes might not start at similar points and hence overlap I guess. The L'Ecuyer-CMRG streams (at least as implemented in base R) solves this problem by creating these beforehand and passing the random seeds to the child processes. Each of these streams is then guaranteed to be independent of each other. This is what is now implemented in |
Thanks, that helps. I see So this is my understanding of how to create independent RNG streams. First create an initial stream, which is just a vector of 7 integers. RNGkind("L'Ecuyer-CMRG")
set.seed(0L) # global seed doesn't matter except for reproducibility
streams <- list()
streams[1L] <- .Random.seed Then each subsequent stream is created recursively from the previous one. streams[[2L]] <- nextRNGStream(streams[[1L])
streams[[3L]] <- nextRNGStream(streams[[2L])
... What's more, each If |
Yes that's right. The only addition in my |
Not an issue for crew to do that. But currently this implementation only ensures the statistical properties without being reproducible. To do so would require mapping tasks to workers beforehand or recording what happens so that it can be repeated. Is that something crew can do? Basically I have just replicated what happens in |
|
If understand you correctly, you are saying that the seed used is recorded by If that's the case then great - yes you can change the seed recorded to the length 7 integer vector. In which case you would not want the algorithm to be changed. Note that the actual seed is 6 integers - the first just identifies the |
If that works then shall I open an interface to get and advance the stream for each compute profile? I think this will be best practice for maintainability. |
Both |
That's a nice name for the algorithm :) Agree there! In addition, note that it is the responsibility of the launcher to get and advance the stream for each worker. |
Hmm... so then it looks like
Yeah, I think that would help a lot. |
Ok! I'm currently on my 'commute' so I'll get this to you with some pointers a bit later. Should be straightforward. |
|
Next to
seed
. c.f. #112The text was updated successfully, but these errors were encountered: