User-supplied RNG algorithm in `push()` #113

wlandau · 2023-09-04T23:52:38Z

Next to seed. c.f. #112

The text was updated successfully, but these errors were encountered:

shikokuchuo · 2023-09-05T09:52:49Z

This seems sensible to enable.

As I have just implemented the use of L'Ecuyer-CMRG streams in mirai, I have not yet had the time to think about the best way to integrate this with crew. There may be a clever way that it can use the .Random.seed directly from mirai, although that probably also necessitates further improvements - I'll try to summarise below.

The motivation is to ensure good statistical properties for computations which are split into parallel processes and then brought back together - the classic parLapply type functional programming. From what I understand, even using 'random' random seeds does not guarantee non-synchronicity across multiple child processes. I guess this becomes more relevant the more long-running the computations are. The solution was devised by Luke Tierney and BD Ripley himself it seems (from the source attribution) - using L'Ecuyer-CMRG streams which are generated iteratively, with each one used to generate the next.

The implementation in the base package parallel actually sends the seeds once the cluster is set up as in the function parallel::clusterSetRNGStream, whereas I have made it part of the command line argument to daemon() to ensure that it is always set. This means that for non-programmeRs they at least get a good default out of the box.

At the moment, the implementation in mirai does, I believe, at least what parallel does. It is also an important improvement from the previous situation where the random seed was reset (randomly) after each evaluation - it is now persisted. Statistical 'safety' is ensured as each process uses a different stream. Reproducibility however is not guaranteed when dispatcher (or any load-balancing algorithm) is used as which tasks are sent to which workers is then not deterministic.

Non-dispatcher mirai however does now generally allow reproducibility due to the round-robin behaviour of NNG, so we know that tasks are allocated to workers sequentially. As I alluded to above, a better more-generalised solution is likely possible but at the cost of more complexity. This also means that for crew, until this is found, it may not benefit directly as I believe targets does require a high level of reproducibility.

However, the changes should also have no downsides vs before (just to note that the RNGkind in the worker processes are now by default L'ecuyer, although you may also choose to override this at the crew level for consistency with previous behaviour if that is important).

wlandau · 2023-09-05T12:42:09Z

Is the main issue statistical reproducibility, or is it how "random" the draws look?

For reproducibility, targets assigns each target a unique deterministic seed of digest::digest2int(as.character(TARGET_NAME), seed = GLOBAL_SEED), where GLOBAL_SEED is configurable and has a fixed default of 0L. I was thinking crew users could emulate this on a task-by-task basis if needed.

The latter issue seems trickier. If each task sets its own unique seed deterministically, then that would ignore the part of the RNG algorithm that transitions from state to state, and so the draws might not emulate randomness exactly as advertised.

wlandau · 2023-09-05T12:43:42Z

The latter issue seems trickier. If each task sets its own unique seed deterministically, then that would ignore the part of the RNG algorithm that transitions from state to state, and so the draws might not emulate randomness exactly as advertised.

Or, I might actually be forgetting how seeds work.

wlandau · 2023-09-05T12:45:08Z

For reproducibility, targets assigns each target a unique deterministic seed of digest::digest2int(as.character(TARGET_NAME), seed = GLOBAL_SEED), where GLOBAL_SEED is configurable and has a fixed default of 0L. I was thinking crew users could emulate this on a task-by-task basis if needed.

If this covers reproducibility, would I really need L'Ecuyer-CMRG?

wlandau · 2023-09-05T13:12:13Z

https://stackoverflow.com/a/13807851 seems relevant. The statistical guarantees are supposed to be:

Reproducibility.
Independence.

The current approach of targets guarantees (1), but in hindsight, I am not so sure about (2).

shikokuchuo · 2023-09-05T13:14:11Z

The latter issue seems trickier. If each task sets its own unique seed deterministically, then that would ignore the part of the RNG algorithm that transitions from state to state, and so the draws might not emulate randomness exactly as advertised.

Or, I might actually be forgetting how seeds work.

Exactly this. If it were as simple as setting the seed then BD Ripley would probably not have had to invent such an elaborate solution. The use of L'Ecuyer or not has no bearing on reproducibility.

I don't have a definitive answer as of now as to how much an issue setting the seed deterministically actually might be - especially as each 'task' where this is done could be atomic on one hand or involve a very long sequence of statistical draws on the other.

~~If helpful, I could export the function nextstream() for you to access (and advance) the stream currently stored on host, as an alternative approach.~~ this is currently is not reproducible as I mentioned previously.

The topic probably merits a deeper dive at some point. But at least we are incrementally making improvements!

wlandau · 2023-09-05T13:30:30Z

From what I understand, even using 'random' random seeds does not guarantee non-synchronicity across multiple child processes. I guess this becomes more relevant the more long-running the computations are.

So if the computation runs long enough on an existing set of parallel processes, then the PRNG state in one process could potentially overlap the PRNG state in a different parallel process? Because it's not just one long sequence which e.g. Mersenne Twister alone could mitigate?

The solution was devised by Luke Tierney and BD Ripley himself it seems (from the source attribution) - using L'Ecuyer-CMRG streams which are generated iteratively, with each one used to generate the next.

The use of L'Ecuyer or not has no bearing on reproducibility.

Yeah, so I guess RNGkind()[1L] could be the default. Changed in b0066d2.

shikokuchuo · 2023-09-05T13:37:51Z

From what I understand, even using 'random' random seeds does not guarantee non-synchronicity across multiple child processes. I guess this becomes more relevant the more long-running the computations are.

So if the computation runs long enough on an existing set of parallel processes, then the PRNG state in one process could potentially overlap the PRNG state in a different parallel process? Because it's not just one long sequence which e.g. Mersenne Twister alone could mitigate?

Just because Mersenne-Twister has a long period, does not guarantee you that 2 different processes might not start at similar points and hence overlap I guess.

The L'Ecuyer-CMRG streams (at least as implemented in base R) solves this problem by creating these beforehand and passing the random seeds to the child processes. Each of these streams is then guaranteed to be independent of each other. This is what is now implemented in mirai.

wlandau · 2023-09-05T14:11:48Z

Thanks, that helps.

I see mirai uses nextRNGStream(), and the documentation is clear.

So this is my understanding of how to create independent RNG streams. First create an initial stream, which is just a vector of 7 integers.

RNGkind("L'Ecuyer-CMRG")
set.seed(0L) # global seed doesn't matter except for reproducibility
streams <- list()
streams[1L] <- .Random.seed

Then each subsequent stream is created recursively from the previous one.

streams[[2L]] <- nextRNGStream(streams[[1L])
streams[[3L]] <- nextRNGStream(streams[[2L])
...

What's more, each nextRNGStream(streams[[I]) is deterministic.

If mirai already does all this already, I wonder if crew should step aside by default and avoid setting seeds altogether. Does that sound reasonable? Users in crew could still set seeds and algorithms if they really care, but this would not be the default.

shikokuchuo · 2023-09-05T14:15:37Z

Yes that's right. The only addition in my create_stream() is that the .Random.seed in the host process is restored, analogous to parallel::clusterSetRNGStream().

shikokuchuo · 2023-09-05T14:18:59Z

If mirai already does all this already, I wonder if crew should step aside by default and avoid setting seeds altogether. Does that sound reasonable? Users in crew could still set seeds and algorithms if they really care, but this would not be the default.

Not an issue for crew to do that. But currently this implementation only ensures the statistical properties without being reproducible. To do so would require mapping tasks to workers beforehand or recording what happens so that it can be repeated. Is that something crew can do?

Basically I have just replicated what happens in parallel to this point. It is an improvement from completely unreproducible / random statistical properties.

wlandau · 2023-09-05T14:29:33Z

currently this implementation only ensures the statistical properties without being reproducible. To do so would require mapping tasks to workers beforehand or recording what happens so that it can be repeated. Is that something crew can do?

crew records the seed supplied by the user to push(). I could change that to the 7-digit L'Ecuyer seed from .Random.seed before the task begins, and I could make sure it is meaningful by disabling the newly added algorithm argument. Sound appropriate?

shikokuchuo · 2023-09-05T14:38:42Z

If understand you correctly, you are saying that the seed used is recorded by targets and hence allows reproducibility if re-run?

If that's the case then great - yes you can change the seed recorded to the length 7 integer vector. In which case you would not want the algorithm to be changed. Note that the actual seed is 6 integers - the first just identifies the .Random.seed as L'Ecuyer I think.

shikokuchuo · 2023-09-05T14:44:40Z

If that works then shall I open an interface to get and advance the stream for each compute profile? I think this will be best practice for maintainability.

wlandau · 2023-09-05T15:04:15Z

If understand you correctly, you are saying that the seed used is recorded by targets and hence allows reproducibility if re-run?

Both targets and crew do this. For crew, I am thinking a task could return .Random.seed if algorithm = "mirai" (will be the default) and otherwise the length 1 integer supplied to set.seed().

shikokuchuo · 2023-09-05T15:20:39Z

That's a nice name for the algorithm :) Agree there!

In addition, note that it is the responsibility of the launcher to get and advance the stream for each worker. mirai does that for the ones it launches itself e.g. locally. Each time a compute profile (environment) is created, a stream is also created and stored there. So my question above is just to confirm if a slightly modified nextstream(.compute) function should be exported?

wlandau · 2023-09-05T15:46:41Z

Hmm... so then it looks like crew needs to do more manual work than I realized. Seems doable though, using something like #113 (comment).

So my question above is just to confirm if a slightly modified nextstream(.compute) function should be exported?

Yeah, I think that would help a lot.

shikokuchuo · 2023-09-05T15:50:56Z

Ok! I'm currently on my 'commute' so I'll get this to you with some pointers a bit later. Should be straightforward.

shikokuchuo · 2023-09-05T19:57:20Z

nextstream() in mirai is now ready to go in 9495f5c. I've tested with crew and will put up a PR with the minimal changes required to make it work.

wlandau added the type: new feature label Sep 4, 2023

wlandau self-assigned this Sep 4, 2023

shikokuchuo mentioned this issue Sep 5, 2023

Test fails due to incomplete specification of random seed (RNGkind) #112

Closed

1 task

wlandau-lilly closed this as completed in 0a5ccfe Sep 5, 2023

wlandau mentioned this issue Sep 5, 2023

Statistical independence of pseudo-random numbers ropensci/targets#1139

Closed

wlandau reopened this Sep 5, 2023

shikokuchuo mentioned this issue Sep 5, 2023

Changes to enable L'Ecuyer-CMRG Streams #115

Merged

2 tasks

wlandau-lilly closed this as completed in 0ed6b0c Sep 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User-supplied RNG algorithm in `push()` #113

User-supplied RNG algorithm in `push()` #113

wlandau commented Sep 4, 2023

shikokuchuo commented Sep 5, 2023

wlandau commented Sep 5, 2023

wlandau commented Sep 5, 2023

wlandau commented Sep 5, 2023

wlandau commented Sep 5, 2023

shikokuchuo commented Sep 5, 2023 •

edited

Loading

wlandau commented Sep 5, 2023

shikokuchuo commented Sep 5, 2023 •

edited

Loading

wlandau commented Sep 5, 2023

shikokuchuo commented Sep 5, 2023 •

edited

Loading

shikokuchuo commented Sep 5, 2023

wlandau commented Sep 5, 2023

shikokuchuo commented Sep 5, 2023

shikokuchuo commented Sep 5, 2023

wlandau commented Sep 5, 2023 •

edited

Loading

shikokuchuo commented Sep 5, 2023 •

edited

Loading

wlandau commented Sep 5, 2023 •

edited

Loading

shikokuchuo commented Sep 5, 2023

shikokuchuo commented Sep 5, 2023

User-supplied RNG algorithm in push() #113

User-supplied RNG algorithm in push() #113

Comments

wlandau commented Sep 4, 2023

shikokuchuo commented Sep 5, 2023

wlandau commented Sep 5, 2023

wlandau commented Sep 5, 2023

wlandau commented Sep 5, 2023

wlandau commented Sep 5, 2023

shikokuchuo commented Sep 5, 2023 • edited Loading

wlandau commented Sep 5, 2023

shikokuchuo commented Sep 5, 2023 • edited Loading

wlandau commented Sep 5, 2023

shikokuchuo commented Sep 5, 2023 • edited Loading

shikokuchuo commented Sep 5, 2023

wlandau commented Sep 5, 2023

shikokuchuo commented Sep 5, 2023

shikokuchuo commented Sep 5, 2023

wlandau commented Sep 5, 2023 • edited Loading

shikokuchuo commented Sep 5, 2023 • edited Loading

wlandau commented Sep 5, 2023 • edited Loading

shikokuchuo commented Sep 5, 2023

shikokuchuo commented Sep 5, 2023

User-supplied RNG algorithm in `push()` #113

User-supplied RNG algorithm in `push()` #113

shikokuchuo commented Sep 5, 2023 •

edited

Loading

shikokuchuo commented Sep 5, 2023 •

edited

Loading

shikokuchuo commented Sep 5, 2023 •

edited

Loading

wlandau commented Sep 5, 2023 •

edited

Loading

shikokuchuo commented Sep 5, 2023 •

edited

Loading

wlandau commented Sep 5, 2023 •

edited

Loading