Default seeding and replicates issues #94

gaow · 2018-03-03T13:58:09Z

User specifies pipeline seeds to be used
Before each module in the pipeline is run, DSC sets the seed to a value (maybe the seed itself) that depends on the pipeline seed.

The main focus here is reproducibility, and an important feature is that users should not think or worry about setting it.

I was in fact still against the idea that DSC should take care of seeding.

It is difficult to set default seed for every module simply because we cannot properly do it for all languages. In R it is most probably configured via set.seed(). For Python and Shell programs the procedure are not unique (Python via numpy, random, or other packages? for Shell admixture ... --seed I will have no idea that program accept such flag!). For those languages users will have to take care of this anyways. That is, behavior is going to be inconsistent between languages.
Even if we can work out 1, how do we handle replicates? We cannot assume that only the first module of a pipeline need seeds. For example what if the first module is just:

preprocess_data:
   data: ...

then the 2nd module needs replicates:

simulation_based_on_data:
   replicates: 1,2,3,4,5

In that case, setting default, global seeds dedicated to replicates is not enough because we need to know when to apply them.

For these reasons, in fact in my proposal in #70 essentially I still rely on users to set seeds. In DSC Road Map I proposed to just spend an entire tutorial discussing it. eg, some tips:

method: R(set.seed(999)) + method.R
 ...

That said, I believe I can fully appreciate what @stephens999 has in mind for R users. I hate this from engineering prospective but how about something like:

DSC:
   some_keyword: 999

so that we automatically set the first line of R code to set.seed(999)?

The text was updated successfully, but these errors were encountered:

pcarbo · 2018-03-04T20:12:42Z

@gaow I thought we agreed that every module instance would be run (by default) with a unique seed. This would take care of one of the issues you have raised.

I think it is okay (and still important!) to attempt to provide a default method for setting the seed, even if it is not guaranteed to work all the time. R is a simpler case because the standard method is to use set.seed, but even in R we can't guarantee that all users will use the standard method; for example, they may be using a "true" random number sequence from random.org, which can be generated using the random package.

gaow · 2018-03-06T02:33:15Z

@stephens999 Sorry I think what we have agreed on has a problem. This theme, ie,

# replicate 1
set.seed(seed 1)
module(A)
set.seed(seed 1)
module(B)
# replicate 2
set.seed(seed 2)
module(A)
set.seed(seed 2)
module(B)

Then module B, the method module, will get different seed set at each replicate, which is not good because we do not want variation in a method across replicates. So it goes back to the setting where we only allow for replicates with different seeds set at the first module, the beginning of pipeline ; and other modules will get their “hash-based” seed. Eg.

# replicate 1
set.seed(seed 1)
module(A)
set.seed(fixed seed)
module(B)
# replicate 2
set.seed(seed 2)
module(A)
set.seed(fixed seed)
module(B)

Then my initial post on this ticket raises a case when replicate is needed in the 2nd module ...

Of course we can still proceed with the above theme (2nd code block), but we will have to explain all the caveats in a dedicated document in the wiki. Is my understanding correct? Is there a different proposal than my 2nd code block above?

@pcarbo to fill you in, Matthew and I have discussed this and we agree while there exists various caveats we would still prefer to offer limited build-in support to deal with replicates, and we will dedicate a documentation page to explain exactly what we do for the currently supported languages, along with cautions.

pcarbo · 2018-03-06T03:02:36Z

@gaow @stephens999 My understanding was that the default seed would work like this:

# replicate 1
set.seed(seed1)
module(A)
set.seed(seed2)
module(B)
# replicate 2
set.seed(seed3)
module(A)
set.seed(seed4)
module(B)

The point of a "default" is not to solve all the cases; it is only to provide a reasonable behaviour that will work in many cases. I still think that providing a default is much better than not providing a default (for reproducibility).

stephens999 · 2018-03-06T16:20:47Z

I don't see a problem. If someone really wants their module to always behave the same they should be responsible for that by setting a seed at the start of the module. eg set.seed(123) Matthew

…

On Mon, Mar 5, 2018 at 9:02 PM, Peter Carbonetto ***@***.***> wrote: @gaow <https://github.com/gaow> @stephens999 <https://github.com/stephens999> My understanding was that the default seed would work like this: # replicate 1 set.seed(seed 1) module(A) set.seed(seed 2) module(B)# replicate 2 set.seed(seed 3) module(A) set.seed(seed 4) module(B) The point of a "default" is not to solve all the cases; it is only to provide a reasonable behaviour that will work in many cases. I still think that providing a default is much better than not providing a default (for reproducibility). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#94 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABt4xdluoss7l1HjMb5zRe6EU4pbSyOMks5tbfxMgaJpZM4Sa7Uv> .

gaow · 2018-03-07T01:09:28Z

Great! I have implemented it; a seed for a module is now: i + h where i is the replicate ID and h is the hash for the input module script converted from hexadecimal to decimal and chopped to 8 digits. that is, we use seeds of 8-digits integer. DSC::replicate and command option --replicate is also added. I believe this issue is now resolved.

…ATE based #94

gaow closed this as completed Mar 7, 2018

gaow added a commit that referenced this issue May 22, 2020

Add configuration option seed to switch between HASH based and REPLIC…

8a3768d

…ATE based #94

gaow added a commit that referenced this issue May 23, 2020

Use module ID as part of seed #94

b224e50

gaow added a commit that referenced this issue May 23, 2020

Do not use HASH for seeds #94

217e795

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default seeding and replicates issues #94

Default seeding and replicates issues #94

gaow commented Mar 3, 2018

pcarbo commented Mar 4, 2018

gaow commented Mar 6, 2018 •

edited

Loading

pcarbo commented Mar 6, 2018 •

edited

Loading

stephens999 commented Mar 6, 2018 via email

gaow commented Mar 7, 2018

Default seeding and replicates issues #94

Default seeding and replicates issues #94

Comments

gaow commented Mar 3, 2018

pcarbo commented Mar 4, 2018

gaow commented Mar 6, 2018 • edited Loading

pcarbo commented Mar 6, 2018 • edited Loading

stephens999 commented Mar 6, 2018 via email

gaow commented Mar 7, 2018

gaow commented Mar 6, 2018 •

edited

Loading

pcarbo commented Mar 6, 2018 •

edited

Loading