Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default seeding and replicates issues #94

Closed
gaow opened this issue Mar 3, 2018 · 5 comments
Closed

Default seeding and replicates issues #94

gaow opened this issue Mar 3, 2018 · 5 comments

Comments

@gaow
Copy link
Member

gaow commented Mar 3, 2018

In #70 @stephens999 has proposed that

  1. User specifies pipeline seeds to be used
  2. Before each module in the pipeline is run, DSC sets the seed to a value (maybe the seed itself) that depends on the pipeline seed.

The main focus here is reproducibility, and an important feature is that users should not think or worry about setting it.

I was in fact still against the idea that DSC should take care of seeding.

  1. It is difficult to set default seed for every module simply because we cannot properly do it for all languages. In R it is most probably configured via set.seed(). For Python and Shell programs the procedure are not unique (Python via numpy, random, or other packages? for Shell admixture ... --seed I will have no idea that program accept such flag!). For those languages users will have to take care of this anyways. That is, behavior is going to be inconsistent between languages.

  2. Even if we can work out 1, how do we handle replicates? We cannot assume that only the first module of a pipeline need seeds. For example what if the first module is just:

preprocess_data:
   data: ...

then the 2nd module needs replicates:

simulation_based_on_data:
   replicates: 1,2,3,4,5

In that case, setting default, global seeds dedicated to replicates is not enough because we need to know when to apply them.

For these reasons, in fact in my proposal in #70 essentially I still rely on users to set seeds. In DSC Road Map I proposed to just spend an entire tutorial discussing it. eg, some tips:

method: R(set.seed(999)) + method.R
 ...

That said, I believe I can fully appreciate what @stephens999 has in mind for R users. I hate this from engineering prospective but how about something like:

DSC:
   some_keyword: 999  

so that we automatically set the first line of R code to set.seed(999)?

@pcarbo
Copy link
Member

pcarbo commented Mar 4, 2018

@gaow I thought we agreed that every module instance would be run (by default) with a unique seed. This would take care of one of the issues you have raised.

I think it is okay (and still important!) to attempt to provide a default method for setting the seed, even if it is not guaranteed to work all the time. R is a simpler case because the standard method is to use set.seed, but even in R we can't guarantee that all users will use the standard method; for example, they may be using a "true" random number sequence from random.org, which can be generated using the random package.

@gaow
Copy link
Member Author

gaow commented Mar 6, 2018

@stephens999 Sorry I think what we have agreed on has a problem. This theme, ie,

# replicate 1
set.seed(seed 1)
module(A)
set.seed(seed 1)
module(B)
# replicate 2
set.seed(seed 2)
module(A)
set.seed(seed 2)
module(B)

Then module B, the method module, will get different seed set at each replicate, which is not good because we do not want variation in a method across replicates. So it goes back to the setting where we only allow for replicates with different seeds set at the first module, the beginning of pipeline ; and other modules will get their “hash-based” seed. Eg.

# replicate 1
set.seed(seed 1)
module(A)
set.seed(fixed seed)
module(B)
# replicate 2
set.seed(seed 2)
module(A)
set.seed(fixed seed)
module(B)

Then my initial post on this ticket raises a case when replicate is needed in the 2nd module ...

Of course we can still proceed with the above theme (2nd code block), but we will have to explain all the caveats in a dedicated document in the wiki. Is my understanding correct? Is there a different proposal than my 2nd code block above?

@pcarbo to fill you in, Matthew and I have discussed this and we agree while there exists various caveats we would still prefer to offer limited build-in support to deal with replicates, and we will dedicate a documentation page to explain exactly what we do for the currently supported languages, along with cautions.

@pcarbo
Copy link
Member

pcarbo commented Mar 6, 2018

@gaow @stephens999 My understanding was that the default seed would work like this:

# replicate 1
set.seed(seed1)
module(A)
set.seed(seed2)
module(B)
# replicate 2
set.seed(seed3)
module(A)
set.seed(seed4)
module(B)

The point of a "default" is not to solve all the cases; it is only to provide a reasonable behaviour that will work in many cases. I still think that providing a default is much better than not providing a default (for reproducibility).

@stephens999
Copy link
Contributor

stephens999 commented Mar 6, 2018 via email

@gaow
Copy link
Member Author

gaow commented Mar 7, 2018

Great! I have implemented it; a seed for a module is now: i + h where i is the replicate ID and h is the hash for the input module script converted from hexadecimal to decimal and chopped to 8 digits. that is, we use seeds of 8-digits integer. DSC::replicate and command option --replicate is also added. I believe this issue is now resolved.

@gaow gaow closed this as completed Mar 7, 2018
gaow added a commit that referenced this issue May 23, 2020
gaow added a commit that referenced this issue May 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants