Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seed issue: module seed or pipeline seed or both? #70

Closed
gaow opened this issue Mar 6, 2017 · 11 comments
Closed

Seed issue: module seed or pipeline seed or both? #70

gaow opened this issue Mar 6, 2017 · 11 comments

Comments

@gaow
Copy link
Member

gaow commented Mar 6, 2017

See this post:

https://github.com/stephenslab/dsc-wiki/blob/master/development/initial_thoughts_on_terminology_and_extraction.md

We definitely want to set a seed at the start of the pipeline. What about at the start of each module? My suggestion is yes. But by default derive this somehow from the pipeline seed.

I suggest let's not have pipeline seed. We'll only use module seeds. A module may require its own seed. The idea of pipeline seed is already reflected by dsc -x .. --seeds option in dsc command interface, that is to reset all seeds in a pipeline to the same specified values for a particular execution (that overwrites the default).

@gaow gaow added the discussion label Mar 6, 2017
@pcarbo
Copy link
Member

pcarbo commented Mar 7, 2017

It is hard to think of useful situations in which you would want to define multiple seed values for a given pipeline (i.e., for a given sequence of block evaluations). So I think a "global" seed is okay, and is probably "best practice". But I suppose there may situations in which a user may want to do this; e.g., to simulate multiple data sets and take all combinations of data sets with different seeds. Having block-specific seeds certainly allows for more flexibility.

@stephens999
Copy link
Contributor

stephens999 commented Mar 7, 2017 via email

@pcarbo
Copy link
Member

pcarbo commented Mar 8, 2017

The question of how to store and access the seeds seems slightly independent of this discussion.

The more general question is whether we store values of variables used in every module instance (i.e., all the local variables and inputs), or do we only record the environment in which the module instance was evaluated. The latter is potentially more efficient, especially if the variables are associated with complex data structures, because it means that values do not have to be stored more than once, but maybe less convenient.

@stephens999
Copy link
Contributor

stephens999 commented Mar 8, 2017 via email

@pcarbo
Copy link
Member

pcarbo commented Mar 8, 2017

I'm not sure which document you are referring to, but my general impression is that "parameter" has meant slightly different things in our discussions. Variable and parameter can be used interchangeably, although parameter is often used to specifically refer to the function/module inputs.

Consider this function in R:

f <- function (x, y) {
   e <- 0.01
   a <- fit.model(x,y,e)
   return(a)
}
  • x, y, e, a are all variables. Specifically, they are "local variables" in that they have no meaning outside function f.

  • x, y are also input parameters; they are associated with values when the function is evaluated in a given environment.

  • variables x, y and a are determined by the environment in which the function is evaluated (in DSC, this is what we are calling "dependencies" since the environment depends on evaluation of other modules).

  • e is a variable that does not depend on the evaluation environment. We could call this a "free variable".

  • a is also an output; that is, it is the only variable in which its value is accessible outside the function.

@stephens999
Copy link
Contributor

stephens999 commented Mar 8, 2017 via email

@stephens999
Copy link
Contributor

stephens999 commented Mar 8, 2017 via email

@gaow gaow mentioned this issue Mar 8, 2017
@gaow
Copy link
Member Author

gaow commented May 2, 2017

Here I propose an interface to set pipeline seed:

data:
  exec: simulator.R
  input:
    seed: $(seed)
    ...
  output:
    ...

DSC:
   run: ...
   variables: 
     seed: R(1:5)

So we have DSC::variables to define quantities that can be accessed in any module via $() syntax. We can therefore define things such as pipeline seeds that can be used in many modules.

@stephens999
Copy link
Contributor

i'm inclined to think the modules should not have access to the seeds.
DSC should deal with the seeds itself. no need for the modules to have access or to worry about them.

Here is how I propose the seeds be dealt with:
i) user specifies pipeline seeds to be used. (1:5 in your example above). This would be done similar to the way you do it above.
i) before each module in the pipeline is run, DSC sets the seed to a value that depends on the pipeline seed. Simplest would be to have it set the seed to the pipeline seed, and I would
start with that behaviour. There may be edge cases where we want to use a different rule, but
this would be fine to begin with i think.

@gaow
Copy link
Member Author

gaow commented May 3, 2017

Seed serves 2 goals: ensure reproducibility and generating replicates. What if I have this example, in current syntax where seed is explicit, to make my point:

data:
  input: 
    seed: $(seed)
 ...

mcmc:
   input:
     seed: $(seed)

DSC:
   variable:
     seed: R(1:5)

Then we will run 5 data-sets and 5 MCMC rounds, generating 25 different output? Instead one might simply want:

data:
  input: 
    seed: $(seed)
 ...

mcmc:
   input:
     seed: 999
 
DSC:
   variable:
     seed: R(1:5)

ie, setting a single, yet fixed seed for mcmc method just to ensure reproducibility, not generating more replicates.

Notice that previously I made seed a standalone keyword:

data:
  seed: R(1:5)
  input:
     ...

@gaow
Copy link
Member Author

gaow commented Jan 25, 2018

We have finally reached an agreement on this issue. We will stick to this thread. The ticket is closed for now until the design changes otherwise -- implementation request has been added to project TODO list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants