Seed issue: module seed or pipeline seed or both? #70

gaow · 2017-03-06T17:16:04Z

See this post:

https://github.com/stephenslab/dsc-wiki/blob/master/development/initial_thoughts_on_terminology_and_extraction.md

We definitely want to set a seed at the start of the pipeline. What about at the start of each module? My suggestion is yes. But by default derive this somehow from the pipeline seed.

I suggest let's not have pipeline seed. We'll only use module seeds. A module may require its own seed. The idea of pipeline seed is already reflected by dsc -x .. --seeds option in dsc command interface, that is to reset all seeds in a pipeline to the same specified values for a particular execution (that overwrites the default).

The text was updated successfully, but these errors were encountered:

pcarbo · 2017-03-07T22:28:10Z

It is hard to think of useful situations in which you would want to define multiple seed values for a given pipeline (i.e., for a given sequence of block evaluations). So I think a "global" seed is okay, and is probably "best practice". But I suppose there may situations in which a user may want to do this; e.g., to simulate multiple data sets and take all combinations of data sets with different seeds. Having block-specific seeds certainly allows for more flexibility.

stephens999 · 2017-03-07T23:49:47Z

I think that to ensure reproducibility we have to set a seed before executing any module. So a module instance will want to include a record of the seed value that was set (for reproducibility). This seed value may well be the same for all module instances in a pipeline instance. But I think it will make sense conceptually (and be more flexible in future) to store the value for each module instance as part of the value of that module instance. Matthew

…

On Tue, Mar 7, 2017 at 4:28 PM, Peter Carbonetto ***@***.***> wrote: It is hard to think of useful situations in which you would want to define multiple seed values for a given pipeline (i.e., for a given sequence of block evaluations). So I think a "global" seed is okay, and is probably "best practice". But I suppose there may situations in which a user may want to do this; e.g., to simulate multiple data sets and take all combinations of data sets with different seeds. Having block-specific seeds certainly allows for more flexibility. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#70 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABt4xUlTXb6IejlvK815kEv_T19KbSifks5rjdn6gaJpZM4MUb4F> .

pcarbo · 2017-03-08T03:52:51Z

The question of how to store and access the seeds seems slightly independent of this discussion.

The more general question is whether we store values of variables used in every module instance (i.e., all the local variables and inputs), or do we only record the environment in which the module instance was evaluated. The latter is potentially more efficient, especially if the variables are associated with complex data structures, because it means that values do not have to be stored more than once, but maybe less convenient.

stephens999 · 2017-03-08T14:05:08Z

here "local variables" means "parameters" in my document?

…

On Tue, Mar 7, 2017 at 9:52 PM, Peter Carbonetto ***@***.***> wrote: The question of how to store and access the seeds seems slightly independent of this discussion. The more general question is whether we store values of variables used in every module instance (i.e., all the local variables and inputs), or do we only record the environment in which the module instance was evaluated. The latter is potentially more efficient, especially if the variables are associated with complex data structures, because it means that values do not have to be stored more than once, but maybe less convenient. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#70 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABt4xRFkKEuPaXOocxnPOgzoSw1KfIKRks5rjiYTgaJpZM4MUb4F> .

pcarbo · 2017-03-08T14:22:48Z

I'm not sure which document you are referring to, but my general impression is that "parameter" has meant slightly different things in our discussions. Variable and parameter can be used interchangeably, although parameter is often used to specifically refer to the function/module inputs.

Consider this function in R:

f <- function (x, y) {
   e <- 0.01
   a <- fit.model(x,y,e)
   return(a)
}

x, y, e, a are all variables. Specifically, they are "local variables" in that they have no meaning outside function f.
x, y are also input parameters; they are associated with values when the function is evaluated in a given environment.
variables x, y and a are determined by the environment in which the function is evaluated (in DSC, this is what we are calling "dependencies" since the environment depends on evaluation of other modules).
e is a variable that does not depend on the evaluation environment. We could call this a "free variable".
a is also an output; that is, it is the only variable in which its value is accessible outside the function.

stephens999 · 2017-03-08T14:30:44Z

@pcarbo: this document, https://github.com/stephenslab/dsc-wiki/blob/master/development/initial_thoughts_on_terminology_and_extraction.md (i have updated to note your comment about no side effects and also clarify that parameters are local to a module)

…

On Wed, Mar 8, 2017 at 8:22 AM, Peter Carbonetto ***@***.***> wrote: I'm not sure which document you are referring to, but my general impression is that "parameter" has meant slightly different things in our discussions. Variable and parameter can be used interchangeably, although parameter is often used to specifically refer to the function/module inputs. Consider this function in R: f <- function (x, y) { e <- 0.01 a <- fit.model(x,y,e) return(a) } - x, y, e, a are all variables. Specifically, they are "local variables" in that they have no meaning outside function f. - x, y are also *input parameters*; they are associated with values when the function is evaluated in a given environment. - variables x, y and a are determined by the environment in which the function is evaluated (in DSC, this is what we are calling "dependencies" since the environment depends on evaluation of other modules). - e is a variable that does not depend on the evaluation environment. We could call this a "free variable". - a is also an output; that is, it is the only variable in which its value is accessible outside the function. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#70 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABt4xWnMrhuRWgPqkZ_LIQD40nKJy8a3ks5rjrm4gaJpZM4MUb4F> .

stephens999 · 2017-03-08T14:33:10Z

I think what @gaow is doing now is to store the parameter values for each module instance explicitly (in the "table" for that module). see for example the datamaker table here: #71 where min_pi0, n, etc are parameter values In contrast the pipeline output variable values are stored separately in a file, and linked to the module instance record via the "returns" column. On Wed, Mar 8, 2017 at 8:05 AM, Matthew Stephens <stephens999@gmail.com> wrote:

…

here "local variables" means "parameters" in my document? On Tue, Mar 7, 2017 at 9:52 PM, Peter Carbonetto ***@***.*** > wrote: > The question of how to store and access the seeds seems slightly > independent of this discussion. > > The more general question is whether we store values of variables used in > every module instance (i.e., all the local variables and inputs), or do we > only record the environment in which the module instance was evaluated. The > latter is potentially more efficient, especially if the variables are > associated with complex data structures, because it means that values do > not have to be stored more than once, but maybe less convenient. > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#70 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ABt4xRFkKEuPaXOocxnPOgzoSw1KfIKRks5rjiYTgaJpZM4MUb4F> > . >

gaow · 2017-05-02T23:43:15Z

Here I propose an interface to set pipeline seed:

data:
  exec: simulator.R
  input:
    seed: $(seed)
    ...
  output:
    ...

DSC:
   run: ...
   variables: 
     seed: R(1:5)

So we have DSC::variables to define quantities that can be accessed in any module via $() syntax. We can therefore define things such as pipeline seeds that can be used in many modules.

stephens999 · 2017-05-03T16:31:25Z

i'm inclined to think the modules should not have access to the seeds.
DSC should deal with the seeds itself. no need for the modules to have access or to worry about them.

Here is how I propose the seeds be dealt with:
i) user specifies pipeline seeds to be used. (1:5 in your example above). This would be done similar to the way you do it above.
i) before each module in the pipeline is run, DSC sets the seed to a value that depends on the pipeline seed. Simplest would be to have it set the seed to the pipeline seed, and I would
start with that behaviour. There may be edge cases where we want to use a different rule, but
this would be fine to begin with i think.

gaow · 2017-05-03T16:38:28Z

Seed serves 2 goals: ensure reproducibility and generating replicates. What if I have this example, in current syntax where seed is explicit, to make my point:

data:
  input: 
    seed: $(seed)
 ...

mcmc:
   input:
     seed: $(seed)

DSC:
   variable:
     seed: R(1:5)

Then we will run 5 data-sets and 5 MCMC rounds, generating 25 different output? Instead one might simply want:

data:
  input: 
    seed: $(seed)
 ...

mcmc:
   input:
     seed: 999
 
DSC:
   variable:
     seed: R(1:5)

ie, setting a single, yet fixed seed for mcmc method just to ensure reproducibility, not generating more replicates.

Notice that previously I made seed a standalone keyword:

data:
  seed: R(1:5)
  input:
     ...

gaow · 2018-01-25T06:13:06Z

We have finally reached an agreement on this issue. We will stick to this thread. The ticket is closed for now until the design changes otherwise -- implementation request has been added to project TODO list.

gaow added the discussion label Mar 6, 2017

gaow mentioned this issue Mar 8, 2017

Extraction syntax #72

Closed

gaow mentioned this issue May 5, 2017

Global variables in DSC #81

Closed

gaow closed this as completed Jan 25, 2018

gaow mentioned this issue Mar 3, 2018

Default seeding and replicates issues #94

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seed issue: module seed or pipeline seed or both? #70

Seed issue: module seed or pipeline seed or both? #70

gaow commented Mar 6, 2017

pcarbo commented Mar 7, 2017

stephens999 commented Mar 7, 2017 via email

pcarbo commented Mar 8, 2017

stephens999 commented Mar 8, 2017 via email

pcarbo commented Mar 8, 2017

stephens999 commented Mar 8, 2017 via email

stephens999 commented Mar 8, 2017 via email

gaow commented May 2, 2017

stephens999 commented May 3, 2017

gaow commented May 3, 2017 •

edited

Loading

gaow commented Jan 25, 2018

Seed issue: module seed or pipeline seed or both? #70

Seed issue: module seed or pipeline seed or both? #70

Comments

gaow commented Mar 6, 2017

pcarbo commented Mar 7, 2017

stephens999 commented Mar 7, 2017 via email

pcarbo commented Mar 8, 2017

stephens999 commented Mar 8, 2017 via email

pcarbo commented Mar 8, 2017

stephens999 commented Mar 8, 2017 via email

stephens999 commented Mar 8, 2017 via email

gaow commented May 2, 2017

stephens999 commented May 3, 2017

gaow commented May 3, 2017 • edited Loading

gaow commented Jan 25, 2018

gaow commented May 3, 2017 •

edited

Loading