Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-experiment studies #5

Open
tmalsburg opened this issue Jun 2, 2015 · 6 comments
Open

Multi-experiment studies #5

tmalsburg opened this issue Jun 2, 2015 · 6 comments

Comments

@tmalsburg
Copy link

If a study consists of multiple experiments, how should the data and materials be structured? The most natural way would be to have a directory for each experiment but that goes counter the approach proposed here. But if the files for each experiment are scattered across the various directories (data, R, analysis, ...), it might make sense to have some sort of naming convention, e.g:

  • data/experiment1_results.dat
  • data/experiment2_results.dat
  • analysis/experiment1_analysis.R
  • analysis/experiment2_analysis.R
@benmarwick
Copy link
Contributor

Yes, good question, thanks for your comments. As I noted in #4, it would be great to have an actual example of someone making a good attempt at this. Maybe it could be your next publication? :) Or one you've already got out?

There are no hard-and-fast rules (yet), what we're doing here is mostly just looking to see how people are already solving these problems for their own research in ways that can be generalised to be useful and practical for many other researchers (rather than trying to be too prescriptive and disconnected from the norms of practice)

@tmalsburg
Copy link
Author

I agree that this proposal shouldn't be too prescriptive. However, if I understand correctly, one goal is to be compatible with R's package structure (so that compendia can be installed like ordinary R-packages). Doesn't that mean that we inherit many of the conventions described in (the notorious) "Writing R extensions"?

Having one directory for each experiment makes sense in the project that I'm currently working on, but it would not conform with R's package structure which requires that data is stored in the top-level data directory.

@benmarwick
Copy link
Contributor

That's a good question, one relevant factor is if you goal is to have the package on CRAN on not. For me, I don't think I'll ever submit any of my research compendia packages to CRAN, so I don't feel too bound to the manual. I'm happy with the minimum to allow the package to build and don't mind some warnings and notes. Others may have different standards and goals for their compendia, and I'm keen to see what standards emerge in others' work.

Regarding multi-experiments, I might organise my compendium something like this:

project
|- DESCRIPTION          
|- README.md             
|- NAMESPACE           
|- LICENSE                  
|
|- data/                      
|  +- exp_1/
|       + my_exp1_data.csv
|       + README.md    
|  +- exp_2/
|       + my_exp2_data.csv
|       + README.md   
|  +- exp_3/
|       + my_exp3_data.csv
|       + README.md  
|
|- analysis/           
|  +- my_report.Rmd    
|  +- exp_1_analysis.R
|  +- exp_2_analysis.R
|  +- exp_3_analysis.R
|
|- R/                    
|  +- my_functions.R    
|
|- man/
|  +- my_functions.Rd   

But that might not make sense for your project, I don't know. I'd be curious to know what structure you use for your multi-experiment project. Would you mind to post it here when you've done it?

To include things like experimental materials (cf #4), you could either put them in a directory in inst/ or just in a top level directory like experimental_materials, similar how some of us have a manuscript directory. That second option is a non-standard extension to the classic R package structure (and probably wouldn't be allowed on CRAN), but it seems to make sense and packages with these extra directories still work as installable objects.

@tmalsburg
Copy link
Author

My study is in progress and unfortunately I can't share the package at this stage. The structure is the following:

├─ README.org
├─ DESCRIPTION
├─ R
│  ├─ geometric.functions.R
│  ├─ ordered_plots.functions.R
│  └─ waic.functions.R
├─ Experiment1
│  ├─ stimuli.txt
│  ├─ presentation.py
│  ├─ results.csv
│  ├─ read_data.functions.R
│  ├─ inspect_raw_data.script.R
│  ├─ participants.csv.gpg
│  ├─ descriptive_stats.org
│  └─ analysis.script.R
├─ Experiment2
│  └─ …
├─ Experiment3
│  └─ …
└─ Manuscript
   └─ manuscript.org

At the top level we have:

  • README.org: a literate org file (similar to R-markdown). Github understands org, so this file is nicely rendered.
  • DESCRIPTION: as defined in Writing R Extensions.
  • R: contains general-purpose R functions as in the current proposal.

Within Experiment1:

  • stimuli.txt
  • presentation.py: the script used for presenting the stimuli during the experiment.
  • results.csv: the data.
  • read_data.functions.R: experiment-specific functions for reading raw and cleaned-up data. Used in inspect_raw_data.script.R, descriptive_stats.org, and analysis.script.R.
  • inspect_raw_data.script.R: generates a series of plots for screening the raw data.
  • participants.csv.gpg: contains participant information and for each participant a flag indicating whether or not their data should be included in the analysis. This file is encrypted because at the current stage it contains sensitive information. This will change once we publish the repository.
  • descriptive_stats.org: a literate org file. Would probably go to vignettes if I’d follow the R-package style (but there is no vignette engine for literate org, hm …).
  • analysis.script.R: inferential stats. Will be converted to a literate org file at a later stage.

The other experiment directories are similar.

That’s what I have so far. Work in progress. What I like very much about this approach is that the directory structure reflects the structure of the study. That would not be the case if I would adopt the R-package approach.

@gmbecker
Copy link

gmbecker commented Jun 4, 2015

Titus,

My major objection to this approach, which for me is a deal breaker, is
that you lose a ton of the benefits of a unified structure. Your directory
structure makes perfect sense to a human looking at it, but it is difficult
to impossible to compute on and even where not impossible I would argue it
is pretty far from optimal.

What should data(stimuli) do if your analysis package is loaded? Where
should R look for the data? How can R even know what data is available?
That is the reason that data lives in one of a few different places/forms.
Because that guarantees that R can find it and give it to the user for any
package loaded in the session. That doesn't seem like it is the case here
without a lot of pretty ugly hacks ("just grep for directory names that
start with 'Experiment' and look at all the files in there every time...'
").

A naming scheme within the data/ directory is much more reasonable from a
tooling perspective, or at the very least, an extension thereof with
directories under data/ (I'd have to look at how that might work, though).

Speaking of such extensions though, I do think it might be reasonable to
write the "spec" in such a way that the analysis package can contain either
individual .R/.Rmd/.org/etc files OR subdirectories of such files grouped
together. Since we are defining what an analysis package does we can say
that, for example, if no top-level files are present, the directories
define "subanalyses" (a formal notion we would come up with).

Best,
~G

On Wed, Jun 3, 2015 at 8:49 PM, Titus von der Malsburg <
notifications@github.com> wrote:

My study is in progress and unfortunately I can't share the package at
this stage. The structure is the following:

├─ README.org
├─ DESCRIPTION
├─ R
│ ├─ geometric.function.R
│ ├─ ordered_plots.function.R
│ └─ waic.function.R
├─ Experiment 1
│ ├─ stimuli.txt
│ ├─ presentation.py
│ ├─ results.csv
│ ├─ read_data.functions.R
│ ├─ inspect_raw_data.script.R
│ ├─ participants.csv.gpg
│ ├─ descriptive_stats.org
│ └─ analysis.script.R
├─ Experiment 2
│ └─ …
├─ Experiment 3
│ └─ …
└─ Manuscript
└─ manuscript.org

At the top level we have:

  • README.org: a literate org file (similar to R-markdown). Github
    understands org, so this file is nicely rendered.
  • DESCRIPTION: as defined in Writing R Extensions.
  • R: contains general-purpose R functions as in the current proposal.

Within Experiment1:

  • stimuli.txt
  • presentation.py: the script used for presenting the stimuli during
    the experiment.
  • results.csv: the data.
  • read_data.functions.R: experiment-specific functions for reading raw
    and cleaned-up data. Used in inspect_raw_data.script.R,
    descriptive_stats.org, and analysis.script.R.
  • inspect_raw_data.script.R: generates a series of plots for screening
    the raw data.
  • participants.csv.gpg: contains participant information and for each
    participant a flag indicating whether or not their data should be included
    in the analysis. This file is encrypted because at the current stage it
    contains sensitive information. This will change once we publish the
    repository.
  • descriptive_stats.org: a literate org file. Would probably go to
    vignettes if I’d follow the R-package style (but there is no vignette
    engine for literate org, hm …).
  • analysis.script.R: inferential stats. Will be converted to a
    literate org file at a later stage.

The other experiment directories are similar.

That’s what I have so far. Work in progress. What I like very much about
this approach is that the directory structure reflects the structure of the
study. That would not be the case if I would adopt the R-package approach.


Reply to this email directly or view it on GitHub
#5 (comment).

Gabriel Becker, PhD
Computational Biologist
Bioinformatics and Computational Biology
Genentech, Inc.

@tmalsburg
Copy link
Author

@gmbecker, I'm fully aware that data(...) doesn't work in my approach. That's why I contrasted it with the "R-package approach" and that's why I have the file read_data.functions.R which contains functions that fill in for data. One basic fact that I have to acknowledge is that many of my colleagues are not R-hackers. They are Python-, Julia-, or Matlab-hackers, or, more likely, no hackers at all. Given that this is my audience, human readability is something that I would not like to give up easily. Having said that, I do see the appeal of being able to install a compendium and to be able to use R's package infrastructure, but to me this seems like a nice-to-have convenience not an essential requirement. Please note that I do not propose adoption of my approach in the context of the current effort. I just responded to Ben's request above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants