-
Notifications
You must be signed in to change notification settings - Fork 36
Repository Structure
The overriding goal of our approach is to guarantee that all output is reproducible.
A user with no special experience should be able to clone the repository, delete everything other than the code and raw input files, and reproduce all output including intermediate data files, statistical analysis, tables and figures, and PDFs of the paper draft and slides. Doing this should be a straightforward and intuitive process, and while the computational time may be substantial, the human time required should not. Not only must the current output be reproducible, but all previous output must be as well (e.g., results in any previous draft or issue deliverable). The output not only needs to be reproducible today; it needs to remain so in the future.
There are two key rules that are the bedrock of reproducibility.
Important commits are any commit or merge to master
, any commit that defines a final issue deliverable, and the final commit before a pull request. See discussion on the Workflow page here.
This is our version of the "one rule" of Github Flow -- that anything in the master
branch is deployable.
Without this rule, a user has no way of knowing what combination of code and inputs produced a given set of output, and there is no guarantee that the code that produced it is the same as the code in the commit. Perhaps some scripts were run but not others. Perhaps edits were made to a script after it was run. Perhaps some output files were changed manually or by processes outside the repository.
Running all relevant build scripts before committing eliminates this ambiguity. The first step in a build script is to delete and recreate the relevant /output/
. So long as no changes are made between the completion of the build script and the time of commit, the user can be confident that everything in /output/
was created by the build script and the code that it calls as of that commit. Re-running the build script at that commit must reproduce the output (provided Rule #3 is followed as well!).
An external dependency is anything outside the repository that the code uses as input. This can include data files that are too large to be committed directly (even with Git LFS); code libraries / packages / modules for Python, R, or Stata; and output of other repositories.
When there are no external dependencies, following Rule #2 is sufficient to guarantee replicability. Otherwise, replicability may fail because a user does not know exactly what external resources were used to produce the output and/or the state of those resources has changed.
Dependencies are robustly documented if: (i) a user can easily see what external resources are used as inputs; (ii) we record these resources' location and state at the time the build script is run; (iii) we provide enough provenance information that a user has a good chance of being able to locate or replicate these resources in the future.
This method is not foolproof. External resources are outside of the repository and so may be outside of our control, and in some cases we may not be able to guarantee that they do not disappear or change. The practices below are designed to minimize this risk.
A repository typically contains all work related to a single research project (e.g., a single journal article). This includes the data, analysis, paper, slides, and any supplementary information like data use agreements, notes from seminar presentations, and so on.
In some cases, a repository builds a dataset or other resources that are then used by multiple projects.
For small to medium scale projects we commit data directly to the repository (using Git LFS). Large data and other large inputs / outputs are stored external to the repository, typically on Dropbox. The procedure for storing large inputs / outputs external to the repository can be found on the external dependencies page.
The key building block of our repositories is a module. A module is a directory with a build script at its top level, an /output/
subdirectory, and one or more input subdirectories such as /code/
or /input/
.
A module is a self-contained unit whose output can always be reproduced by running the build script. A module must declare all of its inputs (files from other parts of the repository that it requires to run) as well as its external dependencies (files from outside the repository that it requires to run). Provided that the inputs and external dependencies are available, the module should run successfully regardless of what machine it is on or where it is located on disk. To help ensure this, it is critical that all paths should be referenced relative to the root of the module.
In our template repository, we use the following structure as the basis for all of our modules:
-
make.py
: Build script. -
input.txt
: File where inputs for the module are declared. -
external.txt
: File where external dependencies for the module are declared. -
/code/
: Scripts used to generate outputs. -
/input/
: Files or links to files produced in the repository outside the module. -
/external/
: Files or links to files outside the repository. -
/temp/
: Temporary files used only within the module that should not be committed. -
/log/
: Log files. -
/output/
: Module output committed to the repository. -
/output_local/
: Module output used outside the module but not committed (e.g., because the files are too large). See here for output local protocol.
To avoid potential mishaps with file duplication, inputs in /input/
and external dependencies in /external/
should preferably exist as uncommitted symbolic links to the original source. For example, we use the code library gslab_make
to automatically create symbolic links using the information contained in input.txt
and external.txt
. One exception is for paper_slides
, where copies of the inputs and external dependencies may be created and committed to facilitate easier LyX compilation.
A build script is a script at the top level of a module that does the following things in sequence: (i) deletes the contents of /output/
; (ii) checks and records the state of all inputs and external dependencies; (iii) executes the code and other steps needed to produce the output. Deleting the contents of /output/
in step (i) is essential; it guarantees that any output following Rule #2 is reproducible.
In our repository template, build scripts are written in Python and always named make.py
. We use the code library gslab_make
to automate steps (i) through (iii). For more information on gslab_make
, refer to its documentation.
For some large projects, we use a more sophisticated suite of build tools called SCons. In this case, we declare the relationship between inputs and outputs more explicitly, and this allows the builder to reproduce only the portions of /output/
that need to be updated given what has been changed. We discuss the use of SCons in the appendix here.
Our template repository is organized into the following modules:
-
/raw/
: Raw data and source files. -
/data/
: Produces intermediate data files from raw data. -
/analysis/
: Executes statistical analysis and create tables and figures. -
/paper_slides/
: Creates PDFs of the paper draft and slides. -
/setup/
: Contains setup instructions for packages / modules required for Python, R, and Stata. -
/lib/
: Contains code libraries used by multiple modules in the repository. -
/docs/
: Notes, data use agreements, design documents, and other documentation that is not code
/raw/
is a repository-wide location for all raw source files and any associated documentation. /raw/
should additionally include a README.txt
with the following information: (i) name and description of the raw source files; (ii) when and where the raw source files were obtained; (iii) the original form of the raw source files if any modifications were made. The template includes an example README.txt
file; you should follow the format of this file by default.
For more complex projects, you may wish to further organize the contents of a module into submodules (e.g., having /descriptive/
and /estimation/
under /analysis/
). In such cases, each submodule should function as an independent module with its own build script and the parent module can be left substantively empty.
Two configuration files are placed at the root of every repository: config.yaml
and config_user.yaml
.
config.yaml
contains all settings and metadata for the repository that can be shared across users. In our template repository, this includes software requirements to check for in setup/setup_repository.py
as well as maximum allowed file sizes to pass into gslab_make.check_repo_size
. Metadata that pertains only to a single module can be stored in config_module.yaml
at the top level of the module.
config_user.yaml
contains settings and metadata such as local paths that are specific to an individual user and thus should not be committed to Git. In our template repository, this includes local paths to external dependencies as well as executable names for locally installed software. A template named config_user_template.yaml
is maintained in setup
. When a user first downloads a repository they should copy this file to the repository root, rename it config_user.yaml
, and update it with the appropriate local information.
The default config_user.yaml
sets the following executable names for locally installed software. If any executable name differ on your local machine, adjust the value in config_user.yaml
accordingly.
software: executable
git-lfs: git-lfs
python: python
r: Rscript
stata: stata-mp
matlab: matlab
lyx: lyx
latex: latex
The wiki for a repository can contain administrative documentation such as:
- Longer-term ideas and todos
- Notes from team meetings, seminar presentations, and etc.
- Correspondence with outside parties
- Status updates and replies for journal submissions
- Press coverage
- Account details for any relevant web services
Every repository should contain a README at the root level. This README should contain information about the software requirements necessary to reproduce all outputs as well as detailed setup instructions for initializing the repository. Furthermore, as discussed above, the /raw/
directory should have its own README.txt
containing the provenance of all source files.
Git LFS is a separate piece of software that allows Git to handle large files. We require everyone running one of our repositories to have Git LFS installed because inadvertently committing large files directly can cause performance issues with the repository as it becomes overly large. This is particularly an issue with binary files as Git is incapable of diff-ing them.
To have a file be tracked by Git LFS instead of Git, it must be added to the .gitattributes
file. The default .gitattributes
for the template repository tracks data, image, and PDF files including those with the following extensions: *.pdf
, *.csv
, *.dta
, *.rda
, *.rds
, *.png
, *.zip
, *.tar.gz
. If a repository creates data, image, or other large files that are not included in this list, their extensions should be added to .gitattributes
.
Often times, we do not want to commit a file to Git. For instance, we would not want to commit large outputs that we anticipate storing to Dropbox. The default .gitignore
for the template repository ignores files including config_user.yaml
, non-paper_slides
/external/
and /input/
subdirectories (they should contain symbolic links, which are user-specific and can be automatically created via gslab_make
), /output_local/
subdirectories (they contain large outputs that will be stored outside of Github), and intermediate/temporary files such as *.pyc
files created by Python and *.Rhistory
files created by R.