Skip to content

Repository Structure

Matthew Gentzkow edited this page May 1, 2019 · 54 revisions

The overriding goal of our approach is to guarantee that all output is reproducible.

A user with no special experience should be able to clone the repository, delete everything other than the code and raw input files, and reproduce all output including intermediate data files, statistical analysis, tables and figures, and PDFs of the paper draft and slides. Doing this should be a straightforward and intuitive process, and while the computational time may be substantial, the human time required should not. Not only must the current output be reproducible, but all previous output must be as well (e.g., results in any previous draft or issue deliverable). The output not only needs to be reproducible today; it needs to remain so in the future.

Golden Rules

There are two key rules that are the bedrock of reproducibility.

Golden Rule #2: Important commits must follow a complete run of the relevant build script

Important commits are any commit or merge to master, any commit that defines a final issue deliverable, and the final commit before a pull request. See discussion on the Workflow page here.

This is our version of the "one rule" of Github Flow -- that anything in the master branch is deployable.

Without this rule, a user has no way of knowing what combination of code and inputs produced a given set of output, and there is no guarantee that the code that produced it is the same as the code in the commit. Perhaps some scripts were run but not others. Perhaps edits were made to a script after it was run. Perhaps some output files were changed manually or by processes outside the repository.

Running the build scripts before committing eliminates this ambiguity. The first step in a build script is to delete and recreate the relevant /output/. So long as no changes are made between the completion of the build script and the time of commit, the user can be confident that everything in /output/ was created by the build script and the code that it calls as of that commit. Re-running the build script at that commit must reproduce the output (provided rule #2 is followed as well!).

Golden Rule #3: All external dependencies must be robustly documented

An external dependency is anything outside the repository that the code uses as input. This can include data files that are too large to be committed directly (even with Git LFS); code libraries / packages / modules for Python, R, or Stata; and output of other repositories.

When there are no external dependencies, following Rule #2 is sufficient to guarantee replicability. Otherwise, replicability may fail because a user does not know exactly what external resources were used to produce the output and/or the state of those resources has changed.

Dependencies are robustly documented if: (i) a user can easily see what external resources are used as inputs; (ii) we record these resources' location and state at the time the build script is run; (iii) we provide enough provenance information that a user has a good chance of being able to locate or replicate these resources in the future.

This method is not foolproof. External resources are outside of the repository and so may be outside of our control, and in some cases we may not be able to guarantee that they do not disappear or change. The practices below are designed to minimize this risk.

Scope

A repository typically contains all work related to a single research project (e.g., a single journal article). This includes the data, analysis, paper, slides, and any supplementary information like data use agreements, notes from seminar presentations, and so on.

In some cases, a repository builds a dataset or other resources that are then used by multiple projects.

For small to medium scale projects we commit data directly to the repository (using Git LFS). Large data and other large inputs / outputs are stored external to the repository, typically on Dropbox. The procedure for storing large inputs / outputs external to the repository can be found here.

Modules

The key building block of our repositories is a module. A module is a directory with a build script at its top level, an /output/ subdirectory, and one or more input subdirectories such as /code/ or /input/.

A module is a self-contained unit whose output can always be reproduced by running the build script. A module must declare all of its inputs (files from other parts of the repository that it requires to run) as well as its external dependencies (files from outside the repository that it requires to run). Provided that the inputs and external dependencies are available, the module should run successfully regardless of what machine it is on or where it is located on disk. To help ensure this, it is critical that all paths should be referenced relative to the root of the module.

In our template repository, we use the following structure as the basis for all of our modules:

Standard Module Structure

  • /raw/: Raw data and source files.
  • /code/: Scripts used to generate outputs.
  • /input/: Files or links to files produced in the repository outside the module.
  • /external/: Files or links to files outside the repository.
  • /log/: Log files.
  • /output/: Outputs for the module.
  • externals.txt: File where external dependencies for the module are declared.
  • inputs.txt: File where inputs for the module are declared.
  • make.py: Build script.

/raw/ should contain raw source files for the module and any associated documentation. /raw/ should additionally include a readme.txt with the following information: (i) name and description of the raw source files; (ii) when and where the raw source files were obtained; (iii) the original form of the raw source files if any modifications were made. The template includes an example readme.txt file; you should follow the format of this file by default.

To avoid potential mishaps with file duplication, inputs in /input/ and external dependencies in /external/ should exist as symbolic links to the original source. For example, we use the code library gslab_make to automatically create symbolic links using the information contained in inputs.txt and externals.txt.

Build Scripts

A build script is a script at the top level of a module that does the following things in sequence: (i) deletes the contents of /output/; (ii) checks and records the state of all inputs and external dependencies; (iii) executes the code and other steps needed to produce the output. Step (i) is the most important. Deleting the contents of /output/ guarantees that any output following Rule #1 is reproducible.

In our repository template, build scripts are written in Python and always named make.py. We use the code library gslab_make to automate steps (i) through (iii). For more information on gslab_make, refer to its documentation.

For some large projects, we use a more sophisticated suite of build tools called SCons. In this case, we declare the relationship between inputs and outputs more explicitly, and this allows the builder to reproduce only the portions of /output/ that need to be updated given what has been changed. We discuss the use of SCons in the appendix here.

Standard Directories

Our template repository contains the following standard directories:

  • /data/: Module to generate intermediate data files.
  • /analysis/: Module to conduct statistical analysis and create tables and figures.
  • /paper_slides/: Module to create PDFs of the paper draft and slides.
  • /setup/: Directory containing setup instructions for packages / modules required for Python, R, and Stata.
  • /lib/: Directory containing repository-wide code libraries.

You should customize your directories as needed. One common additional directory is /docs/, where documents such as data use agreements and experimental design are saved.

config.yaml and config_user.yaml

At the root of the template repository is two configuration files: config.yaml and config_user.yaml.

config.yaml is a one stop shop for all repository-wide metadata. In our template repository, this includes software requirements to check for in setup/setup_repository.py as well as maximum allowed file sizes to pass into gslab_make.check_repo_size.

Metadata that pertains only to a single module can be stored in config_module.yaml at the top level of the module.

User-specific metadata should be stored in config_user.yaml and should not be commited to Github. In our template repository, this includes paths to external dependencies as well as executable names for software. As config_user.yaml is not to be commited, a template named config_user_template.yaml should be maintained in setup so that new users know what user-specific metadata is necessary to initialize the repository.

External Dependencies

External dependencies refer to any file outside the repository that the code uses as an input. For instance, this can include data files that are too large to be committed directly.

All external dependencies should be specified in config_user.yaml and any reference to external dependencies in code should be made via an import of config_user.yaml. Within a module, the following protocol for external dependencies should be used:

  • Specify external dependencies in config_user.yaml.
  • Create symbolic links to external dependencies using gslab_make.link_externals.
  • Reference external dependencies via symbolic links in /external/ as opposed to actual path.

When specifying external dependencies in config_user.yaml, refer to the top level directory containing the external dependencies. Additional pathing to individual files/subdirectories should instead be specified when creating symbolic links. The motivation behind this is to accurately reflect that it is the location of the top level directory, not the contents of the top level directory, that is user-specific.

Output local

A specific instance of external dependencies is when a module creates a large intermediate file that must be stored outside of Github. The following protocol should be used:

  • In the module, save the large intermediate file in subdirectory /output_local/.
  • Manually store the large intermediate file outside of Github (e.g., on Dropbox), making sure to document when the file was stored as well as the hash of the commit used to generate it.
  • For any downstream modules that use the large intermediate file, follow standard proceodure for using an external dependency. Do not reference to the file in /output_local/.

Wiki & Readme

The wiki for a repository should contain the following administrative documentation:

  • Longer-term ideas and todos
  • Notes from team meetings, seminar presentations, and etc.
  • Correspondence with outside parties
  • Status updates and replies for journal submissions
  • Press coverage
  • Account details for any relevant web services

Every repository should contain a README at the root level. This README should contain information about the software requirements necessary to reproduce all outputs as well as detailed setup instructions for initializing the repository. Furthermore, each module of the repository should contain an additional README (e.g., readme.txt) documenting all raw source files.

LFS and .gitattributes

Git LFS is a separate piece of software that allows Git to handle large files. We require everyone running one of our repositories to have Git LFS installed because inadvertently committing large files directly can cause performance issues with the repository as it becomes overly large. This is particularly an issue with binary files as Git is incapable of diff-ing them.

To have a file be tracked by Git LFS instead of Git, it must be added to the .gitattributes file. The default .gitattributes for the template repository tracks the following files with Git LFS:

*.csv filter=lfs diff=lfs merge=lfs -text
*.dta filter=lfs diff=lfs merge=lfs -text
*.rda filter=lfs diff=lfs merge=lfs -text
*.rds filter=lfs diff=lfs merge=lfs -text
*.pdf filter=lfs diff=lfs merge=lfs -text
*.png filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text

.gitignore

Often times, we do not want to commit a file to Git. For instance, we would not want to commit large outputs that we anticipate storing to Dropbox. The default .gitignore for the template repository ignores the following files:

# `config_user.yaml` is ignored as it contains user-specific settings
*config_user.yaml

# `/external/` and `/input/` subdirectories are ignored as they should contain symbolic links, which can be automatically created via `gslab_make`
*external/
*input/

# `/output_local/` subdirectories are ignored as they contain large outputs that will be stored outside of Github
*output_local/

# The following extensions indicated intermediate files generated by certain programs during compiling
*.pyc
*.lyx~
*.lyx#
*.lyx.emergency
*.Rhistory
*.Rapp.history
*.DS_Store
*.aux
*.fls
*.lof
*.lot
*.nav
*.snm
*.toc
*.out
*.svn
*.ipynb
Clone this wiki locally