title | tags | authors | affiliations | date | bibliography | |||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
containerit: Generating Dockerfiles for reproducible research with R |
|
|
|
22 July 2019 |
paper.bib |
Linux containers have become a promising tool to increase transparency, portability, and reproducibility of research in several domains and use cases: data science [@boettiger_introduction_2015], software engineering research [@cito_using_2016], multi-step bioinformatics pipelines [@kim_bio-docklets_2017], standardised environments for exchangeable software [@belmann_bioboxes_2015], computational archaeology [@marwick_computational_2017], packaging algorithms [@hosny_algorun_2016], or geographic object-based image analysis [@knoth_reproducibility_2017].
Running an analysis in a container increases reliability of a workflow, as it can execute packaged code independently of the author's computer and its available configurations and dependencies.
However, capturing a computational environment in containers can be complex, making container use difficult for domain scientists with limited programming experience.
containerit
opens up the advantages of containerisation to a much larger user base by assisting researchers, who are unfamiliar with Linux, command lines or containerisation, in packaging workflows based on R [@r_2018] in container images by using only user-friendly R commands.
Recently containerisation took off as a technology for packaging applications and their dependencies for fast, scalable, and secure sandboxed deployments in cloud-based infrastructures [cf. @osnat_brief_2018].
The most widely used containerisation software is Docker with the following core building blocks (cf. Docker: Get Started):
The image is built from the instructions in a recipe called Dockerfile
.
The image is executed as a container using a container runtime.
An image can be moved between systems as a file (image tarball) or based on an image registry.
A Dockerfile
may use the image created by another Dockerfile
as the starting point, a so-called base image.
While containers can be manually altered, the common practice is to conduct all configurations with the scripts and instructions originating in the Dockerfile
.
An important advantage of containers over virtual machines is that their duality between recipe and image provides and additional layer of transparency and safeguarding.
The Dockerfile
and image can be published alongside a scientific paper to support peer review and, to some extent, preserve the original results [@nust_opening_2017].
Even if an image cannot be executed or a Dockerfile
can no longer be built, the instructions in the Dockerfile
are human-readable, and files in the image can be extracted to recreate an environment that closely resembles the original.
Further useful features are (a) portability, thanks to a single runtime dependency, which allows readers to explore an author's virtual laboratory, including complex dependencies or custom-made code, either on their machines or in cloud-based infrastructures [e.g., by using Binder, see @jupyter_binder_2018], and (b) transparency, because an image's filesystem can be easily inspected.
This way, containers can enable verification of reproducibility and auditing without requiring reviewers to manually download, install, and re-run analyses [@beaulieu-jones_reproducibility_2017].
Container preservation is an active field of research [@rechert_preserving_2017; @emsley_framework_2018]. It is reasonable to assume that key stakeholders interested in workflow preservation, such as universities or scientific publishers, should be able to operate container runtimes on a time scale comparable to data storage requirements by funding agencies, e.g., 10 years in case of the German DFG or British EPSRC. To enable and leverage the stakeholders' infrastructure, container creation must become easier and more widespread.
The package containerit
automates the generation of Dockerfile
s for workflows in R, based on images by the Rocker project [@RJ-2017-065].
The core feature of containerit
is that it transforms the local session information into a set of instructions which can be serialised as a Dockerfile
, as shown in the code snippet below:
> suppressPackageStartupMessages(library("containerit"))
> my_dockerfile <- containerit::dockerfile(from = utils::sessionInfo())
> print(my_dockerfile)
FROM rocker/r-ver:3.5.2
LABEL maintainer="daniel"
RUN export DEBIAN_FRONTEND=noninteractive; apt-get -y update \
&& apt-get install -y git-core \
libcurl4-openssl-dev \
libssl-dev \
pandoc \
pandoc-citeproc
RUN ["install2.r", "curl", "digest", "evaluate", "formatR", \
"futile.logger", "futile.options", "htmltools", "jsonlite", \
"knitr", "lambda.r", "magrittr", "Rcpp", "rjson", \
"rmarkdown", "rsconnect", "semver", "stevedore", "stringi", \
"stringr", "xfun", "yaml"]
WORKDIR /payload/
CMD ["R"]
The created Dockerfile
has installation instructions for the loaded packages and their system dependencies.
It uses the r-ver
stack of Rocker images, matching the R version to the environment encountered locally by containerit
.
These images use MRAN snapshots to control installed R package versions in a reproducible way.
The system dependencies required by these packages are identified using the sysreqs
package [@csardi_sysreqs_2019] and the corresponding database and API.
dockerfile(..)
is the package's main user function and accepts session information objects, session information saved in a file, a set of R commands, an R script file, a DESCRIPTION
file, or an R Markdown document [@allaire_rmarkdown_2018].
Static program analysis using the package automagic
[@brokamp_automagic_2017] is used to increase the chances that the capturing environment has all required packages available, such as when creating Dockerfiles for R Markdown documents as a service [@nust_reproducibility_2018].
To capture the workflow environment, containerit
executes the whole workflow in a new R session using the package callr
[@csardi_callr_2018], because static program analysis can be broken by using helper functions, such as xfun::pkg_attach()
[@xie_xfun_2018], by unintended side effects, or by seemingly clever or user-friendly yet customised ways of loeading packages (cf. first lines in R script file tgis_a_1579333_sm7524.r
in https://doi.org/10.6084/m9.figshare.7757069.v1).
Further parameters for the function comprise, for example, image metadata, base image, versioned installations, and filtering of R packages already installed in the base image.
The package containerit
's main contribution is that it allows for automated capturing of runtime environments as Dockerfile
s based on literate programming workflows [@gentleman_statistical_2007] to support reproducible research.
Together with stevedore
[@fitzjohn_stevedore_2019], containerit
enables a completely R-based creation and manipulation of Docker containers.
Using containerit
only minimally affects researchers' workflows because it can be applied after completing a workflow, while at the same time the captured snapshots can enhance the scholarly publication process (in particular review, interaction, and preservation) and may form a basis for more reusable and transparent publications.
In the future, containerit
may support alternative container software such as Singularity [@kurtzer_singularity_2017], enable parametrisation of container executions and pipelines as demonstrated by Kliko [@molenaar_klikoscientific_2018], or support proper accreditation of software [@codemeta; @katz_software_2018].
Related Work
renv
is an R package for managing reproducible environments for R providing isolation, portability, and pinned versions of R packages, but it does not handle system dependencies.
The Experiment Factory similarly focuses on ease of use for creating Dockerfile
s for behavioural experiments, yet it uses a CLI-based interaction and generates extra shell scripts to be included in the images.
ReproZip [@ChirigatiRSF16] packages files identified by tracing in a self-contained bundle, which can be unpacked to a Docker container/Dockerfile
.
In the R domain, the package dockerfiler
[@fay_dockerfiler_2018] provides an object-oriented API for manual Dockerfile creation, and liftr
[@xiao_liftr_2018] creates a Dockerfile
based on fields added to the metadata header of an R Markdown document.
automagic
[@brokamp_automagic_2017], Whales, dockter
, and repo2docker
use static program analysis to create environment descriptions from common project configuration files for multiple programming languages.
Namely, automagic
analyses R code and can store dependencies in a bespoke YAML format.
Whales and dockter
provide different formats, including Dockerfile
.
Finally, repo2docker
primarily creates containers for interactive notebooks to run as a Binder [@jupyter_binder_2018] but does not actively expose a Dockerfile
.
None of them apply the strict code execution approach as containerit
does.
This work is supported by the project Opening Reproducible Research (Offene Reproduzierbare Forschung) funded by the German Research Foundation (DFG) under project numbers PE 1632/10-1
and 1632/17-1
.