Molecular biology experiments, mass spectrometry-based proteomics, and reproducible data analysis in R
Brendon Smith
Launch in Google ColaboratoryProvided on GitHub with a CC-BY-4.0 license, which is commonly used for open-access scientific publications. I encourage you to use the materials in this repository for your own work. If you use this material, please attribute me and explain what you changed.
Science is an incredible tool for learning about the world. We use theory and experiment to generate new knowledge. In science, reproducibility occurs when different scientists do the same experiment and get results that agree. Our current scientific practices do not promote or reward reproducibility. As a result, the scientific community is experiencing a reproducibility crisis, in which the discoveries that we publish can't be reproduced by multiple labs, or even repeated within the same lab by the same person, in some instances. This is not authentic knowledge.
The reproducibility crisis is troubling. During my postdoc in a large molecular biology lab, I saw the reproducibility crisis unfold, both in the scientific literature and among my colleagues. Even more striking than the crisis itself was the lack of an insightful solution.
Documentation is the sine qua non of reproducibility. How can we hope to reproduce experiments if we don't know how they were done? Documentation must start at the beginning, with reproducible data analysis being preceded by reproducible experimental practices. No statistical adjustment can make up for lack of detailed metadata collected at the time the experiment is performed. Clear, annotated raw data should be provided, and data analyses should clearly describe each action taken from raw data to final analysis.
This repository is a practical example of reproducible scientific data analysis. I have attempted to provide, to the greatest extent possible, a complete documentation of the methods that led to the results presented. It's not perfect. The information is not complete, as I worked with others who don't care about documenting their work. Some aspects of the experiment didn't work well, but that's the point. Experiments don't usually turn out exactly according to plan. By carefully documenting the experiment, and sharing the results openly, I can understand what went wrong, and how to move forward in the most efficient way. That's how science should be.
It might seem strange to my scientific colleagues, who are mostly focused on career advancement and personal aggrandizement, that I would take so much time to analyze preliminary data from a pilot study like this. It's not just about the end result. If we want to address the reproducibility crisis, we need to focus on the process.
- NPR correspondent Richard Harris sums up the reproducibility crisis in his book Rigor Mortis.
- The surgeon and writer Atul Gawande wrote a book about his research, in which he found that distributing a checklist (protocol) to surgical team members reduced patient deaths by half. If world-class surgeons can benefit from improved protocols, scientists certainly can also.
- The scientific journal eLife is a leader in reproducible data analysis and publishing. I particularly enjoy the eLife Labs blog. The Editor-in-Chief, Nobel laureate Randy Schekman, has written about the reproducibility crisis and the damaging effects of luxury journals. Naomi Penfold, Innovation Officer at eLife, has made me aware of many of the tools below.
- These articles provide general discussions of reproducibility:
- Barba LA. The hard road to reproducibility. Science 2016.
- Etchells P. What are the roadblocks to successful scientific replications? The Guardian 2015.
- Loscalzo J. Irreproducible experimental results: Causes, (mis)interpretations, and consequences. Circulation 2012.
- Morrison SJ. Time to do something about reproducibility. eLife 2014.
- Sarewitz D. The pressure to publish pushes down quality. Nature 2016.
- This Cell Reports commentary is a valuable example of the importance of documentation for experimental reproducibility.
- Hines WC, Su Y, Kuhn I, Polyak K, Bissell MJ. Sorting out the FACS: A devil in the details. Cell Rep. 2014.
- Research reagents and materials, including antibodies, cell lines, and mice, appear to contribute substantially to lack of reproducibility.
- Baker M. Reproducibility crisis: Blame it on the antibodies. Nature 2015.
- Couzin-Frankel J. When mice mislead. Science 2013.
- Ioannidis JPA. Extrapolating from animals to humans. Sci. Transl. Med. 2012.
- Lorsch JR, Collins FS, Lippincott-Schwartz J. Fixing problems with cell lines: Technologies and policies can improve authentication. Science 2014.
- Martin B, Ji S, Maudsley S, Mattson MP. 'Control' laboratory rodents are metabolically morbid: Why it matters. PNAS 2010.
- Nature ironically reports on the reproducibility crisis, while continuing to publishing trendy irreproducible articles weekly.
- Baker M. 1,500 scientists lift the lid on reproducibility. Nature 2016.
- Shen H. Interactive notebooks: Sharing the code. Nature 2014.
- The citation style used above is my own custom style, adapted from the style I developed for my dissertation. It combines aspects of the Nature and BioMed Central (BMC) styles. The style is concise yet informative, light on punctuation and formatting, and designed for electronic viewing. DOI HTTPS hyperlinks are provided. Journal volume and page numbers are no longer relevant and have been removed. In addition to the style specifications, I use sentence case for article titles and title case for journal titles. A CSL version for use with Zotero and other citation managers is available in this public GitHub Gist.
- Binder turns GitHub repositories into reproducible computing environments. It uses code and dependency files to create Docker images that run in web browsers. Binder is a potentially great feature, but my experience so far is that it's extremely slow, and not properly loading additional R packages.
- Gigantum: Research project management and collaboration system. It version-controls your research materials, allows them to be easily shared and published, and bundles everything to run reproducibly in the cloud.
- Greene Integrative Genomics Laboratory at Penn:
- Bioinformatics lab that prioritizes transparent and computationally reproducible research.
- They developed a pipeline for continuous analysis of research data (see Nature Biotechnology paper, GitHub and eLife Labs blog post). Continuous Integration (CI) is combined with a persistent archive like Figshare or Zenodo, and data analysis is re-run every time a change is made.
- They wrote an open, collaborative deep learning review article.
- They published an analysis of Sci-Hub.
- Check out their GitHub.
- Hypothesis: Open annotations on the web.
- Project Jupyter
- Google Cloud Platform Podcast Episode 122 on Project Jupyter with Jessica Forde, Yuvi Panda and Chris Holdgraf
- Somers J. The scientific paper is obsolete. The Atlantic 2018.
- See below for details on Jupyter Notebook.
- Open Science Framework: Research project management and collaboration system. Integrates many other software tools and forms of data.
- Protocols.io: Open access repository for creation and sharing of scientific protocols.
- ScienceFair: Decentralized p2p science literature client. See the eLife Labs blog post about ScienceFair. So far, it can only access eLife articles, and even that doesn't really work.
- sciNote: Free electronic lab notebook.
- Stencila: Open document suite that can be used to write and run code in a computationally reproducible way. I recently attended an eLife webinar about Stencila. eLife is considering Stencila as part of a "Reproducible Document Stack" to generate their manuscripts.
- We-Sci: Tool to ensure proper attribution for scientific work.
- Whole Tale: Research project management system.
- Zenodo: Repository for digital materials to be permanently archived and stored with DOI versioning. Figshare is similar.
- Data Carpentry, which is sponsored by NumFOCUS, has a Reproducible Science Curriculum and holds workshops on reproducible data analysis in Python and R.
- The Harvard Institute for Applied Computational Science (IACS) provides free resources to the scientific computing community, such as the annual Computefest. See EDA.ipynb and grammarofdata.ipynb from Computefest 2018 for info on reproducible Exploratory Data Analysis (EDA) workflows.
- Vincent Carey (Harvard Medical School, Brigham & Women's Hospital) provided helpful resources for reproducible data analyses associated with his Repro2017 Harvard Catalyst talk.
To promote reproducible scientific work:
- Comprehensively document experiments and analyses.
- Format code files as computational narratives mixing prose and code with a tool like Jupyter Notebook.
- Version control code with Git and share code on a website like GitHub.
- Create a reproducible cloud computing environment using a tool like Binder.
This is a summary report of an experiment I performed during my postdoc. The goal of this experiment was to identify a molecular complex associated with Nrf1, a protein our research group was studying. Nrf1 is also abbreviated NFE2L1, and should not be confused with Nuclear Respiratory Factor 1.
We began studying Nrf1 because it resides in a cellular organelle called the Endoplasmic Reticulum (ER). We study the ER and its roles in metabolism. We found that Nrf1 mediates the cellular response to cholesterol, and that it seemed to do this separately from its known function as a genetic transcription factor in the nucleus. Cholesterol metabolism occurs at the ER, and is very important in the liver, where cholesterol is metabolized and prepared for excretion.
We hypothesize that a group of other proteins interacts with Nrf1 to mediate its response to cholesterol at the ER. We used proteomics to test our hypothesis, which identifies all possible proteins in a sample with a technique called mass spectrometry.
I incorporated practices for reproducible scientific experimentation and data analysis throughout the project.
Supplementary data files, including the electronic lab notebook, protocols, datasheets and information on materials used, raw data, other data analyses, slides, and images, are available here.
- Data analysis was performed with the R computing language, and is provided in R Markdown and Jupyter Notebook formats.
- Jupyter Notebook combines prose and code to promote construction of reproducible computational narratives that configure the computing environment and precisely describe each step in the data analysis. When code from a reproducible computational narrative is run on another computer, there is a high probability that the same result will be obtained. Reproducibility.
If you haven't used R or R Markdown before, see my R guide.
- R Markdown is a document creation package based on Markdown (syntax for easy generation of formatted HTML documentation), knitr (report generation package) and pandoc (universal document converter).
- An RMarkdown file contains three types of data: YAML front matter header at the top of the file to specify output methods, Markdown-formatted text, and functional code chunks.
- I use RStudio to work with R Markdown.
- I created an RStudio project, which is required for version control and package management.
- The R Markdown file outputs in the GitHub document format to output standard Markdown, in addition to HTML, for compatibility with GitHub.
renv
was used to manage R packages for the project.renv
helps avoid problems caused by different package versions and installations by giving each project its own isolated package library.renv
is separate from general package managers like Homebrew used to install R and RStudio.- This project previously used Packrat, a predecessor to
renv
that is now "soft-deprecated."- Migration from Packrat to
renv
is simple. Run the following command in the console:renv::migrate("~/path/to/repo")
. - Then if packages aren't already installed, run
renv::restore()
, or in the Packages pane, navigate torenv
-> Restore Library.
- Migration from Packrat to
JupyterLab can be used to run Jupyter Notebook files.
If running the Jupyter notebook file locally, I would suggest using JupyterLab within a virtual environment. Here are some setup instructions:
-
I install Python on macOS via Homebrew, and then install JupyterLab inside a virtual envieonment. Once installation is complete, navigate to your project's directory, install dependencies, and run JupyterLab.
-
Here are the necessary command line arguments:
❯ brew install python3 ❯ brew install jupyter ❯ cd path/where/you/want/jupyterlab ❯ python3 -m venv .venv ❯ source .venv/bin/activate .venv ❯ pip install jupyterlab # Install any JupyterLab extensions at this point .venv ❯ jupyter labextension install @jupyterlab/toc # Run JupyterLab .venv ❯ jupyter lab
- Binder can run Jupyter Notebooks in the cloud by creating Docker containers. It takes a long time to build containers. It works well with Python, but I found that it was not properly loading additional packages when running R.
- Google provides a cloud-based Jupyter Notebook environment called Colaboratory. It originally just supported Python, but can now run R. Unfortunately, it requires a Google login to run code.
Complement C1q proteins A, B, C (green dots in the plot above) were identified as potentially interacting with Nrf1 in the setting of liver cholesterol accumulation. The experiment did have notable limitations, which prompted us to refine our methods and continue with further experiments.