Terra incognita, or land unknown, was a term used by cartographers to describe lands that were known to exist but had not yet been explored well enough to map. While this phrase disappeard from maps over the last few centuries, it is ever-present in science. Biases in microbiome analysis have been known to exist since nearly the time the field began, but for many experiments, they remain terra incognita. Exploring this area provides us with a critical next step in the evolution of microbiome research and the constant drive to improve our work.
There is always one more thing to learn.
– Steve Jobs
At Zymo Research, our motto is, "The beauty of science is to make things simple," and it is with this principle guiding our efforts that we offer you this software package. Like many great scientific tools, this package hides much of the complexity of its inner workings in order simplify a common analysis. Under the hood of this system, you'll find DADA2, FIGARO, and the NGS MIQ Score Calculator, the last two of which were initially developed for this application. The mission of this application is identical to the ZymoBIOMICS mission: increasing the reproducibility and standardization of microbiome analysis. Due to the multiple dependencies and need for reproducibility of results and environment in this software package, we have packaged it entirely within a Docker image to ensure consistency between machines and runs.
Publication on this method is pending. Watch this space for news and a link once it is available on bioRXiv.
This guide assumes you already have Docker installed and running on your system as well as the privileges to execute Docker commands. To learn more about Docker, please check out Katacoda for several excellent interactive tutorials on the use of Docker.
git clone --recursive https://github.com/Zymo-Research/miqScore16SPublic.git
cd miqScore16SPublic
docker build -t miqscore16s .
The simplest method to run this analysis (in this writer's opinion, at least) is to have the data ready and waiting in a file structure already set up to mount to the container. Using this practice will ensure that the container finds the expected input files and will output data to the expected location as well as providing an easy system to script against for automation, with iteration only requiring the movement of files.
To do this, create the following directory structure somewhere on your system where you have read and write permissions:
.
+-- input
| +-- sequence
| +-- standard_submitted_R1.fastq
| +-- standard_submitted_R2.fastq
+-- working
+-- output
This directory can have any name desired (dataMountDirectory will be used from here on as its name) so long as it conforms to good naming practices. The files standard_submitted_R1.fastq and standard_submitted_R2.fastq are where this docker will look for read 1 and read 2 data, respectively, if no other files are specified. These file names will work even if the fastq files are compressed using gzip.
Due to the nature of 16S sequencing and library preparation, you will need to have a little information on your library preparation method prior to running this container. This information will consist of the maximum expected length of your amplicon and the lengths of both your forward and reverse primer. You will also need a name that describes the sample and conforms to good file-naming practice (avoiding the use of spaces and special characters, underscores are acceptable) A minimal command to run this container will be as follows, assuming that the container image was built under the name miqscore16s and the directory structure described above is in a directory called dataMountDirectory:
docker container run -v [pathTo]/dataMountDirectory:/data -e SAMPLENAME=My_Sample_Name -e FORWARDPRIMERLENGTH=16 -e REVERSEPRIMERLENGTH=24 -e AMPLICONLENGTH=510 miqscore16s
For more advanced usage, several parameters can be set using environment variables passed into the container at runtime. The environment variables that can be passed are as follows:
Variable | Type | Default | Description |
---|---|---|---|
SAMPLENAME | string | REQUIRED | A name for the sample, should conform to good file-naming practices |
FORWARDPRIMERLENGTH | integer | REQUIRED | Length of the forward primer |
REVERSEPRIMERLENGTH | integer | REQUIRED | Length of the reverse primer |
AMPLICONLENGTH | integer | REQUIRED | Maximum expected length for amplicon |
MAXREADCOUNT | integer | 0 | Maximum reads to allow for analysis, will stop if a larger set of reads is provided. (Enter a value less than 1 for no limits) |
FORWARDREADS | string | /data/input/sequence/standard_submitted_R1.fastq | Path to forward reads within the container (likely a mounted folder) |
REVERSEREADS | string | /data/input/sequence/standard_submitted_R2.fastq | Path to forward reads within the container (likely a mounted folder) |
SEQUENCEFOLDER | string | /data/input/sequence | Path to folder containing input sequences within the container |
RSCRIPTFOLDER | string | [folderWithPackage]/rscripts | Path to folder containing the R scripts to run DADA2 (you will probably never use this one) |
R1MAXEE | integer | 3 | Maximum expected error to allow for forward reads |
R2MAXEE | integer | 6 | Maximum expected error to allow for reverse reads |
TRUNCQ | integer | 2 | Truncate a read after this quality score |
MAXMISMATCH | integer | 2 | Maximum mismatches allowed during read merge |
MINOVERLAP | integer | 10 | Minimum desired overlap for read merging |
NOCLEANUP | boolean | FALSE | Leave all intermediate files behind |
OUTPUTFOLDER | string | /data/output | Folder within the container to write output data |
DATABASEFILE | string | [folderWithPackage]/reference/rdp_train_set_16.fa.gz | File containing the rRNA reference sequences |
TRIMPARAMETERDOWNSAMPLE | integer | -1 | Downsampling for FIGARO trimming parameter prediction |
TRIMPARAMETERPERCENTILE | integer | 83 | Expected error percentile for FIGARO to use when calculating trim parameters |
FILENAMINGSTANDARD | string | ZYMO | How sequence files will be named (other option is "illumina") |
All outputs will be written to the designated output folder for the container, which will unmount upon completion of the run.
There will be two primary output files, one report in HTML format and one in JSON format. Examples of these can be seen here for HTML and here for JSON. The HTML report is designed to be viewed in a browser and gives an overview of the results for the sample. The JSON report, while human-readable, is designed primarily to facilitate analysis using an automated script. It also provides much more detailed information on the results than the HTML report.
In addition to the two primary files, there will be a log file that can be used in the event of a problem with analysis for additional information on the run. Finally, there will be several files generated by the DADA2 pipeline. If you are familiar with DADA2, you will be familiar with these outputs.
We welcome and encourage contributions to this project from the microbiomics community and will happily accept and acknowledge input (and possibly provide some free kits as a thank you). We aim to provide a positive and inclusive environment for contributors that is free of any harassment or excessively harsh criticism. Our Golden Rule: Treat others as you would like to be treated.
We use a modification of Semantic Versioning to identify our releases.
Release identifiers will be major.minor.patch
Major release: Newly required parameter or other change that is not entirely backwards compatible Minor release: New optional parameter Patch release: No changes to parameters
- Michael M. Weinstein - Project Lead, Programming and Design - michael-weinstein
- Aishani Prem - Testing, Design - AishaniPrem
- Mingda Jin - Testing, Code Review - jinmingda
- Shuiquan Tang - Design - shuiquantang
- Jeffrey Bhasin - Design, Code Review - jeffbhasin
- Ryan Kemp - Design - Zymo Research
- Brian Janssen - Design - Zymo Research
- Christopher E. Mason - Guidance, Testing - Cornell University
See also the list of contributors who participated in this project.
This project is not currently licensed, but will likely be licensed under the GNU GPLv3 License - see the LICENSE file for details. This license restricts the usage of this application for non-open sourced systems. Please contact the authors for questions related to relicensing of this software in non-open sourced systems.
We would like to thank the following, without whom this would not have happened:
- The Python Foundation
- The staff at Zymo Research
- The microbiomics community
- Our customers