FastQCS3: Fast Quantitative Checking of 16S Gene Sequencing
The purpose of this software is to give users, primarily microbiome researchers in academia, a tool to run quality checking immediately after sequencing results become available. This package is run from the command line and outputs a link to an interactive dashboard in a web browser. The four main analyses performed include: sequencing quality, relative abundances, alpha diversity metrics, and beta diversity metrics.
This package relies on QIIME2, a previously published microbiome analysis package, to pre-process input data and Biopython to extract sequencing quality metrics. FastQCS3 seeks to utilize QIIME2 data processing capabilities while focusing on building an interactive dashboard with more useful visualization tools than QIIME2's .qza formats. To be able to utilize FastQCS3, QIIME2, Biopython, and Dash must all be installed in your working environment. QIIME2 only runs on Mac OS and on Windows Subsystem for Linux. Information on installing Windows Subsytem for Linux is found below.
You can follow the instructions here to install Miniconda.
conda update conda
sudo apt-get install zip unzip
Once you have Miniconda, you can create an environment using the provided environment.yml file contained in this repo and the following command.
conda env create -n fastqcs3 --file environment.yml
Due to the size of the QIIME2 package, creating the environment can take awhile. Once it has finished, activate your environment through conda activate fastqcs3
.
You can deactivate at any time using conda deactivate
.
If you have followed the instructions up to this point and you have created your environment from the environment.yml
file, move straight to the Operating Instructions section.
To create an environment from scratch, you'll have to download QIIME2 using the following instructions and install Biopython and Dash as well.
conda install wget
If conda install wget
doesn't work, you can also try using pip install wget
, sudo apt-get install wget
, or use package managers like homebrew to brew install wget
.
wget https://data.qiime2.org/distro/core/qiime2-2020.11-py36-osx-conda.yml
conda env create -n qiime2-2020.11 --file qiime2-2020.11-py36-osx-conda.yml
rm qiime2-2020.11-py36-osx-conda.yml
wget https://data.qiime2.org/distro/core/qiime2-2020.11-py36-linux-conda.yml
conda env create -n qiime2-2020.11 --file qiime2-2020.11-py36-linux-conda.yml
rm qiime2-2020.11-py36-linux-conda.yml
the process to install wget will be a more complicated. You'll have to download wget and move the correct exe files into your correct system directories as shown here.
This can be complicated if you don't already have administrator privileges set up in these files on your machine; we recommend using the Mac OS or Windows Subsystem for Linux. Instructions on installing Windows subsystem for Linux can be found here.
conda activate qiime2-2020.11
You can deactivate at any time with conda deactivate
.
After git cloning this repo, copy your directory of gzipped fastq files (.fastq.gz
file format) to your local version of this repo. Note: fastQCS3 has only been developed to perform analysis on either multiplexed single-end sequence data generated following the Earth Microbiome Project protocol or demultiplexed single-end sequence data in the Casava 1.8 format. Sequence data is assumed to have been generated by amplifying the V4 region of the 16S rRNA gene using the 515F/806R primer pair. If reads are still multiplxed, a barcode.fastq.gz
file mapping sample-IDs to their adapter barcode sequences will need to be included in the directory containing your multiplexed reads.
You will also need to add your metadata file to the metadata/
directory (see demo metadata files in metadata/
to match formatting), which matches sample-IDs with their associated metadata. Once your sequence data and metadata are in their appropriate locations, you can move onto the next step.
If you have already activated your environment, go to step 2.
-
Activate your fastQCS3 environment by typing the command
conda activate fastqcs3
. -
Run
python fastQCS3_pkl.py
. This will begin the automated QIIME2 process of importing your data and performing the analysis. -
You will be asked if your reads are multiplexed or demultiplexed. Choose
y
orn
. -
You will then be prompted to enter the name of the directory containing your sequence data. For example, you would enter
demo_data_v1
to import the demo data from that directory. -
Enter the name of your metadata file. The response should be in the format of
sample-metadata.tsv
. Your data will then begin importing into objects the QIIME2 software can handle. -
You will then be given information about the quality of your reads, and the positions at which their quality begins to drop off below a certain threshold. Use this information to advise your choice for a squencing trim length. We recommend setting trim length to an integer (ie.
120
) where your average quality drops below a phred score of ~18-22. If all positions areNone
, then set trim length to0
to trim off nothing and retain the entire length of your reads. -
After choosing a trim length, DADA2 will run. DADA2 will generate a feature table, where each feature corresponds to an ASV. If you have tens of millions of reads, DADA2 can take several hours to run. If you have hundreds of thousands, DADA2 will finish in a matter of minutes.
-
You will then be presented with information about the abundance of features in your dataset. If some samples perform poorly and have few features (0 to ~30), you'll want to exclude them from further analysis. Depending on the distribution of your samples, the software will suggest a unique sampling depth integer value. Either take the suggestion, or read more about choosing a sampling depth. Sampling depth must be an integer (ie.
800
). -
The pipeline will then compute alpha and beta diversity metrics, generate a phylogenetic tree, perform alpha rarefaction, and taxonomic analysis all in one step. If completed successfully, you will be prompted to enter a name for your visualization file you will use in step 11. Do not include spaces or periods in your file name.
-
The visualization objects are then packaged into a
.pkl
file. Follow the prompt and runpython fastQCS3_dashboard.py
. -
Enter the name of the visualization file you just created in step 9. The script will generate the dashboard app and display a link that can then be copied into your browser to visualize your data.
A demo video can be found here. We have provided two sets of demo data in demo_data_v1
(from the QIIME2 "Moving Pictures" tutorial) and demo_data_v2
(Evan's personal sequencing results that inspired this project) if you wish to practice.
1. Your fastq files should be labelled with an alphabetical character in front:
i.e. `A1_S1_L001_R1_001.fastq.gz` instead of `1_S1_LOO1_R1_001.fastq.gz`
2. If running multiple datasets at once, you will have to change the host name and the port name
on line 204 of the `fastQCS3_dashboard.py` file every time
you run a new dataset to visualize all at the same time.
i.e. `app.run_server(host='127.0.0.1', port='8050', debug=False)` followed by
`app.run_server(host='128.0.0.1', port='8050', debug=False)`
We hope to modify this in the future to allow users to run multiple directories at once.
3. Unit testing in dash can be complicated, so most of our unit tests are designed to focus on the
accurate generation of dataframes to then be plotted later in Dash. The main testing functions can
be found in the `py_files` directory and they perform tests on our data manipulation scripts within
that directory. Running nosetests on that directory after running the pkl.py script on the demo_data_v1
directory indicates passing results.
4. The classifier we use in this repo was not trained by us; it's a commonly used microbiome classifier
that we borred from the QIIME2 tutorials. In the future, we'd like to develop our own pipeline to train
a user-specific classifier.
5. Future next steps could also include: making the package pip installable, expanding to include paired
end sequencing, adding additional output plots including an option to display data before dada2 filtering,
and eventually developing our own functions for data processing to remove QIIME2 dependence.
CI can be difficult when running in such a large and complicated environment (like one containing QIIME2).
Check back for updates about continuous integration through Travis CI.