GitHub - FastQCS3-direct/fastqcs3: Main Repo for CHEM E 545/546 Final Project

FastQCS3: Fast Quantitative Checking of 16S Gene Sequencing

Main Repo for CHEM E 545/546 Final Project

Overview/Purpose

The purpose of this software is to give users, primarily microbiome researchers in academia, a tool to run quality checking immediately after sequencing results become available. This package is run from the command line and outputs a link to an interactive dashboard in a web browser. The four main analyses performed include: sequencing quality, relative abundances, alpha diversity metrics, and beta diversity metrics.

Installing Dependencies

This package relies on QIIME2, a previously published microbiome analysis package, to pre-process input data and Biopython to extract sequencing quality metrics. FastQCS3 seeks to utilize QIIME2 data processing capabilities while focusing on building an interactive dashboard with more useful visualization tools than QIIME2's .qza formats. To be able to utilize FastQCS3, QIIME2, Biopython, and Dash must all be installed in your working environment. QIIME2 only runs on Mac OS and on Windows Subsystem for Linux. Information on installing Windows Subsytem for Linux is found below.

Installing Miniconda

You can follow the instructions here to install Miniconda.

Updating Miniconda and Installing Zip/Unzip

conda update conda

sudo apt-get install zip unzip

Creating the FastQCS3 Environment from this Repo

Once you have Miniconda, you can create an environment using the provided environment.yml file contained in this repo and the following command.

conda env create -n fastqcs3 --file environment.yml

Due to the size of the QIIME2 package, creating the environment can take awhile. Once it has finished, activate your environment through conda activate fastqcs3.

You can deactivate at any time using conda deactivate.

If you have followed the instructions up to this point and you have created your environment from the environment.yml file, move straight to the Operating Instructions section.

Creating an Environment from Scratch

To create an environment from scratch, you'll have to download QIIME2 using the following instructions and install Biopython and Dash as well.

Installing wget

conda install wget

If conda install wget doesn't work, you can also try using pip install wget, sudo apt-get install wget, or use package managers like homebrew to brew install wget.

If you have a Mac OS...

wget https://data.qiime2.org/distro/core/qiime2-2020.11-py36-osx-conda.yml

conda env create -n qiime2-2020.11 --file qiime2-2020.11-py36-osx-conda.yml

rm qiime2-2020.11-py36-osx-conda.yml

If you have a Windows OS on Linux...

wget https://data.qiime2.org/distro/core/qiime2-2020.11-py36-linux-conda.yml

conda env create -n qiime2-2020.11 --file qiime2-2020.11-py36-linux-conda.yml

rm qiime2-2020.11-py36-linux-conda.yml

If you have a Windows OS...

the process to install wget will be a more complicated. You'll have to download wget and move the correct exe files into your correct system directories as shown here.

This can be complicated if you don't already have administrator privileges set up in these files on your machine; we recommend using the Mac OS or Windows Subsystem for Linux. Instructions on installing Windows subsystem for Linux can be found here.

To activate your environment,

conda activate qiime2-2020.11

You can deactivate at any time with conda deactivate.

Operating Instructions

After git cloning this repo, copy your directory of gzipped fastq files (.fastq.gz file format) to your local version of this repo. Note: fastQCS3 has only been developed to perform analysis on either multiplexed single-end sequence data generated following the Earth Microbiome Project protocol or demultiplexed single-end sequence data in the Casava 1.8 format. Sequence data is assumed to have been generated by amplifying the V4 region of the 16S rRNA gene using the 515F/806R primer pair. If reads are still multiplxed, a barcode.fastq.gz file mapping sample-IDs to their adapter barcode sequences will need to be included in the directory containing your multiplexed reads.

You will also need to add your metadata file to the metadata/ directory (see demo metadata files in metadata/ to match formatting), which matches sample-IDs with their associated metadata. Once your sequence data and metadata are in their appropriate locations, you can move onto the next step.

To begin running the software:

If you have already activated your environment, go to step 2.

Activate your fastQCS3 environment by typing the command conda activate fastqcs3.
Run python fastQCS3_pkl.py. This will begin the automated QIIME2 process of importing your data and performing the analysis.
You will be asked if your reads are multiplexed or demultiplexed. Choose y or n.
You will then be prompted to enter the name of the directory containing your sequence data. For example, you would enter demo_data_v1 to import the demo data from that directory.
Enter the name of your metadata file. The response should be in the format of sample-metadata.tsv. Your data will then begin importing into objects the QIIME2 software can handle.
You will then be given information about the quality of your reads, and the positions at which their quality begins to drop off below a certain threshold. Use this information to advise your choice for a squencing trim length. We recommend setting trim length to an integer (ie. 120) where your average quality drops below a phred score of ~18-22. If all positions are None, then set trim length to 0 to trim off nothing and retain the entire length of your reads.
After choosing a trim length, DADA2 will run. DADA2 will generate a feature table, where each feature corresponds to an ASV. If you have tens of millions of reads, DADA2 can take several hours to run. If you have hundreds of thousands, DADA2 will finish in a matter of minutes.
You will then be presented with information about the abundance of features in your dataset. If some samples perform poorly and have few features (0 to ~30), you'll want to exclude them from further analysis. Depending on the distribution of your samples, the software will suggest a unique sampling depth integer value. Either take the suggestion, or read more about choosing a sampling depth. Sampling depth must be an integer (ie. 800).
The pipeline will then compute alpha and beta diversity metrics, generate a phylogenetic tree, perform alpha rarefaction, and taxonomic analysis all in one step. If completed successfully, you will be prompted to enter a name for your visualization file you will use in step 11. Do not include spaces or periods in your file name.
The visualization objects are then packaged into a .pkl file. Follow the prompt and run python fastQCS3_dashboard.py.
Enter the name of the visualization file you just created in step 9. The script will generate the dashboard app and display a link that can then be copied into your browser to visualize your data.

A demo video can be found here. We have provided two sets of demo data in demo_data_v1(from the QIIME2 "Moving Pictures" tutorial) and demo_data_v2(Evan's personal sequencing results that inspired this project) if you wish to practice.

Important Notes about Usage and Future Steps

1. Your fastq files should be labelled with an alphabetical character in front:
	i.e. `A1_S1_L001_R1_001.fastq.gz` instead of `1_S1_LOO1_R1_001.fastq.gz`
	
2. If running multiple datasets at once, you will have to change the host name and the port name 
on line 204 of the `fastQCS3_dashboard.py` file every time 
you run a new dataset to visualize all at the same time.
	i.e. `app.run_server(host='127.0.0.1', port='8050', debug=False)` followed by
	`app.run_server(host='128.0.0.1', port='8050', debug=False)`
We hope to modify this in the future to allow users to run multiple directories at once.

3. Unit testing in dash can be complicated, so most of our unit tests are designed to focus on the 
accurate generation of dataframes to then be plotted later in Dash. The main testing functions can
be found in the `py_files` directory and they perform tests on our data manipulation scripts within 
that directory. Running nosetests on that directory after running the pkl.py script on the demo_data_v1
directory indicates passing results.

4. The classifier we use in this repo was not trained by us; it's a commonly used microbiome classifier
that we borred from the QIIME2 tutorials. In the future, we'd like to develop our own pipeline to train
a user-specific classifier.

5. Future next steps could also include: making the package pip installable, expanding to include paired 
end sequencing, adding additional output plots including an option to display data before dada2 filtering,
and eventually developing our own functions for data processing to remove QIIME2 dependence.

Notes about CI

CI can be difficult when running in such a large and complicated environment (like one containing QIIME2).
Check back for updates about continuous integration through Travis CI.

Name		Name	Last commit message	Last commit date
Latest commit History 208 Commits
demo_data_v1		demo_data_v1
demo_data_v2		demo_data_v2
doc		doc
metadata		metadata
py_files		py_files
shell_scripts		shell_scripts
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
fastQCS3_dashboard.py		fastQCS3_dashboard.py
fastQCS3_pkl.py		fastQCS3_pkl.py
gg-13-8-99-515-806-nb-classifier.qza		gg-13-8-99-515-806-nb-classifier.qza

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Main Repo for CHEM E 545/546 Final Project

Overview/Purpose

Installing Dependencies

Installing Miniconda

Updating Miniconda and Installing Zip/Unzip

Creating the FastQCS3 Environment from this Repo

Creating an Environment from Scratch

Installing wget

If you have a Mac OS...

If you have a Windows OS on Linux...

If you have a Windows OS...

To activate your environment,

Operating Instructions

To begin running the software:

Important Notes about Usage and Future Steps

Notes about CI

About

Releases

Packages

Contributors 6

Languages

License

FastQCS3-direct/fastqcs3

Folders and files

Latest commit

History

Repository files navigation

Main Repo for CHEM E 545/546 Final Project

Overview/Purpose

Installing Dependencies

Installing Miniconda

Updating Miniconda and Installing Zip/Unzip

Creating the FastQCS3 Environment from this Repo

Creating an Environment from Scratch

Installing wget

If you have a Mac OS...

If you have a Windows OS on Linux...

If you have a Windows OS...

To activate your environment,

Operating Instructions

To begin running the software:

Important Notes about Usage and Future Steps

Notes about CI

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages