04-qiime2.Rmd

# | Qiime2 workflow
<font size="1">**Created by:** Matt Selensky on 2019-11-19 </br>
**Last updated:** 2019-11-30 </font>


<h2>Workflow for 16S amplicon sequence analysis in Qiime2</h2>

## Import data

This protocol is designed for the processing of 16S rRNA gene amplicon data using 515F/806R primers. Qiime2 requires us to convert the raw data the sequencing center sends us into Qiime-zipped artifacts, or a ```.qza``` extension. We must first import our data (paired-end .fastq files) into this format:

```{bash, echo=TRUE, eval=FALSE}
qiime tools import \
  --type EMPPairedEndSequences \
  --input-path emp-paired-end-sequences \
  --output-path emp-paired-end-sequences.qza
```

The above function requires three .fastq files in the folder ```emp-paired-end-sequences```. One file is for the forward reads, one is for the reverse reads, and another is for the barcodes. They *must* be named ```forward.fastq.gz```, ```reverse.fastq.gz```, and ```barcodes.fastq.gz```, respectively.

```qiime tools import``` will yield a single output, ```emp-paired-end-sequences.qza```, that will contain all of the barcoded reads from every single sample submitted to the sequencing center.

## Demultiplexing

At the sequencing center, DNA sequences were given a barcode specific to each sample so we can track which sample our reads in the ```emp-paired-end-sequences.qza``` file originated from. We do this by *demultiplexing* our sequences.

In Qiime2, we need to create a metadata file that contains the barcodes used for each sample. Check out this [example](https://docs.google.com/spreadsheets/d/1y3yM50tW_23H7fXeou9XwyM92VNd8dCtgk8ndHOMSMs/edit#gid=0) from the Qiime2 documentation of how this metadata file should be formatted. The sequencing center should send a mapping file from which you can obtain the barcodes for each sample. Be sure to save the metadata file as a .tsv. Specify the barcodes and other associated metadata only for the samples you are interested in analyzing. As is often the case in our lab, sequencing data is typically sent back as a mix of samples from different projects. Only including your samples in the metadata file will subset the large ```.qza``` file and will significantly cut down on computation time.

Because we are demultiplexing EMP paired-end sequences, we should use the ```demux emp-paired``` command. The column which contains the barcode in the metadata file for each sample must be specified using the argument ```barcodes-column```:

```{bash, echo=TRUE, eval=FALSE}
qiime demux emp-paired \
  --m-barcodes-file sample-metadata.tsv \
  --m-barcodes-column barcode-sequence \
  --i-seqs emp-paired-end-sequences.qza \
  --o-per-sample-sequences demux.qza 
```

Note: if you have reverse complement sequences, you must pass the argument, ```--p-rev-comp-mapping-barcodes``` to your ```demux``` command to account for this.

You can look at this on the [Qiime2 viewer](https://view.qiime2.org/) by producing a Qiime-zipped visualization file, ```.qzv```, from your now-demultiplexed ```.qza``` output:

```{bash, echo=TRUE, eval=FALSE}
qiime demux summarize \
  --i-data demux.qza \
  --o-visualization demux.qzv
```

From the interactive quality plot in ```demux.qzv```, we can see the distribution of quality scores for each sequenced base. If analyzing paired-end data, you will see two plots: one for the forward read, and one for the reverse read. We use this visualization to inform how we will trim and truncate the ends of the reads in the next denoising step using ```dada2 denoise-paired```

## Denoising and ASV generation

We will use the [DADA2](https://www.ncbi.nlm.nih.gov/pubmed/27214047) algorithm to denoise our data and generate amplicon sequence variants (ASVs). DADA2 is a robust way to filter out noisy sequences, correct errors in marginal sequences, remove chimeras, remove singletons, join denoised paired-end reads, *and* dereplicate sequences. Previously, each of these functions would require separate commands, but DADA2 does it all-in-one. Therefore, this is a particularly computationally intense process. One should consider running ```dada2``` on a computer that can handle it (perhaps by accessing Northwestern's high-performance computing cluster, [Quest](https://www.it.northwestern.edu/research/user-services/quest/)).

```{bash, echo=TRUE, eval=FALSE}
qiime dada2 denoise-paired \
  --i-demultiplexed-seqs demux.qza \ 
  --p-trim-left-f 13 \
  --p-trim-left-r 13 \
  --p-trunc-len-f 150 \
  --p-trunc-len-r 150 \
  --o-table table.qza \
  --o-representative-sequences rep-seqs.qza \
  --o-denoising-stats denoising-stats.qza
```

By inspecting the interactive ```demux.qzv``` file produced in the previous step on the [Qiime2 viewer](https://view.qiime2.org/), we observe that sequence quality scores are lower than average until base #14 in both the forward and reverse reads. We will want to trim these low-quality sequences from our data. Use the argument ```p-trim``` to specify the number of nucleotides that should be trimmed from the left end of the forward (```left-f```) and reverse (```left-r```) reads, which we define as ```13``` here. Similarly, the ```p-trunc-len``` argument is used to trim the right ends of, or *truncate*, our reads.  Since we have paired-end data, our amplicons are 150 nucleotides long. We define ```p-trunc-len``` for the forward and reverse reads as ```150``` because we do not observe a drop off in quality scoring on their right ends in our ```demux.qzv``` file.

The output ```rep-seqs.qza``` contains a list of the ASVs found across all samples, and be used in the next step of our processing workflow: assigning taxonomy.

## Taxonomy

At this point, our ASVs lack any meaningful identification - we don't know whether ASV 'A' comes from the bacterium *E. coli* or the archaeum *S. solfataricus*! We determine who is present in our samples by assigning taxonomic IDs to each "query" sequence (from ```rep-seqs.qza```). We do this by comparing query sequences to a database of known reference sequences ( [Silva](https://www.arb-silva.de/documentation/silva-taxonomy/) is an excellent choice for our purposes). A major advantage of using Qiime2 is that it contains the ```classify-sklearn``` algorithm, which uses machine learning via Naive Bayes to classify sequences. As is the case with other machine learning applications, the classifier must be *trained*. Classifier training is required for every new reference database/amplicon pair, and would be the most resource-intensive step in our workflow by far. Luckily for us, Silva is routinely used to classify sequences coming from 16S rRNA gene amplification using the 515F/806R primers, and the Qiime2 developers provide a free, [pre-trained Silva classifier](https://docs.qiime2.org/2019.10/data-resources/) in their documentation for just that! Download this classifier - you will need it! At the time this was written, I used ```silva-132-99-515-806-nb-classifier-2018.qza```, the latest version of the pre-trained Silva classifier.

Even without the extra training step, it is highly recommended to run ```classify-sklearn``` on a high-performance computing cluster, as it is very memory intensive and slow (budget several hours or even a day for this to complete!). Please refer to our Quest tutorial to get started on how to submit jobs on Northwestern's cluster.

```{bash, echo=TRUE, eval=FALSE}
qiime feature-classifier classify-sklearn \
  --i-classifier classifier.qza \
  --i-reads rep-seqs.qza \
  --o-classification taxonomy.qza
```

If you so choose, you can visualize the resultant taxonomy file on the [Qiime 2 viewer](https://view.qiime2.org/) to verify that classification was successful:

```{bash, echo=TRUE, eval=FALSE}
qiime metadata tabulate \
  --m-input-file taxonomy.qza \ 
  --o-visualization taxonomy.qzv
```

## Taxa barplots and diversity analyses in Qiime2

You can quickly visualize the community composition of your samples via the ```taxa barplot``` command. This requires your clustered feature table and taxonomy.qza from the previous step as inputs. In the [Qiime 2 viewer](https://view.qiime2.org/), you can export the data that feeds the taxa barplot as .csv files specific to each level of taxonomic classification. Use those files in R to produce publication-quality figures. 

```{bash, echo=TRUE, eval=FALSE}
qiime taxa barplot \
  --i-table table.qza \
  --i-taxonomy taxonomy.qza \ 
  --m-metadata-file metadata.tsv \ 
  --o-visualization taxa-bar-plots.qzv
```

In fact, we don't really want to use Qiime2 for making any sort of figure for presentations or publications, but it does have a few handy tools to quickly visualize your data to inform which types of figures you want to make. Let's start with the built-in diversity analyses offered by Qiime2.

Many diversity analyses compute diversity by incorporating phylogeny. That means we have to generate a phylogenetic tree of how our sequences are related to each other! Both rooted and unrooted trees are outputs of the  ```align-to-tree-mafft-fasttree``` command. UniFrac and Faith's Phylogenetic Diverstiy require the use of a rooted tree. 

```{bash, echo=TRUE, eval=FALSE}
qiime phylogeny align-to-tree-mafft-fasttree \
  --i-sequences rep-seqs.qza \
  --o-alignment aligned-rep-seqs.qza \
  --o-masked-alignment masked-aligned-rep-seqs.qza \
  --o-tree unrooted-tree.qza \
  --o-rooted-tree rooted-tree.qza
```

Additionally, the analyses we are about to perform will be subsampling our data to estimate diversity. This *rarefaction* is done so we can compare diversity across samples of different sizes, thereby minimizing bias. We need to know the sequencing depth we should take so we don't miss out on too many rare sequences (by choosing too low of a depth) or too many samples themselves (by choosing too high of a depth). By making an alpha rarefaction plot that we can visualize on the Qiime2 viewer, we can make an informed decision:

```{bash, echo=TRUE, eval=FALSE}
qiime diversity alpha-rarefaction \
  --i-table table.qza \
  --i-phylogeny rooted-tree.qza \
  --p-max-depth 20000 \
  --m-metadata-file metadata.tsv \
  --o-visualization alpha-rarefaction.qzv
```

From this visualization, you should choose a sequencing depth at which the observed OTUs from most samples level off, without excluding too many samples. In our example, we will choose a depth of 6500.

After determining the degree of rarefaction, we can compute core diversity metrics in Qiime2:

```{bash, echo=TRUE, eval=FALSE}
qiime diversity core-metrics-phylogenetic \
  --i-phylogeny rooted-tree.qza \
  --i-table table.qza \
  --p-sampling-depth 6500 \
  --m-metadata-file metadata.tsv \
  --output-dir core-metrics-results
```

The ```core-metrics-results``` folder will contain both alpha and beta diversity metrics. For each metric, you can determine diversity significance using ``` diversity alpha-group-significance``` or ```diversity beta-group-significance```:

```{bash, echo=TRUE, eval=FALSE}
qiime diversity alpha-group-significance \
  --i-alpha-diversity core-metrics-results/alpha-div-metric-of-interest.qza \
  --m-metadata-file metadata.tsv \
  --o-visualization core-metrics-results/metric-group-significance.qzv
```

```{bash, echo=TRUE, eval=FALSE}
qiime diversity beta-group-significance \
  --i-distance-matrix core-metrics-results/distance_matrix.qza \
  --m-metadata-file metadata.tsv \
  --m-metadata-column comparisonofinterest \
  --o-visualization core-metrics-results/unweighted-unifrac-comparisonofinterest-significance.qzv
```

Beta diversity visualizations can be viewed via Qiime2 View's Emperor, which offers an interactive three-dimensional platform to explore relationships in your data.