training.qmd

---
title: "16S Data Processing Training"
author: Katie McCauley from the Bioinformatics and Computational Biosciences Branch (BCBB)
format: 
  html:
    toc: true
    toc-location: left
    number-sections: true
    embed-resources: true
    code-overflow: wrap
    theme: yeti
editor: source
editor_options: 
  chunk_output_type: console
---

Thank you for joining us for this 16S rRNA microbiome data processing tutorial! Before we jump in, let's start by going over our learning objectives for the technical component.

-   Identify Primers and/or Adapters

-   Check the Quality of Microbiome Data

-   Develop Amplicon Sequence Variants (ASVs) using the Divisive Amplicon Denoising Algorithm 2 (DADA2)

-   Assign Taxonomy

-   Check for Chimeras

-   Make a Phylogenetic Tree

-   Collate Data in a `phyloseq` Object

This tutorial is designed to introduce you to the steps involved in generating data from DADA2. While you may never actually create an ASV table using R code again (perhaps preferring Nephele's code-free DADA2 pipeline[^1]), we will hopefully introduce you to the steps involved and points to watch out for or consider when choosing parameters or evaluating your results.

[^1]: https://nephele.niaid.nih.gov/

## Quick Note about ASVs vs OTUs

What we'll cover today are considered **Amplicon Sequence Variants** or **Sequence Variants** because they have the ability to differ by as little as one nucleotide (we'll cover how this happens later). Before DADA2 was widely accepted, though, the field primarily used **Operational Taxonomic Units** or **OTUs**. This was based on clustering sequences with 97% similarity. Meaning:

-   Take your most frequent sequence and consider it the first "seed"

-   Take your next most-frequent sequence and determine how similar it is to the first sequence. If it's 97% similar, consider it the "same" as the first sequence and if it's less than that, consider it a new "seed"

-   Repeat for all dereplicated sequences, checking against all other "seed" sequences

The 97% cutoff was chosen because it was presumed to account for potential PCR error in the sequence data, but often was fairly greedy. The figure below from Benjamin Callahan's supplemental material for the DADA2 paper[^2] help describe this comparison:

[^2]: Callahan, B., McMurdie, P., Rosen, M. *et al.* DADA2: High-resolution sample inference from Illumina amplicon data. *Nat Methods* **13**, 581--583 (2016). https://doi.org/10.1038/nmeth.3869

[![Comparing OTUs to ASVs](images/OTUs_vs_ASVs.png)](https://www.nature.com/articles/nmeth.3869)

## A Primer on Sequencing

[A short video from Illumina on "Sequencing by Synthesis"](https://www.youtube.com/watch?v=fCd6B5HRaZ8)

[![Indexed Sequencing Overview](images/index_orientation.webp)](https://knowledge.illumina.com/library-preparation/general/library-preparation-general-reference_material-list/000002099)

When the data is pulled from the sequencer, it typically arrives in BCL format, which is a highly compressed and proprietary Illumina format from which we need to create human- and computer-readable files (FASTQ). Illumina now frequently provides individual forward and reverse sequence files for each sample (called "Demultiplexed"), and this is the format that gets uploaded to public repositories like the Sequence Read Archive (SRA) or the European Nucleotide Archive (ENA).

[![Library Multiplexing Overview](images/SequencingBySynthesis.png){fig-align="center"}](https://www.illumina.com/content/dam/illumina-marketing/documents/products/illumina_sequencing_introduction.pdf)

This is what a "FASTQ" file ends up looking like:

```         
@M03213:59:000000000-AWR6D:1:1101:12406:1145 2:N:0:NCCTGAGC+NTATTAAG
NGACTACTGGGGTTTCTAATCCTGTTTGCTCCCCACGCTTTCGCACATGAGCGTCAGTACATTCCCAAGNGGCTGCCTTCGCCTTCGGTATTCCTCCACATCTCTACGCNTTTCACCGCTACACGTGGAATTCTACCCCTCCCTAAAGTACTCTAGATTCCCAGTCTGAAATGCAATTCCCAGGTTAAGCCCGGGGCTTTCACACCTCACTTAAAAATCCGCCTGCGTGCCCTTTACGCCCAGTTATTCCGATTAACGCT
+
#8ACCGGGGGGGFFGGGGFGGGGGGGGGGGGGGGGGGGGGGGEGGFFGGGGGGGGGGGGGGFGFFGG<E#:BFFGGGFGGGGGCGGFEFGFFGGGGG<CFFCFGGGGGG#99@FFGGEGBGGFGGF8CFFFEFGGG<=9DC>DDGGD?C=,;EGFGBFDGFFGGGGCC;@EEFGGGGFGGGGGGGGGFGGFDEGGGAADE5;EEB*07/9<FFCFGFGD@=@EDFF>7>9;C?E</2(2.6;<=)7,9<?(*)6(7,,(-
@M03213:59:000000000-AWR6D:1:1101:9817:1174 2:N:0:NCCTGAGC+CTATTAAG
NGACTACTGGGGTTTCTAATCCTGTTCGCTACCCACGCTTTCGAGCCTCAGCGTCAGTTACAAGCCAGAGAGCCGCTTTCGCCACAGGTGTTCCTCCATATATCTACGCATTTCACCGCTACACATGGAATTCCACTCTCCCCTCTTGCACTCAAGTTAAACAGTTTCCAAAGCAAACTATGGTTGAGCCACAGCCTTTGACTTCAGACTTATCTAACCGCCTGCGCTCGCTTTCCGCCCACTAAATCCGTATAACTCTCG
+
#8ACCGGGGGGGFCGGGGGGGGGGGGFGGGGGGGGGGGGGGGGFGGGGFGFFGGGFGFGGGGGGG7CFGCFFGGGGBEGGGGGGGG?EGGGGGFGGGGGGGGFGGGGGGGE@DFAFGGGFGEGGGGGGGGGF<,@,DDFGGDGDG=EFGGGFF,=D?8DF,?EGGCFCF,DFFGGFCDGFGG8@E3<?FF8DG8BFDFEEFCBEFA7;@6;A@CGGC7915>8)702/8:4*4A+7=;((/(,/6<29(((*.//,,/)/(
```

## Find and Organize our Data

Let's dive into some data and generate some ASVs! First things first: we need to organize our sequencing data. If you didn't bring your own data, you will find the files that we work with today under `raw_files`, and we will use this location throughout the tutorial today. If you did bring your own data, figure out where they live and put that location here:

```{r}
#| label: data-loc
data_location <- "~/OneDrive/training/16S-data-processing/raw_files/"
```

The data we're working with today comes from a small subset of publicly available samples (Accession `PRJEB42394`) analyzed here[^3] and utilizes 16S rRNA V4 sequencing of nasal microbiome samples from children with asthma. Samples were sequenced on the NextSeq 500.

[^3]: McCauley KE, Flynn K, Calatroni A, DiMassa V, LaMere B, Fadrosh DW, Lynch KV, Gill MA, Pongracic JA, Khurana Hershey GK, Kercsmar CM, Liu AH, Johnson CC, Kim H, Kattan M, O'Connor GT, Bacharier LB, Teach SJ, Gergen PJ, Wheatley LM, Togias A, LeBeau P, Presnell S, Boushey HA, Busse WW, Gern JE, Jackson DJ, Altman MC, Lynch SV; National Institute of Allergy and Infectious Diseases--sponsored Inner-City Asthma Consortium. Seasonal airway microbiome and transcriptome interactions promote childhood asthma exacerbations. J Allergy Clin Immunol. 2022 Jul;150(1):204-213. doi: 10.1016/j.jaci.2022.01.020. Epub 2022 Feb 8. PMID: 35149044.

I have another markdown file with a tutorial for how I found and downloaded these files if you're interested (See `DownloadingFromSRA.html`). For now, I have selected a handful of samples from this study that we will use to create an ASV/OTU table with DADA2 and those files reside in the `raw_files` directory we linked to above.

```{r}
#| label: data-org
library(dada2)

read_indicator <- "_1"
all_fastqs <- list.files(data_location, full.names = T)
all_fastqs[1:15]
r1_fastqs <- all_fastqs[grepl(read_indicator, all_fastqs)]
r2_fastqs <- all_fastqs[!grepl(read_indicator, all_fastqs)]
r1_fastqs[1:10]
r2_fastqs[1:10]
```

## Identifying Primers and Adapters

Primers and adapters are typically present in sequence data as artifacts of the sequencing process, and need to be removed before we start processing our data. For the purposes of our discussion, I'm going to use R to ask if there are any primers in our sequencing data. This dataset examines the V4 region of the 16S rRNA gene, and uses 515F and 806R primers. You can get the primer sequences from the Earth Microbiome Project's website (https://earthmicrobiome.org/protocols-and-standards/16s/). You may need to obtain different sequences for your primer strategy, but the overall code to search for our (known) primers remains the same.

Note that **Primer sequences** are at the beginning of the sequence read.

```{r}
#| eval: false
#| message: false
#| label: find-primers

library(ShortRead)
library(Biostrings)
FWD <- "GTGCCAGCMGCCGCGGTAA"
REV <- "GGACTACHVGGGTWTCTAAT"

primerHits <- function(primer, fn) {
  start_seq <- sread(readFastq(fn))
  start_seq <- subseq(start_seq, start=1, end=length(primer)+1)
  nhits <- vcountPattern(primer, start_seq, fixed = FALSE)
  return(sum(nhits > 0))
}
sapply(FWD, primerHits, r1_fastqs[1:10])
sapply(REV, primerHits, r2_fastqs[1:10])

```

**Adapters**, however are typically at the end of the read. You can search for them with tools like FASTQC[^4], which is used in Nephele's Pre-Process/Quality Check pipeline (https://nephele.niaid.nih.gov/details_qc/). This provides a good first-pass of data in the early stages and I'll walk through a demo.

[^4]: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Most often, adapters are found at the end of the read because of "read-through". This is what happens when the length of the read is longer than the actual size of the region of interest. For instance, if you were sequencing the V4 region, which is 253bp, but you used a 600-cycle MiSeq (which produces 300 base forward and reverse read, you would read into the adapter.

Since the sequencing methods used for these samples resulted in reads that are shorter than 253 bases and the FASTQC analysis didn't identify any adapters, we can move forward!

::: callout-important
# Quiz

For the following scenarios, would you expect to find adapters? If so, what adapter removal strategies could work?

1.  V4 region (253bp) on a 2x150 NextSeq run
2.  V1-V3 region (\~480bp) on a 2x300 MiSeq run
3.  V4 region (253bp) on a 2x300 MiSeq run
4.  *BONUS* ITS2 (160-390bp) on a 2x300 MiSeq run
:::

## Quality Filtering and Trimming

While DADA2 will physically correct the reads based on quality scores, it is helpful to have relatively high-quality data going in. This is a good time to review the [multiQC report](Training-Sample-QC_multiqc_report.html) to help us identify areas we may want to have impact on.

::: callout-important
# Quiz

Review the QC report. Notice what MultiQC highlights as "good" or "bad". Do you agree?
:::

Before we start processing, we need somewhere for our filtered reads to go, so we do some housekeeping. Keep in mind that I'm performing very simple string modification, but you may need to do something more complex for your data and samples.

```{r}
#| label: prep-filter
r1_filt <- gsub("raw_files", "filtered", r1_fastqs)
r2_filt <- gsub("raw_files", "filtered", r2_fastqs)
## In my case, the filtered directory could exist before I make the document
## The `unlink` function just makes sure it's not there when I run filterAndTrim
unlink("filtered/", recursive = T)
```

Given this information, we now use the `filterAndTrim` function to clean our data. There are several different terms you can use, and these can be reviewed with `?filterAndTrim`.

```{r}
#| label: filter-trim
out <- filterAndTrim(r1_fastqs, r1_filt, r2_fastqs, r2_filt, 
                     maxN=0, maxEE=2, multithread = F,
                     verbose=T, rm.lowcomplex = 8, rm.phix = T)
out
```

::: callout-important
# Quiz

Under the following scenarios, would you expect *more* or *fewer* reads to pass QC?

1.  truncQ = 25, minLen = 125 (default: truncQ = 2, minLen=20)
2.  maxEE = 5
3.  truncLen = c(150,125) (default: truncLen = 0)

If you have time, feel free to test these parameters!
:::

## Denoise Reads

To actually get to "sequence variants", the DADA2 method uses a combination of building an "error" model to determine transition probabilities and an "abundance p-value". This "p-value" functions under the null hypothesis that a sequence is too abundant to be explained by error alone. We don't end up seeing these p-values, but can delve into objects if we're ever curious.

[![Schematic of the DADA Algorithm](images/dada_schematic.webp){fig-align="center"}](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-13-283)

If you want to learn more about the method, you can read either the DADA[^5] or the DADA2 papers[^6]

[^5]: Rosen MJ, Callahan BJ, Fisher DS, Holmes SP. Denoising PCR-amplified metagenome data. BMC Bioinformatics. 2012 Oct 31;13:283. doi: 10.1186/1471-2105-13-283. PMID: 23113967; PMCID: PMC3563472.

[^6]: Callahan BJ, Wong J, Heiner C, Oh S, Theriot CM, Gulati AS, McGill SK, Dougherty MK. High-throughput amplicon sequencing of the full-length 16S rRNA gene with single-nucleotide resolution. Nucleic Acids Res. 2019 Oct 10;47(18):e103. doi: 10.1093/nar/gkz569. PMID: 31269198; PMCID: PMC6765137.

What we do first is train the algorithm on our current data so it knows how to apply error correction. Also, notice that only a portion of samples and sequences are used for training.

```{r}
#| warning: false
#| label: learn-errors

r1_err <- learnErrors(r1_filt, multithread=F, verbose=T)
r2_err <- learnErrors(r2_filt, multithread=F, verbose=T)
```

These plots tell us for each nucleotide transition(A2A, A2T, etc.), the average error frequency across quality scores. The observed values are the points, the red lines indicate expected values, and the final estimated error rates based on the model are the black lines.

While somewhat messy, they are still trending in the correct direction. As long as the model (black line) fits the black points, and the trend is generally negative across all transitions, DADA2 should perform as expected.

```{r}
#| label: plot-errors
#| warning: false

plotErrors(r1_err, nominalQ = T)
plotErrors(r2_err, nominalQ = T)
```

Now we apply these error rates to our filtered data and perform the DADA2 denoising!

```{r}
#| label: dada-run
#| warning: false

r1_dada <- dada(r1_filt, r1_err)
r2_dada <- dada(r2_filt, r2_err)
```

::: callout-tip
## Pooling

As you can see as you watched each sample go through the DADA2 process, each sample goes through divisive denoising by themselves. However, DADA2 also provides other options which allow for sharing of information between samples.

If you choose `pool=TRUE`, all information from all samples is put together before determining the exact sequence variants in each sample. However this can be quite resource intensive (in my tests on this data, it took about 20 minutes for `pool=TRUE`).

Thus, the authors created `pool="pseudo"`, or pseudo-pooling, which allows for information sharing between samples, but also allows for each sample to be processed individually. This reduces time and memory requirements compared to true pooling, while providing similar benefits. You can read more about this option [here](https://benjjneb.github.io/dada2/pseudo.html#pseudo-pooling).
:::

## Merge Reads

Now that we've corrected our reads, we can merge our forward and reverse reads. Here, we need to recall our sequencing strategy again! Our region of interest is 253 bases, which we can cover with 153 bases from the forward read and 153 bases from the reverse read. Therefore, 306-253=53, suggesting we can't expect an overlap greater than \~53 bases. Here is a schematic that might help:

![Read Merging Schematic](images/ReadMerging.png){fig-align="center"}

```{r}
#| label: merge-reads
merged_reads <- mergePairs(r1_dada, r1_filt, r2_dada, r2_filt, verbose=TRUE, minOverlap = 20)
head(merged_reads[[1]])
```

::: callout-important
## Quiz

Let's assume that you have a V1-V3 (\~480bp) sequencing strategy with a 2x300 MiSeq run. On these types of runs, quality tends to decrease quite precipitously once you get to the ends of these longer reads, so you decide to trim at 270 on the forward reads and 230 on the reverse read (`truncLen = c(270,230)`).

What is the overlap length?
:::

::: callout-tip
## Loss of reads at merged step

Sometimes you might find that even though you've ensured sufficient overlap, there's still a large number of reads lost at the merging step. This can happen for several reasons.

First, it can indicate that primers are indeed present in your data and may need trimming.

It could indicate that a filtering/trimming step needs to be better-optimized for your data (see above).

Losing some reads to merging is expected, but if you find you're losing 50% or more, it might be worth diving in a little deeper.
:::

## Make Sequence Table

And finally we can make the first draft of our table!

```{r}
#| label: seq-table
seq_table <- makeSequenceTable(merged_reads)
dim(seq_table)
```

From this information, we can see that there are `r nrow(seq_table)` rows for each of the `r length(merged_reads)` samples and there are columns for the `r ncol(seq_table)` sequence variants.

We can use basic R functions to explore the dataset, much like we would any table of counts.

Below is the distribution of the sequence lengths for our data. Notice that a large majority of sequences are 253 bases long, which is how long the V4 region is.

```{r}
#| label: base-lengths
table(nchar(getSequences(seq_table)))
```

::: {.callout-tip appearance="simple"}
## Filter for sequence length!

If you look at the table and find that you have sequences that are much larger or smaller than your target amplicon, you can remove them through basic R commands like: `filt_seq_table <- seq_table[, nchar(getSequences(seq_table)) %in% 250:255]`

These abnormal lengths can arise from non-specific priming and typically are not cause for concern.
:::

## Remove Chimeras

Chimeras are hybrids of two "parent" sequences, meaning that you have one bacterial sequence belonging to, say, *E. coli*, on one side of the read and another bacterial sequence belonging to *Streptococcus* on the other. These are typically due to the PCR process, and are considered artifacts that should be filtered out. We do that with the `removeBimeraDenovo` function, which checks lower abundance sequence variants against higher abundance sequence variants to determine if any match to two separate "parents".

```{r}
#| label: chimeras
chimera_check <- removeBimeraDenovo(seq_table, verbose=T)
dim(chimera_check)
```

While we lose about half our sequences due to chimeras, we can determine how many reads were retained through this quick one-liner:

```{r}
#| label: chimera-summary
sum(chimera_check)/sum(seq_table)
```

::: callout-important
## Workflow Checkpoint!

Let's take a look at how our data looks so far in a nice, convenient table!

```{r}
getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(r1_dada, getN), sapply(r2_dada, getN), sapply(merged_reads, getN), rowSums(chimera_check))
colnames(track) <- c("input","filtered","denoised_r1","denoised_r2","merged","nonchimeric")
track
```

As you can see, there was a slight loss of reads during filtering and merging for these particular samples. For now, there isn't enough evidence that we need to double-check our workflow so far, but if the data ends up more problematic, this table might help target a step to focus on.
:::

## Assign Taxonomy

We now have `r ncol(chimera_check)` sequence variants and we want to determine what bacteria they actually are. In order to do this, we need to download specially-formatted databases (you can find several different databases [here](https://benjjneb.github.io/dada2/training.html), some of which are contributed by outside teams). For our purposes, we'll download the [SILVA database](https://zenodo.org/record/4587955)[^7]. You'll need all three files and we'll install them to a folder called `databases` that is in our working directory.

[^7]: Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, Peplies J, Glöckner FO (2013) The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucl. Acids Res. 41 (D1): D590-D596.

```{r}
#| label: taxonomy
taxa <- assignTaxonomy(chimera_check, "databases/silva_nr99_v138.1_train_set.fa.gz", multithread=F, verbose=T, minBoot = 80)
taxa_print <- taxa
rownames(taxa_print) <- NULL
head(taxa_print,10)
```

You'll notice a parameter called `minBoot`, which is the minimum bootstrapped probability that the call is accurate. The default is 50. **Play** with different values. What do you notice about the specificity of the classifications? You may need to adjust the number of rows of `taxa_print` that are shown to be able to see some of these effects.

For today, we will only assign taxonomy to the genus level. Taxonomy assignment to the species level occurs using a separate function (`assignSpecies`), but is considered fully valid for DADA2 sequences. The method attempts to identify the genus and species independently and then only assigns species if the genus calls match. By default, `assignSpecies` will only do so if there is one unique species that reaches the criteria. However, there is also the option to list all species that match the sequence variant at 100% identity.

## Make Phylogenetic Tree

Next, we align the sequences to each other so that we can create a phylogenetic tree. This tree can be used for statistical questions or visualizations later. As a disclaimer, there are a few additional steps needed to "optimize" the fit of the tree, but we won't do them here because it can a long time. If you would like to explore this method of tree building further, you can find additional documentation [here](https://f1000research.com/articles/5-1492).

```{r}
#| label: phy-tree
#| message: false
library(DECIPHER)
library(phangorn)
seqs <- as.character(getSequences(chimera_check))
names(seqs) <- seqs
alignment <- DECIPHER::AlignSeqs(DNAStringSet(seqs), anchor=NA)
phang.align <- as.phyDat(as(alignment, "matrix"), type="DNA")
dm <- dist.ml(phang.align)
treeNJ <- NJ(dm)
fit <- pml(treeNJ, data=phang.align)
```

## Generate a Phyloseq Object

Now we have the components we need to make an object that is compatible with various microbiome packages. These microbiome packages help bring the data to life and make both visualization and analysis much easier. My personal preference is [phyloseq](https://joey711.github.io/phyloseq/), but there are other packages out there that do the same or similar things, including [ampvis2](https://kasperskytte.github.io/ampvis2/articles/ampvis2.html) and [SummarizedExperiment](https://github.com/Bioconductor/SummarizedExperiment). I currently don't have sample data on-hand for these samples, so we will make some dummy data first.

```{r}
#| label: make-phyloseq
library(phyloseq)

dummy_sample_data <- data.frame(samples=rownames(chimera_check), treatment_group=sample(c("A","B"), size=20, replace=TRUE))
rownames(dummy_sample_data) <- dummy_sample_data$samples
```

Now we import our components (OTU table, sample data, phylogenetic tree, and taxonomic table) into a single unified object:

```{r}
phy_obj <- phyloseq(otu_table(chimera_check, taxa_are_rows=F), sample_data(dummy_sample_data), phy_tree(fit$tree), tax_table(taxa))
phy_obj
```

With this new phyloseq object, we can perform visualizations. For example, we will create a bar plot and some PCoA plots.

```{r}
#| label: phyloseq-plots
library(ggplot2)
plot_bar(phy_obj, fill="Genus") + guides(fill="none")

ord <- ordinate(phy_obj, method="PCoA", distance = "bray")
plot_ordination(phy_obj, ord)

ord <- ordinate(phy_obj, method="PCoA", distance = "wunifrac")
plot_ordination(phy_obj, ord)
```

## Nephele

We can also process the data pretty much exactly the same way with Nephele. I have some jobs that I've already completed; feel free to review and experiment with them.

**QC**: [a2e52d4e7445](https://nephele.niaid.nih.gov/results_page/a2e52d4e7445)

**DADA2**: [4cac15889e25](https://nephele.niaid.nih.gov/results_page/4cac15889e25)

If you want to re-run one of these jobs, visit the main Nephele page (nephele.niaid.nih.gov) and place the job IDs above into the "Need to re-run a Nephele job?" text box. Jobs remain available for 90 days, and I ran these on October 22, so they should be available for 3 more months.

As you review the results, you'll notice that some specific tools or implementations are a bit different (ie, different phylogenetic tree builders, etc), but broadly the same techniques are applied.

## Wrap-up

Thank you for joining today!

If you have feedback, ideas for additional sessions/trainings, questions about your own project, etc, feel free to reach out to me (kathryn.mccauley@nih.gov) or make a ticket to meet with the Bioinformatics team (go to bioinformatics.niaid.nih.gov and click "Start Your Project Today"); we would be happy to assist!