Skip to content

Latest commit

 

History

History
635 lines (452 loc) · 29.8 KB

16s.md

File metadata and controls

635 lines (452 loc) · 29.8 KB

16s Metabarcoding Analysis

This tutorial is largely inspired of the MiSeq SOP from the Schloss Lab.

Kozich JJ, Westcott SL, Baxter NT, Highlander SK, Schloss PD. (2013): Development of a dual-index sequencing strategy and curation pipeline for analyzing amplicon sequence data on the MiSeq Illumina sequencing platform. Applied and Environmental Microbiology. 79(17):5112-20.

Table of Contents

Introduction

The 16S rRNA gene is a section of prokaryotic DNA found in all bacteria and archaea. This gene codes for an rRNA, and this rRNA in turn makes up part of the ribosome. The first 'r' in rRNA stands for ribosomal. The ribosome is composed of two subunits, the large subunit (LSU) and the small subunit (SSU).

The 16S rRNA gene is a commonly used tool for identifying bacteria for several reasons. First, traditional characterization depended upon phenotypic traits like gram positive or gram negative, bacillus or coccus, etc. Taxonomists today consider analysis of an organism's DNA more reliable than classification based solely on phenotypes. Secondly, researchers may, for a number of reasons, want to identify or classify only the bacteria within a given environmental or medical sample. While there is a homologous gene in eukaryotes, the 18S rRNA gene, it is distinct, thereby rendering the 16S rRNA gene a useful tool for extracting and identifying bacteria as separate from plant, animal, fungal, and protist DNA within the same sample. Thirdly, the 16S rRNA gene is relatively short at 1.5 kb, making it faster and cheaper to sequence than many other unique bacterial genes.

Mothur is a command-line computer program for analyzing sequence data from microbial communities and namely 16s data. mothur is licensed under the GPL and is free to use.

Softwares Required for this Tutorial

Downloading the Data and Start Mothur

Firstly, download and unzip the sample dataset:

wget http://www.mothur.org/w/images/d/d6/MiSeqSOPData.zip
unzip MiSeqSOPData.zip

In the MiSeq_SOP directory, you'll find the reads files in fastq format, as well as a file called stability.files

The first lines of stability.files look like this:

F3D0 F3D0_S188_L001_R1_001.fastq F3D0_S188_L001_R2_001.fastq F3D141 F3D141_S207_L001_R1_001.fastq F3D141_S207_L001_R2_001.fastq F3D142 F3D142_S208_L001_R1_001.fastq F3D142_S208_L001_R2_001.fastq F3D143 F3D143_S209_L001_R1_001.fastq F3D143_S209_L001_R2_001.fastq

The first column is the name of the sample. The second column is the name of the forward read for that sample and the third columns in the name of the reverse read for that sample.

Now it's time to start mothur. Type mothur in your terminal. You should see your prompt changing to

mothur >

Reducing Sequencing and PCR Errors

The first thing we want to do is combine our two sets of reads for each sample and then to combine the data from all of the samples. This is done using the make.contigs command, which requires stability.files as input. This command will extract the sequence and quality score data from your fastq files, create the reverse complement of the reverse read and then join the reads into contigs.

make.contigs(file=stability.files, processors=8)  

It took 30 secs to process 152360 sequences.  

Group count:
F3D0	7793
F3D1	5869
F3D141	5958
F3D142	3183
F3D143	3178
F3D144	4827
F3D145	7377
F3D146	5021
F3D147	17070
F3D148	12405
F3D149	13083
F3D150	5509
F3D2	19620
F3D3	6758
F3D5	4448
F3D6	7989
F3D7	5129
F3D8	5294
F3D9	7070
Mock	4779

Total of all groups is 152360

Output File Names:
stability.trim.contigs.fasta
stability.trim.contigs.qual
stability.contigs.report
stability.scrap.contigs.fasta
stability.scrap.contigs.qual
stability.contigs.groups

The stability.contigs.report file will tell you something about the contig assembly for each read. Let's see what these sequences look like using the summary.seqs command:

summary.seqs(fasta=stability.trim.contigs.fasta)

Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	248	248	0	3	1
2.5%-tile:	1	252	252	0	3	3810
25%-tile:	1	252	252	0	4	38091
Median: 	1	252	252	0	4	76181
75%-tile:	1	253	253	0	5	114271
97.5%-tile:	1	253	253	6	6	148552
Maximum:	1	502	502	249	243	152360
Mean:	1	252.811	252.811	0.70063	4.44854
# of Seqs:	152360

This tells us that we have 152360 sequences that for the most part vary between 248 and 253 bases. Interestingly, the longest read in the dataset is 502 bp. Be suspicious of this, the reads are supposed to be 251 bp each. This read clearly didn't assemble well (or at all). Also, note that at least 2.5% of our sequences had some ambiguous base calls. We'll take care of these issues in the next step when we run screen.seqs.

screen.seqs(fasta=stability.trim.contigs.fasta, group=stability.contigs.groups, maxambig=0, maxlength=275)

You'll notice that mothur remembered that we used 8 processors in make.contigs. To see what else mothur knows about you, run the following:

get.current()

Current files saved by mothur:
fasta=stability.trim.contigs.good.fasta
group=stability.contigs.good.groups
qfile=stability.trim.contigs.qual
processors=8
summary=stability.trim.contigs.summary

What this means is that mothur remembers your latest fasta file and group file as well as the number of processors you have. So you could run:

mothur > summary.seqs(fasta=stability.trim.contigs.good.fasta)
mothur > summary.seqs(fasta=current)
mothur > summary.seqs()

and get the same output for each command.

But, now that we have filtered the sequencing errors, let's move to the next step.

Processing Improved Sequences

We anticipate that many of our sequences are duplicates of each other. Because it's computationally wasteful to align the same sequences several times, we'll make our sequences unique:

unique.seqs(fasta=stability.trim.contigs.good.fasta)

If two sequences have the same identical sequence, then they're considered duplicates and will get merged. In the screen output there are two columns - the first is the number of sequences characterized and the second is the number of unique sequences remaining

Another thing to do to make our lives easier is to simplify the names and group files. If you look at the most recent versions of those files you'll see together they are 13 MB. This may not seem like much, but with a full MiSeq run those long sequence names can add up and make life tedious. So we'll run count.seqs to generate a table where the rows are the names of the unique seqeunces and the columns are the names of the groups. The table is then filled with the number of times each unique sequence shows up in each group.

This will generate a file called stability.trim.contigs.good.count_table. In subsequent commands we'll use it by using the count option:

count.seqs(name=stability.trim.contigs.good.names, group=stability.contigs.good.groups)
summary.seqs(count=stability.trim.contigs.good.count_table)

Using stability.trim.contigs.good.unique.fasta as input file for the fasta parameter.

Using 8 processors.

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	250	250	0	3	1
2.5%-tile:	1	252	252	0	3	3227
25%-tile:	1	252	252	0	4	32265
Median: 	1	252	252	0	4	64530
75%-tile:	1	253	253	0	5	96794
97.5%-tile:	1	253	253	0	6	125832
Maximum:	1	270	270	0	12	129058
Mean:	1	252.462	252.462	0	4.36663
# of unique seqs:	16477
total # of seqs:	129058

Now we need to align our sequences to the reference alignment.

First we need to download the SILVA database.

# This step should be done outside mothur
wget http://www.mothur.org/w/images/9/98/Silva.bacteria.zip
unzip Silva.bacteria.zip

If you have quit mothur to download the database, rerun the mothur command, then take a look at the database you have downloaded:

summary.seqs(fasta=silva.bacteria/silva.bacteria.fasta, processors=8)

Now do the alignment using align.seqs:

align.seqs(fasta=stability.trim.contigs.good.unique.fasta, reference=silva.bacteria/silva.bacteria.fasta)

We can then run summary.seqs again to get a summary of our alignment:

summary.seqs(fasta=stability.trim.contigs.good.unique.align, count=stability.trim.contigs.good.count_table)

You'll see that the bulk of the sequences start at position 13862 and end at position 23444. Some sequences start at position 13144 or 13876 and end at 22587 or 25294. These deviants from the mode positions are likely due to an insertion or deletion at the terminal ends of the aliignments. Sometimes you'll see sequences that start and end at the same position indicating a very poor alignment, which is generally due to non-specific amplification. To make sure that everything overlaps the same region we'll re-run screen.seqs to get sequences that start at or before position 1968 and end at or after position 11550. We'll also set the maximum homopolymer length to 8 since there's nothing in the database with a stretch of 9 or more of the same base in a row (this really could have been done in the first execution of screen.seqs above). Note that we need the count table so that we can update the table for the sequences we're removing and we're also using the summary file so we don't have to figure out again all the start and stop positions:

screen.seqs(fasta=stability.trim.contigs.good.unique.align, count=stability.trim.contigs.good.count_table, summary=stability.trim.contigs.good.unique.summary, start=13862, end=23444, maxhomop=8)
summary.seqs(fasta=current, count=current)

No we can trim both ends of the aligned reads to be sure the all overlap exactly the same region. We can do this with fliter.seqs

filter.seqs(fasta=stability.trim.contigs.good.unique.good.align, vertical=T, trump=.)

We may have introduced redundancy by trimming the ends of the sequences, so we will re-run unique.seqs

unique.seqs(fasta=stability.trim.contigs.good.unique.good.filter.fasta, count=stability.trim.contigs.good.good.count_table)

This identified 3 duplicate sequences that we've now merged with previous unique sequences. The next thing we want to do to further de-noise our sequences is to pre-cluster the sequences using the pre.cluster command allowing for up to 2 differences between sequences. This command will split the sequences by group and then sort them by abundance and go from most abundant to least and identify sequences that are within 2 nt of each other. If they are then they get merged. We generally favor allowing 1 difference for every 100 bp of sequence:

pre.cluster(fasta=stability.trim.contigs.good.unique.good.filter.unique.fasta, count=stability.trim.contigs.good.unique.good.filter.count_table, diffs=2)

At this point we have removed as much sequencing error as we can and it is time to turn our attention to removing chimeras. We'll do this using the UCHIME algorithm that is called within mothur using the chimera.uchime command. Again, this command will split the data by sample and check for chimeras. Our preferred way of doing this is to use the abundant sequences as our reference. In addition, if a sequence is flagged as chimeric in one sample, the the default (dereplicate=F) is to remove it from all samples. Our experience suggests that this is a bit aggressive since we've seen rare sequences get flagged as chimeric when they're the most abundant sequence in another sample. This is how we do it:

chimera.uchime(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.count_table, dereplicate=t)

Running chimera.uchime with the count file will remove the chimeric sequences from the count file. But you still need to remove those sequences from the fasta file. We do this using remove.seqs:

remove.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta, accnos=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.accnos)

As a final quality control step, we need to see if there are any "undesirables" in our dataset. Sometimes when we pick a primer set they will amplify other stuff that gets to this point in the pipeline such as 18S rRNA gene fragments or 16S rRNA from Archaea, chloroplasts, and mitochondira. There's also just the random stuff that we want to get rid of. Let's go ahead and classify those sequences using the Bayesian classifier with the classify.seqs command:

classify.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.count_table, reference=silva.bacteria/silva.bacteria.fasta, taxonomy=silva.bacteria/silva.bacteria.rdp.tax, cutoff=80)

Now that everything is classified we want to remove our undesirables. We do this with the remove.lineage command:

remove.lineage(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.count_table, taxonomy=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.rdp.wang.taxonomy, taxon=Chloroplast-Mitochondria-unknown-Archaea-Eukaryota)

Analysis

OTUs

We will use cluster.split for clustering sequences into OTUs

cluster.split(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table, taxonomy=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.rdp.wang.pick.taxonomy, splitmethod=classify, taxlevel=4, cutoff=0.15)

We used taxlevel=4, which corresponds to the level of Order

Next we want to know how many sequences are in each OTU from each group and we can do this using the make.shared command. Here we tell mothur that we're really only interested in the 0.03 cutoff level:

make.shared(list=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.list, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table, label=0.03)

We also want to know the taxonomy for each of our OTUs. We can get the consensus taxonomy for each OTU using the classify.otu command

classify.otu(list=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.list, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table, taxonomy=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.rdp.wang.pick.taxonomy, label=0.03)

If you open the file stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.0.03.cons.taxonomy, you can get information about your OTUs.

OTU	Size	Taxonomy
Otu0001	12328	Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Porphyromonadaceae(100);Barnesiella(100);Barnesiella_unclassified(100);
Otu0002	8918	Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Porphyromonadaceae(100);Barnesiella(100);Barnesiella_unclassified(100);
Otu0003	7850	Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Porphyromonadaceae(100);Barnesiella(100);Barnesiella_unclassified(100);
Otu0004	7478	Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Porphyromonadaceae(100);Barnesiella(100);Barnesiella_unclassified(100);
Otu0005	7478	Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Porphyromonadaceae(100);Barnesiella(100);Barnesiella_unclassified(100);
Otu0006	6650	Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Porphyromonadaceae(100);Barnesiella(100);Barnesiella_unclassified(100);
Otu0007	6341	Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Bacteroidaceae(100);Bacteroides(100);Bacteroides_unclassified(100);
Otu0008	5374	Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Rikenellaceae(100);Alistipes(100);Alistipes_unclassified(100);
Otu0009	3618	Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Porphyromonadaceae(100);Barnesiella(100);Barnesiella_unclassified(100);

This is telling you that Otu0001 was observed 12328 times in your sample and that 100% of the sequences were from Barnesiella

In order to vizualise the composition of our datasets, we'll use phyloseq, a R package to work with microbiom data.

Most of the phyloseq functionalities require aand a tree file. We need to generate it with mothur:

dist.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta, output=lt, processors=8)
clearcut(phylip=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.phylip.dist)

Batch Mode

It is perfectly acceptable to enter the commands for your analysis from within mothur. We call this the interactive mode. If you are doing a lot these types of analysis or you want to use this SOP on your own data without thinking too much, you can run mothur in batch mode using

./mothur script.batch

where script.batch (or whatever name you want, really) is a text file containing all the commands that you previously entered in interactive mode.

If you have time, copy all the commands from this tutorial in a file, a try to make mothur work in batch mode!

PhyloSeq Analysis

First, install and load the phyloseq package:

source('http://bioconductor.org/biocLite.R')
biocLite('phyloseq')

library("phyloseq")
library("ggplot2")
library("plyr")
theme_set(theme_bw())  # set the ggplot theme

The PhyloSeq package has an import_mothur function that you can use to import the files you generated with mothur. As an example, import the example mothur data provided by phyloseq as an example:

mothlist <- system.file("extdata", "esophagus.fn.list.gz", package="phyloseq")
mothgroup <- system.file("extdata", "esophagus.good.groups.gz", package="phyloseq")
mothtree <- system.file("extdata", "esophagus.tree.gz", package="phyloseq")

show_mothur_cutoffs(mothlist)
cutoff <- '0.10'
x <- import_mothur(mothlist, mothgroup, mothtree, cutoff)
x

Note: If if you ever work with 16s data and decide to use QIIME instead of mothur, phyloseq also has an import_qiime function. Also, newer version of qiime and mothur have the ability to produce a .biom file.

“The biom file format (canonically pronounced ‘biome’) is designed to be a general-use format for representing counts of observations in one or more biological samples. BIOM is a recognized standard for the Earth Microbiome Project and is a Genomics Standards Consortium candidate project.”

More info on http://biom-format.org/

For the rest of this tutorial, we will work with an example dataset provided by the phyloseq package. Load the data with the following command:

data(enterotype)
data("GlobalPatterns")

Ordination and distance-based analysis

Let's do some preliminary filtering. Remove the OTUs that included all unassigned sequences ("-1")

enterotype <- subset_species(enterotype, Genus != "-1")

The available distance methods coded in the phyloseq package:

dist_methods <- unlist(distanceMethodList)
print(dist_methods)

##     UniFrac1     UniFrac2        DPCoA          JSD     vegdist1
##    "unifrac"   "wunifrac"      "dpcoa"        "jsd"  "manhattan"
##     vegdist2     vegdist3     vegdist4     vegdist5     vegdist6
##  "euclidean"   "canberra"       "bray" "kulczynski"    "jaccard"
##     vegdist7     vegdist8     vegdist9    vegdist10    vegdist11
##      "gower"   "altGower"   "morisita"       "horn"  "mountford"
##    vegdist12    vegdist13    vegdist14    vegdist15   betadiver1
##       "raup"   "binomial"       "chao"        "cao"          "w"
##   betadiver2   betadiver3   betadiver4   betadiver5   betadiver6
##         "-1"          "c"         "wb"          "r"          "I"
##   betadiver7   betadiver8   betadiver9  betadiver10  betadiver11
##          "e"          "t"         "me"          "j"        "sor"
##  betadiver12  betadiver13  betadiver14  betadiver15  betadiver16
##          "m"         "-2"         "co"         "cc"          "g"
##  betadiver17  betadiver18  betadiver19  betadiver20  betadiver21
##         "-3"          "l"         "19"         "hk"        "rlb"
##  betadiver22  betadiver23  betadiver24        dist1        dist2
##        "sim"         "gl"          "z"    "maximum"     "binary"
##        dist3   designdist
##  "minkowski"        "ANY"

Remove the two distance-methods that require a tree, and the generic custom method that requires user-defined distance arguments.

# These require tree
dist_methods[(1:3)]

# Remove them from the vector
dist_methods <- dist_methods[-(1:3)]
# This is the user-defined method:
dist_methods["designdist"]

# Remove the user-defined distance
dist_methods = dist_methods[-which(dist_methods=="ANY")]

Loop through each distance method, save each plot to a list, called plist.

plist <- vector("list", length(dist_methods))
names(plist) = dist_methods
for( i in dist_methods ){
    # Calculate distance matrix
    iDist <- distance(enterotype, method=i)
    # Calculate ordination
    iMDS  <- ordinate(enterotype, "MDS", distance=iDist)
    ## Make plot
    # Don't carry over previous plot (if error, p will be blank)
    p <- NULL
    # Create plot, store as temp variable, p
    p <- plot_ordination(enterotype, iMDS, color="SeqTech", shape="Enterotype")
    # Add title to each plot
    p <- p + ggtitle(paste("MDS using distance method ", i, sep=""))
    # Save the graphic to file.
    plist[[i]] = p
}

Combine results and shade according to Sequencing technology:

df = ldply(plist, function(x) x$data)
names(df)[1] <- "distance"
p = ggplot(df, aes(Axis.1, Axis.2, color=SeqTech, shape=Enterotype))
p = p + geom_point(size=3, alpha=0.5)
p = p + facet_wrap(~distance, scales="free")
p = p + ggtitle("MDS on various distance metrics for Enterotype dataset")
p

Print individual plots:

print(plist[["jsd"]])
print(plist[["jaccard"]])
print(plist[["bray"]])
print(plist[["euclidean"]])

Alpha diversity graphics

Here is the default graphic produced by the plot_richness function on the GP example dataset:

GP <- prune_species(speciesSums(GlobalPatterns) > 0, GlobalPatterns)
plot_richness(GP)

Note that in this case, the Fisher calculation results in a warning (but still plots). We can avoid this by specifying a measures argument to plot_richness, which will include just the alpha-diversity measures that we want.

plot_richness(GP, measures=c("Chao1", "Shannon"))

We can specify a sample variable on which to group/organize samples along the horizontal (x) axis. An experimentally meaningful categorical variable is usually a good choice – in this case, the "SampleType" variable works much better than attempting to interpret the sample names directly (as in the previous plot):

plot_richness(GP, x="SampleType", measures=c("Chao1", "Shannon"))

Now suppose we wanted to use an external variable in the plot that isn’t in the GP dataset already – for example, a logical that indicated whether or not the samples are human-associated. First, define this new variable, human, as a factor (other vectors could also work; or other data you might have describing the samples).

sampleData(GP)$human <- getVariable(GP, "SampleType") %in% c("Feces", "Mock", "Skin", "Tongue")

Now tell plot_richness to map the new human variable on the horizontal axis, and shade the points in different color groups, according to which "SampleType" they belong.

plot_richness(GP, x="human", color="SampleType", measures=c("Chao1", "Shannon"))

We can merge samples that are from the environment (SampleType), and make the points bigger with a ggplot2 layer. First, merge the samples.

GPst = merge_samples(GP, "SampleType")
# repair variables that were damaged during merge (coerced to numeric)
sample_data(GPst)$SampleType <- factor(sample_names(GPst))
sample_data(GPst)$human <- as.logical(sample_data(GPst)$human)

p = plot_richness(GPst, x="human", color="SampleType", measures=c("Chao1", "Shannon"))
p + geom_point(size=5, alpha=0.7)

Trees

head(phy_tree(GlobalPatterns)$node.label, 10)

The node data from the GlobalPatterns dataset are strange. They look like they might be bootstrap values, but they sometimes have two decimals.

phy_tree(GlobalPatterns)$node.label = substr(phy_tree(GlobalPatterns)$node.label, 1, 4)

Additionally, the dataset has many OTUs, too many to fit them all on a tree. Let's take the 50 more abundant and plot a basic tree:

physeq = prune_taxa(taxa_names(GlobalPatterns)[1:50], GlobalPatterns)
plot_tree(physeq)

dots are annotated next to tips (OTUs) in the tree, one for each sample in which that OTU was observed. Let's color the dots by taxonomic ranks, and sample covariates:

plot_tree(physeq, nodelabf=nodeplotboot(), ladderize="left", color="SampleType")

by taxonomic class:

plot_tree(physeq, nodelabf=nodeplotboot(), ladderize="left", color="Class")

It can be useful to label the tips:

plot_tree(physeq, color="SampleType", label.tips="Genus")

Making a radial tree is easy with ggplot2, simply recognizing that our vertically-oriented tree is a cartesian mapping of the data to a graphic – and that a radial tree is the same mapping, but with polar coordinates instead.

plot_tree(physeq, nodelabf=nodeplotboot(60,60,3), color="SampleType", shape="Class", ladderize="left") + coord_polar(theta="y")

Bar plots

Bar plots are one of the easiest way to vizualize your data. But be careful, they can be misleading if grouping sample!

Let's take a subset of the GlobalPatterns dataset, and produce a basic bar plot:

gp.ch = subset_taxa(GlobalPatterns, Phylum == "Chlamydiae")
plot_bar(gp.ch)

The dataset is plotted with every sample mapped individually to the horizontal (x) axis, and abundance values mapped to the veritcal (y) axis. At each sample’s horizontal position, the abundance values for each OTU are stacked in order from greatest to least, separate by a thin horizontal line. As long as the parameters you choose to separate the data result in more than one OTU abundance value at the respective position in the plot, the values will be stacked in order as a means of displaying both the sum total value while still representing the individual OTU abundances.

The bar plot will be clearer with color to represent the Genus to which each OTU belongs.

plot_bar(gp.ch, fill="Genus")

Now keep the same fill color, and group the samples together by the SampleType variable; essentially, the environment from which the sample was taken and sequenced.

plot_bar(gp.ch, x="SampleType", fill="Genus")

A more complex example using facets:

plot_bar(gp.ch, "Family", fill="Genus", facet_grid=~SampleType)

Heatmaps

The following two lines subset the dataset to just the top 300 most abundant Bacteria taxa across all samples (in this case, with no prior preprocessing. Not recommended, but quick).

data("GlobalPatterns")
gpt <- subset_taxa(GlobalPatterns, Kingdom=="Bacteria")
gpt <- prune_taxa(names(sort(taxa_sums(gpt),TRUE)[1:300]), gpt)
plot_heatmap(gpt, sample.label="SampleType")

subset a smaller dataset based on an Archaeal phylum

gpac <- subset_taxa(GlobalPatterns, Phylum=="Crenarchaeota")
plot_heatmap(gpac)

Plot microbiome network

There is a random aspect to some of the network layout methods. For complete reproducibility of the images produced later in this tutorial, it is possible to set the random number generator seed explicitly:

set.seed(711L)

Because we want to use the enterotype designations as a plot feature in these plots, we need to remove the 9 samples for which no enterotype designation was assigned (this will save us the hassle of some pesky warning messages, but everything still works; the offending samples are anyway omitted).

enterotype = subset_samples(enterotype, !is.na(Enterotype))

Create an igraph-based network based on the default distance method, “Jaccard”, and a maximum distance between connected nodes of 0.3.

ig <- make_network(enterotype, max.dist=0.3)
plot_network(ig, enterotype)

The previous graphic displayed some interesting structure, with one or two major subgraphs comprising a majority of samples. Furthermore, there seemed to be a correlation in the sample naming scheme and position within the network. Instead of trying to read all of the sample names to understand the pattern, let’s map some of the sample variables onto this graphic as color and shape:

plot_network(ig, enterotype, color="SeqTech", shape="Enterotype", line_weight=0.4, label=NULL)

In the previous examples, the choice of maximum-distance and distance method were informed, but arbitrary. Let’s see what happens when the maximum distance is lowered, decreasing the number of edges in the network

ig <- make_network(enterotype, max.dist=0.2)
plot_network(ig, enterotype, color="SeqTech", shape="Enterotype", line_weight=0.4, label=NULL)

Let’s repeat the previous exercise, but replace the Jaccard (default) distance method with Bray-Curtis

ig <- make_network(enterotype, dist.fun="bray", max.dist=0.3)
plot_network(ig, enterotype, color="SeqTech", shape="Enterotype", line_weight=0.4, label=NULL)