This tutorial is largely inspired of the MiSeq SOP from the Schloss Lab.
Kozich JJ, Westcott SL, Baxter NT, Highlander SK, Schloss PD. (2013): Development of a dual-index sequencing strategy and curation pipeline for analyzing amplicon sequence data on the MiSeq Illumina sequencing platform. Applied and Environmental Microbiology. 79(17):5112-20.
- Introduction
- Softwares Required for this Tutorial
- Downloading the Data and Start Mothur
- Reducing Sequencing and PCR Errors
- Processing Improved Sequences
- Analysis
- Batch Mode
The 16S rRNA gene is a section of prokaryotic DNA found in all bacteria and archaea. This gene codes for an rRNA, and this rRNA in turn makes up part of the ribosome. The first 'r' in rRNA stands for ribosomal. The ribosome is composed of two subunits, the large subunit (LSU) and the small subunit (SSU).
The 16S rRNA gene is a commonly used tool for identifying bacteria for several reasons. First, traditional characterization depended upon phenotypic traits like gram positive or gram negative, bacillus or coccus, etc. Taxonomists today consider analysis of an organism's DNA more reliable than classification based solely on phenotypes. Secondly, researchers may, for a number of reasons, want to identify or classify only the bacteria within a given environmental or medical sample. While there is a homologous gene in eukaryotes, the 18S rRNA gene, it is distinct, thereby rendering the 16S rRNA gene a useful tool for extracting and identifying bacteria as separate from plant, animal, fungal, and protist DNA within the same sample. Thirdly, the 16S rRNA gene is relatively short at 1.5 kb, making it faster and cheaper to sequence than many other unique bacterial genes.
Mothur is a command-line computer program for analyzing sequence data from microbial communities and namely 16s data. mothur is licensed under the GPL and is free to use.
Firstly, download and unzip the sample dataset:
wget http://www.mothur.org/w/images/d/d6/MiSeqSOPData.zip
unzip MiSeqSOPData.zip
In the MiSeq_SOP
directory, you'll find the reads files in fastq format, as well as a file called stability.files
The first lines of stability.files
look like this:
F3D0 F3D0_S188_L001_R1_001.fastq F3D0_S188_L001_R2_001.fastq F3D141 F3D141_S207_L001_R1_001.fastq F3D141_S207_L001_R2_001.fastq F3D142 F3D142_S208_L001_R1_001.fastq F3D142_S208_L001_R2_001.fastq F3D143 F3D143_S209_L001_R1_001.fastq F3D143_S209_L001_R2_001.fastq
The first column is the name of the sample. The second column is the name of the forward read for that sample and the third columns in the name of the reverse read for that sample.
Now it's time to start mothur. Type mothur
in your terminal. You should see your prompt changing to
mothur >
The first thing we want to do is combine our two sets of reads for each sample and then to combine the data from all of the samples. This is done using the make.contigs
command, which requires stability.files
as input. This command will extract the sequence and quality score data from your fastq files, create the reverse complement of the reverse read and then join the reads into contigs.
make.contigs(file=stability.files, processors=8)
It took 30 secs to process 152360 sequences.
Group count:
F3D0 7793
F3D1 5869
F3D141 5958
F3D142 3183
F3D143 3178
F3D144 4827
F3D145 7377
F3D146 5021
F3D147 17070
F3D148 12405
F3D149 13083
F3D150 5509
F3D2 19620
F3D3 6758
F3D5 4448
F3D6 7989
F3D7 5129
F3D8 5294
F3D9 7070
Mock 4779
Total of all groups is 152360
Output File Names:
stability.trim.contigs.fasta
stability.trim.contigs.qual
stability.contigs.report
stability.scrap.contigs.fasta
stability.scrap.contigs.qual
stability.contigs.groups
The stability.contigs.report
file will tell you something about the contig assembly for each read. Let's see what these sequences look like using the summary.seqs
command:
summary.seqs(fasta=stability.trim.contigs.fasta)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 248 248 0 3 1
2.5%-tile: 1 252 252 0 3 3810
25%-tile: 1 252 252 0 4 38091
Median: 1 252 252 0 4 76181
75%-tile: 1 253 253 0 5 114271
97.5%-tile: 1 253 253 6 6 148552
Maximum: 1 502 502 249 243 152360
Mean: 1 252.811 252.811 0.70063 4.44854
# of Seqs: 152360
This tells us that we have 152360 sequences that for the most part vary between 248 and 253 bases. Interestingly, the longest read in the dataset is 502 bp. Be suspicious of this, the reads are supposed to be 251 bp each. This read clearly didn't assemble well (or at all). Also, note that at least 2.5% of our sequences had some ambiguous base calls. We'll take care of these issues in the next step when we run screen.seqs
.
screen.seqs(fasta=stability.trim.contigs.fasta, group=stability.contigs.groups, maxambig=0, maxlength=275)
You'll notice that mothur remembered that we used 8 processors in make.contigs
. To see what else mothur knows about you, run the following:
get.current()
Current files saved by mothur:
fasta=stability.trim.contigs.good.fasta
group=stability.contigs.good.groups
qfile=stability.trim.contigs.qual
processors=8
summary=stability.trim.contigs.summary
What this means is that mothur remembers your latest fasta file and group file as well as the number of processors you have. So you could run:
mothur > summary.seqs(fasta=stability.trim.contigs.good.fasta)
mothur > summary.seqs(fasta=current)
mothur > summary.seqs()
and get the same output for each command.
But, now that we have filtered the sequencing errors, let's move to the next step.
We anticipate that many of our sequences are duplicates of each other. Because it's computationally wasteful to align the same sequences several times, we'll make our sequences unique:
unique.seqs(fasta=stability.trim.contigs.good.fasta)
If two sequences have the same identical sequence, then they're considered duplicates and will get merged. In the screen output there are two columns - the first is the number of sequences characterized and the second is the number of unique sequences remaining
Another thing to do to make our lives easier is to simplify the names and group files. If you look at the most recent versions of those files you'll see together they are 13 MB. This may not seem like much, but with a full MiSeq run those long sequence names can add up and make life tedious. So we'll run count.seqs to generate a table where the rows are the names of the unique seqeunces and the columns are the names of the groups. The table is then filled with the number of times each unique sequence shows up in each group.
This will generate a file called stability.trim.contigs.good.count_table. In subsequent commands we'll use it by using the count option:
count.seqs(name=stability.trim.contigs.good.names, group=stability.contigs.good.groups)
summary.seqs(count=stability.trim.contigs.good.count_table)
Using stability.trim.contigs.good.unique.fasta as input file for the fasta parameter.
Using 8 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 250 250 0 3 1
2.5%-tile: 1 252 252 0 3 3227
25%-tile: 1 252 252 0 4 32265
Median: 1 252 252 0 4 64530
75%-tile: 1 253 253 0 5 96794
97.5%-tile: 1 253 253 0 6 125832
Maximum: 1 270 270 0 12 129058
Mean: 1 252.462 252.462 0 4.36663
# of unique seqs: 16477
total # of seqs: 129058
Now we need to align our sequences to the reference alignment.
First we need to download the SILVA database.
# This step should be done outside mothur
wget http://www.mothur.org/w/images/9/98/Silva.bacteria.zip
unzip Silva.bacteria.zip
If you have quit mothur to download the database, rerun the mothur
command, then take a look at the database you have downloaded:
summary.seqs(fasta=silva.bacteria/silva.bacteria.fasta, processors=8)
Now do the alignment using align.seqs
:
align.seqs(fasta=stability.trim.contigs.good.unique.fasta, reference=silva.bacteria/silva.bacteria.fasta)
We can then run summary.seqs
again to get a summary of our alignment:
summary.seqs(fasta=stability.trim.contigs.good.unique.align, count=stability.trim.contigs.good.count_table)
You'll see that the bulk of the sequences start at position 13862 and end at position 23444. Some sequences start at position 13144 or 13876 and end at 22587 or 25294. These deviants from the mode positions are likely due to an insertion or deletion at the terminal ends of the aliignments. Sometimes you'll see sequences that start and end at the same position indicating a very poor alignment, which is generally due to non-specific amplification. To make sure that everything overlaps the same region we'll re-run screen.seqs to get sequences that start at or before position 1968 and end at or after position 11550. We'll also set the maximum homopolymer length to 8 since there's nothing in the database with a stretch of 9 or more of the same base in a row (this really could have been done in the first execution of screen.seqs above). Note that we need the count table so that we can update the table for the sequences we're removing and we're also using the summary file so we don't have to figure out again all the start and stop positions:
screen.seqs(fasta=stability.trim.contigs.good.unique.align, count=stability.trim.contigs.good.count_table, summary=stability.trim.contigs.good.unique.summary, start=13862, end=23444, maxhomop=8)
summary.seqs(fasta=current, count=current)
No we can trim both ends of the aligned reads to be sure the all overlap exactly the same region. We can do this with fliter.seqs
filter.seqs(fasta=stability.trim.contigs.good.unique.good.align, vertical=T, trump=.)
We may have introduced redundancy by trimming the ends of the sequences, so we will re-run unique.seqs
unique.seqs(fasta=stability.trim.contigs.good.unique.good.filter.fasta, count=stability.trim.contigs.good.good.count_table)
This identified 3 duplicate sequences that we've now merged with previous unique sequences. The next thing we want to do to further de-noise our sequences is to pre-cluster the sequences using the pre.cluster
command allowing for up to 2 differences between sequences. This command will split the sequences by group and then sort them by abundance and go from most abundant to least and identify sequences that are within 2 nt of each other. If they are then they get merged. We generally favor allowing 1 difference for every 100 bp of sequence:
pre.cluster(fasta=stability.trim.contigs.good.unique.good.filter.unique.fasta, count=stability.trim.contigs.good.unique.good.filter.count_table, diffs=2)
At this point we have removed as much sequencing error as we can and it is time to turn our attention to removing chimeras. We'll do this using the UCHIME algorithm that is called within mothur using the chimera.uchime
command. Again, this command will split the data by sample and check for chimeras. Our preferred way of doing this is to use the abundant sequences as our reference. In addition, if a sequence is flagged as chimeric in one sample, the the default (dereplicate=F) is to remove it from all samples. Our experience suggests that this is a bit aggressive since we've seen rare sequences get flagged as chimeric when they're the most abundant sequence in another sample. This is how we do it:
chimera.uchime(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.count_table, dereplicate=t)
Running chimera.uchime
with the count file will remove the chimeric sequences from the count file. But you still need to remove those sequences from the fasta file. We do this using remove.seqs
:
remove.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta, accnos=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.accnos)
As a final quality control step, we need to see if there are any "undesirables" in our dataset. Sometimes when we pick a primer set they will amplify other stuff that gets to this point in the pipeline such as 18S rRNA gene fragments or 16S rRNA from Archaea, chloroplasts, and mitochondira. There's also just the random stuff that we want to get rid of. Let's go ahead and classify those sequences using the Bayesian classifier with the classify.seqs
command:
classify.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.count_table, reference=silva.bacteria/silva.bacteria.fasta, taxonomy=silva.bacteria/silva.bacteria.rdp.tax, cutoff=80)
Now that everything is classified we want to remove our undesirables. We do this with the remove.lineage
command:
remove.lineage(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.count_table, taxonomy=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.rdp.wang.taxonomy, taxon=Chloroplast-Mitochondria-unknown-Archaea-Eukaryota)
We will use cluster.split
for clustering sequences into OTUs
cluster.split(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table, taxonomy=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.rdp.wang.pick.taxonomy, splitmethod=classify, taxlevel=4, cutoff=0.15)
We used taxlevel=4
, which corresponds to the level of Order
Next we want to know how many sequences are in each OTU from each group and we can do this using the make.shared command
. Here we tell mothur that we're really only interested in the 0.03 cutoff level:
make.shared(list=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.list, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table, label=0.03)
We also want to know the taxonomy for each of our OTUs. We can get the consensus taxonomy for each OTU using the classify.otu
command
classify.otu(list=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.list, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table, taxonomy=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.rdp.wang.pick.taxonomy, label=0.03)
If you open the file stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.0.03.cons.taxonomy
, you can get information about your OTUs.
OTU Size Taxonomy
Otu0001 12328 Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Porphyromonadaceae(100);Barnesiella(100);Barnesiella_unclassified(100);
Otu0002 8918 Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Porphyromonadaceae(100);Barnesiella(100);Barnesiella_unclassified(100);
Otu0003 7850 Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Porphyromonadaceae(100);Barnesiella(100);Barnesiella_unclassified(100);
Otu0004 7478 Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Porphyromonadaceae(100);Barnesiella(100);Barnesiella_unclassified(100);
Otu0005 7478 Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Porphyromonadaceae(100);Barnesiella(100);Barnesiella_unclassified(100);
Otu0006 6650 Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Porphyromonadaceae(100);Barnesiella(100);Barnesiella_unclassified(100);
Otu0007 6341 Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Bacteroidaceae(100);Bacteroides(100);Bacteroides_unclassified(100);
Otu0008 5374 Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Rikenellaceae(100);Alistipes(100);Alistipes_unclassified(100);
Otu0009 3618 Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Porphyromonadaceae(100);Barnesiella(100);Barnesiella_unclassified(100);
This is telling you that Otu0001 was observed 12328 times in your sample and that 100% of the sequences were from Barnesiella
In order to vizualise the composition of our datasets, we'll use phyloseq, a R package to work with microbiom data.
Most of the phyloseq functionalities require aand a tree file. We need to generate it with mothur:
dist.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta, output=lt, processors=8)
clearcut(phylip=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.phylip.dist)
It is perfectly acceptable to enter the commands for your analysis from within mothur. We call this the interactive mode. If you are doing a lot these types of analysis or you want to use this SOP on your own data without thinking too much, you can run mothur in batch mode
using
./mothur script.batch
where script.batch (or whatever name you want, really) is a text file containing all the commands that you previously entered in interactive mode.
If you have time, copy all the commands from this tutorial in a file, a try to make mothur work in batch mode!
First, install and load the phyloseq package:
source('http://bioconductor.org/biocLite.R')
biocLite('phyloseq')
library("phyloseq")
library("ggplot2")
library("plyr")
theme_set(theme_bw()) # set the ggplot theme
The PhyloSeq package has an import_mothur
function that you can use to import the files you generated with mothur. As an example, import the example mothur data provided by phyloseq as an example:
mothlist <- system.file("extdata", "esophagus.fn.list.gz", package="phyloseq")
mothgroup <- system.file("extdata", "esophagus.good.groups.gz", package="phyloseq")
mothtree <- system.file("extdata", "esophagus.tree.gz", package="phyloseq")
show_mothur_cutoffs(mothlist)
cutoff <- '0.10'
x <- import_mothur(mothlist, mothgroup, mothtree, cutoff)
x
Note: If if you ever work with 16s data and decide to use QIIME instead of mothur, phyloseq also has an import_qiime
function. Also, newer version of qiime and mothur have the ability to produce a .biom
file.
“The biom file format (canonically pronounced ‘biome’) is designed to be a general-use format for representing counts of observations in one or more biological samples. BIOM is a recognized standard for the Earth Microbiome Project and is a Genomics Standards Consortium candidate project.”
More info on http://biom-format.org/
For the rest of this tutorial, we will work with an example dataset provided by the phyloseq package. Load the data with the following command:
data(enterotype)
data("GlobalPatterns")
Let's do some preliminary filtering. Remove the OTUs that included all unassigned sequences ("-1")
enterotype <- subset_species(enterotype, Genus != "-1")
The available distance methods coded in the phyloseq package:
dist_methods <- unlist(distanceMethodList)
print(dist_methods)
## UniFrac1 UniFrac2 DPCoA JSD vegdist1
## "unifrac" "wunifrac" "dpcoa" "jsd" "manhattan"
## vegdist2 vegdist3 vegdist4 vegdist5 vegdist6
## "euclidean" "canberra" "bray" "kulczynski" "jaccard"
## vegdist7 vegdist8 vegdist9 vegdist10 vegdist11
## "gower" "altGower" "morisita" "horn" "mountford"
## vegdist12 vegdist13 vegdist14 vegdist15 betadiver1
## "raup" "binomial" "chao" "cao" "w"
## betadiver2 betadiver3 betadiver4 betadiver5 betadiver6
## "-1" "c" "wb" "r" "I"
## betadiver7 betadiver8 betadiver9 betadiver10 betadiver11
## "e" "t" "me" "j" "sor"
## betadiver12 betadiver13 betadiver14 betadiver15 betadiver16
## "m" "-2" "co" "cc" "g"
## betadiver17 betadiver18 betadiver19 betadiver20 betadiver21
## "-3" "l" "19" "hk" "rlb"
## betadiver22 betadiver23 betadiver24 dist1 dist2
## "sim" "gl" "z" "maximum" "binary"
## dist3 designdist
## "minkowski" "ANY"
Remove the two distance-methods that require a tree, and the generic custom method that requires user-defined distance arguments.
# These require tree
dist_methods[(1:3)]
# Remove them from the vector
dist_methods <- dist_methods[-(1:3)]
# This is the user-defined method:
dist_methods["designdist"]
# Remove the user-defined distance
dist_methods = dist_methods[-which(dist_methods=="ANY")]
Loop through each distance method, save each plot to a list, called plist.
plist <- vector("list", length(dist_methods))
names(plist) = dist_methods
for( i in dist_methods ){
# Calculate distance matrix
iDist <- distance(enterotype, method=i)
# Calculate ordination
iMDS <- ordinate(enterotype, "MDS", distance=iDist)
## Make plot
# Don't carry over previous plot (if error, p will be blank)
p <- NULL
# Create plot, store as temp variable, p
p <- plot_ordination(enterotype, iMDS, color="SeqTech", shape="Enterotype")
# Add title to each plot
p <- p + ggtitle(paste("MDS using distance method ", i, sep=""))
# Save the graphic to file.
plist[[i]] = p
}
Combine results and shade according to Sequencing technology:
df = ldply(plist, function(x) x$data)
names(df)[1] <- "distance"
p = ggplot(df, aes(Axis.1, Axis.2, color=SeqTech, shape=Enterotype))
p = p + geom_point(size=3, alpha=0.5)
p = p + facet_wrap(~distance, scales="free")
p = p + ggtitle("MDS on various distance metrics for Enterotype dataset")
p
Print individual plots:
print(plist[["jsd"]])
print(plist[["jaccard"]])
print(plist[["bray"]])
print(plist[["euclidean"]])
Here is the default graphic produced by the plot_richness function on the GP example dataset:
GP <- prune_species(speciesSums(GlobalPatterns) > 0, GlobalPatterns)
plot_richness(GP)
Note that in this case, the Fisher calculation results in a warning (but still plots). We can avoid this by specifying a measures argument to plot_richness, which will include just the alpha-diversity measures that we want.
plot_richness(GP, measures=c("Chao1", "Shannon"))
We can specify a sample variable on which to group/organize samples along the horizontal (x) axis. An experimentally meaningful categorical variable is usually a good choice – in this case, the "SampleType" variable works much better than attempting to interpret the sample names directly (as in the previous plot):
plot_richness(GP, x="SampleType", measures=c("Chao1", "Shannon"))
Now suppose we wanted to use an external variable in the plot that isn’t in the GP dataset already – for example, a logical that indicated whether or not the samples are human-associated. First, define this new variable, human, as a factor (other vectors could also work; or other data you might have describing the samples).
sampleData(GP)$human <- getVariable(GP, "SampleType") %in% c("Feces", "Mock", "Skin", "Tongue")
Now tell plot_richness to map the new human variable on the horizontal axis, and shade the points in different color groups, according to which "SampleType" they belong.
plot_richness(GP, x="human", color="SampleType", measures=c("Chao1", "Shannon"))
We can merge samples that are from the environment (SampleType), and make the points bigger with a ggplot2 layer. First, merge the samples.
GPst = merge_samples(GP, "SampleType")
# repair variables that were damaged during merge (coerced to numeric)
sample_data(GPst)$SampleType <- factor(sample_names(GPst))
sample_data(GPst)$human <- as.logical(sample_data(GPst)$human)
p = plot_richness(GPst, x="human", color="SampleType", measures=c("Chao1", "Shannon"))
p + geom_point(size=5, alpha=0.7)
head(phy_tree(GlobalPatterns)$node.label, 10)
The node data from the GlobalPatterns
dataset are strange. They look like they might be bootstrap values, but they sometimes have two decimals.
phy_tree(GlobalPatterns)$node.label = substr(phy_tree(GlobalPatterns)$node.label, 1, 4)
Additionally, the dataset has many OTUs, too many to fit them all on a tree. Let's take the 50 more abundant and plot a basic tree:
physeq = prune_taxa(taxa_names(GlobalPatterns)[1:50], GlobalPatterns)
plot_tree(physeq)
dots are annotated next to tips (OTUs) in the tree, one for each sample in which that OTU was observed. Let's color the dots by taxonomic ranks, and sample covariates:
plot_tree(physeq, nodelabf=nodeplotboot(), ladderize="left", color="SampleType")
by taxonomic class:
plot_tree(physeq, nodelabf=nodeplotboot(), ladderize="left", color="Class")
It can be useful to label the tips:
plot_tree(physeq, color="SampleType", label.tips="Genus")
Making a radial tree is easy with ggplot2, simply recognizing that our vertically-oriented tree is a cartesian mapping of the data to a graphic – and that a radial tree is the same mapping, but with polar coordinates instead.
plot_tree(physeq, nodelabf=nodeplotboot(60,60,3), color="SampleType", shape="Class", ladderize="left") + coord_polar(theta="y")
Bar plots are one of the easiest way to vizualize your data. But be careful, they can be misleading if grouping sample!
Let's take a subset of the GlobalPatterns dataset, and produce a basic bar plot:
gp.ch = subset_taxa(GlobalPatterns, Phylum == "Chlamydiae")
plot_bar(gp.ch)
The dataset is plotted with every sample mapped individually to the horizontal (x) axis, and abundance values mapped to the veritcal (y) axis. At each sample’s horizontal position, the abundance values for each OTU are stacked in order from greatest to least, separate by a thin horizontal line. As long as the parameters you choose to separate the data result in more than one OTU abundance value at the respective position in the plot, the values will be stacked in order as a means of displaying both the sum total value while still representing the individual OTU abundances.
The bar plot will be clearer with color to represent the Genus to which each OTU belongs.
plot_bar(gp.ch, fill="Genus")
Now keep the same fill color, and group the samples together by the SampleType variable; essentially, the environment from which the sample was taken and sequenced.
plot_bar(gp.ch, x="SampleType", fill="Genus")
A more complex example using facets:
plot_bar(gp.ch, "Family", fill="Genus", facet_grid=~SampleType)
The following two lines subset the dataset to just the top 300 most abundant Bacteria taxa across all samples (in this case, with no prior preprocessing. Not recommended, but quick).
data("GlobalPatterns")
gpt <- subset_taxa(GlobalPatterns, Kingdom=="Bacteria")
gpt <- prune_taxa(names(sort(taxa_sums(gpt),TRUE)[1:300]), gpt)
plot_heatmap(gpt, sample.label="SampleType")
subset a smaller dataset based on an Archaeal phylum
gpac <- subset_taxa(GlobalPatterns, Phylum=="Crenarchaeota")
plot_heatmap(gpac)
There is a random aspect to some of the network layout methods. For complete reproducibility of the images produced later in this tutorial, it is possible to set the random number generator seed explicitly:
set.seed(711L)
Because we want to use the enterotype designations as a plot feature in these plots, we need to remove the 9 samples for which no enterotype designation was assigned (this will save us the hassle of some pesky warning messages, but everything still works; the offending samples are anyway omitted).
enterotype = subset_samples(enterotype, !is.na(Enterotype))
Create an igraph-based network based on the default distance method, “Jaccard”, and a maximum distance between connected nodes of 0.3.
ig <- make_network(enterotype, max.dist=0.3)
plot_network(ig, enterotype)
The previous graphic displayed some interesting structure, with one or two major subgraphs comprising a majority of samples. Furthermore, there seemed to be a correlation in the sample naming scheme and position within the network. Instead of trying to read all of the sample names to understand the pattern, let’s map some of the sample variables onto this graphic as color and shape:
plot_network(ig, enterotype, color="SeqTech", shape="Enterotype", line_weight=0.4, label=NULL)
In the previous examples, the choice of maximum-distance and distance method were informed, but arbitrary. Let’s see what happens when the maximum distance is lowered, decreasing the number of edges in the network
ig <- make_network(enterotype, max.dist=0.2)
plot_network(ig, enterotype, color="SeqTech", shape="Enterotype", line_weight=0.4, label=NULL)
Let’s repeat the previous exercise, but replace the Jaccard (default) distance method with Bray-Curtis
ig <- make_network(enterotype, dist.fun="bray", max.dist=0.3)
plot_network(ig, enterotype, color="SeqTech", shape="Enterotype", line_weight=0.4, label=NULL)