Skip to content

Latest commit

 

History

History
200 lines (136 loc) · 7.58 KB

demo02.md

File metadata and controls

200 lines (136 loc) · 7.58 KB

Analyzing Read Counts from PathoScope

By now you already obtained read counts from your metagenomic experiment. The thing is what do we do with them. Most of the time, obtaining read counts is only the beggining to exploring the metagenomic content of your samples. The first step in analyzing read count data is to put it in a contingency table format.

Let's combine multiple PathoScope tsv files into a contingency table

We need to create a table where rows are genomes and columns are samples. In the previous demo we analyzed a downsampled version of sample ES_211. The original dataset consisted of 16 cases (patients with schisophrenia) and 16 healthy controls.

I've already analyzed these data and I'm providing all tsv files (rather the top 50 hits from each file). If you want to give it a try, you can always download the original data which was deposited in NCBI under BioProject [PRJNA255439] (http://www.ncbi.nlm.nih.gov/sra/?term=PRJNA255439).
The following code is from an R script that I use but that it's not as tidy and neat as I'd like to.

Let's load the required libraries and set our working directory

library(xlsx)
library(gtools)

directory<-"~/PS_demo/tsv"
setwd(directory)

If you don't have those libraries installed, simple issue the following commands

install.packages(c("xlsx","gtools"))

We need to create a list of files, a list of "patient names", and finally read in relevant columns in the tsv's so that we can combine everything into a contingency table

# make a list of the files to loop through

list_of_files <- list.files(path=directory, pattern = ".tsv")

# make a list of just the patient names

patient_names<-NULL
for( i in 1:length(list_of_files)){
  patient_names[i]<-substring(list_of_files[i], 1, 6)
}

# read in each table

read_counts <- lapply(list_of_files, read.table, sep="\t", header = FALSE, skip =2)
read_counts <- lapply(read_counts, function(x) x[, c(1,4)])
read_counts <- lapply(read_counts, function(x) x[complete.cases(x),])

Now we have a list of tables (or data frames) that we need to combine, in a non-redundant fashion, to create our contingency table

# for each table make the first col name OTU and the second the patient name

for( i in 1:length(list_of_files)){
  colnames(read_counts[[i]])<- c("OTU", patient_names[i])
}

# list of lists called otu which stores the first column otu names for each dataframe

otu<-NULL
for( i in 1:length(list_of_files)){
  otu[i]<- list(as.character(read_counts[[i]][, 1]))
}

# for each dataframe in read_counts transpose and then 

read_counts <- lapply(read_counts, function(x) t(x[,2]))

# add the otus back as the column name

for( i in 1:length(list_of_files)){
  read_counts[[i]]<-data.frame(read_counts[[i]])
  colnames(read_counts[[i]])<-otu[[i]]
  read_counts[[i]]<-data.frame(patient = patient_names[i], read_counts[[i]])
}

# combine the different dataframes together

otu_table <- read_counts[[1]]
for( i in 2:length(list_of_files)){
  otu_table <- smartbind(otu_table, read_counts[[i]], fill = 0)
}

# transpose the table back so that the microbes are the rows and the patients are the col

otu_table<-t(data.matrix(otu_table))
colnames(otu_table)<-patient_names
otu_table<-otu_table[2:nrow(otu_table), ]

# remove zeroes
otu_table_noZeroes<-otu_table[apply(otu_table[,-1], 1, function(x) !all(x==0)),]

write.csv(otu_table_noZeroes,"otu_table.csv")

Now let's jump ahead and create some fun plots and charts to explore our data

Since we are in a limited time setting, we are going to load three data frames, i.e., read counts from above, a taxonomy table, and metadata associated with the patients).

Let's load some libraries and create a phyloseq object:

source("http://bioconductor.org/biocLite.R")
biocLite("phyloseq")
library(phyloseq)

# we need to create a phyloseq object manually. See tutorial 
# http://joey711.github.io/phyloseq/import-data#manual

# read in data
setwd("~/PS_demo/data")
read.csv("otu_table.csv", header=TRUE, row.names=1) ->otu_table
read.csv("taxmat.csv", header=TRUE, row.names=1) ->taxmat
read.table('metadata.txt', header=TRUE, sep='\t', row.names=1) -> metadata

as.matrix(otu_table)->otumat
as.matrix(taxmat)->taxmat

rownames(otumat) <- paste0("OTU", 1:nrow(otumat))
rownames(taxmat) <- rownames(otumat)
OTU = otu_table(otumat, taxa_are_rows = TRUE)
TAX = tax_table(taxmat)

# create a phyloseq object with the three dataframes

physeq = phyloseq(OTU, TAX)
sampledata = sample_data(metadata)
physeq = merge_phyloseq(physeq, sampledata)

Now that we have a phyloseq object, we can easily create plots using the functions provided in the PhyloSeq Bioconductor package (here and here).

Let's start with some alpha diversity measures

library("ggplot2")
plot_bar(physeq, fill = "Superkingdom")
plot_heatmap(physeq, taxa.label = "Superkingdom")
plot_richness(physeq, x = "state", color = "state",measures=c("Observed","Chao1", "Fisher","ACE", "Shannon", "Simpson", "InvSimpson")) + geom_boxplot()

alpha

We can also test for whether microbial composition in cases and controls is significantly different.

# Load required libraries

library("DESeq2")

# relevel data so that results are expressed in comparison to "control" samples

sample_data(physeq)$state <- relevel(sample_data(physeq)$state, "control")

# convert phyloseq object to deseq object, accounting for cigarrette smoking...

diagdds = phyloseq_to_deseq2(physeq, ~ cigsmoker + state)

# ...and perform a test to find out if cases and controls are different

diagdds = DESeq(diagdds,test="Wald", fitType = "local")

# now subset results to show only significant results

res = results(diagdds, cooksCutoff = FALSE)
res = res[order(res$padj, na.last = NA), ]
alpha = 0.001
sigtab = res[(res$padj < alpha), ]
sigtab = cbind(as(sigtab, "data.frame"), as(tax_table(physeq)[rownames(sigtab), ], "matrix"))

# and here we look only at taxa that has a positive fold change in comparison to controls

posigtab = sigtab[sigtab[, "log2FoldChange"] > 0, ]
posigtab = posigtab[, c("baseMean", "log2FoldChange", "lfcSE", "padj", "Phylum", "Class", "Family", "Genus","Species")]

# we can always create a plot to visualize the results

theme_set(theme_bw())
scale_fill_discrete <- function(palname = "Set1", ...) {
  scale_fill_brewer(palette = palname, ...)
}
sigtabgen = subset(sigtab, !is.na(Genus))
# Phylum order
x = tapply(sigtabgen$log2FoldChange, sigtabgen$Phylum, function(x) max(x))
x = sort(x, TRUE)
sigtabgen$Phylum = factor(as.character(sigtabgen$Phylum), levels = names(x))
# Genus order
x = tapply(sigtabgen$log2FoldChange, sigtabgen$Genus, function(x) max(x))
x = sort(x, TRUE)
sigtabgen$Genus = factor(as.character(sigtabgen$Genus), levels = names(x))
ggplot(sigtabgen, aes(x = Genus, y = log2FoldChange, color = Phylum)) + geom_point(size = 6) + 
  theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0.5))

fold

PhyloSeq and DESeq packages allow for all sorts of statistical comparisons and visualization options. I encourage people interested in metagenomic analysis to check out Bioconductor pages for these packages for further information.

That concludes our demo/tutorial. If you have further questions don't hesitate to contact me at castronallar@gmail.com or in GitHub or [Google Groups] (https://groups.google.com/forum/#!forum/pathoscope). Thanks!