-
Notifications
You must be signed in to change notification settings - Fork 3
/
case00B-Batch_effects.Rmd
219 lines (171 loc) · 10.2 KB
/
case00B-Batch_effects.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
---
title: "The UK Crop Microbiome Cryobank - How-to Guide: Correcting Batch Effects"
author: "Payton Yau"
date: "2024-04-14"
output: md_document
---
# How-to Guide: Correcting Batch Effects
The UK Crop Microbiome Cryobank integrates genomic (DNA) data with a cryobank collection of samples for the soil microbiomes of the UK major crop plant systems. For this project, the microbiomes are from the rhizosphere (the soil surrounding the crop plant roots) and from bulk soil (soil outside the rhizosphere). The Cryobank provides a facility for researchers to source data and samples, including cryo-preserved microbial material and genomic and metagenomic sequences from different soil microbiome environments.
### Convert Qiime2 objects to Phyloseq objects
`Qiime2` is a microbial community analysis tool used for sequencing analysis, while `Phyloseq` is an R package for analyzing high-throughput sequencing data. The `Qiime2R` package allows conversion of `Qiime2` data to `Phyloseq` within R. R enhances features through external packages from sources like CRAN, Bioconductor, and GitHub. After installation, packages must be loaded into the R session using the **library()** function.Once data is imported, Phyloseq enables data manipulation, analysis, and visualization. This conversion leverages R's analysis tools like ggplot2 for visualization, dplyr for data manipulation, and vegan for ecological community analysis.
```{r packages, warning=FALSE, message=FALSE}
# Download phyloseq from Bioconductor
# if (!require("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
# BiocManager::install("phyloseq")
library("phyloseq")
# Download qiime2R from Github
# if (!requireNamespace("devtools", quietly = TRUE)){install.packages("devtools")}
# devtools::install_github("jbisanz/qiime2R")
library("qiime2R")
# install.packages("RColorBrewer")
library("RColorBrewer")
# install.packages("doParallel")
library("doParallel")
```
```{r qiime2R, warning=FALSE, message=FALSE}
# Convert qiime2 results to phyloseq format
physeq <- qza_to_phyloseq(
features = "C:/Users/Admin/OneDrive/Desktop/temp/428_228_220_table_silva138-with-phyla-no-mitochondria-no-chloroplast.qza", # table.qza
taxonomy = "C:/Users/Admin/OneDrive/Desktop/temp/428_228_220_taxonomy_silva138.qza",
metadata = "C:/Users/Admin/OneDrive/Desktop/temp/16s_meta-table.txt"
#, tree = "rooted-tree.qza"
)
physeq ## confirm the object
```
### Remove unwanted (failed and controls) samples before the normalisation
Removing unwanted samples before normalisation is a common step in microbiome data analysis pipelines. In many cases, some samples may fail during sequencing or quality control, while others may be controls or blanks that are not of interest. These samples can introduce noise and bias in downstream analyses if not removed. By removing the unwanted samples before normalization, the remaining samples can be normalised based on their true biological variation, allowing for more accurate comparisons between samples.
```{r Remove, warning=FALSE, message=FALSE}
# sample_names(physeq)
# rank_names(physeq) # "Kingdom" "Phylum" "Class" "Order" "Family" "Genus" "Species"
## unwanted samples removel (including failed samples)
physeq.ori <- subset_samples(physeq, Analysis == "Include")
## remove object
rm(physeq)
```
### Batch effects correction using Constrained Quantile Normalisation (ConQUR)
Constrained Quantile Normalisation (ConQUR, <https://www.nature.com/articles/s41467-022-33071-9>) is a normalisation technique used in high-throughput sequencing data, particularly in microbiome studies. It is a type of **quantile normalisation** approach that preserves the relative abundances of taxa between samples while simultaneously removing systematic technical variations that can arise due to differences in sequencing depth, PCR amplification bias, or other factors. ConQUR uses kernel density estimation to model the distribution of taxon abundances across all samples, and then constrains the normalisation process to maintain the relative position of each taxon within that distribution.
```{r ConQuR_install, warning=FALSE, message=FALSE}
# devtools::install_github("wdl2459/ConQuR")
library(ConQuR)
```
```{r ConQuR_run, warning=FALSE, message=FALSE}
# Convert ASV table to a data frame and transpose
B <- as.data.frame(physeq.ori@otu_table) # taxa
B <- t(B)
B <- as.data.frame(B)
# Extract batch ID from sample data
batchid = physeq.ori@sam_data$Plate # batchid
# Extract covariates
D = physeq.ori@sam_data[, c('Type', 'Soil', 'Location')] #covar
summary(D)
# Correct for batch effects using ConQuR package
options(warn=-1) # required to call
taxa_correct1 = ConQuR(tax_tab = B,
batchid = batchid,
covariates = D,
batch_ref="1"
) # warning messages may appear & it can be ignored
# Transpose the corrected matrix and convert it to a data frame
taxa_correct2 <- t(taxa_correct1)
taxa_correct2 <- as.data.frame(taxa_correct2)
# Create new ASV table, taxonomy table, and sample data
ASV = otu_table(taxa_correct2, taxa_are_rows = TRUE)
TAXA = tax_table(physeq.ori)
sampledata = sample_data(physeq.ori)
# repack the objects into a level 4 phyloseq structural data
physeq.norm = phyloseq(ASV, TAXA, sampledata)
# Save the "physeq.norm" object for the other case study analysis
# save(physeq.norm, file = "norm.RData")
# load("C:/Users/Admin/OneDrive/Desktop/temp/norm.RData")
# remove
rm(B, D, batchid, taxa_correct1, taxa_correct2, ASV, TAXA, sampledata, to_skip)
```
### Bata diversity - before and after the normalisation
Beta diversity quantifies variation in microbial composition among samples, aiding in identifying patterns in microbial distribution. Non-Metric Multidimensional Scaling (NMDS) and Principal Coordinates Analysis (PCoA) are ordination techniques used for beta diversity analysis.
**NMDS** preserves the rank order of pairwise dissimilarities between samples in a lower-dimensional space, making it suitable for cases where distances between samples are not well-preserved. The distances on the NMDS plot reflect the similarities or dissimilarities between samples but are not directly interpretable.
**PCoA**, a metric multidimensional scaling technique, attempts to preserve the actual distances between samples in a lower-dimensional space. The distances on the PCoA plot reflect the actual dissimilarities between samples. Unlike NMDS, PCoA may not perform as well with non-linear or rank-based dissimilarity measures.
Here, we employ NMDS to analyze Beta diversity, allowing us to draw comparisons between the states before and after normalisation.
```{r ggplot2, warning=FALSE, message=FALSE}
# install.packages("ggplot2")
library("ggplot2")
# install.packages("dplyr")
library("dplyr")
# install.packages("ggpubr")
library("ggpubr")
```
### Bata diversity - before the normalisation
```{r Beta_1, warning=FALSE, message=FALSE}
# method options: NMDS / PCoA
NMDS1 <- ordinate(physeq = physeq.ori,
method = "NMDS",
distance = "bray"
)
# Plot ordination
plot_ordination(physeq = physeq.ori,
ordination = NMDS1,
color = "Plate",
shape = "Type"
) +
theme_classic() +
geom_point(aes(color = Plate), alpha = 1, size = 3.5) +
stat_ellipse(aes(color = Plate, group = Plate), geom = "path", alpha = 1.5) + # Add ellipses
theme(
text = element_text(size = 18, colour = "black"),
axis.ticks = element_line(colour = "black", size = 1.1),
axis.line = element_line(colour = 'black', size = 1.1),
axis.text.x = element_text(colour = "black", angle = 0, hjust = 0.5,
size = 13, face = "bold"),
axis.text.y = element_text(colour = "black", angle = 0, hjust = 0.5,
size = 13, face = "bold"),
axis.title.y = element_text(color = "black", size = 20, face = "bold"),
axis.title.x = element_text(color = "black", size = 20, face = "bold")) +
scale_color_brewer(palette = "Dark2") +
scale_fill_brewer(palette = "Dark2") +
scale_shape_manual(values = c(15, 17, 3, 4, 16, 21, 22, 23)) # Set custom shapes
```
### Bata diversity - after the normalisation
```{r Beta_2, warning=FALSE, message=FALSE}
##### based on Plate and Type - after the normalisation
# method options: NMDS / PCoA
NMDS2 <- ordinate(physeq = physeq.norm,
method = "NMDS",
distance = "bray"
)
# Plot ordination
plot_ordination(physeq = physeq.norm,
ordination = NMDS2,
color = "Plate",
shape = "Type"
) +
theme_classic() +
geom_point(aes(color = Plate), alpha = 1, size = 3.5) +
stat_ellipse(aes(color = Plate, group = Plate), geom = "path", alpha = 1.5) + # Add ellipses
theme(
text = element_text(size = 18, colour = "black"),
axis.ticks = element_line(colour = "black", size = 1.1),
axis.line = element_line(colour = 'black', size = 1.1),
axis.text.x = element_text(colour = "black", angle = 0, hjust = 0.5,
size = 13, face = "bold"),
axis.text.y = element_text(colour = "black", angle = 0, hjust = 0.5,
size = 13, face = "bold"),
axis.title.y = element_text(color = "black", size = 20, face = "bold"),
axis.title.x = element_text(color = "black", size = 20, face = "bold")) +
scale_color_brewer(palette = "Dark2") +
scale_fill_brewer(palette = "Dark2") +
scale_shape_manual(values = c(15, 17, 3, 4, 16, 21, 22, 23)) # Set custom shapes
```
#### Convert `physeq.norm` object to ASV matrix and Save the "physeq.norm" object for the other case study analysis
```{r export ASV, echo=FALSE}
###### Extract abundance matrix from the phyloseq object #####
# OTU1 = as(otu_table(physeq.norm), "matrix")
# Transpose the matrix
# OTU1 <- t(OTU1)
# Coerce to data.frame
# OTUdf = as.data.frame(physeq.norm)
# save(physeq.norm, file = "norm.RData")
```
### Session Info
```{r sessionInfo, warning=FALSE, message=FALSE}
sessionInfo()
```