publications-scripts/How-to-paper/case_00-preparation/case00B-Batch_effects.Rmd

---
title: "The UK Crop Microbiome Cryobank - How-to Guide: Correcting Batch Effects"
author: "Payton Yau"
date: "2024-04-14"
output: md_document
---

# How-to Guide: Correcting Batch Effects

The UK Crop Microbiome Cryobank integrates genomic (DNA) data with a cryobank collection of samples for the soil microbiomes of the UK major crop plant systems. For this project, the microbiomes are from the rhizosphere (the soil surrounding the crop plant roots) and from bulk soil (soil outside the rhizosphere). The Cryobank provides a facility for researchers to source data and samples, including cryo-preserved microbial material and genomic and metagenomic sequences from different soil microbiome environments.



### Convert Qiime2 objects to Phyloseq objects

`Qiime2` is a microbial community analysis tool used for sequencing analysis, while `Phyloseq` is an R package for analyzing high-throughput sequencing data. The `Qiime2R` package allows conversion of `Qiime2` data to `Phyloseq` within R. R enhances features through external packages from sources like CRAN, Bioconductor, and GitHub. After installation, packages must be loaded into the R session using the **library()** function.Once data is imported, Phyloseq enables data manipulation, analysis, and visualization. This conversion leverages R's analysis tools like ggplot2 for visualization, dplyr for data manipulation, and vegan for ecological community analysis.

```{r packages, warning=FALSE, message=FALSE}
# Download phyloseq from Bioconductor
# if (!require("BiocManager", quietly = TRUE))
#   install.packages("BiocManager")
# BiocManager::install("phyloseq")
library("phyloseq")

# Download qiime2R from Github
# if (!requireNamespace("devtools", quietly = TRUE)){install.packages("devtools")}
# devtools::install_github("jbisanz/qiime2R")
library("qiime2R")

# install.packages("RColorBrewer")
library("RColorBrewer")

# install.packages("doParallel")
library("doParallel")
```

```{r qiime2R, warning=FALSE, message=FALSE}
# Convert qiime2 results to phyloseq format
physeq <- qza_to_phyloseq(
  features = "C:/Users/Admin/OneDrive/Desktop/temp/428_228_220_table_silva138-with-phyla-no-mitochondria-no-chloroplast.qza", # table.qza
  taxonomy = "C:/Users/Admin/OneDrive/Desktop/temp/428_228_220_taxonomy_silva138.qza",
  metadata = "C:/Users/Admin/OneDrive/Desktop/temp/16s_meta-table.txt"
  #, tree = "rooted-tree.qza"
)

physeq ## confirm the object
```

### Remove unwanted (failed and controls) samples before the normalisation

Removing unwanted samples before normalisation is a common step in microbiome data analysis pipelines. In many cases, some samples may fail during sequencing or quality control, while others may be controls or blanks that are not of interest. These samples can introduce noise and bias in downstream analyses if not removed. By removing the unwanted samples before normalization, the remaining samples can be normalised based on their true biological variation, allowing for more accurate comparisons between samples.

```{r Remove, warning=FALSE, message=FALSE}
# sample_names(physeq) 
# rank_names(physeq) # "Kingdom" "Phylum" "Class" "Order" "Family" "Genus" "Species"

## unwanted samples removel (including failed samples)
physeq.ori <- subset_samples(physeq, Analysis == "Include")

## remove object
rm(physeq)
```

### Batch effects correction using Constrained Quantile Normalisation (ConQUR)

Constrained Quantile Normalisation (ConQUR, <https://www.nature.com/articles/s41467-022-33071-9>) is a normalisation technique used in high-throughput sequencing data, particularly in microbiome studies. It is a type of **quantile normalisation** approach that preserves the relative abundances of taxa between samples while simultaneously removing systematic technical variations that can arise due to differences in sequencing depth, PCR amplification bias, or other factors. ConQUR uses kernel density estimation to model the distribution of taxon abundances across all samples, and then constrains the normalisation process to maintain the relative position of each taxon within that distribution.

```{r ConQuR_install, warning=FALSE, message=FALSE}
# devtools::install_github("wdl2459/ConQuR")
library(ConQuR)
```

```{r ConQuR_run, warning=FALSE, message=FALSE}
# Convert ASV table to a data frame and transpose
B <- as.data.frame(physeq.ori@otu_table) # taxa
B <- t(B)
B <- as.data.frame(B) 

# Extract batch ID from sample data
batchid = physeq.ori@sam_data$Plate # batchid

# Extract covariates
D = physeq.ori@sam_data[, c('Type', 'Soil', 'Location')] #covar
summary(D)

# Correct for batch effects using ConQuR package
options(warn=-1) # required to call
taxa_correct1 = ConQuR(tax_tab = B, 
                       batchid = batchid, 
                       covariates = D, 
                       batch_ref="1"
                       ) #  warning messages may appear & it can be ignored

# Transpose the corrected matrix and convert it to a data frame
taxa_correct2 <- t(taxa_correct1)
taxa_correct2 <- as.data.frame(taxa_correct2)

# Create new ASV table, taxonomy table, and sample data
ASV = otu_table(taxa_correct2, taxa_are_rows = TRUE)
TAXA = tax_table(physeq.ori)
sampledata = sample_data(physeq.ori)

# repack the objects into a level 4 phyloseq structural data
physeq.norm = phyloseq(ASV, TAXA, sampledata)

# Save the "physeq.norm" object for the other case study analysis
# save(physeq.norm, file = "norm.RData")
# load("C:/Users/Admin/OneDrive/Desktop/temp/norm.RData")

# remove
rm(B, D, batchid, taxa_correct1, taxa_correct2, ASV, TAXA, sampledata, to_skip)
```

### Bata diversity - before and after the normalisation

Beta diversity quantifies variation in microbial composition among samples, aiding in identifying patterns in microbial distribution. Non-Metric Multidimensional Scaling (NMDS) and Principal Coordinates Analysis (PCoA) are ordination techniques used for beta diversity analysis.

**NMDS** preserves the rank order of pairwise dissimilarities between samples in a lower-dimensional space, making it suitable for cases where distances between samples are not well-preserved. The distances on the NMDS plot reflect the similarities or dissimilarities between samples but are not directly interpretable.

**PCoA**, a metric multidimensional scaling technique, attempts to preserve the actual distances between samples in a lower-dimensional space. The distances on the PCoA plot reflect the actual dissimilarities between samples. Unlike NMDS, PCoA may not perform as well with non-linear or rank-based dissimilarity measures.

Here, we employ NMDS to analyze Beta diversity, allowing us to draw comparisons between the states before and after normalisation.

```{r ggplot2, warning=FALSE, message=FALSE}
# install.packages("ggplot2")
library("ggplot2")
# install.packages("dplyr")
library("dplyr")
# install.packages("ggpubr")
library("ggpubr")
```

### Bata diversity - before the normalisation

```{r Beta_1, warning=FALSE, message=FALSE}
# method options: NMDS / PCoA
NMDS1 <- ordinate(physeq = physeq.ori, 
                  method = "NMDS", 
                  distance = "bray"
                  )

# Plot ordination
plot_ordination(physeq = physeq.ori,
                      ordination = NMDS1,
                      color = "Plate",
                      shape = "Type"
                      ) + 
  theme_classic() + 
  geom_point(aes(color = Plate), alpha = 1, size = 3.5) +
  stat_ellipse(aes(color = Plate, group = Plate), geom = "path", alpha = 1.5) + # Add ellipses
  theme(
    text = element_text(size = 18, colour = "black"), 
    axis.ticks = element_line(colour = "black", size = 1.1),
    axis.line = element_line(colour = 'black', size = 1.1),
    axis.text.x = element_text(colour = "black", angle = 0, hjust = 0.5, 
                               size = 13, face = "bold"),
    axis.text.y = element_text(colour = "black", angle = 0, hjust = 0.5, 
                               size = 13, face = "bold"),
    axis.title.y = element_text(color = "black", size = 20, face = "bold"), 
    axis.title.x = element_text(color = "black", size = 20, face = "bold")) + 
  scale_color_brewer(palette = "Dark2") + 
  scale_fill_brewer(palette = "Dark2") + 
  scale_shape_manual(values = c(15, 17, 3, 4, 16, 21, 22, 23)) # Set custom shapes

```

### Bata diversity - after the normalisation

```{r Beta_2, warning=FALSE, message=FALSE}
##### based on Plate and Type - after the normalisation
# method options: NMDS / PCoA
NMDS2 <- ordinate(physeq = physeq.norm, 
                  method = "NMDS", 
                  distance = "bray"
                  )

# Plot ordination
plot_ordination(physeq = physeq.norm,
                ordination = NMDS2,
                color = "Plate",
                shape = "Type"
                ) +
  theme_classic() + 
  geom_point(aes(color = Plate), alpha = 1, size = 3.5) +
  stat_ellipse(aes(color = Plate, group = Plate), geom = "path", alpha = 1.5) + # Add ellipses
  theme(
    text = element_text(size = 18, colour = "black"), 
    axis.ticks = element_line(colour = "black", size = 1.1),
    axis.line = element_line(colour = 'black', size = 1.1),
    axis.text.x = element_text(colour = "black", angle = 0, hjust = 0.5, 
                               size = 13, face = "bold"),
    axis.text.y = element_text(colour = "black", angle = 0, hjust = 0.5, 
                               size = 13, face = "bold"),
    axis.title.y = element_text(color = "black", size = 20, face = "bold"), 
    axis.title.x = element_text(color = "black", size = 20, face = "bold")) + 
  scale_color_brewer(palette = "Dark2") + 
  scale_fill_brewer(palette = "Dark2") + 
  scale_shape_manual(values = c(15, 17, 3, 4, 16, 21, 22, 23)) # Set custom shapes


```

#### Convert `physeq.norm` object to ASV matrix and Save the "physeq.norm" object for the other case study analysis

```{r export ASV, echo=FALSE}
###### Extract abundance matrix from the phyloseq object #####
# OTU1 = as(otu_table(physeq.norm), "matrix")
# Transpose the matrix
# OTU1 <- t(OTU1)
# Coerce to data.frame
# OTUdf = as.data.frame(physeq.norm)
# save(physeq.norm, file = "norm.RData")
```

### Session Info

```{r sessionInfo, warning=FALSE, message=FALSE}
sessionInfo()
```