2_Normalization_and_statistics.Rmd

---
title: "Normalization and Statistical Analysis"
author: "Christian Ayala"
output:
  html_document:
    df_print: paged
  html_notebook: default
  pdf_document: default
editor_options:
  chunk_output_type: console
---

This Notebook is to perform normalization of the area under the curve (AUC) of the peaks detected by *Compound Discoverer*.

# 1. Importing Libraries

```{r libraries, message=FALSE, warning=FALSE}
library(tidyverse)
library(readxl)
library(ggpubr)
library(ggsci)
library(gridExtra)
library(vegan)
library(factoextra)
source('functions_cdis_norm_stats.R')
```

# 2. Import data

Set if the data to be used is going to be labeled or unlabeled

```{r}
# Flag for labeled / unlabeled data, set TRUE or FALSE

label = TRUE
```

The input data is the **compounds-table** if generated by the previous scripts. This table is used to avoid problems with some tests such as PCA, which does not allow for many zeroes or missing values

```{r set_path, message=FALSE}
# set path variables
project_dir <- getwd()
project_name <- 'Bog_1e5_label'
figures_dir <- file.path(project_dir, paste0(project_name, '_output_figures'))
tables_dir <- file.path(project_dir,  paste0(project_name, '_output_tables'))

# For unlabeled samples, use the gap_filled_compounds_table.csv, for labeled samples use the compounds_table
if(label == TRUE){
  compounds_table_file <- file.path(tables_dir, 'compounds_table.csv')
}else{
  compounds_table_file <- file.path(tables_dir, 'gap_filled_compounds_table.csv')
}

# Load compounds_table

compounds_table <- read_csv(compounds_table_file)

# Import metadata and fix names
metadata_file <- file.path(tables_dir, 'fixed_metadata.csv')
metadata <- read_csv(metadata_file)

```

# 3. Data Manipulation and Transformation

```{r Data_manipulation}
# Create a new tibble with the AUC per each mass from each sample
auc_table <- compounds_table %>% 
  select(FeatureID, SampleID, AUC)

# Transform the dataframe into a matrix-like table

auc_table <-  spread(auc_table, SampleID, AUC)
auc_table$FeatureID <- factor(auc_table$FeatureID, levels = str_sort(auc_table$FeatureID, numeric = TRUE))
auc_table <- auc_table %>% 
  arrange(FeatureID)

# Save untransformed data

auc_table <- column_to_rownames(auc_table, var = 'FeatureID')

table_file <- file.path(tables_dir, 'raw_auc_table.csv')
write.csv(auc_table, table_file, row.names = TRUE)

```

# 4. Data Normalization by multiple methods

Data is normalized by multiple methods to decide

```{r Data_normalization, warning=FALSE}

normalization_plot <- normalize_by_all(auc_table)

figure_file <- file.path(figures_dir, 'all_normalized.boxplot.png')
ggsave(figure_file, normalization_plot, dpi = 300)

```

Based on the plot select the best normalization method for the sample.
In this case the best normalization method was **Median normalization**


```{r Best normalization}

# Obtaining non-transformed data for differential analysis

norm.matrix <- median.norm(auc_table, transform_data = FALSE)

# Change missing values for zeroes

norm.matrix[is.na(norm.matrix)] <- 0

# Save normalized data, non transformed data for differential analysis

table_file <- file.path(tables_dir, 'normalized_untransformed_auc_table.csv')
write.csv(norm.matrix, table_file, row.names = TRUE)

```

For the rest of the analysis in this Notebook, the transformed values will be used

```{r Non-transformed.norm}
# Obtaining transformed data for multivariate statistica analysis

norm.matrix <- median.norm(auc_table)


# Change missing values for zeroes

norm.matrix[is.na(norm.matrix)] <- 0

# Save normalized data

table_file <- file.path(tables_dir, 'normalized_transformed_auc_table.csv')
write.csv(norm.matrix, table_file, row.names = TRUE)

```

# 5. Statistical Analysis

## 5.1 NMDS 

Choose if analysis will be done based on relative abundance or presence absence

```{r typeofanalysis}

# This portion of the script parts adapted from statistical analysis from MetaboTandem and MetaboDirect pipelines

type <- 'ra'

if(type == 'ra'){
  nmds.matrix <- t(norm.matrix)
  dm.method <- 'bray'
  # distance matrix by Bray because relative abundance mode was selected
  dm <- vegdist(nmds.matrix, method=dm.method)
  print('Relative abundance method selected')
}else if(type == 'pa'){
  nmds.matrix <- decostand(t(norm.matrix), 'pa')
  dm.method <- 'euclidean'
  dm <- vegdist(nmds.matrix, method = dm.method)
  print('Presence/absence method selected')
} else{
  print('Select analysis method: "pa" for presence absence or "ra" for relative abundance')
}

```

Perform the actual nmds analysis

**A good rule of thumb for interpretation**: 
- < 0.05 provides an excellent representation in reduced dimensions, 
- < 0.1 is great, 
- < 0.2 is good/ok, 
- < 0.3 provides a poor representation. 


```{r nmds}

set.seed(123)
nmds <- metaMDS(dm,
                k = 2,
                maxit = 999,
                trymax = 500,
                wascores = TRUE)
stressplot(nmds)

# Extract nmds scores for plotting

nmds.scores <- as.data.frame(scores(nmds))
nmds.scores <- rownames_to_column(nmds.scores, var = 'SampleID')
nmds.scores <- left_join(nmds.scores, metadata, by = 'SampleID')

nmds_plot <- plot_nmds(nmds.scores, Time, Material) +
  labs(title = 'NMDS plot by relative abundance')
nmds_plot

figure_file <- file.path(figures_dir, 'nmds_relative_abundance.png')
ggsave(figure_file, nmds_plot, dpi = 300)

```

## 5.2 PCA

```{r}

# Calculate PCA with prcomp
pca <- prcomp(t(norm.matrix))

# Get eigenvalues
eigen <- get_eigenvalue(pca)


# Plot screeplot using the functions from factoextra

scree_plot <- fviz_eig(pca, addlabels = TRUE) +
  theme_bw() +
  theme(plot.title = element_text(face = 'bold', hjust = 0.5))
scree_plot 

figure_file <- file.path(figures_dir, 'screeplot.png')
ggsave(figure_file, scree_plot, dpi = 300)

# Plot cumulative variance plot

cumvar_plot <- plot_cumvar(eigen)
cumvar_plot

figure_file <- file.path(figures_dir, 'cumulative_variance.png')
ggsave(figure_file, cumvar_plot, dpi = 300)
```


```{r}

# Extract sample coordinates for PC1 and PC2
pca_coordinates <- as.tibble(pca$x)
pca_coordinates$SampleID <- rownames(pca$x)

# Merge with metadata
pca_coordinates <- left_join(pca_coordinates, metadata, by ='SampleID')

# Prepare axis labels for PCA

pc1 <- paste0('PC1 (', round(eigen$variance.percent[1], digits = 1), '%)')
pc2 <- paste0('PC2 (', round(eigen$variance.percent[2], digits = 1), '%)')

# Plot Individuals PCA

pca_plot <- plot_dotplot(pca_coordinates, PC1, PC2, Time, Material) +
  labs(title = 'PCA plot',
       x = pc1,
       y = pc2)

pca_plot

figure_file <- file.path(figures_dir, 'PCA-plot.png')
ggsave(figure_file, pca_plot, dpi = 300)

```

*Labeled data* obtained from Compound Discoverer is not gap-filled and contains multiple *zeroes*. For that reason the **NMDS plot** is more informative


## 5.3 PERMANOVA

Permutational Multivariate Analysis of Variance Using Distance Matrices

```{r Permanova}

rownames(metadata) <- metadata$SampleID

set.seed(456)
permanova <- adonis(dm ~ Time, 
                    data=metadata, 
                    permutations=999, 
                    method="bray")
permanova


```

The permanova shows that AUC at **different time points** contribute to 61% of the differences among the samples

## 5.4 Feature contribution

Explore to determine which features are driving the difference in time

```{r Feature contribution, message=FALSE, warning=FALSE}

# Comparisons are done in pairs

# Comparing based on the sample origin (litter or peat_litter)

sub_metadata <- metadata %>% 
  filter(Time == 'T0' | Time == 'T3')

sub_norm.matrix <- select(norm.matrix, rownames(sub_metadata))

set.seed(456)
# Adonis test to look for feature contribution
perm <- adonis(t(sub_norm.matrix) ~ Time, 
               data=sub_metadata, 
               permutations=999, 
               method="bray")
perm
```

There are no features that significantly drive the difference between **T0** and **any of the times**

```{r echo=FALSE}
# 
# # Feature extraction and inspection
# perm.features <- coefficients(perm)['Time1',]
# perm.features <- perm.features[rev(order(abs(perm.features )))]
# perm.features <- as.data.frame(perm.features) 
# colnames(perm.features) <- paste0('f.contrib')
# perm.features <- signif(perm.features, 2)
# perm.features$Features <- rownames(perm.features) #NEGATIVE = Drives T3, POSITIVE = Drives T0
```


```{r echo=FALSE}
# t2 = quantile(abs(perm.features$f.contrib), 0.60)
# t2
# 
# # Classify features
# perm.features.g <- mutate(perm.features,
#                           Group = if_else(abs(f.contrib)<t2, 'low contribution',
#                                           if_else(f.contrib>t2, 'T0', 'T3')))
# 
# file_table <- (file.path(tables_dir, 'features_contribution_T0vsT3.csv'))
# write_csv(perm.features.g, file_table)
# plot.contrib <- 
#   perm.features.g %>% 
#   filter(abs(f.contrib)>t2) %>% 
#   ggplot(aes(x = reorder(Features, f.contrib), y = f.contrib, fill = f.contrib > 0)) +
#   geom_bar(stat="identity", width = 1, color='black', size=0.1, alpha=0.6) +
#   scale_fill_jama(labels = c('T3', 'T0')) +
#   theme_bw() +
#   labs(title = 'Feature contribution to the difference observed at different times',
#        x = 'MS Features',
#        y = 'Contribution',
#        fill = 'Drives:') +
#   theme(plot.title = element_text(face = 'bold',
#                                   hjust = 0.5),
#         axis.text.x = element_blank(),
#         axis.ticks.x = element_blank())
# 
# 
# plot.contrib
# 
# figure_file <- file.path(figures_dir, 'Contribution in time.png')
# ggsave(figure_file, plot.contrib, dpi = 300)
# ```
# 
# 
# ```{r}
# neg.features <- perm.features.g %>% 
#   slice_min(f.contrib, n = 10)
# 
# neg.features <- left_join(neg.features, compounds_table, by = c('Features' = 'FeatureID')) %>% 
#   select(f.contrib, Features, Name, Formula) %>% 
#   distinct()
# 
# table_file <- file.path(tables_dir, 'Features_contributing_T3.csv')
# write_csv(neg.features, table_file)
# 
# pos.features <- perm.features.g %>% 
#   slice_max(f.contrib, n = 10)
# 
# pos.features <- left_join(pos.features, compounds_table, by = c('Features' = 'FeatureID')) %>% 
#   select(f.contrib, Features, Name, Formula) %>% 
#   distinct()
# 
# table_file <- file.path(tables_dir, 'Features_contributing_T0.csv')
# write_csv(pos.features, table_file)
```

Explore to determine which features are driving the difference in based of the label status

```{r}

# Comparisons are done in pairs

# All samples are either labeled or unlabeled so no filtering of the metadata is necessary

sub_metadata <- metadata 

sub_norm.matrix <- select(norm.matrix, rownames(sub_metadata))

set.seed(456)
# Adonis test to look for feature contribution
perm <- adonis(t(sub_norm.matrix) ~ Label, 
               data=sub_metadata, 
               permutations=999, 
               method="bray")
perm
```

No significance difference found in the pairwise comparison of only the labeled