pda_quantification_preprocessing_noframes.Rmd

---
title: "Statistical Methods for Quantitative MS-based Proteomics: Part I. Preprocessing"
author: "Lieven Clement"
date: "[statOmics](https://statomics.github.io), Ghent University"
output:
    html_document:
      code_download: true
      theme: flatly
      toc: true
      toc_float: true
      highlight: tango
      number_sections: true
bibliography: msqrob2.bib

---

<a rel="license" href="https://creativecommons.org/licenses/by-nc-sa/4.0"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a>

- [Playlist PDA Preprocessing](https://www.youtube.com/playlist?list=PLZH1hP8_LbJJXQeQ_KYDNoq-AXyFBG6nX)

# Outline {-}

1. Introduction

2. Preprocessing

    - Log-transformation
    - Filtering
    - Normalization
    - Summarization

Note, that the R-code is included for learners who are aiming to develop R/markdown scripts to automate their quantitative proteomics data analyses.
According to the target audience of the course we either work with a graphical user interface (GUI) in a R/shiny App msqrob2gui (e.g. Proteomics Bioinformatics course of the EBI and the Proteomics Data Analysis course at the Gulbenkian institute) or with R/markdowns scripts (e.g. Bioinformatics Summer School at UCLouvain or the Statistical Genomics Course at Ghent University). 

---

# Intro: Challenges in Label-Free Quantitative Proteomics


## MS-based workflow

```{r echo=FALSE}
knitr::include_graphics("./figures/ProteomicsWorkflow.png")
```
  
- Peptide Characteristics
  
  - Modifications
  - Ionisation Efficiency: huge variability
  - Identification
    - Misidentification $\rightarrow$ outliers
    - MS$^2$ selection on peptide abundance
    - Context depending missingness
    - Non-random missingness

$\rightarrow$ Unbalanced pepide identifications across samples and messy data

---

## Level of quantification

- MS-based proteomics returns peptides: pieces of proteins

```{r echo=FALSE}
knitr::include_graphics("./figures/challenges_peptides.png")
```

- Quantification commonly required on the protein level

```{r echo=FALSE}
knitr::include_graphics("./figures/challenges_proteins.png")
```

---

## Label-free Quantitative Proteomics Data Analysis Workflows

```{r echo=FALSE}
knitr::include_graphics("./figures/proteomicsDataAnalysis.png")
```

---

## CPTAC Spike-in Study

```{r echo=FALSE, out.width="50%"}
knitr::include_graphics("./figures/cptacLayoutLudger.png")
```

- Same trypsin-digested yeast proteome background in each sample
- Trypsin-digested Sigma UPS1 standard: 48 different human proteins spiked in at 5 different concentrations (treatment A-E) 
- Samples repeatedly run on different instruments in different labs
- After MaxQuant search with match between runs option

  - 41\% of all proteins are quantified in all samples
  - 6.6\% of all peptides are quantified in all samples

$\rightarrow$ vast amount of missingness


## Maxquant output

```{r echo=FALSE}
knitr::include_graphics("./figures/maxquantOutputDir.png")
```

---


# Import the data in R 

## Data infrastructure

<details><summary> Click to see background on data infrastructure used in R to store proteomics data </summary><p>
- We use the `QFeatures` package that provides the infrastructure to
  - store,  
  - process, 
  - manipulate and 
  - analyse quantitative data/features from mass spectrometry
experiments. 

- It is based on the `SummarizedExperiment` and
`MultiAssayExperiment` classes. 


```{r fig.cap = "Conceptual representation of a `SummarizedExperiment` object.  Assays contain information on the measured omics features (rows) for different samples (columns). The `rowData` contains information on the omics features, the `colData` contains information on the samples, i.e. experimental design etc.", echo=FALSE, out.width="80%"}
knitr::include_graphics("./figures/SE.png")
```

- Assays in a QFeatures object have a
hierarchical relation: 
  
  - proteins are composed of peptides, 
  - themselves produced by spectra
  - relations between assays are tracked and recorded throughout data processing

```{r featuresplot, fig.cap = "Conceptual representation of a `QFeatures` object and the aggregative relation between different assays.", echo = FALSE}
par(mar = c(0, 0, 0, 0))
plot(NA, xlim = c(0, 12), ylim = c(0, 20),
     xaxt = "n", yaxt = "n",
     xlab = "", ylab = "", bty = "n")
for (i in 0:7)
    rect(0, i, 3, i+1, col = "lightgrey", border = "white")
for (i in 8:12)
    rect(0, i, 3, i+1, col = "steelblue", border = "white")
for (i in 13:18)
    rect(0, i, 3, i+1, col = "orange", border = "white")
for (i in 19)
    rect(0, i, 3, i+1, col = "darkgrey", border = "white")
for (i in 5:7)
    rect(5, i, 8, i+1, col = "lightgrey", border = "white")
for (i in 8:10)
    rect(5, i, 8, i+1, col = "steelblue", border = "white")
for (i in 11:13)
    rect(5, i, 8, i+1, col = "orange", border = "white")
for (i in 14)
    rect(5, i, 8, i+1, col = "darkgrey", border = "white")
rect(9, 8, 12, 8+1, col = "lightgrey", border = "white")
rect(9, 9, 12, 9+1, col = "steelblue", border = "white")
rect(9, 10, 12, 10+1, col = "orange", border = "white")
rect(9, 11, 12, 11+1, col = "darkgrey", border = "white")
segments(3, 8, 5, 8, lty = "dashed")
segments(3, 6, 5, 7, lty = "dashed")
segments(3, 4, 5, 6, lty = "dashed")
segments(3, 0, 5, 5, lty = "dashed")
segments(3, 10, 5, 9, lty = "dashed")
segments(3, 11, 5, 10, lty = "dashed")
segments(3, 13, 5, 11, lty = "dashed")
segments(3, 14, 5, 12, lty = "dashed")
segments(3, 16, 5, 13, lty = "dashed")
segments(3, 19, 5, 14, lty = "dashed")
segments(3, 20, 5, 15, lty = "dashed")
segments(8, 5, 9, 8, lty = "dashed")
segments(8, 8, 9, 9, lty = "dashed")
segments(8, 11, 9, 10, lty = "dashed")
segments(8, 14, 9, 11, lty = "dashed")
segments(8, 15, 9, 12, lty = "dashed")
```

</p></details>

## Import data in R


### Load libraries 

<details><summary> Click to see code </summary><p>
```{r, warning=FALSE, message=FALSE}
library(tidyverse)
library(limma)
library(QFeatures)
library(msqrob2)
library(plotly)
library(ggplot2)
```
</p></details>

### Read data 

<details><summary> Click to see background and code </summary><p>
1. We use a peptides.txt file from MS-data quantified with maxquant that 
contains MS1 intensities summarized at the peptide level. 
```{r}
peptidesFile <- "https://raw.githubusercontent.com/statOmics/PDA/data/quantification/fullCptacDatasSetNotForTutorial/peptides.txt"
```

2. Maxquant stores the intensity data for the different samples in columnns that start with Intensity. We can retreive the column names with the intensity data with the code below: 

```{r}
ecols <- grep("Intensity\\.", names(read.delim(peptidesFile)))
```

3. Read the data and store it in  QFeatures object 

```{r}
pe <- readQFeatures(
  table = peptidesFile,
  fnames = 1,
  ecol = ecols,
  name = "peptideRaw", sep="\t")
```
</p></details>

### Explore object

<details><summary> Click to see background and code </summary><p>
- The rowData contains information on the features (peptides) in the assay. E.g. Sequence, protein, ...

```{r}
rowData(pe[["peptideRaw"]])
```

- The colData contains information on the samples

```{r} 
colData(pe)
```

- No information is stored yet on the design. 


```{r} 
pe %>% colnames
```

- Note, that the sample names include the spike-in condition. 
- They also end on a number. 
  
  - 1-3 is from lab 1, 
  - 4-6 from lab 2 and 
  - 7-9 from lab 3. 

- We update the colData with information on the design

```{r}
colData(pe)$lab <- rep(rep(paste0("lab",1:3),each=3),5) %>% as.factor
colData(pe)$condition <- pe[["peptideRaw"]] %>% colnames %>% substr(12,12) %>% as.factor
colData(pe)$spikeConcentration <- rep(c(A = 0.25, B = 0.74, C = 2.22, D = 6.67, E = 20),each = 9)
```

- We explore the colData again

```{r}
colData(pe)
```

</p></details>

---

# Preprocessing

## Log-transformation

### Explore the data with plots

Peptide AALEELVK from spiked-in UPS protein P12081. 
We only show data from lab1.

<details><summary> Click to see code to make plot </summary><p>
```{r}
subset <- pe["AALEELVK",colData(pe)$lab=="lab1"]
plotWhyLog <- data.frame(concentration = colData(subset)$spikeConcentration,
           y = assay(subset[["peptideRaw"]]) %>% c
           ) %>% 
  ggplot(aes(concentration, y)) +
  geom_point() +
  xlab("concentration (fmol/l)") +
  ggtitle("peptide AALEELVK in lab1")
```
</p></details>

```{r}
plotWhyLog
```

- Variance increases with the mean
$\rightarrow$ Multiplicative error structure 

<details><summary> Click to see code to make plot </summary><p>
```{r}
plotLog <- data.frame(concentration = colData(subset)$spikeConcentration,
           y = assay(subset[["peptideRaw"]]) %>% c
           ) %>% 
  ggplot(aes(concentration, y)) +
  geom_point() + 
  scale_x_continuous(trans='log2') + 
  scale_y_continuous(trans='log2') +
  xlab("concentration (fmol/l)") +
  ggtitle("peptide AALEELVK in lab1 with axes on log scale")
```
</p></details>

```{r}
plotLog
```

- Data seems to be homoscedastic on log-scale $\rightarrow$ log transformation of the intensity data
- In quantitative proteomics analysis on $\log_2$ 

$\rightarrow$ Differences on a $\log_2$ scale: $\log_2$ fold changes

$$
\log_2 B - \log_2 A = \log_2 \frac{B}{A} = \log FC_\text{B - A}
$$
$$ 
\begin{array} {l}
log_2 FC = 1 \rightarrow FC = 2^1 =2\\
log_2 FC = 2 \rightarrow FC = 2^2 = 4\\
\end{array}
$$


### log-transformation of the data 

<details><summary> Click to see code to log-transfrom the data </summary><p>
- We calculate how many non zero intensities we have for each peptide and this can be useful for filtering.

```{r}
rowData(pe[["peptideRaw"]])$nNonZero <- rowSums(assay(pe[["peptideRaw"]]) > 0)
```


- Peptides with zero intensities are missing peptides and should be represent
with a `NA` value rather than `0`.

```{r}
pe <- zeroIsNA(pe, "peptideRaw") # convert 0 to NA
```

- Logtransform data with base 2

```{r}
pe <- logTransform(pe, base = 2, i = "peptideRaw", name = "peptideLog")
```
</p></details>

---

## Filtering

- Reverse sequences
- Only identified by modification site (only modified peptides detected)
- Razor peptides: non-unique peptides assigned to the protein group with the most other peptides 
- Contaminants
- Peptides few identifications
- Proteins that are only identified with one or a few peptides

Filtering does not induce bias if the criterion is independent from the downstream data analysis!

<details><summary> Click to see code to filter the data </summary><p>

1. Handling overlapping protein groups

In our approach a peptide can map to multiple proteins, as long as there is
none of these proteins present in a smaller subgroup.

```{r}
pe <- filterFeatures(pe, ~ Proteins %in% smallestUniqueGroups(rowData(pe[["peptideLog"]])$Proteins))
```

2. Remove reverse sequences (decoys) and contaminants

We now remove the contaminants, peptides that map to decoy sequences, and proteins
which were only identified by peptides with modifications.

```{r}
pe <- filterFeatures(pe,~Reverse != "+")
pe <- filterFeatures(pe,~ Potential.contaminant != "+")
```

3. Drop peptides that were only identified in one sample

We keep peptides that were observed at last twice.

```{r}
pe <- filterFeatures(pe,~ nNonZero >=2)
nrow(pe[["peptideLog"]])
```

We keep `r nrow(pe[["peptideLog"]])` peptides upon filtering.
</p></details>

---

## Normalization 

<details><summary> Click to see code to make plot </summary><p>

```{r}
densityConditionD <- pe[["peptideLog"]][,colData(pe)$condition=="D"] %>% 
  assay %>%
  as.data.frame() %>%
  gather(sample, intensity) %>% 
  mutate(lab = colData(pe)[sample,"lab"]) %>%
  ggplot(aes(x=intensity,group=sample,color=lab)) + 
    geom_density() +
    ggtitle("condition D")

densityLab2 <- pe[["peptideLog"]][,colData(pe)$lab=="lab2"] %>% 
  assay %>%
  as.data.frame() %>%
  gather(sample, intensity) %>% 
  mutate(condition = colData(pe)[sample,"condition"]) %>%
  ggplot(aes(x=intensity,group=sample,color=condition)) + 
    geom_density() +
    ggtitle("lab2")
```
</p></details>

```{r}
densityConditionD
```

```{r}
densityLab2
```
- Even in very clean synthetic dataset (same background, only 48 UPS
proteins can be different) the marginal peptide intensity distribution
across samples can be quite distinct

- Considerable effects between and within labs for replicate samples
- Considerable effects between samples with different spike-in
concentration

$\rightarrow$ Normalization is needed

---


### Mean or median?

- Miller and Fishkin (1997) reported that over a period of 30 years males would like to have on average 64.3 partners and females 2.8. 


<details><summary> </summary><p>
- Miller and Fishkin (1997) reported that the median number of partners someone would like to have over a period of 30 years males is 1 for both males and females. 
</p></details>

<details><summary> </summary><p>
Mean is very sensitive to outliers!

```{r echo=FALSE}
knitr::include_graphics("./figures/partners.png")
```
</p></details>


---

### Normalization of the data by median centering

$$y_{ip}^\text{norm} = y_{ip} - \hat\mu_i$$ 
with $\hat\mu_i$ the median intensity over all observed peptides in sample $i$.

<details><summary> Click to see R-code to normalize the data </summary><p>
```{r}
pe <- normalize(pe, 
                i = "peptideLog", 
                name = "peptideNorm", 
                method = "center.median")
```
</p></details>

### Plots of normalized data


<details><summary> Click to see code to make plot </summary><p>
```{r}
densityConditionDNorm <- pe[["peptideNorm"]][,colData(pe)$condition=="D"] %>% 
  assay %>%
  as.data.frame() %>%
  gather(sample, intensity) %>% 
  mutate(lab = colData(pe)[sample,"lab"]) %>%
  ggplot(aes(x=intensity,group=sample,color=lab)) + 
    geom_density() +
    ggtitle("condition D")

densityLab2Norm <- pe[["peptideNorm"]][,colData(pe)$lab=="lab2"] %>% 
  assay %>%
  as.data.frame() %>%
  gather(sample, intensity) %>% 
  mutate(condition = colData(pe)[sample,"condition"]) %>%
  ggplot(aes(x=intensity,group=sample,color=condition)) + 
    geom_density() +
    ggtitle("lab2")
```
</p></details>

```{r}
densityConditionDNorm
```

```{r}
densityLab2Norm
```

- Upon normalization the marginal distributions of the peptide intensities across samples are much more comparable
- We still see deviations
- This can be due to technical variability
- In micro-array literature, quantile normalisation is used to force the median and all other quantiles to be equal across samples
- In proteomics quantile normalisation often introduces artifacts due to a difference in missing peptides across samples 
- More advanced methods should be developed for normalizing proteomics data
- If there are differences in the width of the marginal distributions of the data across samples. They can also be standardized by using a robust estimator for location and scale, i.e. 
$$y_{ip}^\text{norm} = \frac{y_{ip} - \mu_i}{s_i}$$ 

---

## Summarization 

- We illustrate summarization issues using a subset of the cptac study (Lab 2, condition A and E) for a spiked protein (UPS P12081).

<details><summary> Click to see code to make plot </summary><p>
```{r plot = FALSE}
summaryPlot <- pe[["peptideNorm"]][
    rowData(pe[["peptideNorm"]])$Proteins == "P12081ups|SYHC_HUMAN_UPS",
    colData(pe)$lab=="lab2"&colData(pe)$condition %in% c("A","E")] %>%
  assay %>%
  as.data.frame %>%
  rownames_to_column(var = "peptide") %>%
  gather(sample, intensity, -peptide) %>% 
  mutate(condition = colData(pe)[sample,"condition"]) %>%
  ggplot(aes(x = peptide, y = intensity, color = sample, group = sample, label = condition), show.legend = FALSE) +
  geom_line(show.legend = FALSE) +
  geom_text(show.legend = FALSE) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
  xlab("Peptide") + 
  ylab("Intensity (log2)")
```
</p></details>

```{r}
summaryPlot
```

We observe:  

- intensities from multiple peptides for each protein in a sample
- Strong peptide effect 
-Unbalanced peptide identification
- Pseudo-replication: peptide intensities from a particular protein in the same sample are correlated, i.e. they more alike than peptide intensities from a particular protein between samples. 


$\rightarrow$ Summarize all peptide intensities from the same protein in a sample into a single protein expression value

Commonly used methods are 

- Mean summarization
$$
y_{ip}=\beta_i^\text{samp} + \epsilon_{ip}
$$

- Median summarization
- Maxquant's maxLFQ summarization (in protein groups file)
- Model based summarization: 
$$
y_{ip}=\beta_i^\text{samp} + \beta_p^\text{pep} + \epsilon_{ip}
$$


<details><summary> Click to see R-code to normalize the data </summary><p>
We use the standard sumarization in aggregateFeatures, which is
robust model based summarization. 

```{r,warning=FALSE}
pe <- aggregateFeatures(pe,
    i = "peptideNorm", 
    fcol = "Proteins", 
    na.rm = TRUE,
    name = "protein")
```

Other summarization methods can be implemented by using the `fun` argument in the `aggregateFeatures` function. 

- `fun = MsCoreUtils::medianPolish()` to fits an additive model (two way decomposition) using Tukey's median polish_ procedure using stats::medpolish()

- `fun = MsCoreUtils::robustSummary()` to calculate a robust aggregation using MASS::rlm() (default)

- `fun = base::colMeans()` to use the mean of each column

- `fun = matrixStats::colMedians()` to use the median of each column

- `fun = base::colSums()` to use the sum of each column

</p></details>

---

# Exercise

1. We will evaluate different summarization methods (Maxquant maxLFQ, median and robust model based) in the tutorial session before discussing on their advantages/disadvantages.
2. Can you anticipate on potential problems related to the summarization?

---

# Code

- Our R/Bioconductor package `msqrob2` can be used in R markdown scripts or with a GUI/shinyApp in the `msqrob2gui` package or . 

- Users who want to learn how to code and automate proteomics data analyses in reproducible R markdown scripts can get more information on all code used in this script in the video below. 

1. Data infrastructure

2. Import proteomics data

3. Preprocessing

    -   Log-transformation

    - Filtering

    - Normalisation

    - Summarization