scAPA_vignette.rmd

---
title: "scAPA package vignette"
output: rmarkdown::github_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

Install and load scAPA package.

```{r load packeges}
if(!require(devtools)) install.packages("devtools")
require(devtools)
if(!require(scAPA)) devtools::install_github("ElkonLab/scAPA/scAPA") 
require(scAPA)
```

The file Peaks.RDS used for this example is the result of running scAPA.shell.script.R on data from [Lukassen et al](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6132189/).

This **is not** a downsampled version, but an analysis of the full bams.
Download and load the scAPAList object Peaks.RDS.
```{r Download, echo=TRUE, eval=FALSE}
peaks.url <- "https://github.com/ElkonLab/scAPA/blob/master/Rfiles/Peaks.RDS"
download.file(url = peaks.url, destfile = "Peaks.RDS")
```

```{r load, echo=TRUE}
a <- readRDS("Peaks.RDS")
a
```
The object contains the following: 

* The peak count matrix:
```{r show count matrix, echo=T}
head(a@cells.counts[,1:3])
```
Rows are peaks, and columns are cells. The cells name is composed of its sample and barcodes (the sample is added to cope with cells barcodes that are duplicated between samples).

* Cell-cluster annotions:
```{r show cluster anotations}
head(a@cluster.anot)
```
The first column is the cell barcodes as it appears in the peak count matrix. The second is their assigned cell clusters. Other columns, such as the tsne coordinates, are optional.

* A data.frame with the sequence downstream the 3' edge of each peak:
```{r show down seq}
head(a@down.seq,2)
```
This is required for filtering peaks that may result from internal priming.

* Information regarding the peaks genomic location (optional slot): 
```{r show row.Data}
head(a@row.Data)
```

First, calculate the sum of the counts from each peak in each cell cluster.

```{r calc_clusters_counts}
a <- calc_clusters_counts(a)
a
```
This step is required if the option sc was specified in scAPA.shell.scipt.R If sc = false, clusters' counts have already been calculated.

##Peak filtering

First, we calculate the peak counts-per-million (CPM) peak cluster matrix.

```{r calc_cpm}
a <- calc_cpm(a)
head(a@norm$CPM)
```
Then, only peaks whose total sum over all cell clusters is above a certion CPMs threshold are considered. Here we choose 10 as the threshold:  

```{r filter law CPMs}
keep.cpm <- rowSums(a@norm$CPM) > 10
a <- a[keep.cpm,]
a
```


To exclude internal priming suspected peaks, peaks having a stretch of at least 8 consecutive As in the region between 10 nt to 140 nt. to the peak's end are filtered:

```{r filter_IP}
a <- filter_IP(x = a, int.priming.seq = "AAAAAAAA",
               left = 10,
               right = 140)
a
```

The sequence and the regions can be varied using the arguments "int.priming.seq", "left", "right". See the function manual page for more details by typing ?filter_IP in R studio.


## Statistical analysis – detection of dynamic APA events between cell clusters

* Set an scAPAresults object:

```{r}
results <- set_scAPAresults(a)
results
```

* The slot clus.counts is a list such that each 3’UTR with more than one peak is represented by a table where rows are its associated peaks and columns are cell clusters. 

```{r show results clus.counts}
results@clus.counts[1:2]
```

An analogous list is produced for the 'cells.counts' slot:
```{r show results cells.counts}
results@cells.counts[[1]][,1:3]
```
Here each column is a cell.

* For each such table, perform a Chi-squared test, p-values are corrected for multiple testing using BH FDR.

```{r test_APA}
results <- test_APA(results, clus = "all")
head(results@pvalues$all)
```

The clus argument specifies the clusters to be tested. Default is "all" for all cluster. specify, for example, clus = c("ES", "RS") to test ES vs RS.

* For 3' UTR with more than two peaks that show significant usage change, for each peak, chi-square test for goodness of fit is performed. The threshold for significant change is set by the argument sig.level (FDR value)

```{r test_peaks}
results <- test_peaks(results, clus = "all", sig.level = 0.05)
```

We get a slot 'peak.pvaues' with such peaks (p-values, q-values, and peak PUI for each cluster).

```{r}
head(results@peak.pvaues$Peaks_all)
```

## Inferring global trends of APA modulation

* The proximal pA site usage index (proximal PUI) is used to quantify the relative usage of the most proximal pA site within a 3' UTR (with two or more peaks). See [pipeline.description.pdf](pipeline.description.pdf) for the definition of this index.

```{r calc_p_pui_mat}
results <- calc_p_pui_mat(results)
head(results@ppui.clus)
```

This function also calculates the mean PUI index for each cell and assin it to the slot 'ppui.cells'.

```{r ppui.cells}
head(results@ppui.cells)
```

Typing the name of the scAPAresult object in R now will show a summary of the results:
```{r summary}
results
```

* We creat a scAPAresults object containing only 3'UTR with significant APA events. 
```{r significants}
sig.utrs <- as.vector(results@pvalues$all[,2] < 0.05)
sig <- results[which(sig.utrs),]
```

* Plot the cumulative distribution function of the proximal PUI of the significant 3'UTRs:
```{r plot ppui, eval=F}
require(ggplot2)
tidy.ppui <- as.data.frame(sig@ppui.clus)
colnames(tidy.ppui) <- gsub(x = colnames(tidy.ppui), 
                            pattern = "_proximal_PUI", replacement = "")
tidy.ppui <- tidyr::gather(data = tidy.ppui)
colnames(tidy.ppui) <- c("Cluster", "value")
p <- ggplot(data = tidy.ppui, aes(x = value, color = Cluster)) + 
  stat_ecdf(size = 1) + theme_bw() + xlab("Proximal peak usage index") +
  ylab("Cumulative fraction") + ggplot2::coord_cartesian(xlim = c(-3, 4.1))
print(p)
```

* Consider 3' UTRs with exactly two peaks to identify events of 3'UTR shortening
(or lengthening). 

```{r}
two.peaks <- sapply(sig@clus.counts, function(x){nrow(x) == 2})
sig.two <- sig[which(two.peaks),]
sig.two
```
For such 3' UTRs, an increase in the value of the proximal PUI in cell type 1 relative to cell type 2 indicates 3' UTR shortening in cell type 1. For example, comparing RS to SC:
```{r RS vs SC}
RS.SC.shortening = sum(sig.two@ppui.clus[,2] > sig.two@ppui.clus[,3], 
                       na.rm = T)
RS.SC.lengthening = sum(sig.two@ppui.clus[,2] < sig.two@ppui.clus[,3],
                        na.rm = T)
RS.SC.shortening 
RS.SC.lengthening
```

* We can plot a tSNE coloring cells according to their mean proximal PUI, as follows:

```{r tsne, eval=F}
require(ggplot2)
mean_pui <- as.data.frame(results@ppui.cells)

mean_pui$Cell_BC <- gsub(pattern = "_proximal_PUI", replacement = "", 
                         x = rownames(mean_pui))
colnames(mean_pui)[1] <- "Mean Proximal PUI"
tsne <- merge(a@cluster.anot, mean_pui, by.x = "Cell_BC")
require(gridExtra)
colnames(tsne)[2:4] <- c("Cell Cluster", "tsne1", "tsne2")
              p <-ggplot(tsne, ggplot2::aes(x = tsne1 , 
                                            y = tsne2, 
                                            color = `Mean Proximal PUI`)) +
                geom_point(size=0.8) + xlab("tSNE1") + ylab("tSNE2") + 
                theme_bw() +scale_color_gradient(low = "blue", high = "red") +
                ggtitle("tSNE mean proximal PUI") + 
                theme(legend.position="top")
              g <- ggplot(tsne, ggplot2::aes(x = tsne1, y = tsne2,
                                             color = `Cell Cluster`)) +
                ggplot2::geom_point(size=0.8) + xlab("tSNE1") + 
                ylab("tSNE2") + theme_bw() + ggtitle("tSNE cell clusters") + 
                theme(legend.position="top")
              gridExtra::grid.arrange(g, p, nrow = 1)

```