15-remove-conf2.Rmd_

---
output: html_document
---

# Dealing with confounders 2

# Effectiveness by kBET

```{r}
library(kBET)
# library(data.table)
# library(RColorBrewer)
# library(ggplot2)
library(RUVSeq)
library(scater)
```

## Prelude - Load data and try different normalisations


```{r}
umi <- readRDS("tung/umi.rds")
umi.qc <- umi[fData(umi)$use, pData(umi)$use]
endog_genes <- !fData(umi.qc)$is_feature_control
erccs <- fData(umi.qc)$is_feature_control
```


```{r}
umi.qc <- 
  scater::normaliseExprs(umi.qc,
                         method = "TMM",
                         feature_set = endog_genes)
qclust <- scran::quickCluster(umi.qc, min.size = 30)
umi.qc <- scran::computeSumFactors(umi.qc, sizes = 15, clusters = qclust)
umi.qc <- scater::normalize(umi.qc)

umi.qc <- 
  scater::normaliseExprs(umi.qc,
                         method = "RLE", 
                         feature_set = endog_genes)

umi.qc <- 
  scater::normaliseExprs(umi.qc,
                         method = "upperquartile", 
                         feature_set = endog_genes,
                         p = 0.99)
```

Use regression methods.


```{r}
#RUVg

ruvg <- RUVg(counts(umi.qc), erccs, k = 1)
set_exprs(umi.qc, "ruvg1") <- ruvg$normalizedCounts
ruvg <- RUVg(counts(umi.qc), erccs, k = 2)
set_exprs(umi.qc, "ruvg2") <- ruvg$normalizedCounts
set_exprs(umi.qc, "ruvg2_logcpm") <- log2(t(t(ruvg$normalizedCounts) / 
                                              colSums(ruvg$normalizedCounts) * 1e6) + 1)

#RUVs

scIdx <- matrix(-1, ncol = max(table(umi.qc$individual)), nrow = 3)
tmp <- which(umi.qc$individual == "NA19098")
scIdx[1, 1:length(tmp)] <- tmp
tmp <- which(umi.qc$individual == "NA19101")
scIdx[2, 1:length(tmp)] <- tmp
tmp <- which(umi.qc$individual == "NA19239")
scIdx[3, 1:length(tmp)] <- tmp
cIdx <- rownames(umi.qc)
ruvs <- RUVs(counts(umi.qc), cIdx, k = 1, scIdx = scIdx, isLog = FALSE)
set_exprs(umi.qc, "ruvs1") <- ruvs$normalizedCounts
ruvs <- RUVs(counts(umi.qc), cIdx, k = 2, scIdx = scIdx, isLog = FALSE)
set_exprs(umi.qc, "ruvs2") <- ruvs$normalizedCounts
set_exprs(umi.qc, "ruvs2_logcpm") <- log2(t(t(ruvs$normalizedCounts) / 
                                              colSums(ruvs$normalizedCounts) * 1e6) + 1)
```

## kBET - k-nearest neighbour batch estimation test

We estimate if the batch effect was successfully removed from the data, we test how well replicates of the same patients would mix. Theoretically, all replicates should be sampled from the same high dimensional distribution. In this ideal situation, any neighbourhood of a randomly chosen cell (i.e. some spatial subset) has equal contributions of each batch. In a biased situation, some of these subsets consist only of cells from one batch. The tool **kBET** tells how many of these subsets have a *biased* composition. Hence, the smaller the observed kBET value, the less bias. However, the confounded design of the dataset would not allow to test for effects of overcorrection.  


```{r}
compare_kBET_results <- function(sce){
  indiv <- unique(sce$individual)
  norms <- names(assayData(sce))
  results <- list()
  for (i in indiv){
    for (j in norms){
      tmp <- kBET(df = t(get_exprs(sce[,sce$individual== i], exprs_values = j)), batch = sce$batch[sce$individual==i], 
                  heuristic = TRUE, verbose = FALSE, addTest = FALSE, plot = FALSE)
      results[[i]][[j]] <- tmp$summary$kBET.observed[1]
    }
  }
  return(as.data.frame(results))
}
```


```{r}
eff_debatching <- compare_kBET_results(umi.qc)
```

## Plot results


```{r}
dod <- melt(as.matrix(eff_debatching),  value.name="kBET")
colnames(dod)[1:2] <- c("Normalisation","Individual")

colorset <- c('gray', brewer.pal(n = 9, "RdYlBu"))

ggplot(dod, aes(Normalisation, Individual, fill=kBET)) +  geom_tile() +
  scale_fill_gradient2(na.value = "gray",  low = colorset[2], mid=colorset[6], high = colorset[10],
                       midpoint = 0.5, limit = c(0,1)) +
  scale_x_discrete(expand = c(0, 0)) +
  scale_y_discrete(expand = c(0, 0)) + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, 
                                   size = 12, hjust = 1)) + 
ggtitle("Effect of batch regression methods per individual")

```

In conclusion, batch effects have not been removed completely by any of the tested methods. According to kBET results, we obtained the best results in individual *NA19098* with **RUVs** and two factors of unwanted variation. In individual *NA19101*, **RUVg** with k=2 and log(CPM) normalisation appears to be least batch affected. Some methods tend to increase the batch effect even - compare *counts* to *exprs* columns, for example.