forked from hemberg-lab/scRNA.seq.course
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path15-remove-conf2.Rmd_
127 lines (98 loc) · 4.51 KB
/
15-remove-conf2.Rmd_
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
---
output: html_document
---
# Dealing with confounders 2
# Effectiveness by kBET
```{r}
library(kBET)
# library(data.table)
# library(RColorBrewer)
# library(ggplot2)
library(RUVSeq)
library(scater)
```
## Prelude - Load data and try different normalisations
```{r}
umi <- readRDS("tung/umi.rds")
umi.qc <- umi[fData(umi)$use, pData(umi)$use]
endog_genes <- !fData(umi.qc)$is_feature_control
erccs <- fData(umi.qc)$is_feature_control
```
```{r}
umi.qc <-
scater::normaliseExprs(umi.qc,
method = "TMM",
feature_set = endog_genes)
qclust <- scran::quickCluster(umi.qc, min.size = 30)
umi.qc <- scran::computeSumFactors(umi.qc, sizes = 15, clusters = qclust)
umi.qc <- scater::normalize(umi.qc)
umi.qc <-
scater::normaliseExprs(umi.qc,
method = "RLE",
feature_set = endog_genes)
umi.qc <-
scater::normaliseExprs(umi.qc,
method = "upperquartile",
feature_set = endog_genes,
p = 0.99)
```
Use regression methods.
```{r}
#RUVg
ruvg <- RUVg(counts(umi.qc), erccs, k = 1)
set_exprs(umi.qc, "ruvg1") <- ruvg$normalizedCounts
ruvg <- RUVg(counts(umi.qc), erccs, k = 2)
set_exprs(umi.qc, "ruvg2") <- ruvg$normalizedCounts
set_exprs(umi.qc, "ruvg2_logcpm") <- log2(t(t(ruvg$normalizedCounts) /
colSums(ruvg$normalizedCounts) * 1e6) + 1)
#RUVs
scIdx <- matrix(-1, ncol = max(table(umi.qc$individual)), nrow = 3)
tmp <- which(umi.qc$individual == "NA19098")
scIdx[1, 1:length(tmp)] <- tmp
tmp <- which(umi.qc$individual == "NA19101")
scIdx[2, 1:length(tmp)] <- tmp
tmp <- which(umi.qc$individual == "NA19239")
scIdx[3, 1:length(tmp)] <- tmp
cIdx <- rownames(umi.qc)
ruvs <- RUVs(counts(umi.qc), cIdx, k = 1, scIdx = scIdx, isLog = FALSE)
set_exprs(umi.qc, "ruvs1") <- ruvs$normalizedCounts
ruvs <- RUVs(counts(umi.qc), cIdx, k = 2, scIdx = scIdx, isLog = FALSE)
set_exprs(umi.qc, "ruvs2") <- ruvs$normalizedCounts
set_exprs(umi.qc, "ruvs2_logcpm") <- log2(t(t(ruvs$normalizedCounts) /
colSums(ruvs$normalizedCounts) * 1e6) + 1)
```
## kBET - k-nearest neighbour batch estimation test
We estimate if the batch effect was successfully removed from the data, we test how well replicates of the same patients would mix. Theoretically, all replicates should be sampled from the same high dimensional distribution. In this ideal situation, any neighbourhood of a randomly chosen cell (i.e. some spatial subset) has equal contributions of each batch. In a biased situation, some of these subsets consist only of cells from one batch. The tool **kBET** tells how many of these subsets have a *biased* composition. Hence, the smaller the observed kBET value, the less bias. However, the confounded design of the dataset would not allow to test for effects of overcorrection.
```{r}
compare_kBET_results <- function(sce){
indiv <- unique(sce$individual)
norms <- names(assayData(sce))
results <- list()
for (i in indiv){
for (j in norms){
tmp <- kBET(df = t(get_exprs(sce[,sce$individual== i], exprs_values = j)), batch = sce$batch[sce$individual==i],
heuristic = TRUE, verbose = FALSE, addTest = FALSE, plot = FALSE)
results[[i]][[j]] <- tmp$summary$kBET.observed[1]
}
}
return(as.data.frame(results))
}
```
```{r}
eff_debatching <- compare_kBET_results(umi.qc)
```
## Plot results
```{r}
dod <- melt(as.matrix(eff_debatching), value.name="kBET")
colnames(dod)[1:2] <- c("Normalisation","Individual")
colorset <- c('gray', brewer.pal(n = 9, "RdYlBu"))
ggplot(dod, aes(Normalisation, Individual, fill=kBET)) + geom_tile() +
scale_fill_gradient2(na.value = "gray", low = colorset[2], mid=colorset[6], high = colorset[10],
midpoint = 0.5, limit = c(0,1)) +
scale_x_discrete(expand = c(0, 0)) +
scale_y_discrete(expand = c(0, 0)) +
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 12, hjust = 1)) +
ggtitle("Effect of batch regression methods per individual")
```
In conclusion, batch effects have not been removed completely by any of the tested methods. According to kBET results, we obtained the best results in individual *NA19098* with **RUVs** and two factors of unwanted variation. In individual *NA19101*, **RUVg** with k=2 and log(CPM) normalisation appears to be least batch affected. Some methods tend to increase the batch effect even - compare *counts* to *exprs* columns, for example.