-
Notifications
You must be signed in to change notification settings - Fork 0
/
analysis.Rmd
356 lines (322 loc) · 14.5 KB
/
analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
---
title: "Benchmarking approaches for the analysis of single-cell expression data"
author: "Alan Murphy"
date: "Most recent update:<br> `r Sys.Date()`"
output:
rmarkdown::html_document:
theme: spacelab
highlight: zenburn
code_folding: show
toc: true
toc_float: true
smooth_scroll: true
number_sections: false
self_contained: true
editor_options:
chunk_output_type: inline
---
```{r setup, include=T, message=F}
#root.dir <- here::here()
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
#root.dir = root.dir
fig.height = 6,
fig.width = 8 #178mm
)
knitr::opts_knit$set(#root.dir = root.dir,
dpi = 350)
library(data.table)
library(cowplot)
library(wesanderson)
library(ggplot2)
library(patchwork)
library(pROC)
```
## Background
Recently, [Zimmerman et al.](https://www.nature.com/articles/s41467-021-21038-1)
proposed the use of mixed models over pseudobulk aggregation approaches,
reporting improved performance on a novel simulation approach of hierarchical
single-cell expression data. However, their reported results could not prove the
superiority of mixed models as they only calculate performance based on type 1
error (performance of the models on non-differentially expressed genes). To
correctly benchmark the models, a reanalysis using a balanced measure of
performance, considering both the type 1 and type 2 errors (both the
differentially and non-differentially expressed genes), is necessary.
This benchmark was conducted using the R/benchmark_script.R in this repository
and an edited version of hierarchicell which returns the Matthews correlation
coefficient performance metric as well as the type 1 error rates, uses the same
simulated data across approaches and has checkpointing capabilities (so runs can
continue from where they left off if aborted or crashed), available at:
https://github.com/neurogenomics/hierarchicell.
We tested the models using the Matthews Correlation Coefficient (MCC). MCC is a
well-known and frequently adopted metric in the machine learning field which
offers a more [informative and reliable score on binary classification problems](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-019-6413-7).
MCC produces scores in [-1,1] and will only assign a high score if a model
performs well on both non-differentially and differentially expressed genes.
Moreover, MCC scores are proportional to both the size of the differentially and
non-differentially expressed genes, so it is robust to imbalanced datasets.
The benchmark was conducted based on the average MCC from 20,000 iterations;
50 runs for each of the 5 to 40 individuals and 50 to 500 cells at a p-value
cut-off of 0.05 on 10,000 genes (5,000 differentially expressed genes at varying
fold changes and 5,000 non-differentially expressed genes).
## Results of benchmark analysis
We can visualise the results from the analysis, first load the data which is
stored in this repo:
```{r}
res <- data.table::fread("./results/method_performances.csv")
print(head(res))
```
Now let's plot the results. We will split this plot to also show just the top
four performing results (left) so the differences between these are clearer:
```{r}
#get colour palette set up - choose distinctive colours
#with similar tones for similar methods
q10 <- c(
#orange - GEE
"orange2",
#yellow - modified t
"yellow2",
#purple - PB
"plum2",
"purple3",
#gold - Tobit
"goldenrod4",
#green - GLMs
"seagreen1",
"green4",
#blue - MAST
"dodgerblue4",
"cadetblue4",
"cadetblue2"
)
size1 <- 7.4
size2 <- 10.5
all_n_ind <- c(5,10,20,30,40)
all_n_cells <- c(50,100,250,500)
tmp <- expand.grid(all_n_ind, all_n_cells)
tmp <- tmp[order(tmp$Var1),]
#create plotting column n_ind,n_cells - make factor so order maintained
res[,n_ind_n_cells:=factor(paste0(n_ind,"-\n",n_cells),
levels=apply(tmp, 1, paste, collapse="-\n"))]
#add descriptive method name
res[,method_des:=fcase(
method == "MAST", "Two-part hurdle: Default",
method == "MAST_Combat", "Two-part hurdle: Corrected",
method == "MAST_RE","Two-part hurdle RE",
method == "Monocle","Tobit",
method == "ROTS","Modified t",
method == "GLMM_tweedie","Tweedie: GLMM",
method == "GLM_tweedie","Tweedie: GLM",
method == "Pseudobulk_mean","Pseudobulk: Mean",
method == "Pseudobulk_sum","Pseudobulk: Sum",
method == "GEE1","GEE1"
)]
mcc_res <- res[,mean(MCC),
c("n_ind_n_cells","n_ind","n_cells","method_des")]
setorder(mcc_res,method_des)
#plot all res
plt_mcc1 <-
ggplot(mcc_res)+
geom_line(aes(x=n_ind_n_cells,y=V1,colour=method_des,group=method_des))+
geom_vline(xintercept=4.5,color="grey",linetype = 3)+
geom_vline(xintercept=8.5,color="grey",linetype = 3)+
geom_vline(xintercept=12.5,color="grey",linetype = 3)+
geom_vline(xintercept=16.5,color="grey",linetype = 3)+
theme_cowplot()+
scale_colour_manual(values=q10)+
labs(colour='Method',y="Matthews Correlation Coefficient (MCC)",
x="Number of Individuals - Number of cells")+
theme(axis.text.x = element_text(size=size1),
axis.text.y = element_text(size=size1),
axis.title.x = element_text(size=size2),
axis.title.y = element_text(size=size2),
legend.text = element_text(size=size2),
legend.title = element_blank())
#plot top 4
plt_mcc2 <-
ggplot(mcc_res[method_des %in% c("Pseudobulk: Mean","Pseudobulk: Sum",
"GEE1","Tweedie: GLMM"),])+
geom_line(aes(x=n_ind_n_cells,y=V1,colour=method_des,group=method_des),
show.legend = FALSE)+
geom_vline(xintercept=4.5,color="grey",linetype = 3)+
geom_vline(xintercept=8.5,color="grey",linetype = 3)+
geom_vline(xintercept=12.5,color="grey",linetype = 3)+
geom_vline(xintercept=16.5,color="grey",linetype = 3)+
theme_cowplot()+
labs(colour='Method',y="Matthews Correlation Coefficient (MCC)",
x="Number of Individuals - Number of cells")+
theme(axis.text.x = element_text(size=size1),
axis.text.y = element_text(size=size1),
axis.title.x = element_text(size=size2),
axis.title.y = element_blank(),
legend.text = element_text(size=size2),
legend.title = element_blank())+
#get matching colours for DEG approaches from full graph
scale_colour_manual(values=q10[c(1,3,4,7)])
plt_mcc1 + plt_mcc2 +
plot_layout(guides = "collect") & theme(legend.position = "bottom")
```
So it is clear that our results demonstrate that pseudobulk approaches are far
from being "too conservative" and are, in fact, the best performing models
based on this simulated dataset for the analysis of single-cell expression data.
Now let's check on an imbalanced number of cells, plotted with the balanced
number of cells for comparison for 20 individuals. This time we will also plot
the standard deviation around the mean to show how much the performance varies:
```{r}
res_imbal <- data.table::fread("./results/method_performances_imbal.csv")
#add descriptive method name
res_imbal[,method_des:=fcase(
method == "MAST", "Two-part hurdle: Default",
method == "MAST_Combat", "Two-part hurdle: Corrected",
method == "MAST_RE","Two-part hurdle RE",
method == "Monocle","Tobit",
method == "ROTS","Modified t",
method == "GLMM_tweedie","Tweedie: GLMM",
method == "GLM_tweedie","Tweedie: GLM",
method == "Pseudobulk_mean","Pseudobulk: Mean",
method == "Pseudobulk_sum","Pseudobulk: Sum",
method == "GEE1","GEE1"
)]
res_imbal[,n_cells:="imbal"]
res_imbal[,n_ind_n_cells:=factor(paste0(n_ind,"-\n",n_cells))]
mcc_res_imbal <- res_imbal[,mean(MCC),
c("n_ind_n_cells","n_ind","n_cells","method_des")]
#add in runs for 20 ind not imbal
mcc_res_imbal <-
rbindlist(list(mcc_res[n_ind==20,],mcc_res_imbal))
#plot with error bands
#get sd
sd_imbal <-
res_imbal[,sd(MCC),
c("n_ind_n_cells","n_ind","n_cells","method_des")]
sd_bal <-
res[,sd(MCC),
c("n_ind_n_cells","n_ind","n_cells","method_des")]
sd_imbal <- rbindlist(list(sd_bal[n_ind==20,],sd_imbal))
setnames(sd_imbal,"V1","sd")
mcc_res_imbal[sd_imbal,sd:=i.sd, on=c("n_ind_n_cells","method_des")]
ggplot(mcc_res_imbal)+
geom_line(aes(x=n_ind_n_cells,y=V1,colour=method_des,group=method_des))+
geom_vline(xintercept=4.5,color="grey",linetype = 3)+
#error bars with sd
geom_errorbar(aes(x=n_ind_n_cells,ymin=V1-sd, ymax=V1+sd,colour=method_des), width=.2,
position=position_dodge(0.05))+
theme_cowplot()+
scale_colour_manual(values=q10)+
labs(colour='Method',y="Matthews Correlation Coefficient (MCC)",
x="Number of Individuals - Number of cells")+
theme(axis.text.x = element_text(size=size1),
axis.text.y = element_text(size=size1),
axis.title.x = element_text(size=size2),
axis.title.y = element_text(size=size2),
legend.text = element_text(size=size2),
legend.title = element_blank())
```
Pseudobulk mean outperformed all other approaches on this analysis. The
pseudobulk approach which aggregated by averaging rather than taking the sum
appears to be the top performing overall, however, it is worth noting that
hierarchicell does not normalise the simulated datasets before passing to the
pseudobulk approaches. This is a standard step in such analysis to account for
differences in sequencing depth and library sizes. This approach was taken by
Zimmerman et al. as their data is simulated one independent gene at a time
without considering differences in library size. The effect of this step is more
apparent on the imbalanced number of cells where pseudobulk sum’s performance
degraded dramatically. Pseudobulk mean appears invariant to this missing
normalisation step because of averaging’s own normalisation effect. It should be
noted that this is a flaw in the simulation software strategy and does not show
an improved performance of pseudobulk mean over sum.
Finally, it may be of interest to consider the performance of the different
methods at the same type 1 error rates. We can inspect their sensitivity
(1-type 2 error) at different type 1 error with Receiver Operating Curves
(ROCs).
```{r}
#load results
dat_rocs <-
data.table::fread("unzip -p ./results/method_performances_imbal_DEGs.csv.zip")
#add descriptive method name
dat_rocs[,method_des:=fcase(
method == "MAST", "Two-part hurdle: Default",
method == "MAST_Combat", "Two-part hurdle: Corrected",
method == "MAST_RE","Two-part hurdle RE",
method == "Monocle","Tobit",
method == "ROTS","Modified t",
method == "GLMM_tweedie","Tweedie: GLMM",
method == "GLM_tweedie","Tweedie: GLM",
method == "Pseudobulk_mean","Pseudobulk: Mean",
method == "Pseudobulk_sum","Pseudobulk: Sum",
method == "GEE1","GEE1"
)]
setorder(dat_rocs,method_des)
method_desc <- c("Two-part hurdle: Default","Two-part hurdle RE",
"Two-part hurdle: Corrected","Tweedie: GLM","Tweedie: GLMM",
"GEE1","Pseudobulk: Mean","Pseudobulk: Sum","Tobit",
"Modified t")
#loop through all proportions of DEGs
prop_DEGs <- c(0.05,0.1,0.2,0.3)
prop_rocs <- vector(mode="list",length=length(prop_DEGs))
names(prop_rocs) <- prop_DEGs
for(prop_i in prop_DEGs){
dat_rocs_prop <- dat_rocs[prop_DEGs==prop_i,]
#need to create a separate ROC obj per method
rocs <- vector(mode="list",length=length(method_desc))
names(rocs) <- unique(dat_rocs$method_des)
for(method_i in unique(dat_rocs$method_des)){
#define object to plot
rocs[[method_i]] <-
pROC::roc(dat_rocs_prop[method_des==method_i,]$actual,
dat_rocs_prop[method_des==method_i,]$pred)
}
#create ROC plot
prop_rocs[[as.character(prop_i)]]<-
ggroc(rocs,legacy.axes=TRUE)+
annotate("text", size=2.8, family="Helvetica",fontface="bold",
x = rep(0.9,10),
y=rev(c(0,0.03,0.06,0.09,0.12,0.15,0.18,0.21,0.24,0.27)),
label = lapply(rocs,function(x) round(x$auc[1],3)),color=q10)+
#geom_segment(aes(x = 0, xend = 1, y = 0, yend = 1),
# color="darkgrey", linetype="dashed")+
#geom_text(lapply(rocs,function(x) x$auc[1]))+
theme_cowplot()+
scale_colour_manual(values=q10)+
labs(colour='Method',y="1 - Type 2 error",
x="Type 1 error")+
ggtitle(as.character(prop_i))+
theme(plot.title = element_text(size = size2, face = NULL),
axis.text.x = element_text(size=size1),
axis.text.y = element_text(size=size1),
axis.title.x = element_text(size=size2),
axis.title.y = element_text(size=size2),
legend.text = element_text(size=size2),
legend.title = element_blank())
}
#plot all
prop_rocs[["0.05"]] + prop_rocs[["0.1"]] +
prop_rocs[["0.2"]]+prop_rocs[["0.3"]] +
plot_layout(guides = "collect",ncol = 4) & theme(legend.position = "bottom")
```
## Future work
Pseudobulk approaches were also found to be the top performing approaches in a
recent review by [Squair et al.](https://www.nature.com/articles/s41467-021-25960-2).
Notably, the pseudobulk method used here; DESeq2, performed worse than other
pseudobulk models in Squair et al.,’s analysis and so their adoption may further
increase the performance of pseudobulk approaches on our dataset. Conversely,
Squair et al., did not consider all models included in our analysis or the
different forms of pseudobulk aggregation. Therefore, our results on sum and
mean pseudobulk extend their findings and indicate that mean aggregation may be
the best performing.
However, you should be cognizant that the lack of a normalisation step based on
the flaw in the simulation software strategy likely causes the increased
performance of mean over sum aggregation. Further, the use of simulated datasets
in our analysis may not accurately reflect the differences between individuals
that are present in biological datasets. Thus, despite both our results and
those reported by Squair et al., there is still room for further analysis,
benchmarking more models, including different combinations of pseudobulk
aggregation methods and models, on more representative simulated datasets and
biological datasets to identify the optimal approach. Specifically, we would
expect pseudobulk sum with a normalisation step to outperform pseudobulk mean
since it can account for the intra-individual variance which is otherwise lost
with pseudobulk mean but this should be tested.
In conclusion, these results demonstrate that pseudobulk approaches are far from
being too conservative and are, in fact, the best performing models based on
this simulated dataset for the analysis of single-cell expression data.