-
Notifications
You must be signed in to change notification settings - Fork 42
/
Copy pathpaper.Rmd
740 lines (458 loc) · 89.7 KB
/
paper.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
---
title: "A comparison of single-cell trajectory inference methods"
output: dynbenchmark::pdf_manuscript
---
```{r setup, include=FALSE}
library(dynbenchmark)
library(tidyverse)
tools <- read_rds(result_file("tools.rds", "03-methods"))
methods <- read_rds(result_file("methods.rds", "03-methods"))
```
Wouter Saelens* ¹ ², Robrecht Cannoodt* ¹ ² ³, Helena Todorov¹ ² ⁴, Yvan Saeys¹ ²
\* Equal contribution
¹ Data mining and Modelling for Biomedicine, VIB Center for Inflammation Research, Ghent, Belgium.
² Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Ghent, Belgium.
³ Center for Medical Genetics, Ghent University Hospital, Ghent, Belgium.
⁴ Centre International de Recherche en Infectiologie, Inserm, U1111, Université Claude Bernard Lyon 1, CNRS, UMR5308, École Normale Supérieure de Lyon, Univ Lyon, F-69007, Lyon, France
Published in Nat Biotechnol 37, 547–554 (2019). doi: [10.1038/s41587-019-0071-9](https://doi.org/10.1038/s41587-019-0071-9).
```{r}
n_tools <- nrow(tools)
n_tools_evaluated <- nrow(filter(tools, method_evaluated))
n_methods_evaluated <- nrow(filter(methods, method_evaluated, method_source == "tool"))
n_control_methods_evaluated <- nrow(filter(methods, method_evaluated, method_source == "control"))
n_datasets_real <- nrow(list_datasets("real"))
n_datasets_synthetic <- nrow(list_datasets("synthetic"))
```
# Abstract
Recent technological advances allow the unbiased modelling of cellular dynamic processes, by generating -omics profiles of thousands of single cells and computationally ordering these cells along a trajectory. Since 2014, at least `r n_tools` tools for trajectory inference have been developed, but due to high variability in their inputs and outputs, these methods are difficult to compare. As a result, a comprehensive assessment of the performance of trajectory inference methods is still lacking, and more importantly also guidelines for users in the field.
In this study, we evaluated a total of `r n_methods_evaluated` methods in terms of cellular ordering, topology, scalability, and usability. Our results highlight the complementarity of existing tools, and that there is no generally best performing method. Instead, the choice of method mostly depends on the dataset dimensions and the trajectory topology present in the dataset. To assist users in selecting the most appropriate method, we therefore developed a set of guidelines, available as an interactive app at `r print_url("http://guidelines.dynverse.org")`. While the field certainly has matured significantly since its inception, further progress is still necessary in order to cope with increasingly complex and novel use cases. We hope to spearhead such developments, and the field as a whole, by making our evaluation pipeline easily extensible and freely available at `r print_url("https://www.github.com/dynverse/dynverse")`.
# Introduction
<!-- §1: TI overview -->
Single-cell -omics technologies now make it possible to model biological systems more accurately than ever before [@tanay_scaling_2017]. One area where single-cell data has been particularly useful is in the study of cellular dynamic processes, such as the cell cycle, cell differentiation and cell activation [@etzrodt_quantitativesinglecellapproaches_2014]. Such dynamic processes can be modelled computationally using trajectory inference (TI) methods, also called pseudotime analysis, which order cells along a trajectory by their overall similarity in -omics datasets [@trapnell_definingcelltypes_2015; @cannoodt_computational_2016; @moon_manifoldlearningbasedmethods_2018]. The resulting trajectories are most often linear, bifurcating or tree-shaped, but more recent methods also allow identifying more complex trajectory topologies such as cyclic [@liu_reconstructingcellcycle_2017] or disconnected graphs [@wolf_graphabstractionreconciles_2017]. TI methods offer an unbiased and transcriptome-wide understanding of a dynamic process [@tanay_scaling_2017], thereby allowing the objective identification of new (primed) subsets of cells [@schlitzer_identificationcdc1cdc2committed_2015], delineation of a differentiation tree [@velten_humanhaematopoieticstem_2017; @see_mappinghumandc_2017] and inference of regulatory interactions responsible for one or more bifurcations [@aibar_scenicsinglecellregulatory_2017].
Current applications of TI focus on specific subsets of cells, but ongoing efforts to construct transcriptomic catalogues of whole organisms [@regev_scienceforumhuman_2017; @han_mappingmousecell_2018; @schaum_singlecelltranscriptomics20_2018] underline the urgency for accurate, scalable [@aibar_scenicsinglecellregulatory_2017; @angerer_singlecellsmake_2017] and user-friendly TI methods.
<!-- §2: Plethora of new TI methods -->
A plethora of TI methods has been developed over the last years, and even more are being created every month (`r ref("stable", "tools")`). Indeed, in several repositories listing single-cell tools, such as `r print_url("http://www.omictools.org")` [@henry_omictoolsinformativedirectory_2014], the "awesome-single-cell" list (`r print_url("https://github.com/seandavi/awesome-single-cell")`) [@davis_awesomesinglecelllistsoftware_2018] and `r print_url("http://www.scRNA-tools.org")` [@zappia_exploringsinglecellrnaseq_2017], TI methods are one of the largest categories. While each method has its own unique set of characteristics in terms of underlying algorithm, required prior information and produced outputs, two of the most distinctive differences between TI methods are whether they fix the topology of the trajectory, and what type(s) of graph topologies they can detect. Early TI methods typically fixed the topology algorithmically (e.g. linear [@bendall_singlecelltrajectorydetection_2014; @schlitzer_identificationcdc1cdc2committed_2015; @shin_singlecellrnaseqwaterfall_2015; @campbell_bayesiangaussianprocess_2015] or bifurcating trajectories [@haghverdi_diffusionpseudotimerobustly_2016; @setty_wishboneidentifiesbifurcating_2016]), or through parameters provided by the user [@trapnell_dynamics_2014; @matsumoto_scoupprobabilisticmodel_2016]. These methods therefore mainly focus on correctly ordering the cells along the fixed topology. Most recent methods also infer the topology [@qiu_reversedgraphembedding_2017; @street_slingshotcelllineage_2017; @wolf_graphabstractionreconciles_2017], which increases the difficulty of the problem at hand, but allows an unbiased identification of both the ordering inside a branch and the topology connecting these branches.
```{r}
add_stable(
result_file("tools.xlsx", "03-methods"),
"tools",
paste0("Overview of available trajectory inference tools, and whether they were included in this study."),
""
)
```
<!-- §3: Problem statement -->
Given the diversity in TI methods, an important issue to address is a quantitative assessment of their performance, scalability, robustness and usability. Many attempts at tackling this issue have already been made [@haghverdi_diffusionpseudotimerobustly_2016; @ji_tscanpseudotimereconstruction_2016; @welch_slicerinferringbranched_2016; @matsumoto_scoupprobabilisticmodel_2016; @duverle_celltreebioconductorpackage_2016; @cannoodt_scorpiusimprovestrajectory_2016; @lonnberg_singlecellrnaseqcomputational_2017; @campbell_probabilisticmodelingbifurcations_2017; @wolf_graphabstractionreconciles_2017], but a comprehensive benchmarking evaluation of TI methods, which compares a large set of methods on a large set of datasets, is still lacking. This is problematic, as new users to the field are confronted with an overwhelming choice of TI methods, without a clear idea of which method would optimally solve their problem. Moreover, the strengths and weaknesses of existing methods need to be assessed, so that new developments in the field can focus on improving the current state-of-the-art.
```{r}
add_fig(
fig_path = raw_file("overview_evaluation_v6.svg", ""),
ref_id = "ti_evaluation_overview",
caption_main = "Overview of several key aspects of the evaluation.",
caption_text = glue::glue("a) A schematic overview of our evaluation pipeline. b) In order to make the trajectories comparable to each other, a common trajectory model was used to represent reference trajectories from the real and synthetic datasets, as well as any predictions of TI methods. c) Trajectories are automatically classified into one of seven trajectory types, with increasing complexity. d) We defined four metrics, each assessing the quality of a different aspect of the trajectory. The Hamming-Ipsen-Mikhailov (HIM) score will assess the similarity between the two topologies, taking into account differences in edge lengths and degree distributions. The {label_metric('F1_branches')} assesses the similarity of the assignment of cells onto branches. The {label_metric('correlation')} quantifies the similarity in cellular positions between two trajectories, by calculating the correlation between pairwise geodesic distances. Finally, {label_metric('featureimp_wcor')} quantifies the agreement between trajectory differentially expressed features from the known trajectory and the predicted trajectory."),
width = 14,
height = 8.5,
)
```
# Results
## Trajectory inference methods
<!-- Method characterisation -->
In this study, we performed a comprehensive evaluation of `r n_methods_evaluated` TI methods (`r ref("fig", "ti_evaluation_overview", "a")`). In order to make the outputs from different methods directly comparable to each other, we developed a common probabilistic model for representing trajectories from all possible sources (`r ref("fig", "ti_evaluation_overview", "b")`). In this model, the overall topology is represented by a network of "milestones", and the cells are placed within the space formed by each set of connected milestones. Although almost every method returned a unique set of outputs, we were able to classify these outputs into 7 distinct groups (`r ref("sfig", "overview_wrapping")`), and wrote a common output convertor for each of these groups (`r ref("fig", "results_summary", "a")`). When strictly required, we also provided prior information to the method. These different priors can range from weak priors which are relatively easy to acquire, such as a start cell, to strong priors, such as a known grouping of cells, which are much harder to know a priori and which can potentially introduce a large bias into the analysis (`r ref("fig", "results_summary", "a")`).
<!-- Topology characterisation -->
The largest difference between TI methods is whether a method fixes the topology, and if it does not, what kind of topologies it can detect. We defined seven possible types of topologies, ranging from very basic topologies (linear, cyclical and bifurcating) to the more complex ones (connected and disconnected graphs). Most methods either focus on inferring linear trajectories, or limit the search to tree or less complex topologies, with only a selected few attempting to infer cyclic or disconnected topologies (`r ref("fig", "results_summary", "a")`).
<!-- Overall evaluation results -->
We evaluated each method on four core aspects: (i) accuracy of a prediction, given a gold or silver standard on `r n_datasets_real` real and `r n_datasets_synthetic` synthetic datasets, (ii) scalability with respect to the number of cells and features (e.g. genes) (iii) stability of the predictions after subsampling the datasets, and (iv) the usability of the tool in terms of software, documentation and the manuscript. Overall, we found a large diversity across the four evaluation criteria, with only a few methods, such as PAGA, Slingshot and SCORPIUS, performing well across the board (`r ref("fig", "results_summary", "b")`). We will discuss each evaluation criterion in more detail (`r ref("fig", "results_detailed")` and `r ref("sfig", "results")`), after which we conclude with guidelines for method users and future perspectives for method developers.
```{r}
add_fig(
fig_path = result_file("results_summary.pdf", experiment_id = "08-summary"),
ref_id = "results_summary",
caption_main = glue::glue("A characterisation of the {n_methods_evaluated} methods evaluated in this study, and their overall evaluation results."),
caption_text = "(a) We characterised the methods according to the wrapper type, their required priors, whether the inferred topology is constrained by the algorithm (fixed) or a parameter (param), and the types of inferable topologies. The methods are grouped vertically based on the most complex trajectory type they can infer. (b) The overall results of the evaluation on four criteria: accuracy using a reference trajectory on real and synthetic data, scalability with increasing number of cells and features, stability across dataset subsamples, and quality of the implementation. Methods which errored on more than 50% of the datasets were removed from this figure, and are shown in {ref('sfig', 'results')} instead.",
width = 13,
height = 18
)
```
```{r}
add_fig(
fig_path = result_file("results_detailed.pdf", experiment_id = "08-summary"),
ref_id = "results_detailed",
caption_main = glue::glue("Detailed results of the four main evaluation criteria: accuracy, scalability, stability and usability."),
caption_text = glue::glue("(a) The names of the methods, ordered as in {ref('fig', 'results_summary')}. (b) Accuracy of trajectory inference methods across metrics, dataset sources and dataset trajectory types. The performance of a method is generally more stable across dataset sources, but very variable depending on the metric and trajectory type. (c) Predicted execution times for varying numbers of cells and features (\\# cells × \\# features). Predictions were made by training a regression model after running each method on bootstrapped datasets with varying numbers of cells and features. (d) Stability results by calculating the average pairwise similarity between models inferred across multiple runs of the same method. (e) Usability scores of the tool and corresponding manuscript, grouped per category. Off-the-shelf methods were directly implemented in R and thus do not have a usability score."),
width = 13,
height = 18
)
```
```{r}
add_sfig(
fig_path = result_file("results_suppfig.pdf", experiment_id = "08-summary"),
ref_id = "results",
caption_main = glue::glue("Results from the evaluation, for all methods and across all evaluation criteria."),
caption_text = glue::glue("(a) We characterised the methods according to the wrapper type, their required priors, whether the inferred topology is constrained by the algorithm (fixed) or a parameter (param), and the types of inferable topologies. The methods are grouped vertically based on the most complex trajectory type they can infer. (b) The overall results of the evaluation on four criteria: benchmarking using a reference trajectory on real and synthetic data, scalability with increasing number of cells and features, stability across dataset subsamples, and quality of the implementation. (c) Accuracy of trajectory inference methods across metrics, dataset sources and dataset trajectory types. The performance of a method is generally more stable across dataset sources, but very variable depending on the metric and trajectory type. (d) Predicted execution times and memory usage for varying numbers of cells and features (\\# cells × \\# features). Predictions were made by training a regression model after running each method on bootstrapped datasets with varying numbers of cells and features. (e) Stability results by calculating the average pairwise similarity between models inferred across multiple runs of the same method. (f) Usability scores of the tool and corresponding manuscript, grouped per category."),
width = 30,
height = 18,
integrate = FALSE
)
```
## Accuracy
<!-- Accuracy methods -->
We defined several metrics to compare a prediction to a reference trajectory (`r ref("snote", "metrics")`). Based on an analysis of their robustness and conformity to a set of rules (`r ref("snote", "metrics")`), we chose four metrics each assessing a different aspect of a trajectory (`r ref("fig", "ti_evaluation_overview", "d")`): the topology (Hamming-Ipsen-Mikhailov, `r label_metric("him")`), the quality of the assignment of cells to branches (`r label_metric("F1_branches")`), the cell positions (`r label_metric("correlation")`) and the accuracy of the differentially expressed features along the trajectory (`r label_metric("featureimp_wcor")`). The data compendium consisted of both synthetic datasets, which offer the most exact reference trajectory, and real datasets, which provide the highest biological relevance. These real datasets come from a variety of single-cell technologies, organisms, and dynamic processes, and contain several types of trajectory topologies (`r ref("stable", "datasets")`). Real datasets were classified as "gold standard" if the reference trajectory was not extracted from the expression data itself, such as via cellular sorting or cell mixing [@tian_scrnaseqmixologybetter_2018]. All other real datasets were classified as "silver standard". For synthetic datasets we used several data simulators, including a simulator of gene regulatory networks using a thermodynamic model of gene regulation[@schaffter_genenetweaversilicobenchmark_2011]. For each simulation, we used a real dataset as a reference, to match its dimensions, number of differentially expressed genes, drop-out rates and other statistical properties[@zappia_splattersimulationsinglecell_2017].
<!-- Accuracy overview -->
We found that method performance was very variable across datasets, indicating that there is no "one-size-fits-all" method that works well on every dataset (`r ref("sfig", "benchmark_interpretation", "a")`). Even methods which can detect most of the trajectory types, such as PAGA, RaceID/StemID and SLICER were not the best methods across all trajectory types (`r ref("fig", "results_detailed", "b")`). The overall score between the different dataset sources was moderately to highly correlated (0.5 - 0.9) with the scores on real datasets containing a gold standard (`r ref("sfig", "benchmark_interpretation", "b")`), confirming both the accuracy of the gold standard trajectories and the relevance of the synthetic data. On the other hand, the different metrics frequently disagreed with each other, with Monocle and PAGA Tree scoring better on the topology scores, while other methods, such as Slingshot, were better at ordering the cells and placing them into the correct branches (`r ref("fig", "results_detailed", "b")`).
<!-- Topologies-->
The performance of a method was strongly dependent on the type of trajectory present in the data (`r ref("fig", "results_detailed", "b")`). Slingshot typically performed better on datasets containing more simple topologies, while PAGA, pCreode and RaceID/StemID had higher scores on datasets with trees or more complex trajectories (`r ref("sfig", "benchmark_interpretation", "c")`). This was reflected in the types of topologies detected by every method, as those predicted by Slingshot tended to contain less branches, while those detected by PAGA, pCreode and Monocle DDRTree gravitated towards more complex topologies (`r ref("sfig", "benchmark_interpretation", "d")`). This analysis therefore indicates that detecting the right topology is still a difficult task for most of these methods, because methods tend to be either too optimistic or too pessimistic regarding the complexity of the topology in the data.
<!-- Complementarity between methods -->
The high variability between datasets, together with the diversity in detected topologies between methods, could indicate some complementarity between the different methods. To test this, we calculated the likelihood of obtaining a top model when using only a subset of all methods. A top model in this case was defined as a model with an overall score of at least 95% as the best model. On all datasets, using one method would result in getting a top model in about 27% of the time. This increased to up to 74% with the addition of 6 other methods (`r ref("fig", "complementarity", "a")`). This resulted in a relatively diverse set of methods, containing both strictly linear or cyclic methods, and methods with a broad trajectory type range such as PAGA. We found similar indications of complementarity between the top methods on data containing only linear, bifurcation or multifurcating trajectories (`r ref("fig", "complementarity", "b")`), although in these cases less methods were necessary to obtain at least one top model for a given dataset. Altogether, this shows that there is considerable complementarity between the different methods, and that users should try out a diverse set of methods on their data, especially when the topology is unclear a priori. Moreover, it also opens up the possibilities for new ensemble methods which utilize this complementarity.
```{r echo=FALSE, results="asis"}
add_fig(
fig_path = result_file("complementarity.pdf", experiment_id = "10-benchmark_interpretation"),
ref_id = "complementarity",
caption_main = "Complementarity between different trajectory inference methods",
caption_text = "(a) We assessed the likelihood for different combinations of methods to lead to a ‘top model’ (defined as a model with overall score at least 95\\% of the best model) when applied to all datasets. (b) The likelihood for different combinations of methods to lead to a ‘top model’ was assessed separately on different trajectory types. For this figure, we did not include any methods requiring strong prior information (cell grouping or time course).",
width = 13,
height = 6
)
```
```{r echo=FALSE, results="asis"}
add_sfig(
fig_path = result_file("benchmark_interpretation.pdf", experiment_id = "10-benchmark_interpretation"),
ref_id = "benchmark_interpretation",
caption_main = "Accuracy of trajectory inference methods.",
caption_text = "(a) Overall score for all methods and datasets, colored by the source of the datasets. (c) Similarity between the overall scores of all dataset sources, compared to real datasets with a gold standard. (b) Bias in the overall score towards trajectory types. (d) Distributions of the difference in size between predicted and reference topologies. A positive difference means that the topology predicted by the method is more complex than the one in the reference.",
width = 12,
height = 15
)
```
## Scalability
While early TI methods were developed at a time where profiling more than a thousand cells was exceptional, methods now have to cope with hundreds of thousands of cells, and perhaps soon with more than ten million [@svensson_exponentialscalingsinglecell_2018]. Moreover, the recent application of TI methods on multi-omics single-cell data also showcases the increasing demands on the number of features [@cao_jointprofilingchromatin_2018]. To assess the scalability, we ran each method on up- and downscaled versions of five distinct real datasets. We modelled the running time and memory usage using a Shape Constrained Additive Model [@pya_shapeconstrainedadditive_2015] (`r ref("sfig", "scaling", "a")`). As a control, we compared the predicted time (and memory) with the actual time (resp. memory) on all benchmarking datasets, and found that these were highly correlated overall (> 0.9, `r ref("sfig", "predcors")`), and moderately to highly correlated (0.5 - 0.9) for almost every method, depending on to what extent the execution of a method succeeded during the scalability experiments (`r ref("fig", "results_detailed", "c")` and `r ref("sfig", "results", "a")`).
We found that the scalability of most methods was overall very poor, with most graph and tree methods not finishing within an hour on a dataset with ten thousand cells and ten thousand features (`r ref("fig", "results_detailed", "c")`), around the size of a typical droplet-based single-cell dataset [@svensson_exponentialscalingsinglecell_2018]. Running times increased further with increasing number of cells, with only a handful of graph/tree methods completing within a day on a million of cells (PAGA, PAGA Tree, Monocle DDRTree, Stemnet and GrandPrix). Some methods, such as Monocle DDRTree and GrandPrix, also suffered from unsatisfactory running times when given a high number of features.
Methods with a low running time typically had two defining aspects: they had a linear time complexity with respect to the features and/or cells, and adding new cells or features led to a relatively low increase in time (`r ref("sfig", "scaling", "b")`). We found that more than half of all methods had a quadratic or superquadratic complexity with respect to the number of cells, which would make it difficult to apply any of these methods in a reasonable time frame on datasets with more than a thousand cells (`r ref("sfig", "scaling", "b")`).
We also assessed the memory requirements of each methods (`r ref("sfig", "results", "c")`). Most methods had reasonable memory requirements for modern workstations or computer clusters (≤12GB) with PAGA and STEMNET in particular having a low memory usage with both a high number of cells or high number of features. Notably, the memory requirements were very high for several methods on datasets with high number of cells (RaceID/StemID, pCreode and MATCHER) or features (Monocle DDRTree, SLICE and MFA).
Altogether, the scalability analysis indicated that the dimensions of the data is an important factor in the choice of method and that method development should pay more attention to maintaining reasonable running times and memory usage.
```{r echo=FALSE, results="asis"}
add_sfig(
fig_path = result_file("scaling.pdf", experiment_id = "05-scaling"),
ref_id = "scaling",
caption_main = "Scalability of trajectory inference methods.",
caption_text = "(a) Three examples of average observed running times across five datasets (left) and the predicted running time (right). (b) Overview of the scalability results of all methods, ordered by their average predicted running time from (a). We predicted execution times and memory usage for each method with increasing number of features or cells, and used these values to classify each method into sublinear, linear, quadratic and superquadratic based on the shape of the curve.",
width = 12,
height = 15
)
```
```{r}
add_sfig(
fig_path = result_file("predcors.pdf", experiment_id = "10-benchmark_interpretation"),
ref_id = "predcors",
caption_main = glue::glue("Agreement between actual values and predictions for execution times and memory usage."),
caption_text = glue::glue(""),
width = 10,
height = 10
)
```
## Stability
It is not only important that a method is able to infer an accurate model in a reasonable time frame, but also that it produces a similar model when given very similar input data. In order to test the stability of each method, we executed each method on 10 different subsamples of the datasets (95% of the cells, 95% of the features), and calculated the average similarity between each pair of models using the same scores used to assess the accuracy of a trajectory (`r ref("fig", "results_detailed", "d")`).
Given that the trajectories of methods which fix the topology either algorithmically or through a parameter is already very constrained, it is to be expected that such methods tend to generate very stable results. Nonetheless, some fixed topology methods still produced slightly more stable results, such as SCORPIUS and MATCHER for linear methods, and MFA for multifurcating methods. Stability was much more diverse among methods with a free topology. Slingshot produced more stable models than PAGA (Tree), which in turn produced more stable results than pCreode and Monocle DDRTree.
## Usability
While not directly related to the accuracy of the inferred trajectory, it is also important to assess the quality of the implementation and how easy it is for a biological user to use [@taschuk_ten_2017]. We scored each method using a transparent checklist of important scientific and software development practices, including software packaging, documentation, automated code testing, and publication into a peer-reviewed journal (`r ref("stable", "qc_scoresheet")`). Important to note here is that there is a selection bias in the tools selected for this analysis, as we did not include a substantial set of tools due to issues with installation, code availability and executability on a freely available platform (which excludes MATLAB). The reason for not including certain tools are all discussed on our repository (`r print_url('https://github.com/dynverse/dynmethods/issues?q=label:\"won\'t+wrap\"')`). Installation issues seem to be quite general in bioinformatics [@mangul_comprehensiveanalysisusability_2018], and the trajectory inference field is no exception.
We found that most methods fulfilled the basic criteria, such as the availability of a tutorial and elemental code quality criteria (`r ref("fig", "results_detailed", "d")` and `r ref("sfig", "qc_overview")`). While recent methods had a slightly better quality score than older methods, several quality aspects were consistently lacking for the majority of the methods (`r ref("sfig", "qc_overview", " right")`) and we believe that these should receive extra attention from developers. Although these outstanding issues covered all five categories, code assurance and documentation in particular were problematic areas, notwithstanding several studies pinpointing these as good practices [@wilson_best_2014; @artaza_top10metrics_2016]. Only two methods had a nearly perfect usability score (Slingshot and Celltrails), and these could be used as an inspiration of future methods. We observed no clear relation between method quality and method performance (`r ref("fig", "results_summary", "b")`).
```{r}
add_sfig(
fig_path = result_file("qc_overview.rds", experiment_id = "03-methods/02-tool_qc"),
ref_id = "qc_overview",
caption_main = "Usability of trajectory inference methods.",
caption_text = glue::glue("Shown is the score given for each method on every item from our usability score sheet ({ref('stable', 'qc_scoresheet')}). Each aspect of the quality control was part of a category, and each category was weighted so that it contributed equally to the final quality score. Within each category, each aspect also received a weight depending on how often it was mentioned in a set of papers discussing good practices in tool development and evaluation. This is represented in the plot as the height on the y-axis. Top: Average usability score for each method. Right: The average score of each quality control item. Shown into more detail are those items which had an average score lower than 0.5."),
width = 15,
height = 18
)
```
```{r}
add_stable(
result_file("qc_scoresheet.xlsx", "03-methods/02-tool_qc"),
"qc_scoresheet",
paste0("Scoring sheet for assessing usability of trajectory inference methods. Each quality aspect was given a weight based on how many times it was mentioned in a set of articles discussing best practices for tool development."),
""
)
```
# Discussion
<!-- The context -->
In this study, we presented a large scale evaluation of the performance of `r n_methods_evaluated` TI methods. By using a common trajectory representation and four metrics to compare the methods’ outputs, we were able to assess the accuracy of the methods on more than two-hundred datasets. We also assessed several other important quality measures, such as the quality of the method's implementation, the scalability to hundreds of thousands of cells, and the stability of the output on small variations of the datasets.
<!-- FOR METHOD USERS -->
<!-- Practical guidelines -->
Based on the results of our benchmark, we propose a set of practical guidelines for method users (`r ref("fig", "user_guidelines")` and `r print_url("http://guidelines.dynverse.org")`). We postulate that, as a method's performance is heavily dependent on the trajectory type being studied, the choice of method should currently be primarily driven by the anticipated trajectory topology in the data. For the majority of use cases, the user will know very little about the expected trajectory, except perhaps whether the data is expected to contain multiple disconnected trajectories, cycles or a complex tree structure. In each of these use cases, our evaluation suggests a different set of optimal methods, as shown in `r ref("fig", "user_guidelines")`. Several other factors will also impact the choice of methods, such as the dimensions of the dataset, and the prior information which is available. These factors and several others can all be dynamically explored in our interactive app (`r print_url("https://guidelines.dynverse.org")`). This app can also be used to query the results of this evaluation, such as filtering the datasets, or changing the importance of the evaluation metrics for the final ranking.
```{r}
add_fig(
fig_path = result_file("guidelines.pdf", experiment_id = "09-guidelines"),
ref_id = "user_guidelines",
caption_main = "Practical guidelines for method users.",
caption_text = "As the performance of a method mostly depends on the topology of the trajectory, the choice of TI method will be primarily influenced by the user's existing knowledge about the expected topology in the data. We therefore devised a set of practical guidelines, which combines the method's performance, user friendliness and the number of assumptions a user is willing to make about the topology of the trajectory. Methods to the right are ranked according to their performance on a particular (set of) trajectory type. Further to the right are shown the accuracy (+: scaled performance ≥ 0.9, ±: > 0.6), usability scores (+: ≥ 0.9, ± ≥ 0.6), estimated running times and required prior information.",
width = 13,
height = 10
)
```
<!-- Further considerations -->
When inferring a trajectory on a data set of interest, it is important to take two further points into account. First, it is critical that a trajectory, and the downstream results and/or hypotheses originating from it, are confirmed by multiple TI methods. This is to make sure the prediction is not biased due to the given parameter setting or the particular algorithms underlying a TI method. The value of using different methods is further supported by our analysis indicating substantial complementarity between the different methods. Second, even if the expected topology is known, it can be beneficial to also try out methods which make less assumptions about the trajectory topology. When the expected topology is confirmed using such a method, it provides additional evidence to the user. When a more complex topology is produced, this could indicate that the underlying biology is much more complex than anticipated by the user.
<!-- A common trajectory analysis framework -->
Critical to the broad applicability of TI methods is the standardisation of the input and output interfaces of TI methods, so that users can effortlessly execute TI methods on their dataset of interest, compare different predicted trajectories, and apply downstream analyses such as finding genes important for the trajectory, network inference [@aibar_scenicsinglecellregulatory_2017] or finding modules of genes [@saelens_comprehensiveevaluationmodule_2018]. Our framework is an initial attempt at tackling this problem, and we illustrate its usefulness here by comparing the predicted trajectories of several top-performing methods on datasets containing a linear, tree, cyclic and disconnected graph topology (`r ref("fig", "example_predictions")`). Using our framework, this figure can be recreated using only a couple of lines of R code (`r print_url("https://methods.dynverse.org")`). In the future, this framework could be extended to allow additional input data, such as spatial and RNA velocity information [@manno_rnavelocitysingle_2018], and easier downstream analyses. In addition, further discussion within the field is required in order to arrive at a consensus concerning a common interface for trajectory models, which can include additional features such as uncertainty and gene importance.
<!-- FOR METHOD DEVELOPERS -->
<!-- New tools: challenges and possibilities -->
Our study indicates that the field of trajectory inference is maturing, primarily for linear and bifurcating trajectories (as we illustrate with two datasets in `r ref("fig", "example_predictions", "a and b")`). However, we also highlight several ongoing challenges, which should be addressed before TI can be a reliable tool for analysing single-cell -omics datasets with complex trajectories. Foremost, new methods should focus on improving the unbiased inference of tree, cyclic graph and disconnected topologies, as we found that methods repeatedly overestimate or underestimate the complexity of the underlying topology, even if the trajectory could easily be identified using a dimensionality reduction method (illustrated with some synthetic datasets in `r ref("fig", "example_predictions", "c and d")`). Furthermore, higher standards for code assurance and documentation could help in adopting these tools across the single-cell -omics field. Finally, new tools should be designed to scale well with the increasing number of cells and features. We found that only a handful of current methods are able to handle datasets with more than 10.000 cells within a reasonable timeframe, making the others possibly obsolete very soon given the rapid technological developments. To support the development of these new tools, we provide a series of vignettes on how to wrap and evaluate a method and on the different measures proposed in this study at `r print_url("https://benchmark.dynverse.org")`.
<!-- The robustness of the evaluation & development of new methods -->
We found that the performance of a method can be very variable between datasets, and therefore included a large set of both real and synthetic data within our evaluation, leading to a robust overall ranking of the different methods. Nonetheless, we want to avoid that a race to be the first on the ranking would stifle creativity, given that there is a high risk that a certain new methodology would not improve upon the state-of-the-art. Indeed, we firmly believe that "good-yet-not-the-best" methods [@norel_selfassessmenttrap_2011] can still provide a very valuable contribution to the field, especially if they make use of novel algorithms, return a more scalable solution, or provide a unique insight in specific use cases. This is also supported by our analysis of method complementarity. Some examples for the latter include PhenoPath, which can include additional covariates in its model, Ouija, which returns a measure of uncertainty of each cell's position within the trajectory, and StemID, which can infer the directionality of edges within the trajectory. We feel that such methods should also be valued, and encourage their use provided that they are contained in a user friendly tool.
```{r}
add_fig(
fig_path = result_file("example_predictions.pdf", experiment_id = "11-example_predictions"),
ref_id = "example_predictions",
caption_main = "Demonstration of how a common framework for TI methods facilitates broad applicability using some example datasets.",
caption_text = glue::glue("Trajectories inferred by each method were projected to a common dimensionality reduction using multi-dimensional scaling. For each dataset, we also calculated a \"consensus\" prediction, by calculating the {label_metric('correlation')} between each pair of models, and picking the model with the highest score on average. (a) The top methods applied on a dataset containing a linear trajectory of differentiation dendritic cells, going from MDP, CDP to PreDC. (b) The top methods applied on a dataset containing a bifurcating trajectory of reprogrammed fibroblasts. (c) A synthetic dataset generated by dyntoy, containing four disconnected trajectories. (d) A synthetic dataset generated by dyngen, containing a cyclic trajectory."),
width = 14,
height = 8
)
```
________________
# Online Methods
## Data and code availability
The processed real and synthetic datasets used in this study are deposited on Zenodo ([doi.org/10.5281/zenodo.1443566](https://doi.org/10.5281/zenodo.1443566)) [@cannoodt_goldstandardsinglecell_].
The main analysis repository is available at `r print_url("https://benchmark.dynverse.org")` and is divided into several experiments. Each experiment has its own set of scripts and results, each accompanied by an illustrated readme which can be easily browsed and explored on the github website.
The analysis scripts call several other R packages, of which an overview is available at `r print_url("http://dynverse.org")`. These packages include **dynwrap**, used to wrap the output of methods into the common trajectory model, **dyneval**, which contains the evaluation metrics, **dynguidelines**, the guidelines app, and **dynplot** for plotting trajectories.
## Trajectory inference methods
<!-- Overview -->
We gathered a list of `r n_tools` trajectory inference tools (`r ref("stable", "tools")`), by searching the literature for "trajectory inference" and "pseudotemporal ordering", and based on two existing lists found online: `r print_url("https://github.com/seandavi/awesome-single-cell")` [@davis_awesomesinglecelllistsoftware_2018] and `r print_url("https://github.com/agitter/single-cell-pseudotime")` [@gitter_singlecellpseudotimeoverviewalgorithms_2018].
We welcome any contributions by creating an issue at `r print_url("https://methods.dynverse.org")`.
<!-- Inclusion criterion -->
Methods were excluded from the evaluation based on several criteria: `r glue::glue_collapse(glue::glue("({R.utils::decapitalize(non_inclusion_reasons$footnote)}) {non_inclusion_reasons$long}"), ", ")`. The discussions on why these methods were excluded can be found at `r print_url('https://github.com/dynverse/dynmethods/issues?q=label:"won\'t wrap"')`. In the end, we included `r n_methods_evaluated` methods in the evaluation.
### Method wrappers
To make easy to run each method in a reproducible manner, each method was wrapped within docker and singularity containers (available at `r print_url("https://methods.dynverse.org")`). These containers are automatically built on both Singularity Hub (`r print_url("https://singularity-hub.org/")`) and Docker Hub (`r print_url("https://hub.docker.com/u/dynverse")`).
For each method, we wrote a wrapper script based on example scripts or tutorials provided by the authors (as mentioned in the respective wrapper scripts). This script reads in the input data, runs the method, and outputs the files required to construct a trajectory. We also created a script to generate an example dataset, which is used to automatically test every method with continuous integration on travis-ci.org (`r print_url("https://travis-ci.org/dynverse")`).
We used the github issues system to contact the authors of each method, and asked for feedback on the wrappers, the metadata, and the quality control. About one third of the authors responded, and we improved the wrappers based on their feedback. These discussions can be viewed on github: `r print_url('https://github.com/dynverse/dynmethods/issues?q=label:"method+discussion"')`.
### Method input
As input, we provided each method with either the raw count data (after cell and gene filtering) or normalised expression values, based on the description in the method documentation or from the study describing the method.
<!-- Prior information -->
A large portion of the methods requires some form of prior information (e.g. a start cell) in order to be executable. Other methods optionally allow to exploit certain prior information. Prior information can be supplied as a starting cell from which the trajectory will originate, a set of important marker genes, or even a grouping of cells into cell states. Providing prior information to a TI method can be both a blessing and a curse. In one way, prior information can help the method to find the correct trajectory among many, equally likely, alternatives. On the other hand, incorrect or noisy prior information can bias the trajectory towards current knowledge. Moreover, prior information is not always easily available, and its subjectivity can therefore lead to multiple equally plausible solutions, restricting the applicability of such TI methods to well studied systems.
The prior information was extracted from the reference trajectory as follows:
* **Start cells** The identity of one or more start cells. For both real and synthetic data, a cell was chosen which was the closest (in geodesic distance) to each milestone with only outgoing edges. For ties, one random cell was chosen. For cyclic datasets, a random cell was chosen.
* **End cells** The identity of one or more end cells. Similar as the start cells, but now for every state with only incoming edges.
* **# end states** Number of terminal states. Number of milestones with only incoming edges.
* **Grouping** For each cell a label to which state/cluster/branch it belongs. For real data, the states from the gold/silver standard. For synthetic data, each milestone was seen as one group, and cells were assigned to their closest milestone.
* **# branches** Number of branches/intermediate states. For real data, the number of states in the gold/silver standard. For synthetic data, the number of milestones.
* **Discrete time course** For each cell a time point from which it was sampled. If available, directly extracted from the reference trajectory, otherwise the geodesic distance from the root milestone was used. For synthetic data, four discrete timepoints were chosen, at which the cells were “sampled” to provide a time course information reflecting the one provided in real experiments.
* **Continuous time course** For each cell a time point from which it was sampled. For real data this was equal to the discrete time course, for synthetic data we used the internal simulation time of each simulator.
### Common trajectory model
Due to the absence of a common format for trajectory models, most methods return a unique set of output formats with few overlaps. We therefore post-processed the output of each method into a common probabilistic trajectory model (`r ref("sfig", "overview_wrapping", "a")`). This model consisted of three parts: (i) The milestone network represents the overall network topology, and contains edges between different milestones and the length of the edge between them. (ii) The milestone percentages contain, for each cell, its position between milestones, and sums for each cell to one. (iii) The regions of delayed commitment, which defines connections between three or more milestones. These must be explicitly defined in the trajectory model and per region one milestone must be directly connected to all other milestones of the region.
```{r}
add_sfig(
raw_file("overview_wrapping_v3.svg", "03-methods"),
"overview_wrapping",
"A common interface for TI methods.",
"a) The input and output of each TI method is standardised. As input, each TI method receives either raw or normalised counts, several parameters, and a selection of prior information. After its execution, a method uses one of the seven wrapper functions to transform its output to the common trajectory model. This common model then allows to perform common analysis functions on trajectory models produced by any TI method. b) Illustrations of the specific transformations performed by each of the wrapper functions.",
13,
7.5
)
```
Depending on the output of a method, we used different strategies to convert the output to our model (`r ref("sfig", "overview_wrapping", "b")`). Special conversions are denoted by an \*, and will be explained in more detail below.
```{r}
methods_evaluated <- load_methods()
special_methods <- c("mfa", "dpt", "slice", "sincell", "calista", "urd")
# create functions to list the methods within a particular conversion category
list_wrapper_type_methods <- function(wrapper_types_oi) {
methods_evaluated %>%
filter(wrapper_type %in% wrapper_types_oi) %>%
mutate(name = paste0(method_name, ifelse(method_id %in% special_methods, "\\*", ""))) %>%
pull(name) %>%
sort() %>%
label_vector()
}
```
* **Type 1, direct:** `r list_wrapper_type_methods(c("direct", "branch_trajectory"))`. The wrapped method directly returned a network of milestones, the regions of delayed commitment, and for each cell it is given to what extent it belongs to a milestone. In some cases, this indicates that additional transformations were required for the method, not covered by any of the following output formats. Some methods returned a branch network instead of a milestone network, and this network was converted by calculating the line graph of the branch network.
* **Type 2, linear pseudotime:** `r list_wrapper_type_methods("linear")`. The method returned a pseudotime, which is translated into a linear trajectory where the milestone network contains two milestones and cells are positioned between these two milestones.
* **Type 3, cyclical pseudotime:** `r list_wrapper_type_methods("cyclic")`. The method returned a pseudotime, which is translated into a cyclical trajectory where the milestone network contains three milestones and cells are positioned between these three milestones. These milestones were positioned at pseudotime 0, ⅓ and ⅔ .
* **Type 4, end state probability:** `r list_wrapper_type_methods("end_state_prob")`. The method returned a pseudotime and for each cell and end state a probability ($\textrm{Pr}$) for how likely a cell will end up in a certain end state. This was translated into a star-shaped milestone network, with one starting milestone ($M_0$) and several outer milestones ($M_i$), with regions of delayed commitment between all milestones. The milestone percentage of a cell to one of the outer milestones was equal to $\textrm{pseudotime} \times \textrm{Pr}_{M_i}$. The milestone percentage to the starting milestone was equal to $1-\textrm{pseudotime}$ .
* **Type 5, cluster assignment:** `r list_wrapper_type_methods("cluster_assignment")`. The method returned a milestone network, and an assignment of each cell to a specific milestone. Cells were positioned onto the milestones they are assigned to, with milestone percentage equal to 1.
* **Type 6, orthogonal projection:** `r list_wrapper_type_methods("orth_proj")`. The method returned a milestone network, and a dimensionality reduction of the cells and milestones. The cells were projected to the closest nearest segment, thus determining the cells' position along the milestone network. If a method also returned a cluster assignment (type 5), we limited the projection of each cell to the closest edge connecting to the milestone of a cell. For these methods, we usually wrote two wrappers, one which included the projection, and one without.
* **Type 7, cell graph:** `r list_wrapper_type_methods("cell_graph")`. The method returned a network of cells,
and which cell-cell transitions are part of the 'backbone' structure. Backbone cells with degree $\neq$ 2 were regarded as milestones, and all other cells were placed on transitions between the milestones. If methods did not return a distance between pairs of cells, the cells were uniformly positioned between the two milestones. Otherwise, we first calculated the distance between two milestones as the sum of the distances between the cells, and then divided the distance of each pair of cells with the total distance to get the milestone percentages.
Special conversions were necessary for certain methods:
* **CALISTA:** We assigned the cells to the branch at which the sum of the cluster probabilities of two connected milestones was the highest. The cluster probabilities of the two selected milestones were then used as milestone percentages. This was then processed as a type 1, direct, method.
* **DPT:** We projected the cells onto the cluster network, consisting of a central milestone (this cluster contains the cells which were assigned to the "unknown" branch) and three terminal milestones, each corresponding to a tip point. This was then processed as a type 1, direct, method.
* **Sincell:** To constrain the number of milestones this method creates, we merged two cell clusters iteratively until the percentage of leaf nodes was below a certain cutoff, default at 25%. This was then processed as a type 7, cell graph, method.
* **SLICE:** As discussed in the vignette of SLICE, we ran principal curves one by one for every edge detected by SLICE. This was then processed as a type 1, direct, method.
* **MFA:** We used the branch assignment as state probabilities, which together with the global pseudotime were processed as a type 4, end state probabilities, method.
* **URD:** We extracted the pseudotime of a cell within each branch using the y positions in the tree layout. This was then further processed as a type 1, direct, method.
More information on how each method was wrapped can be found within the comments of each wrapper script, listed at `r print_url("https://methods.dynverse.org")`.
### Off-the-shelf methods
For baseline performance, we added several "off-the-shelf" TI methods which can be run using a few lines of code in R.
* **Component 1:** This method returns the first component of a PCA dimensionality reduction as a linear trajectory. This method is especially relevant as it has been used in a few studies already [@kouno_temporaldynamicstranscriptional_2013; @zeng_pseudotemporalorderingsingle_2017].
* **Angle:** Similar to the previous method, this method computes the angle with respect to the origin in a two dimensional PCA, and uses this angle as a pseudotime for generating a cyclical trajectory.
* **MST:** This method performs PCA dimensionality reduction, followed by clustering using the R mclust package, after which the clusters are connected using a minimum spanning tree. The trees are orthogonally projected to the nearest segment of the tree. This baseline is highly relevant as many methods follow the same methodology: dimensionality reduction, clustering, topology inference, project cells to topology.
## Trajectory types
We classified all possible trajectory topologies into distinct trajectory types, based on topological criteria (`r ref("fig", "ti_evaluation_overview", "c")`). These trajectory types start from the most general trajectory type, a disconnected graph, and move down (within a directed acyclic graph structure), progressively becoming more simple until the two basic types: linear and cyclical. A disconnected graph is a graph in which only one edge can exist between two nodes. A (connected) graph is a disconnected graph in which all nodes are connected. An acyclic graph is a graph containing no cycles. A tree is an acyclic graph containing no convergences (no nodes with in-degree higher than 1). A binary tree is a rooted tree with every node having maximally two outgoing edges. A convergence is a directed acyclic graph in which only one node has a degree larger than one and this same node has an indegree of one. A multifurcation is a rooted tree in which only one node has a degree larger than one. A bifurcation is a multifurcation in which only one node has a degree equal to 3. A linear topology is a graph in which no node has a degree larger than 3. Finally, a cycle is a connected graph in which no node has a degree larger than 3 but contains one cycle. In most cases, a method which was able to detect a complex trajectory type, was also able to detect less complex trajectory types, with some exceptions shown in `r ref("fig", "results_summary", "a")`.
For simplicity, we merged the bifurcation and convergence trajectory type, and the acyclic graph and connected graph trajectory type in the main figures of the paper.
## Real datasets
We gathered real datasets by searching for "single-cell" at the Gene Expression Omnibus and selecting those datasets in which the cells are sampled from different stages in a dynamic process (`r ref("stable", "datasets")`). The scripts to download and process these datasets are available on our repository (`r print_url("https://benchmark.dynverse.org/tree/master/scripts/01-datasets")`). Whenever possible, we preferred to start from the raw counts data. These raw counts were all normalised and filtered using a common pipeline, as discussed later. Some original datasets contained more than one trajectory, in which case we split the dataset into its separate connected trajectory, but also generated several combinations of connected trajectories to include some datasets with disconnected trajectories in the evaluation. In the end, we included `r nrow(list_datasets("real"))` datasets for this evaluation.
For each dataset, we extracted a reference trajectory, consisting of two parts: the cellular grouping (milestones) and the connections between these groups (milestone network). The cellular grouping was provided by the authors of the original study, and we classified it as a gold standard when it was created independently from the expression matrix (such as from cell sorting, the origin of the sample or the time it was sampled) or silver standard otherwise (usually by clustering the expression values). To connect these cell groups, we used the original study to determine the network which the authors validated or otherwise found to be the most likely. In the end, each group of cells was placed on a milestone, having percentage = 1 for that particular milestone. The known connections between these groups were used to construct the milestone network. If there was biological or experimental time data available, we used this as the length of the edge, otherwise we set all the lengths equal to one.
```{r}
add_stable(
result_file("real_datasets.xlsx","01-datasets/01-real"),
"datasets",
"Real datasets used in this study."
)
```
## Synthetic datasets
To generate synthetic datasets, we used four different synthetic data simulators:
* **dyngen:** Simulations of gene regulatory networks, available at `r print_url("https://github.com/dynverse/dyngen")`
* **dyntoy:** Random gradients of expression in the reduced space, available at `r print_url("https://github.com/dynverse/dyntoy")`
* **PROSSTT:** Expression is sampled from a linear model which depends on pseudotime [@papadopoulosPROSSTTProbabilisticSimulation2018]
* **Splatter:** Simulations of non-linear paths between different expression states [@zappia_splattersimulationsinglecell_2017]
For every simulator, we took great care to make the datasets as realistic as possible. To do this, we extracted several parameters from all real datasets. We calculated the number of differentially expressed features within a trajectory using a two-way Mann–Whitney U test between every pair of cell groups. These values were corrected for multiple testing using the Benjamini-Hochberg procedure (FDR < 0.05) and we required that a gene was expressed in at least 5% of cells, and had at least a fold-change of 2. We also calculated several other parameters, such as drop-out rates and library sizes using the Splatter package [@zappia_splattersimulationsinglecell_2017]. These parameters were then given to the simulators when applicable, as described for each simulator below. Not every real dataset was selected to serve as a reference for a synthetic dataset. Instead, we chose a set of ten distinct reference real datasets by clustering all the parameters of each real dataset, and used the reference real datasets at the cluster centers from a pam clustering (with $k = 10$, implemented in the R cluster package) to generate synthetic data.
### dyngen
```{r}
design_dyngen <- read_rds(result_file("design_dyngen.rds", "01-datasets/02-synthetic"))
```
The dyngen ( `r print_url("https://github.com/dynverse/dyngen")`) workflow to generate synthetic data is based on the well established workflow used in the evaluation of network inference methods [@schaffter_genenetweaversilicobenchmark_2011; @marbach_wisdom_2012] and consists of four main steps: network generation, simulation, gold standard extraction and simulation of the scRNA-seq experiment. At every step, we tried to mirror real regulatory networks, while keeping the model simple and easily extendable. We simulated a total of `r nrow(design_dyngen)` datasets, with `r length(unique(design_dyngen$modulenet_name))` different topologies.
#### Network generation
One of the main processes involved in cellular dynamic processes is gene regulation, where regulatory cascades and feedback loops lead to progressive changes in expression and decision making. The exact way a cell chooses a certain path during its differentiation is still an active research field, although certain models have already emerged and been tested in vivo. One driver of bifurcation seems to be mutual antagonism, where genes [@xu_regulationbifurcatingcell_2015] strongly repress each other, forcing one of the two to become inactive [@graf_forcingcellschange_2009]. Such mutual antagonism can be modelled and simulated [@wang_quantifyingwaddingtonlandscape_2011; @ferrell_bistabilitybifurcationswaddington_2012]. Although such a two-gene model is simple and elegant, the reality is frequently more complex, with multiple genes (grouped into modules) repressing each other [@yosef_dynamicregulatorynetwork_2013].
To simulate certain trajectory topologies, we therefore designed module networks in which the cells follow a particular trajectory topology given certain parameters. Two module networks generated linear trajectories (linear and linear long), one generated a bifurcation, one generated a convergence, one generated a multifurcation (trifurcating), two generated a tree (consecutive bifurcating and binary tree), one generated an acyclic graph (bifurcating and converging), one generated a complex fork (trifurcating), one generated a rooted tree (consecutive bifurcating) and two generated simple graph structures (bifurcating loop and bifurcating cycle). The structure of these module networks is available at https://github.com/dynverse/dyngen/tree/master/inst/ext_data/modulenetworks.
From these module networks we generated gene regulatory networks in two steps: the main regulatory network was first generated, and extra target genes from real regulatory networks were added. For each dataset, we used the same number of genes as were differentially expressed in the real datasets. 5% of the genes were assigned to be part of the main regulatory network, and were randomly distributed among all modules (with at least one gene per module). We sampled edges between these individual genes (according to the module network) using a uniform distribution between 1 and the number of possible targets in each module. To add additional target genes to the network, we assigned every regulator from the network to a real regulator in a real network (from regulatory circuits [@marbach_tissuespecificregulatorycircuits_2016]), and extracted for every regulator a local network around it using personalized pagerank (with damping factor set to 0.1), as implemented in the `page_rank` function of the *igraph* package.
#### Simulation of gene regulatory systems using thermodynamic models
To simulate the gene regulatory network, we used a system of differential equations similar to those used in evaluations of gene regulatory network inference methods [@marbach_wisdom_2012]. In this model, the changes in gene expression ($x_i$) and protein expression ($y_i$) are modeled using ordinary differential equations [@schaffter_genenetweaversilicobenchmark_2011] (ODEs):
$$
\begin{aligned}
\label{eq:mrna_ode}
\frac{dx_i}{dt} &= \underbrace{m \times f(y_1, y_2, ...)}_\text{production} - \underbrace{\lambda \times x_i}_\text{degradation}
\end{aligned}
$$
$$
\begin{aligned}
\label{eq:prot_ode}
\frac{dy_i}{dt} &= \underbrace{r \times x_i}_\text{production} - \underbrace{\Lambda \times y_i}_\text{degradation}
\end{aligned}
$$
where $m$, $\lambda$, $r$ and $\Lambda$ represent production and degradation rates, the ratio of which determines the maximal gene and protein expression. The two types of equations are coupled because the production of protein $y_i$ depends on the amount of gene expression $x_i$, which in turn depends on the amount of other proteins through the activation function $f(y_1, y_2, ...)$.
The activation function is inspired by a thermodynamic model of gene regulation, in which the promoter of a gene can be bound or unbound by a set of transcription factors, each representing a certain state of the promoter. Each state is linked with a relative activation $\alpha_j$, a number between 0 and 1 representing the activity of the promoter at this particular state. The production rate of the gene is calculated by combining the probabilities of the promoter being in each state with the relative activation:
$$
\begin{aligned}
\label{eq:input_function_probabilities}
f(y_1, y_2, ..., y_n) = \sum_{j \in \{0, 1, ..., n^2\}} \alpha_j \times P_j
\end{aligned}
$$
The probability of being in a state is based on the thermodynamics of transcription factor binding. When only one transcription factor is bound in a state:
$$
\begin{aligned}
P_j \propto \nu = \left(\frac{y}{k}\right)^{n}
\end{aligned}
$$
where the hill coefficient $n$ represents the cooperativity of binding and $k$ the transcription factor concentration at half-maximal binding. When multiple regulators are bound:
$$
\begin{aligned}
P_j \propto \nu = \rho \times \prod_j \left(\frac{y_j}{k_j}\right)^{n_j}
\end{aligned}
$$
where $\rho$ represents the cooperativity of binding between the different transcription factors.
$P_i$ is only proportional to $\nu$ because $\nu$ is normalized such that $\sum_{i} P_i = 1$.
To each differential equation, we added an additional stochastic term:
$$
\begin{aligned}
\frac{dx_i}{dt} = m \times f(y_1, y_2, ...) - \lambda \times x_i &+ \eta \times \sqrt{x_i} \times \Delta W_t \\
\frac{dy_i}{dt} = r \times x_i - \Lambda \times y_i &+ \eta \times \sqrt{y_i} \times \Delta W_t
\end{aligned}
$$
with $\Delta W_t \sim \mathcal{N}(0, h)$.
Similar to [@schaffter_genenetweaversilicobenchmark_2011], we sample the different parameters from random distributions, given in `r ref("stable", "samplers")`.
```{r}
add_stable(
read_rds(result_file("samplers.rds","01-datasets/02-synthetic")),
"samplers",
"Distributions from which each parameter in the thermodynamic model was sampled."
)
```
We converted each ODE to an SDE by adding a chemical Langevin equation, as described in [@schaffter_genenetweaversilicobenchmark_2011]. These SDEs were simulated using the Euler–Maruyama approximation, with time-step $h = 0.01$ and noise strength $\eta = 8$. The total simulation time varied between 5 for linear and bifurcating datasets, 10 for consecutive bifurcating, trifurcating and converging datasets, 15 for bifurcating converging datasets and 30 for linear long, cycle and bifurcating loop datasets. The burn-in period was for each simulation 2. Each network was simulated 32 times.
#### Simulation of the single-cell RNA-seq experiment
For each dataset we sampled the same number of cells as were present in the reference real dataset, limited to the simulation steps after burn-in. These cells were sampled uniformly across the different steps of the 32 simulations. Next, we used the Splatter package [@zappia_splattersimulationsinglecell_2017] to estimate the different characteristics of a real dataset, such as the distributions of average gene expression, library sizes and dropout probabilities. We used Splatter to simulate the expression levels $\lambda_{i,j}$ of housekeeping genes $i$ (to match the number of genes in the reference dataset) in every cell $j$. These were combined with the expression levels of the genes simulated within a trajectory. Next, true counts were simulated using $Y'_{i,j} \sim \text{Poisson}(\lambda_{i,j})$. Finally, we simulated dropouts by setting true counts to zero by sampling from a Bernoulli distribution using a dropout probability $\pi^D_{i,j} =\frac{1}{1+e^{-k(\text{ln}(\lambda_{i,j})-x_0)}}$. Both $x_0$ (the midpoint for the dropout logistic function) and $k$ (the shape of the dropout logistic function) were estimated by Splatter.
This count matrix was then filtered and normalised using the pipeline described below.
#### Gold standard extraction
Because each cellular simulation follows the trajectory at its own speed, knowing the exact position of a cell within the trajectory topology is not straightforward. Furthermore, the speed at which simulated cells make a decision between two or more alternative paths is highly variable. We therefore first constructed a backbone expression profile for each branch within the trajectory. To do this, we first defined in which order the expression of the modules is expected to change, and then generated a backbone expression profile in which the expression of these modules increases and decreases smoothly between 0 and 1. We also smoothed the expression in each simulation using a rolling mean with a window of 50 time steps, and then calculated the average module expression along the simulation. We used dynamic time warping, implemented in the dtw R package [@giorgino_computingvisualizingdynamic_2009; @tormene_matchingincompletetime_2009], with an open end to align a simulation to all possible module progressions, and then picked the alignment which minimised the normalised distance between the simulation and the backbone. In case of cyclical trajectory topologies, the number of possible milestones a backbone could progress through was limited to 20.
### dyntoy
```{r}
design_dyntoy <- read_rds(result_file("design_dyntoy.rds", "01-datasets/02-synthetic"))
```
For more simplistic data generation ("toy" datasets), we created the dyntoy workflow (`r print_url("https://github.com/dynverse/dyntoy")`) . We created `r length(unique(design_dyntoy$topology_model))` topology generators (described below), and with `r nrow(design_dyntoy)/length(unique(design_dyntoy$topology_model))` datasets per generator, this lead to a total of `r nrow(design_dyntoy)` datasets.
We created a set of topology generators, were $B(n, p)$ denotes a binomial distribution, and $U(a, b)$ denotes a uniform distribution:
* Linear and cyclic, with number of milestones $\sim B(10, 0.25)$
* Bifurcating and converging, with four milestones
* Binary tree, with number of branching points $\sim U(3, 6)$
* Tree, with number of branching points $\sim U(3, 6)$ and maximal degree $\sim U(3, 6)$
For more complex topologies we first calculated a random number of "modifications" $\sim U(3, 6)$ and a $\textit{deg}_{\textit{max}} \sim B(10, 0.25) + 1$. For each type of topology, we defined what kind of modifications are possible: divergences, loops, convergences and divergence-convergence. We then iteratively constructed the topology by uniformly sampling from the set of possible modifications, and adding this modification to the existing topology. For a divergence, we connected an existing milestone to a number of a new milestones. Conversely, for a convergence we connected a number of new nodes to an existing node. For a loop, we connected two existing milestones with a number of milestones in between. Finally for a divergence-convergence we connected an existing milestone to several new milestones which again converged on a new milestone. The number of nodes was sampled from $\sim B(\textit{deg}_{\textit{max}} - 3, 0.25) + 2$
* Looping, allowed loop modifications
* Diverging-converging, allowed divergence and converging modifications
* Diverging with loops, allowed divergence and loop modifications
* Multiple looping, allowed looping modifications
* Connected, allowed looping, divergence and convergence modifications
* Disconnected, number of components sampled from $\sim B(5, 0.25) + 2$, for each component we randomly chose a topology from the ones listed above
After generating the topology, we sampled the length of each edge $\sim U(0.5, 1)$. We added regions of delayed commitment to a divergence in a random half of the cases. We then placed the number of cells (same number as from the reference real dataset), on this topology uniformly, based on the length of the edges in the milestone network.
For each gene (same number as from the reference real dataset), we calculated the Kamada-Kawai layout in 2 dimensions, with edge weight equal to the length of the edge. For this gene, we then extracted for each cell a density value using a bivariate normal distribution with $\mu \sim U(x_{\textit{min}}, x_{\textit{min}})$ and $\sigma \sim U(x_{\textit{min}}/10, x_{\textit{min}}/8)$. We used this density as input for a zero-inflated negative binomial distribution with $\mu ~ U(100, 1000) \times \textit{density}$, $k ~ U(\mu / 10, \mu / 4)$ and $pi$ from the parameters of the reference real dataset, to get the final count values.
This count matrix was then filtered and normalised using the pipeline described below.
### PROSSTT
```{r}
design_prosstt <- read_rds(result_file("design_dyntoy.rds", "01-datasets/02-synthetic"))
```
PROSSTT is a recent data simulator [@papadopoulosPROSSTTProbabilisticSimulation2018], which simulates expression using linear mixtures of expression programs and random walks through the trajectory. We used 5 topology generators from dyntoy (linear, bifurcating, multifurcating, binary tree and tree), and simulated for each topology generator 10 datasets using different reference real datasets. However, due to frequent crashes of the tool, only 19 datasets created output and were thus used in the evaluation.
Using the `simulate_lineage` function, we simulated the lineage expression, with parameters $a \sim U(0.01, 0.1)$, $\textit{branch-tol}_{\textit{intra}} \sim U(0, 0.9)$ and $\textit{branch-tol}_{\textit{inter}} \sim U(0, 0.9)$. These parameter distributions were chosen very broad so as to make sure both easy and difficult datasets are simulated. After simulating base gene expression with `simulate_base_gene_exp`, we used the `sample_density` function to finally simulate expression values of a number of cells (the same as from the reference real dataset), with $\alpha \sim \textit{Lognormal}$ ($\mu = 0.3$ and $\sigma = 1.5$) and $\beta \sim \textit{Lognormal}$ ($\mu = 2$ and $\sigma = 1.5$). Each of these parameters were centered around the default values of PROSSTT, but with enough variability to ensure a varied set of datasets.
This count matrix was then filtered and normalised using the pipeline described below.
### Splatter
Splatter [@zappia_splattersimulationsinglecell_2017] simulates expression values by constructing non-linear paths between different states, each having a distinct expression profile. We used 5 topology generators from dyntoy (linear, bifurcating, multifurcating, binary tree and tree), and simulated for each topology generator 10 datasets using different reference real datasets, leading to a total of 50 datasets.
We used the `splatSimulatePaths` function from Splatter to simulate datasets, with number of cells and genes equal to those in the reference real dataset, and with parameters $\textit{nonlinearProb}$, $\textit{sigmaFac}$ and $\textit{skew}$ all sampled from $U(0, 1)$.
This count matrix was then filtered and normalised using the pipeline described below.
## Dataset filtering and normalisation
We used a standard single-cell RNA-seq preprocessing pipeline which applies parts of the scran and scater Bioconductor packages [@lun_stepbystepworkflowlowlevel_2016]. The advantages of this pipeline is that it works both with and without spike-ins, and includes a harsh cell filtering which looks at abnormalities in library sizes, mitochondrial gene expression, and number of genes expressed using median absolute deviations (which we set to 3). We required that a gene was expressed in at least 5% of the cells, and that it should have an average expression higher than 0.02. Furthermore, we used the pipeline to select the most highly variable genes, using a false discovery rate of 5% and a biological component higher than 0.5. As a final filter, we removed both all-zero genes and cells until convergence.
## Benchmark metrics
The importance of using multiple metrics to compare complex models has been stated repeatedly [@norel_selfassessmenttrap_2011]. Furthermore, a trajectory is a model with multiple layers of complexity, which calls for several metrics each assessing a different layer. We therefore defined several possible metrics for comparing trajectories, each investigating different layers. These are all discussed in `r ref("snote", "metrics")` along with examples and robustness analyses when appropriate.
Next, we created a set of rules to which we think a good trajectory metric should conform, and tested this empirically for each metric by comparing scores before and after perturbing a dataset (`r ref("snote", "metrics")`). Based on this analysis, we chose 4 metrics for the evaluation, each assessing a different aspect of the trajectory: (i) the `r label_metric("him")` measures the topological similarity, (ii) the `r label_metric("F1_branches")` compares the branch assignment, (iii) the `r label_metric("correlation")` assesses the similarity in pairwise cell-cell distances and thus the cellular positions, and (iv) the `r label_metric("featureimp_wcor")` looks at whether similar important features (genes) are found in both the reference dataset and the prediction.
### The Hamming-Ipsen-Mikhailov metric
The Hamming-Ipsen-Mikhailov (HIM) metric [@jurman_HIM_Glocal_Metric_2015] uses the two weighted adjacency matrices of the milestone networks as input (weighted by edge length). It is a linear combination of the normalised Hamming distance, which gives an indication of the differences in edge lengths, and the normalised Ipsen-Mikhailov distance, which assesses the similarity in degree distributions. The latter has a parameter $\gamma$, which was fixed at 0.1 as to make the scores comparable between datasets. We illustrate the metric and discuss alternatives in `r ref("snote", "metrics")`.
### The F1 between branch assignments
To compare branch assignment, we used an $F_1$ score, also used used for comparing biclustering methods [@saelens_comprehensiveevaluationmodule_2018]. To calculate this metric, we first calculated the similarity of all pairs of branches between the two trajectories using the Jaccard similarity. Next, we define the $\textit{Recovery}$ (respectively $\textit{Relevance}$) as the average maximal similarity of all branches in the reference dataset (respectively $\textit{prediction}$). The `r label_metric("F1_branches")` was then defined as the harmonic mean between $\textit{Recovery}$ and $\textit{Relevance}$. We illustrate this metric further in `r ref("snote", "metrics")`.
### Correlation between geodesic distances
When the position of a cell is the same in both the reference and the prediction, its relative distances to all other cells in the trajectory should also be the same. This observation is the basis for the `r label_metric("correlation")` metric.
To calculate the `r label_metric("correlation")`, we first sampled 100 waypoint cells in both the prediction and the reference dataset, using stratified sampling between the different milestones, edges and regions of delayed commitment, weighted by the number of cells in each collection. We then calculated the geodesic distances between the union of waypoint cells from both datasets and all other cells. The calculation of the geodesic distance depended on the location of the two cells within the trajectory, further discussed in `r ref("snote", "metrics")` and was weighted by the length of the edge in the milestone network. Finally, the `r label_metric("correlation")` was defined as the Spearman rank correlation between the distances of both datasets. We illustrate the metric and assess the effect of the number of waypoint cells in `r ref("snote", "metrics")`.
### The correlation between important features
The `r label_metric("featureimp_wcor")` assesses whether the same trajectory differentially expressed features are found between the prediction and the model. To calculate this metric, we used Random Forest regression (implemented in the R ranger package [@wright_rangerfastimplementation_2017]), to predict expression values of each gene, based on the geodesic distances of a cell to each milestone. We then extract feature importance values for each feature, and calculated the similarity of the feature importances using a weighted Pearson correlation, weighted by the feature importance in the reference dataset as to give more weight to large differences. As hyperparameters we set the number of trees to 10000, and the number of features on which to split to 1% of all available features. We illustrate this metric and assess the effect of its hyperparameters in `r ref("snote", "metrics")`.
### Score aggregation
To rank methods, we needed to aggregate the different scores on two levels: across datasets and across different metrics. This aggregation strategy is explained into more detail in `r ref("snote", "metrics")`.
To make sure easy and difficult datasets have equal influence on the final score, we first normalised the scores on each dataset across the different methods. We shifted and scaled the scores to $\sigma = 1$ and $\mu = 0$, and then applied the unit probability density function of a normal distribution on these values to get the scores back into the $[0, 1]$ range.
Since there is a bias in dataset source and trajectory type (e.g. there are many more linear datasets), we aggregated the scores per method and dataset in multiple steps. We first aggregated the datasets with the same dataset source and trajectory type using an arithmetic mean of their scores. Next, the scores were averaged over different dataset sources, using a arithmetic mean which was weighted based on how much the synthetic and silver scores correlated with the real gold scores. Finally, the scores were aggregated over the different trajectory types again using a arithmetic mean.
Finally, to get an overall benchmarking score, we aggregated the different metrics using a geometric mean.
## Method execution
Each execution of a method on a dataset was performed in a separate task as part of a gridengine job. Each task was allocated one CPU core of an Intel(R) Xeon(R) CPU E5-2665 @ 2.40GHz, and one R session was started for each task. During the execution of a method on a dataset, if the time limit (>1h) or memory limit (16GB) was exceeded, or an error was produced, a zero score was returned for that execution.
## Complementarity
To assess the complementarity between different methods, we first calculated for every method and dataset whether the overall score is equal or higher than 95% of the best overall score for that particular dataset. We then calculated for every method the weighted percentage of datasets which fulfilled this rule, weighted similarly as in the benchmark aggregation, and chose the best method. We iteratively added new methods until all methods were selected. For this analysis, we did not include any methods which require any strong prior information, and only included methods which can detect the trajectory types present in at least one of the datasets.
## Scalability
To assess the scalability of each method, we started from 5 real datasets, selected using the centers from a k-medoids as discussed in the synthetic data generation section. We up- and downscaled these datasets between 10 to 100.000 cells and 10 to 100.000 features, while never going higher than 1.000.000 values in total. To generate new cells or features, we first generated a 10-nearest neighbour graph of both the cells and features from the expression space. For every new cell or feature, we used a linear combination of one to three existing cells or features, where each cell or features was given a weight sampled from $\textit{U}(0,1)$.
We ran each method on each dataset for maximally 1 hour and gave each process 10GB of memory. To determine the running time of each method, we started the timer right after data loading and the loading of any packages, and stopped the clock before post-processing and saving of the output. Pre- and postprocessing steps specific to a method, such as dimensionality reduction and gene filtering, were included in the time. To estimate the maximal memory usage, we used the `max_vmem` value from the `qacct` command provided by a gridengine cluster. We acknowledge however that these memory estimates are very noisy, and the averages provided in this study are therefore only rough estimates.
The relationship between the dimensions of a dataset and the running time or maximal memory usage was modelled using shape constrained additive models [@pya_shapeconstrainedadditive_2015], with $log_{10}|\textit{cells}|$ and $log_{10}|\textit{features}|$ as predictor variables, and fitted this model using the `scam` function as implemented in the R scam package, with $log_{10}\textit{time}$ (or $log_{10}\textit{memory}$) as outcome.
To classify the time complexity of each method with respect to the number of cells, we predicted the running time at 10.000 features with increasing number of cells from 100 to 100.000, with steps of 100. We trained a generalised linear model with the following function: `y ~ log(x) + sqrt(x) + x + I(x^2) + I(x^3)`. The time complexity of a method was then classified as follows:
$$
\begin{cases}
\text{superquadratic} & \text{if } w_{x^3} > 0.25,\\
\text{quadratic} & \text{if } w_{x^2} > 0.25,\\
\text{linear} & \text{if } w_x > 0.25,\\
\text{sublinear} & \text{if } w_{\text{log}(x)} > 0.25 \text{ or } w_{\text{sqrt}(x)} > 0.25,\\
\text{case with highest weight} & \text{else.}
\end{cases}
$$
This process was repeated for classifying the time complexity with respect to the number of features, and the memory complexity both with respect to the number of cells and features.
## Stability
In the ideal case, a method should produce a similar trajectory, even when the input data is slightly different. However, running the method multiple times on the same input data would not be the ideal approach to assess its stability, given that a lot of tools are artificially deterministic by internally resetting the pseudorandom number generator (for example, using the `set.seed` function in R or the `random.seed` function in numpy). To assess the stability of each method, we therefore selected a number of datasets, which consisted of 25% of the datasets accounting for 15% of the total runtime, chosen such that after aggregation the overall scores still has > 0.99 correlation with the original overall ranking. We subsampled each dataset 10 times with 95% of the original cells and 95% of the original features. We ran every method on each of the bootstraps, and assessed the stability by calculating the benchmarking scores between each pair of subsequent models (run $i$ is compared to run $i+1$). For the `r label_metric("correlation")` and `r label_metric("F1_branches")`, we only used the intersection between the cells of two datasets, while the intersection of the features was used for the `r label_metric("featureimp_wcor")`.
## Usability
We created a transparent scoring scheme to quantify the usability of each method based on several existing tool quality and programming guidelines in literature and online (`r ref("stable", "qc_scoresheet")`). The goal of this quality control is in the first place to stimulate the improvement of current methods, and the development of user and developer friendly new methods. The quality control assessed 6 categories, each looking at several aspects, which are further divided into individual items. The availability category checks whether the method is easily available, whether the code and dependencies can be easily installed, and how the method can be used. The code quality assesses the quality of the code both from a user perspective (function naming, dummy proofing and availability of plotting functions) and a developer perspective (consistent style and code duplication). The code assurance category is frequently overlooked, and checks for code testing, continuous integration [@beaulieu-jones_reproducibility_2017] and an active support system. The documentation category checks the quality of the documentation, both externally (tutorials and function documentation) and internally (inline documentation). The behaviour category assesses the ease by which the method can be run, by looking for unexpected output files and messages, prior information and how easy the trajectory model can be extracted from the output. Finally, we also assessed certain aspects of the study in which the method was proposed, such as publication in a peer-reviewed journal, the number of datasets in which the usefulness of the method was shown, and the scope of method evaluation in the paper.
Each quality aspect received a weight depending on how frequently it was found in several papers and online sources which discuss tool quality (`r ref("stable", "qc_scoresheet")`). This was to make sure that more important aspects, such as the open source availability of the method, outweighed other less important aspects, such as the availability of a graphical user interface. For each aspect, we also assigned a weight to the individual questions being investigated (`r ref("stable", "qc_scoresheet")`). For calculating the final score, we weighed each of the six categories equally.
## Guidelines
For each set of outcomes in the guidelines figure, we selected one to four methods, by first filtering the methods on those which can detect all required trajectory types, and ordering the methods according to their average accuracy score on datasets containing these trajectory types (aggregated according to the scheme presented in the accuracy section).
We use the same approach for selecting the best set of methods in the guidelines app (`r print_url("https://guidelines.dynverse.org")`), developed using the R shiny package. This app will also filter the methods, among other things, depending on the predicted running time and memory requirements, the prior information available, and the prefered execution environment (using the dynmethods package or standalone).
# Acknowledgements
We would like to thank the original authors of the methods for their feedback and improvements on the method wrappers. This study was supported by the Fonds Wetenschappelijk Onderzoek (FWO, R.C. and W.S.) and BOF (Ghent University, H.T).
# Author contributions
R.C., W.S., H.T. and Y.S. designed the study. R.C. and W.S. performed the experiments and analysed the data. W.S., R.C., and H.T. implemented software packages. R.C., W.S., Y.S. and H.T. prepared the manuscript. Y.S. supervised the project.
# Competing financial interests
The authors declare no competing financial interests
<!--
Tips for working in this document:
Go to tools -> preferences and remove Use smart quotes
Enable automatic zotero syncing to bibtex
Make sure to pin the zotero bibtex references after first using them!
-->