
diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 0000000..8b13789 --- /dev/null +++ b/.nojekyll @@ -0,0 +1 @@ + diff --git a/404.html b/404.html new file mode 100644 index 0000000..ed99fc9 --- /dev/null +++ b/404.html @@ -0,0 +1,96 @@ + + +
+ + + + +YEAR: 2024 +COPYRIGHT HOLDER: Centre de Bioinformatique de Bordeaux (CBiB) ++ +
Copyright (c) 2024 Centre de Bioinformatique de Bordeaux (CBiB)
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+vignettes/SeuratIntegrate.Rmd
+ SeuratIntegrate.Rmd
SeuratIntegrate is an R package that aims to extend the pool of +single-cell RNA sequencing (scRNA-seq) integration methods available in +Seurat. Moreover, +SeuratIntegrate provides a set of tools to evaluate the performance of +the integrations produced.
+SeuratIntegrate provides access to R and Python methods to correct +batch effect:
+Method | +Type | +Implementation | +Underlying algorithm | +Reference | +
---|---|---|---|---|
ComBat | ++sva +(Bioconductor) | +Empirical Bayes adjustment | +Johnson et al., +2007 | +|
Harmony | ++harmony +(CRAN) | +Iterative embedding correction | +Korsunsky et al., +2019 | +|
MNN | ++batchelor +(Bioconductor) | +Mutual Nearest Neighbors | +Haghverdi +et al., 2018 | +|
BBKNN | ++bbknn +(GitHub) | +Batch-balanced nearest neighbors | +Polański et al., +2020 | +|
scVI | ++scvi-tools +(GitHub) | +Variational autoencoder | +Lopez et al., +2018 | +|
scANVI | ++scvi-tools +(GitHub) | +Semi-supervised variational autoencoder | +Xu et +al., 2021 | +|
Scanorama | ++scanorama (GitHub) | +Manifold alignment | +Hie et al., +2019 | +|
trVAE | ++scArches (GitHub) | +Conditional variational autoencoder | +Lotfollahi et al., +2020 | +
SeuratIntegrate provides a new interface to integrate the layers of
+an object: DoIntegrate()
. Moreover, SeuratIntegrate is
+compatible with CCA
and RPCA
(included in Seurat) and
+FastMNN
(from SeuratWrappers)
SeuratIntegrate incorporates 11 scoring metrics: 6 quantify the
+degree of batch mixing (batch
+correction), while 5 assess the preservation of biological
+differences
(bio-conservation) based
+on ground truth cell type labels.
Below is a table summarising each score’s input and type:
+Score name | +Require a cell type variable | +Require a clustering variable | +Input | +Score type | +
---|---|---|---|---|
Cell cycle regression | ++ | + | ||
PCA regression | ++ | + | ||
PCA density | ++ | + | ||
ASW batch | ++ | |||
ASW | ++ | |||
ARI | ++ | |||
NMI | ++ | |||
cLISI | ++ |
+ |
+||
iLISI | ++ |
+ |
+||
kBET | ++ |
+ |
+||
Graph connectivity | +
+per.component = TRUE ) |
++ |
Most scores are computed on an embedding
+(
Seurat::DimReduc
object) or a graph (
Seurat::Neighbor
or
+Seurat::Graph
object). The exceptions are ARI and NMI,
+which compare two categorical variables thus don’t need anything else
+than a cell-type and a cluster assignment variables.d anything else than
+a cell-type and a cluster assignment variables.
Most scores are based on a cell type label variable. This consists in +an estimate of each cell’s type obtained by analysing each batch +separately or by using an automatic cell annotation algorithm. This +estimate of cell types must be of sufficient quality to be considered +suitable as ground truth.
+We will use the small dataset of 200 immune liver cells and around +6,500 genes included in SeuratIntegrate.
+
+data("liver_small")
+dim(liver_small) # genes x cells
## [1] 6534 200
+It comprises 4 donors from 2 studies. Among the donors, 2 are healthy +and 2 are suffering from a hepatocellular carcinoma (HCC).
+Importantly, the Seurat object’s metadata also embeds cell type +annotation variables.
+
+liver_small[[]][,c(13, 15:17)]
Note that albeit not compulsory, a high quality cell-type +annotation is very important to evaluate the performance of +integrations because many scoring metrics use them as ground truth. +Moreover, it helps in verifying whether technical effects (usually +called batch effects) overweight true biological differences.
+Here, we use the last one and save it:
+
+cell.var <- "manual_cell_type_short"
We are going to test 3 integration methods that output different +types of objects. We will use ComBat (corrected counts), BBKNN +(corrected knn graph) and Harmony (corrected dimension reduction). BBKNN +is a Python method, so we need to have a conda environment to be able to +use it. Right now, we don’t have such an environment:
+
+getCache()
Hopefully, SeuratIntegrate facilitates the task with
+UpdateEnvCache()
:
+UpdateEnvCache("bbknn")
+Note: Similarly, you can call UpdateEnvCache()
+with all other Python methods (“scvi”, “scanorama”, “trVAE”) to set up
+their corresponding conda environments.
+
+If the process is successful, the cache of conda environments should +have been updated:
+
+getCache()
We also want to make sure we have a conda environment with a +umap-learn version compatible with Seurat:
+
+reticulate::conda_create('umap_0.5.4', packages = 'umap=0.5.4')
To ensure that SeuratIntegrate works well, it is indispensable to +split the Seurat object. This process distributes cells into +single-batch layers such that each batch is normalized independently. We +account for potential technical variability between samples by +designating “ID_sample” as the batch variable.
+
+batch.var <- "ID_sample" # save for later
+
+liver_small <- split(liver_small, f = liver_small$ID_sample)
+liver_small <- SCTransform(liver_small)
+
+liver_small <- RunPCA(liver_small)
+liver_small <- FindNeighbors(liver_small, dims = 1:20, k.param = 15L)
+liver_small <- RunUMAP(liver_small, dims = 1:20, n.neighbors = 15)
+DimPlot(liver_small, group.by = c("First_author", "ID_sample"))
+DimPlot(liver_small, group.by = cell.var)
+liver_small <-
+ DoIntegrate(object = liver_small,
+
+ # integrations
+ CombatIntegration(),
+ bbknnIntegration(orig = "pca", ndims.use = 20),
+ SeuratIntegrate::HarmonyIntegration(orig = "pca", dims = 1:20),
+
+ # additional parameters
+ use.hvg = TRUE,
+ use.future = c(FALSE, TRUE, FALSE)
+)
## Integration 1 in 3: integrating using 'CombatIntegration'
+## Integration 2 in 3: integrating using 'bbknnIntegration'
+## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
+##
+## Number of nodes: 200
+## Number of edges: 1784
+##
+## Running Louvain algorithm...
+## Maximum modularity in 10 random starts: 0.6251
+## Number of communities: 5
+## Elapsed time: 0 seconds
+## Integration 3 in 3: integrating using 'SeuratIntegrate::HarmonyIntegration'
+If we take a look at our Seurat object, we can note that it has been +enriched with many objects:
+ +## An object of class Seurat
+## 22602 features across 200 samples within 4 assays
+## Active assay: SCT (6534 features, 3000 variable features)
+## 3 layers present: counts, data, scale.data
+## 3 other assays present: RNA, combat.reconstructed, bbknn.ridge
+## 4 dimensional reductions calculated: pca, umap, pca.bbknn, harmony
+##
+## #####
+##
+## [1] "SCT_nn"
+## [2] "SCT_snn"
+## [3] "bbknn_scale.data_connectivities"
+## [4] "bbknn_scale.data_distances"
+## [5] "bbknn_ridge.residuals_connectivities"
+## [6] "bbknn_ridge.residuals_distances"
+Important outputs are:
+The type of output is important to consider, because scoring metrics +are not compatible with all output types. The simplest strategy is to +process each output separately in order to obtain at least a PCA out of +it, or even a knn graph (indispensable to compute clusters). Note that +several scores cannot be computed on knn graphs, hence knn graph outputs +(e.g. BBKNN) can only be evaluated by a reduced set of metrics.
+Below is a summary for each output type (bracketed steps are not +always necessary):
+ScaleData()
] ->
+RunPCA()
-> [FindNeighbors()
->
+FindOptimalClusters()
]RunPCA()
] ->
+[FindNeighbors()
->
+FindOptimalClusters()
]FindOptimalClusters()
]Here, we will use SymmetrizeKnn()
between
+FindNeighbors()
and FindOptimalClusters()
+because return.neighbor = TRUE
. This is useful to keep the
+distances between cells in the KNN
graph. Although not
+compulsory, this is used to stay in line with BBKNN’s output. To prevent
+the community detection algorithm to output a high fraction of
+singletons, we “symmetrize” the matrix which makes the graph
+“undirected”.
+# corrected counts outputs
+DefaultAssay(liver_small) <- "combat.reconstructed"
+VariableFeatures(liver_small) <- VariableFeatures(liver_small[["SCT"]])
+liver_small <- ScaleData(liver_small)
+
+liver_small <- RunPCA(liver_small, npcs = 50L, reduction.name = "pca.combat")
+liver_small <- FindNeighbors(liver_small, reduction = "pca.combat", dims = 1:20,
+ return.neighbor = TRUE, graph.name = "knn.combat")
+liver_small <- SymmetrizeKnn(liver_small, graph.name = "knn.combat")
+liver_small <- FindOptimalClusters(liver_small, graph.name = "knn.combat_symmetric",
+ cluster.name = "combat_{cell.var}_{metric}",
+ cell.var = cell.var,
+ optimisation.metric = c("nmi", "ari")) # default, compute both
+
+DefaultAssay(liver_small) <- "SCT"
+# dimension reduction outputs
+liver_small <- FindNeighbors(liver_small, reduction = "pca", dims = 1:20, k.param = 20L,
+ return.neighbor = TRUE, graph.name = "knn.unintegrated")
+liver_small <- SymmetrizeKnn(liver_small, graph.name = "knn.unintegrated")
+liver_small <- FindOptimalClusters(liver_small, graph.name = "knn.unintegrated_symmetric",
+ cluster.name = "unintegrated_{cell.var}_{metric}",
+ cell.var = cell.var)
+
+
+liver_small <- FindNeighbors(liver_small, reduction = "harmony", dims = 1:20,
+ return.neighbor = TRUE, graph.name = "knn.harmony")
+liver_small <- SymmetrizeKnn(liver_small, graph.name = "knn.harmony")
+liver_small <- FindOptimalClusters(liver_small, graph.name = "knn.harmony_symmetric", cell.var = cell.var,
+ cluster.name = "harmony_{cell.var}_{metric}")
+# graph outputs
+liver_small <- SymmetrizeKnn(liver_small, graph.name = "bbknn_ridge.residuals_distances")
+liver_small <- FindOptimalClusters(liver_small, graph.name = "bbknn_ridge.residuals_distances_symmetric",
+ cell.var = cell.var, cluster.name = "bbknn_{cell.var}_{metric}")
Now that we have post-processed each integration’s output, we can +compute multiple scores to estimate the accuracy of the +integrations.
+Those scores are run on a dimension reduction (such as a PCA) and are +not compatible with graph outputs. They are cell-type free score thus do +not require a cell-type label variable.
+
+liver_small <- AddScoreRegressPC(liver_small, integration = "unintegrated",
+ batch.var = batch.var, reduction = "pca")
+liver_small <- AddScoreRegressPC(liver_small, integration = "combat",
+ batch.var = batch.var, reduction = "pca.combat")
+liver_small <- AddScoreRegressPC(liver_small, integration = "harmony",
+ batch.var = batch.var, reduction = "harmony")
+
+liver_small <- AddScoreDensityPC(liver_small, integration = "unintegrated",
+ batch.var = batch.var, reduction = "pca")
+liver_small <- AddScoreDensityPC(liver_small, integration = "combat",
+ batch.var = batch.var, reduction = "pca.combat")
+liver_small <- AddScoreRegressPC(liver_small, integration = "harmony",
+ batch.var = batch.var, reduction = "harmony")
+
+liver_small <- CellCycleScoringPerBatch(liver_small, batch.var = batch.var,
+ s.features = cc.genes$s.genes,
+ g2m.features = cc.genes$g2m.genes)
+liver_small <- AddScoreRegressPC.CellCycle(liver_small, integration = "unintegrated",
+ batch.var = batch.var, what = "pca",
+ compute.cc = FALSE, dims.use = 1:20)
+liver_small <- AddScoreRegressPC.CellCycle(liver_small, integration = "combat",
+ batch.var = batch.var, what = "pca.combat",
+ compute.cc = FALSE, dims.use = 1:20)
+liver_small <- AddScoreRegressPC.CellCycle(liver_small, integration = "harmony",
+ batch.var = batch.var, what = "harmony",
+ compute.cc = FALSE, dims.use = 1:20)
Those scores return an average silhouette width (ASW), either per
+cell type label or per batch. They are run on a dimension reduction
+(such as a PCA) and are not compatible with graph outputs. They are
+based on a cell-type label variable. However,
+ScoreASWBatch()
can be run in a cell-type label independent
+manner with per.cell.var = FALSE
.
+liver_small <- AddScoreASW(liver_small, integration = "unintegrated",
+ cell.var = cell.var, what = "pca")
+liver_small <- AddScoreASW(liver_small, integration = "combat",
+ cell.var = cell.var, what = "pca.combat")
+liver_small <- AddScoreASW(liver_small, integration = "harmony",
+ cell.var = cell.var, what = "harmony")
+
+liver_small <- AddScoreASWBatch(liver_small, integration = "unintegrated",
+ batch.var = batch.var, cell.var = cell.var,
+ what = "pca")
+liver_small <- AddScoreASWBatch(liver_small, integration = "combat",
+ batch.var = batch.var, cell.var = cell.var,
+ what = "pca.combat")
+liver_small <- AddScoreASWBatch(liver_small, integration = "harmony",
+ batch.var = batch.var, cell.var = cell.var,
+ what = "harmony")
Those scores are graph outputs. They require a cell-type label +variable.
+
+liver_small <- AddScoreConnectivity(liver_small, integration = "unintegrated",
+ graph.name = "knn.unintegrated_symmetric",
+ cell.var = cell.var)
+liver_small <- AddScoreConnectivity(liver_small, integration = "combat",
+ graph.name = "knn.combat_symmetric",
+ cell.var = cell.var)
+liver_small <- AddScoreConnectivity(liver_small, integration = "harmony",
+ graph.name = "knn.harmony_symmetric",
+ cell.var = cell.var)
+liver_small <- AddScoreConnectivity(liver_small, integration = "bbknn",
+ graph.name = "bbknn_ridge.residuals_distances_symmetric",
+ cell.var = cell.var)
+
+liver_small <- AddScoreLISI(liver_small, integration = "unintegrated",
+ batch.var = batch.var, cell.var = cell.var,
+ reduction = "pca")
+liver_small <- AddScoreLISI(liver_small, integration = "combat",
+ batch.var = batch.var, cell.var = cell.var,
+ reduction = "pca.combat")
+liver_small <- AddScoreLISI(liver_small, integration = "harmony",
+ batch.var = batch.var, cell.var = cell.var,
+ reduction = "harmony")
+liver_small <- AddScoreLISI(liver_small, integration = "bbknn",
+ batch.var = batch.var, cell.var = cell.var,
+ reduction = NULL,
+ graph.name = "bbknn_ridge.residuals_distances_symmetric")
Those scores compare two categorical variables. Thus, they don’t need
+anything else than a cell-type and a cluster assignment variables
+(computed during Post-processing with
+FindOptimalClusters()
).
+liver_small <- AddScoreARI(liver_small, integration = "unintegrated",
+ cell.var = cell.var,
+ clust.var = paste("unintegrated", cell.var, "ari", sep = "_"))
+liver_small <- AddScoreARI(liver_small, integration = "combat",
+ cell.var = cell.var,
+ clust.var = paste("combat", cell.var, "ari", sep = "_"))
+liver_small <- AddScoreARI(liver_small, integration = "harmony",
+ cell.var = cell.var,
+ clust.var = paste("harmony", cell.var, "ari", sep = "_"))
+liver_small <- AddScoreARI(liver_small, integration = "bbknn",
+ cell.var = cell.var,
+ clust.var = paste("bbknn", cell.var, "ari", sep = "_"))
+
+liver_small <- AddScoreNMI(liver_small, integration = "unintegrated",
+ cell.var = cell.var,
+ clust.var = paste("unintegrated", cell.var, "nmi", sep = "_"))
+liver_small <- AddScoreNMI(liver_small, integration = "combat",
+ cell.var = cell.var,
+ clust.var = paste("combat", cell.var, "nmi", sep = "_"))
+liver_small <- AddScoreNMI(liver_small, integration = "harmony",
+ cell.var = cell.var,
+ clust.var = paste("harmony", cell.var, "nmi", sep = "_"))
+liver_small <- AddScoreNMI(liver_small, integration = "bbknn",
+ cell.var = cell.var,
+ clust.var = paste("bbknn", cell.var, "nmi", sep = "_"))
Now that we have obtained several scores per integration, we can +compare their performances. First, let’s scale the scores between zero +and one and standardize their direction (the closer to one, always the +better). This step makes things easier to compare and improves +interpretability.
+
+liver_small <- ScaleScores(liver_small)
Now, let’s plot the integrations.
+
+PlotScores(liver_small)
We can observe that 3 overall scores have been computed, namely batch +correction, bio-conservation and overall (the average of the last two). +To balance each score’s contribution to the overall scores, a min-max +rescaling is applied on each score after scaling, stretching the +original score’s bounds to zero and one.
+It can be disabled to preview the original scores:
+
+PlotScores(liver_small, rescale = FALSE)
We can also hide scores or integrations. For instance, “PCA.density” +and “PCA.regression” are hardly informative. Let’s exclude them and see +how it affects overall scores:
+
+PlotScores(liver_small, rescale = FALSE, exclude.score = c("PCA.density", "PCA.regression"))
Interestingly, we obtained the same ranking of integrations as in the +first figure.
+It is also possible to project the cells on a UMAP computed for each +integration’s output, as a complementary (visual) inspection of +integration performances.
+
+liver_small <- RunUMAP(liver_small, dims = 1:20, reduction = "pca.combat",
+ reduction.name = "umap.combat")
+liver_small <- RunUMAP(liver_small, dims = 1:20, reduction = "harmony",
+ reduction.name = "umap.harmony")
+
+library(future)
+plan(multisession)
+liver_small %<-% {
+ reticulate::use_condaenv('umap_0.5.4')
+ RunUMAP(liver_small, graph = "bbknn_ridge.residuals_connectivities",
+ umap.method = "umap-learn", n.epochs = 200L,
+ reduction.name = "umap.bbknn") }
+liver_small
+plan(sequential)
+DimPlot(liver_small, group.by = c(batch.var, cell.var), reduction = "umap")
+UMAP of unintegrated data +
+
+DimPlot(liver_small, group.by = c(batch.var, cell.var), reduction = "umap.combat")
+UMAP of ComBat integration +
+
+DimPlot(liver_small, group.by = c(batch.var, cell.var), reduction = "umap.bbknn")
+UMAP of BBKNN integration +
+ ## R version 4.4.2 (2024-10-31)
+ ## Platform: x86_64-pc-linux-gnu
+ ## Running under: Linux Mint 21
+ ##
+ ## Matrix products: default
+ ## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
+ ## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
+ ##
+ ## locale:
+ ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
+ ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
+ ## [5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=en_US.UTF-8
+ ## [7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
+ ## [9] LC_ADDRESS=C LC_TELEPHONE=C
+ ## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
+ ##
+ ## time zone: Europe/Paris
+ ## tzcode source: system (glibc)
+ ##
+ ## attached base packages:
+ ## [1] stats graphics grDevices utils datasets methods base
+ ##
+ ## other attached packages:
+ ## [1] future_1.34.0 dplyr_1.1.4 SeuratIntegrate_0.4.0
+ ## [4] Seurat_5.1.0 SeuratObject_5.0.2 sp_2.1-4
+ ##
+ ## loaded via a namespace (and not attached):
+ ## [1] RcppAnnoy_0.0.22 splines_4.4.2
+ ## [3] later_1.4.1 batchelor_1.20.0
+ ## [5] tibble_3.2.1 polyclip_1.10-7
+ ## [7] XML_3.99-0.17 fastDummies_1.7.4
+ ## [9] lifecycle_1.0.4 edgeR_4.2.2
+ ## [11] globals_0.16.3 lattice_0.22-6
+ ## [13] MASS_7.3-61 backports_1.5.0
+ ## [15] magrittr_2.0.3 limma_3.60.6
+ ## [17] plotly_4.10.4 sass_0.4.9
+ ## [19] rmarkdown_2.29 jquerylib_0.1.4
+ ## [21] yaml_2.3.10 httpuv_1.6.15
+ ## [23] glmGamPoi_1.16.0 sctransform_0.4.1
+ ## [25] spam_2.11-0 spatstat.sparse_3.1-0
+ ## [27] reticulate_1.40.0 DBI_1.2.3
+ ## [29] cowplot_1.1.3 pbapply_1.7-2
+ ## [31] RColorBrewer_1.1-3 ResidualMatrix_1.14.1
+ ## [33] abind_1.4-8 zlibbioc_1.50.0
+ ## [35] Rtsne_0.17 GenomicRanges_1.56.2
+ ## [37] purrr_1.0.2 BiocGenerics_0.50.0
+ ## [39] tweenr_2.0.3 rappdirs_0.3.3
+ ## [41] sva_3.52.0 GenomeInfoDbData_1.2.12
+ ## [43] IRanges_2.38.1 S4Vectors_0.42.1
+ ## [45] ggrepel_0.9.6 irlba_2.3.5.1
+ ## [47] listenv_0.9.1 spatstat.utils_3.1-1
+ ## [49] genefilter_1.86.0 goftest_1.2-3
+ ## [51] RSpectra_0.16-2 annotate_1.82.0
+ ## [53] spatstat.random_3.3-2 fitdistrplus_1.2-1
+ ## [55] parallelly_1.41.0 pkgdown_2.1.1
+ ## [57] DelayedMatrixStats_1.26.0 leiden_0.4.3.1
+ ## [59] codetools_0.2-20 DelayedArray_0.30.1
+ ## [61] ggforce_0.4.2 scuttle_1.14.0
+ ## [63] tidyselect_1.2.1 UCSC.utils_1.0.0
+ ## [65] farver_2.1.2 ScaledMatrix_1.12.0
+ ## [67] matrixStats_1.4.1 stats4_4.4.2
+ ## [69] spatstat.explore_3.3-3 jsonlite_1.8.9
+ ## [71] BiocNeighbors_1.22.0 progressr_0.15.1
+ ## [73] ggridges_0.5.6 survival_3.8-3
+ ## [75] systemfonts_1.1.0 tools_4.4.2
+ ## [77] ragg_1.3.3 ica_1.0-3
+ ## [79] Rcpp_1.0.13-1 glue_1.8.0
+ ## [81] gridExtra_2.3 SparseArray_1.4.8
+ ## [83] mgcv_1.9-1 xfun_0.49
+ ## [85] MatrixGenerics_1.16.0 GenomeInfoDb_1.40.1
+ ## [87] withr_3.0.2 fastmap_1.2.0
+ ## [89] digest_0.6.37 rsvd_1.0.5
+ ## [91] R6_2.5.1 mime_0.12
+ ## [93] textshaping_0.4.1 colorspace_2.1-1
+ ## [95] scattermore_1.2 tensor_1.5
+ ## [97] RSQLite_2.3.9 spatstat.data_3.1-4
+ ## [99] RhpcBLASctl_0.23-42 tidyr_1.3.1
+ ## [101] generics_0.1.3 data.table_1.16.4
+ ## [103] httr_1.4.7 htmlwidgets_1.6.4
+ ## [105] S4Arrays_1.4.1 uwot_0.2.2
+ ## [107] pkgconfig_2.0.3 gtable_0.3.6
+ ## [109] blob_1.2.4 lmtest_0.9-40
+ ## [111] SingleCellExperiment_1.26.0 XVector_0.44.0
+ ## [113] htmltools_0.5.8.1 dotCall64_1.2
+ ## [115] scales_1.3.0 Biobase_2.64.0
+ ## [117] png_0.1-8 lisi_1.0
+ ## [119] harmony_1.2.3 spatstat.univar_3.1-1
+ ## [121] knitr_1.49 rstudioapi_0.17.1
+ ## [123] reshape2_1.4.4 nlme_3.1-166
+ ## [125] cachem_1.1.0 zoo_1.8-12
+ ## [127] stringr_1.5.1 KernSmooth_2.23-24
+ ## [129] parallel_4.4.2 miniUI_0.1.1.1
+ ## [131] AnnotationDbi_1.66.0 desc_1.4.3
+ ## [133] pillar_1.10.0 grid_4.4.2
+ ## [135] vctrs_0.6.5 RANN_2.6.2
+ ## [137] promises_1.3.2 BiocSingular_1.20.0
+ ## [139] distances_0.1.11 beachmat_2.20.0
+ ## [141] xtable_1.8-4 cluster_2.1.8
+ ## [143] evaluate_1.0.1 locfit_1.5-9.10
+ ## [145] cli_3.6.3 compiler_4.4.2
+ ## [147] rlang_1.1.4 crayon_1.5.3
+ ## [149] future.apply_1.11.3 labeling_0.4.3
+ ## [151] plyr_1.8.9 forcats_1.0.0
+ ## [153] fs_1.6.5 stringi_1.8.4
+ ## [155] viridisLite_0.4.2 deldir_2.0-4
+ ## [157] BiocParallel_1.38.0 Biostrings_2.72.1
+ ## [159] munsell_0.5.1 lazyeval_0.2.2
+ ## [161] spatstat.geom_3.3-4 Matrix_1.7-1
+ ## [163] RcppHNSW_0.6.0 patchwork_1.3.0
+ ## [165] bit64_4.5.2 sparseMatrixStats_1.16.0
+ ## [167] ggplot2_3.5.1 statmod_1.5.0
+ ## [169] KEGGREST_1.44.1 shiny_1.10.0
+ ## [171] SummarizedExperiment_1.34.0 ROCR_1.0-11
+ ## [173] memoise_2.0.1 igraph_2.1.2
+ ## [175] broom_1.0.7 bslib_0.8.0
+ ## [177] bit_4.5.0.1
+vignettes/introduction.Rmd
+ introduction.Rmd
We will use the pbmcsca
dataset available from the
+package SeuratData
.
+# install `SeuratData` package (if not yet)
+if (! requireNamespace("SeuratData", quietly = TRUE)) {
+ devtools::install_github('satijalab/seurat-data')
+}
+
+# increase download timeout and install the dataset
+options(timeout = 300)
+SeuratData::InstallData('pbmcsca')
+# load the dataset (take 1,000 first cells to speed-up execution)
+seu <- SeuratData::LoadData('pbmcsca')[,1:1e3]
## Warning: Assay RNA changing from Assay to Assay
+## Warning: Assay RNA changing from Assay to Assay5
+Have a look at the metadata:
+
+# rmarkdown::paged_table -> prints data frames in a less ugly way than default
+rmarkdown::paged_table(head(seu[[]], n = 10))
The column Method
provides information about the
+single-cell technology used to acquire each cell’s trancriptome. We
+consider it as the batch of origin.
+table(seu$Method)
##
+## 10x Chromium (v2) A CEL-Seq2 Smart-seq2
+## 494 253 253
+The column CellType
contains cell-types annotation that
+we will use later on.
+rmarkdown::paged_table(as.data.frame(table(seu$CellType)))
Let’s now define the batch variable and the cell-type +variable(s):
+
+batch.var <- 'Method' # available in the metadata of the object
+cell.var <- 'CellType' # available in the metadata of the object
The current Seurat object is not split, meaning that each layer +contains cells from all batches.
+ +## Layers before split: counts, data
+We select the column Method
to separate the cells into
+batch-specific layers. This step is indispensable to run integration
+methods and enables to normalise each batch (i.e. layer)
+independently.
+seu[['RNA']] <- split(seu[['RNA']], f = seu$Method)
+
+cat('Layers after split:', paste(Layers(seu), collapse = ", "), '\n')
## Layers after split: counts.Smart-seq2, counts.CEL-Seq2, counts.10x_Chromium_v2_A, data.Smart-seq2, data.CEL-Seq2, data.10x_Chromium_v2_A
+Then, we proceed to the standard Seurat
workflow until
+we obtain the PCA reduction.
+seu <- SCTransform(seu)
+seu <- RunPCA(seu, verbose = F)
+seu <- FindNeighbors(seu, reduction = "pca", dims = 1:30, k.param = 20L)
We now further process the object until we can visualise the +dispersion of cells on a UMAP dimension reduction.
+
+seu <- FindClusters(seu, graph.name = 'SCT_snn', resolution = .5)
+seu <- RunUMAP(seu, dims = 1:30, reduction = 'pca')
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
+##
+## Number of nodes: 1000
+## Number of edges: 43767
+##
+## Running Louvain algorithm...
+## Maximum modularity in 10 random starts: 0.8715
+## Number of communities: 8
+## Elapsed time: 0 seconds
+
+DimPlot(seu, label = T) + NoLegend() + ggplot2::coord_fixed(ratio = .7)
Let’s colour cells according to batch and cell-type label.
+
+DimPlot(seu, group.by = batch.var) +
+ DimPlot(seu, group.by = cell.var) & ggplot2::coord_fixed(ratio = .7)
Albeit moderately, the dispersion of cells on the UMAP seems to be +influenced by a batch effect. According to the “CellType” variable, B +cells from 10x and CEL-Seq2 are not gathered together and monocyte cells +do not fully overlap.
+Now, we want to correct these technical differences.
+The integration commands are to some extent similar to the Seurat
+V5 vignette. The purpose of this package is to extend the set of
+available integration methods. See the bottom of
+?IntegrateLayers
to have a comprehensive list of relevant
+methods.
Many methods supported by SeuratIntegrate are implemented in Python,
+and the wrappers provided rely on the reticulate
package
+and conda
environments. If you are not familiar with the
+CondaEnvManager
, have a look at the
+vignette("setup_and_tips")
.
Here are some important considerations before performing an +integration:
+DoIntegrate()
from SeuratIntegrate to
+integrate the cell batches (see the next section)?bbknnIntegration
)CondaEnvManager
via
+UpdateEnvCache()
(check the
+vignette("setup_and_tips")
)conda_env
parameter. This overrides the default
+behaviour by loading the specified environment instead of fetching the
+CondaEnvManager
’s cache.Some methods expect a raw count matrix, others expect the scaled +counts of the variable features, etc. To help you with the choice, look +at the table below
+Method | +Layer | +Features | +
---|---|---|
ComBat | +data | +any but better with restricted number of features +(e.g. variable) | +
Harmony | +N/A (PCA reduction) | +N/A (PCA reduction) | +
MNN | +data | +any but better with restricted number of features +(e.g. variable) | +
bbknn | +N/A (PCA reduction) | +N/A (PCA reduction) | +
scVI / scANVI | +counts | +all | +
Scanorama | +counts or data | +any | +
trVAE | +
+ data (recon loss “mse”) +counts (otherwise) + |
+all | +
Layers:
+
/!\ IMPORTANT /!\ To use
+all features when calling an integration method with
+IntegrateLayers()
:
+IntegrateLayers(object, features = Features(object), scale.layer = NULL)
.
+Does not work for a SCTAssay
.
The function DoIntegrate()
is a very handy way to run
+multiple integrations in a single command. Note that:
...
is the place to specify the integration commands to
+run.FooIntegration()
,
+BarIntegration()
, etc.FooIntegration(layers = "data")
)use.hvg = TRUE
results in using variable featuresuse.future = TRUE
is useful to run Python-based methods
+(a normal R session cannot load more than one conda instance, while
+future
enables to launch background sessions, preventing
+the main on to load any conda environment.). It is highly
+recommended to set it to FALSE
for R-based
+methods.Most integration methods can be used without modifying default +parameters. In this vignette, we will change some arguments to meet our +needs. Notably, we will change the number of cores allocated to each +method (where possible).
+
+ncores <- parallel::detectCores() - 2
In this vignette, we are going to use 3 Python-based methods, namely +BBKNN, Scanorama and scANVI from the +scvi-tools suite. Let’s make sure they are available straight away:
+
+# BBKNN
+if (! isValid(getCache()$bbknn)) {
+ UpdateEnvCache("bbknn")
+}
+# Scanorama
+if (! isValid(getCache()$scanorama)) {
+ UpdateEnvCache("scanorama")
+}
+# scvi-tools
+if (! isValid(getCache()$scanvi)) {
+ UpdateEnvCache("scanvi")
+}
Let’s proceed to a few batch-effect corrections:
+seu <- DoIntegrate(
+ object = seu,
+ SeuratIntegrate::HarmonyIntegration(orig = "pca", dims = 1:30,
+ ncores = ncores),
+ CCAIntegration(orig = "pca", dims = 1:30 , new.reduction = "cca.integrated",
+ normalization.method = "SCT"),
+ RPCAIntegration(orig = "pca", dims = 1:30, new.reduction = "rpca.integrated",
+ normalization.method = "SCT"),
+ use.future = FALSE # R-based
+)
## Integration 1 in 3: integrating using 'SeuratIntegrate::HarmonyIntegration'
+## Integration 2 in 3: integrating using 'CCAIntegration'
+## Integration 3 in 3: integrating using 'RPCAIntegration'
+Note: We use SeuratIntegrate::
before
+HarmonyIntegration
to avoid any confusion with
+Seurat::HarmonyIntegration()
.
+seu <- DoIntegrate(
+ object = seu,
+ bbknnIntegration(orig = "pca", layers = "data", ndims = 30),
+ ScanoramaIntegration(orig = "pca", ncores = ncores),
+ scANVIIntegration(groups = seu[[]], groups.name = "Method",
+ labels.name = "CellType", layers = "counts",
+ torch.intraop.threads = ncores,
+ torch.interop.threads = ncores,
+ max_epochs = 20L),
+ use.future = TRUE, # Python-based
+ use.hvg = c(TRUE, TRUE, FALSE)
+)
## Integration 1 in 3: integrating using 'bbknnIntegration'
+## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
+##
+## Number of nodes: 1000
+## Number of edges: 6866
+##
+## Running Louvain algorithm...
+## Maximum modularity in 10 random starts: 0.6812
+## Number of communities: 10
+## Elapsed time: 0 seconds
+## Integration 2 in 3: integrating using 'ScanoramaIntegration'
+## Integration 3 in 3: integrating using 'scANVIIntegration'
+++Note: set
+max_epochs = 20L
for +scANVIIntegration
is only to save time ! The default number +of epochs (400) results in a superior integration.
If we take a look at our Seurat object, we can note that it has been +enriched with many objects:
+
+print(seu)
## An object of class Seurat
+## 69594 features across 1000 samples within 4 assays
+## Active assay: SCT (16450 features, 3000 variable features)
+## 3 layers present: counts, data, scale.data
+## 3 other assays present: RNA, bbknn.ridge, scanorama.reconstructed
+## 8 dimensional reductions calculated: pca, umap, harmony, cca.integrated, rpca.integrated, pca.bbknn, integrated.scanorama, integrated.scANVI
+
+cat("Graph objects:", paste(Graphs(seu), collapse = ", "), "\n")
+cat("Neighbor objects:", paste(Neighbors(seu), collapse = ", "), "\n")
+cat("Reduction dimensions:", paste(Reductions(seu), collapse = ", "), "\n")
+cat("Assays:", paste(Assays(seu), collapse = ", "), "\n")
## Graph objects: SCT_nn, SCT_snn, bbknn_scale.data_connectivities, bbknn_scale.data_distances, bbknn_ridge.residuals_connectivities, bbknn_ridge.residuals_distances
+## Neighbor objects:
+## Reduction dimensions: pca, umap, harmony, cca.integrated, rpca.integrated, pca.bbknn, integrated.scanorama, integrated.scANVI
+## Assays: RNA, SCT, bbknn.ridge, scanorama.reconstructed
+Great! We have successfully performed several integrations! However, +stopping here would be unsatisfactory because we still need to process +each integration’s output(s) to obtain at least one UMAP projection for +each. Here, we will also aim at generating assessable representations to +score.
+Several objects can be produced by each integration algorithm, namely +a layer in a new assay (i.e. corrected counts), a dimension reduction +(corrected embedding), or a knn network. Some even produce more than one +output (for instance Scanorama produces corrected counts and a dimension +reduction).
+The type of output is important to consider, because scoring metrics +are not compatible with all output types. The simplest strategy is to +process each output separately in order to obtain at least a PCA out of +it, or even a knn graph (essential to compute clusters). Note that most +scores cannot be computed on knn graphs, hence knn graph outputs +(e.g. BBKNN) can only be evaluated by a reduced set of metrics.
+Below is a summary of post-processing steps for each output type +(bracketed steps are not always necessary):
+ScaleData()
] ->
+RunPCA()
-> [FindNeighbors()
->
+FindOptimalClusters()
]RunPCA()
] ->
+[FindNeighbors()
->
+FindOptimalClusters()
]FindOptimalClusters()
]RunPCA()
is sometimes run even on dimension reduction
+objects (within scoring functions) because some scores require a
+variance associated with each dimension.
Let’s process all the outputs. Here, we will go through all the steps
+for a more exhaustive demonstration. However, it is to be noted that
+skipping the final step FindOptimalClusters()
makes the
+neighbour graph computation step (FindNeighbors()
)
+unnecessary. In such a case however, one will forgo two scoring metrics,
+namely ARI and NMI.
Here, we will use SymmetrizeKnn()
between
+FindNeighbors()
and FindOptimalClusters()
+because we set return.neighbor = TRUE
in
+FindNeighbors()
. This is useful to keep the distances
+between cells in the KNN
graph rather than what
+FindNeighbors()
does by default, which is converting the
+KNN
graph to an adjacency matrix with 0/1s and to a
+SNN
network with values bounded between 0 and 1. This not
+compulsory, but this is used to stay in line with BBKNN’s output. To
+prevent the community detection algorithm to output a high fraction of
+singletons, we “symmetrize” the matrix which makes the graph
+“undirected”.
+# corrected counts outputs
+DefaultAssay(seu) <- "scanorama.reconstructed"
+seu <- ScaleData(seu)
+seu <- RunPCA(seu, npcs = 50L, reduction.name = "pca.scanorama_counts")
+seu <- FindNeighbors(seu, reduction = "pca.scanorama_counts", dims = 1:30, return.neighbor = TRUE,
+ graph.name = "knn.scanorama_counts")
+seu <- SymmetrizeKnn(seu, graph.name = "knn.scanorama_counts")
+seu <- FindOptimalClusters(seu, graph.name = "knn.scanorama_counts_symmetric",
+ cluster.name = "scanorama_counts_{cell.var}_{metric}",
+ cell.var = cell.var,
+ optimisation.metric = c("nmi", "ari")) # default, compute both
+
+
+# dimension reduction outputs
+DefaultAssay(seu) <- "SCT"
+seu <- FindNeighbors(seu, reduction = "pca", dims = 1:30, k.param = 20L,
+ return.neighbor = TRUE, graph.name = "knn.unintegrated")
+seu <- SymmetrizeKnn(seu, graph.name = "knn.unintegrated")
+seu <- FindOptimalClusters(seu, graph.name = "knn.unintegrated_symmetric",
+ cluster.name = "unintegrated_{cell.var}_{metric}",
+ cell.var = cell.var)
+
+seu <- FindNeighbors(seu, reduction = "integrated.scanorama", dims = 1:30,
+ return.neighbor = TRUE, graph.name = "knn.scanorama_reduction")
+seu <- SymmetrizeKnn(seu, graph.name = "knn.scanorama_reduction")
+seu <- FindOptimalClusters(seu, graph.name = "knn.scanorama_reduction_symmetric",
+ cluster.name = "scanorama_reduction_{cell.var}_{metric}",
+ cell.var = cell.var)
+
+seu <- FindNeighbors(seu, reduction = "harmony", dims = 1:30,
+ return.neighbor = TRUE, graph.name = "knn.harmony")
+seu <- SymmetrizeKnn(seu, graph.name = "knn.harmony")
+seu <- FindOptimalClusters(seu, graph.name = "knn.harmony_symmetric", cell.var = cell.var,
+ cluster.name = "harmony_{cell.var}_{metric}")
+
+seu <- FindNeighbors(seu, reduction = "cca.integrated", dims = 1:30,
+ return.neighbor = TRUE, graph.name = "knn.cca")
+seu <- SymmetrizeKnn(seu, graph.name = "knn.cca")
+seu <- FindOptimalClusters(seu, graph.name = "knn.cca_symmetric", cell.var = cell.var,
+ cluster.name = "cca_{cell.var}_{metric}")
+
+seu <- FindNeighbors(seu, reduction = "rpca.integrated", dims = 1:30,
+ return.neighbor = TRUE, graph.name = "knn.rpca")
+seu <- SymmetrizeKnn(seu, graph.name = "knn.rpca")
+seu <- FindOptimalClusters(seu, graph.name = "knn.rpca_symmetric", cell.var = cell.var,
+ cluster.name = "rpca_{cell.var}_{metric}")
+
+seu <- FindNeighbors(seu, reduction = "integrated.scANVI",
+ return.neighbor = TRUE, graph.name = "knn.scanvi")
+seu <- SymmetrizeKnn(seu, graph.name = "knn.scanvi")
+seu <- FindOptimalClusters(seu, graph.name = "knn.scanvi_symmetric", cell.var = cell.var,
+ cluster.name = "scanvi_{cell.var}_{metric}")
+
+
+# graph outputs
+seu <- SymmetrizeKnn(seu, graph.name = "bbknn_ridge.residuals_distances")
+seu <- FindOptimalClusters(seu, graph.name = "bbknn_ridge.residuals_distances_symmetric",
+ cell.var = cell.var, cluster.name = "bbknn_{cell.var}_{metric}")
++Note: Instead of sticking with the default +
+FindNeighbors(return.neighbors = FALSE)
at the beginning, we could have switched it to +TRUE
right away, process theKNN
graph with +SymmetrizeKnn()
and use it for subsequent steps (umap, +clustering, etc.)
FindOptimalClusters()
adds metadata columns with the
+clustering results that maximized a metric (NMI or ARI) to the Seurat
+object:
+rmarkdown::paged_table(seu[[]][1:10, grep("CellType_[arinm]{3}$", colnames(seu[[]]))])
Now that we have computed objects ready to be scored, we will proceed
+with assessing each integration output. For this step, one can either
+use Score[score_name]()
and save scores in separate
+variables or use AddScore[score_name]()
to directly store
+scores within the Seurat object. The latter is far more convenient and
+allows to compare scores graphically. This is the strategy we are going
+to adopt here.
Last but not least, a cell-type variable is used in most scores, +hence it is highly recommended to have an estimate of each cell’s type +(or produce such an estimate) stored in the Seurat object as a column in +the metadata.
+++Note that if one doesn’t have a variable with cell-labels (used as +ground truth in multiple scores) he will have to produce it or do +without several scores. Alternatively, one can use automatic cell +annotation algorithms (e.g. the Azimuth package). One +can also have multiple cell label variables (e.g. Azimuth typically +returns as many cell label variables as levels of annotations contained +in the reference). Scores requiring cell type annotations always accept +multiple column names.
+
++There is a risk of confusion between cell annotations when using +automatic cell annotation tools. Furthermore, in the case of using +Azimuth to annotate cells, there is a specific risk of biasing results +by favouring RPCA integration because Azimuth uses RPCA to integrate the +query dataset onto the reference.
+
First of all, let’s organise outputs in lists:
+
+reductions <- list(
+ unintegrated = "pca",
+ scanorama_counts = "pca.scanorama_counts",
+ scanorama_reduction = "integrated.scanorama",
+ harmony = "harmony",
+ cca = "cca.integrated",
+ rpca = "rpca.integrated",
+ scanvi = "integrated.scANVI",
+ bbknn = NULL
+)
+
+graphs <- list(
+ unintegrated = "knn.unintegrated_symmetric",
+ scanorama_counts = "knn.scanorama_counts_symmetric",
+ scanorama_reduction = "knn.scanorama_reduction_symmetric",
+ harmony = "knn.harmony_symmetric",
+ cca = "knn.cca_symmetric",
+ rpca = "knn.rpca_symmetric",
+ scanvi = "knn.scanvi_symmetric",
+ bbknn = "bbknn_ridge.residuals_distances_symmetric"
+)
+
+integrations <- names(reductions)
Let’s finalise our preparations: :
+ScoreKBET()
+ScoreRegressPC.CellCycle()
(score of cell cycle
+conservation), run CellCycleScoringPerBatch()
beforehand
+(next chunk)
+seu <- CellCycleScoringPerBatch(seu, batch.var = batch.var,
+ s.features = cc.genes$s.genes,
+ g2m.features = cc.genes$g2m.genes)
Let’s now loop through the integration outputs:
+
+for (integ in integrations) {
+ reduc <- reductions[[integ]]
+ graph <- graphs[[integ]]
+ clust.var.ari <- paste(integ, cell.var, "ari", sep = "_") # produced by `FindOptimalClusters()`
+ clust.var.nmi <- paste(integ, cell.var, "nmi", sep = "_") # produced by `FindOptimalClusters()`
+ if (! is.null(reduc)) { # all TRUE except bbknn
+ seu <- AddScoreASW(seu, integration = integ, cell.var = cell.var,
+ what = reduc, dist.package = "distances")
+ seu <- AddScoreASWBatch(seu, integration = integ, batch.var = batch.var,
+ cell.var = cell.var, what = reduc,
+ dist.package = "distances")
+ seu <- AddScoreDensityPC(seu, integration = integ, batch.var = batch.var,
+ reduction = reduc)
+ seu <- AddScoreRegressPC(seu, integration = integ, batch.var = batch.var,
+ reduction = reduc)
+ seu <- AddScoreRegressPC.CellCycle(seu, integration = integ,
+ batch.var = batch.var, what = reduc,
+ compute.cc = FALSE) # because CellCycleScoringPerBatch was ran before
+ }
+ seu <- AddScoreARI(seu, integration = integ, cell.var = cell.var,
+ clust.var = clust.var.ari)
+ seu <- AddScoreNMI(seu, integration = integ, cell.var = cell.var,
+ clust.var = clust.var.nmi)
+ seu <- AddScoreConnectivity(seu, integration = integ, graph.name = graph,
+ cell.var = cell.var)
+ seu <- AddScoreKBET(seu, integration = integ, batch.var = batch.var,
+ cell.var = cell.var, what = reduc %||% sub("_symmetric$", "", graph),
+ graph.type = "distances", verbose = FALSE)
+ seu <- AddScoreLISI(seu, integration = integ, batch.var = batch.var,
+ cell.var = cell.var, reduction = reduc,
+ graph.name = if (is.null(reduc)) sub("_symmetric$", "", graph) else NULL,
+ graph.type = "distances", ncores = ncores)
+
+}
Now that we have computed multiple scores, we can look at them using
+IntegrationScores()
:
+rmarkdown::paged_table(IntegrationScores(seu))
Note that those scores are raw. Let’s scale them (and make them +comparable):
+
+seu <- ScaleScores(seu)
Once again, we can print them:
+
+rmarkdown::paged_table(IntegrationScores(seu, scaled = TRUE))
To readily compare the integrations, let’s plot the scores:
+
+PlotScores(seu)
One might notice a difference in scale for some of the scores, if +comparing the plot with the table just before. This is the case for the +PCA density score for instance.
+
+print(sort(IntegrationScores(seu, scaled = TRUE)$PCA.density))
## [1] 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.05713768 0.12268377
+Indeed, PlotScores()
rescales the scores using min-max
+normalisation by default (rescale = TRUE
). One might chose
+to disable it:
+PlotScores(seu, rescale = FALSE)
We notice that PCA based methods output very low scores in this case. +Since they cannot be computed on knn graphs, scores are biased in favour +of BBKNN. You can exclude some scores (and recompute overall scores on +the fly)
+
+PlotScores(seu, rescale = FALSE, exclude.score = c("pca.density", "pca.regression"))
You can chose a different type of plot (radar
or
+lollipop
):
+library(ggplot2)
+PlotScores(seu, plot.type = "radar") +
+ # reduce overlap between axis names and figures
+ theme(legend.position = "top", panel.spacing = unit(3, "cm"),
+ plot.margin = margin(r = 3, l = 3, unit = "cm"),
+ axis.text.x = element_text(size = 10))
## Warning: Removed 5 rows containing missing values or values outside the scale range
+## (`geom_point()`).
+In this last plot, we also exclude the non-integrated dataset. Since
+the rescale
argument is true by default, the scores are
+rescaled without the excluded integration’s scores.
+ PlotScores(seu, plot.type = "lollipop",
+ exclude.integration = "unintegrated")
## Warning: Removed 5 rows containing missing values or values outside the scale range
+## (`geom_point()`).
+## Warning: Removed 5 rows containing missing values or values outside the scale range
+## (`geom_errorbarh()`).
+We want to compare the UMAP embeddings. For this, we first compute +the dimension reductions:
+
+for (integ in integrations) {
+ reduc <- reductions[[integ]]
+ if (! is.null(reduc)) { # all except BBKNN
+ seu <- RunUMAP(seu, dims = 1:min(30, ncol(seu[[reduc]])), reduction = reduc,
+ reduction.name = paste0(integ, ".umap"))
+ }
+}
As BBKNN’s output is a graph, we need to use the umap Python +package:
+
+if (! reticulate::condaenv_exists('umap_0.5.4')) {
+ reticulate::conda_create('umap_0.5.4', packages = 'umap=0.5.4')
+}
+
+library(future)
+plan(multisession)
+seu %<-% {
+ reticulate::use_condaenv('umap_0.5.4')
+ RunUMAP(seu, graph = "bbknn_ridge.residuals_connectivities", umap.method = "umap-learn",
+ n.epochs = 200L, reduction.name = "bbknn.umap") }
+seu
+plan(sequential)
+library(ggplot2)
+plot_list <- sapply(integrations, function(integ) {
+ DimPlot(seu, reduction = paste0(integ, ".umap"), group.by = batch.var) +
+ ggtitle(integ) +
+ theme(axis.title = element_blank())
+}, simplify = FALSE)
+
+patchwork::wrap_plots(plot_list, guides = "collect")
+library(ggplot2)
+plot_list <- sapply(integrations, function(integ) {
+ DimPlot(seu, reduction = paste0(integ, ".umap"), group.by = cell.var) +
+ ggtitle(integ) +
+ theme(axis.title = element_blank())
+}, simplify = FALSE)
+
+patchwork::wrap_plots(plot_list, guides = "collect")
There are several observations to be made here, that require further +explanations:
+CellType
, highlighting the importance of having cell
+annotations of sufficient quality to be considered suitable as ground
+truth (which is actually not the case here)SCT
normalisation, which
+is much more efficient at masking batch-specific differences than the
+classical LogNorm
. Thus, the inter-batch differences are
+not driving the principal components.ScaleScores()
produce scores that can have different
+scales (as long as rescale = FALSE
) . Thus, min-max
+rescaling is used by default in PlotScores()
, to balance
+each score’s contribution to the overall scores. This is especially
+suited for comparing a large number of integrations. However, it has
+some drawbacks: it can heavily distort the scale of scores when their
+maximum or minimum are far from 1 or 0 respectively (e.g. all cLISI
+scores are above 0.9). Hence, the final decision on
+whether to use min-max rescaling is left to the user’s
+discretion. ## R version 4.4.2 (2024-10-31)
+ ## Platform: x86_64-pc-linux-gnu
+ ## Running under: Linux Mint 21
+ ##
+ ## Matrix products: default
+ ## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
+ ## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
+ ##
+ ## locale:
+ ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
+ ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
+ ## [5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=en_US.UTF-8
+ ## [7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
+ ## [9] LC_ADDRESS=C LC_TELEPHONE=C
+ ## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
+ ##
+ ## time zone: Europe/Paris
+ ## tzcode source: system (glibc)
+ ##
+ ## attached base packages:
+ ## [1] stats graphics grDevices utils datasets methods base
+ ##
+ ## other attached packages:
+ ## [1] future_1.34.0 ggplot2_3.5.1 SeuratIntegrate_0.4.0
+ ## [4] Seurat_5.1.0 SeuratObject_5.0.2 sp_2.1-4
+ ##
+ ## loaded via a namespace (and not attached):
+ ## [1] fs_1.6.5 matrixStats_1.4.1
+ ## [3] spatstat.sparse_3.1-0 httr_1.4.7
+ ## [5] RColorBrewer_1.1-3 tools_4.4.2
+ ## [7] sctransform_0.4.1 backports_1.5.0
+ ## [9] R6_2.5.1 ResidualMatrix_1.14.1
+ ## [11] lazyeval_0.2.2 uwot_0.2.2
+ ## [13] mgcv_1.9-1 withr_3.0.2
+ ## [15] gridExtra_2.3 progressr_0.15.1
+ ## [17] cli_3.6.3 Biobase_2.64.0
+ ## [19] textshaping_0.4.1 spatstat.explore_3.3-3
+ ## [21] fastDummies_1.7.4 labeling_0.4.3
+ ## [23] sass_0.4.9 spatstat.data_3.1-4
+ ## [25] genefilter_1.86.0 ggridges_0.5.6
+ ## [27] pbapply_1.7-2 pkgdown_2.1.1
+ ## [29] systemfonts_1.1.0 hcabm40k.SeuratData_3.0.0
+ ## [31] harmony_1.2.3 parallelly_1.41.0
+ ## [33] limma_3.60.6 rstudioapi_0.17.1
+ ## [35] RSQLite_2.3.9 FNN_1.1.4.1
+ ## [37] generics_0.1.3 ica_1.0-3
+ ## [39] spatstat.random_3.3-2 dplyr_1.1.4
+ ## [41] Matrix_1.7-1 S4Vectors_0.42.1
+ ## [43] abind_1.4-8 lifecycle_1.0.4
+ ## [45] yaml_2.3.10 edgeR_4.2.2
+ ## [47] SummarizedExperiment_1.34.0 SparseArray_1.4.8
+ ## [49] Rtsne_0.17 glmGamPoi_1.16.0
+ ## [51] grid_4.4.2 blob_1.2.4
+ ## [53] ifnb.SeuratData_3.1.0 promises_1.3.2
+ ## [55] crayon_1.5.3 miniUI_0.1.1.1
+ ## [57] lattice_0.22-6 beachmat_2.20.0
+ ## [59] cowplot_1.1.3 annotate_1.82.0
+ ## [61] KEGGREST_1.44.1 pillar_1.10.0
+ ## [63] knitr_1.49 GenomicRanges_1.56.2
+ ## [65] kBET_0.99.6 future.apply_1.11.3
+ ## [67] codetools_0.2-20 leiden_0.4.3.1
+ ## [69] glue_1.8.0 spatstat.univar_3.1-1
+ ## [71] data.table_1.16.4 vctrs_0.6.5
+ ## [73] png_0.1-8 spam_2.11-0
+ ## [75] gtable_0.3.6 cachem_1.1.0
+ ## [77] xfun_0.49 S4Arrays_1.4.1
+ ## [79] mime_0.12 survival_3.8-3
+ ## [81] pbmcref.SeuratData_1.0.0 SingleCellExperiment_1.26.0
+ ## [83] statmod_1.5.0 fitdistrplus_1.2-1
+ ## [85] ROCR_1.0-11 nlme_3.1-166
+ ## [87] bit64_4.5.2 RcppAnnoy_0.0.22
+ ## [89] GenomeInfoDb_1.40.1 bslib_0.8.0
+ ## [91] irlba_2.3.5.1 KernSmooth_2.23-24
+ ## [93] colorspace_2.1-1 BiocGenerics_0.50.0
+ ## [95] DBI_1.2.3 tidyselect_1.2.1
+ ## [97] bit_4.5.0.1 compiler_4.4.2
+ ## [99] BiocNeighbors_1.22.0 desc_1.4.3
+ ## [101] DelayedArray_0.30.1 plotly_4.10.4
+ ## [103] scales_1.3.0 lmtest_0.9-40
+ ## [105] distances_0.1.11 rappdirs_0.3.3
+ ## [107] stringr_1.5.1 digest_0.6.37
+ ## [109] goftest_1.2-3 spatstat.utils_3.1-1
+ ## [111] rmarkdown_2.29 XVector_0.44.0
+ ## [113] RhpcBLASctl_0.23-42 htmltools_0.5.8.1
+ ## [115] pkgconfig_2.0.3 sparseMatrixStats_1.16.0
+ ## [117] MatrixGenerics_1.16.0 fastmap_1.2.0
+ ## [119] rlang_1.1.4 htmlwidgets_1.6.4
+ ## [121] UCSC.utils_1.0.0 shiny_1.10.0
+ ## [123] DelayedMatrixStats_1.26.0 farver_2.1.2
+ ## [125] jquerylib_0.1.4 zoo_1.8-12
+ ## [127] jsonlite_1.8.9 BiocParallel_1.38.0
+ ## [129] BiocSingular_1.20.0 magrittr_2.0.3
+ ## [131] scuttle_1.14.0 GenomeInfoDbData_1.2.12
+ ## [133] dotCall64_1.2 patchwork_1.3.0
+ ## [135] pbmcsca.SeuratData_3.0.0 munsell_0.5.1
+ ## [137] Rcpp_1.0.13-1 reticulate_1.40.0
+ ## [139] stringi_1.8.4 zlibbioc_1.50.0
+ ## [141] MASS_7.3-61 plyr_1.8.9
+ ## [143] parallel_4.4.2 listenv_0.9.1
+ ## [145] ggrepel_0.9.6 forcats_1.0.0
+ ## [147] deldir_2.0-4 Biostrings_2.72.1
+ ## [149] splines_4.4.2 tensor_1.5
+ ## [151] locfit_1.5-9.10 lisi_1.0
+ ## [153] bonemarrowref.SeuratData_1.0.0 igraph_2.1.2
+ ## [155] spatstat.geom_3.3-4 heartref.SeuratData_1.0.0
+ ## [157] RcppHNSW_0.6.0 reshape2_1.4.4
+ ## [159] stats4_4.4.2 ScaledMatrix_1.12.0
+ ## [161] XML_3.99-0.17 evaluate_1.0.1
+ ## [163] tweenr_2.0.3 httpuv_1.6.15
+ ## [165] batchelor_1.20.0 bmcite.SeuratData_0.3.0
+ ## [167] RANN_2.6.2 tidyr_1.3.1
+ ## [169] purrr_1.0.2 polyclip_1.10-7
+ ## [171] SeuratData_0.2.2.9001 scattermore_1.2
+ ## [173] ggforce_0.4.2 rsvd_1.0.5
+ ## [175] broom_1.0.7 xtable_1.8-4
+ ## [177] RSpectra_0.16-2 later_1.4.1
+ ## [179] viridisLite_0.4.2 ragg_1.3.3
+ ## [181] tibble_3.2.1 memoise_2.0.1
+ ## [183] AnnotationDbi_1.66.0 IRanges_2.38.1
+ ## [185] cluster_2.1.8 sva_3.52.0
+ ## [187] globals_0.16.3
+SeuratIntegrate provides access to R and Python methods to correct +batch effect:
+Method | +Type | +Implementation | +Underlying algorithm | +Reference | +
---|---|---|---|---|
ComBat | ++sva +(Bioconductor) | +Empirical Bayes adjustment | +Johnson et al., +2007 | +|
Harmony | ++harmony +(CRAN) | +Iterative embedding correction | +Korsunsky et al., +2019 | +|
MNN | ++batchelor +(Bioconductor) | +Mutual Nearest Neighbors | +Haghverdi +et al., 2018 | +|
BBKNN | ++bbknn +(GitHub) | +Batch-balanced nearest neighbors | +Polański et al., +2020 | +|
scVI | ++scvi-tools +(GitHub) | +Variational autoencoder | +Lopez et al., +2018 | +|
scANVI | ++scvi-tools +(GitHub) | +Semi-supervised variational autoencoder | +Xu et +al., 2021 | +|
Scanorama | ++scanorama (GitHub) | +Manifold alignment | +Hie et al., +2019 | +|
trVAE | ++scArches (GitHub) | +Conditional variational autoencoder | +Lotfollahi et al., +2020 | +
SeuratIntegrate provides a new interface to integrate the layers of
+an object: DoIntegrate()
. Moreover, SeuratIntegrate is
+compatible with CCA
and RPCA
(included in Seurat) and
+FastMNN
(from SeuratWrappers)
The table below summarizes the integrations methods you can run with
+DoIntegrate()
along with the expected inputs and produced
+outputs of each method.
Method | +Package | +Function | +Input | +Output(s) | +
---|---|---|---|---|
ComBat | +SeuratIntegrate | +CombatIntegration() |
+Layer | +Layer with corrected counts | +
Harmony | +SeuratIntegrate | +HarmonyIntegration() |
+PCA | +DimReduc (corrected PCA) | +
MNN | +SeuratIntegrate | +MNNIntegration() |
+Layer | +Layer with corrected counts | +
BBKNN | +SeuratIntegrate | +bbknnIntegration() |
+PCA | +Graphs (edge weights can be distances or +connectivities) | +
scVI | +SeuratIntegrate | +scVIIntegration() |
+Layer | +DimReduc (latent space) | +
scANVI | +SeuratIntegrate | +scANVIIntegration() |
+Layer | +DimReduc (latent space) | +
Scanorama | +SeuratIntegrate | +ScanoramaIntegration() |
+Layer | +
+ Layer with corrected counts +DimReduc (corrected embedding) + |
+
trVAE | +SeuratIntegrate | +trVAEIntegration() |
+
+ Data layer Counts layer |
+DimReduc (latent space) | +
CCA | +Seurat | +CCAIntegration() |
+PCA | +DimReduc (corrected PCA) | +
RPCA | +Seurat | +RPCAIntegration() |
+PCA | +DimReduc (corrected PCA) | +
FastMNN | +SeuratWrappers | +FastMNNIntegration() |
+Layer | +
+ Layer with corrected counts +DimReduc (corrected PCA) + |
+
SeuratIntegrate incorporates 11 scoring metrics: 6 quantify the
+degree of batch mixing (batch
+correction), while 5 assess the preservation of biological
+differences
(bio-conservation) based
+on ground truth cell type labels.
Below is a table summarising each score’s input and type:
+Score name | +Require a cell type variable | +Require a clustering variable | +Input | +Score type | +
---|---|---|---|---|
Cell cycle regression | ++ | + | ||
PCA regression | ++ | + | ||
PCA density | ++ | + | ||
ASW batch | ++ | |||
ASW | ++ | |||
ARI | ++ | |||
NMI | ++ | |||
cLISI | ++ |
+ |
+||
iLISI | ++ |
+ |
+||
kBET | ++ |
+ |
+||
Graph connectivity | +
+per.component = TRUE ) |
++ |
Most scores are computed on an embedding
+(
Seurat::DimReduc
object) or a graph (
Seurat::Neighbor
or
+Seurat::Graph
object). The exceptions are ARI and NMI,
+which compare two categorical variables thus don’t need anything else
+than a cell-type and a cluster assignment variables.d anything else than
+a cell-type and a cluster assignment variables.
Most scores are based on a cell type label variable. This consists in +an estimate of each cell’s type obtained by analysing each batch +separately or by using an automatic cell annotation algorithm. This +estimate of cell types must be of sufficient quality to be considered +suitable as ground truth.
+vignettes/setup_and_tips.Rmd
+ setup_and_tips.Rmd
SeuratIntegrate
’s main purpose is to extend the range of
+scRNA-seq integration tools available in R and compatible with
+Seurat
. Many of them being solely available in Python, we
+developed wrappers leveraging reticulate
package’s
+capabilities. reticulate
enables to directly call Python
+from R, give that conda environments have been set up beforehand.
The purpose of this vignette is to illustrate and ease the +installation and the use of those environments.
+Prior to starting, make sure you have:
+PATH
(or that you
+know its location)We need multiple conda environments. One for bbknn
, one
+for Scanorama
, one for
+scVI
/scANVI
and one for trVAE
. If
+you don’t plan on using any or some of these methods, whether you decide
+to set up their respective conda environments is up to you.
One the contrary, if you already have some conda environments with
+appropriate libraries on your machine, you can tell
+SeuratIntegrate
to use them. Let’s see how.
Have look at your CondaEnvManager
:
+getCache()
It’s a handy way to have a glance at all the implemented methods and
+the status of their conda environments. ComBat
,
+Harmony
and MNN
are R-based methods and don’t
+need any conda environment. The rest of them however are Python-based
+and will function through reticulate
, hence require conda
+environments.
If you don’t have any conda environments for them, look at the next sub-section. Conversely, if you want to add +an existing conda environment, directly go to the following one.
+SeuratIntegrate
+Note that the commands below have only been tested on Linux +distributions
+Try the following commands (be aware that execution might take +time):
+
+UpdateEnvCache('bbknn')
+UpdateEnvCache('scvi')
+UpdateEnvCache('scanorama')
+UpdateEnvCache('trvae')
Note that:
+conda.bin = /path/to/conda
+scVI
and scANVI
share the same
+environment. Hence, it is not necessary to run both
+UpdateEnvCache('scvi')
and
+UpdateEnvCache('scanvi')
+Have look at your CondaEnvManager
:
+getCache()
If you already have one (several) existing conda environment(s) for
+one (some) of the methods, you can tell SeuratIntegrate
to
+use it (them). Similarly, if you run into problems with
+UpdateEnvCache()
commands above, the alternative is to set
+up conda environments yourself and provide them to
+SeuratIntegrate
. Whatever the case, let’s proceed.
You’ll use UpdateEnvCache()
. You can specify the name of
+the conda environment or the path to it. By default,
+UpdateEnvCache()
will try to decide whether the provided
+value for conda.env
is a path or a name based on simple
+tests. To avoid any misinterpretation, you can use
+conda.env.is.path = TRUE
or FALSE
when your
+input is the path or the name of the environment, respectively.
+But beware not to make mistakes !!!
See examples below. You should adapt the arguments to your +situation:
+
+# environment for bbknn
+UpdateEnvCache('bbknn', conda.env = 'bbknn_env',
+ conda.env.is.path = FALSE) # default 'auto' would work
+
+# environment for bbknn in ./bbknn_env/
+UpdateEnvCache('bbknn', conda.env = 'bbknn_env',
+ conda.env.is.path = TRUE)
+
+# environment for bbknn, conda binary not in PATH
+UpdateEnvCache('bbknn', conda.env = 'bbknn_env', conda.bin = 'cutom/location/conda')
+
+# path for scvi-tools
+UpdateEnvCache('scvi', conda.env = '~/miniconda3/envs/scvi-tools_env',
+ conda.env.is.path = TRUE) # default 'auto' would work
Note that:
+conda.bin = /path/to/conda
+conda.bin
must correspond to the conda managing the
+conda.env
+scVI
and scANVI
share the same
+environment. Hence, it is not necessary to run both
+UpdateEnvCache('scvi')
and
+UpdateEnvCache('scanvi')
+Now you can use the Python-based methods !
+If you want to update a conda
environment, use
+# change 'method' by the name of the method
+UpdateEnvCache(method = 'method', overwrite.env = TRUE, ...)
To unset an environment, use
+
+# change 'method' by the name of the method
+resetCache(method = 'method')
It can happen that a conda environment cannot be installed on a +specific machine or os. In this case, there is hardly anything better to +do than browse the internet in hope that someone else has experienced a +similar problem and that the solution is public. Otherwise, try to +modify the set of packages to install, be more or less stringent with +package versions to install, etc. You can also create a conda +environment with Python and pip, and then to try to install packages +with pip rather than conda.
+Once the problem is solved (if it can be), you can save the new
+environment to the CondaEnvManager
with
+# change 'method' by the name of the method
+# change'difficult_conda_env' by the name of the working conda environment
+UpdateEnvCache(method = 'method', conda.env = 'difficult_conda_env')
It can happen that a conda environment does not work (or stops +working) on a specific machine. Below are some potential causes of +conflicts between Python libraries, conda itself or other components +that could lead to malfunctions (non exhaustive list):
+Check if the same command works outside of an rmarkdown (e.g. in a R +script), once you have closed any Rmardown file and closed the R session +(and restrated RStudio). This is something to consider notably when you +encounter an error like:
+ImportError: /opt/conda/envs/[env_name]/lib/python3.10/site-packages/PIL/../../../libtiff.so.6: undefined symbol: jpeg12_write_raw_data, version LIBJPEG_8.0
+Check if the same command works outside of RStudio. For instance, if +there is an error during an integration because scanpy cannot be +imported, try:
+# from a terminal
+R
+
+First, try to update reticulate. If it doesn’t work any better, check +if someone has encountered the same issue (browse the web or the issues +on the reticulate github repository. If nothing works, either post an +issue on the reticulate github repos or retry to import after you have +installed different Python package versions.
+If this is a problem withmkl
, install the conda
+environment again with the arguments nomkl
(and
+-c conda-forge
if not already).
+This is more tricky. But some packages are know to be incompatible.
+For instance, jax and jaxlib work better when their versions are
+identical. In my experience, the scvi-tools
environment can
+be set up with two discrepant versions of jax and jaxlib. To check,
+try:
conda list -n [conda_env_name] jax
+If the following bullet points seem obscure, further explanations are +given in the sections below. In brief,
++++
+- +disable future for R-based methods +(
+DoIntegrate([...], use.future = FALSE)
)- +never use
+CCA
andRPCA
+integration methods with multisession (previous advice prevents this, +especially for Windows users)- multicore futures are faster and less memory-intensive than +multisession, but unstable on RStudio and unavailable on +Windows +
+- to force
+DoIntegrate()
to use a multicore framework (at +your own risk), set +options(parallelly.fork.enable = FALSE)
. +Unavailable on Windows +
A R session can only initialise one Python environment at a time via
+reticulate. This known
+limitation
+of reticualte
is overcome by launching a “background
+session” using Future
. The environment is initialised
+there instead of the main user’s R session. This feature is embedded in
+DoIntegrate()
.
Futures are therefore useless for R-based methods and should be
+disabled with DoIntegrate([...], use.future = FALSE)
.
+Worse, it is discouraged with
+CCAIntegration
and RPCAIntegration
+(explanations are in the final part of the vignette)
In the vast majority of cases, the impact of the “futures” +is insignificant. The most obvious are the few needed seconds to launch +the future session and export globals, in addition to the reordering +of stdout and message output, resulting in messy and less +informative print statements intended for the user.
+The package implements sequential, multicore, multisession, and
+cluster futures. SeuratIntegrate
only uses the multicore
+and multisession ones. DoIntegrate()
automatically picks
+the preferred one based on the output of
+parallelly::supportsMulticore()
## [1] FALSE
+Here multicore is not supported, thus DoIntegrate()
+would start a multisession. Further explanations regarding settings
+giving priority to multicore are available in the function’s help (the
+important part is in the disclosure widget
+?supportsMulticore
)
+help('supportsMulticore', package = 'parallelly')
?supportsMulticore
+++Support for process forking + +
+While R supports forked processing on Unix-like operating system such as +Linux and macOS, it does not on the Microsoft Windows operating system. +
+For some R environments it is considered unstable to perform parallel +processing based on forking. +This is for example the case when using RStudio, cf. +RStudio Inc. recommends against using forked processing when running R from within the RStudio software. +This function detects when running in such an environment and returns +
+ + +FALSE
, despite the underlying operating system supports forked processing. +A warning will also be produced informing the user about this the first +time time this function is called in an R session. +This warning can be disabled by setting R option +parallelly.supportsMulticore.unstable
, or environment variable +R_PARALLELLY_SUPPORTSMULTICORE_UNSTABLE to"quiet"
. +Enable or disable forked processing + +
+It is possible to disable forked processing for futures by setting R +option
+ +parallelly.fork.enable
toFALSE
. Alternatively, one can +set environment variable R_PARALLELLY_FORK_ENABLE tofalse
. +Analogously, it is possible to override disabled forking by setting one +of these toTRUE
. +
In a nutshell, multicore is
options(parallelly.fork.enable = TRUE)
+The main reason for using multicore is that FORKing is considered to +be faster and result in a lower memory overhead than PSOCK +(i.e. multisession) (see there +for technical details)
+Furthermore, DoIntegrate()
uses not only future, but
+also NSE (non-standard evaluation), enabling to specify arguments within
+each integration function call. Briefly, each call inside
+DoIntegrate()
(such as
+DoIntegrate(bbknnIntegration())
) is not directly executed
+but captured to be executed later on (so that the proper value
+for the object
parameter can be passed on to
+bbknnIntegration
for instance). Further details are
+available in the book Advanced
+R by Hadley Wickham.
The important part is that, unlike for multicore, on a multisession
+future, DoIntegrate()
evaluates each argument before
+launching the integration on the background R session. Thus, a Seurat
+Assay object is passed as a str
ucture (output of
+str(object)
). This takes time and makes the call extremely
+long.
It has a unexpected side-effect. It slows down
+CCAIntegration
and RPCAIntegration
a lot when
+they are input with Seurat objects normalised with
+SCTransform()
. Indeed, they both call
+FindIntegrationAnchors()
-> merge()
->
+merge.SCTAssay()
. The latter performs a grep
+on the previous calls (output of sys.calls()
). In a
+multisession future, big objects such as the Seurat Assay object are
+passed as a str
ucture and the grep
can be
+unnecessarily time-consuming. To avoid this, one can either specify
+use.future = FALSE
for the R-based method (this is always
+preferable) or at least ban the use of CCA
and
+RPCA
integrations with a multisession future (note that
+Windows users can only pick the first option).