Skip to content

Using the graph.name argument generates clustering that is different when not using the graph.name argument? #4155

@Pancreas-Pratik

Description

@Pancreas-Pratik

Thank you Satija Lab for the incredible work that has been done with Seurat! You guys are miracle workers!

Here is the bug:

I am using the pbm3k dataset from SeuratData as the reproducible example.

I think there is a bug when using the graph.name argument. Please correct me if I'm wrong.

Simply using the graph.name argument generates clustering that is different when not using the graph.name argument. (either more or less clusters, more or less cells in clusters)

Similar clustering differences happen when using the default algorithm for FindClusters as well. Here I used algorithm 4.

I thought specifying a graph.name should keep everything consistent. I actually prefer the clustering generated with the graph.name argument included in my dataset. I'm wondering what is happening.

Here is what was done:

library(dplyr)
library(Seurat)
library(patchwork)

# Load the PBMC dataset
SeuratData::InstallData("pbmc3k")
pbmc <- SeuratData::LoadData("pbmc3k")
# Initialize the Seurat object with the raw (non-normalized data).
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
#Without graph.name argument
#pbmc <- FindNeighbors(pbmc, graph.name ="test", dims = 1:10)
#pbmc <- FindClusters(pbmc, graph.name = "test", algorithm = 4, resolution = 0.5)
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, algorithm = 4, resolution = 0.5)
pbmc <- RunUMAP(pbmc, dims = 1:10)
DimPlot(pbmc, reduction = "umap", label = TRUE, label.size = 6)

Here is the full workflow for with the graph.name argument:

> pbmc <- SeuratData::LoadData("pbmc3k")
> # Initialize the Seurat object with the raw (non-normalized data).
> pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
> pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
> pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
Performing log-normalization
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
> pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
Calculating gene variances
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Calculating feature variances of standardized and clipped values
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
> all.genes <- rownames(pbmc)
> pbmc <- ScaleData(pbmc, features = all.genes)
Centering and scaling data matrix
  |============================================================================================| 100%
> pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
PC_ 1 
Positive:  CST3, TYROBP, LST1, AIF1, FTL, FTH1, LYZ, FCN1, S100A9, TYMP 
	   FCER1G, CFD, LGALS1, S100A8, CTSS, LGALS2, SERPINA1, IFITM3, SPI1, CFP 
	   PSAP, IFI30, SAT1, COTL1, S100A11, NPC2, GRN, LGALS3, GSTP1, PYCARD 
Negative:  MALAT1, LTB, IL32, IL7R, CD2, B2M, ACAP1, CD27, STK17A, CTSW 
	   CD247, GIMAP5, AQP3, CCL5, SELL, TRAF3IP3, GZMA, MAL, CST7, ITM2A 
	   MYC, GIMAP7, HOPX, BEX2, LDLRAP1, GZMK, ETS1, ZAP70, TNFAIP8, RIC3 
PC_ 2 
Positive:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1, HLA-DRA, LINC00926, CD79B, HLA-DRB1, CD74 
	   HLA-DMA, HLA-DPB1, HLA-DQA2, CD37, HLA-DRB5, HLA-DMB, HLA-DPA1, FCRLA, HVCN1, LTB 
	   BLNK, P2RX5, IGLL5, IRF8, SWAP70, ARHGAP24, FCGR2B, SMIM14, PPP1R14A, C16orf74 
Negative:  NKG7, PRF1, CST7, GZMB, GZMA, FGFBP2, CTSW, GNLY, B2M, SPON2 
	   CCL4, GZMH, FCGR3A, CCL5, CD247, XCL2, CLIC3, AKR1C3, SRGN, HOPX 
	   TTC38, APMAP, CTSC, S100A4, IGFBP7, ANXA1, ID2, IL32, XCL1, RHOC 
PC_ 3 
Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1, HLA-DPA1, CD74, MS4A1, HLA-DRB1, HLA-DRA 
	   HLA-DRB5, HLA-DQA2, TCL1A, LINC00926, HLA-DMB, HLA-DMA, CD37, HVCN1, FCRLA, IRF8 
	   PLAC8, BLNK, MALAT1, SMIM14, PLD4, LAT2, IGLL5, P2RX5, SWAP70, FCGR2B 
Negative:  PPBP, PF4, SDPR, SPARC, GNG11, NRGN, GP9, RGS18, TUBB1, CLU 
	   HIST1H2AC, AP001189.4, ITGA2B, CD9, TMEM40, PTCRA, CA2, ACRBP, MMD, TREML1 
	   NGFRAP1, F13A1, SEPT5, RUFY1, TSC22D1, MPP1, CMTM5, RP11-367G6.3, MYL9, GP1BA 
PC_ 4 
Positive:  HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1, CD74, HLA-DPB1, HIST1H2AC, PF4, TCL1A 
	   SDPR, HLA-DPA1, HLA-DRB1, HLA-DQA2, HLA-DRA, PPBP, LINC00926, GNG11, HLA-DRB5, SPARC 
	   GP9, AP001189.4, CA2, PTCRA, CD9, NRGN, RGS18, GZMB, CLU, TUBB1 
Negative:  VIM, IL7R, S100A6, IL32, S100A8, S100A4, GIMAP7, S100A10, S100A9, MAL 
	   AQP3, CD2, CD14, FYB, LGALS2, GIMAP4, ANXA1, CD27, FCN1, RBP7 
	   LYZ, S100A11, GIMAP5, MS4A6A, S100A12, FOLR3, TRABD2A, AIF1, IL8, IFI6 
PC_ 5 
Positive:  GZMB, NKG7, S100A8, FGFBP2, GNLY, CCL4, CST7, PRF1, GZMA, SPON2 
	   GZMH, S100A9, LGALS2, CCL3, CTSW, XCL2, CD14, CLIC3, S100A12, CCL5 
	   RBP7, MS4A6A, GSTP1, FOLR3, IGFBP7, TYROBP, TTC38, AKR1C3, XCL1, HOPX 
Negative:  LTB, IL7R, CKB, VIM, MS4A7, AQP3, CYTIP, RP11-290F20.3, SIGLEC10, HMOX1 
	   PTGES3, LILRB2, MAL, CD27, HN1, CD2, GDI2, ANXA5, CORO1B, TUBA1B 
	   FAM110A, ATP1A1, TRADD, PPA1, CCDC109B, ABRACL, CTD-2006K23.1, WARS, VMO1, FYB 
> pbmc <- FindNeighbors(pbmc, graph.name ="test", dims = 1:10)
Computing nearest neighbor graph
Computing SNN
> pbmc <- FindClusters(pbmc, graph.name = "test", algorithm = 4, resolution = 0.5)
> pbmc <- RunUMAP(pbmc, dims = 1:10)
15:37:13 UMAP embedding parameters a = 0.9922 b = 1.112
15:37:13 Read 2638 rows and found 10 numeric columns
15:37:13 Using Annoy for neighbor search, n_neighbors = 30
15:37:13 Building Annoy index with metric = cosine, n_trees = 50
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
15:37:14 Writing NN index file to temp file /tmp/Rtmp66O0pD/filee2d579073b5
15:37:14 Searching Annoy index using 1 thread, search_k = 3000
15:37:15 Annoy recall = 100%
15:37:15 Commencing smooth kNN distance calibration using 1 thread
15:37:16 Initializing from normalized Laplacian + noise
15:37:16 Commencing optimization for 500 epochs, with 105124 positive edges
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
15:37:21 Optimization finished
> DimPlot(pbmc, reduction = "umap", label = TRUE, label.size = 6)

with-graph-name

Here is the full workflow **without** `graph.name` argument:
> pbmc <- SeuratData::LoadData("pbmc3k")
> # Initialize the Seurat object with the raw (non-normalized data).
> pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
> pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
> pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
Performing log-normalization
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
> pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
Calculating gene variances
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Calculating feature variances of standardized and clipped values
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
> all.genes <- rownames(pbmc)
> pbmc <- ScaleData(pbmc, features = all.genes)
Centering and scaling data matrix
  |============================================================================================| 100%
> pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
PC_ 1 
Positive:  CST3, TYROBP, LST1, AIF1, FTL, FTH1, LYZ, FCN1, S100A9, TYMP 
	   FCER1G, CFD, LGALS1, S100A8, CTSS, LGALS2, SERPINA1, IFITM3, SPI1, CFP 
	   PSAP, IFI30, SAT1, COTL1, S100A11, NPC2, GRN, LGALS3, GSTP1, PYCARD 
Negative:  MALAT1, LTB, IL32, IL7R, CD2, B2M, ACAP1, CD27, STK17A, CTSW 
	   CD247, GIMAP5, AQP3, CCL5, SELL, TRAF3IP3, GZMA, MAL, CST7, ITM2A 
	   MYC, GIMAP7, HOPX, BEX2, LDLRAP1, GZMK, ETS1, ZAP70, TNFAIP8, RIC3 
PC_ 2 
Positive:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1, HLA-DRA, LINC00926, CD79B, HLA-DRB1, CD74 
	   HLA-DMA, HLA-DPB1, HLA-DQA2, CD37, HLA-DRB5, HLA-DMB, HLA-DPA1, FCRLA, HVCN1, LTB 
	   BLNK, P2RX5, IGLL5, IRF8, SWAP70, ARHGAP24, FCGR2B, SMIM14, PPP1R14A, C16orf74 
Negative:  NKG7, PRF1, CST7, GZMB, GZMA, FGFBP2, CTSW, GNLY, B2M, SPON2 
	   CCL4, GZMH, FCGR3A, CCL5, CD247, XCL2, CLIC3, AKR1C3, SRGN, HOPX 
	   TTC38, APMAP, CTSC, S100A4, IGFBP7, ANXA1, ID2, IL32, XCL1, RHOC 
PC_ 3 
Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1, HLA-DPA1, CD74, MS4A1, HLA-DRB1, HLA-DRA 
	   HLA-DRB5, HLA-DQA2, TCL1A, LINC00926, HLA-DMB, HLA-DMA, CD37, HVCN1, FCRLA, IRF8 
	   PLAC8, BLNK, MALAT1, SMIM14, PLD4, LAT2, IGLL5, P2RX5, SWAP70, FCGR2B 
Negative:  PPBP, PF4, SDPR, SPARC, GNG11, NRGN, GP9, RGS18, TUBB1, CLU 
	   HIST1H2AC, AP001189.4, ITGA2B, CD9, TMEM40, PTCRA, CA2, ACRBP, MMD, TREML1 
	   NGFRAP1, F13A1, SEPT5, RUFY1, TSC22D1, MPP1, CMTM5, RP11-367G6.3, MYL9, GP1BA 
PC_ 4 
Positive:  HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1, CD74, HLA-DPB1, HIST1H2AC, PF4, TCL1A 
	   SDPR, HLA-DPA1, HLA-DRB1, HLA-DQA2, HLA-DRA, PPBP, LINC00926, GNG11, HLA-DRB5, SPARC 
	   GP9, AP001189.4, CA2, PTCRA, CD9, NRGN, RGS18, GZMB, CLU, TUBB1 
Negative:  VIM, IL7R, S100A6, IL32, S100A8, S100A4, GIMAP7, S100A10, S100A9, MAL 
	   AQP3, CD2, CD14, FYB, LGALS2, GIMAP4, ANXA1, CD27, FCN1, RBP7 
	   LYZ, S100A11, GIMAP5, MS4A6A, S100A12, FOLR3, TRABD2A, AIF1, IL8, IFI6 
PC_ 5 
Positive:  GZMB, NKG7, S100A8, FGFBP2, GNLY, CCL4, CST7, PRF1, GZMA, SPON2 
	   GZMH, S100A9, LGALS2, CCL3, CTSW, XCL2, CD14, CLIC3, S100A12, CCL5 
	   RBP7, MS4A6A, GSTP1, FOLR3, IGFBP7, TYROBP, TTC38, AKR1C3, XCL1, HOPX 
Negative:  LTB, IL7R, CKB, VIM, MS4A7, AQP3, CYTIP, RP11-290F20.3, SIGLEC10, HMOX1 
	   PTGES3, LILRB2, MAL, CD27, HN1, CD2, GDI2, ANXA5, CORO1B, TUBA1B 
	   FAM110A, ATP1A1, TRADD, PPA1, CCDC109B, ABRACL, CTD-2006K23.1, WARS, VMO1, FYB 
> #Without graph.name argument
> #pbmc <- FindNeighbors(pbmc, graph.name ="test", dims = 1:10)
> #pbmc <- FindClusters(pbmc, graph.name = "test", algorithm = 4, resolution = 0.5)
> pbmc <- FindNeighbors(pbmc, dims = 1:10)
Computing nearest neighbor graph
Computing SNN
> pbmc <- FindClusters(pbmc, algorithm = 4, resolution = 0.5)
> pbmc <- RunUMAP(pbmc, dims = 1:10)
15:38:52 UMAP embedding parameters a = 0.9922 b = 1.112
15:38:52 Read 2638 rows and found 10 numeric columns
15:38:52 Using Annoy for neighbor search, n_neighbors = 30
15:38:52 Building Annoy index with metric = cosine, n_trees = 50
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
15:38:53 Writing NN index file to temp file /tmp/Rtmp66O0pD/filee2d55ca00c5
15:38:53 Searching Annoy index using 1 thread, search_k = 3000
15:38:54 Annoy recall = 100%
15:38:54 Commencing smooth kNN distance calibration using 1 thread
15:38:55 Initializing from normalized Laplacian + noise
15:38:55 Commencing optimization for 500 epochs, with 105124 positive edges
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
15:39:00 Optimization finished
> DimPlot(pbmc, reduction = "umap", label = TRUE, label.size = 6)

without-graph-name

> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] pbmc3k.SeuratData_3.1.4 patchwork_1.1.1         SeuratObject_4.0.0      Seurat_4.0.0           
[5] dplyr_1.0.4            

loaded via a namespace (and not attached):
  [1] nlme_3.1-152         matrixStats_0.58.0   RcppAnnoy_0.0.18     RColorBrewer_1.1-2  
  [5] httr_1.4.2           sctransform_0.3.2    tools_4.0.4          utf8_1.1.4          
  [9] R6_2.5.0             irlba_2.3.3          rpart_4.1-15         KernSmooth_2.23-18  
 [13] uwot_0.1.10          mgcv_1.8-33          DBI_1.1.1            lazyeval_0.2.2      
 [17] colorspace_2.0-0     withr_2.4.1          tidyselect_1.1.0     gridExtra_2.3       
 [21] compiler_4.0.4       cli_2.3.1            plotly_4.9.3         labeling_0.4.2      
 [25] scales_1.1.1         lmtest_0.9-38        spatstat.data_2.0-0  ggridges_0.5.3      
 [29] pbapply_1.4-3        rappdirs_0.3.3       spatstat_1.64-1      goftest_1.2-2       
 [33] stringr_1.4.0        digest_0.6.27        spatstat.utils_2.0-0 pkgconfig_2.0.3     
 [37] htmltools_0.5.1.1    parallelly_1.23.0    fastmap_1.1.0        htmlwidgets_1.5.3   
 [41] rlang_0.4.10         rstudioapi_0.13      shiny_1.6.0          farver_2.0.3        
 [45] generics_0.1.0       zoo_1.8-8            jsonlite_1.7.2       ica_1.0-2           
 [49] magrittr_2.0.1       Matrix_1.3-2         Rcpp_1.0.6           munsell_0.5.0       
 [53] fansi_0.4.2          abind_1.4-5          reticulate_1.18      lifecycle_1.0.0     
 [57] stringi_1.5.3        MASS_7.3-53.1        Rtsne_0.15           plyr_1.8.6          
 [61] grid_4.0.4           parallel_4.0.4       listenv_0.8.0        promises_1.2.0.1    
 [65] ggrepel_0.9.1        crayon_1.4.1         miniUI_0.1.1.1       deldir_0.2-10       
 [69] lattice_0.20-41      cowplot_1.1.1        splines_4.0.4        tensor_1.5          
 [73] pillar_1.5.0         igraph_1.2.6         future.apply_1.7.0   reshape2_1.4.4      
 [77] codetools_0.2-18     leiden_0.3.7         glue_1.4.2           data.table_1.14.0   
 [81] BiocManager_1.30.10  vctrs_0.3.6          png_0.1-7            httpuv_1.5.5        
 [85] gtable_0.3.0         RANN_2.6.1           purrr_0.3.4          polyclip_1.10-0     
 [89] tidyr_1.1.2          SeuratData_0.2.1     scattermore_0.7      future_1.21.0       
 [93] assertthat_0.2.1     ggplot2_3.3.3        xfun_0.21            mime_0.10           
 [97] xtable_1.8-4         RSpectra_0.16-0      later_1.1.0.1        survival_3.2-7      
[101] viridisLite_0.3.0    tibble_3.1.0         tinytex_0.29         cluster_2.1.1       
[105] globals_0.14.0       fitdistrplus_1.1-3   ellipsis_0.3.1       ROCR_1.0-11   

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions