Problem in buildNhoodGraph on a large dataset #290

moinfar · 2023-09-01T14:20:42Z

Describe the bug
Hi,
First, thanks for providing and maintenance of the package.
I am using Milo on a dataset of 588k cells. I am trying to run Milo in Jupyter environment using the rpy2 interface (like here).
Everything is fine when I subset the cells to 10% (59k cells). However, when I run it on the whole data with 588k cells, I face an error on the buildNhoodGraph function. It looks like an overflow problem, but I am not sure. Unfortunately, the data is too big for me to upload. But please let me know if I can provide anything else.

Minimum code example
Minimum example to reproduce the error

# python
import os
import scanpy as sc
import anndata as ad
import numpy as np
import pandas as pd
import anndata as ad
import anndata2ri
import rpy2
from rpy2.robjects import r

sc.settings.set_figure_params(dpi=300, frameon=False)

anndata2ri.activate()
%load_ext rpy2.ipython

%%R
library(miloR)
library(igraph)
library(dplyr)
library(ggplot2)
library(scater)
library(patchwork)

MILO_K = 150

# python
oroginal_adata = sc.read("some/path.h5ad")
adata = oroginal_adata
# adata = sc.pp.subsample(oroginal_adata, fraction=0.1, copy=True)

sc.pp.neighbors(adata, n_neighbors=10, use_rep='latent')
sc.tl.umap(adata)

adata_no_knn = adata.copy()
adata_no_knn.obsp = None
adata_no_knn.uns.pop("neighbors")
adata_no_knn

knn_adjacency = adata.obsp["connectivities"]

%%R magic_args="-i adata_no_knn"
LATENT_DIM = ncol(reducedDim(adata_no_knn, 'latent'))
milo <- Milo(adata_no_knn)
milo

%%R magic_args="-i knn_adjacency"
milo_graph <- buildFromAdjacency(knn_adjacency, k=MILO_K, is.binary=TRUE)
graph(milo) <- miloR::graph(milo_graph)
milo <- buildGraph(milo, k=MILO_K, d=LATENT_DIM, reduced.dim='latent')
milo <- makeNhoods(milo, prop = 0.1, k = MILO_K, d=LATENT_DIM, refined = TRUE, reduced_dims='latent')
plotNhoodSizeHist(milo)

%%R
meta_data = data.frame(colData(milo))[,c("sample_id", "Site", "condition")]
milo <- countCells(milo, meta.data = meta_data, sample="sample_id")

## Calculate distances between cells in neighbourhoods
## for spatial FDR correction
milo <- calcNhoodDistance(milo, d=LATENT_DIM, reduced.dim='latent')

design_df <- distinct(meta_data)
rownames(design_df) <- design_df$sample_id
da_results <- testNhoods(milo, design = ~ Site + condition, design.df = design_df, reduced.dim='latent')

%%R
tryCatch({
    milo <- buildNhoodGraph(milo)
}, error=function(err) {
    print(err)
})

Full error traceback

Error in if (any(i < 0L)) { : missing value where TRUE/FALSE needed
In addition: Warning message:
In int2i(as.integer(i), n) : NAs introduced by coercion to integer range
Error in if (any(i < 0L)) { : missing value where TRUE/FALSE needed

Session info
Output of sessionInfo()

R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS/LAPACK: /dss/dsshome1/03/di93zoz/miniconda3/envs/multigrate2/lib/libopenblasp-r0.3.23.so;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats4    tools     stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] Matrix_1.5-4.1              patchwork_1.1.3            
 [3] scater_1.28.0               scuttle_1.10.2             
 [5] SingleCellExperiment_1.22.0 SummarizedExperiment_1.30.2
 [7] Biobase_2.60.0              GenomicRanges_1.52.0       
 [9] GenomeInfoDb_1.36.1         IRanges_2.34.1             
[11] S4Vectors_0.38.1            BiocGenerics_0.46.0        
[13] MatrixGenerics_1.12.3       matrixStats_1.0.0          
[15] ggplot2_3.4.3               dplyr_1.1.2                
[17] igraph_1.5.1                miloR_1.8.1                
[19] edgeR_3.42.4                limma_3.56.2               

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.0          viridisLite_0.4.2        
 [3] vipor_0.4.5               farver_2.1.1             
 [5] viridis_0.6.4             bitops_1.0-7             
 [7] ggraph_2.1.0              RCurl_1.98-1.12          
 [9] tweenr_2.0.2              digest_0.6.33            
[11] rsvd_1.0.5                lifecycle_1.0.3          
[13] statmod_1.5.0             magrittr_2.0.3           
[15] compiler_4.3.1            rlang_1.1.1              
[17] utf8_1.2.3                labeling_0.4.2           
[19] S4Arrays_1.0.5            graphlayouts_1.0.0       
[21] DelayedArray_0.26.7       RColorBrewer_1.1-3       
[23] abind_1.4-5               BiocParallel_1.34.2      
[25] withr_2.5.0               purrr_1.0.2              
[27] grid_4.3.1                polyclip_1.10-4          
[29] fansi_1.0.4               beachmat_2.16.0          
[31] colorspace_2.1-0          scales_1.2.1             
[33] gtools_3.9.4              MASS_7.3-60              
[35] cli_3.6.1                 crayon_1.5.2             
[37] generics_0.1.3            DelayedMatrixStats_1.22.5
[39] ggbeeswarm_0.7.2          ggforce_0.4.1            
[41] stringr_1.5.0             splines_4.3.1            
[43] zlibbioc_1.46.0           parallel_4.3.1           
[45] XVector_0.40.0            vctrs_0.6.3              
[47] BiocSingular_1.16.0       BiocNeighbors_1.18.0     
[49] ggrepel_0.9.3             irlba_2.3.5.1            
[51] beeswarm_0.4.0            packrat_0.9.1            
[53] locfit_1.5-9.8            tidyr_1.3.0              
[55] glue_1.6.2                codetools_0.2-19         
[57] cowplot_1.1.1             stringi_1.7.12           
[59] gtable_0.3.4              ScaledMatrix_1.8.1       
[61] munsell_0.5.0             tibble_3.2.1             
[63] pillar_1.9.0              GenomeInfoDbData_1.2.10  
[65] R6_2.5.1                  sparseMatrixStats_1.12.2 
[67] tidygraph_1.2.3           lattice_0.21-8           
[69] Rcpp_1.0.11               gridExtra_2.3            
[71] pkgconfig_2.0.3

P.S
Some minor differences between 10% and 100%:

Computing Milo on 100% of the data takes so long (maybe 10 hours) to reach the mentioned line.
I get this warning when subsetting to 10% of the data on da_results <- testNhoods(...):

In addition: Warning message:
In testNhoods(milo, design = ~Site + condition, design.df = design_df,  :
  Sample names in design matrix and nhood counts are not matched. Reordering

The text was updated successfully, but these errors were encountered:

MikeDMorgan · 2023-09-18T07:59:55Z

Hi @moinfar - two things to start with: 1) you don't need to run the nhood refinement with 100% of the data - that is entirely redundant and a sure-fire way to kill all of your statistical power, 2) use the graph-based nhood refinement and spatial FDR correction - it's orders of magnitude faster.

Finally, there is a milopy implementation maintained by @emdann that you may find useful: https://github.com/emdann/milopy

moinfar · 2023-09-18T11:35:53Z

Hi @MikeDMorgan,

Thanks for your reply.

Sorry, I think I am missing something. As far as I understood, the nhood refinement proportion is given as a parameter in the makeNhoods function. I use the following code in both mentioned runs:

milo <- makeNhoods(milo, prop = 0.1, k = MILO_K, d=LATENT_DIM, refined = TRUE, reduced_dims='latent')

The difference between the two runs is that I subset the whole data (initial adata) to 10% in the working one, while I use the whole data in the other. The latter results in the given error. Am I missing something?

Thanks. Can you please explain this in more detail? Or some links to the documentation. For graph-based nhood refinement, I think I can pass refinement_scheme="graph". Right? But, I am not sure if spatial FDR correction is something other than what I do now.

Thanks for mentioning the milopy implementation. I will take a look at it.
I would also appreciate it if you could help me with the original error I faced when using the main (588k) data.

MikeDMorgan · 2023-09-18T11:48:41Z

Hi @moinfar I see - sorry from your post I inferred you set p=1.0 - the original nhood refinement time complexity for the jth nhood is $O(n^2_j)$ so scales ~quadratically with the number of cells. The newer algorithms scale much better and run in a fraction of the time. The most up-to-date version on Bioconductor has the relevant function documentation. The manuscript is currently in the works.

I haven't seen that particular error before - it suggests there are NA of Inf values in your nhoods() matrix. Could you check that the scanpy adjacency matrix is binary.

ManarHashemTaha · 2024-04-24T09:43:44Z

Did you solve the error @moinfar ?

moinfar · 2024-04-25T08:01:29Z

@ManarHashemTaha
Unfortunately not.
I used the newest algorithm, but the problem still persisted.
I checked the scanpy neighborhood graph, and it was OK (no problem in scanpy results).

Xiaohan0416 · 2024-10-14T02:15:36Z

Oh no, I have the same issue here. My data includes 221k cells

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem in buildNhoodGraph on a large dataset #290

Problem in buildNhoodGraph on a large dataset #290

moinfar commented Sep 1, 2023 •

edited

Loading

MikeDMorgan commented Sep 18, 2023

moinfar commented Sep 18, 2023 •

edited

Loading

MikeDMorgan commented Sep 18, 2023

ManarHashemTaha commented Apr 24, 2024

moinfar commented Apr 25, 2024 •

edited

Loading

Xiaohan0416 commented Oct 14, 2024

Problem in buildNhoodGraph on a large dataset #290

Problem in buildNhoodGraph on a large dataset #290

Comments

moinfar commented Sep 1, 2023 • edited Loading

MikeDMorgan commented Sep 18, 2023

moinfar commented Sep 18, 2023 • edited Loading

MikeDMorgan commented Sep 18, 2023

ManarHashemTaha commented Apr 24, 2024

moinfar commented Apr 25, 2024 • edited Loading

Xiaohan0416 commented Oct 14, 2024

moinfar commented Sep 1, 2023 •

edited

Loading

moinfar commented Sep 18, 2023 •

edited

Loading

moinfar commented Apr 25, 2024 •

edited

Loading