Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] [r] SOMASparseNDArray$read_sparse_matrix_zero_based() raises error when calling from a sparse matrix created by python's from_anndata #1348

Closed
pablo-gar opened this issue May 9, 2023 · 9 comments
Assignees
Labels
bug Something isn't working r-api

Comments

@pablo-gar
Copy link
Member

pablo-gar commented May 9, 2023

Describe the bug

Hey @mlin doing some work with the R iterators I think I have uncovered an unexpected issue for SOMASparseNDArray$read_sparse_matrix_zero_based()

I can see that in your tests you used a mock SOMASparseNDArray with a shape c(10,10)

ndarray$create(arrow::int32(), shape = c(10, 10))

However, when building a SOMAExperiment from python using the from_anndata method we end up with SOMASparseNDArray that is of shape 9223372036854773760 by 9223372036854773760, because of this
#1327 (comment)

Then if I try to read the X sparse matrix of the SOMAExperiment created with from_anndata you hit a numeric overflow issue because the way we use shape to create the Matrix::SparseMatrix .

The actual error is from this

dims = as.integer(self$shape()), repr = repr)

Which triggers this
https://github.com/cran/Matrix/blob/d35b9ca2e877a6fc55dde6ad7c9ccfc0b35624d9/R/sparseMatrix.R#L426

To Reproduce

> library("tiledbsoma")
> human_experiment = SOMAExperimentOpen("./apis/python/notebooks/data/sparse/")
> x_sparse <- human_experiment$ms$get("RNA")$X$get("data")
> x_sparse$read_sparse_matrix_zero_based(coords = list(1:10, 1:10))
Error in Matrix::sparseMatrix(i = 1 + as.numeric(tbl$GetColumnByName("soma_dim_0")),  : 
  invalid 'dims'
In addition: Warning message:
In as.integer.integer64(self$shape()) : NAs produced by integer overflow
@pablo-gar pablo-gar added bug Something isn't working r-api labels May 9, 2023
@pablo-gar pablo-gar changed the title [Bug] [r] SOMASparseNDArray$read_sparse_matrix_zero_based() raises error when calling from a sparse matrix create by python's from_anndata [Bug] [r] SOMASparseNDArray$read_sparse_matrix_zero_based() raises error when calling from a sparse matrix created by python's from_anndata May 9, 2023
@mlin
Copy link
Member

mlin commented May 9, 2023

@pablo-gar @eddelbuettel would a reasonable solution here be to use dims = as.integer(min(self$shape(), .Machine$integer.max))?

@eddelbuettel
Copy link
Contributor

eddelbuettel commented May 9, 2023

I am wondering if there is a better way to not drop the range. When I use the tiledb package I can access this:

> uri <- "apis/python/notebooks/data/sparse/pbmc3k/ms/RNA/X/data"
> lapply(dimensions(domain(schema(uri))), \(x) domain(x))
[[1]]
integer64
[1] 0                   9223372036854773759

[[2]]
integer64
[1] 0                   9223372036854773759

> 

@eddelbuettel
Copy link
Contributor

Ah darn, and I fell again for the sparseMatrix() constraint on dim:

dims: optional length-2 integer vector of matrix dimensions.  If
      missing, then ‘!index1+c(max(i),max(j))’ is used.

Given that your suggested fix is likely our best bet to create a create sparse matrix object of this type. I think @mojaveazure had a times looked at packages spam (and especially spam64 for "spam with 64bit ints") as possible alternatives but as I recall there were always "reasons" why we are still with sparseMatrix.

@pablo-gar
Copy link
Member Author

@mlin

@pablo-gar @eddelbuettel would a reasonable solution here be to use dims = as.integer(min(self$shape(), .Machine$integer.max))?

Not a 100% sure if this is sustainable solution, in theory a SOMASparseNDarray could have dimensions (not capacity) that are larger than .Machine$integer.max, for which case the above will still throw an error (that is not very descriptive)

I think we should probably raise a clear error indicating that such matrices are not supported -- but the question is, how to efficiently get the actual dimension (not capacity).

Let me do a bit experimentation on this.

@mojaveazure
Copy link
Member

mojaveazure commented May 9, 2023

there were always "reasons" why we are still with sparseMatrix

The biggest reason is that spam/spam64 are not used anywhere; pretty much all downstream analysis methods are implemented for Matrix (particularly dgCMatrix) so while a spam/spam64 matrix allows us to read in massive datasets, we can't really do much with them (and they're exclusively CSR/dgR, so neither Seurat nor SingleCellExperiment are optimized for them)

That being said, we could suggest spam/spam64 and if they're installed, allow the user to read in a matrix as a spam/spam64 matrix (I can't remember if it works with spam64 loaded or if it has to be attached)

Also note: we'll get these errors if either the dimensions of the matrix are greater than .Machine$integer.max (generally 231 - 1) or if the number of non-zero values (eg. length(dgCMatrix@x)) is greater than .Machine$integer.max (as dgCMatrix@x is indexed with 32-bit integers at the C level)

@mlin
Copy link
Member

mlin commented May 9, 2023

Also note: we'll get these errors if either the dimensions of the matrix are greater than .Machine$integer.max (generally 231 - 1) or if the number of non-zero values (eg. length(dgCMatrix@x)) is greater than .Machine$integer.max (as dgCMatrix@x is indexed with 32-bit integers at the C level)

Yikes...I was ready to argue that limiting the dimensions to 2^31 is fine for now, but, I can see the limit on individual cells pinching much sooner.

@pablo-gar
Copy link
Member Author

pablo-gar commented May 10, 2023

Yikes...I was ready to argue that limiting the dimensions to 2^31 is fine for now, but, I can see the limit on individual cells pinching much sooner.

I experimented with this a little bit and it is not a feasible solution, somehow the Matrix package hits this error after creating 2 dgCMatrices with a small number of non-zero values but with shapes 2^31 by 2^31

> sparse <- x_sparse$read_sparse_matrix_zero_based(coords = list(1:10,1:10))
Error: vector memory exhausted (limit reached?)

@mlin
Copy link
Member

mlin commented May 10, 2023

OK. I originally filed #1099 (to set the sparseMatrix dims to the SOMASparseNDArray shape) while porting one of the Python notebooks that reported the dims of the numpy object, in trying to port that as 1:1 possible. Since this is now causing significant problems, I don't see big harm in reverting that so that the sparseMatrix dims will be the maximum row/column actually present (and documenting that this is so). SOMASparseNDArray$shape() would still report the "true" shape. Discussion?

The "vector memory exhausted" is disturbing though, like not consistent with the main idea of a sparse data structure? It sounds like in the future we may well need read_spam64_zero_based() added in a backwards-compatible way.

@pablo-gar
Copy link
Member Author

the memory exhausted issue was due to dgCMatrix and dgRMatrix allocating in memory the full corresponding axis, this was solved by defaulting to dgTMatrix here #1274

This issue can be closed when #1440 is merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working r-api
Projects
None yet
Development

No branches or pull requests

5 participants