[Bug] [r] `SOMASparseNDArray$read_sparse_matrix_zero_based()` raises error when calling from a sparse matrix created by python's `from_anndata` #1348

pablo-gar · 2023-05-09T01:00:39Z

Describe the bug

Hey @mlin doing some work with the R iterators I think I have uncovered an unexpected issue for SOMASparseNDArray$read_sparse_matrix_zero_based()

I can see that in your tests you used a mock SOMASparseNDArray with a shape c(10,10)

TileDB-SOMA/apis/r/tests/testthat/test-SOMASparseNDArray.R

Line 4 in d8a6fc9

ndarray$create(arrow::int32(), shape = c(10, 10))

However, when building a SOMAExperiment from python using the from_anndata method we end up with SOMASparseNDArray that is of shape 9223372036854773760 by 9223372036854773760, because of this
#1327 (comment)

Then if I try to read the X sparse matrix of the SOMAExperiment created with from_anndata you hit a numeric overflow issue because the way we use shape to create the Matrix::SparseMatrix .

The actual error is from this

TileDB-SOMA/apis/r/R/SOMASparseNDArray.R

Line 203 in d8a6fc9

dims = as.integer(self$shape()), repr = repr)

Which triggers this
https://github.com/cran/Matrix/blob/d35b9ca2e877a6fc55dde6ad7c9ccfc0b35624d9/R/sparseMatrix.R#L426

To Reproduce

> library("tiledbsoma")
> human_experiment = SOMAExperimentOpen("./apis/python/notebooks/data/sparse/")
> x_sparse <- human_experiment$ms$get("RNA")$X$get("data")
> x_sparse$read_sparse_matrix_zero_based(coords = list(1:10, 1:10))
Error in Matrix::sparseMatrix(i = 1 + as.numeric(tbl$GetColumnByName("soma_dim_0")),  : 
  invalid 'dims'
In addition: Warning message:
In as.integer.integer64(self$shape()) : NAs produced by integer overflow

The text was updated successfully, but these errors were encountered:

mlin · 2023-05-09T10:23:34Z

@pablo-gar @eddelbuettel would a reasonable solution here be to use dims = as.integer(min(self$shape(), .Machine$integer.max))?

eddelbuettel · 2023-05-09T11:23:17Z

I am wondering if there is a better way to not drop the range. When I use the tiledb package I can access this:

> uri <- "apis/python/notebooks/data/sparse/pbmc3k/ms/RNA/X/data"
> lapply(dimensions(domain(schema(uri))), \(x) domain(x))
[[1]]
integer64
[1] 0                   9223372036854773759

[[2]]
integer64
[1] 0                   9223372036854773759

>

eddelbuettel · 2023-05-09T11:55:33Z

Ah darn, and I fell again for the sparseMatrix() constraint on dim:

dims: optional length-2 integer vector of matrix dimensions.  If
      missing, then ‘!index1+c(max(i),max(j))’ is used.

Given that your suggested fix is likely our best bet to create a create sparse matrix object of this type. I think @mojaveazure had a times looked at packages spam (and especially spam64 for "spam with 64bit ints") as possible alternatives but as I recall there were always "reasons" why we are still with sparseMatrix.

pablo-gar · 2023-05-09T17:04:29Z

@mlin

@pablo-gar @eddelbuettel would a reasonable solution here be to use dims = as.integer(min(self$shape(), .Machine$integer.max))?

Not a 100% sure if this is sustainable solution, in theory a SOMASparseNDarray could have dimensions (not capacity) that are larger than .Machine$integer.max, for which case the above will still throw an error (that is not very descriptive)

I think we should probably raise a clear error indicating that such matrices are not supported -- but the question is, how to efficiently get the actual dimension (not capacity).

Let me do a bit experimentation on this.

mojaveazure · 2023-05-09T18:13:56Z

there were always "reasons" why we are still with sparseMatrix

The biggest reason is that spam/spam64 are not used anywhere; pretty much all downstream analysis methods are implemented for Matrix (particularly dgCMatrix) so while a spam/spam64 matrix allows us to read in massive datasets, we can't really do much with them (and they're exclusively CSR/dgR, so neither Seurat nor SingleCellExperiment are optimized for them)

That being said, we could suggest spam/spam64 and if they're installed, allow the user to read in a matrix as a spam/spam64 matrix (I can't remember if it works with spam64 loaded or if it has to be attached)

Also note: we'll get these errors if either the dimensions of the matrix are greater than .Machine$integer.max (generally 2³¹ - 1) or if the number of non-zero values (eg. length(dgCMatrix@x)) is greater than .Machine$integer.max (as dgCMatrix@x is indexed with 32-bit integers at the C level)

mlin · 2023-05-09T19:04:04Z

Also note: we'll get these errors if either the dimensions of the matrix are greater than .Machine$integer.max (generally 231 - 1) or if the number of non-zero values (eg. length(dgCMatrix@x)) is greater than .Machine$integer.max (as dgCMatrix@x is indexed with 32-bit integers at the C level)

Yikes...I was ready to argue that limiting the dimensions to 2^31 is fine for now, but, I can see the limit on individual cells pinching much sooner.

pablo-gar · 2023-05-10T03:09:14Z

Yikes...I was ready to argue that limiting the dimensions to 2^31 is fine for now, but, I can see the limit on individual cells pinching much sooner.

I experimented with this a little bit and it is not a feasible solution, somehow the Matrix package hits this error after creating 2 dgCMatrices with a small number of non-zero values but with shapes 2^31 by 2^31

> sparse <- x_sparse$read_sparse_matrix_zero_based(coords = list(1:10,1:10))
Error: vector memory exhausted (limit reached?)

mlin · 2023-05-10T06:08:22Z

OK. I originally filed #1099 (to set the sparseMatrix dims to the SOMASparseNDArray shape) while porting one of the Python notebooks that reported the dims of the numpy object, in trying to port that as 1:1 possible. Since this is now causing significant problems, I don't see big harm in reverting that so that the sparseMatrix dims will be the maximum row/column actually present (and documenting that this is so). SOMASparseNDArray$shape() would still report the "true" shape. Discussion?

The "vector memory exhausted" is disturbing though, like not consistent with the main idea of a sparse data structure? It sounds like in the future we may well need read_spam64_zero_based() added in a backwards-compatible way.

pablo-gar · 2023-06-05T23:50:59Z

the memory exhausted issue was due to dgCMatrix and dgRMatrix allocating in memory the full corresponding axis, this was solved by defaulting to dgTMatrix here #1274

This issue can be closed when #1440 is merged

pablo-gar added bug Something isn't working r-api labels May 9, 2023

pablo-gar assigned mlin and pablo-gar May 9, 2023

mojaveazure mentioned this issue May 11, 2023

[r] [RFC] Add SOMASparseNDArray$read_spam_matrix() #1367

Open

pablo-gar mentioned this issue May 11, 2023

[r] Add iterator classes #1274

Merged

johnkerl mentioned this issue Jun 6, 2023

[r/python] Parent/tracking issue for shape #1445

Closed

7 tasks

johnkerl closed this as completed Jul 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] [r] `SOMASparseNDArray$read_sparse_matrix_zero_based()` raises error when calling from a sparse matrix created by python's `from_anndata` #1348

[Bug] [r] `SOMASparseNDArray$read_sparse_matrix_zero_based()` raises error when calling from a sparse matrix created by python's `from_anndata` #1348

pablo-gar commented May 9, 2023 •

edited by johnkerl

Loading

mlin commented May 9, 2023

eddelbuettel commented May 9, 2023 •

edited

Loading

eddelbuettel commented May 9, 2023

pablo-gar commented May 9, 2023

mojaveazure commented May 9, 2023 •

edited

Loading

mlin commented May 9, 2023

pablo-gar commented May 10, 2023 •

edited

Loading

mlin commented May 10, 2023

pablo-gar commented Jun 5, 2023

[Bug] [r] SOMASparseNDArray$read_sparse_matrix_zero_based() raises error when calling from a sparse matrix created by python's from_anndata #1348

[Bug] [r] SOMASparseNDArray$read_sparse_matrix_zero_based() raises error when calling from a sparse matrix created by python's from_anndata #1348

Comments

pablo-gar commented May 9, 2023 • edited by johnkerl Loading

mlin commented May 9, 2023

eddelbuettel commented May 9, 2023 • edited Loading

eddelbuettel commented May 9, 2023

pablo-gar commented May 9, 2023

mojaveazure commented May 9, 2023 • edited Loading

mlin commented May 9, 2023

pablo-gar commented May 10, 2023 • edited Loading

mlin commented May 10, 2023

pablo-gar commented Jun 5, 2023

[Bug] [r] `SOMASparseNDArray$read_sparse_matrix_zero_based()` raises error when calling from a sparse matrix created by python's `from_anndata` #1348

[Bug] [r] `SOMASparseNDArray$read_sparse_matrix_zero_based()` raises error when calling from a sparse matrix created by python's `from_anndata` #1348

pablo-gar commented May 9, 2023 •

edited by johnkerl

Loading

eddelbuettel commented May 9, 2023 •

edited

Loading

mojaveazure commented May 9, 2023 •

edited

Loading

pablo-gar commented May 10, 2023 •

edited

Loading