[r] Improve handling of arrays with with domains greater than 2^31 - 1 #1504

aaronwolen · 2023-06-23T19:21:21Z

Issue and/or context: #1487

Changes: This makes a few adjustment to improve our handling of arrays with domains greater than 2^31-1.

SOMASparseNDArray$read() no longer warns if nnz() > .Machine$integer.max, since these arrays can be read without issue using SOMASparseNDArray$read()$tables()
SOMASparseNDArrayRead$sparse_matrix() truncates shape to 2^31 - 1 if shape > 2^31-1 and lets the user know via a warning. I decided to add this logic here rather than deeper in the stack so that it executes only once when reading sparse matrices, rather than once per iteration.
Fixed SparseReaditer's shape assertion
arrow_table_to_sparse adds a final check to ensure that shape is <= 2^31 - 1 and all coordinates are within [0, 2^31 - 1) (since shape is 1-based and coords are 0-based)
arrow_table_to_sparse was also slightly simplified to work directly with 0-based coords and leverage Matrix::sparseMatrix()'s index1 argument
SOMAExperimentAxisQuery$to_seurat_graph() was refactored to leverage $to_sparse_matrix() and avoid the costly creation of a dgCMatrix with domains .Machine$integer.max

Values in this range are fine when reading into an arrow table

codecov-commenter · 2023-06-23T19:31:45Z

Codecov Report

Patch coverage has no change and project coverage change: -11.55 ⚠️

Comparison is base (cb6dd5a) 64.69% compared to head (24f8a14) 53.14%.

❗ Current head 24f8a14 differs from pull request most recent head 20d00ce. Consider uploading reports for the commit 20d00ce to get more accurate results

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1504       +/-   ##
===========================================
- Coverage   64.69%   53.14%   -11.55%     
===========================================
  Files         102       72       -30     
  Lines        8349     5833     -2516     
===========================================
- Hits         5401     3100     -2301     
+ Misses       2948     2733      -215

Flag	Coverage Δ
python	`?`
r	`53.14% <ø> (+0.13%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

see 35 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

johnkerl

🚢

eddelbuettel

One question on Rd files -- something seems different between our machines. Ideas?

apis/r/man/ConfigList.Rd

apis/r/tests/testthat/helper-test-soma-objects.R

apis/r/R/utils-readerTransformers.R

eddelbuettel

I would need a wee bit more time for a full review but this looks ready to sail.

mojaveazure

LGTM

pablo-gar

Looks good to me with a big asterisk that truncating shapes and throwing a warning is prone to ghost errors for the user.

pablo-gar · 2023-06-23T23:51:45Z

apis/r/R/SOMASparseNDArrayRead.R

+
+
+      if (any(private$shape > .Machine$integer.max)) {
+        warning(


Should we just error out here? A warning and modifying the expected output is error-prone if the user is not paying close attention

Thanks, @pablo-gar. I struggled with this question but if we error out here we preclude the creation of sparse matrices from arrays with domains >= 2^31-1, even when all non-zero elements are an within R-friendly range of [0, 2^31-1). To me this seemed like a reasonable trade-off: we can still create sparse matrices from these arrays, but we warn the user the resulting matrix's shape will be truncated to c(2^31-1, 2^31-1). In the unlikely scenario that the array (& query) contains non-zero elements at indices >= 2^31-1 (or more than than 2^32-1 values), then we error out and suggest they access the data as an arrow table.

What do you think?

@aaronwolen @pablo-gar is there a performant way in R to find the max used index and check if that is > 2^31-1?

Currently we check the coordinates of the arrow table before attempting to construct a sparse matrix. We could potentially use non-empty domain beforehand but we'll have an easier option in place after we start adding the bounding box metadata.

@aaronwolen this is awesome!! Please either create a tracking task, or, make a note on #1445 so we remember

Thanks, @pablo-gar. I struggled with this question but if we error out here we preclude the creation of sparse matrices from arrays with domains >= 2^31-1, even when all non-zero elements are an within R-friendly range of [0, 2^31-1).

Fair.

In the unlikely scenario that the array (& query) contains non-zero elements at indices >= 2^31-1 (or more than than 2^32-1 values), then we error out and suggest they access the data as an arrow table.

And great solution

is there a performant way in R to find the max used index and check if that is > 2^31-1?

We could potentially use non-empty domain.

I tried this already and decided not to go with it and to make the shape representation a bona-fide representation of the sparse shape and for parity with Python. I agree with @aaronwolen that once the bounding-box design is finalized we can revisit (and revisit in Python as well)

pablo-gar · 2023-06-23T23:56:02Z

apis/r/R/utils-readerTransformers.R

+                              x = tbl$soma_data$as_vector(),
+                              dims = shape,
+                              repr = repr,
+                              index1 = FALSE)


nice! Mike and I missed this arg in the past

I recall all of us discussing it at some point a few weeks back. No panacea, but it shows the Matrix authors know about issues / a need for zero base.

aaronwolen added 14 commits June 23, 2023 13:53

Temporarily stop erroring out on 32bit shape

4df98e3

Truncate dims/coords to 31bit

36baff5

Avoid creation of CsparseMatrix with full array domain

6eda833

Remove 32bit warning from SOMASparseNDArray$read()

0309a21

Truncate >=32bit array shapes in SOMASparseNDArrayRead$sparse_matrix()

8c5d740

Restore shape size assertions lower in the stack

802c086

Fix shape truncation

642f8b6

Stop removing 32bit coordinates from SOMASparseNDArray$convert_coords()

818729d

Values in this range are fine when reading into an arrow table

Warn immediately

3390c45

Refine assertion messages

904dd5e

Fix code alignment

939b2d6

Refine sparse matrix creation

eaf57eb

Add tests for 32-bit arrays

0e71ed8

Fix assertions again

82c2812

aaronwolen marked this pull request as ready for review June 23, 2023 20:07

aaronwolen requested review from johnkerl, eddelbuettel, pablo-gar and mojaveazure June 23, 2023 20:07

johnkerl approved these changes Jun 23, 2023

View reviewed changes

eddelbuettel reviewed Jun 23, 2023

View reviewed changes

apis/r/man/ConfigList.Rd Outdated Show resolved Hide resolved

apis/r/tests/testthat/helper-test-soma-objects.R Show resolved Hide resolved

mojaveazure reviewed Jun 23, 2023

View reviewed changes

apis/r/R/utils-readerTransformers.R Outdated Show resolved Hide resolved

aaronwolen added 2 commits June 23, 2023 16:42

Bump version

a7e8ae0

Restore assertion that nnz < 2^31-1

20d00ce

aaronwolen force-pushed the aaronwolen/improve-handling-of-64bit-arrays branch from 9a01423 to 20d00ce Compare June 23, 2023 21:43

aaronwolen requested review from eddelbuettel and mojaveazure June 23, 2023 21:43

eddelbuettel approved these changes Jun 23, 2023

View reviewed changes

mojaveazure approved these changes Jun 23, 2023

View reviewed changes

pablo-gar approved these changes Jun 23, 2023

View reviewed changes

aaronwolen merged commit ba44f1f into main Jun 26, 2023

aaronwolen deleted the aaronwolen/improve-handling-of-64bit-arrays branch June 26, 2023 21:11

johnkerl mentioned this pull request Jun 27, 2023

[python] Use 31-bit-friendly default shape for ingest #1440

Merged

pablo-gar mentioned this pull request Jun 27, 2023

[r/python] Parent/tracking issue for shape #1445

Closed

7 tasks

This was referenced Jul 6, 2023

[r] Slow export of obsm/varm arrays to Seurat #1518

Closed

[r] Optimize export of obsm/varm arrays to Seurat #1521

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[r] Improve handling of arrays with with domains greater than 2^31 - 1 #1504

[r] Improve handling of arrays with with domains greater than 2^31 - 1 #1504

aaronwolen commented Jun 23, 2023 •

edited by eddelbuettel

Loading

codecov-commenter commented Jun 23, 2023 •

edited

Loading

johnkerl left a comment

eddelbuettel left a comment

eddelbuettel left a comment

mojaveazure left a comment

pablo-gar left a comment

pablo-gar Jun 23, 2023

aaronwolen Jun 26, 2023 •

edited

Loading

johnkerl Jun 26, 2023 •

edited

Loading

aaronwolen Jun 26, 2023 •

edited

Loading

johnkerl Jun 26, 2023

pablo-gar Jun 26, 2023 •

edited

Loading

pablo-gar Jun 23, 2023

eddelbuettel Jun 24, 2023 •

edited

Loading

[r] Improve handling of arrays with with domains greater than 2^31 - 1 #1504

[r] Improve handling of arrays with with domains greater than 2^31 - 1 #1504

Conversation

aaronwolen commented Jun 23, 2023 • edited by eddelbuettel Loading

codecov-commenter commented Jun 23, 2023 • edited Loading

Codecov Report

johnkerl left a comment

Choose a reason for hiding this comment

eddelbuettel left a comment

Choose a reason for hiding this comment

eddelbuettel left a comment

Choose a reason for hiding this comment

mojaveazure left a comment

Choose a reason for hiding this comment

pablo-gar left a comment

Choose a reason for hiding this comment

pablo-gar Jun 23, 2023

Choose a reason for hiding this comment

aaronwolen Jun 26, 2023 • edited Loading

Choose a reason for hiding this comment

johnkerl Jun 26, 2023 • edited Loading

Choose a reason for hiding this comment

aaronwolen Jun 26, 2023 • edited Loading

Choose a reason for hiding this comment

johnkerl Jun 26, 2023

Choose a reason for hiding this comment

pablo-gar Jun 26, 2023 • edited Loading

Choose a reason for hiding this comment

pablo-gar Jun 23, 2023

Choose a reason for hiding this comment

eddelbuettel Jun 24, 2023 • edited Loading

Choose a reason for hiding this comment

aaronwolen commented Jun 23, 2023 •

edited by eddelbuettel

Loading

codecov-commenter commented Jun 23, 2023 •

edited

Loading

aaronwolen Jun 26, 2023 •

edited

Loading

johnkerl Jun 26, 2023 •

edited

Loading

aaronwolen Jun 26, 2023 •

edited

Loading

pablo-gar Jun 26, 2023 •

edited

Loading

eddelbuettel Jun 24, 2023 •

edited

Loading