Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[r] Improve handling of arrays with with domains greater than 2^31 - 1 #1504

Merged
merged 16 commits into from
Jun 26, 2023

Conversation

aaronwolen
Copy link
Member

@aaronwolen aaronwolen commented Jun 23, 2023

Issue and/or context: #1487

Changes: This makes a few adjustment to improve our handling of arrays with domains greater than 2^31-1.

  • SOMASparseNDArray$read() no longer warns if nnz() > .Machine$integer.max, since these arrays can be read without issue using SOMASparseNDArray$read()$tables()
  • SOMASparseNDArrayRead$sparse_matrix() truncates shape to 2^31 - 1 if shape > 2^31-1 and lets the user know via a warning. I decided to add this logic here rather than deeper in the stack so that it executes only once when reading sparse matrices, rather than once per iteration.
  • Fixed SparseReaditer's shape assertion
  • arrow_table_to_sparse adds a final check to ensure that shape is <= 2^31 - 1 and all coordinates are within [0, 2^31 - 1) (since shape is 1-based and coords are 0-based)
  • arrow_table_to_sparse was also slightly simplified to work directly with 0-based coords and leverage Matrix::sparseMatrix()'s index1 argument
  • SOMAExperimentAxisQuery$to_seurat_graph() was refactored to leverage $to_sparse_matrix() and avoid the costly creation of a dgCMatrix with domains .Machine$integer.max

@codecov-commenter
Copy link

codecov-commenter commented Jun 23, 2023

Codecov Report

Patch coverage has no change and project coverage change: -11.55 ⚠️

Comparison is base (cb6dd5a) 64.69% compared to head (24f8a14) 53.14%.

❗ Current head 24f8a14 differs from pull request most recent head 20d00ce. Consider uploading reports for the commit 20d00ce to get more accurate results

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1504       +/-   ##
===========================================
- Coverage   64.69%   53.14%   -11.55%     
===========================================
  Files         102       72       -30     
  Lines        8349     5833     -2516     
===========================================
- Hits         5401     3100     -2301     
+ Misses       2948     2733      -215     
Flag Coverage Δ
python ?
r 53.14% <ø> (+0.13%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

see 35 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@aaronwolen aaronwolen marked this pull request as ready for review June 23, 2023 20:07
Copy link
Member

@johnkerl johnkerl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

Copy link
Contributor

@eddelbuettel eddelbuettel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question on Rd files -- something seems different between our machines. Ideas?

apis/r/man/ConfigList.Rd Outdated Show resolved Hide resolved
@aaronwolen aaronwolen force-pushed the aaronwolen/improve-handling-of-64bit-arrays branch from 9a01423 to 20d00ce Compare June 23, 2023 21:43
Copy link
Contributor

@eddelbuettel eddelbuettel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would need a wee bit more time for a full review but this looks ready to sail. :shipit:

Copy link
Member

@mojaveazure mojaveazure left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@pablo-gar pablo-gar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me with a big asterisk that truncating shapes and throwing a warning is prone to ghost errors for the user.



if (any(private$shape > .Machine$integer.max)) {
warning(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just error out here? A warning and modifying the expected output is error-prone if the user is not paying close attention

Copy link
Member Author

@aaronwolen aaronwolen Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @pablo-gar. I struggled with this question but if we error out here we preclude the creation of sparse matrices from arrays with domains >= 2^31-1, even when all non-zero elements are an within R-friendly range of [0, 2^31-1). To me this seemed like a reasonable trade-off: we can still create sparse matrices from these arrays, but we warn the user the resulting matrix's shape will be truncated to c(2^31-1, 2^31-1). In the unlikely scenario that the array (& query) contains non-zero elements at indices >= 2^31-1 (or more than than 2^32-1 values), then we error out and suggest they access the data as an arrow table.

What do you think?

Copy link
Member

@johnkerl johnkerl Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aaronwolen @pablo-gar is there a performant way in R to find the max used index and check if that is > 2^31-1?

Copy link
Member Author

@aaronwolen aaronwolen Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we check the coordinates of the arrow table before attempting to construct a sparse matrix. We could potentially use non-empty domain beforehand but we'll have an easier option in place after we start adding the bounding box metadata.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aaronwolen this is awesome!! Please either create a tracking task, or, make a note on #1445 so we remember

Copy link
Member

@pablo-gar pablo-gar Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @pablo-gar. I struggled with this question but if we error out here we preclude the creation of sparse matrices from arrays with domains >= 2^31-1, even when all non-zero elements are an within R-friendly range of [0, 2^31-1).

Fair.

In the unlikely scenario that the array (& query) contains non-zero elements at indices >= 2^31-1 (or more than than 2^32-1 values), then we error out and suggest they access the data as an arrow table.

And great solution

is there a performant way in R to find the max used index and check if that is > 2^31-1?

We could potentially use non-empty domain.

I tried this already and decided not to go with it and to make the shape representation a bona-fide representation of the sparse shape and for parity with Python. I agree with @aaronwolen that once the bounding-box design is finalized we can revisit (and revisit in Python as well)

x = tbl$soma_data$as_vector(),
dims = shape,
repr = repr,
index1 = FALSE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! Mike and I missed this arg in the past

Copy link
Contributor

@eddelbuettel eddelbuettel Jun 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall all of us discussing it at some point a few weeks back. No panacea, but it shows the Matrix authors know about issues / a need for zero base.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants