-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[r] Improve handling of arrays with with domains greater than 2^31 - 1 #1504
[r] Improve handling of arrays with with domains greater than 2^31 - 1 #1504
Conversation
Values in this range are fine when reading into an arrow table
Codecov ReportPatch coverage has no change and project coverage change:
❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more. Additional details and impacted files@@ Coverage Diff @@
## main #1504 +/- ##
===========================================
- Coverage 64.69% 53.14% -11.55%
===========================================
Files 102 72 -30
Lines 8349 5833 -2516
===========================================
- Hits 5401 3100 -2301
+ Misses 2948 2733 -215
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question on Rd files -- something seems different between our machines. Ideas?
9a01423
to
20d00ce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would need a wee bit more time for a full review but this looks ready to sail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me with a big asterisk that truncating shapes and throwing a warning is prone to ghost errors for the user.
|
||
|
||
if (any(private$shape > .Machine$integer.max)) { | ||
warning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just error out here? A warning and modifying the expected output is error-prone if the user is not paying close attention
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @pablo-gar. I struggled with this question but if we error out here we preclude the creation of sparse matrices from arrays with domains >= 2^31-1
, even when all non-zero elements are an within R-friendly range of [0, 2^31-1)
. To me this seemed like a reasonable trade-off: we can still create sparse matrices from these arrays, but we warn the user the resulting matrix's shape will be truncated to c(2^31-1, 2^31-1)
. In the unlikely scenario that the array (& query) contains non-zero elements at indices >= 2^31-1
(or more than than 2^32-1
values), then we error out and suggest they access the data as an arrow table.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aaronwolen @pablo-gar is there a performant way in R to find the max used index and check if that is > 2^31-1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we check the coordinates of the arrow table before attempting to construct a sparse matrix. We could potentially use non-empty domain beforehand but we'll have an easier option in place after we start adding the bounding box metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aaronwolen this is awesome!! Please either create a tracking task, or, make a note on #1445 so we remember
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @pablo-gar. I struggled with this question but if we error out here we preclude the creation of sparse matrices from arrays with domains >= 2^31-1, even when all non-zero elements are an within R-friendly range of [0, 2^31-1).
Fair.
In the unlikely scenario that the array (& query) contains non-zero elements at indices >= 2^31-1 (or more than than 2^32-1 values), then we error out and suggest they access the data as an arrow table.
And great solution
is there a performant way in R to find the max used index and check if that is > 2^31-1?
We could potentially use non-empty domain.
I tried this already and decided not to go with it and to make the shape representation a bona-fide representation of the sparse shape and for parity with Python. I agree with @aaronwolen that once the bounding-box design is finalized we can revisit (and revisit in Python as well)
x = tbl$soma_data$as_vector(), | ||
dims = shape, | ||
repr = repr, | ||
index1 = FALSE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! Mike and I missed this arg in the past
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recall all of us discussing it at some point a few weeks back. No panacea, but it shows the Matrix authors know about issues / a need for zero base.
Issue and/or context: #1487
Changes: This makes a few adjustment to improve our handling of arrays with domains greater than 2^31-1.
SOMASparseNDArray$read()
no longer warns ifnnz() > .Machine$integer.max
, since these arrays can be read without issue usingSOMASparseNDArray$read()$tables()
SOMASparseNDArrayRead$sparse_matrix()
truncatesshape
to2^31 - 1
ifshape > 2^31-1
and lets the user know via a warning. I decided to add this logic here rather than deeper in the stack so that it executes only once when reading sparse matrices, rather than once per iteration.SparseReaditer
'sshape
assertionarrow_table_to_sparse
adds a final check to ensure thatshape
is<= 2^31 - 1
and all coordinates are within[0, 2^31 - 1)
(sinceshape
is 1-based andcoords
are 0-based)arrow_table_to_sparse
was also slightly simplified to work directly with 0-based coords and leverageMatrix::sparseMatrix()
'sindex1
argumentSOMAExperimentAxisQuery$to_seurat_graph()
was refactored to leverage$to_sparse_matrix()
and avoid the costly creation of adgCMatrix
with domains.Machine$integer.max