-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AnnData dense matrix read support #146
Conversation
As a side-effect, it seems the SingletonNumReader was dead code and got deleted.
} | ||
std::vector<uint32_t> dims; | ||
bool row_major, is_sparse; | ||
readAnnDataDims(h5file, group, dims, row_major, is_sparse); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how this is now a much cleaner function. However, I think that using a void typed function to get the row_major status makes it a little bit unclear on how you obtained row_major status. In terms of readability, having to look into the function call on readAnnDataDims
to see that your values changed as they're passed by reference is an extra step. Maybe we can just make it return the bool, or the tuple of bools?
As discussed offline, another to-do item is filtering out explicit zeros while reading from a dense matrix |
This required also changing openAnnDataMatrix() to return a unique_ptr, and open10xFeatureMatrix() got edited at the same time for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks super good! Admittedly I had to read through a lot of highfive APIs, as well as through AnnData, to get a full understanding, but this looks super clear. Runs on my end well!
// Read AnnData sparse matrix, with an implicit transpose to CSC format for | ||
// any data stored in CSR format. | ||
// Row/col names are handled as follows: | ||
// - If row_names or col_names are provided, they are assumed to already have taken into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to mention this in the header file too
// Filter out zero values from a MatrixLoader | ||
// This is useful when reading dense matrices that have many zero values, | ||
// or when performing operations that will cause new zeros to be created (e.g. multiplying a row by zero) | ||
template<typename T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool and elegant! Can already see this being used in a handful of derived matrixloader types
FilterZeros(std::unique_ptr<MatrixLoader<T>> &&loader) : MatrixLoaderWrapper<T>(std::move(loader)) {} | ||
|
||
// Return false if there are no more entries to load | ||
virtual bool load() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't load()
and capacity()
just have the override
keyword?
if (dims.size() != 1) | ||
throw std::runtime_error( | ||
std::string("readAnnDataDimname(): expected ") + axis + | ||
" index to be 1-dimensional aray." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
" index to be 1-dimensional aray." | |
" index to be 1-dimensional array." |
, cols_(d_.getDimensions()[1]) {} | ||
|
||
// Return total number of integers in the reader | ||
virtual uint64_t size() const { return rows_ * cols_; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using override
keyword instead of virutal
would probably make more sense
@@ -260,14 +264,24 @@ test_that("AnnData read backwards compatibility", { | |||
open_matrix_anndata_hdf5(file.path(dir, f), group="layers/transpose") %>% | |||
as("dgCMatrix") %>% | |||
expect_identical(ans) | |||
open_matrix_anndata_hdf5(file.path(dir, f), group="layers/dense") %>% | |||
as("dgCMatrix") %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to check for writing matrices that were originally read dense. I did a little bit of testing with write_matrix_hdf5()
, and write_matrix_memory()
and they appear to be working. But it never hurts to have a check in case of a regression!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since as("dgCMatrix")
calls write_matrix_memory()
internally, I think this is already covered here
r/src/matrix_io.cpp
Outdated
template <class T> List dims_matrix(std::unique_ptr<MatrixLoader<T>> &&mat, bool transpose) { | ||
// This is only safe because we know the main `dims_matrix` function doesn't store any references to the object | ||
return dims_matrix(std::move(*mat), transpose); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to do a bit more research on the safety of this before merging. I was just running some small C++ tests and getting surprised by when destructors got called, so my mental model is not quite right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I have figured this out now. The solution I used was:
- Make the base
dims_matrix
take an l-value reference, since that provides more ownership guarantees to the caller (i.e. the object will not get consumed by the function call) - Make a helper overload that takes an r-value reference. This provides fewer guarantees to the caller but is otherwise safe
- Dereferencing a std::unique_ptr with
*
produces an l-value, which now can be passed directly to the l-value reference overload ofdims_matrix
No qualms about the new changes you added! Looks good Ben. |
This addresses old issues #17 and #36 by adding support for reading the AnnData dense matrix format.
Changes:
obsm
,varm
, and dense matrices. Reduce the version numbers covered since v0.7.8 appeared to be identical to v0.7.0obsm
, orvarp
, etc.)The first commit is the main one to look at for review, the others are moving around code mostly.