Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AnnData dense matrix read support #146

Merged
merged 11 commits into from
Nov 25, 2024
Merged

Add AnnData dense matrix read support #146

merged 11 commits into from
Nov 25, 2024

Conversation

bnprks
Copy link
Owner

@bnprks bnprks commented Oct 23, 2024

This addresses old issues #17 and #36 by adding support for reading the AnnData dense matrix format.

Changes:

  • Add support for reading dense AnnData matrices
  • Adjust test files for old AnnData versions to also cover obsm, varm, and dense matrices. Reduce the version numbers covered since v0.7.8 appeared to be identical to v0.7.0
  • Adjust how dimnames are recognized from AnnData files (match based on length rather than always assuming var is rows and obs is cols. This helps support other matrix shapes as found in obsm, or varp, etc.)
  • Split the 10x and AnnData import code into separate files since they are getting a bit long otherwise

The first commit is the main one to look at for review, the others are moving around code mostly.

}
std::vector<uint32_t> dims;
bool row_major, is_sparse;
readAnnDataDims(h5file, group, dims, row_major, is_sparse);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how this is now a much cleaner function. However, I think that using a void typed function to get the row_major status makes it a little bit unclear on how you obtained row_major status. In terms of readability, having to look into the function call on readAnnDataDims to see that your values changed as they're passed by reference is an extra step. Maybe we can just make it return the bool, or the tuple of bools?

@bnprks
Copy link
Owner Author

bnprks commented Oct 25, 2024

As discussed offline, another to-do item is filtering out explicit zeros while reading from a dense matrix

This required also changing openAnnDataMatrix() to return a unique_ptr,
and open10xFeatureMatrix() got edited at the same time for consistency.
Copy link
Collaborator

@immanuelazn immanuelazn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks super good! Admittedly I had to read through a lot of highfive APIs, as well as through AnnData, to get a full understanding, but this looks super clear. Runs on my end well!

// Read AnnData sparse matrix, with an implicit transpose to CSC format for
// any data stored in CSR format.
// Row/col names are handled as follows:
// - If row_names or col_names are provided, they are assumed to already have taken into
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to mention this in the header file too

// Filter out zero values from a MatrixLoader
// This is useful when reading dense matrices that have many zero values,
// or when performing operations that will cause new zeros to be created (e.g. multiplying a row by zero)
template<typename T>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool and elegant! Can already see this being used in a handful of derived matrixloader types

FilterZeros(std::unique_ptr<MatrixLoader<T>> &&loader) : MatrixLoaderWrapper<T>(std::move(loader)) {}

// Return false if there are no more entries to load
virtual bool load() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't load() and capacity() just have the override keyword?

if (dims.size() != 1)
throw std::runtime_error(
std::string("readAnnDataDimname(): expected ") + axis +
" index to be 1-dimensional aray."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
" index to be 1-dimensional aray."
" index to be 1-dimensional array."

, cols_(d_.getDimensions()[1]) {}

// Return total number of integers in the reader
virtual uint64_t size() const { return rows_ * cols_; }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using override keyword instead of virutal would probably make more sense

@@ -260,14 +264,24 @@ test_that("AnnData read backwards compatibility", {
open_matrix_anndata_hdf5(file.path(dir, f), group="layers/transpose") %>%
as("dgCMatrix") %>%
expect_identical(ans)
open_matrix_anndata_hdf5(file.path(dir, f), group="layers/dense") %>%
as("dgCMatrix") %>%
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to check for writing matrices that were originally read dense. I did a little bit of testing with write_matrix_hdf5(), and write_matrix_memory() and they appear to be working. But it never hurts to have a check in case of a regression!

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since as("dgCMatrix") calls write_matrix_memory() internally, I think this is already covered here

Comment on lines 73 to 76
template <class T> List dims_matrix(std::unique_ptr<MatrixLoader<T>> &&mat, bool transpose) {
// This is only safe because we know the main `dims_matrix` function doesn't store any references to the object
return dims_matrix(std::move(*mat), transpose);
}
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to do a bit more research on the safety of this before merging. I was just running some small C++ tests and getting surprised by when destructors got called, so my mental model is not quite right

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have figured this out now. The solution I used was:

  • Make the base dims_matrix take an l-value reference, since that provides more ownership guarantees to the caller (i.e. the object will not get consumed by the function call)
  • Make a helper overload that takes an r-value reference. This provides fewer guarantees to the caller but is otherwise safe
  • Dereferencing a std::unique_ptr with * produces an l-value, which now can be passed directly to the l-value reference overload of dims_matrix

@immanuelazn
Copy link
Collaborator

No qualms about the new changes you added! Looks good Ben.

@bnprks bnprks merged commit 290863d into main Nov 25, 2024
4 checks passed
@bnprks bnprks deleted the bp/dense-anndata-read branch November 25, 2024 23:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants