-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[r][cpp]Add supports to write IterableMatrix
to dense H5 dataset with AnnData format
#166
Conversation
…th AnnData format
Hi @ycli1995, thanks for submitting this! From a quick skim it looks pretty good. This is a welcome addition to fill out our support for dense anndata matrices. I'm a bit busy right now so it will be longer than usual to get a proper review of the C++ and merge. Assuming there aren't any issues in the C++ on a closer look, I'll probably just have some minor tweaks to request on the R API before merging. (I'm trying to think through whether it's better to just add a @immanuelazn might also take a look since he's a recent lab hire who is helping out with BPCells development |
Thanks you for the feedback! My consideration of a new function is that the orignial |
Hi @ycli1995 thanks for the contribution! This is great work, and I can see the reasoning behind a lot of your design decisions. I left a preliminary review, and I will probably continue to add on to it over the weekend. Ben will likely give it a second pass though. Overall, functionality looks great, but I commented on some styling things, as well as on the R interface that you and Ben had already discussed. |
for (uint32_t ii = 0; ii < capacity; ii++) { | ||
val_buf[*(mat.rowData() + i + ii)] = *(mat.valData() + i + ii); | ||
} | ||
idx += capacity; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what idx is providing here since it is not being reset. Is it just a check that there is more than 0 rows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this section can actually be simplified a bit further. I don't think you need the inner loop or the max_capacity
variable at all. In StoredMatrixWriter
, an inner loop is used to minimize the buffer size for the downstream NumWriter
objects, but in this case it's not needed since we are just staging everything in val_buf
anyhow.
I also agree with Immanuel: you could also just replace idx
with a bool
that gets set to true if any data has been loaded from the column.
std::vector<T> val_buf = zero_buf; // buffer for each column | ||
while (mat.nextCol()) { | ||
if (user_interrupt != NULL && *user_interrupt) return; | ||
if (mat.currentCol() < col) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel like you can remove the conditional on line 104 and just set this to if (mat.currentCol() != col)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think mat.currentCol() != col
wouldn't work because in the case of a matrix with a column of all zeros, it's possible to have col < mat.currentCol()
. (That said, I'm not sure the case of a missing column is being handled correctly right now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ycli1995, I got a chance to do my full look through as well. I added a few comments in addition to the ones Immanuel already made. Most stuff is just coding style to improve consistency with the rest of the BPCells code, but I did find one correctness issue.
Because mat.nextCol()
can skip columns if there are empty columns, then writes don't work properly for matrices with some empty columns. Here's a test case that currently fails:
dir <- tempdir()
m <- matrix(0, nrow=3, ncol=4)
m[2,2] <- 1
m[3,4] <- 1
rownames(m) <- paste0("row", seq_len(nrow(m)))
colnames(m) <- paste0("col", seq_len(ncol(m)))
mat <- m |> as("dgCMatrix") |> as("IterableMatrix")
ans <- write_matrix_anndata_hdf5_dense(mat, file.path(dir, "zeros.h5"))
expect_identical(as.matrix(mat), as.matrix(ans))
// Write a Dense Array to an AnnData file | ||
template <typename T> | ||
H5DenseMatrixWriter<T> createAnnDataDenseMatrix( | ||
std::string file, | ||
std::string dataset, | ||
uint32_t nrow, | ||
uint32_t ncol, | ||
bool row_major, | ||
uint32_t buffer_size, | ||
uint32_t chunk_size, | ||
uint32_t gzip_level | ||
) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we re-arrange the code so most of the implementation can be in a .cpp
file?
I think the best way to do this would be to only put the declaration for createAnnDataDenseMatrix
in the header file, and make it return a std::unique_ptr<MatrixWriter<T>>
, then move everything else into the .cpp
file with explicit template instantiations, just like createAnnDataMatrix
does (link).
And as mentioned above, I think we'll probably want to get rid of the nrow
and ncol
parameters here and get them during the call to write(mat_in)
* Delay the construction of H5 dataset in `H5DenseMatrixWriter::write()` * Simplify the implementation of `H5DenseMatrixWriter::write()` wile buffering values for each column * Put the implementation of `createAnnDataDenseMatrix` in .cpp file
Hi, @bnprks
|
Hi @ycli1995, thanks for making these adjustments. Replying to your points:
I've have a couple small coding style recommendations for the
Click to show `write()` implementationvoid write(MatrixLoader<T> &mat, std::atomic<bool> *user_interrupt = NULL) override {
HighFive::DataSet h5dataset = createH5Matrix(mat.rows(), mat.cols());
bool loaded = false; // Any non-zero values has been loaded.
std::vector<T> val_buf(mat.rows(), 0);
while (mat.nextCol()) {
if (user_interrupt != NULL && *user_interrupt) return;
while (mat.load()) {
loaded = true;
uint32_t *row_data = mat.rowData();
T *val_data = mat.valData();
uint32_t capacity = mat.capacity();
for (uint32_t i = 0; i < capacity; i++) {
val_buf[row_data[i]] = val_data[i];
}
}
if (loaded) {
if (row_major) {
h5dataset.select({(uint64_t)mat.currentCol(), 0}, {1, val_buf.size()}).write_raw(val_buf.data(), datatype);
} else {
h5dataset.select({0, (uint64_t)mat.currentCol()}, {val_buf.size(), 1}).write_raw(val_buf.data(), datatype);
}
}
for (auto &x : val_buf) {
x = 0;
}
}
h5dataset.createAttribute("encoding-type", std::string("array"));
h5dataset.createAttribute("encoding-version", std::string("0.2.0"));
} There are a couple remaining changes I'd like to see:
After that, I'll want to put in an update in the |
…nse_cpp()` * Currently `buffer_size` only affects the reader of dense matrix, instead of the writer.
* No need for `OrderRows` * No need for extra `zero_buf` * Simplify the buffer loading
Hi, @bnprks After some trials, I finally gave up for controlling the buffer size in A simplified Thank you again for the code review and those great advices! |
Thanks @ycli1995, all these changes look good! I just pushed a few final commits to update some docs, the NEWS file, and a small code re-organization to minimize the amount that needs to be exposed in the header file. Everything you wrote looks to have been working well, so I'll merge this in to main now. |
Hi, @bnprks,
Since
BPCells
now supports to read dense matrix fromh5ad
file, I addwrite_matrix_anndata_hdf5_dense()
to write anIterableMatrix
into a dense H5 dataset withAnnData
format. This feature may help when users want to store a result matrix after a series of transformations, and reuse it in downstream analysis.The implemention levervages
H5DenseMatrixWriter
class withwrite()
method to handle row major or column major matrices. The standalonewrite_matrix_anndata_hdf5_dense()
should work well without breaking the orignial designs for sparse matrices. In other words, we leave the choices of calling this feature to users. On the other hand, users don't need to concern about whether the input matrix is sparse, the function will always write the matrix into a 2D H5 dataset.