Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[r][cpp]Add supports to write IterableMatrix to dense H5 dataset with AnnData format #166

Merged
merged 12 commits into from
Jan 9, 2025

Conversation

ycli1995
Copy link
Contributor

Hi, @bnprks,

Since BPCells now supports to read dense matrix from h5ad file, I add write_matrix_anndata_hdf5_dense() to write an IterableMatrix into a dense H5 dataset with AnnData format. This feature may help when users want to store a result matrix after a series of transformations, and reuse it in downstream analysis.

The implemention levervages H5DenseMatrixWriter class with write() method to handle row major or column major matrices. The standalone write_matrix_anndata_hdf5_dense() should work well without breaking the orignial designs for sparse matrices. In other words, we leave the choices of calling this feature to users. On the other hand, users don't need to concern about whether the input matrix is sparse, the function will always write the matrix into a 2D H5 dataset.

@bnprks
Copy link
Owner

bnprks commented Dec 13, 2024

Hi @ycli1995, thanks for submitting this! From a quick skim it looks pretty good. This is a welcome addition to fill out our support for dense anndata matrices. I'm a bit busy right now so it will be longer than usual to get a proper review of the C++ and merge.

Assuming there aren't any issues in the C++ on a closer look, I'll probably just have some minor tweaks to request on the R API before merging. (I'm trying to think through whether it's better to just add a sparse=TRUE argument onto write_matrix_anndata_hdf5() rather than adding another function)

@immanuelazn might also take a look since he's a recent lab hire who is helping out with BPCells development

@ycli1995
Copy link
Contributor Author

Thanks you for the feedback! My consideration of a new function is that the orignial write_matrix_anndata_hdf5() uses group to specify where the matrix to be stored within an H5 file. Since an H5 Dataset is quite different with a Group conceptually, I'm not sure whether it is a good idea to mix up these two storage formats in write_matrix_anndata_hdf5() which may have been widely used in community codes based on BPCells. Therefore, I finally decided to just add another function to insure the compatibility, not only for users' side but also for the underlying implemention.

@immanuelazn
Copy link
Collaborator

Hi @ycli1995 thanks for the contribution! This is great work, and I can see the reasoning behind a lot of your design decisions. I left a preliminary review, and I will probably continue to add on to it over the weekend. Ben will likely give it a second pass though.

Overall, functionality looks great, but I commented on some styling things, as well as on the R interface that you and Ben had already discussed.

r/R/matrix.R Show resolved Hide resolved
r/src/bpcells-cpp/matrixIterators/H5DenseMatrixWriter.h Outdated Show resolved Hide resolved
r/R/matrix.R Show resolved Hide resolved
for (uint32_t ii = 0; ii < capacity; ii++) {
val_buf[*(mat.rowData() + i + ii)] = *(mat.valData() + i + ii);
}
idx += capacity;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what idx is providing here since it is not being reset. Is it just a check that there is more than 0 rows?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section can actually be simplified a bit further. I don't think you need the inner loop or the max_capacity variable at all. In StoredMatrixWriter, an inner loop is used to minimize the buffer size for the downstream NumWriter objects, but in this case it's not needed since we are just staging everything in val_buf anyhow.

I also agree with Immanuel: you could also just replace idx with a bool that gets set to true if any data has been loaded from the column.

std::vector<T> val_buf = zero_buf; // buffer for each column
while (mat.nextCol()) {
if (user_interrupt != NULL && *user_interrupt) return;
if (mat.currentCol() < col)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel like you can remove the conditional on line 104 and just set this to if (mat.currentCol() != col)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think mat.currentCol() != col wouldn't work because in the case of a matrix with a column of all zeros, it's possible to have col < mat.currentCol(). (That said, I'm not sure the case of a missing column is being handled correctly right now)

r/tests/testthat/test-matrix_io.R Show resolved Hide resolved
Copy link
Owner

@bnprks bnprks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ycli1995, I got a chance to do my full look through as well. I added a few comments in addition to the ones Immanuel already made. Most stuff is just coding style to improve consistency with the rest of the BPCells code, but I did find one correctness issue.

Because mat.nextCol() can skip columns if there are empty columns, then writes don't work properly for matrices with some empty columns. Here's a test case that currently fails:

dir <- tempdir()
m <- matrix(0, nrow=3, ncol=4)
m[2,2] <- 1
m[3,4] <- 1
rownames(m) <- paste0("row", seq_len(nrow(m)))
colnames(m) <- paste0("col", seq_len(ncol(m)))

mat <- m |> as("dgCMatrix") |> as("IterableMatrix")

ans <- write_matrix_anndata_hdf5_dense(mat, file.path(dir, "zeros.h5"))

expect_identical(as.matrix(mat), as.matrix(ans))

r/src/bpcells-cpp/matrixIterators/H5DenseMatrixWriter.h Outdated Show resolved Hide resolved
r/src/bpcells-cpp/matrixIterators/H5DenseMatrixWriter.h Outdated Show resolved Hide resolved
r/src/bpcells-cpp/matrixIterators/H5DenseMatrixWriter.h Outdated Show resolved Hide resolved
Comment on lines 119 to 130
// Write a Dense Array to an AnnData file
template <typename T>
H5DenseMatrixWriter<T> createAnnDataDenseMatrix(
std::string file,
std::string dataset,
uint32_t nrow,
uint32_t ncol,
bool row_major,
uint32_t buffer_size,
uint32_t chunk_size,
uint32_t gzip_level
) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we re-arrange the code so most of the implementation can be in a .cpp file?

I think the best way to do this would be to only put the declaration for createAnnDataDenseMatrix in the header file, and make it return a std::unique_ptr<MatrixWriter<T>>, then move everything else into the .cpp file with explicit template instantiations, just like createAnnDataMatrix does (link).

And as mentioned above, I think we'll probably want to get rid of the nrow and ncol parameters here and get them during the call to write(mat_in)

* Delay the construction of H5 dataset in `H5DenseMatrixWriter::write()`
* Simplify the implementation of `H5DenseMatrixWriter::write()` wile buffering values for each column
* Put the implementation of `createAnnDataDenseMatrix` in .cpp file
@ycli1995
Copy link
Contributor Author

Hi, @bnprks
Sorry for the delayed reply.

  1. I finally get things correct when H5DenseMatrixWriter encounter columns with zeros. The write() method now should always refresh the col and val_buf after calling mat.nextCol(). Corresponding unit tests have also been commited.
  2. The implementation of H5DenseMatrixWriter and the whole writing workflow have been overhauled. write_matrix_anndata_hdf5_dense_base() now follows the similar logic with write_matrix_anndata_hdf5_base:
  • createAnnDataDenseMatrix() will create the H5 file and gather necessary information to construct a writer like createAnnDataMatrix().
  • Call write() with the input matrix with any size. This step will actually create the target dataset.
  • After the matrix has been written, create obs and var.
  1. For max_capacity, my consideration is that if we get a matrix with very large row numbers (such as 100,000-row expression matrix for some plants), we may need to load() more than one time to save RAM when writing one column. Therefore I keep this argument although it currently seems not work well. It may take me extra time to figure out how to reuse the val_buf to save memory usage.

@bnprks
Copy link
Owner

bnprks commented Dec 21, 2024

Hi @ycli1995, thanks for making these adjustments. Replying to your points:

  1. Yep, everything looks good now when zero values occur, looks good!
  2. Thanks for making that adjustment, seems good now
  3. I don't think your logic quite works here. val_buf will be a fixed size regardless of the value of max_capacity, so there is no memory savings when max_capacity is lower. I would recommend deleting this parameter and the corresponding buffer_size parameter. Because of the way writing an hdf5 matrix works, your approach with val_buf is basically the only way to go, but it doesn't benefit from doing this chunked approach.

I've have a couple small coding style recommendations for the write() function, and I think it's just easier to show the full implementation (see below). The main changes were:

  • Don't need to use OrderRows since the function will work fine regardless of the ordering
  • Don't need a safety check for out-of-order columns as dense HDF5 writes aren't sensitive to this
  • Don't need a separate zero buf, as it's less memory usage to just reset the val_buf to zero after each column
  • Removed unnecessary checks and variables after accounting for those changes
Click to show `write()` implementation
void write(MatrixLoader<T> &mat, std::atomic<bool> *user_interrupt = NULL) override {
        HighFive::DataSet h5dataset = createH5Matrix(mat.rows(), mat.cols());
        bool loaded = false; // Any non-zero values has been loaded.
        std::vector<T> val_buf(mat.rows(), 0);

        while (mat.nextCol()) {
            if (user_interrupt != NULL && *user_interrupt) return;

            while (mat.load()) {
                loaded = true;
                uint32_t *row_data = mat.rowData();
                T *val_data = mat.valData();
                uint32_t capacity = mat.capacity();
                for (uint32_t i = 0; i < capacity; i++) {
                    val_buf[row_data[i]] = val_data[i];
                }
            }
            if (loaded) {
                if (row_major) {
                    h5dataset.select({(uint64_t)mat.currentCol(), 0}, {1, val_buf.size()}).write_raw(val_buf.data(), datatype);
                } else {
                    h5dataset.select({0, (uint64_t)mat.currentCol()}, {val_buf.size(), 1}).write_raw(val_buf.data(), datatype);
                }
            }
            for (auto &x : val_buf) {
                x = 0;
            }
        }
        h5dataset.createAttribute("encoding-type", std::string("array"));
        h5dataset.createAttribute("encoding-version", std::string("0.2.0"));
    }

There are a couple remaining changes I'd like to see:

  1. Could we put in a simplified version of the write() function similar to what I've written above, then also delete the max_capacity and buffer_size parameters?
  2. (Optional) we could move the whole H5DenseMatrixWriter class into the .cpp file if we make createAnnDataDenseMatrix return a std::unique_ptr<MatrixWriter<T>> (Probably constructed with std::make_unique<H5DenseMatrixWriter<T>> in the implementation)

After that, I'll want to put in an update in the NEWS.md file, and maybe make some slight documentation tweaks once I see how everything looks rendered on the docs website. Would you prefer I push those changes as a modification on this PR so you can check them over, or prefer that I do the docs updates as a follow-up PR after this is merged?

…nse_cpp()`

* Currently `buffer_size` only affects the reader of dense matrix, instead of the writer.
* No need for `OrderRows`
* No need for extra `zero_buf`
* Simplify the buffer loading
@ycli1995
Copy link
Contributor Author

Hi, @bnprks
I'm so sorry for the delayed reply since I've been quite busy for my major job these days. T^T

After some trials, I finally gave up for controlling the buffer size in H5DenseMatrixWriter. All corresponding parameters have been removed. Currently the buffer_size parameter in write_matrix_anndata_hdf5_dense() will only work for the internal open_matrix_anndata_hdf5(). Users could need more RAM when they encounter matrices with large row numbers which, however, may be uncommon situations. Therefore extra consideration is not necessary at the momment.

A simplified write() method has been added according to your advices. In addition, I change the loaded to false after writing each column to avoid triggering the H5 writing with a zero buffer. It works well with current unit tests. For moving H5DenseMatrixWriter class into the .cpp file, I'm not 100% sure about how to do it correctly yet. So I left it as is.

Thank you again for the code review and those great advices!

@bnprks
Copy link
Owner

bnprks commented Jan 9, 2025

Thanks @ycli1995, all these changes look good! I just pushed a few final commits to update some docs, the NEWS file, and a small code re-organization to minimize the amount that needs to be exposed in the header file. Everything you wrote looks to have been working well, so I'll merge this in to main now.

@bnprks bnprks merged commit 7f4f502 into bnprks:main Jan 9, 2025
immanuelazn pushed a commit to immanuelazn/BPCells that referenced this pull request Jan 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants