[r][cpp]Add supports to write `IterableMatrix` to dense H5 dataset with AnnData format #166

ycli1995 · 2024-12-11T08:25:53Z

Since BPCells now supports to read dense matrix from h5ad file, I add write_matrix_anndata_hdf5_dense() to write an IterableMatrix into a dense H5 dataset with AnnData format. This feature may help when users want to store a result matrix after a series of transformations, and reuse it in downstream analysis.

The implemention levervages H5DenseMatrixWriter class with write() method to handle row major or column major matrices. The standalone write_matrix_anndata_hdf5_dense() should work well without breaking the orignial designs for sparse matrices. In other words, we leave the choices of calling this feature to users. On the other hand, users don't need to concern about whether the input matrix is sparse, the function will always write the matrix into a 2D H5 dataset.

…th AnnData format

bnprks · 2024-12-13T03:41:36Z

Hi @ycli1995, thanks for submitting this! From a quick skim it looks pretty good. This is a welcome addition to fill out our support for dense anndata matrices. I'm a bit busy right now so it will be longer than usual to get a proper review of the C++ and merge.

Assuming there aren't any issues in the C++ on a closer look, I'll probably just have some minor tweaks to request on the R API before merging. (I'm trying to think through whether it's better to just add a sparse=TRUE argument onto write_matrix_anndata_hdf5() rather than adding another function)

@immanuelazn might also take a look since he's a recent lab hire who is helping out with BPCells development

ycli1995 · 2024-12-14T02:12:40Z

Thanks you for the feedback! My consideration of a new function is that the orignial write_matrix_anndata_hdf5() uses group to specify where the matrix to be stored within an H5 file. Since an H5 Dataset is quite different with a Group conceptually, I'm not sure whether it is a good idea to mix up these two storage formats in write_matrix_anndata_hdf5() which may have been widely used in community codes based on BPCells. Therefore, I finally decided to just add another function to insure the compatibility, not only for users' side but also for the underlying implemention.

immanuelazn · 2024-12-14T10:42:09Z

Hi @ycli1995 thanks for the contribution! This is great work, and I can see the reasoning behind a lot of your design decisions. I left a preliminary review, and I will probably continue to add on to it over the weekend. Ben will likely give it a second pass though.

Overall, functionality looks great, but I commented on some styling things, as well as on the R interface that you and Ben had already discussed.

r/R/matrix.R

r/src/bpcells-cpp/matrixIterators/H5DenseMatrixWriter.h

r/R/matrix.R

immanuelazn · 2024-12-14T10:30:36Z

r/src/bpcells-cpp/matrixIterators/H5DenseMatrixWriter.h

+                    for (uint32_t ii = 0; ii < capacity; ii++) {
+                        val_buf[*(mat.rowData() + i + ii)] = *(mat.valData() + i + ii);
+                    }
+                    idx += capacity;


I'm not sure what idx is providing here since it is not being reset. Is it just a check that there is more than 0 rows?

I think this section can actually be simplified a bit further. I don't think you need the inner loop or the max_capacity variable at all. In StoredMatrixWriter, an inner loop is used to minimize the buffer size for the downstream NumWriter objects, but in this case it's not needed since we are just staging everything in val_buf anyhow.

I also agree with Immanuel: you could also just replace idx with a bool that gets set to true if any data has been loaded from the column.

immanuelazn · 2024-12-14T10:31:15Z

r/src/bpcells-cpp/matrixIterators/H5DenseMatrixWriter.h

+        std::vector<T> val_buf = zero_buf;  // buffer for each column
+        while (mat.nextCol()) {
+            if (user_interrupt != NULL && *user_interrupt) return;
+            if (mat.currentCol() < col)


Feel like you can remove the conditional on line 104 and just set this to if (mat.currentCol() != col)

I think mat.currentCol() != col wouldn't work because in the case of a matrix with a column of all zeros, it's possible to have col < mat.currentCol(). (That said, I'm not sure the case of a missing column is being handled correctly right now)

r/tests/testthat/test-matrix_io.R

bnprks

Hi @ycli1995, I got a chance to do my full look through as well. I added a few comments in addition to the ones Immanuel already made. Most stuff is just coding style to improve consistency with the rest of the BPCells code, but I did find one correctness issue.

Because mat.nextCol() can skip columns if there are empty columns, then writes don't work properly for matrices with some empty columns. Here's a test case that currently fails:

dir <- tempdir()
m <- matrix(0, nrow=3, ncol=4)
m[2,2] <- 1
m[3,4] <- 1
rownames(m) <- paste0("row", seq_len(nrow(m)))
colnames(m) <- paste0("col", seq_len(ncol(m)))

mat <- m |> as("dgCMatrix") |> as("IterableMatrix")

ans <- write_matrix_anndata_hdf5_dense(mat, file.path(dir, "zeros.h5"))

expect_identical(as.matrix(mat), as.matrix(ans))

r/src/bpcells-cpp/matrixIterators/H5DenseMatrixWriter.h

bnprks · 2024-12-15T01:55:11Z

r/src/bpcells-cpp/matrixIterators/H5DenseMatrixWriter.h

+// Write a Dense Array to an AnnData file
+template <typename T>
+H5DenseMatrixWriter<T> createAnnDataDenseMatrix(
+    std::string file,
+    std::string dataset,
+    uint32_t nrow,
+    uint32_t ncol,
+    bool row_major,
+    uint32_t buffer_size,
+    uint32_t chunk_size,
+    uint32_t gzip_level
+) {


Could we re-arrange the code so most of the implementation can be in a .cpp file?

I think the best way to do this would be to only put the declaration for createAnnDataDenseMatrix in the header file, and make it return a std::unique_ptr<MatrixWriter<T>>, then move everything else into the .cpp file with explicit template instantiations, just like createAnnDataMatrix does (link).

And as mentioned above, I think we'll probably want to get rid of the nrow and ncol parameters here and get them during the call to write(mat_in)

* Delay the construction of H5 dataset in `H5DenseMatrixWriter::write()` * Simplify the implementation of `H5DenseMatrixWriter::write()` wile buffering values for each column * Put the implementation of `createAnnDataDenseMatrix` in .cpp file

ycli1995 · 2024-12-19T09:22:06Z

Hi, @bnprks
Sorry for the delayed reply.

I finally get things correct when H5DenseMatrixWriter encounter columns with zeros. The write() method now should always refresh the col and val_buf after calling mat.nextCol(). Corresponding unit tests have also been commited.
The implementation of H5DenseMatrixWriter and the whole writing workflow have been overhauled. write_matrix_anndata_hdf5_dense_base() now follows the similar logic with write_matrix_anndata_hdf5_base:

createAnnDataDenseMatrix() will create the H5 file and gather necessary information to construct a writer like createAnnDataMatrix().
Call write() with the input matrix with any size. This step will actually create the target dataset.
After the matrix has been written, create obs and var.

For max_capacity, my consideration is that if we get a matrix with very large row numbers (such as 100,000-row expression matrix for some plants), we may need to load() more than one time to save RAM when writing one column. Therefore I keep this argument although it currently seems not work well. It may take me extra time to figure out how to reuse the val_buf to save memory usage.

bnprks · 2024-12-21T03:48:19Z

Hi @ycli1995, thanks for making these adjustments. Replying to your points:

Yep, everything looks good now when zero values occur, looks good!
Thanks for making that adjustment, seems good now
I don't think your logic quite works here. val_buf will be a fixed size regardless of the value of max_capacity, so there is no memory savings when max_capacity is lower. I would recommend deleting this parameter and the corresponding buffer_size parameter. Because of the way writing an hdf5 matrix works, your approach with val_buf is basically the only way to go, but it doesn't benefit from doing this chunked approach.

I've have a couple small coding style recommendations for the write() function, and I think it's just easier to show the full implementation (see below). The main changes were:

Don't need to use OrderRows since the function will work fine regardless of the ordering
Don't need a safety check for out-of-order columns as dense HDF5 writes aren't sensitive to this
Don't need a separate zero buf, as it's less memory usage to just reset the val_buf to zero after each column
Removed unnecessary checks and variables after accounting for those changes

Click to show `write()` implementation

void write(MatrixLoader<T> &mat, std::atomic<bool> *user_interrupt = NULL) override {
        HighFive::DataSet h5dataset = createH5Matrix(mat.rows(), mat.cols());
        bool loaded = false; // Any non-zero values has been loaded.
        std::vector<T> val_buf(mat.rows(), 0);

        while (mat.nextCol()) {
            if (user_interrupt != NULL && *user_interrupt) return;

            while (mat.load()) {
                loaded = true;
                uint32_t *row_data = mat.rowData();
                T *val_data = mat.valData();
                uint32_t capacity = mat.capacity();
                for (uint32_t i = 0; i < capacity; i++) {
                    val_buf[row_data[i]] = val_data[i];
                }
            }
            if (loaded) {
                if (row_major) {
                    h5dataset.select({(uint64_t)mat.currentCol(), 0}, {1, val_buf.size()}).write_raw(val_buf.data(), datatype);
                } else {
                    h5dataset.select({0, (uint64_t)mat.currentCol()}, {val_buf.size(), 1}).write_raw(val_buf.data(), datatype);
                }
            }
            for (auto &x : val_buf) {
                x = 0;
            }
        }
        h5dataset.createAttribute("encoding-type", std::string("array"));
        h5dataset.createAttribute("encoding-version", std::string("0.2.0"));
    }

There are a couple remaining changes I'd like to see:

Could we put in a simplified version of the write() function similar to what I've written above, then also delete the max_capacity and buffer_size parameters?
(Optional) we could move the whole H5DenseMatrixWriter class into the .cpp file if we make createAnnDataDenseMatrix return a std::unique_ptr<MatrixWriter<T>> (Probably constructed with std::make_unique<H5DenseMatrixWriter<T>> in the implementation)

After that, I'll want to put in an update in the NEWS.md file, and maybe make some slight documentation tweaks once I see how everything looks rendered on the docs website. Would you prefer I push those changes as a modification on this PR so you can check them over, or prefer that I do the docs updates as a follow-up PR after this is merged?

…nse_cpp()` * Currently `buffer_size` only affects the reader of dense matrix, instead of the writer.

* No need for `OrderRows` * No need for extra `zero_buf` * Simplify the buffer loading

ycli1995 · 2024-12-25T07:53:00Z

Hi, @bnprks
I'm so sorry for the delayed reply since I've been quite busy for my major job these days. T^T

After some trials, I finally gave up for controlling the buffer size in H5DenseMatrixWriter. All corresponding parameters have been removed. Currently the buffer_size parameter in write_matrix_anndata_hdf5_dense() will only work for the internal open_matrix_anndata_hdf5(). Users could need more RAM when they encounter matrices with large row numbers which, however, may be uncommon situations. Therefore extra consideration is not necessary at the momment.

A simplified write() method has been added according to your advices. In addition, I change the loaded to false after writing each column to avoid triggering the H5 writing with a zero buffer. It works well with current unit tests. For moving H5DenseMatrixWriter class into the .cpp file, I'm not 100% sure about how to do it correctly yet. So I left it as is.

Thank you again for the code review and those great advices!

bnprks · 2025-01-09T01:51:13Z

Thanks @ycli1995, all these changes look good! I just pushed a few final commits to update some docs, the NEWS file, and a small code re-organization to minimize the amount that needs to be exposed in the header file. Everything you wrote looks to have been working well, so I'll merge this in to main now.

[r][cpp]Add supports to write IterableMatrix to dense H5 dataset wi…

23324d0

…th AnnData format

immanuelazn reviewed Dec 14, 2024

View reviewed changes

bnprks reviewed Dec 15, 2024

View reviewed changes

ycli1995 added 2 commits December 18, 2024 17:56

[r][cpp]Fix a bug when writing dense matrix with empty columns

7d68dfc

ycli1995 added 2 commits December 25, 2024 14:40

[r][cpp]Remove unused buffer_size for `write_matrix_anndata_hdf5_de…

518fb95

…nse_cpp()` * Currently `buffer_size` only affects the reader of dense matrix, instead of the writer.

[cpp]Simplify write() method for H5DenseMatrixWriter

b91bbad

* No need for `OrderRows` * No need for extra `zero_buf` * Simplify the buffer loading

bnprks and others added 7 commits January 8, 2025 15:30

Move H5DenseMatrixWriter to cpp file

ca07257

Update docs

63d111d

Update NEWS.md

7f27233

Update NEWS.md

e9a4aec

Merge branch 'dense' of github.com:ycli1995/BPCells into pr/ycli1995/166

c6511be

Merge branch 'main' into dense

289a51e

Remove duplicate NEWS line

2e8451d

bnprks merged commit 7f4f502 into bnprks:main Jan 9, 2025

immanuelazn pushed a commit to immanuelazn/BPCells that referenced this pull request Jan 18, 2025

[r][cpp] Support writing AnnData dense matrices (bnprks#166)

f1232b0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[r][cpp]Add supports to write `IterableMatrix` to dense H5 dataset with AnnData format #166

[r][cpp]Add supports to write `IterableMatrix` to dense H5 dataset with AnnData format #166

ycli1995 commented Dec 11, 2024

bnprks commented Dec 13, 2024

ycli1995 commented Dec 14, 2024

immanuelazn commented Dec 14, 2024

immanuelazn Dec 14, 2024

bnprks Dec 15, 2024

immanuelazn Dec 14, 2024

bnprks Dec 15, 2024

bnprks left a comment

bnprks Dec 15, 2024

ycli1995 commented Dec 19, 2024

bnprks commented Dec 21, 2024

ycli1995 commented Dec 25, 2024

bnprks commented Jan 9, 2025

[r][cpp]Add supports to write IterableMatrix to dense H5 dataset with AnnData format #166

[r][cpp]Add supports to write IterableMatrix to dense H5 dataset with AnnData format #166

Conversation

ycli1995 commented Dec 11, 2024

bnprks commented Dec 13, 2024

ycli1995 commented Dec 14, 2024

immanuelazn commented Dec 14, 2024

immanuelazn Dec 14, 2024

Choose a reason for hiding this comment

bnprks Dec 15, 2024

Choose a reason for hiding this comment

immanuelazn Dec 14, 2024

Choose a reason for hiding this comment

bnprks Dec 15, 2024

Choose a reason for hiding this comment

bnprks left a comment

Choose a reason for hiding this comment

bnprks Dec 15, 2024

Choose a reason for hiding this comment

ycli1995 commented Dec 19, 2024

bnprks commented Dec 21, 2024

ycli1995 commented Dec 25, 2024

bnprks commented Jan 9, 2025

[r][cpp]Add supports to write `IterableMatrix` to dense H5 dataset with AnnData format #166

[r][cpp]Add supports to write `IterableMatrix` to dense H5 dataset with AnnData format #166