Skip to content

Commit

Permalink
[r][cpp] Support writing AnnData dense matrices (#166)
Browse files Browse the repository at this point in the history
  • Loading branch information
ycli1995 authored Jan 9, 2025
1 parent d3d3d35 commit 7f4f502
Show file tree
Hide file tree
Showing 11 changed files with 383 additions and 12 deletions.
1 change: 1 addition & 0 deletions r/NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,7 @@ export(write_fragments_memory)
export(write_insertion_bedgraph)
export(write_matrix_10x_hdf5)
export(write_matrix_anndata_hdf5)
export(write_matrix_anndata_hdf5_dense)
export(write_matrix_dir)
export(write_matrix_hdf5)
export(write_matrix_memory)
Expand Down
3 changes: 3 additions & 0 deletions r/NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@ Contributions welcome :)

# BPCells 0.3.1 (in-progress main branch)

## Features
- Add `write_matrix_anndata_hdf5_dense()` which allows writing matrices in AnnData's dense format, most commonly used for `obsm` or `varm` matrices. (Thanks to @ycli1995 for pull request #166)

## Bug-fixes
- Fix error message printing when MACS crashes during `call_peaks_macs()` (pull request #175)

Expand Down
4 changes: 4 additions & 0 deletions r/R/RcppExports.R
Original file line number Diff line number Diff line change
Expand Up @@ -457,6 +457,10 @@ write_matrix_anndata_hdf5_cpp <- function(matrix, file, group, type, row_major,
invisible(.Call(`_BPCells_write_matrix_anndata_hdf5_cpp`, matrix, file, group, type, row_major, buffer_size, chunk_size, gzip_level))
}

write_matrix_anndata_hdf5_dense_cpp <- function(matrix, file, dataset, type, row_major, chunk_size, gzip_level) {
invisible(.Call(`_BPCells_write_matrix_anndata_hdf5_dense_cpp`, matrix, file, dataset, type, row_major, chunk_size, gzip_level))
}

read_hdf5_string_cpp <- function(path, group, buffer_size) {
.Call(`_BPCells_read_hdf5_string_cpp`, path, group, buffer_size)
}
Expand Down
44 changes: 38 additions & 6 deletions r/R/matrix.R
Original file line number Diff line number Diff line change
Expand Up @@ -2136,22 +2136,32 @@ setMethod("short_description", "AnnDataMatrixH5", function(x) {

#' Read/write AnnData matrix
#'
#' @description
#' Read or write a matrix from an anndata hdf5 file. These functions will
#' automatically transpose matrices when converting to/from the AnnData
#' format. This is because the AnnData convention stores cells as rows, whereas the R
#' convention stores cells as columns. If this behavior is undesired, call `t()`
#' manually on the matrix inputs and outputs of these functions.
#'
#'
#' Most users writing to AnnData files should default to `write_matrix_anndata_hdf5()` rather
#' than the dense variant (see details for more information).
#'
#' @inheritParams open_matrix_hdf5
#' @return AnnDataMatrixH5 object, with cells as the columns.
#' @details Dimnames are inferred from `obs/_index` or `var/_index` based on length matching.
#' @details
#' **Efficiency considerations**: Reading from a dense AnnData matrix will generally be slower
#' than sparse for single cell datasets, so it is recommended to re-write any dense AnnData
#' inputs to a sparse format early in processing.
#'
#' `write_matrix_anndata_hdf5()` should be used by default, as it always writes in the more efficient sparse format.
#' `write_matrix_anndata_hdf5_dense()` writes in the AnnData dense format, and can be used for smaller matrices
#' when efficiency and file size are less of a concern than increased portability (e.g. writing to `obsm` or `varm` matrices).
#' See the [AnnData docs](https://anndata.readthedocs.io/en/latest/fileformat-prose.html#dense-arrays) for format details.
#'
#' **Dimension names:** Dimnames are inferred from `obs/_index` or `var/_index` based on length matching.
#' This helps to infer dimnames for `obsp`,` varm`, etc. If the number of `len(obs) == len(var)`,
#' dimname inference will be disabled.
#'
#' *Efficiency considerations*: Reading from a dense AnnData matrix will generally be slower
#' than sparse for single cell datasets, so it is recommended to re-write any dense AnnData
#' inputs to a sparse format early in processing. Note that `write_matrix_hdf5()` will only
#' write in the sparse format.
#' @export
open_matrix_anndata_hdf5 <- function(path, group = "X", buffer_size = 16384L) {
assert_is_file(path)
Expand Down Expand Up @@ -2191,6 +2201,28 @@ write_matrix_anndata_hdf5 <- function(mat, path, group = "X", buffer_size = 1638
open_matrix_anndata_hdf5(path, group, buffer_size)
}

#' @rdname open_matrix_anndata_hdf5
#' @inheritParams write_matrix_anndata_hdf5
#'
#' @param dataset The dataset within the hdf5 file to write the matrix to. Used for `write_matrix_anndata_hdf5_dense`
#'
#' @export
write_matrix_anndata_hdf5_dense <- function(mat, path, dataset = "X", buffer_size = 16384L, chunk_size = 1024L, gzip_level = 0L) {
assert_is(mat, "IterableMatrix")
assert_is(path, "character")
mat <- t(mat)
write_matrix_anndata_hdf5_dense_cpp(
iterate_matrix(mat),
path,
dataset,
matrix_type(mat),
mat@transpose,
chunk_size,
gzip_level
)
open_matrix_anndata_hdf5(path, dataset, buffer_size)
}

#' Import MatrixMarket files
#'
#' Read a sparse matrix from a MatrixMarket file. This is a text-based format used by
Expand Down
31 changes: 25 additions & 6 deletions r/man/open_matrix_anndata_hdf5.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions r/src/Makevars.in
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ bpcells-cpp/fragmentIterators/StoredFragments.o \
bpcells-cpp/fragmentUtils/BedWriter.o \
bpcells-cpp/fragmentUtils/FootprintMatrix.o \
bpcells-cpp/fragmentUtils/InsertionIterator.o \
bpcells-cpp/matrixIterators/H5DenseMatrixWriter.o \
bpcells-cpp/matrixIterators/ImportMatrix10xHDF5.o \
bpcells-cpp/matrixIterators/ImportMatrixAnnDataHDF5.o \
bpcells-cpp/matrixIterators/PeakMatrix.o \
Expand Down
17 changes: 17 additions & 0 deletions r/src/RcppExports.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1578,6 +1578,22 @@ BEGIN_RCPP
return R_NilValue;
END_RCPP
}
// write_matrix_anndata_hdf5_dense_cpp
void write_matrix_anndata_hdf5_dense_cpp(SEXP matrix, std::string file, std::string dataset, std::string type, bool row_major, uint32_t chunk_size, uint32_t gzip_level);
RcppExport SEXP _BPCells_write_matrix_anndata_hdf5_dense_cpp(SEXP matrixSEXP, SEXP fileSEXP, SEXP datasetSEXP, SEXP typeSEXP, SEXP row_majorSEXP, SEXP chunk_sizeSEXP, SEXP gzip_levelSEXP) {
BEGIN_RCPP
Rcpp::RNGScope rcpp_rngScope_gen;
Rcpp::traits::input_parameter< SEXP >::type matrix(matrixSEXP);
Rcpp::traits::input_parameter< std::string >::type file(fileSEXP);
Rcpp::traits::input_parameter< std::string >::type dataset(datasetSEXP);
Rcpp::traits::input_parameter< std::string >::type type(typeSEXP);
Rcpp::traits::input_parameter< bool >::type row_major(row_majorSEXP);
Rcpp::traits::input_parameter< uint32_t >::type chunk_size(chunk_sizeSEXP);
Rcpp::traits::input_parameter< uint32_t >::type gzip_level(gzip_levelSEXP);
write_matrix_anndata_hdf5_dense_cpp(matrix, file, dataset, type, row_major, chunk_size, gzip_level);
return R_NilValue;
END_RCPP
}
// read_hdf5_string_cpp
std::vector<std::string> read_hdf5_string_cpp(std::string path, std::string group, uint32_t buffer_size);
RcppExport SEXP _BPCells_read_hdf5_string_cpp(SEXP pathSEXP, SEXP groupSEXP, SEXP buffer_sizeSEXP) {
Expand Down Expand Up @@ -2650,6 +2666,7 @@ static const R_CallMethodDef CallEntries[] = {
{"_BPCells_dims_matrix_anndata_hdf5_cpp", (DL_FUNC) &_BPCells_dims_matrix_anndata_hdf5_cpp, 3},
{"_BPCells_iterate_matrix_anndata_hdf5_cpp", (DL_FUNC) &_BPCells_iterate_matrix_anndata_hdf5_cpp, 6},
{"_BPCells_write_matrix_anndata_hdf5_cpp", (DL_FUNC) &_BPCells_write_matrix_anndata_hdf5_cpp, 8},
{"_BPCells_write_matrix_anndata_hdf5_dense_cpp", (DL_FUNC) &_BPCells_write_matrix_anndata_hdf5_dense_cpp, 7},
{"_BPCells_read_hdf5_string_cpp", (DL_FUNC) &_BPCells_read_hdf5_string_cpp, 3},
{"_BPCells_hdf5_group_exists_cpp", (DL_FUNC) &_BPCells_hdf5_group_exists_cpp, 2},
{"_BPCells_hdf5_group_objnames_cpp", (DL_FUNC) &_BPCells_hdf5_group_objnames_cpp, 2},
Expand Down
149 changes: 149 additions & 0 deletions r/src/bpcells-cpp/matrixIterators/H5DenseMatrixWriter.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
// Copyright 2024 BPCells contributors
//
// Licensed under the Apache License, Version 2.0 <LICENSE-APACHE or
// https://www.apache.org/licenses/LICENSE-2.0> or the MIT license
// <LICENSE-MIT or https://opensource.org/licenses/MIT>, at your
// option. This file may not be copied, modified, or distributed
// except according to those terms.

#include "../arrayIO/hdf5.h"
#include "../utils/filesystem_compat.h"

#include "H5DenseMatrixWriter.h"

namespace BPCells {

template <typename T> class H5DenseMatrixWriter : public MatrixWriter<T> {
private:
HighFive::File h5file;
std::string dataset_path;
uint64_t chunk_size;
uint32_t gzip_level;

HighFive::DataType datatype = HighFive::create_datatype<T>();
bool row_major;

HighFive::DataSet createH5Matrix(uint64_t nrow, uint64_t ncol) {
HighFive::SilenceHDF5 s;

// Create a dataspace with initial shape and max shape
uint64_t nrow_h5 = nrow;
uint64_t ncol_h5 = ncol;
if (row_major) {
nrow_h5 = ncol;
ncol_h5 = nrow;
}
HighFive::DataSpace dataspace({nrow_h5, ncol_h5});

// Use chunking
HighFive::DataSetCreateProps props;
props.add(HighFive::Chunking(std::vector<hsize_t>{std::min(chunk_size, nrow_h5), std::min(chunk_size, ncol_h5)}));
if (gzip_level > 0) {
props.add(HighFive::Shuffle());
props.add(HighFive::Deflate(gzip_level));
}
// Create the dataset
return h5file.createDataSet<T>(dataset_path, dataspace, props);
}

public:
H5DenseMatrixWriter(
HighFive::File h5file,
std::string dataset,
bool row_major,
uint64_t chunk_size = 1024,
uint32_t gzip_level = 0
)
: h5file(h5file)
, dataset_path(dataset)
, chunk_size(chunk_size)
, gzip_level(gzip_level)
, row_major(row_major) {}

void write(MatrixLoader<T> &mat_in, std::atomic<bool> *user_interrupt = NULL) override {

HighFive::DataSet h5dataset = createH5Matrix(mat_in.rows(), mat_in.cols());

bool loaded = false; // Any non-zero values has been loaded.
std::vector<T> val_buf(mat_in.rows()); // buffer for each column
while (mat_in.nextCol()) {
if (user_interrupt != NULL && *user_interrupt) return;
while (mat_in.load()) {
loaded = true;
uint32_t *row_data = mat_in.rowData();
T *val_data = mat_in.valData();
uint32_t capacity = mat_in.capacity();
for (uint32_t i = 0; i < capacity; i++) {
val_buf[row_data[i]] = val_data[i];
}
if (user_interrupt != NULL && *user_interrupt) return;
}
if (loaded) {
if (row_major) {
h5dataset.select({(uint64_t)mat_in.currentCol(), 0}, {1, val_buf.size()}).write_raw(val_buf.data(), datatype);
} else {
h5dataset.select({0, (uint64_t)mat_in.currentCol()}, {val_buf.size(), 1}).write_raw(val_buf.data(), datatype);
}
}
for (auto &x : val_buf) {
x = 0;
}
loaded = false;
}
h5dataset.createAttribute("encoding-type", std::string("array"));
h5dataset.createAttribute("encoding-version", std::string("0.2.0"));
}
};


template <typename T>
std::unique_ptr<MatrixWriter<T>> createAnnDataDenseMatrix(
std::string file,
std::string dataset,
bool row_major,
uint32_t chunk_size,
uint32_t gzip_level
) {
HighFive::SilenceHDF5 s;

// Create the HDF5 file
std_fs::path path(file);
if (path.has_parent_path() && !std_fs::exists(path.parent_path())) {
std_fs::create_directories(path.parent_path());
}
HighFive::File h5file(file, HighFive::File::OpenOrCreate);

return std::make_unique<H5DenseMatrixWriter<T>>(h5file, dataset, row_major, chunk_size, gzip_level);
}

// Explicit template instantiations
template std::unique_ptr<MatrixWriter<uint32_t>> createAnnDataDenseMatrix<uint32_t>(
std::string file,
std::string dataset,
bool row_major,
uint32_t chunk_size,
uint32_t gzip_level
);
template std::unique_ptr<MatrixWriter<uint64_t>> createAnnDataDenseMatrix<uint64_t>(
std::string file,
std::string dataset,
bool row_major,
uint32_t chunk_size,
uint32_t gzip_level
);
template std::unique_ptr<MatrixWriter<float>> createAnnDataDenseMatrix<float>(
std::string file,
std::string dataset,
bool row_major,
uint32_t chunk_size,
uint32_t gzip_level
);
template std::unique_ptr<MatrixWriter<double>> createAnnDataDenseMatrix<double>(
std::string file,
std::string dataset,
bool row_major,
uint32_t chunk_size,
uint32_t gzip_level
);

} // namespace BPCells
35 changes: 35 additions & 0 deletions r/src/bpcells-cpp/matrixIterators/H5DenseMatrixWriter.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
// Copyright 2024 BPCells contributors
//
// Licensed under the Apache License, Version 2.0 <LICENSE-APACHE or
// https://www.apache.org/licenses/LICENSE-2.0> or the MIT license
// <LICENSE-MIT or https://opensource.org/licenses/MIT>, at your
// option. This file may not be copied, modified, or distributed
// except according to those terms.

#pragma once
#include <cstdint>
#include <memory>
#include <string>

#include "MatrixIterator.h"

namespace BPCells {

/**
* @brief Write a matrix to an AnnData HDF5 file in dense format
*
* @param file Path of the HDF5 file on disk
* @param dataset Path of the output dataset within the HDF5 file
* @param row_major If true, write output in transposed (row major) order. (Input is always interpeted as col major)
* @param chunk_size Chunk size used for HDF5 array storage
* @param gzip_level If non-zero, level of gzip compression to use while writing.
*/
template <typename T> std::unique_ptr<MatrixWriter<T>> createAnnDataDenseMatrix(
std::string file,
std::string dataset,
bool row_major,
uint32_t chunk_size,
uint32_t gzip_level
);

} // namespace BPCells
Loading

0 comments on commit 7f4f502

Please sign in to comment.