Skip to content

Dev Docs

Ben Parks edited this page Jul 1, 2024 · 1 revision

MatrixLoader types and object flow

Note: This setup will change in the near(ish) future, but it is accurate for now (July 2024)

We'll trace through the TransformPow function, which raises all entries in the matrix to a fixed power.

General Flow:

  1. We define an S4 object which inherits the "IterableMatrix" type (for all BPCells matrices) link
    • For direct loading from disk, this will store an absolute path to the input file
    • For transforms, generally one slot named @matrix will store the input matrix S4 object
  2. User calls a function that requires accessing the data (e.g. rowSums)
  3. BPCells R code calls iterate_matrix, which converts from a R object and returns a C++ pointer to R. R then passes this C++ pointer that can load data to an RCpp function that will use the data. rowSums R side
    • These pointers should generally be short-lived and not be returned to the user. (e.g. if a file is deleted while a C++ pointer wants to read it, that might create a badly-controlled crash rather than a nice R error)
    • Each different "IterableMatrix" subclass implements iterate_matrix via setMethod("iterate_matrix", ...) example link. Generally implemented by calling iterate_matrix on an input, then passing the pointer to an RCpp function.
  4. On the C++ side, these are represented as Rcpp::XPtr<std::unique_ptr<T>>.
    • Wrapping and unwrapping is handled by functions in R_xptr_wrapper.h
    • With few exceptions, this means that once C++ gets passed an object from R, it takes ownership and will destruct the object in a timely manner (thus closing any open files avoiding locking issues with HDF5 files)
  5. A RCpp top-level function takes the argument as an Rcpp SEXP type, then converts it using take_unique_xptr to the assumed C++ type. example link
  6. The top-level function calls a method/function that operates only on C++ objects (&MatrixLoader<double>::rowSums in our example)

These steps are basically the same for fragment objects

Important base types:

R "IterableMatrix" -- base class for all BPCells matrix types. Every transform or input file type gets a new subclass. New subclasses must implement the following methods:

code link

  • iterate_matrix() -> converts from R object to C++ pointer
  • matrix_type() -> Return the data type (uint32_t, float, double)
  • short_description() -> Return a vector of strings describing the steps
  • storage_order() (optional override) -> "row" for stored contiguously by row, "col" for stored contiguously by column
    • "col" is more common. By default types inherit and use the @transpose slot which is TRUE if the matrix is row-major. C++ treats everything as col-major.
  • matrix_inputs() (optional override) -> return a list of the input matrix or matrices for transform objects
  • matrix_inputs<- (optional override) -> Allow setting the input matrices given a list of IterableMatrix objects

Important slots:

  • @dim - dimensions of the matrix (row,col).
  • @dimnames -length-2 list with row/col names if present (in that order)

C++ MatrixLoader

code link

New operations implement a subclass, and override methods as needed. For convenience when writing transforms, use MatrixLoaderWrapper on the implementation side (will default to using the input's implementation for all functions). For convenience on the consumer side, use MatrixIterator, which allows accessing one element at a time rather than in batches via pointers. Example MatrixIterator usage

  • Conceptually, each entry is a tuple of (val, row), and these are loaded one column at a time (so the col index will not change within an inner loop). In practice, there's a pointer to a val array, and a pointer to a row array which have equal lengths. Often these are pointers to the .data() of a std::vector<T>
  • As metadata, rows and columns can have string names assigned to them. This is the only metadata supported on BPCells matrices

R IterableFragments link

  • Unlike IterableMatrix, does not include any slots in the base class
  • iterate_fragments() -> basically same deal as iterate_matrix(). Converts R object to C++ pointer
  • short_description() -> Same as IterableMatrix. Return a charater vector listing transformation steps for this object and all its inputs.
  • cellNames() -> return a vector of the cell names for this object
  • chrNames() -> return a vector of the chromosome names for this object

C++ FragmentsLoader

code link

Same idea as MatrixLoader. Similarly, FragmentLoaderWrapper is useful for implementing new transforms, and FragmentIterator is useful for writing a consuming function.

  • Each entry is a tuple of (cell, start, end) where start is 0-based inclusive index and end is a 0-based exclusive index. Cell is a numeric ID. Chromosome changes slowly and has a separate numeric ID. Range coordinate systems link. BPCells uses a zero-based, end-exclusive system similar to bed files.
  • Cell and chromosome string labels are the only metadata supported on BPCells fragment objects.

Supporting user interrupts

Problem: when a user presses Ctrl-C, we want to know to quit out of our computation promptly.

BPCells solution:

  1. Worker function takes a final argument of std::atomic<bool> *user_interrupt.
  2. If pointer is non-null, worker function checks periodically and exits early if it becomes true: if (user_interrupt != nullptr && count++ % 65536 == 0 && *user_interrupt) return; example link
    • Don't put this in the very inner loop ideally. For matrices make sure it gets checked at least once a column. For fragments once a chromosome is OK, but ideally something like every ~100K fragments (pick a power of 2 as in above example)
  3. In top-level function, call worker function through run_with_R_interrupt_check. example usage
    • Note that getting the C++ reference types mixed up will cause confusing compiler errors. Check example usage to see how it works.

What else to know:

  • run_with_R_interrupt_check will check for an interrupt in the main thread every 100ms, and run the worker in a background thread. It will signal an interrupt by setting user_interrupt to true.
  • As from the docs: "It is EXTREMELY IMPORTANT that no R objects are created/destroyed inside the spawned thread, which includes destructors that mess with R's GC protection."
  • i.e. worker functions should basically operate on pointers/references to C++ objects