-
Notifications
You must be signed in to change notification settings - Fork 18
Dev Docs
Note: This setup will change in the near(ish) future, but it is accurate for now (July 2024)
We'll trace through the TransformPow
function, which raises all entries in the matrix to a fixed power.
General Flow:
- We define an S4 object which inherits the "IterableMatrix" type (for all BPCells matrices) link
- For direct loading from disk, this will store an absolute path to the input file
- For transforms, generally one slot named
@matrix
will store the input matrix S4 object
- User calls a function that requires accessing the data (e.g.
rowSums
) - BPCells R code calls
iterate_matrix
, which converts from a R object and returns a C++ pointer to R. R then passes this C++ pointer that can load data to an RCpp function that will use the data. rowSums R side- These pointers should generally be short-lived and not be returned to the user. (e.g. if a file is deleted while a C++ pointer wants to read it, that might create a badly-controlled crash rather than a nice R error)
- Each different "IterableMatrix" subclass implements
iterate_matrix
viasetMethod("iterate_matrix", ...)
example link. Generally implemented by callingiterate_matrix
on an input, then passing the pointer to an RCpp function.
- On the C++ side, these are represented as
Rcpp::XPtr<std::unique_ptr<T>>
.- Wrapping and unwrapping is handled by functions in R_xptr_wrapper.h
- With few exceptions, this means that once C++ gets passed an object from R, it takes ownership and will destruct the object in a timely manner (thus closing any open files avoiding locking issues with HDF5 files)
- A RCpp top-level function takes the argument as an Rcpp
SEXP
type, then converts it usingtake_unique_xptr
to the assumed C++ type. example link - The top-level function calls a method/function that operates only on C++ objects (
&MatrixLoader<double>::rowSums
in our example)
These steps are basically the same for fragment objects
R "IterableMatrix" -- base class for all BPCells matrix types. Every transform or input file type gets a new subclass. New subclasses must implement the following methods:
- iterate_matrix() -> converts from R object to C++ pointer
- matrix_type() -> Return the data type (uint32_t, float, double)
- short_description() -> Return a vector of strings describing the steps
- storage_order() (optional override) -> "row" for stored contiguously by row, "col" for stored contiguously by column
- "col" is more common. By default types inherit and use the
@transpose
slot which is TRUE if the matrix is row-major. C++ treats everything as col-major.
- "col" is more common. By default types inherit and use the
- matrix_inputs() (optional override) -> return a list of the input matrix or matrices for transform objects
- matrix_inputs<- (optional override) -> Allow setting the input matrices given a list of IterableMatrix objects
Important slots:
-
@dim
- dimensions of the matrix (row,col). -
@dimnames
-length-2 list with row/col names if present (in that order)
C++ MatrixLoader
New operations implement a subclass, and override methods as needed. For convenience when writing transforms, use MatrixLoaderWrapper
on the implementation side (will default to using the input's implementation for all functions). For convenience on the consumer side, use MatrixIterator
, which allows accessing one element at a time rather than in batches via pointers. Example MatrixIterator usage
- Conceptually, each entry is a tuple of (val, row), and these are loaded one column at a time (so the col index will not change within an inner loop). In practice, there's a pointer to a
val
array, and a pointer to arow
array which have equal lengths. Often these are pointers to the.data()
of astd::vector<T>
- As metadata, rows and columns can have string names assigned to them. This is the only metadata supported on BPCells matrices
R IterableFragments link
- Unlike IterableMatrix, does not include any slots in the base class
- iterate_fragments() -> basically same deal as iterate_matrix(). Converts R object to C++ pointer
- short_description() -> Same as IterableMatrix. Return a charater vector listing transformation steps for this object and all its inputs.
- cellNames() -> return a vector of the cell names for this object
- chrNames() -> return a vector of the chromosome names for this object
C++ FragmentsLoader
Same idea as MatrixLoader. Similarly, FragmentLoaderWrapper
is useful for implementing new transforms, and FragmentIterator
is useful for writing a consuming function.
- Each entry is a tuple of (cell, start, end) where start is 0-based inclusive index and end is a 0-based exclusive index. Cell is a numeric ID. Chromosome changes slowly and has a separate numeric ID. Range coordinate systems link. BPCells uses a zero-based, end-exclusive system similar to bed files.
- Cell and chromosome string labels are the only metadata supported on BPCells fragment objects.
Problem: when a user presses Ctrl-C, we want to know to quit out of our computation promptly.
BPCells solution:
- Worker function takes a final argument of
std::atomic<bool> *user_interrupt
. - If pointer is non-null, worker function checks periodically and exits early if it becomes true:
if (user_interrupt != nullptr && count++ % 65536 == 0 && *user_interrupt) return;
example link- Don't put this in the very inner loop ideally. For matrices make sure it gets checked at least once a column. For fragments once a chromosome is OK, but ideally something like every ~100K fragments (pick a power of 2 as in above example)
- In top-level function, call worker function through
run_with_R_interrupt_check
. example usage- Note that getting the C++ reference types mixed up will cause confusing compiler errors. Check example usage to see how it works.
What else to know:
- run_with_R_interrupt_check will check for an interrupt in the main thread every 100ms, and run the worker in a background thread. It will signal an interrupt by setting user_interrupt to true.
- As from the docs: "It is EXTREMELY IMPORTANT that no R objects are created/destroyed inside the spawned thread, which includes destructors that mess with R's GC protection."
- i.e. worker functions should basically operate on pointers/references to C++ objects