The goal of malacoda is to enable Bayesian analysis of high-throughput genomic assays like massively parallel reporter assays (MPRA) and CRISPR screens.
It uses a negative-binomial-based Bayesian model shown in the Kruschke diagram below. This model offers numerous advantages over traditional null hypothesis significance testing based methods:
- Models raw data - The model is fit directly to the input counts
(MPRA barcodes or gRNAs)
- The lack of transformations avoids discarding 0 counts as in traditional methods.
- Prior information - Empirical priors are fit from the observed assay
globally, enabling estimate shrinkage that reduces errors due to
multiple testing
- Informative annotations (such as DNase hypersensitivity estimates or gene scores) can be included to further refine the empirical priors by conditional density estimation.
- The R interface provides clear and interpretable outputs and figures.
Other features include:
- automated barcode counting from MPRA library design and FASTQs
- custom Stan models for fast posterior evaluation
- variational Bayes support through
rstan::vb
that allows for quick first pass checks - Annotation checking - quantitatively evaluate how much a given genomic annotation source improves empirical prior estimation by prior ratios
- Convenience and QC functions -
count_barcodes()
to automatically go from FASTQs to raw MPRA counts.get_sample_correlations()
to check the consistency of your data.
The manuscript associated with this work was published in PLOS Computational Biology on July 21, 2020. If you use this software, please cite this source:
Ghazi, A. R., Kong, X., Chen, E. S., Edelstein, L. C., & Shaw, C. A. (2020). Bayesian modelling of high-throughput sequencing assays with malacoda. PLOS Computational Biology, 16(7), 1–18. https://doi.org/10.1371/journal.pcbi.1007504
The HTML versions of the supplements for the manuscript are available in this repository’s man/supplements/ directory.
This is a basic example which shows you how to fit the simplest form of the model. The input needs to provide the MPRA counts for each barcode of each allele of each variant with a column for each sequencing sample. This example only shows 8 rows, but realistic datasets will have thousands.
variant_id | allele | barcode | DNA1 | DNA2 | RNA1 | RNA2 | RNA3 | RNA4 |
---|---|---|---|---|---|---|---|---|
1_205247315_2-3 | ref | CGATATAGTTAC | 19 | 19 | 3 | 0 | 39 | 4 |
1_205247315_2-3 | ref | ACGCATTCACCT | 56 | 51 | 38 | 19 | 38 | 46 |
1_205247315_2-3 | alt | TAGTATCAACTA | 76 | 50 | 14 | 44 | 11 | 20 |
1_205247315_2-3 | alt | GTCGTCAGCGAT | 106 | 69 | 80 | 32 | 163 | 57 |
10_101274365_1-3 | ref | ATAAGCCGTCCG | 54 | 37 | 43 | 2 | 12 | 24 |
10_101274365_1-3 | ref | CCGACGGCGTTG | 61 | 35 | 18 | 7 | 19 | 10 |
10_101274365_1-3 | alt | GATACGAACACA | 42 | 34 | 29 | 48 | 194 | 129 |
10_101274365_1-3 | alt | TAACGATCGGTC | 33 | 25 | 14 | 9 | 11 | 409 |
The specific format requirements for the input can be found on the help
page for ?fit_mpra_model
. The code below will fit the basic malacoda
model using a dataset that comes with the package:
library(malacoda)
marg_prior = fit_marg_prior(umpra_example)
fit_mpra_model(mpra_data = umpra_example,
out_dir = '/path/to/outputs/',
priors = marg_prior,
n_cores = getOption('mc.cores', 2L),
tot_samp = 1000,
n_chains = 3,
vb_pass = TRUE,
save_nonfunctional = TRUE)
This will fit the model to each input in the assay (using some example
variants from Ulirsch et al.,
Cell, 2016 in the
built-in dataset umpra_example
) using a marginal prior, save the
outputs for each variant at the specified directory, and return a data
frame of summary statistics for each variant, including binary calls of
functional/non-functional, posterior means on activity levels &
transcription shift.
variant_id | ts_post_mean | ref_post_mean | alt_post_mean | is_functional | hdi_lower | hdi_upper |
---|---|---|---|---|---|---|
22_32880585_2-3 | 1.67 | -1.15 | 0.52 | TRUE | 0.95 | 2.32 |
15_66068074_1-3 | 0.73 | -0.83 | -0.10 | FALSE | -0.15 | 1.56 |
12_121167039_1-3 | 0.56 | -0.69 | -0.13 | FALSE | -0.25 | 1.26 |
16_68092850_2-3 | -0.44 | 0.18 | -0.26 | FALSE | -1.15 | 0.40 |
8_42424746_1-2 | 0.40 | -0.33 | 0.07 | FALSE | -0.12 | 0.87 |
19_33204544_1-3 | -0.27 | -0.15 | -0.42 | FALSE | -0.91 | 0.44 |
More sophisticated analyses that use annotations to create informative
priors for higher sensitivity are described in the MPRA Analysis
vignette accessible with vignette('mpra_vignette', package = 'malacoda')
. Other features like annotation checking and traditional
NHST analysis are also explained in the vignette. Each function provided
by the package have extensive help documentation that should help
elucidate what they do and how to use them e.g. ?fit_mpra_model
.
malacoda
is intended for use on Mac and Linux. Windows may work aside
from parallelization, however we do not intend to support Windows.
It’s best to have the most up-to-date version of R (3.6.0 as of May 2 2019).
The first step is to install rstan
and Rcpp
. The following command
will usually suffice to do this, if not you can find more in-depth
installation instructions on the rstan
documentation.
You should have root access.
# Sys.setenv(MAKEFLAGS = "-j4") # This compilation flag can help speed up
# installation on multi-threaded machines. This command uses 4 threads.
install.packages(c('Rcpp', 'rstan'), dependencies = TRUE, type = 'source')
Once malacoda is accepted up on CRAN it will be installable with:
# install.packages('malacoda')
You can install the development version of malacoda from github with:
# install.packages("devtools")
devtools::install_github("andrewGhazi/malacoda")
This should install the dependencies (which are mostly tidyverse packages), compile the malacoda Stan models, and install the package.
count_barcodes()
requires the
FASTX-Toolkit
(installation instructions provided at that link) and sed
which comes
with most Unix-like operating systems. It can also optionally take
advantage of the FreeBarcodes
package for barcode error-correction.
In addition to the summary statistics table output above, the sampler outputs for each variant are saved in the user-defined output directory. These are stanfit objects, thus they can be visualized using all the tools provided in packages like bayesplot.
malacoda
also provides several plotting function of its own.
mpra_tile_plot
can help to visualize the raw MPRA counts (and some
summary metrics):
And malacoda_intervals()
can help to visualize the posterior from the
malacoda model fit:
Other visualization functions are available for annotation checking. These help visualize the improvement induced by the use of informative conditional priors.
- A full analysis walk-through is available through the mpra analysis
vignette, accessible with
vignette("mpra_vignette")
(You will need to have built the vignette during package installation e.g.devtools::install_github('andrewGhazi/malacoda', build_vignettes = TRUE)
) - Each function in the package has an associated help page that can be
accessed from R, for example
?fit_mpra_model
. These are generally quite detailed. - You can see a list of the data objects included with the package
using the command
data(package = 'malacoda')
. You can then access the description of each data object as you would the help documentation e.g.?umpra_example
. - A pdf manual containing all documentation for all exported functions
can be built during package installation by setting
build_manual = TRUE
as an argument toinstall_github
. - Don’t hesitate to open an issue on this Github repository or send me a direct message if things don’t work how you expect. Even if the package is working as intended, if users are having problems I want to fix that.
Fleshed out CRISPR model support✓- Informative prior estimation for CRISPR dropout screen models
Categorical conditional priors✓- additional Quality Control functionality
- monoallelic variant analysis
model code✓- model interface
- multi-tissue / multi-allelic variant analysis
model code✓- model interface
- tissue level informative prior estimation
- single-step hierarchical model with efficient, within-chain
parallelization via
reduce_sum
The massively parallel functionalization field is still rapidly changing so there’s a lot of experimental structures out there that are not accounted for in this package yet. If you’re interested in these methods, feel free to contact me with some example data and I’ll be happy to take a look to see if I can adapt the existing models!
Please contact me through Github DM or my BCM email address if you use the package or have feature requests / comments.