R package to facilitate microbial fitness inference (using Generalized Linear Models) from barcode counts based on the Negative Binomial (NB) distribution.
Designed for ultra-high throughput screening with (more than) tens of millions of observations. This means:
- Estimation of batch effects based on known experimental metadata. This is used in final estimation of effect size, and also in the estimation of the NB dispersion parameter to guard against overly conservative p-values
- Computational short-cuts which allow very large datasets to be analyzed without large
glm
objects causing memory explosions - Support for parallelization across cores or GridEngine cluster jobs
- Support for checkpointing in case of intermediate failure
Not yet on CRAN or BioconductoR, so you need to install using devtools.
In an R session:
devtools::install_github('eachanjohnson/concensusGLM')
This will install the library in your default path. Check .libPaths()
to see what that is.
The script contained in exec/concensusGLM.R
allows for a lot of imagined use cases.
If you use the --sge
option, you need to provide the path on your computer to the exec/sge-template.sh
template submission script, which you may need to edit for your particular setup.
Usage:
concensusGLM.R --help
concensusGLM.R --data <data-file> --neg <negative-control> [--meta <experiment-metadata> --pos <positive-control> --no-count-threshold --checkpoint --parallel --sge <template-script>]
Options:
-h --help Show this message and exit
--data <data-file> Input count table CSV, one line per lane per strain per well
--meta <experiment-metadata> CSV of handling records
--neg <negative-control> Name of negative control compound, e.g. DMSO or none or untreated
--pos <positive-control> Name of positive control compound, e.g. rifampin or BRD-K01507359-001-19-5 or BRD-K01507359
--checkpoint Save checkpoints to allow restart in case of failure
--parallel Analyze strains in parallel (needs multiple cores)
--sge <template-script> Use GridEngine cluster with template to analyze strains in parallel.
--no-count-threshold Don\'t discard plates or strains based on low counts
This package provides the objects:
concensusDataSet
, a container for ConcensusGLM analysisconcenusWorkflow
, a wrapper to theworkflow
object from workflows, allowing scale-up to multi-core on a laptop or a GridEngine style cluster
Both objects concensusDataSet
have methods to carry out ConcensusGLM analysis. Using ?methodName
in the R console will give documentation.
getRoughDispersions
getBatchEffects
getFinalDispersions
getFinalModel
write_concensusDataSet
Application of these methods in an order other than listed above is not fully supported and may not work.
Applying the methods to a concensusDataSet
will execute the method immediately. Applying it to a concenusWorkflow
will add it to a list of commands, which can be executed at some other time by calling execute
on a concensusWorkflow
object.
You can also scatter
a concenusWorkflow
based on categorical columns in the input data set to allow chunking for embarassingly parallel computation when execute
is called. This can be done on the local machine using multiple cores, or using a GridEngine style cluster like SGE or UGE, in which case the submission script exec/sge-template.sh
may need to be edited for your specific cluster.
Eachan O. Johnson et. al., Large-scale chemical-genetics yields new M. tuberculosis inhibitor classes, Nature, July 2019, doi: 10.1038/s41586-019-1315-z
Dependencies:
- R
- broom
- dplyr
- errR
- future
- magrittr
- MASS
- parallel
- readr for fast loading of large CSV files
- workflows
Similar approaches include but are not limited to: