TensorSignatures is a tensor factorisation framework for mutational signature analysis, which in contrast to other methods, deciphers mutational processes not only in terms of mutational spectra, but also assess their properties with respect to various genomic variables, allows the inclusion of different mutation types and integrates a robust noise model toperform the inference.
TensorSignatures is a young project and breaking changes are to be expected. We keep a changelog and it will have possible breakage clearly documented.
TensorSignatures makes use of the TensorFlow 1.5.x framework requiring the user to install a separate package to enable GPU support, i.e. tensorflow-gpu
instead of tensorflow
. We highly recommend to install TensorSignatures into an environment with tensorflow-gpu, as the tensor computations greatly benefit from GPU-acceleration.
To obtain the most recent version of TensorSignatures, we recommend to download the repository directly from GitHub and to install the package into a virtual environment. To get started, clone the repository by executing the following commands in your terminal
$ git clone https://github.com/gerstung-lab/tensorsignatures.git && cd tensorsignatures
Then, create a new virtual environment and install all dependencies. If you have access to a GPU with cuda support use requirements-gpu.txt
instead of requirements.txt
.
$ python -m venv env
$ source env/bin/activate
$ pip install --upgrade pip setuptools wheel && pip install -r requirements.txt
Finally, install TensorSignatures.
$ python setup.py install
To install tensorsignatures
via Pypi simply type
$ pip install tensorsignatures
into your shell.
To run TensorSignatures within a docker environment, clone the repository
$ git clone https://github.com/gerstung-lab/tensorsignatures.git
$ cd tensorsignatures
and spin up the container using docker-compose
$ docker-compose up --build
This spins up a jupyter server including notebooks with tutorials on http://localhost:8889.
- Free software: MIT license
- Documentation: https://tensorsignatures.readthedocs.io.
Running TensorSignatures involves three steps: preparing the input data, i.e. creating the mutation count tensor as well as the mutation count matrix, computing a trinucleotide normalisation to account for differences in the nucleotide composition of different genomic regions, and running TensorSignatures.
We provide a docker image that contains all R
and bioconductor
dependencies to create the variant tensor and the other mutation type matrix. To use it, pull the image from docker. Note that the image is approximately 5 GB large.
$ docker pull sagar87/tensorsignatures-data:latest
To use the image switch into the folder containing your VCF data. Then run image using the following command and supply the VCF files as well as the name of the hdf5
output file (must be the last argument) as arguments.
$ docker run -v $PWD:/usr/src/app/mount sagar87/tensorsignatures-data <vcf1.vcf> <vcf2.vcf> ... <vcfn.vcf> <output.h5>
Then continue with Step 2.
Make sure you have R3.4.x
(!) and the packages VariantAnnotation
and rhdf5
installed. You can install them, if necessary, by executing
$ Rscript -e "source('https://bioconductor.org/biocLite.R'); biocLite('VariantAnnotation')"
and
$ Rscript -e "source('https://bioconductor.org/biocLite.R'); biocLite('rhdf5')"
from your command line.
To get started, download the following files and place them in the same directory:
Constants.RData (contains GRanges
objects that annotate transcription/replication orientation, nucleosomal and epigenetic states)
mutations.R (all required functions to partiton SNVs, MNVs and indels)
processVcf.R (loads vcf
files and creates the SNV count tensor, MNV and indel count matrix; eventually needs custom modification to make the script run on your vcfs.)
To obtain the SNV count tensor and the matrices containing other mutation types, execute processVcf.R
and pass the VCF files you want to convert, as well as a name for an output hdf5
file as command line arguments, e.g.
$ Rscript processVcf.R <vcf1.vcf> <vcf2.vcf> ... <vcfn.vcf> <output.h5>
In case of errors please check wether you have correctly specified paths in line 6-8. Also, take a look at the readVcfSave
function and adjust it when it fails.
TensorSignatures requires a trinucleotide normalisation constant to account for differences in the nucleotide composition of genomic states. To compute it, invoke the prep sub routine of TensorSignatures and pass the hd5
file from Step 1 as well as the path for the output file as positional arguments to the programme.
$ tensorsignatures prep <output.h5> <tsdata.h5>
There are two ways to run TensorSignatures using either the refit
option, which fits the exposures of a set of pre-defined signatures extracted from the PCAWG cohort to a your dataset, or via the train
subroutine, that performs a denovo extraction of tensor signatures. Refitting tensor signatures is computationally fast but does not allow to discover new signatures, while extracting new signatures from scratch is computationally intensive (GPU required) and requires ideally larger numbers of samples. For most use cases, with a small number of samples, we advice to use the refit option:
$ tensorsignatures --verbose refit tsData.h5 refit.pkl -n
To run a denovo extraction use
$ tensorsignatures --verbose train tsData.h5 denovo.pkl <rank> -k <size> -n -ep <epochs>
where rank
specifies the decomposition rank, size
controls the dispersion of the model, and epochs
the number of desired epochs to fit the model. TensorSignatures outputs value of the objective function (log likelihood) that is minimised during training as well as the change of the objective during an epoch interval (delta
). When deciding on the number of epochs to train the model ensure that it is sufficiently large such that the objective function converges, i.e. the delta
value is close to, or fluctuates around zero. For more information on how to run TensorSignatures in a practical setting see the documentation. Running TensorSignatures will yield a pickle dump which can subsequently inspected using the tensorsignatures package.
- Run
tensorsignatures
on your dataset using theTensorSignature
class provided by the package or via the command line tool. - Compute percentile based bootstrap confidence intervals for inferred parameters.
- Basic plotting tools to visualize tensor signatures and inferred parameters
- Harald Vöhringer and Moritz Gerstung