This repository contains the source code for SigMa (Signature Markov model) and related experiments. SigMa is a probabilistic model of the sequential dependencies among mutation signatures.
Below, we provide an overview of the SigMa model from the corresponding paper. "The input data consists of (A) a set of predefined signatures that form an emission matrix E (here, for simplicity, represented over six mutation types), and (B) a sequence of mutation categories from a single sample and a distance threshold separating sky and cloud mutation segments. (C) The SigMa model has two components: (top) a multinomial mixture model (MMM) for isolated sky mutations and (bottom) an extension of a Hidden Markov Model (HMM) capturing sequential dependencies between close-by cloud mutations; all model parameters are learned from the input data in an unsupervised manner. (D) SigMa finds the most likely sequence of signatures that explains the observed mutations in sky and clouds."
SigMa is written in Python 3. We recommend using Conda to manage dependencies, which you can do directly using the provided environment.yml
file:
conda env create -f environment.yml
source activate sigma-env
For windows replace last command with
activate sigma-env
We use Snakemake to manage the workflow of running SigMa on hundreds of tumor samples.
First, download and preprocess the ICGC breast cancer whole-genomes and COSMIC mutation signatures. To do so, run:
cd data && snakemake all
Second, run SigMa and a multinomial mixture model (MMM) on each sample, and perform leave-one-out cross-validation (LOOCV):
snakemake all
This will create an output/
directory, with two subdirectories: models/
and loocv/
. models/
contains SigMa trained on each sample. loocv/
contains the results of LOOCV with SigMa using different cloud thresholds.
To run the entire SigMa workflow on different mutation signatures or data, see the Snakefile
for configuration options.
To train SigMa or MMM on individual mutation sequences, use the src/train_and_predict.py
script. To get a list of command-line arguments, run:
python src/train_and_predict.py -h
Please report bugs and feature requests in the Issues tab of this GitHub repository.
For further questions, please email Max Leiserson and Itay Sason directly.
Xiaoqing Huang*, Itay Sason*, Damian Wojtowicz*, Yoo-Ah Kim, Mark Leiserson^, Teresa M Przytycka^, Roded Sharan^. Hidden Markov Models Lead to Higher Resolution Maps of Mutation Signature Activity in Cancer. Genome Medicine (2019) doi: 10.1186/s13073-019-0659-1.
* equal first author contribution ^ equal senior author contribution