This contains code for statistical analysis of recurrent mutations in whole genome sequencing data.
This project includes code to perform the recurrent mutation analysis described in Melton et al. Nature Genetics 2015 This analysis is comprised of two steps: (1) Build a sample and genomic location specific mutation probability model and (2) use the Poisson Binomial to compute the probability of k or more samples with mutation for each given mutated site. The Poisson binomial calculations are made possible using the poibin R pakage.
python Main.py --M MutationFileListFile --C CovariateFileListFile --CC CombinedCovariateFile --LR LRModelName --P parallel --MF MergedMutationFilename --G grid --L logFilePath --RS regionSize
Option | Description |
---|---|
MutationFileListFile | This should be a tab delimited file with patient id, mutation file location, and additional info (see below). |
CovariateFileListFile | This should be a tab delimited file with covariate file name and file location. |
CombinedCovariateFile | This should be a filename for the combined covariates. It can be generated as an intermediate but the name should be specified. |
LRModelName | The name of the logistic regression model. |
parallel | The number of jobs to run in parallel. |
MergedMutationFilename | The name of the merged mutation file that is generated as an intermediate. |
grid | 'T' to use grid engine. This option is not enabled yet. |
logFilePath | The path to a log file (only used if grid option is 'T') |
regionSize | Optional region size. Right now '1' is the only acceptable input. |
This file should contain the following columns: pid, MutationFile, MutationWigFile, MutationCovariateFile, CoverageWigFile, WGCovariateFile, MutationCovariateSummaryFile, ModelData
'pid' is the patient id. The mutation file is an input file the others are intermediate files generating during the application run.
These are base pair (AT or CG), replication timing, and coding/noncoding exon/intron annotations from GENCODE.
Same as for mutations but accross all bases with high coverage in the original WGS sequencing.
Fit the logistic regression model to all data from all samples.
Get the numbers of times a given genomic position is mutated across samples.
Use the sample specific probability model from above to compute the site specific mutation probabilities for each sample.
Using a vector of probabilities (one for each sample) and the poibin package compute the probabilities of seeing the observed number of mutations at each given site.