GitHub - mwilkers/unceqr: Somation mutation prediction by RNA and DNA integration

UNCeqR

####Author: Matthew D. Wilkerson Copyright: Matthew D. Wilkerson, 2011-2015. mdwilkerson@outlook.com

##Introduction Precision cancer medicine depends on accurate somatic mutation detection to inform patient treatment, which is complicated by low tumor purity. In this setting, standard techniques using DNA whole exome sequencing yield low sensitivity. UNCeqR is a first-of-its-kind method that integrates patient-matched DNA sequencing and RNA sequencing to detect somatic mutations. In a published study, UNCeqR has been shown to substantially boosts sensitivity for low purity tumors as well as advancing state-of-the-art mutation profiles with abundant novel driver and therapeutically-targeted mutation discoveries.

UNCeqR is versatile. There are different modes of operation, depending on input and options, including:

#####Configuring and tuning UNCeqR for different cohorts or sequencing types is important to achieve high performance. This is set through the set of options and library files described below. It is hard-coded for human mutation calling, but this can easily be adjusted for other organisms. For configuration and tuning questions, collaborations or consulting, email mdwilkerson@outlook.com.

##Citation Please cite this reference in publications or presentations using UNCeqR or UNCeqR results:

[1]
Wilkerson MD, Cabanski CR, Sun W, Hoadley KA, Walter V, Mose LE, Troester MA, Hammerman PS, Parker JS, Perou CM, Hayes DN. (2014) Integrated RNA and DNA sequencing improves mutation detection in low purity tumors Nucleic Acids Research. first published online June 26, 2014 Read the article.

##License MIT License. See LICENSE file.

##Building

###Prequisites:

Linux operating system

UNCeqR has been tested on RHEL6, Ubuntu.

Linux utilities

grep, cat, sed, date, md5sum

R

Available at www.r-project.org. Minimum version 2.15.

R libraries: VGAM, multicore.
Perl

Version 5.10.1 tested

Perl libraries Getopt::Long, Math::CDF, Math::Cephes Available via cpan.org

###Installation and test steps

cd samtools0.1.18_unceqr
make
cd examples
make
../samtools index ex1.bam
cd ../..
mv UNCeqR_conf.pl.template UNCeqR_conf.pl
Edit UNCeqR_conf.pl variables $localPath to be the full path to the unceqr directory. $st to be the full path to samtools.1.18/samtools $R to be the full path to the R binary $Rscript to be the full path to the Rscript $R_LIBS_USER to the full path to the R library $perlLib to full path of perl library
test sh example.cmd

###Example usage

sh example.cmd
This example calls the same toy data for tumor DNA-WES and tumor RNA-seq with reference being the germline.  This is only for testing for successful output generation and these are not recommended settings for somatic mutation detection.  See reference [1] for somatic mutation detection settings.

The VCF file show include thes lines:
seq1	548	.	C	A	.	.	.

seq2	505	.	A	G	.	.	.

seq2	784	.	C	CAATT	.	.	.

seq2	789	.	G	A	.	.	.

#Note this example has the dense option set to N, which is much slower than dense=Y.  The Dense=Y is hard coded for human chromosomes, which is currently not available for this example or other genome assemblies that do not start with chr or a digit between 1 and 22.

###Overview

UNCeqR_main.pl
UNCeqR_proc.pl
- output: unceqr_proc.all.csv. This is the main output file for unceqr and contains number of applied statistical tests in final lines: dtn, rtn, mtn for dna, rna and meta. Comment lines have # prefix.
UNCeqr_fit.r
- output: unceqr_proc.all.csv.fit. contains model parameters.
UNCeqR_2vcf.pl
- This a very minimal VCF. Users are recommended to use the unceqr_proc.all.csv output.
- output: unceqr_test.csv.vcf
UNCeqR_conf.pl
- this file contains the configuration parameters to run UNCeqR.

Options

Parameter	Description	Required (R), Optional (O)	Default
Options realted to input data
tumorRNA	String, path to RNA BAM file	O	blank
tumorRNAchr	String, tumor RNA has ‘chr’ prefix to chromosome names “Y” else “N”	Req if above defined	Y
tumorDNA	String, path to tumor DNA BAM file	O	blank
tumorDNAchr	String, tumor RNA has ‘chr’ prefix to chromosome names “Y” else “N”	Req if above defined	Y
normalDNA	String, pth to normal (germline) DNA BAM file	O	blank
normalDNAchr	String, normalDNA has ‘chr’ prefix to chromosome names “Y” else “N”	Req if above defined	Y
fastaWchr	String, full path to fasta file of ‘chr’ prefix chromosome sequences	Req
fastaWoChr	String, full path to fasta file of chromosome sequences, with ‘chr’ prefix	Req
regionsToQuery	String, full path to BED file format of genomic reqions to query; no chr in these regions; must be sorted by chr dictionary order and left position numeric order. Zero-based file, 0-100 is the first hundred bases. no header line. In dense=Y, overlapping sites are reported once, in dense=N, each region is independent and sites can be reported multiple times.	Req
resultsDir	String, full path to directory with flanking “/” for output to be written to; directory will be created if non-existent	R
dense	String, Query mode:“Y” – many close regions in bed file, “N” – few regions in bed file This affects the speed of UNCeqR.	Req
snpFile	String, full path to text file that has predefined SNP or artifactual positions to be considered as germline allele. Format is comma delimited with these fields: chromosome, position, alleles to be considered as germline [ATCGinsdel]. This has no header line and must be on same genome assembly as BAMs, 1-based positions, and must be sorted same as regionsToQuery file. See example snp file for reference	O
mainProc	Number of separate UNCeqR processes per round	[not used]
DnaOnly	For running with only one tumor sequencing, the DNA arguments are used, not RNA, and dnaOnly should be set to 1.	Opt	0
Alignment and Base Filtering: these options affect segregate these bases and
trimEnd	integer, number of characters to trim from every read; these do not contribute to counts	Opt	0
minTMapQ	Integer, minimum mapping quality value of read alignment, to be counted	O	0
minTBaseQ	Integer, minimum base quality value	O	0
flagFilter	Character: 0 – no filtering, 1 – only retain proper pairs, 2 – strike secondary, 3 – strike duplicates, 4 – strike duplicates or secondary, 5 – strike secondary, duplicate and fail. *qcFail and fusion reads always struck	O	NA
maxIH	Integer, maximum hit index of alignment to retain; IH tag.	Opt	1
maxNM	Integer, maximum number of mismatches and indel events from tag field to retain (MD tag plus indel count if not NM tag)	O	2
maxDepth	Integer, maximum read depth to consider in a genomic position	R	10000
indelShadow	Integer, number of positions to be queried for insertions or deletions in germline alignment. Positions (indelShadow – 1) / 2 to the right and to the left of the current position are evaluated. Needs to be odd. Setting at value == 0 will turn this off.	O
normIndelFrac	Numeric, minimum fraction of germline reads having an insertion or deletion at a position that results in striking all positions in the indelShadow region. Setting at value > 1 will turn this option off.
maxHomopolymer	Integer, maximum homopolymer allowed to retain position; Homopolymer is defined the largest number of identical reference nucleotides to the left or to the right, not including the current position. Runs larger than maxHomopolymer are not retained.	O	10000
normalLoose	String, “Y” - except all alignments from normalDNA “N” - apply filtering to normal bam also.	O	N
trainNum	Integer, number of usable genomic positions to use for betabinomial parameter estimation, these are taken the first usable positions according the sorted BED regions input	R	10000
Genomic position related: these options affect what is printed in output and supercede the ‘usable data’ options below
preMinTumorCov	Integer, number of usable bases in RNA or DNA, affects speed	O	5
preMinNormCov	Integer, number of usable bases in normal Req, affects speed	O	5
minInNormFrac	Numeric, minimum percent for an allele to count in the normal genotype estimate	R	0.02
regionRestrict	String,Y – only print bases from within BED file, not overlapping alignments; N – print all	O	Y
verboseOut	Numeric, 1, print all bases whether significant or not, to unceqr_proc.csv and unceqr_test.csv 0, only print positions with a p-value < 0.05 in any model to unceqr_test.csv.	O	0
Usable Data Filtering: these options define which positions will be tested for variant presence
minNormCnt	Integer, minimum depth in normal	R	5
maxNormPlural	Integer, maximum number of base variants allowed in normal	R	2
minDnaTumorCnt	Integer, minimum depth in tumor DNA	R	5
maxDnaBias	Float, minimum acceptable to test Fisher's exact p-value from test of strand bias in tumor dna versus germline reads; 0 disables	R	0.1
maxDnaStruckProp	Decimal, maximum acceptable percentage of reads struck at a position; ex: 90% would be 10% good data.	R	0.75
minRnaTumorCnt	Integer, minimum depth in tumor rna	R	5
maxRnaBias	Float, minimum acceptable to test Fisher's exact p-value from test of strand bias in tumor rna versus germline reads; 0 disables	R	0.1
maxRnaStruckProp	Decimal, maximum acceptable percentage of reads struck at a position; ex: 90% would be 10% good data. Gaps (introns) not counted towards struck	R	0.75
maxDnaTumorPluralProp	Decimal, minimum acceptable to test value of major variant allele / all variant alleles;	R	0.75
maxRnaTumorPluralProp	Decimal, minimum acceptable to test value of major variant allele / all variant alleles;	Y	0.75
medStart	Minimum acceptable median distance to end of alignment; 0 disables;	Y	1

##Description of output This table describes the columns of unceqr_proc.all.csv.

Variable	Description	Example
chr	chromosome label	1
pos	genomic position (1-based)	1233
universal	allele from reference assembly sequence	A
normCnt	number of admissible reads from germline bam	10
normPlural	number of distinct alleles in germline sequencing, that exceeded minInNormFrac	1
normChars	germline allele set for position	AG
normAllChars	distribution of all normal alleles
normTrimCount	count of germline allele set minus minNorm alleles
Variables repeated for DNA, RNA and POOL
*_refCnt	hq read count of tumor alleles matching germline allele set
*_refCntNoIndel
*_nonRefCnt	hq read count of tumor alleles not matching germline allele set
*_nonRefCntNoIndel
*_maxVal	hq read count of major alternate allele (majNonRef)
*_majNonRef	most common allele in tumor that is not in germline allele set
*_str	distribution of allele counts in form of allele[count]	A[100]T[1]
*_refPlus	read count of alleles in germline allele set aligned in plus orientation
*_refNeg	read count of alleles in germline allele set aligned in negative orientation
*_varPlus	read count of alleles not in germline allele set aligned in plus orientation
*_refNeg	read count of alleles not in germline allele set aligned in negative orientation
*_struckProp	proportion of reads struck at this position, inadmissible
*_medDist	median distance among reads of the major variant allele to their closest aligned end.
isSNP	indicator (1/0) to indicate if position was contained in snp file
featureId	feature id (line number) from regionsToQuery file
indelShadow	indicator if position was shadowed by indel in germline
DNA_hq	flag for high quality
RNA_hq	flag for high quality
p.value.dna	p-value from tumor DNA model
p.value.rna	p-value from tumor RNA model
p.value.meta	p-value from meta model

##History

v0.0.0
- 4/2011 – interrogation mode developed
- 11/2011 – integrated detection mode developed
... development ...
v0.1.12 - moved p-value computation from R to Perl, for runtime improvement.
v0.1.14 – manuscript version
v0.1.14.2 – updated documentation, example data
v0.2 - at github, new license.
v0.2.1 - added separate filtering for maxRnaBias level; prior used maxDnaBias for both dna and rna.

##Quirks

depending on the regionsToQuery (genomic locations queried for mutations) and other settings there can be more than one row per genomic locations, and users may want to account for, or collapse this in their analyses.
For running with only one tumor sequencing, the DNA arguments are used, not RNA, and dnaOnly should be set to 1.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
misc		misc
samtools-0.1.18_unceqr		samtools-0.1.18_unceqr
LICENSE		LICENSE
README.md		README.md
UNCeqR_2vcf.pl		UNCeqR_2vcf.pl
UNCeqR_conf.pl.template		UNCeqR_conf.pl.template
UNCeqR_fit.r		UNCeqR_fit.r
UNCeqR_main_simple.pl		UNCeqR_main_simple.pl
UNCeqR_proc.pl		UNCeqR_proc.pl
example.cmd		example.cmd
example_reg.txt		example_reg.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Options

About

Releases

Packages

Languages

License

mwilkers/unceqr

Folders and files

Latest commit

History

Repository files navigation

Options

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages