SplitFusion - a fast pipeline for detection of gene fusion based on fusion-supporting split alignment.
Gene fusion is a hallmark of cancer. Many gene fusions are effective therapeutic targets such as BCR-ABL in chronic myeloid leukemia, EML4-ALK in lung cancer, and any of a number of partners-ROS1 in lung cancer. Accurate detection of gene fusion plays a pivotal role in precision medicine by matching the right drugs to the right patients.
Challenges in the diagnosis of gene fusions include poor sample quality, limited amount of available clinical specimens, and complicated gene rearrangements. The anchored multiplex PCR (AMP) is a clinically proven technology designed, in one purpose, for robust detection of gene fusions across clinical samples of different types and varied qualities, including RNA extracted from FFPE samples.
SplitFusion is a companion data pipeline for AMP, for the detection of gene fusion based on split alignments, i.e. reads crossing fusion breakpoints, with the ability to accurately infer in-frame or out-of-frame of fusion partners of a given fusion candidate. SplitFusion also outputs example breakpoint-supporting seqeunces in FASTA format, allowing for further investigations.
Zheng Z, et al. Anchored multiplex PCR for targeted next-generation sequencing. Nat Med. 2014
The analysis consists of ## computational steps:
-
Retrive all alignments that have secondary alignments (the 'SA' tag in SAM format) from bam files generated by BWA MEM.
-
Remove alignments with low mapping quality (default 20).
-
...
-
...
Lastly, outputs a summary table and breakpoint-spanning reads.
Required software, libraries and data dependency
- R
- samtools
- bedtools
- snpEff
- java
git clone https://github.com/Zheng-NGS-Lab/SplitFusion.git
R CMD INSTALL SplitFusion
The dependency data (e.g. in 'data') should contain:
Filename | Content |
---|---|
panel-name.target.genes | Contains a list of targets (gene name, e.g. ALK, ROS1, etc.) |
fusion.gene-exon.filter.txt | Contains recurrent breakpoints identified as data accumulates, but are not of interest. |
fusion.gene-exon.txt | By default, SplitFusion only outputs fusions that are in-frame fusion of two different genes or when number of breakpoint-supporting reads exceed predefined threashold. This file contains known breakpoints that do not belong to the above two kinds, but are clinically relevant, e.g. "MET_exon13---MET_exon15" an exon-skipping event forms an important theraputic target. Many exon skipping/alternative splicing events are normal or of unknown clinical relevance and are thus not output by default. |
fusion.partners.txt | Contains a list of known fusion partners of targets. |
ENSEMBL.orientation.txt | Due to the lack of transcript orientation in snpEff annotation, so this file include two columns, Orientation (+ or –) and transcript ID (ENST*). |
The above files could be updated periodically as a backend supporting database that facilitates automatc filtering and outputing of fusion candidates.
1.Sample information table: Sample name (prefixed name in bam file), Cancer type or project name ( not used in script, just for user labeling ), Panel name (prefixed panel name in panel-name.target.genes), cpuBWA number
An example table (table separated) :
AP7 Sample_ID Panel cpuBWA
example LungFusion ITFTNA 2
example LungFusion ITFTNA 2
2.Config file: you can set the path and parameters of depended tools in this file.
### Input file
SplitFusionPath="/your path/SplitFusion"
sampleInfo="/your path/Sample information table file"
runInfo="/your path/Config file"
bam_path="/your directory of bam or fasta files/"
Panel_path="/your directory of dependency data/"
### Tools and database
hgRef="/your path/Homo_sapiens_assembly19.fasta"
samtools="/your path/samtools" #Version: 1.9 (using htslib 1.9)
bedtools="/your path/bedtools" #Version: v2.25.0
java="/your path/java" #openjdk version "1.8.0_181", OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-1ubuntu0.16.04.1-b13), OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
snpEff="/your path/snpEff.jar" #Downloaded from http://snpeff.sourceforge.net/download_donate.html
snpEff_ref="hg19" ### Note: annotation result varys with different version of $snpEff_ref, which is better to match the version of hgRef
R="/your path/R"
bwa="/your path/bwa" #Version: bwa-0.7.17
### Parameters of SplitFusion
StrVarMinStartSite=3
maxQueryGap=0
minMapLength=25
minExclusive=25
maxOverlap=9
minMQ=13
R command line:
>Library(SplitFusion)
>runSplitFusion(runInfo= "/path/example.runInfo", output= "/path/result/", sample.id="example")
An example brief output table:
AP7 | GeneExon5'---GeneExon3' | num_unique_reads | frame | Gene_Exon_cDNA_5'_3' |
---|---|---|---|---|
A01-P701 | KIF5B_exon15---RET_exon12 | 7 | in-frame | KIF5B exon15 c.1723 .NM_004521.---RET exon12 c.2138 .NM_020630. |
A02-P702 | EML4_intronic---ALK_exon20 | 9 | NA | EML4 intronic c.NA .NM_001145076.---ALK exon20 c.3171 .NM_004304. |
A02-P702 | EML4_intronic---ALK_exon20 | 10 | NA | EML4 intronic c.NA .NM_001145076.---ALK exon20 c.3173 .NM_004304. |
A02-P702 | EML4_exon4---ALK_exon20 | 64 | in-frame | EML4 exon4 c.468 .NM_001145076.---ALK exon20 c.3171 .NM_004304. |
An example output fastq file for the KIF5B_exon15---RET_exon12 fusion of sample A01-P701 is:
CL100059760L2C005R002_288074 TTCCCACTTTGGATCCTCCTTTACATCATTATTTCCCACAGCAATTCCTATTTCTGCAAGGTCTTTTAGTAAAGATGC CL100059760L2C008R017_227619 TTCCGAGGGAATTCCCACTTTGGATCCTCCTTTACATCATTATTTCCCACAGCAATTCCTATTTCTGCAAGGTCTTTT CL100059760L2C005R026_187567 TTGACTGGAGTTCAGACGTGTGCTCTTCCGAAAGCCCTCCCCGGTGCGCATGTTGGCAGGCTCAGACAAGGCCCTGG CL100059760L2C003R075_538109 TAGGAATTGCTGTGGGAAATAATGATGTAAAGGAGGATCCAAAGTGGGAATTCCCTCGGAAGAACTTGGTTCTTGGAA CL100059760L2C006R047_14778 ATAGGAATTGCTGTGGGAAATAATGATGTAAAGGAGGATCCAAAGTGGGAATTCCCTCGGAAGAACTTGGTTCTTGGA CL100059760L2C012R012_483791 GAATTGCTGTGGGAAATAATGATGTAAAGGAGGATCCAAAGTGGGAATTCCCTCGGAAGAACTTGGTTCTTGGAAAAA CL100059760L2C013R006_484169 AATTGCTGTGGGAAATAATGATGTAAAGGAGGATCCAAAGTGGGAATTCCCTCGGAAGAACTTGGTTCTTGGAAAAAC CL100059760L2C017R020_507274 ATAGGAATTGCTGTGGGAAATAATGATGTAAAGGAGGATCCAAAGTGGGAATTCCCTCGGAAGAACTTGGTTCTTGGA CL100059760L2C017R091_315407 GAATTGCTGTGGGAAATAATGATGTAAAGGAGGATCCAAAGTGGGAATTCCCTCGGAAGAGCTTGGTTCTTGGAAAAA CL100059760L2C003R015_252083 GGGAATTCCCACTTTGGATCCTCCTATGTTGGAATTCCCTCGGAAGAACTTGGTTCTTGGAAAAACTCTAAGATCGGA
An visualization of example output fastq for the EML4_intron6---ALK_exon20: