-
Notifications
You must be signed in to change notification settings - Fork 26
RetroSeq Tutorial
RetroSeq is a tool for discovery and genotyping of transposable element variants (TEVs) (also known as mobile element insertions) from next-gen sequencing reads aligned to a reference genome in BAM format. The goal is to call TEVs that are not present in the reference genome but present in the sample that has been sequenced. It should be noted that RetroSeq can be used to locate any class of viral insertion in any species where whole-genome sequencing data with a suitable reference genome is available.
If you want to find if a transposable element that is present in the reference genome is not found in your sample (a deletion in structural variation terminology), then you should use a structural variation deletion caller such as Breakdancer.
RetroSeq is a two phase process, the first being the read pair discovery phase where discorandant mate pairs are detected and assigned to a TE class (Alu, SINE, LINE, etc.) by using either the annotated TE elements in the reference and/or aligned with Exonerate to the supplied library of viral sequences.
RetroSeq requires that samtools, windowBED, and optionally Exonerate are in the path. It will check for these and verify the version is sufficiently recent.
NOTE: RetroSeq has been primarily tested on BAM files produced by MAQ and BWA. There is no guarantee it will work on BAM files from other aligners.
The goal here is to pass through the BAM and identify discordant read pairs that might support a TE insertion. You can either supply a tab delimited file specifying a set of TE types (e.g. Alu, LINE etc.) and the corresponding BED file of locations where these are in the reference genome (-refTEs parameter). Alternatively, you can provide a tab delimited file specifying a set of viral/TE types and the corresponding fasta file with a set of consensus sequences for these (-eref parameter).
RetroSeq: A tool for discovery and genotyping of transposable elements from short read alignments
Version: 1.2 Author: Thomas Keane (thomas.keane@sanger.ac.uk)
Usage: retroseq.pl -discover -bam -eref -output [-srmode] [-q ] [-id ] [-len -noclean]
-bam BAM file of paired reads mapped to reference genome
-output Output file to store candidate supporting reads (required for calling step)
[-eref Tab file with list of transposon types and the corresponding fasta file of reference sequences (e.g. SINE /home/me/refs/SINE.fasta). Required when the -align option is used.]
[-refTEs Tab file with TE type and BED file of reference elements. These will be used to quickly assign discordant reads the TE types and avoid alignment. Using this will speed up discovery dramatically.]
[-srmode Search for split reads in the BAM file]
[-minclip] Minimum length of soft clippped portion of read to be considered for split-read analysis. Default is 30bp.
[-noclean Do not remove intermediate output files. Default is to cleanup.]
[-q Minimum mapping quality for a read mate that anchors the insertion call. Default is 30. Parameter is optional.]
[-id Minimum percent ID for a match of a read to the transposon references. Default is 90.]
[-len Minimum length of a hit to the transposon references. Default is 36bp.]
[-rgs Comma separated list of readgroups to operate on. Default is all.]
[-exd Fofn of BED files of regions where discordant mates falling into will be excluded e.g. simple repeats, centromeres, telomeres]
[-align Do the computational expensive exonerate PE discordant mate alignment step]
The discovery stage will produce an output file (specified by the -output parameter) with lists of read pair names per TE type. Multiple discovery stage outputs can be inputted to the calling phase (i.e. to allow parallelisation of the discovery phase).
The calling phase takes one or more outputs from the discovery phase, clusters the reads, and carries out various checks on the breakpoints to decide if a TEV is present. You can provide a list of locations to ignore per TE type - this would typically be the list of locations of the reference elements of that type (-filter option).
RetroSeq: A tool for discovery and genotyping of transposable elements from short read alignments
Version: 1.2 Author: Thomas Keane (thomas.keane@sanger.ac.uk)
Usage: retroseq.pl -call -bam -input -ref -output [-srinput -filter -cleanup -reads -depth ]
-bam BAM file OR BAM fofn
-input Either a single output file from the PE discover stage OR a prefix of a set of files from discovery to be combined for calling OR a fofn of discovery stage output files
-ref Fasta of reference genome
-output Output file name (VCF)
[-srinput Either a single output from split-read discovery stage OR a prefix of a set of files to be combined for calling (will trigger split-read calling to run)]
[-orientate Attempt to predicate the orientation of the calls. Default is no.]
[-filter Tab file with TE type and BED file of reference elements. These will be filtered out from the calling.]
[-region Call a particular chromosome only (chr) OR region (chr:start-end) only]
[-depth Max average depth of a region to be considered for calling. Default is 200.]
[-reads It is the minimum number of reads required to make a call. Default is 5. Parameter is optional.]
[-q Minimum mapping quality for a read mate that anchors the insertion call. Default is 30. Parameter is optional.]
[-ignoreRGs Single read group name OR a file of readgroups that should be ignored. Default is none.]
[-novel] Call novel sequence insertions also (from non-TE mate pairs)
[-noclean Do not remove intermediate output files. Default is to cleanup.]
The final TE calls from RetroSeq are in VCF format (http://vcftools.sourceforge.net/). The calls are annotated with information on number of supporting reads (GQ tag). The FL tag ranges from 1-8 and gives information on the breakpoint with 8 being the most confident calls and lower values indicating calls that don’t meet the breakpoint criteria for reasons such as lack of 5’ or 3’ reads. The example in Fig. 1 shows two human Alu calls, one with a GQ of 15 and FL of 5 and the other has a GQ of 37 and FL of 8 (highly confident).
##source=RetroSeq v1.2
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=MEINFO,Number=4,Type=String,Description="Mobile element info of the form NAME,START,END,POLARITY">
##ALT=<ID=INS:ME,Type=String,Description="Insertion of a mobile element">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype quality">
##FORMAT=<ID=FL,Number=1,Type=Integer,Description="Call Status - for reference calls a flag to say if the call failed a particular filter. Filters are ordered by priority in calling (higher number indicates closer to being called). 1 - depth too high in region, 2 - not enough reads in cluster, 3 - not enough total flanking reads, 4 - not enough inconsistently mapped reads, 5 - neither side passes ratio test, 6 - one side passes ratio test, 7 - distance too large at breakpoint, 8 - PASSED all filters">
##INFO=<ID=NOT_VALIDATED,Number=0,Type=Flag,Description="Not validated experimentally">
##INFO=<ID=1000G,Number=0,Type=Flag,Description="Overlaps with 1000G MEI call">
##INFO=<ID=REPEATMASKER,Number=0,Type=Flag,Description="Overlaps with a reference ME element">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12891
10 135035407 . G <INS:ME> 15 . MEINFO=Alu,135035407,135035408,NA;NOT_VALIDATED;SVTYPE=INS GT:GQ:FL 1/1:15:5
11 98774710 . A <INS:ME> 37 . MEINFO=Alu,98774710,98774711,NA;NOT_VALIDATED;SVTYPE=INS GT:GQ:FL 1/1:37:8
To run RetroSeq on a human sample you will need:
- BAM file of you reads mapped to the human reference
- BED files of the locations in the reference of the TE types you want to call (1)
- Optional file of BED file names with regions to ignore (e.g. telomeres, centromeres, reference gaps etc) (2)
- Reference fasta file
(1) The BED files should be listed together in a single tab delimited file giving the TE type and BED location e.g. types.tab
Alu /home/me/data/Alu.bed
L1 /home/me/data/L1.bed
(2) A file of file names where each file is a BED file with regions to ignore e.g. exclude_regions.fofn
hg19.fa.satellites.bed
hg19.fa.simple_repeat.bed
hg19.fa.centromeres.bed
Command 1 - read pair discovery
perl retroseq.pl -discover -bam mysample.bam -output mysample.candidates.tab -refTEs types.tab -exd exclude_regions.fofn
Command 2 - TE calling
perl retroseq.pl -call -bam mysample.bam -input mysample.candidates.tab -ref hg19.fa -output mysample.out -filter types.tab -reads 10 -depth 400
This will produce a VCF file calls mysample.out.PE.vcf with the final calls. Remember to filter the final VCF file for proximity to reference copies of Alu and L1. It is suggested to filter out any calls less than the average fragment size of the sequencing libraries.