GitHub - adaptivegenome/repeatseq: Accurate microsatellite genotypes from high-throughput resequencing data

adaptivegenome / repeatseq Public

Notifications You must be signed in to change notification settings
Fork 9
Star 23

Accurate microsatellite genotypes from high-throughput resequencing data

www.mittelmanlab.com

View license

23 stars 9 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
regions		regions
CLParse.cpp		CLParse.cpp
INSTALL		INSTALL
README		README
estimatePhi.py		estimatePhi.py
license.txt		license.txt
makefile		makefile
repeatseq.cpp		repeatseq.cpp
repeatseq.h		repeatseq.h
structures.cpp		structures.cpp

Repository files navigation

RepeatSeq v0.8.2

1. Introduction
RepeatSeq determines genotypes for microsatellite repeats in high-throughput sequencing data. RepeatSeq is available
through the Virginia Tech non-commerical license. For more details on the license and use, see license.txt included in
this distribution.

If you use this code, we would be appreciative if you cite our manuscript:

G. Highnam, C. Franck, A. Martin, C. Stephens, A. Puthige, and D. Mittelman (2012) Accurate human microsatellite genotypes from high-throughput resequencing data using informed error profiles, Nucleic Acids Res, Epub Oct 22.

PubMed Link: http://www.ncbi.nlm.nih.gov/pubmed/23090981

Any feedback, comments, or suggestions are greatly appreciated as we develop RepeatSeq further. Contact us at adaptivegenome@gmail.com.

2. Setup and Installation
See INSTALL for install instructions

3. Required Input
RepeatSeq requires a BAM file, a FASTA file, and a region file as the minimal parameters. 

4. Optional Input

The user to specify a number of command-line options to customize the behavior of RepeatSeq:

Command-line Options: 
	-r   	    use only a specific read length or range of read lengths (e.g. LENGTH or MIN:MAX)
	-L          required number of reference matching bases BEFORE the repeat [3]
	-R          required number of reference matching bases AFTER the repeat [3]
	-M          minimum mapping quality for a read to be used for allele determination
	-multi      exclude reads flagged with the XT:A:R tag
	-pp         exclude reads that are not properly paired (for PE reads only)
    	-haploid    assume a haploid rather than diploid genome
	-repeatseq  write .repeatseq file (**see below for more information**)
	-calls      write .calls file (**see below for more information**)
	-t          include user-defined tag in the output filename
	-o          number of flanking bases to output from each read

5. Running RepeatSeq

Usage: repeatseq [options] <in.bam> <in.fasta> <in.regions>,

If an improper command line option is found, RepeatSeq will exit and print usage information.

6. Output Formats for RepeatSeq
RepeatSeq can output a VCF file or two custom output formats: .REPEATSEQ and .CALLS. The VCF file is the only file produced by default, however the other two can be enabled through the “-repeatseq” and “-calls” command line options. We are in the process of making 1000G calls and to faciliate this process we have recently revised our VCF output to meet 4.1 specs as well as the 1000G GT:GL format for genotypes and likelihoods.

The .calls format is tab delimited and contains repeat annotations, genotypes and confidence values. The genotype field is “NA” when a genotype is not assigned, “N” if the genotype is homozygous with length N, or “NhM” if the genotype is heterozygous with lengths N and M. Similar to the QUAL field in the VCF format, the confidence field here is the phred-scaled quality score for the assertion made in the genotype field, i.e. it gives the value of -10log_10(p_incorrect-call); high values indicate high confidence calls. The TRF string is the concatenated output of Gary Benson's TRF.

Here is a sample .calls file:

[region]	[TRF string]			        [Genotype][Confidence]
2L:6146-6162    3.8_4_78_21_20_52_0_0_47_1.00_ATTA    17        39.3627
2L:7006-7017    4.0_3_100_0_24_66_0_0_33_0.92_AAT     NA        NA
2L:10589-10595  7.0_1_100_0_14_0_0_0_100_0.00_T       7h6       17.5857

The .repeatseq file contains a full annotated alignment of reads to the a reference repeat. Records consist of the following items:

(1) The first line of each record is the headerline, marked with a ~ character. The field label keys are as follows:
○	REF - Reference length
○	A - Alleles present
○	C - Concordance (number of reads that support majority allele - 1 / total reads - 1)
○	D - Total number of reads that span the midpoint of the microsatellite (pre-filtering)
○	R - Reads present after all filters
○	S - Number of reads with no CIGAR sequence present (not included in genotyping)
○	M - Average mapping quality of reads used in genotyping
○	GT - Genotype
○	L - Likelihood; similar to VCF format, it gives the value of -10log_10(p_incorrect-call)

(2) The second line is the reference sequence for quick comparison with read sequences.

(3) Each following line is a read that was used in genotyping. It’s space-delimited with the following fields:
○	Bases leading into the repeat (lowercase 'x' means no base at that position)
○	Bases of the repeat 
○	Bases following the repeat (lowercase 'x' means no base at that position)
○	Start position of the read
○	Read length
○	Number of consecutive bases that match preceding the repeat
○	Number of consecutive bases that match following the repeat
○	Average base quality of the bases of the read (phred scores are converted to base calling probabilities before averaging).
○	Mapping quality of the read
○	Read information flags (p=paired, P=properly paired, u=unmapped, U=mate unmapped, r=reverse strand, R=mate is reversed strand, 1=first mate, 2=second mate, f=failed quality check)
○	CIGAR string
○	Read-ID

Here is an example of a .repeatseq file:

~2L:21828-21836 3.0_3_100_0_18_0_0_33_66_0.92_TTG REF:9 A:9 C:1 D:4 R:3 S:0 M:75 GT:9 L:75.84
TGTAAAAT TTGTTGTTG CATCAAAC
TGTAAAAT TTGTTGTTG CATCAAAC 21770 75 8 8 B:0.9949 M:150 F:pPr1 C:75M ID:USI-EAS034:7:8:1591:864#0
TGTAAAAT TTGTTGTTG CATCAAAC 21813 75 8 8 B:0.9957 M:0 F:pPR1 C:75M ID:USI-EAS034:7:5:1114:284#0
xxTAAAAT TTGTTGTTG CATCAAAC 21822 75 6 8 B:0.9957 M:150 F:pPR1 C:75M ID:USI-EAS034:7:79:1478:1338#0

~3R:13465-13477 13.0_1_100_0_26_0_0_0_100_0.00_T REF:13 A:10 C:1 D:13 R:8 S:0 M:105.88 GT:10 L:79.89
CCGTTACC TTTTTTTTTTTTT ATCGTTAC
CCGTTACC ---TTTTTTTTTT ATCGTTAC 13413 75 8 8 B:0.9951 M:160 F:pUr1 C:52M3D23M ID:USI-EAS034:7:69:9:117#0
CCGTTACC ---TTTTTTTTTT ATCGTTAC 13431 75 8 8 B:0.9958 M:149 F:pPr2 C:34M3D41M ID:USI-EAS034:7:8:241#0
CCGTTACC ---TTTTTTTTTT ATCGTTAC 13434 75 8 8 B:0.9955 M:151 F:pPR2 C:31M3D44M ID:USI-EAS034:7:17:517#0
CCGTTACC ---TTTTTTTTTT ATCGTTAC 13440 75 8 8 B:0.9956 M:115 F:pPR1 C:25M3D50M ID:USI-EAS034:7:86:1708#0
CCGTTACC ---TTTTTTTTTT ATCGTTAC 13442 75 8 8 B:0.9844 M:103 F:pPR2 C:23M3D51M1S ID:USI-EAS034:7:19:557#0
CCGTTACC ---TTTTTTTTTT ATCGTTAC 13448 75 8 8 B:0.9921 M:80 F:pPR2 C:17M3D58M ID:USI-EAS034:7:60:1435#0
CCGTTACC ---TTTTTTTTTT ATCGTTAC 13449 75 8 8 B:0.996 M:160 F:pPr2 C:16M3D59M ID:USI-EAS034:7:45:467#0
CCGTTACC ---TTTTTTTTTT ATCGTTAC 13456 75 8 8 B:0.9959 M:35 F:pPR2 C:9M3D66M ID:USI-EAS034:7:99:1447#0

~X:318-324 2.3_3_100_0_14_0_57_42_0_0.99_GCC REF:7 A:7[3] 10[2] C:0.5 D:9 R:5 S:0 M:54 GT:7h10 L:36.04
GTGGTGGT G---CCGCCG TTGATTTG
GTGGTGGT G---CCGCCG TTGATTTG 3149184 75 8 8 B:0.9949 M:160 F:pPr1 C:15M3I57M ID:USI-EAS034:7:56:544#0
GTGGTGGT GGTGCCGCCG TTGATTTG 3149189 75 8 8 B:0.812 M:0 F:pPr1 C:22S30M3I20M ID:USI-EAS034:7:30:128#0
GTGGTGGT GGTGCCGCCG TTGATTTx 3149192 75 8 7 B:0.7287 M:0 F:pPr2 C:32S27M3I13M ID:USI-EAS034:7:83:1423#0
GTGGTGGT G---CCGCCG TTGATTTG 3149195 75 8 8 B:0.8156 M:13 F:pPr1 C:21S4M3I47M ID:USI-EAS034:7:35:457#0
xxGGTGGT G---CCGCCG TTGATTTG 3149212 75 6 8 B:0.8486 M:150 F:pPR1 C:58M17S ID:USI-EAS034:7:55:1592#0

Please contact us at evolvability@vt.edu or visit mittelmanlab.com for more information.