-
Notifications
You must be signed in to change notification settings - Fork 12
How to Run matchAnnot
Run matchAnnot from the command line, as follows:
~skellytf/MatchAnnot-master/matchAnnot.py --gtf gencode.v19.annotation.gtf myData.sam > annotations.out
MatchAnnot expects the following inputs:
--gtf Annotation file, in format as described by --format option (Mandatory).
--format Format of annotation file: 'standard', 'alt' or 'pickle' (default: standard).
--clusters cluster_report.csv as produced by IsoSeq (Optional).
--outpickle Output file for matches in pickle format, used by clusterView (Optional)
(pipe or arg) SAM file of IsoSeq transcripts aligned to genomic reference (Mandatory).
The annotations file can be in standard GTF format (as with GENCODE), an alternate (more forgiving) format, or in python's pickle format. It turns out that annotation files from various sources vary quite a bit in format. If '--format standard' doesn't work for you, try '--format alt'. You can also pickle a 'standard' file, using the pickleAnnot.py script. A pickled file loads a little more quickly, and take a bit less room, but there's not a lot to be gained.
You'll probably want to create a pickled version of the output, using '--outpickle '. If you want to run clusterView, you will need that as input.
If you have the cluster_report.csv file produced by IsoSeq, you can pass that to matchAnnot with the '--clusters ' option. matchAnnot output will then include lines listing the individual full-length and partial reads-of-insert that comprise each cluster. OTOH, there may be a lot of reads, and the output isn't always useful, so you may want to skip this option.
The aligner used must be splice-aware. MatchAnnot has been developed using the STAR aligner (http://code.google.com/p/rna-star). The reference supplied to STAR was created using the GRCh38 human reference and the GENCODE-22 annotations file: (ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_22/gencode.v22.annotation.gtf.gz)
Care should be taken that the versions of the genomic reference and the annotation file match. For example, the GENCODE-22 annotations are based on the GRCh38 reference. If you mix versions, you will get meaningless results. Also make sure that the chromosome names match between the two files.
The input SAM file can be specified either as the last argument on the command line (without a keyword), or can be piped in with |. The file should be sorted by chromosome and coordinate (e.g., with 'samtools sort').
It's also useful for the SAM lines to include an MD string, which is used, along with the CIGAR string, to describe the differences between the aligned sequence and the reference. According to the SAM spec, the MD string is optional, and not all aligners include it in their output. You can add it after the fact to an existing SAM file using 'samtools calmd'. If an MD string is present in a SAM line, the exon: lines in the matchAnnot output include the number of substitution errors in the exon, and an overall Q score for the exon.