GitHub - TomSkelly/MatchAnnot: Python scripts for matching output of Pacific Biosciences IsoSeq RNA-seq pipeline to an annotation file.

TomSkelly / MatchAnnot Public

Notifications You must be signed in to change notification settings
Fork 12
Star 23

Python scripts for matching output of Pacific Biosciences IsoSeq RNA-seq pipeline to an annotation file.

23 stars 12 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
images		images
.gitignore		.gitignore
Annotations.py		Annotations.py
Best.py		Best.py
CHANGES		CHANGES
COPYING		COPYING
CigarString.py		CigarString.py
Cluster.py		Cluster.py
ClusterReport.py		ClusterReport.py
INSTALL		INSTALL
PolyA.py		PolyA.py
README		README
Reference.py		Reference.py
SWAligner.py		SWAligner.py
clusterView.py		clusterView.py
mapAnnot.py		mapAnnot.py
matchAnnot.py		matchAnnot.py
pickleAnnot.py		pickleAnnot.py
pickleRef.py		pickleRef.py
postAnnot.py		postAnnot.py
showAnnot.py		showAnnot.py
splitSAM.py		splitSAM.py
trimPrimers.py		trimPrimers.py
trimSplit.py		trimSplit.py
tt_log.py		tt_log.py

Repository files navigation

MatchAnnot is a python script which accepts a SAM file of IsoSeq
transcripts aligned to a genomic reference and matches them to an
annotation database in GTF format.

The aligner used must be splice-aware. MatchAnnot has been developed
using the STAR aligner (http://code.google.com/p/rna-star). The
reference supplied to STAR was created using the hg19 human reference
and the GENCODE-19 annotations file:
(ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz)


MatchAnnot expects the following inputs:

    --gtf          Annotation file, in format as described by --format option (Mandatory).
    --format       Format of annotation file: 'standard', 'alt' or 'pickle' (default: standard).
    --clusters     cluster_report.csv as produced by IsoSeq (Optional).
    (pipe or arg)  SAM file of IsoSeq transcripts aligned to genomic reference (Mandatory).


The output of the gencode_isoseq.pl script contains several types of line:

isoform:     A mapped isoform, output of IsoSeq. Line shows isoform name,
             and start and end genomic coordinates of alignment.

cigar:       The cigar string from the SAM file entry for the isoform.

cl:          A list of the reads-of-insert which were clustered to create the
             isoform. This information is printed only if a cluster report file
             is supplied via the --clusters parameter. Each line lists one or 
	     more reads from a single SMRTcell, labelled as either full-length
             or non-FL. The mapping from SMRTcell number to full SMRTcell name
	     is in the summary at the end of the output.

polyA:       A list of the positions where polyadenylation motifs were found 
             near the 3' end of the isoform.

gene:        A gene in the annotation file whose position overlaps the
             aligned isoform. Line shows gene name, its start and end
             coordinates, and the differences between those and the
             isoform start and end.

tr:          An annotated transcript of the gene under consideration. Line
             shows transcript name, a score, and the exon-to-exon
             mapping. Each [] grouping in the exon mapping is
             a list of transcript exons which match the isoform exons
             (see example below). Scores are as follows:

             5: IsoSeq exons match annotation exons one-for-one. Sizes agree
                except for leading and trailing edges.

             4: Like 5, but leading and trailing edge sizes differ by a 
                larger amount than the score-5 transcript found for this gene.

             3: One-for-one exon match, but sizes of internal exons disagree.

             2: The best match among all score=1 transcripts.

             1: Some exons overlap, overlaps are 1-for-1 where they exist.

	     0: Everyting else: isoform overlaps gene, but little or
	        no exon congruance.

exon:        Details of a single exon match. Shown only for transcripts
             with score >= 3. Line shows isoform and transcript start and
             stop coordinates and the delta between them, plus the
             number of indels found in the alignment (per the cigar
             string).

result:      A one-line summary for the isoform, showing the best gene and
             trancript found, and the resulting score.

summary:     Bookkeeping information at the end.


An example of an exon mapping (exons are numbered from 0):

                   1             2         3               4           5
   isoform:        ==========    ======    ==============  ===         =======
   transcript      =====        =========    ====    =========  =====    ========
                   1            2            3       4          5        6

   maps as follows:

   [1] [2] [3,4] [4] [6]


   An ideal mapping is one-for-one:

   [1] [2] [3] [4] [5]


   To make it *really* ideal, the exon coordinates should be equal as well (or nearly so).