-
Notifications
You must be signed in to change notification settings - Fork 19
Home
Welcome to the Ribbon wiki!
The aim of this wiki is to provide the user a brief explanation of this tool and how to use it.
The aim of Ribbon is simple: from a group of transcripts, de novo contigs, reference sequence (or a mixture of all the above) to construct a long transcript (a SuperTranscript) containing all the bases from all the shorter sequences whilst preserving the ordering in which they come in the shorter trancsripts.
Installation
There are two pre-requisites for Ribbon:
- Anaconda for python3 (which will come with all necessary python packages) - https://www.continuum.io/downloads.
- BLAT v.35 (Download zip file and install as per README - https://users.soe.ucsc.edu/~kent/src/).
Annotations Additionally, Ribbon can provide two complementary annotations of these SuperTranscripts:
1) An annotation by Blocks which are defined by matched overlapping parts of two or more of the contigs which build the SuperTranscript, these are non-diverging paths on the graph. These SuperBlocks are thought of as exon like structures, in reality they can include fewer or greater number of bases than one exon from annotated genome.
2) An annotation built from using the transcript coverage over the SuperTranscript. In other words, it annotates where the transcripts used to build the SuperTranscript map back to the SuperTranscript.
The Algorithm
The algorithm can be thought of as the following steps: 1) Input a list of contigs/trancsripts and their sequence in a fasta file and a text file with the clustering information for which gene/cluster each transcript belongs to.
For each gene:
2) Using BLAT (https://genome.ucsc.edu/FAQ/FAQblat.html) pairwise align each transcript in the cluster to find the regions which overlap.
3) Construct a directed graph, where each node is a base in one of the transcripts and the directed edge retains the ordering of the bases in each transcript. Using the pairwise alignments of all clusters merge shared bases (nodes) together.
4) Simplify graph and remove all cycles in order to create a Directed Acyclic Graph (DAG), necessary for the next step.
5) Topologically sort the nodes ( each node know is a string of bases from the original unsimplified graph) using Khan's algorithm, which will give a non-unique sorting of the bases.
6) Extract the annotations (both SuperBlock style and transcript style)
Example of a cycle break:
Visualisation
The output of SuperTranscript is a fasta file and two annotation files (.gff). These can be easily read into the Integrative Genomics Viewer (IGV - https://www.broadinstitute.org/igv/) where one can easily view the read coverage of the SuperTranscript and the various annotations.
The files used in this example are find in the Examples folder of the repository.
Producing a SuperTranscript
python Ribbon.py Example_genome.fasta clusters.txt
Where Genome.fasta if a fasta file which contains all the transcripts in all the genes/clusters you wish to construct a SueprTranscript for. Clusters.txt is a text file containing a two tab separated columns containing the transcript/contig name in the first column and the cluster/gene name in the second column (as is the output of Corset).
This runs in parallel mode (each gene can be run as a separate stand alone thread).
To get the help options simply type:
python Ribbon.py --help
usage: Ribbon.py [-h] [--cores CORES] [--alternate] [--clear]
GenomeFile ClusterFile
positional arguments:
GenomeFile The name of the fasta file containing all transcripts
ClusterFile The name of the text file with the transcript to cluster
mapping
optional arguments:
-h, --help show this help message and exit
--cores CORES The number of cores you wish to run the job on (default =
4)
--alternate, -aa Create alternate annotations and create metrics on success
of SuperTranscript Building
--clear, -c Clear intermediate files after processing
--maxTran MAXTRAN Set a maximum for the number of transcripts from a
cluster to be included for building the SuperTranscript
(default=50).
Note: By default all the fasta files and psl files required for the BLAT pair-wise allignment will be produced in the folder where your cluster mapping text file is.
Extracting the annotation of transcripts against the SuperTranscript
python Checker.py SuperDuper.fasta
usage: Checker.py [-h] [--cores CORES] SuperFile
positional arguments:
SuperFile The name of the SuperDuper.fasta file created by
SuperTranscript
optional arguments:
-h, --help show this help message and exit
--cores CORES The number of cores you wish to run the job on (default = 4)
IGV viewer
To start IGV from the command line, simply type: igv This will load igv (if you have it installed), then one simply has to load the SuperDuper.fasta file which contains the sequence for each gene. The sorted .bam files which contains the reads mapped to the SuperDuper.fasta and the annotation files, SuperDuper.gff and SuperDuper_trans.gff (remembering to expand them using a right click on the annotation object in igv and choosing expanded view mode).
Viewing transcript coverage on SuperTranscript
Another function which the Ribbon package includes is to view for a given gene the coverage of each transcript on the SuperTranscript.
python STViewer.py GeneA
usage: STViewer.py [-h] GeneName
positional arguments:
GeneName The name of the gene whom you wish to view
optional arguments:
-h, --help show this help message and exit
1) Run a DeNovo assembly (e.g. with Trinity)
2) Cluster the contigs into genes (e.g. Using Trinity or Corset)
3) Build SuperTranscript and annotations.
4) Map reads to SuperTranscript
5) Sort and Index bam files (if want to view in IGV)
Further optional analyses
- Extract Differential Expression at gene level (e.g. using Corset or DESeq, limma/voom)
- Extract differential transcript/exon usage(e.g. with DEXseq or edgeR/voom/diffsplice)
- Cryptic Cancer variants