Skip to content
Quarkins edited this page Aug 16, 2016 · 39 revisions

Welcome to the Ribbon wiki!

The aim of this wiki is to provide the user a brief explanation of this tool and how to use it.

The aim of Ribbon is simple: from a group of transcripts, de novo contigs, reference sequence (or a mixture of all the above) to construct a long transcript (a SuperTranscript) containing all the bases from all the shorter sequences whilst preserving the ordering in which they come in the shorter trancsripts.

Installation

There are two pre-requisites for Ribbon:

  1. Anaconda for python3 (which will come with all necessary python packages) - https://www.continuum.io/downloads.
  2. BLAT v.35 (Download zip file and install as per README - https://users.soe.ucsc.edu/~kent/src/).

Annotations Additionally, Ribbon can provide two complementary annotations of these SuperTranscripts:

1) An annotation by Blocks which are defined by matched overlapping parts of two or more of the contigs which build the SuperTranscript, these are non-diverging paths on the graph. These SuperBlocks are thought of as exon like structures, in reality they can include fewer or greater number of bases than one exon from annotated genome.

2) An annotation built from using the transcript coverage over the SuperTranscript. In other words, it annotates where the transcripts used to build the SuperTranscript map back to the SuperTranscript.

The Algorithm

The algorithm can be thought of as the following steps: 1) Input a list of contigs/trancsripts and their sequence in a fasta file and a text file with the clustering information for which gene/cluster each transcript belongs to.

For each gene:
2) Using BLAT (https://genome.ucsc.edu/FAQ/FAQblat.html) pairwise align each transcript in the cluster to find the regions which overlap.
3) Construct a directed graph, where each node is a base in one of the transcripts and the directed edge retains the ordering of the bases in each transcript. Using the pairwise alignments of all clusters merge shared bases (nodes) together.
4) Simplify graph and remove all cycles in order to create a Directed Acyclic Graph (DAG), necessary for the next step.
5) Topologically sort the nodes ( each node know is a string of bases from the original unsimplified graph) using Khan's algorithm, which will give a non-unique sorting of the bases.
6) Extract the annotations (both SuperBlock style and transcript style)

Example of a cycle break:

Visualisation

The output of SuperTranscript is a fasta file and two annotation files (.gff). These can be easily read into the Integrative Genomics Viewer (IGV - https://www.broadinstitute.org/igv/) where one can easily view the read coverage of the SuperTranscript and the various annotations.

Code Example

The files used in this example are find in the Examples folder of the repository.

Producing a SuperTranscript

python Ribbon.py Example_genome.fasta clusters.txt

Where Genome.fasta if a fasta file which contains all the transcripts in all the genes/clusters you wish to construct a SueprTranscript for. Clusters.txt is a text file containing a two tab separated columns containing the transcript/contig name in the first column and the cluster/gene name in the second column (as is the output of Corset).

This runs in parallel mode (each gene can be run as a separate stand alone thread).

To get the help options simply type:
python Ribbon.py --help

usage: Ribbon.py [-h] [--cores CORES] [--alternate] [--clear]
             GenomeFile ClusterFile  

positional arguments:  
  GenomeFile        The name of the fasta file containing all transcripts  
  ClusterFile       The name of the text file with the transcript to cluster
                mapping  

optional arguments:  
  -h, --help        show this help message and exit  
  --cores CORES     The number of cores you wish to run the job on (default =  
                4)  
  --alternate, -aa  Create alternate annotations and create metrics on success  
                of SuperTranscript Building  
  --clear, -c       Clear intermediate files after processing
  --maxTran MAXTRAN  Set a maximum for the number of transcripts from a
                 cluster to be included for building the SuperTranscript
                 (default=50).  

Note: By default all the fasta files and psl files required for the BLAT pair-wise allignment will be produced in the folder where your cluster mapping text file is.

Extracting the annotation of transcripts against the SuperTranscript

python Checker.py SuperDuper.fasta

usage: Checker.py [-h] [--cores CORES] SuperFile  

positional arguments:  
  SuperFile      The name of the SuperDuper.fasta file created by  
             SuperTranscript  

optional arguments:  
  -h, --help     show this help message and exit   
  --cores CORES  The number of cores you wish to run the job on (default = 4)  

IGV viewer

To start IGV from the command line, simply type: igv This will load igv (if you have it installed), then one simply has to load the SuperDuper.fasta file which contains the sequence for each gene. The sorted .bam files which contains the reads mapped to the SuperDuper.fasta and the annotation files, SuperDuper.gff and SuperDuper_trans.gff (remembering to expand them using a right click on the annotation object in igv and choosing expanded view mode).

Viewing transcript coverage on SuperTranscript

Another function which the Ribbon package includes is to view for a given gene the coverage of each transcript on the SuperTranscript.

python STViewer.py GeneA

usage: STViewer.py [-h] GeneName  

positional arguments:  
  GeneName    The name of the gene whom you wish to view  

optional arguments:  
  -h, --help  show this help message and exit  

A suggested pipeline for deNovo assembled non-model organisms

1) Run a DeNovo assembly (e.g. with Trinity)
2) Cluster the contigs into genes (e.g. Using Trinity or Corset)
3) Build SuperTranscript and annotations.
4) Map reads to SuperTranscript
5) Sort and Index bam files (if want to view in IGV)

Further optional analyses

  • Extract Differential Expression at gene level (e.g. using Corset or DESeq, limma/voom)
  • Extract differential transcript/exon usage(e.g. with DEXseq or edgeR/voom/diffsplice)
  • Cryptic Cancer variants