Tutorial

Executing SwitchSeq
- 1. Downloading the annotation data files
  - Option 1: switchseq -t get_data (faster)
  - Option 2: resources/get_data/get_data.sh
- 2. Finding, annotating and visualising splicing switch events: switchseq -t get_switch
  - Dealing with technical replicates
  - Filtering events
Interpreting SwitchSeq output

Executing SwitchSeq

1. Downloading the annotation data files

SwitchSeq needs a few data files to be able to collect information on the transcripts and genes involved in the switch events and to display this information properly in the html output. Here you will find how to obtain this information in the proper format.

Note: The version of the annotation used with SwitchSeq should coincide with that used to estimate transcript expression levels. Any switch events that affect either non-coding genes or transcripts that are not listed in the annotation provided will be listed in the file skipped.txt (see Output directory structure).

Option 1: `switchseq -t get_data` (faster)

The tool get_data offers an easy way to retrieve all the necessary files for the execution, since it can be executed within SwitchSeq. However, it will just work for a few selected species and Ensembl versions (see Supported species and Ensembl versions); which mostly coincide with the ones included in the latest APPRIS release (v3.2.2 - Dec 2013). If the species and/or Ensembl version that you are using are not included in that list, please refer to the next section (Option 2).

Let us imagine that we are working with a human dataset and have quantified transcript expression levels by using the models provided in Ensembl 66. We can proceed to obtain the necessary files with the following command:

switchseq -t get_data --species hsa --ensembl_v 66

This will automatically retrieve information from Ensembl and APPRIS for switch event annotation.

Note: More information on the tool get_data can be obtained with the --help option (switchseq -t get_data --help) or in the Software usage page.

List of supported species and Ensembl versions

Human - hsa - Ensembl v74
Human - hsa - Ensembl v73
Human - hsa - Ensembl v66
Mouse - mmu - Ensembl v75
Mouse - mmu - Ensembl v74
Mouse - mmu - Ensembl v72
Rat - rno - Ensembl v70
Zebrafish - dre - Ensembl v70

Option 2: `resources/get_data/get_data.sh`

Alternatively, if your species and/or Ensembl version are not included in the list above, you should proceed by executing the script resources/get_data/get_data.sh, which will create a similar output to the one provided by the tool switchseq -t get_data. Here an example on how to call this script (see as well the script resources/get_data/EXECUTE.sh):

species=hsa
ensembl_v=73
outdir=/homes/mar/public_html/tools/switchseq/data_for_download
biomart_path=http://sep2013.archive.ensembl.org/biomart/martservice?
    # url to the correct archive version
    # should be in the following format: 
    # http://release_date.archive.ensembl.org/biomart/martservice?
    # further info: http://www.ensembl.org/Help/ArchiveList
ensembl_api_path=/homes/mar/system/ensembl
    # path to the Ensembl Core API
    # see https://github.com/Ensembl/ensembl   
bioperl=/homes/mar/system/bioperl-live
    # path to bioperl

./get_data.sh $species $ensembl_v $outdir $biomart_path $ensembl_api_path $bioperl

2. Finding, annotating and visualising splicing switch events: `switchseq -t get_switch`

In order to identify switch events we need to provide a table with normalised counts. Such quantifications can be obtained using any normalisation method (e.g. RPKMs/FPKMs, TPMs, DESeq...), but the use of those that properly estimate size factors across samples is encouraged. The input table can contain any number of genes, but a useful approach is to focus on the genes for which we have already predicted differences in splicing through other tools (e.g. DEXSeq, MMDIFF). In this way, we can obtain a subset of that list that contains the most extreme changes in splicing across the two studied conditions, i.e. switch events.

It is important to provide SwitchSeq with an input file that is properly formatted:

The first two columns should specify the gene and transcript ids, respectively. Transcript quantifications for each of the evaluated conditions should follow afterwards.
The first row in the input file should contain the column names. Those will be used to produce the plots. Technical replicates are supported (see below).
Fields can be delimited by any number of spaces or tabs.
Ensure that the expression values are NOT in scientific notation nor logarithmic scale.

For example, for an experiment with 4 control and 4 knock-out (KO) samples, this would be a valid input file:

gId               tId              C1     C2     C3     C4     KO1    KO2    KO3    KO4
ENSG00000124193   ENST00000244020  21.07  27.39  24.32  21.80  11.83  13.71  3.13   22.42
ENSG00000124193   ENST00000483871  8.19   9.13   13.09  6.51   30.42  15.46  36.04  16.91
...

Once the input file is ready and the necessary files have been obtained (see the Downloading the annotation data files section), we can proceed with the execution as follows:

switchseq -t get_switch \
      --species hsa \           # species
      --ensembl_v 66 \          # Ensembl version used for transcript quantification
      --input ./expdata.txt \   # input file
      --cond1 3-6 \             # columns in input file that correspond to the first condition
      --cond2 7-10             # columns in input file that correspond to the second condition

This will generate the output directory ./html. For more information on how to interpret SwitchSeq output, please refer to the next section.

Dealing with technical replicates

Technical replicates can be included in the input table, where they should be labelled with identical column names. E.g:

gId               tId              C1     C1     C2     C3     C4     KO1    KO2    KO3    KO4     KO4
...

# corresponding condition intervals:
#    cond1 3-7
#    cond2 8-12

Under this scenario, only transcripts that have been consistently identified as major across technical replicates will be considered during the execution.

For tviz output, the median expression across technical replicates will be calculated before performing the standard analysis.

Filtering events

Gene expression

    --threshold_gexp | -g <float>
            Gene expression threshold. [default: 0.01]

Genes with an expression level below this threshold will not be considered in the analysis.

Major transcript dominance

    --threshold_dominance | -dom <float>
            Dominance threshold. [default: 1]

Major transcript dominance is calculated by dividing, for each gene, the abundance of the second transcript in the ranking by that of the major transcript. This value can be interpreted as follows:

Dominance = 1: the first and second most abundant transcripts of a given gene are expressed at the same level.
Dominance = 0.5: there is a 2-fold difference in the expression levels of the first and second most abundant transcripts of a given gene.
Dominance = 0.2: there is a 5-fold difference.
etc.

When filtering switch events based on this threshold, only those that involve an X-fold dominant transcript in each condition will be reported.

Major transcript expression breadth

    --threshold_breadth | -b <float>
            Expression breadth threshold. [default: 50]

Expression breadth refers to the percentage of samples in which a given transcript is detected as dominant relative to the number of samples in which the corresponding gene is expressed. Only switch events with an expression breadth bigger than the threshold in both conditions are reported. E.g.: an expression breadth of 50% will report switch events for which transcript A is detected as dominant in 50% of the samples in condition 1, and transcript B in 50% of the samples in condition 2.

Significant events

Tools like MMDIFF provide a significance threshold for each of the tested transcripts, and such list of significant events can be passed to SwitchSeq to filter transcripts accordingly with the --filt option:

switchseq -t get_switch \
      --species hsa \           
      --ensembl_v 66 \          
      --input ./expdata.txt \   
      --cond1 3-6 \             
      --cond2 7-10 \            
      --filt significant_dtu.txt    # list of events to filter

Where significant_dtu.txt contains the significant events in the following format:

ENSG00000153086 ENST00000356140
ENSG00000153086 ENST00000392928

Note: More information on the tool get_switch can be obtained with the --help option (switchseq -t get_switch --help) or in the Software usage page.

Interpreting SwitchSeq output

Output directory structure

SwitchSeq output is self-contained and is provided by default under the ./html directory, unless otherwise specified by the user. This directory has the following structure:

html/
|--index.html          # html output
|--css/                # css for output rendering
|--js/                 # js for output rendering
`--data/               # data files to be displayed from the html, including plots, 
   |--switch.txt       #    protein alignments and a txt version of the switch events table
   |--skipped.txt      #    events discarded because the gene/transcripts are not present in the annotation
   |--prot_aln/
   `--plots/           
      |--distrplots/
      `--starplots/

Output html sections

An example output report is provided here and has four sections, which can be interpreted as follows:

Execution information: Brief report with the options specified during SwitchSeq execution.
Filter: Summary table of the identified switch events with links to filtered versions of the output table.
Format: List of the columns that should be displayed in the output table.
Output table: Table with the all the identified switch events (default), which can be further filtered based on transcript biotype information (see Filter). In addition, each column can be sorted by clicking on its header, and multiple column sorting can be achieved by holding the Shift key.

Switch events reported in the output table have been annotated as follows:

General information (part 1)
- gId - Ensembl gene id
- gName - gene name
- nOfT - number of annotated transcripts in the Ensembl version used for execution
Condition specific information
- CX.tId - Ensembl id for the most abundant and recurrent transcript in condition 1 (i.e. major transcript in condition 1)
- CX.principal - is the major transcript in condition X classified as principal in APPRIS?
- CX.biotype - transcript biotype for the major transcript in condition X
- CX.tExp - in how many samples is the transcript detected as dominant in condition X? Only switch events that involve dominant major transcripts as defined by the corresponding dominance threshold are taken into account.
- CX.gExp - in how many samples is the gene expressed in condition X? Only genes expressed above the corresponding threshold are considered here.
- CX.breadth - major transcript expression breadth in condition X (i.e. percentage of samples in which the transcript is detected as dominant relative to the number of samples in which the gene is expressed: CX.tExp / CX.gExp * 100). Only switch events with an expression breadth bigger than the one indicated by the corresponding threshold in both conditions are reported.
General information (part 2)
- pIdentity - if the two transcripts involved in the switch are protein coding, what is the percentage of identity between the two coding sequences?
- pdbEntry - if the two transcripts involved in the switch are protein coding, is there any PDB entry available?
- distrplot - link to distrplot
- starplot - link to starplot
- rank - arbitrary value that enables sorting the switch events regarding the number of replicates involved in the switch. Obtained from CX.tExp and CX.gExp.

Plots

SwitchSeq will produce two different plots for each switch event:

A distrplot, which contains information on the distribution of gene expression levels, major transcript dominance and transcript relative abundances in each condition.
A starplot, which provides information on the transcript quantifications in a sample specific manner.

An example distrplot

This plot contains three panels:

left - Distribution of gene expression levels.
middle - Distribution of major transcript dominance. Major transcript dominance is calculated in each condition as the ratio of expression of the second/first most abundant transcripts. Thus, a ratio close to 0 will indicate that the most abundant transcript is dominating the expression for that gene, while a ratio close to 1 indicates that the first and second most abundant transcripts are expressed at the same level.
right - Relative abundances for all the transcripts annotated within the gene. Considering that gene expression is the result of adding up transcript expression levels, relative abundance can be understood as the fraction of gene expression explained by each transcript.

Depending on the number of samples, the information in each panel will be displayed as boxplots (n >= 10 in each condition) or as dots + error bars (which correspond to the mean + SD across samples, respectively).

An example starplot

In this plot each sample is represented as a pie chart, where each of the slices corresponds to a transcript. The size of each slice is proportional to the transcript expression level, and the overall size of the plot is proportional to the gene expression level, thus allowing comparisons across samples.

An example interpretation of a switch event

Let us focus on the switch event identified for the gene ACMSD in the example output report. From the data provided in the table we can interpret the following:

The expression estimates obtained from the RNA-seq data indicate that there is one protein coding transcript in condition 1 and a different one in condition 2.
The first protein coding transcript is widely detected as the most abundant one within the gene in condition 1. Similarly, the second transcript is detected as the most abundant one in 3 out of 4 samples for condition 2.
From the distrplot we learn the following:
- The gene is expressed at a lower level in condition 2.
- Major transcripts do not seem to be expressed in a dominant fashion in both conditions. In addition, major transcript dominance is more variable in condition 2.
- We can further visualise the relative abundances of each of the annotated transcripts in the third panel.
The starplot further confirms our previous observations by displaying transcript expression levels in a sample specific manner.
The first protein coding transcript is identified as a principal one by APPRIS, while the second one is not.
From the alignment of the two coding sequences we learn that 75% of the sequence is shared. Further inspection of the alignment output indicates that the differences are located at the N-terminal end of the protein (i.e. the protein that corresponds to the transcript identified in condition 2 is shorter). In addition, there is a PDB structure associated to that gene, which covers all of its coding sequence (PDB entry 2WM1). By loading this structure into a visualisation software and combining this two sources of information, we could further infer any impact of the detected changes in splicing at the protein structure level. Alternatively, we can automate this step by executing MAÌSTAS and using the FASTA sequences provided in the alignment report as input. The output of this MAÌSTAS execution suggests that the protein product expected in condition 2 is unlikely to fold properly.

In conclusion, thanks to the combination of different data sources, we now know (1) that this gene undergoes changes in splicing across the two investigated conditions, (2) that this changes result in two different major transcripts being expressed and (3) that even though both major transcripts are coding, the one detected in condition 2 might lead to a misfolded protein.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutorial

Table of contents

Executing SwitchSeq

1. Downloading the annotation data files

Option 1: `switchseq -t get_data` (faster)

List of supported species and Ensembl versions

Option 2: `resources/get_data/get_data.sh`

2. Finding, annotating and visualising splicing switch events: `switchseq -t get_switch`

Dealing with technical replicates

Filtering events

Gene expression

Major transcript dominance

Major transcript expression breadth

Significant events

Interpreting SwitchSeq output

Output directory structure

Output html sections

Plots

An example distrplot

An example starplot

An example interpretation of a switch event

Clone this wiki locally

Tutorial

Table of contents

Executing SwitchSeq

1. Downloading the annotation data files

Option 1: switchseq -t get_data (faster)

List of supported species and Ensembl versions

Option 2: resources/get_data/get_data.sh

2. Finding, annotating and visualising splicing switch events: switchseq -t get_switch

Dealing with technical replicates

Filtering events

Gene expression

Major transcript dominance

Major transcript expression breadth

Significant events

Interpreting SwitchSeq output

Output directory structure

Output html sections

Plots

An example distrplot

An example starplot

An example interpretation of a switch event

Clone this wiki locally

Option 1: `switchseq -t get_data` (faster)

Option 2: `resources/get_data/get_data.sh`

2. Finding, annotating and visualising splicing switch events: `switchseq -t get_switch`