-
Notifications
You must be signed in to change notification settings - Fork 4
Tutorial
SwitchSeq needs a few data files to be able to collect information on the transcripts and genes involved in the switch events and to display this information properly in the html
output. Here you will find how to obtain this information in the proper format.
Note: The version of the annotation used with SwitchSeq should coincide with that used to estimate transcript expression levels. Any switch events that affect either non-coding genes or transcripts that are not listed in the annotation provided will be listed in the file skipped.txt
(see Output directory structure).
The tool get_data
offers an easy way to retrieve all the necessary files for the execution, since it can be executed within SwitchSeq. However, it will just work for a few selected species and Ensembl versions (see Supported species and Ensembl versions); which mostly coincide with the ones included in the latest APPRIS release (v3.2.2 - Dec 2013). If the species and/or Ensembl version that you are using are not included in that list, please refer to the next section (Option 2).
Let us imagine that we are working with a human dataset and have quantified transcript expression levels by using the models provided in Ensembl 66. We can proceed to obtain the necessary files with the following command:
switchseq -t get_data --species hsa --ensembl_v 66
This will automatically retrieve information from Ensembl and APPRIS for switch event annotation.
Note: More information on the tool get_data
can be obtained with the --help
option (switchseq -t get_data --help
) or in the Software usage page.
- Human - hsa - Ensembl v74
- Human - hsa - Ensembl v73
- Human - hsa - Ensembl v66
- Mouse - mmu - Ensembl v75
- Mouse - mmu - Ensembl v74
- Mouse - mmu - Ensembl v72
- Rat - rno - Ensembl v70
- Zebrafish - dre - Ensembl v70
Alternatively, if your species and/or Ensembl version are not included in the list above, you should proceed by executing the script resources/get_data/get_data.sh
, which will create a similar output to the one provided by the tool switchseq -t get_data
. Here an example on how to call this script (see as well the script resources/get_data/EXECUTE.sh
):
species=hsa
ensembl_v=73
outdir=/homes/mar/public_html/tools/switchseq/data_for_download
biomart_path=http://sep2013.archive.ensembl.org/biomart/martservice?
# url to the correct archive version
# should be in the following format:
# http://release_date.archive.ensembl.org/biomart/martservice?
# further info: http://www.ensembl.org/Help/ArchiveList
ensembl_api_path=/homes/mar/system/ensembl
# path to the Ensembl Core API
# see https://github.com/Ensembl/ensembl
bioperl=/homes/mar/system/bioperl-live
# path to bioperl
./get_data.sh $species $ensembl_v $outdir $biomart_path $ensembl_api_path $bioperl
In order to identify switch events we need to provide a table with normalised counts. Such quantifications can be obtained using any normalisation method (e.g. RPKMs/FPKMs, TPMs, DESeq...), but the use of those that properly estimate size factors across samples is encouraged. The input table can contain any number of genes, but a useful approach is to focus on the genes for which we have already predicted differences in splicing through other tools (e.g. DEXSeq, MMDIFF). In this way, we can obtain a subset of that list that contains the most extreme changes in splicing across the two studied conditions, i.e. switch events.
It is important to provide SwitchSeq with an input file that is properly formatted:
- The first two columns should specify the gene and transcript ids, respectively. Transcript quantifications for each of the evaluated conditions should follow afterwards.
- The first row in the input file should contain the column names. Those will be used to produce the plots. Technical replicates are supported (see below).
- Fields can be delimited by any number of spaces or tabs.
- Ensure that the expression values are NOT in scientific notation nor logarithmic scale.
For example, for an experiment with 4 control and 4 knock-out (KO) samples, this would be a valid input file:
gId tId C1 C2 C3 C4 KO1 KO2 KO3 KO4
ENSG00000124193 ENST00000244020 21.07 27.39 24.32 21.80 11.83 13.71 3.13 22.42
ENSG00000124193 ENST00000483871 8.19 9.13 13.09 6.51 30.42 15.46 36.04 16.91
...
Once the input file is ready and the necessary files have been obtained (see the Downloading the annotation data files section), we can proceed with the execution as follows:
switchseq -t get_switch \
--species hsa \ # species
--ensembl_v 66 \ # Ensembl version used for transcript quantification
--input ./expdata.txt \ # input file
--cond1 3-6 \ # columns in input file that correspond to the first condition
--cond2 7-10 # columns in input file that correspond to the second condition
This will generate the output directory ./html
. For more information on how to interpret SwitchSeq output, please refer to the next section.
Technical replicates can be included in the input table, where they should be labelled with identical column names. E.g:
gId tId C1 C1 C2 C3 C4 KO1 KO2 KO3 KO4 KO4
...
# corresponding condition intervals:
# cond1 3-7
# cond2 8-12
Under this scenario, only transcripts that have been consistently identified as major across technical replicates will be considered during the execution.
For tviz output, the median expression across technical replicates will be calculated before performing the standard analysis.
--threshold_gexp | -g <float>
Gene expression threshold. [default: 0.01]
Genes with an expression level below this threshold will not be considered in the analysis.
--threshold_dominance | -dom <float>
Dominance threshold. [default: 1]
Major transcript dominance is calculated by dividing, for each gene, the abundance of the second transcript in the ranking by that of the major transcript. This value can be interpreted as follows:
- Dominance = 1: the first and second most abundant transcripts of a given gene are expressed at the same level.
- Dominance = 0.5: there is a 2-fold difference in the expression levels of the first and second most abundant transcripts of a given gene.
- Dominance = 0.2: there is a 5-fold difference.
- etc.
When filtering switch events based on this threshold, only those that involve an X-fold dominant transcript in each condition will be reported.
--threshold_breadth | -b <float>
Expression breadth threshold. [default: 50]
Expression breadth refers to the percentage of samples in which a given transcript is detected as dominant relative to the number of samples in which the corresponding gene is expressed. Only switch events with an expression breadth bigger than the threshold in both conditions are reported. E.g.: an expression breadth of 50% will report switch events for which transcript A is detected as dominant in 50% of the samples in condition 1, and transcript B in 50% of the samples in condition 2.
Tools like MMDIFF provide a significance threshold for each of the tested transcripts, and such list of significant events can be passed to SwitchSeq to filter transcripts accordingly with the --filt
option:
switchseq -t get_switch \
--species hsa \
--ensembl_v 66 \
--input ./expdata.txt \
--cond1 3-6 \
--cond2 7-10 \
--filt significant_dtu.txt # list of events to filter
Where significant_dtu.txt
contains the significant events in the following format:
ENSG00000153086 ENST00000356140
ENSG00000153086 ENST00000392928
Note: More information on the tool get_switch
can be obtained with the --help
option (switchseq -t get_switch --help
) or in the Software usage page.
SwitchSeq output is self-contained and is provided by default under the ./html
directory, unless otherwise specified by the user. This directory has the following structure:
html/
|--index.html # html output
|--css/ # css for output rendering
|--js/ # js for output rendering
`--data/ # data files to be displayed from the html, including plots,
|--switch.txt # protein alignments and a txt version of the switch events table
|--skipped.txt # events discarded because the gene/transcripts are not present in the annotation
|--prot_aln/
`--plots/
|--distrplots/
`--starplots/
An example output report is provided here and has four sections, which can be interpreted as follows:
- Execution information: Brief report with the options specified during SwitchSeq execution.
- Filter: Summary table of the identified switch events with links to filtered versions of the output table.
- Format: List of the columns that should be displayed in the output table.
-
Output table: Table with the all the identified switch events (default), which can be further filtered based on transcript biotype information (see Filter). In addition, each column can be sorted by clicking on its header, and multiple column sorting can be achieved by holding the
Shift
key.
Switch events reported in the output table have been annotated as follows:
-
General information (part 1)
- gId - Ensembl gene id
- gName - gene name
- nOfT - number of annotated transcripts in the Ensembl version used for execution
-
Condition specific information
- CX.tId - Ensembl id for the most abundant and recurrent transcript in condition 1 (i.e. major transcript in condition 1)
- CX.principal - is the major transcript in condition X classified as principal in APPRIS?
- CX.biotype - transcript biotype for the major transcript in condition X
- CX.tExp - in how many samples is the transcript detected as dominant in condition X? Only switch events that involve dominant major transcripts as defined by the corresponding dominance threshold are taken into account.
- CX.gExp - in how many samples is the gene expressed in condition X? Only genes expressed above the corresponding threshold are considered here.
- CX.breadth - major transcript expression breadth in condition X (i.e. percentage of samples in which the transcript is detected as dominant relative to the number of samples in which the gene is expressed: CX.tExp / CX.gExp * 100). Only switch events with an expression breadth bigger than the one indicated by the corresponding threshold in both conditions are reported.
-
General information (part 2)
- pIdentity - if the two transcripts involved in the switch are protein coding, what is the percentage of identity between the two coding sequences?
- pdbEntry - if the two transcripts involved in the switch are protein coding, is there any PDB entry available?
- distrplot - link to distrplot
- starplot - link to starplot
- rank - arbitrary value that enables sorting the switch events regarding the number of replicates involved in the switch. Obtained from CX.tExp and CX.gExp.
SwitchSeq will produce two different plots for each switch event:
- A
distrplot
, which contains information on the distribution of gene expression levels, major transcript dominance and transcript relative abundances in each condition. - A
starplot
, which provides information on the transcript quantifications in a sample specific manner.
This plot contains three panels:
- left - Distribution of gene expression levels.
- middle - Distribution of major transcript dominance. Major transcript dominance is calculated in each condition as the ratio of expression of the second/first most abundant transcripts. Thus, a ratio close to 0 will indicate that the most abundant transcript is dominating the expression for that gene, while a ratio close to 1 indicates that the first and second most abundant transcripts are expressed at the same level.
- right - Relative abundances for all the transcripts annotated within the gene. Considering that gene expression is the result of adding up transcript expression levels, relative abundance can be understood as the fraction of gene expression explained by each transcript.
Depending on the number of samples, the information in each panel will be displayed as boxplots (n >= 10 in each condition) or as dots + error bars (which correspond to the mean + SD across samples, respectively).
In this plot each sample is represented as a pie chart, where each of the slices corresponds to a transcript. The size of each slice is proportional to the transcript expression level, and the overall size of the plot is proportional to the gene expression level, thus allowing comparisons across samples.
Let us focus on the switch event identified for the gene ACMSD in the example output report. From the data provided in the table we can interpret the following:
- The expression estimates obtained from the RNA-seq data indicate that there is one protein coding transcript in condition 1 and a different one in condition 2.
- The first protein coding transcript is widely detected as the most abundant one within the gene in condition 1. Similarly, the second transcript is detected as the most abundant one in 3 out of 4 samples for condition 2.
- From the distrplot we learn the following:
- The gene is expressed at a lower level in condition 2.
- Major transcripts do not seem to be expressed in a dominant fashion in both conditions. In addition, major transcript dominance is more variable in condition 2.
- We can further visualise the relative abundances of each of the annotated transcripts in the third panel.
- The starplot further confirms our previous observations by displaying transcript expression levels in a sample specific manner.
- The first protein coding transcript is identified as a principal one by APPRIS, while the second one is not.
- From the alignment of the two coding sequences we learn that 75% of the sequence is shared. Further inspection of the alignment output indicates that the differences are located at the N-terminal end of the protein (i.e. the protein that corresponds to the transcript identified in condition 2 is shorter). In addition, there is a PDB structure associated to that gene, which covers all of its coding sequence (PDB entry 2WM1). By loading this structure into a visualisation software and combining this two sources of information, we could further infer any impact of the detected changes in splicing at the protein structure level. Alternatively, we can automate this step by executing MAÌSTAS and using the FASTA sequences provided in the alignment report as input. The output of this MAÌSTAS execution suggests that the protein product expected in condition 2 is unlikely to fold properly.
In conclusion, thanks to the combination of different data sources, we now know (1) that this gene undergoes changes in splicing across the two investigated conditions, (2) that this changes result in two different major transcripts being expressed and (3) that even though both major transcripts are coding, the one detected in condition 2 might lead to a misfolded protein.