## circRNA detection from RNA-seq reads ## This repository holds python code to detect head-to-tail spliced (back-spliced) sequencing reads, indicative of circular RNA (circRNA) in RNA-seq data. It is also used extensively by [circbase](http://circbase.org). ### Author notes and preface ### The scripts in this package were designed by Nikolaus Rajewsky, Antigoni Elefsinioti and Marvin Jens. The current implementation was written by Marvin Jens. This software is provided to you "as is", without any warranty. We used this code (v1) to produce the results presented in ``` Nature. 2013 Mar 21;495(7441):333-8. doi: 10.1038/nature11928. Epub 2013 Feb 27. Circular RNAs are a large class of animal RNAs with regulatory potency. Memczak S, Jens M, Elefsinioti A, Torti F, Krueger J, Rybak A, Maier L, Mackowiak SD, Gregersen LH, Munschauer M, Loewer A, Ziebold U, Landthaler M, Kocks C, le Noble F, Rajewsky N. ``` Have a look at [Memczak *et al.* 2013](http://www.ncbi.nlm.nih.gov/pubmed/23446348) and the supplementary material for additional information. Please don't contact the authors, who are very busy, regarding this software unless you have found a bug, wish to provide improvements to the code, propose a collaboration, or use our code for commercial purposes. Thank you very much for your understanding! ### Brief version history ### The current code (v1.2) has some small bug fixes and improvements. Each version deemed stable is tagged and a brief summary of the improvements is given here: - **v1** : as used in Memczak et al. 2013 - **v1.2** : - by default no longer report junctions with only one uniquely aligned anchor. Original behaviour can be restored using the `--halfuniq` switch. - fix the uniqueness handling. Occasionally reads would have either anchor align uniquely, but never both. These rare cases now get flagged as "NO_UNIQ_BRIDGES". - support all chromosomes in one fasta file - store the size of each chromosome upon first access, pad violations of chromosome bounds by padding with 'N' - category labels have been extended for clarity ("UNAMBIGUOUS_BP" instead of "UNAMBIGUOUS", etc.), in a manner which preserves functionality of `grep` commands. - **v2** : (not released yet): produce multiple anchor lengths to potentially yield more junctions with unique anchor mappings. ### License ### For non-commercial purposes the code is released under the General Public License (version 3). See the file LICENSE for more detail. If you are interested in commercial use of this software contact the authors. ### Prerequisites ### The scripts run on Ubuntu 12.04.2 on a 64Bit machine with python 2.7. We do not know if it runs in different environments but other 64Bit unix versions should run if you can get the required 3rd party software installed. You need to install the python packages numpy and pysam. If there are no packages in your linux distro's repositories, try the very useful python installer (building numpy requires many dependencies, so obtaining pre- compiled packages from a repository is a better option). ``` pip install --user pysam ``` Next you need the short read mapper bowtie2 and samtools up and running. samtools now is in the repositories of many distros, but here you can get the most fresh versions: ``` http://samtools.sourceforge.net/ http://bowtie-bio.sourceforge.net/bowtie2/index.shtml ``` At this point you should have everything to run a built-in test data set ``` cd test_data ./run_test.sh ``` If you get error messages here, first make sure the dependencies are really installed correctly and run on their own. The authors of this code can not give support bowtie2, samtools, python, or any other third-party packages! Sorry, but not enough time for this. If you are sure that the problem is with our code, just zip the test_data folder and e-mail it to us. MAYBE, we can help. If everything goes well you can get started with your real data! :) You need to have the reference genome and a bowtie2 index for it. As an example, let's assume you use C.elegans genome ce6 (WS190): ``` wget -c http://hgdownload.cse.ucsc.edu/goldenPath/ce6/bigZips/chromFa.tar.gz -P tmp mkdir genome cd genome; tar -xzf ../tmp/chromFa.tar.gz; cd .. cat genome/chr*.fa > genome/ce6.fa ``` This will retrieve the genome from the UCSC website, unpack it into a folder 'genome' and create one big fasta file with all chromosomes to build the index: ``` bowtie2-build genome/ce6.fa bt2_ce6 ``` ### How to use the unmapped2anchors.py script ### It is recommended to map your RNA-seq reads against the genome first and keep the part that can not be mapped contiguously to look for splice- junctions afterwards. The genome alignments can be used for gene expression analysis and the unmapped reads will represent a fraction of the input, thus downstream analysis will be faster. (Note that --phred64 is only for Illumina quality scores. Sanger scores are --phred33, but that's the default anyway, so it could be omitted) ``` bowtie2 -p16 --very-sensitive --phred64 --mm -M20 --score-min=C,-15,0 \ -x bt2_ce6 -q -U <your_reads.qfa.gz> 2> bowtie2.log \ | samtools view -hbuS - | samtools sort - test_vs_ce6 ``` put the unaligned reads aside and split good quality reads into anchors for independent mapping (used to identify splice junctions) ``` samtools view -hf 4 test_vs_ce6.bam | samtools view -Sb - > unmapped_ce6.bam ./unmapped2anchors.py unmapped_ce6.bam | gzip > ce6_anchors.qfa.gz ``` ### How to use find_circ.py ### Now we have everything to screen for spliced reads, from either linear or head-to-tail (circular) splicing: ``` mkdir <run_folder> bowtie2 -p 16 --reorder --mm -M20 --score-min=C,-15,0 -q -x bt2_ce6 \ -U ce6_anchors.qfa.gz | ./find_circ.py -G genome -p ce6_test_ -s <run_folder>/sites.log > <run_folder>/sites.bed 2> <run_folder>/sites.reads ``` The important ingredients here are a folder in which each chromosome can be found as a separate fasta file (genome/chrI.fa ....), and a prefix for naming the splice junctions ("ce6test"). The rest is about output -s ce6_test/sites.log will receive statistics of the run, stdout reports the encountered junctions, and stderr the reads supporting each junction. Note: the read sequence can also be reported as the reverse complement of the raw reads sequence. If you analyse reads from multiple sources (tissues, samples) in one run, you can ask find_circ.py to divvy up the contributions from each sample. For this, provide a two-column tab-separated flat file with a read-name prefix to match for and an associated tissue. Note that the matching is performed from top to bottom. so you want to match for "read_from_exp_1" BEFORE "read_from". See samples_example.txt to get an idea. The correspon- ding switch is find_circ.py -r samples_examples.txt ### How to filter the output ### The first 6 columns of 'sites.bed' are standard BED. The rest hold various quality metrics about the junction. (the 'score' field holds the number of distinct reads supporting the junction). Check the header line for a quick overview. It is usually a good idea to demand at least 2 reads supporting the junction, unambiguous breakpoint detection and some sane mapping quality: To get a reasonable set of circRNA candidates try: ``` grep circ <run_folder>/sites.bed | grep -v chrM | ./sum.py -2,3 | \ ./scorethresh.py -16 1 | ./scorethresh.py -15 2 | \ ./scorethresh.py -14 2 | ./scorethresh.py 7 2 | \ ./scorethresh.py 8,9 35 | ./scorethresh.py -17 100000 \ > <run_folder>/circ_candidates.bed ```