RNA Sequence Analysis for Non Model Species (Red Spruce)
This repository is a usable, publicly available tutorial for analyzing differential expression data and creating topological gene networks. All steps have been provided for the UConn CBC Xanadu cluster here with appropriate headers for the Slurm scheduler that can be modified simply to run. Commands should never be executed on the submit nodes of any HPC machine. If working on the Xanadu cluster, you should use sbatch scriptname after modifying the script for each stage. Basic editing of all scripts can be performed on the server with tools such as nano, vim, or emacs. If you are new to Linux, please use this handy guide for the operating system commands. In this guide, you will be working with common bio Informatic file formats, such as FASTA, FASTQ, SAM/BAM, and GFF3/GTF. You can learn even more about each file format here. If you do not have a Xanadu account and are an affiliate of UConn/UCHC, please apply for one here.
Contents
In this tutorial we will be analyzing RNA-Sequence data from pine needle samples of the red spruce. This data is not published and therefore can only be accessed through the Xanadu directory in "/UCHC/PublicShare/RNASeq_Workshop/RedSpruce/Reads/Illumina". We will be using the red spruce as a "non-model" organism. You may be quite familiar with "model" organisms, such as Drosophila melanogaster, Caenorhabditis elegans, or Arabidopsis thaliana. What is it that makes an organism a "model"? A few things. But most critically, the ease with which the organisms may be grown in a controlled setting, how quick the organisms mature and may reproduce, the cost of growing and maintaining a population of an organism, and the ease with which the organism genomics may be manipulated manually. From this we see why the organisms listed prior are good models -- nemotodes, fruit flies, and weeds are small so do not take much space and may be grown in a laboratory, they reach maturity and reproduce at a very early age (in some instances at a week old!), are quite robust in needing few nutrients to survive, and lastly can be genetically modified quite easily (whether through artificial selection, transposable elements, or chemical mutagenesis). From all of this we can see why the red spruce is not a great model!
When an organism is called "model" there is an underlying assumption that very generous amounts of research have been performed on the species resulting in large pools of publicly available data. In biology and bioinformatics this means there are reference transcriptomes, structural annotations, known variant genotypes, and a wealth of other useful information in computational research. By contrast, when an organism is called "non-model" there is the underlying assumption that the information described prior will have to be generated by the research. This means that after extracting genetic data from a non-model organism, the researcher will then have to assemble the transcriptome, annotate the transcriptome, identify any pertinent genetic relationships, and so on. We can use this to develop a small map of our goals for analyzing our red spruce RNA samples. That is:
The data consists of 3 libraries under three different treatments:
- Ambient Carbondioxide condition
- Elevated Carbondioxide condition
- Cotreated for both condition
In this workflow we have seperated each step into folders, where you can find the appropriate scripts in conjunction with each steps. When you clone the git repository, the below directory structure will be cloned into your working directory.
RNA-Sequence-Analysis-Non-Model-Species-Red-Spruce
├── images
├── raw_reads
In this tutorial we will be using SLURM schedular to submit jobs to Xanadu cluster. In each script we will be using it will contain a header section which will allocate the resources for the SLURM schedular. The header section will contain:
#!/bin/bash
#SBATCH --job-name=JOBNAME
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 1
#SBATCH --mem=1G
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=ALL
#SBATCH --mail-user=first.last@uconn.edu
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
Before beginning, we need to understand a few aspects of the Xanadu server. When first logging into Xanadu from your local terminal, you will be connected to the submit node. The submit node is the interface with which users on Xanadu may submit their processes to the desired compute nodes, which will run the process. Never, under any circumstance, run processes directly in the submit node. Your process will be killed and all of your work lost! This tutorial will not teach you shell script configuration to submit your tasks on Xanadu. Therefore, before moving on, read and master the topics covered in the Xanadu tutorial.
Note:
If you are woking in Xanadu cluster, the raw reads may not need to be copied into you directory. In the raw_data folder we have created a script called initialize.sh when executed will create sim-links to the actual read files. With this way we can avolid making unnessary copies of the read data.
So once you ran the script the raw_reads folder will look like:
raw_reads/
├── ambient.fastq
├── cotreated.fastq
├── elevated.fastq
└── initialize.sh
The reads with which we will be working have been sequenced using Illumina. We assume that you are familiar with the sequencing technology. Let's have a look at the content of one of our reads, which are in the "fastq" format:
head -n 4 ambient.fastq
which will show the first four lines in the fastq file:
@HWI-ST318_0118:3:1:1185:2192#0/1
CCCTTCCTGAGATGTAAGTCCTGATCTTGANNNTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+HWI-ST318_0118:3:1:1185:2192#0/1
cbabc`cY``UVVZT^\\^^b\bbb`Z_BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
In here we see that first line corrosponds to the sample information followed by the length of the read, and in the second line corrosponds to the nucleotide reads, followed by the "+" sign where if repeats the information in the first line. Then the fourth line corrosponds to the quality score for each nucleotide in the first line.
Step one is to perform quality control on the reads, and we will be using Sickle for the Illumina reads. To start with we have single-end reads.
module load sickle/1.33
sickle se -f ../raw_reads/ambient.fastq -t illumina -o ambient.trimmed.fastq -q 30 -l 35
sickle se -f ../raw_reads/elevated.fastq -t illumina -o elevated.trimmed.fastq -q 30 -l 35
sickle se -f ../raw_reads/cotreated.fastq -t illumina -o cotreated.trimmed.fastq -q 30 -l 35
The useage information on the sickle program:
Usage: sickle se [options] -f <fastq sequence file> -t <quality type> -o <trimmed fastq file>
Options:
-f, --fastq-file, Input fastq file (required)
-t, --qual-type, Type of quality values
solexa (CASAVA < 1.3)
illumina (CASAVA 1.3 to 1.7)
sanger (which is CASAVA >= 1.8)
-o, --output-file, Output trimmed fastq file (required)
-q, --qual-threshold, trimming based on average quality in a window. Default 20.
-l, --length-threshold, Threshold to keep a read based on length after trimming. Default 20.
The quality may be any score from 0 to 40. The default of 20 is much too low for a robust analysis. We want to select only reads with a quality of 30 or better. Additionally, the desired length of each read is 35bp. Again, we see that a default of 20 is much too low for analysis confidence. Lastly, we must know the scoring type. While the quality type is not listed on the SRA pages, most SRA reads use the "sanger" quality type. Unless explicitly stated, try running sickle using the sanger qualities. If an error is returned, try illumina. If another error is returned, lastly try solexa.
The full slurm script which is called sickle.sh is stored in the quality_control folder.
At the end of the job it will produce the following trimmed files as follows:
quality_control/
├── ambient.trimmed.fastq
├── cotreated.trimmed.fastq
├── elevated.trimmed.fastq
├── sickle_352.err
├── sickle_352.out
└── sickle.sh
The summary of the reads will be in the *.out
file, which will give how many reads is kept and how many have been discarded in each run.
SE input file: ../raw_reads/ambient.fastq
Total FastQ records: 103531742
FastQ records kept: 83066649
FastQ records discarded: 20465093
SE input file: ../raw_reads/elevated.fastq
Total FastQ records: 97712226
FastQ records kept: 81595012
FastQ records discarded: 16117214
SE input file: ../raw_reads/cotreated.fastq
Total FastQ records: 101490246
FastQ records kept: 81173217
FastQ records discarded: 20317029
Sample | Initial Records | Records kept | Records Discarded | Kept (%) |
---|---|---|---|---|
ambient | 103531742 | 83066649 | 20465093 | 80.23 |
elevated | 97712226 | 81595012 | 16117214 | 83.50 |
cotreated | 101490246 | 81173217 | 20317029 | 79.98 |
Now that we've performed quality control we are ready to assemble our transcriptome using the RNA-Seq reads. We will be using the software Trinity. Nearly all transcriptome assembly software operates under the same premise. Consider the following:
Suppose we have the following reads:
A C G A C G T T T G A G A
T T G A G A T T A C C T A G
We notice that the end of each read is the beginning of the next read, so we assemble them as one sequence by matching the overlaps:
A C G A C G T T T G A G A
T T G A G A T T A C C T A G
Which gives us:
A C G A C G T [T T G A G A] T T A C C T A G
In De novo assembly section, we will be woking in the assembly
directory. In here we will be assembling the trimmed illumina reads seperatly using the trinity transcriptome assembler. Assembly requires a great deal of memory (RAM) and can take few days if the read set is large. Following is the trinity command that we use to assemble each transcriptome seperatly.
module load trinity/2.6.6
Trinity --seqType fq \
--max_memory 100G \
--single ../quality_control/ambient.trimmed.fastq \
--min_contig_length 300 \
--CPU 16 \
--normalize_reads \
--output trinity_ambient \
--full_cleanup
Trinity --seqType fq \
--max_memory 100G \
--single ../quality_control/cotreated.trimmed.fastq \
--min_contig_length 300 \
--CPU 16 \
--normalize_reads \
--output trinity_cotreated \
--full_cleanup
Trinity --seqType fq \
--max_memory 100G \
--single ../quality_control/elevated.trimmed.fastq \
--min_contig_length 300 \
--CPU 16 \
--normalize_reads \
--output trinity_elevated \
--full_cleanup
So the useage information for Trinity program we use:
Usage: Trinity [options]
Options (Required):
--seqType <string> : type of reads: ('fa' or 'fq')
--max_memory <string> : max memory to use by Trinity
--single <string> : unpaired/single reads, one or more file names can be included
Options (optional)
--CPU <int> : number of CPUs to use, default: 2
--min_contig_length <int>: minimum assembled contig length to report (def=200)
--output <string> : directory for output
--full_cleanup : only retain the Trinity fasta file, rename as ${output_dir}.Trinity.fasta
The full slurm script is called trinity.sh, and can be found in the assembly directory.
Trinity combines three independent software modules: Inchworm, Chrysalis, and Butterfly, applied sequentially to process large volumes of RNA-seq reads. Trinity partitions the sequence data into many individual de Bruijn graphs, each representing the transcriptional complexity at a given gene or locus, and then processes each graph independently to extract full-length splicing isoforms and to tease apart transcripts derived from paralogous genes. Briefly, the process works like so:
Inchworm assembles the RNA-seq data into the unique sequences of transcripts, often generating full-length transcripts for a dominant isoform, but then reports just the unique portions of alternatively spliced transcripts.
Chrysalis clusters the Inchworm contigs into clusters and constructs complete de Bruijn graphs for each cluster. Each cluster represents the full transcriptonal complexity for a given gene (or sets of genes that share sequences in common). Chrysalis then partitions the full read set among these disjoint graphs.
Butterfly then processes the individual graphs in parallel, tracing the paths that reads and pairs of reads take within the graph, ultimately reporting full-length transcripts for alternatively spliced isoforms, and teasing apart transcripts that corresponds to paralogous genes.
During the Trinity run there will be lots of files will be grenerated. These checkpoint files will help us to restart from that specific point if for some reason the program stops for some other problems. Once the program ends sucessfully all these checkpoint files will be removed since we have requested a full cleanup using the --full_cleanup
command. Clearing the files is very important as it will help us to remove all the unwanted files and also to keep the storage capacity and the number of files to a minimum. So at the end of a successful run we will end up with the following files:
assembly/
├── trinity_ambient.Trinity.fasta
├── trinity_ambient.Trinity.fasta.gene_trans_map
├── trinity_cotreated.Trinity.fasta
├── trinity_cotreated.Trinity.fasta.gene_trans_map
├── trinity_elevated.Trinity.fasta
├── trinity_elevated.Trinity.fasta.gene_trans_map
└── trinity.sh
So we will have three assembly files, one for each condition.
Because we used RNA reads to sequence our transcriptome, chances are that there are multiples of the same reads varying slightly which create multiples of the same assembled sequence. Under this assumption, we may also assume that most of the modules in our assembled transcriptome are actually repeats, the results of the assembly of slightly different reads from the same gene. We want to remove the repeats of these modules to shorten the length of our transcriptome and make for more efficient work in the future. We can do this by partitioning and clustering the transcriptome, then taking only one module from each of the clusters. There is a very convenient software which performs all of this for us in the exact way just described: vsearch.
To obtain a set of unique genes from both runs, we will cluster the two resulting assemblies together. First, the two assembies will be combined into one file using the Unix command cat, which refers to concatanate.
cat ../assembly/trinity_ambient.Trinity.fasta \
../assembly/trinity_cotreated.Trinity.fasta \
../assembly/trinity_elevated.Trinity.fasta > combine.fasta
Once the files are combined, we will use vsearch to find redundancy between the assembled transcripts and create a single output known as a centroids file. The threshold for clustering in this example is set to 80% identity.
module load vsearch/2.4.3
vsearch --threads 8 --log LOGFile \
--cluster_fast combine.fasta \
--id 0.80 \
--centroids centroids.fasta \
--uc clusters.uc
Command options in the vsearch program that we used:
Usage: vsearch [OPTIONS]
--threads INT number of threads to use, zero for all cores (0)
--log FILENAME write messages, timing and memory info to file
--cluster_fast FILENAME cluster sequences after sorting by length
--id REAL reject if identity lower, accepted values: 0-1.0
--centroids FILENAME output centroid sequences to FASTA file
--uc FILENAME specify filename for UCLUST-like output
The full script is called clustering.sh, which can be found in the clustering folder. At the end of the run it will produce the following files:
clustering/
├── centroids.fasta
├── clusters.uc
├── combine.fasta
└── LOGFile
The centroids.fasta will contain the unique genes from the three asseblies.