RNA-Seq for Model Plant (Arabidopsis thaliana)

This repository is a usable, publicly available tutorial for analyzing differential expression data and creating topological gene networks. All steps have been provided for the UConn CBC Xanadu cluster here with appropriate headers for the Slurm scheduler that can be modified simply to run. Commands should never be executed on the submit nodes of any HPC machine. If working on the Xanadu cluster, you should use sbatch scriptname after modifying the script for each stage. Basic editing of all scripts can be performed on the server with tools such as nano, vim, or emacs. If you are new to Linux, please use this handy guide for the operating system commands. In this guide, you will be working with common bio Informatic file formats, such as FASTA, FASTQ, SAM/BAM, and GFF3/GTF. You can learn even more about each file format here. If you do not have a Xanadu account and are an affiliate of UConn/UCHC, please apply for one here.

Contents

1 Introduction and programs
2 Accessing the data using sra-toolkit
3 Quality control using sickle
4 Aligning reads to a genome using hisat2
5 Transcript assembly and quantification with StringTie
6 Differential expression analysis using ballgown
7 Topological networking using cytoscape
8 Conclusion

Introduction and programs

In this tutorial, we will be analyzing thale cress (Arabidopsis thaliana) RNA-Seq data from various parts of the plant (roots, stems). Perhaps one of the most common organisms for genetic study, the aggregrate wealth of genetic Information of the thale cress makes it ideal for new-comers to learn. Organisms such as this we call "model organisms". You may think of model organisms as a subset of living things which, under the normal conventions of analysis, behave nicely. The data we will be analyzing comes from an experiment in which various cellular RNA was collected from the roots and shoots of a single thale cress. The RNA profiles are archived in the SRA, and meta Information on each may be viewed through the SRA ID: SRR3498212, SRR3498213, SRR3498215, SRR3498216.

The Single Read Archive, or SRA, is a publicly available database containing read sequences from a variety of experiments. Scientists who would like their read sequences present on the SRA submit a report containing the read sequences, experimental details, and any other accessory meta-data.

Our data, SRR3498212, SRR3498213, SRR3498215, SRR3498216 come from root 1, root 2, shoot 1, and shoot 2 of a single thale cress, respectively. Our objective in this analysis is to determine which genes are expressed in all samples, quantify the expression of each common gene in each sample, identify genes which are lowly expressed in roots 1 and 2 but highly expressed in shoots 1 and 2, or vice versa, quantify the relative expression of such genes, and lastly to create a visual topological network of genes with similar expression profiles.

You may connect to Xanadu via SSH, which will place you in your home directory

-bash-4.2$ cd /home/CAM/$USER

Your home directory contains 10TB of storage and will not pollute the capacities of other users on the cluster.

The workflow may be cloned into the appropriate directory using the terminal command:

-bash-4.2$ git clone https://github.com/wolf-adam-eily/rnaseq_for_model_plant.git
-bash-4.2$ cd rnaseq_for_model_plant
-bash-4.2$ ls
-bash-4.2$ cd rnaseq_for_model_plant/
-bash-4.2$ ls
all_clusters.csv.png           README.md
complete_edge_list.csv.png     Rplot.png
cytoscape1.png                 sam_sort_bam.sh
cytoscape2.png                 sickle_run.sh
cytoscape3.png                 transcript_assembly.sh
cytoscape4.png                 trimmed.html
cytoscape5.png                 trimmed_SRR3498212_fastqc.html
data_dump.sh                   trimmed_SRR3498213_fastqc.html
hisat2_run.sh                  trimmed_SRR3498215_fastqc.html
pcaplot_for_all_libraries.png  trimmed_SRR3498216_fastqc.html
quality_control.sh

All of the completed scripts for this tutorial are available for you to submit. However, before submitting, you may want to edit the scripts to include your email!

Before beginning, we need to understand a few aspects of the Xanadu server. When first logging into Xanadu from your local terminal, you will be connected to the submit node. The submit node is the interface with which users on Xanadu may submit their processes to the desired compute nodes, which will run the process. Never, under any circumstance, run processes directly in the submit node. Your process will be killed and all of your work lost! This tutorial will not teach you shell script configuration to submit your tasks on Xanadu. Therefore, before moving on, read and master the topics covered in the Xanadu tutorial.

Now that we have covered the introduction and objective of our analysis, we may begin!

Accessing the data using sra-toolkit

We know that the SRA contain the read sequences and accessory meta Information from experiments. Rather than downloading experimental data through a browser, we may use the sratoolkit's "fastq-dump" function to directly dump raw read data into the current terminal directory. Let's have a look at this function (it is expected that you have read the Xanadu tutorial, and are familiar with loading modules):

-bash-4.2$ module load sratoolkit
  fastq-dump [options]  [...]
  fastq-dump [options] 

Use option --help for more  Information

fastq-dump : 2.8.2

For our needs, we will simply be using the accession numbers to dump our experimental data into our directory. We know our accession numbers, so let's write a shell script to retrieve our raw reads. There are a variety of text editors available on Xanadu. My preferred text editor is "nano". Therefore, we will be using nano to write our shell script.

-bash-4.2$ nano data_dump.sh

  GNU nano 2.3.1                                                      File: data_dump.sh                                                                                                                    

#!/bin/bash
#SBATCH --job-name=data_dump
#SBATCH --mail-user=your.email@uconn.edu
#SBATCH --mail-type=ALL
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 16
#SBATCH --mem=120G
#SBATCH -o data_dump_%j.out
#SBATCH -e data_dump_%j.err
#SBATCH --partition=general

mkdir /home/CAM/$USER/tmp/
export TMPDIR=/home/CAM/$USER/tmp/

module load sratoolkit

fastq-dump SRR3498212
fastq-dump SRR3498213
fastq-dump SRR3498215
fastq-dump SRR3498216
                                                                                             [ Read 20 lines ]
^G Get Help                       ^O WriteOut                       ^R Read File                      ^Y Prev Page                      ^K Cut Text                       ^C Cur Pos
^X Exit                           ^J Justify                        ^W Where Is                       ^V Next Page                      ^U UnCut Text                     ^T To Spell

As a precautionary measure, always include your temporary directory in the environment. While not all programs require a temporary directory to work, it takes far less time including ours in the environment than it is waiting for an error! After typing our script, we press CTRL + X to exit, 'y', and then enter to save.

Now that we have our script saved, we submit it to the compute nodes with the following command:

-bash-4.2$ sbatch data_dump.sh

Now we wait until we receive an email that our process has finished.

Let's take a look at one of our files:

-bash-4.2$ head SRR3498212.fastq
@SRR3498212.1 SN638:767:HC555BCXX:1:1108:2396:1996 length=50
NTCAATCGGTCAGAGCACCGCCCTGTCAAGGCGGAAGCAGATCGGAAGAG
+SRR3498212.1 SN638:767:HC555BCXX:1:1108:2396:1996 length=50
#<DDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@SRR3498212.2 SN638:767:HC555BCXX:1:1108:2934:1998 length=50
NCAGTTTTGACCAGATAGGTCTCGCTAAGAAGATTGAGAAGATCGGAAGA
+SRR3498212.2 SN638:767:HC555BCXX:1:1108:2934:1998 length=50
#<<<BDFFEEHIIIHIHEHEHIIEHHHH?<GCCCHCCGICHH1<GHHHIC
@SRR3498212.3 SN638:767:HC555BCXX:1:1108:3860:2000 length=50
NTCGCTTCGTAAGCGAAAGGCCGCGAGTTCGAAGATCGGAAGAGCACACG

We see that for our first three runs we have Information about the sampled read including its length followed by the nucleotide read and then a "+" sign. The "+" sign marks the beginning of the corresponding scores for each nucleotide read for the nucleotide sequence preceding the "+" sign.

Quality control using sickle

Sickle performs quality control on illumina paired-end and single-end short read data using a sliding window. As the window slides along the fastq file, the average score of all the reads contained in the window is calculated. Should the average window score fall beneath a set threshold, sickle determines the reads responsible and removes them from the run. After visiting the SRA pages for our data, we see that our data are single end reads. Let's find out what sickle can do with these:

-bash-4.2$ module load sickle

-bash-4.2$ sickle

Usage: sickle  [options]

Command:
pe	paired-end sequence trimming
se	single-end sequence trimming

--help, display this help and exit
--version, output version  Information and exit

We have single-end sequences.

-bash-4.2$ sickle se

Usage: sickle se [options] -f  -t  -o 

Options:
-f, --fastq-file, Input fastq file (required)
-t, --qual-type, Type of quality values (solexa (CASAVA < 1.3), illumina (CASAVA 1.3 to 1.7), sanger (which is CASAVA >= 1.8)) (required)
-o, --output-file, Output trimmed fastq file (required)
-q, --qual-threshold, Threshold for trimming based on average quality in a window. Default 20.
-l, --length-threshold, Threshold to keep a read based on length after trimming. Default 20.
-x, --no-fiveprime, Don't do five prime trimming.
-n, --trunc-n, Truncate sequences at position of first N.
-g, --gzip-output, Output gzipped files.
--quiet, Don't print out any trimming  Information
--help, display this help and exit
--version, output version  Information and exit

The quality may be any score from 0 to 40. The default of 20 is much too low for a robust analysis. We want to select only reads with a quality of 35 or better. Additionally, the desired length of each read is 50bp. Again, we see that a default of 20 is much too low for analysis confidence. We want to select only reads whose lengths exceed 45bp. Lastly, we must know the scoring type. While the quality type is not listed on the SRA pages, most SRA reads use the "sanger" quality type. Unless explicitly stated, try running sickle using the sanger qualities. If an error is returned, try illumina. If another error is returned, lastly try solexa.

Let's put all of this together for our sickle script using our downloaded fastq files:

-bash-4.2$ nano sickle_run.sh

  GNU nano 2.3.1                                                      File: sickle_run.sh                                                                                                                   

#!/bin/bash
#SBATCH --job-name=sickle_run
#SBATCH --mail-user=
#SBATCH --mail-type=ALL
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 16
#SBATCH --mem=120G
#SBATCH -o sickle_run_%j.out
#SBATCH -e sickle_run_%j.err
#SBATCH --partition=general

export TMPDIR=/home/CAM/$USER/tmp/

module load sickle

sickle se -f SRR3498212.fastq -t sanger -o trimmed_SRR3498212.fastq -q 35 -l 45

sickle se -f SRR3498213.fastq -t sanger -o trimmed_SRR3498213.fastq -q 35 -l 45

sickle se -f SRR3498215.fastq -t sanger -o trimmed_SRR3498215.fastq -q 35 -l 45

sickle se -f SRR3498216.fastq -t sanger -o trimmed_SRR3498216.fastq -q 35 -l 45

                                                                                             [ Read 23 lines ]
^G Get Help                       ^O WriteOut                       ^R Read File                      ^Y Prev Page                      ^K Cut Text                       ^C Cur Pos
^X Exit                           ^J Justify                        ^W Where Is                       ^V Next Page                      ^U UnCut Text                     ^T To Spell

-bash-4.2$ sbatch sickle_run.sh

It is helpful to see how the quality of the data has changed after using sickle. To do this, we will be using the commandline versions of fastqc and MultiQC. These two programs simply create reports of the average quality of our trimmed reads, with some graphs. There is no way to view a --help menu for these programs in the command-line. However, their use is quite simple, we simply run "fastqc <trimmed_fastq>" or "multiqc -f -n trimmed trimmed*". Do not worry too much about the options for MultiQC! Let's write our script:

-bash-4.2$ nano quality_control.sh

  GNU nano 2.3.1                                                    File: quality_control.sh                                                                                                                

#!/bin/bash
#SBATCH --job-name=quality_control
#SBATCH --mail-user=
#SBATCH --mail-type=ALL
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 16
#SBATCH --mem=120G
#SBATCH -o quality_control_%j.out
#SBATCH -e quality_control_%j.err
#SBATCH --partition=general

export TMPDIR=/home/CAM/$USER/tmp/

module load fastqc
module load MultiQC

fastqc trimmed_SRR3498212.fastq

fastqc trimmed_SRR3498213.fastq

fastqc trimmed_SRR3498215.fastq

fastqc trimmed_SRR3498216.fastq

multiqc -f -n trimmed trimmed*


                                                                                             [ Read 26 lines ]
^G Get Help                       ^O WriteOut                       ^R Read File                      ^Y Prev Page                      ^K Cut Text                       ^C Cur Pos
^X Exit                           ^J Justify                        ^W Where Is                       ^V Next Page                      ^U UnCut Text                     ^T To Spell

-bash-4.2$ sbatch quality_control.sh

fastqc will create the files "trimmed_file_fastqc.html". To have a look at one, we need to move all of our "trimmed_file_fastqc.html" files into a single directory, and then secure copy that folder to our local directory. Then, we may open our files! If that seems like too much work for you, you may open the files directly through this github. Simply click on any "html" file and you may view it in your browser immediately. Because of this, the steps mentioned above will not be placed in this tutorial.

This script will also create a directory "trimmed_data". Let's look inside of that directory:

-bash-4.2$ cd trimmed_data<
-bash-4.2$ ls 
multiqc_fastqc.txt         multiqc.log
multiqc_general_stats.txt  multiqc_sources.txt

Let's have a look at the file format from fastqc and multiqc. When loading the fastqc file, you will be greeted with this screen:

There are some basic statistics which are all pretty self-explanatory. Notice that none of our sequences fail the quality report! It would be concerning if we had even one because this report is from our trimmed sequence! The same thinking applies to our sequence length. Should the minimum of the sequence length be below 45, we would know that sickle had not run properly. Let's look at the next index in the file:

This screen is simply a box-and-whiskers plot of our quality scores per base pair. Note that there is a large variance and lower mean scores (but still about in our desired range) for base pairs 1-5. These are the primer sequences! I will leave it to you to ponder the behavior of this graph. If you're stumped, you may want to learn how Illumina sequencing" works.

Our next index is the per sequence quality scores:

This index is simply the total number of base pairs (y-axis) which have a given quality score (x-axis). This plot is discontinuous and discrete, and should you calculate the Riemann sum the result is the total number of base pairs present across all reads.

The last index at which we are going to look is the "Overrepresented Sequences" index: This is simply a list of sequences which appear disproportionately in our reads file. The reads file actually includes the primer sequences for this exact reason. When fastqc calculates a sequence which appears many times beyond the expected distribution, it may check the primer sequences in the reads file to determine if the sequence is a primer. If the sequence is not a primer, the result will be returned as "No Hit". Sequences which are returned as "No Hit" are most likely highly expressed genes.

We see that our multiqc file has the same indices as our fastqc files, but is simply the mean of all the statistics across our fastqc files:

Aligning reads to a genome using hisat2

HISAT2 is a fast and sensitive aligner for mapping next generation sequencing reads against a reference genome. HISAT2 requires two arguments: the reads file being mapped and the indexed genome to which those reads are mapped. Typically, the hisat2-build command is used to make a HISAT index file for the genome. It will create a set of files with the suffix .ht2, these files together build the index. What is an index and why is it helpful? Genome indexing is the same as indexing a tome, like an encyclopedia. It is much easier to locate Information in the vastness of an encyclopedia when you consult the index, which is ordered in an easily navigatable way with pointers to the location of the Information you seek within the encylopedia. Genome indexing is thus the structuring of a genome such that it is ordered in an easily navigatable way with pointers to where we can find whichever gene is being aligned. Let's have a look at how the hisat2-build command works:

-bash-4.2$ module load hisat2
-bash-4.2$ hisat2-build

No input sequence or sequence file specified!
HISAT2 version 2.1.0 by Daehwan Kim ( Infphilo@gmail.com, http://www.ccb.jhu.edu/people/ Infphilo)
Usage: hisat2-build [options]*  
	reference_in            comma-separated list of files with ref sequences
	hisat2_index_base       write ht2 data to files with this dir/basename

As you can see, we simply enter our reference genome files and the desired prefix for our .ht2 files. Now, fortunately for us, Xanadu has many indexed genomes which we may use. To see if there is a hisat2 Arabidopsis thaliana indexed genome we need to look at the Xanadu databases page. We see that our desired indexed genome is in the location /isg/shared/databases/alignerIndex/plant/Arabidopsis/thaliana/Athaliana_HISAT2/. Now we are ready to align our reads using hisat2 (for hisat2, the script is going to be written first with an explanation of the options after).

nano hisat2_run.sh

  GNU nano 2.3.1                                                      File: hisat2_run.sh                                                                                                                   

#!/bin/bash
#SBATCH --job-name=hisat2_run
#SBATCH --mail-user=
#SBATCH --mail-type=ALL
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 16
#SBATCH --mem=120G
#SBATCH -o hisat2_run_%j.out
#SBATCH -e hisat2_run_%j.err
#SBATCH --partition=general

export TMPDIR=/home/CAM/$USER/tmp/

module load hisat2

hisat2 -p 16 --dta -x /isg/shared/databases/alignerIndex/plant/Arabidopsis/thaliana/Athaliana_HISAT2/thaliana -q trimmed_SRR3498212.fastq -S athaliana_root_1.sam

hisat2 -p 16 --dta -x /isg/shared/databases/alignerIndex/plant/Arabidopsis/thaliana/Athaliana_HISAT2/thaliana -q trimmed_SRR3498213.fastq -S athaliana_root_2.sam

hisat2 -p 16 --dta -x /isg/shared/databases/alignerIndex/plant/Arabidopsis/thaliana/Athaliana_HISAT2/thaliana -q trimmed_SRR3498215.fastq -S athaliana_shoot_1.sam

hisat2 -p 16 --dta -x /isg/shared/databases/alignerIndex/plant/Arabidopsis/thaliana/Athaliana_HISAT2/thaliana -q trimmed_SRR3498216.fastq -S athaliana_shoot_2.sam

                                                                                             [ Read 23 lines ]
^G Get Help                       ^O WriteOut                       ^R Read File                      ^Y Prev Page                      ^K Cut Text                       ^C Cur Pos
^X Exit                           ^J Justify                        ^W Where Is                       ^V Next Page                      ^U UnCut Text                     ^T To Spell

-p : number of processors been used
--dta: report alignments tailored for transcript assemblers
-x: path to index generated from previous step
-q: query input files in fastq format
-S: output SAM file

bash-4.2$ sbatch hisat2_run.sh

Once the mapping have been completed, the file structure is as follows:

bash-4.2$ ls
athaliana_root_1.sam
athaliana_root_2.sam
athaliana_shoot_1.sam
athaliana_shoot_2.sam

When HISAT2 completes its run, it will summarize each of it’s alignments, and it is written to the standard error file, which can be found in the same folder once the run is completed.

bash-4.2$ nano hisat2_run*err

  GNU nano 2.3.1                                                   File: hisat2_run_507187.err                                                                                                              

34475799 reads; of these:
  34475799 (100.00%) were unpaired; of these:
    33017550 (95.77%) aligned 0 times
    1065637 (3.09%) aligned exactly 1 time
    392612 (1.14%) aligned >1 times
4.23% overall alignment rate
42033973 reads; of these:
  42033973 (100.00%) were unpaired; of these:
    40774230 (97.00%) aligned 0 times
    931377 (2.22%) aligned exactly 1 time
    328366 (0.78%) aligned >1 times
3.00% overall alignment rate
31671127 reads; of these:
  31671127 (100.00%) were unpaired; of these:
    31103167 (98.21%) aligned 0 times
    465131 (1.47%) aligned exactly 1 time
    102829 (0.32%) aligned >1 times
1.79% overall alignment rate
49890217 reads; of these:
  49890217 (100.00%) were unpaired; of these:
    48622480 (97.46%) aligned 0 times
    1029943 (2.06%) aligned exactly 1 time
    237794 (0.48%) aligned >1 times
2.54% overall alignment rate

Let's have a look at a SAM file:

-bash-4.2$ head -n 20 rnaseq_athaliana_root_1.sam
@HD	VN:1.0	SO:unsorted
@SQ	SN:Chr1	LN:30427671
@SQ	SN:Chr2	LN:19698289
@SQ	SN:Chr3	LN:23459830
@SQ	SN:Chr4	LN:18585056
@SQ	SN:Chr5	LN:26975502
@SQ	SN:ChrM	LN:366924
@SQ	SN:ChrC	LN:154478
@PG	ID:hisat2	PN:hisat2	VN:2.1.0	CL:"/isg/shared/apps/hisat2/2.1.0/hisat2-align-s --wrapper basic-0 -p 16 --dta -x /isg/shared/databases/alignerIndex/plant/Arabidopsis/thaliana/Athaliana_HISAT2/thaliana -q trimmed_SRR3498212.fastq -S rnaseq_athaliana_root_1.sam"
SRR3498212.6	4	*	0	0	*	*	0	0	TTTCCAAGCCCTTTCTAGTCTGCGCTTGAGTTTGATTGCAGAGATCGGAA	DDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII	YT:Z:UU
SRR3498212.1	4	*	0	0	*	*	0	0	CAATCGGTCAGAGCACCGCCCTGTCAAGGCGGAAGCAGATCGGAAGAG	DDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII	YT:Z:UU
SRR3498212.4	4	*	0	0	*	*	0	0	AAAGGGCGTGGGTTCAAATCCCACAGATGTCACCAGATCGGAAGAGC	DDHIIIIIIIIIEHHHIHIIIIHIIIIIIIIIIIIIIIIIIIIIIHH	YT:Z:UU
SRR3498212.8	4	*	0	0	*	*	0	0	TTAAGATTGCTGATTTTGGCCTGGCACGTGAGGTTAAGATCGGAAGAGCA	DDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII	YT:Z:UU
SRR3498212.19	4	*	0	0	*	*	0	0	TGGATGATGGAAAAACCAGCAAGCCCCTCTTCTTTCAAGATCGGAAGAGC	DDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII	YT:Z:UU
SRR3498212.23	4	*	0	0	*	*	0	0	TTTGCCTTCCAAGCAATAGACCCGGGTAGATCGGAAGAGCACACGTCTGA	DDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII	YT:Z:UU
SRR3498212.24	4	*	0	0	*	*	0	0	TGAAACTTCTTGGTTTTAAAGTGTGAATATAGCTGACAAAAGATTGGAAG	DDDDDIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIII	YT:Z:UU
SRR3498212.12	4	*	0	0	*	*	0	0	AAGGGTGTTCTCTGCTACGGACCTCCAGATCGGAAGAGCACACGTCTGAA	DDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII	YT:Z:UU
SRR3498212.27	4	*	0	0	*	*	0	0	ATTGTTCCGGGCTGCCCAGTCCAAGCTGAGAGTGAAGATCGGAAGAGCAC	DDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII	YT:Z:UU
SRR3498212.29	4	*	0	0	*	*	0	0	TATGTCTACGCTGGTTCAAATCCAGCTCGGCCCACCAAGATCGGAAGAGC	DDDDDIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII	YT:Z:UU
SRR3498212.18	4	*	0	0	*	*	0	0	CGTGGGTTCGACTCCCACTGTGGTCGCCAAGATCGGAAGAGCACACGTC	DDDCHCCHHHEIHIGIIIEGHHIIIIGHHHIIIIIIIIIIIIIIIIIII	YT:Z:UU

All of the lines starting with an "@" symbol tell us something about the chromosomes or our input. For instance "@SQ SN:Chr1 LN:30427671" tells us that we have a sequence (@SQ) whose sequence name is Chr1 (SN:Chr1), lastly the sequence has a length of 30427671bp (LN:30427671). You may be wondering what the first line means. It is quite straightfoward! The first line is simply the header (@HD) stating that the file is unsorted (SO:unsorted). The second column in the first line is somewhat of a dummy variable, but stands for "version number". Lastly we have the "@PG" line, which, in order, keeps track of the software used to write the file (ID:hisat2), the program name used to align the reads (PN:hisat2), the version of the program used (VN:2.1.0), and lastly the user input which started the process (written in the form that the program reads, not in which we wrote it).

The alignment portion of the SAM file is much more straight-forward and may be understood by reading the SAM output formatting guide linked in the beginning of this tutorial.

Because of the density of the sam file, it is compressed to binary to create a more easily tractable file for manipulation by future programs. We convert the sam file to bam with the following command and sort it such that the alignments are listed in the order the genes appear in the genome. To do this we use the software samtools:

-bash-4.2$ module load samtools
bash-4.2$ samtools
Usage:   samtools  [options]

Commands:
  -- Indexing
     dict           create a sequence dictionary file
     faidx          index/extract FASTA
     index          index alignment

  -- Editing
     calmd          recalculate MD/NM tags and '=' bases
     fixmate        fix mate  Information
     reheader       replace BAM header
     targetcut      cut fosmid regions (for fosmid pool only)
     addreplacerg   adds or replaces RG tags
     markdup        mark duplicates

  -- File operations
     collate        shuffle and group alignments by name
     cat            concatenate BAMs
     merge          merge sorted alignments
     mpileup        multi-way pileup
     sort           sort alignment file
     split          splits a file by read group
     quickcheck     quickly check if SAM/BAM/CRAM file appears intact
     fastq          converts a BAM to a FASTQ
     fasta          converts a BAM to a FASTA

  -- Statistics
     bedcov         read depth per BED region
     depth          compute the depth
     flagstat       simple stats
     idxstats       BAM index stats
     phase          phase heterozygotes
     stats          generate stats (former bamcheck)

  -- Viewing
     flags          explain BAM flags
     tview          text alignment viewer
     view           SAM<->BAM<->CRAM conversion
     depad          convert padded BAM to unpadded BAM

We are truly only interested in sorting our SAM files.

-bash-4.2$ samtools sort

Usage: samtools sort [options...] [in.bam]
Options:
  -l INT     Set compression level, from 0 (uncompressed) to 9 (best)
  -m INT     Set maximum memory per thread; suffix K/M/G recognized [768M]
  -n         Sort by read name
  -t TAG     Sort by value of TAG. Uses position as secondary index (or read name if -n is set)
  -o FILE    Write final output to FILE rather than standard output
  -T PREFIX  Write temporary files to PREFIX.nnnn.bam
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
  -O, --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]
  -@, --threads INT
               Number of additional threads to use [0]

The sort function converts SAM files to BAM automatically. Therefore, we can cut through most of these options and do a simple "samtools sort -o <output.bam> <inupt.sam>. Let's write our script:

bash-4.2$ nano sam_sort_bam.sh

  GNU nano 2.3.1                                                     File: sam_sort_bam.sh                                                                                                                  

#!/bin/bash
#SBATCH --job-name=sam_sort_bam
#SBATCH --mail-user=
#SBATCH --mail-type=ALL
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 16
#SBATCH --mem=120G
#SBATCH -o sam_sort_bam_%j.out
#SBATCH -e sam_sort_bam_%j.err
#SBATCH --partition=general

export TMPDIR=/home/CAM/$USER/tmp/

module load samtools

samtools sort -@ 16 -o athaliana_root_1.bam athaliana_root_1.sam

samtools sort -@ 16 -o athaliana_root_2.bam athaliana_root_2.sam

samtools sort -@ 16 -o athaliana_shoot_1.bam athaliana_shoot_1.sam

samtools sort -@ 16 -o athaliana_shoot_2.bam athaliana_shoot_2.sam
                                                                                             [ Read 23 lines ]
^G Get Help                       ^O WriteOut                       ^R Read File                      ^Y Prev Page                      ^K Cut Text                       ^C Cur Pos
^X Exit                           ^J Justify                        ^W Where Is                       ^V Next Page                      ^U UnCut Text                     ^T To Spell

bash-4.2$ sbatch sam_sort_bam.sh

Transcript assembly and quantification with StringTie

This is the most intricate part of our analysis. Because of the many steps involved, this tutorial is going to walk through the steps before writing a final batch script to be submitted.

Analysis of RNA-seq experiments requires accurate reconstructions of all the isoforms expressed from each gene, as well as estimates of the relative abundance of those isoforms. Accurate quantification benefits from knowledge of precisely which reads originated from each isoform, which cannot be computed perfectly because reads are much shorter than transcripts. StringTie assembles transcripts from RNA-seq reads that have been aligned to the genome, first grouping the reads into distinct gene loci and then assembling each locus into as many isoforms as are needed to explain the data. To begin, it would be nice to have a source describing all genome features (simply a list of genes, exons, transcripts, etc., which states on which chromosome and over which base-pairs those genes are). The file format for which we're looking is GFF (which is exactly as described the sentence prior). We can download the GFF file for the thale cress with the following code:

bash-4.2$ wget https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_gff3/TAIR10_GFF3_genes.gff

bash-4.2$ head TAIR_GFF3_genes.gff
Chr1	TAIR10	chromosome	1	30427671	.	.	.	ID=Chr1;Name=Chr1
Chr1	TAIR10	gene	3631	5899	.	+	.	ID=AT1G01010;Note=protein_coding_gene;Name=AT1G01010
Chr1	TAIR10	mRNA	3631	5899	.	+	.	ID=AT1G01010.1;Parent=AT1G01010;Name=AT1G01010.1;Index=1
Chr1	TAIR10	protein	3760	5630	.	+	.	ID=AT1G01010.1-Protein;Name=AT1G01010.1;Derives_from=AT1G01010.1
Chr1	TAIR10	exon	3631	3913	.	+	.	Parent=AT1G01010.1
Chr1	TAIR10	five_prime_UTR	3631	3759	.	+	.	Parent=AT1G01010.1
Chr1	TAIR10	CDS	3760	3913	.	+	0	Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1	TAIR10	exon	3996	4276	.	+	.	Parent=AT1G01010.1
Chr1	TAIR10	CDS	3996	4276	.	+	2	Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1	TAIR10	exon	4486	4605	.	+	.	Parent=AT1G01010.1

The GFF file is quite self-explanatory. However, it'd be nice if could combine all of the pieces of Information from the GFF into something better. For instance, if there are multiple overlapping, but distinct exons from a single gene, we could use that Information to determine the isoforms of that gene. Then, we could make a file which gives each isoform its own track (there are other extrapolations to be made, but this is our most relevant example). Luckily for us, we can use the program "gffread" to transform our GFF file into the more useful form just stated, The output of gffread --help is much too dense for us to go into here, but the necessary options will be explained. Do not run this code! We are compiling this code with various other chunks into one script, be patient!

bash-4.2$ module load gffread
gffread TAIR10_GFF3_genes.gff -T -o athaliana_TAIR10_genes.gtf

The option -T tells gffread to convert our input into the gtf format, and the option -o simply is how we call the output. The GTF format is simply the transcript assembly file, and is composed of exons and coding sequences. Let's have a look at the GTF file:

-bash-4.2$ head athaliana_TAIR10_genes.gtf 
Chr1	TAIR10	exon	3631	3913	.	+	.	transcript_id "AT1G01010.1"; gene_id "AT1G01010"; gene_name "AT1G01010";
Chr1	TAIR10	exon	3996	4276	.	+	.	transcript_id "AT1G01010.1"; gene_id "AT1G01010"; gene_name "AT1G01010";
Chr1	TAIR10	exon	4486	4605	.	+	.	transcript_id "AT1G01010.1"; gene_id "AT1G01010"; gene_name "AT1G01010";
Chr1	TAIR10	exon	4706	5095	.	+	.	transcript_id "AT1G01010.1"; gene_id "AT1G01010"; gene_name "AT1G01010";
Chr1	TAIR10	exon	5174	5326	.	+	.	transcript_id "AT1G01010.1"; gene_id "AT1G01010"; gene_name "AT1G01010";
Chr1	TAIR10	exon	5439	5899	.	+	.	transcript_id "AT1G01010.1"; gene_id "AT1G01010"; gene_name "AT1G01010";
Chr1	TAIR10	CDS	3760	3913	.	+	0	transcript_id "AT1G01010.1"; gene_id "AT1G01010"; gene_name "AT1G01010";
Chr1	TAIR10	CDS	3996	4276	.	+	2	transcript_id "AT1G01010.1"; gene_id "AT1G01010"; gene_name "AT1G01010";
Chr1	TAIR10	CDS	4486	4605	.	+	0	transcript_id "AT1G01010.1"; gene_id "AT1G01010"; gene_name "AT1G01010";
Chr1	TAIR10	CDS	4706	5095	.	+	0	transcript_id "AT1G01010.1"; gene_id "AT1G01010"; gene_name "AT1G01010";

-bash-4.2$ tail athaliana_TAIR10_genes.gtf 
ChrM	TAIR10	exon	349830	351413	.	-	.	transcript_id "ATMG01360.1"; gene_id "ATMG01360"; gene_name "ATMG01360";
ChrM	TAIR10	CDS	349830	351413	.	-	0	transcript_id "ATMG01360.1"; gene_id "ATMG01360"; gene_name "ATMG01360";
ChrM	TAIR10	exon	360717	361052	.	-	.	transcript_id "ATMG01370.1"; gene_id "ATMG01370"; gene_name "ATMG01370";
ChrM	TAIR10	CDS	360717	361052	.	-	0	transcript_id "ATMG01370.1"; gene_id "ATMG01370"; gene_name "ATMG01370";
ChrM	TAIR10	exon	361062	361179	.	-	.	transcript_id "ATMG01380.1"; gene_id "ATMG01380"; gene_name "ATMG01380";
ChrM	TAIR10	exon	361350	363284	.	-	.	transcript_id "ATMG01390.1"; gene_id "ATMG01390"; gene_name "ATMG01390";
ChrM	TAIR10	exon	363725	364042	.	+	.	transcript_id "ATMG01400.1"; gene_id "ATMG01400"; gene_name "ATMG01400";
ChrM	TAIR10	CDS	363725	364042	.	+	0	transcript_id "ATMG01400.1"; gene_id "ATMG01400"; gene_name "ATMG01400";
ChrM	TAIR10	exon	366086	366700	.	-	.	transcript_id "ATMG01410.1"; gene_id "ATMG01410"; gene_name "ATMG01410";
ChrM	TAIR10	CDS	366086	366700	.	-	0	transcript_id "ATMG01410.1"; gene_id "ATMG01410"; gene_name "ATMG01410";

We see that whereas in our GFF file we have various untranslated regions included, as well as annotations, the GTF format contains Information only on various transcripts for each gene. The "transcript_id" denoter in the last column tells us the gene and its isoform, and everything else about the GTF file is quite apparent!

Just as was stated for our conversion from gff to gtf, it would be helpful for us to perform the same operation on our aligned reads. That is, if there are multiple, overlapping but distinct reads from a single gene, we could combine these reads into one transcript isoform. Because we have the gene isoforms in the gtf file, we can re-map each assembled transcript to a gene isoform and then count how many mappings there are per isoform. This, in effect, allows us to quantify the expression rates of each isoform. We will be using the program StringTie to assemble the transcripts for each sample. StringTie requires three input arguments: the BAM alignment file, the genomic GTF file, and the desired output GTF filename. Thus, our code will look like (do not run this!):

bash-4.2$ module load stringtie

stringtie -p 16 athaliana_root_1.bam -G athaliana_TAIR10_genes.gtf -o athaliana_root_1.gtf

stringtie -p 16 athaliana_root_2.bam -G athaliana_TAIR10_genes.gtf -o athaliana_root_2.gtf

stringtie -p 16 athaliana_shoot_1.bam -G athaliana_TAIR10_genes.gtf -o athaliana_shoot_1.gtf

stringtie -p 16 athaliana_shoot_2.bam -G athaliana_TAIR10_genes.gtf -o athaliana_shoot_2.gtf

Let's have a look at the stringtie output:

-bash-4.2$ athaliana_root_1.gtf
# stringtie -p 16 rnaseq_athaliana_root_1.bam -G athaliana_TAIR10_genes.gtf -o rnaseq_athaliana_root_1.gtf
# StringTie version 1.3.3b
Chr1	StringTie	transcript	28500	28706	1000	+	.	gene_id "STRG.1"; transcript_id "STRG.1.1"; reference_id "AT1G01046.1"; ref_gene_id "AT1G01046"; ref_gene_name "AT1G01046"; cov "0.241546"; FPKM "3.727008"; TPM "0.747930";
Chr1	StringTie	exon	28500	28706	1000	+	.	gene_id "STRG.1"; transcript_id "STRG.1.1"; exon_number "1"; reference_id "AT1G01046.1"; ref_gene_id "AT1G01046"; ref_gene_name "AT1G01046"; cov "0.241546";
Chr1	StringTie	transcript	47494	48839	1000	-	.	gene_id "STRG.2"; transcript_id "STRG.2.1"; cov "3.928230"; FPKM "60.611832"; TPM "12.163484";
Chr1	StringTie	exon	47494	47982	1000	-	.	gene_id "STRG.2"; transcript_id "STRG.2.1"; exon_number "1"; cov "4.529652";
Chr1	StringTie	exon	48075	48839	1000	-	.	gene_id "STRG.2"; transcript_id "STRG.2.1"; exon_number "2"; cov "3.543791";
Chr1	StringTie	transcript	50075	51199	1000	-	.	gene_id "STRG.3"; transcript_id "STRG.3.1"; reference_id "AT1G01100.2"; ref_gene_id "AT1G01100"; ref_gene_name "AT1G01100"; cov "8.437494"; FPKM "130.188904"; TPM "26.126097";
Chr1	StringTie	exon	50075	50337	1000	-	.	gene_id "STRG.3"; transcript_id "STRG.3.1"; exon_number "1"; reference_id "AT1G01100.2"; ref_gene_id "AT1G01100"; ref_gene_name "AT1G01100"; cov "6.228601";
Chr1	StringTie	exon	50419	50631	1000	-	.	gene_id "STRG.3"; transcript_id "STRG.3.1"; exon_number "2"; reference_id "AT1G01100.2"; ref_gene_id "AT1G01100"; ref_gene_name "AT1G01100"; cov "9.487524";

While this is certainly confusing, we can still understand it. To start each row we have the chromosome for each sequence. The second column is the software used to assemble the transcript, in our case StringTie. The third column is the sequence type, either a transcript or an exon. The next two columns are the start and end bp of the feature (assuming the chromosome starts at bp 1), followed by another column which is the score. The column after the score is either "+" or "-" for forward strand and reverse strand, respectively. Our last two columns are the frame ("." means frame not determined) and the feature meta Information.

You may be wondering about the last three columns which are not standard in GTF. The column "cov" is simply the covariance of the gene across samples (if the gene is highly or lowly expressed in both samples the covariance will be high, if it is highly expressed in one sample and lowly expressed in another sample, the covariance will be low). The FPKM column is the fragment per kilobase million value. Simply put, this is the number of times the specific feature was counted divided by how many counts there were for all features combined, in millions. That number is then divided by the length of the feature in kilobases. The reason for this being that longer features should have higher counts. For instance, when we split our mRNA in sequences of 50 or less for reading, one 5000 bp transcript will appear as 100 counts, and one 500 bp transcript will appear as 10 counts. Now, let's divide each count by the transcript length in kilobases: we have 20 as the value for the 5000 bp sequence (100/(5000/1000)) and 20 as the value for the 500 bp sequence (10/(500/1000)! Now our quantification matches the actual expression profile -- that is, both features were transcribed the same amount.

The last column, TPM, is the transcripts per feature per million transcripts counted across all features combined. As evident from the example above, without some sort of scaling factor, this value is highly misleading.

The assembled isoforms in each sample are most likely different from those of other samples. However, we may repeat the process of determining isoforms but this time using the gtf files for all four samples. That is, if there are isoforms across the sample gtfs which overlap but are distinct, we may merge those into a single isoform. To do this we will be using the --merge option of stringtie. The --merge option of stringtie requires three input arguments: the genomic GTF file, the desired output filename, and a plain-text file containing each file to be merged separated by a return. To begin, let's make our plain-text file:

bash-4.2$ nano mergelist.txt

  GNU nano 2.3.1                                                      File: mergelist.txt                                                                                                                   

athaliana_root_1.gtf
athaliana_root_2.gtf
athaliana_shoot_1.gtf
athaliana_shoot_2.gtf
                                                                                              [ Read 4 lines ]
^G Get Help                       ^O WriteOut                       ^R Read File                      ^Y Prev Page                      ^K Cut Text                       ^C Cur Pos
^X Exit                           ^J Justify                        ^W Where Is                       ^V Next Page                      ^U UnCut Text                     ^T To Spell

After saving our plain-text file, we can merge our samples using the following code (do not run this!):

bash-4.2$ module load stringtie
stringtie --merge -p 16 -G -o merged stringtie_merged.gtf

While the options are quite self-explanatory, one thing to note is that the "-o" option is simply the output prefix. After running, you should see the following files in your directory (but hopefully you listened and did not run it yet!):

bash-4.2$ ls
merged.annotated.gtf
merged.loci
merged.stats
merged.stringtie_merged.gtf.refmap
merged.stringtie_merged.gtf.tmap
merged.tracking

Although you have not run the code yet, let's have a look at some of the files we've generated (we will not be looking at the merged.annotated.gtf as we are already quite familiar with the gtf format!)

-bash-4.2$ head merged.loci
XLOC_000001	Chr1[+]3631-5899	AT1G01010|AT1G01010.1	AT1G01010.1
XLOC_000002	Chr1[+]23146-31227	AT1G01040|AT1G01040.1,AT1G01040|AT1G01040.2	AT1G01040.1,AT1G01040.2
XLOC_000003	Chr1[+]28500-28706	AT1G01046|AT1G01046.1	AT1G01046.1
XLOC_000004	Chr1[+]44677-44787	AT1G01073|AT1G01073.1	AT1G01073.1
XLOC_000005	Chr1[+]52239-54692	AT1G01110|AT1G01110.2,AT1G01110|AT1G01110.1	AT1G01110.2,AT1G01110.1
XLOC_000006	Chr1[+]56624-56740	AT1G01115|AT1G01115.1	AT1G01115.1
XLOC_000007	Chr1[+]72339-74096	AT1G01160|AT1G01160.1,AT1G01160|AT1G01160.2	AT1G01160.1,AT1G01160.2
XLOC_000008	Chr1[+]75583-76758	AT1G01180|AT1G01180.1	AT1G01180.1
XLOC_000009	Chr1[+]88898-89745	AT1G01210|AT1G01210.1	AT1G01210.1
XLOC_000010	Chr1[+]91376-95651	AT1G01220|AT1G01220.1	AT1G01220.1

We see we have a condensed form of our various exons. The exon loci name is the first column, followed by the chromosome, strand, and bp location in the second column. The final columns are the gene ID, the transcript IDs to which the loci belong, and the isoforms to which the transcripts belong.

-bash-4.2$ nano merged.stats

  GNU nano 2.3.1                                                      File: merged.stats                                                                                                                    

# gffcompare v0.10.4 | Command line was:
# gffcompare -r athaliana_TAIR10_genes.gtf -G -o merged stringtie_merged.gtf
#

# Summary for dataset: stringtie_merged.gtf
#     Query mRNAs :   41854 in   33403 loci  (30272 multi-exon transcripts)
#            (6013 multi-transcript loci, ~1.3 transcripts per locus)
# Reference mRNAs :   41607 in   33350 loci  (30127 multi-exon)
# Super-loci w/ reference transcripts:    33184
#-----------------| Sensitivity | Precision  |
        Base level:   100.0     |    99.9    |
        Exon level:   100.0     |    99.9    |
      Intron level:   100.0     |   100.0    |
Intron chain level:   100.0     |    99.5    |
  Transcript level:   100.0     |    99.4    |
       Locus level:   100.0     |    99.8    |

     Matching intron chains:   30127
       Matching transcripts:   41607
              Matching loci:   33350

          Missed exons:       0/169264  (  0.0%)
           Novel exons:      90/169578  (  0.1%)
        Missed introns:       0/127896  (  0.0%)
         Novel introns:      13/127951  (  0.0%)
           Missed loci:       0/33350   (  0.0%)
            Novel loci:      75/33403   (  0.2%)

 Total union super-loci across all input datasets: 33403
41854 out of 41854 consensus transcripts written in merged.annotated.gtf (0 discarded as redundant)

                                                                                             [ Read 30 lines ]
^G Get Help                       ^O WriteOut                       ^R Read File                      ^Y Prev Page                      ^K Cut Text                       ^C Cur Pos
^X Exit                           ^J Justify                        ^W Where Is                       ^V Next Page                      ^U UnCut Text                     ^T To Spell

The Information here is quite apparent.

-bash-4.2$ head merged.stringtie_merged.gtf.refmap
ref_gene_id	ref_id	class_code	qry_id_list
AT1G01010	AT1G01010.1	=	AT1G01010|AT1G01010.1
AT1G01040	AT1G01040.1	=	AT1G01040|AT1G01040.1
AT1G01040	AT1G01040.2	=	AT1G01040|AT1G01040.2
AT1G01046	AT1G01046.1	=	AT1G01046|AT1G01046.1
AT1G01073	AT1G01073.1	=	AT1G01073|AT1G01073.1
AT1G01110	AT1G01110.2	=	AT1G01110|AT1G01110.2
AT1G01110	AT1G01110.1	=	AT1G01110|AT1G01110.1
AT1G01115	AT1G01115.1	=	AT1G01115|AT1G01115.1
AT1G01160	AT1G01160.1	=	AT1G01160|AT1G01160.1

Here we have the gene IDs from the reference GFF, followed by the isoform IDs from the reference GTF, the "class_code" which simply tells you that the last column was matched fully ("=") or partially ("c"). Taking our first column, we see that all of isoform ATG01010.1 was matched to the gene ATG01010.

-bash-4.2$ head merged.stringtie_merged.gtf.tmap
ref_gene_id	ref_id	class_code	qry_gene_id	qry_id	num_exons	FPKM	TPM		cov	len	major_iso_id	ref_match_len
AT1G01090	AT1G01090.1	=	AT1G01090	AT1G01090.1	3	0.000000	0.000000	0.000000	1627	AT1G01090.1	1627
AT1G01100	AT1G01100.2	=	AT1G01100	AT1G01100.2	4	0.000000	0.000000	0.000000	631	AT1G01100.2	631
AT1G01100	AT1G01100.1	=	AT1G01100	AT1G01100.1	4	0.000000	0.000000	0.000000	587	AT1G01100.2	587
AT1G01100	AT1G01100.4	=	AT1G01100	AT1G01100.4	4	0.000000	0.000000	0.000000	607	AT1G01100.2	607
AT1G01100	AT1G01100.3	=	AT1G01100	AT1G01100.3	5	0.000000	0.000000	0.000000	566	AT1G01100.2	566
AT1G01120	AT1G01120.1	=	AT1G01120	AT1G01120.1	1	0.000000	0.000000	0.000000	1899	AT1G01120.1	1899
AT1G01180	AT1G01180.1	=	AT1G01180	AT1G01180.1	1	0.000000	0.000000	0.000000	1176	AT1G01180.1	1176
AT1G01183	AT1G01183.1	=	AT1G01183	AT1G01183.1	1	0.000000	0.000000	0.000000	101	AT1G01183.1	101
AT1G01225	AT1G01225.1	=	AT1G01225	AT1G01225.1	2	0.000000	0.000000	0.000000	1025	AT1G01225.1	1025

All of the Information in the .tmap file may be readily understood now, knowing that "len" and "ref_match_len" are the sequence lengths and reference lengths, respectively.

Lastly,

-bash-4.2$ head merged.tracking
TCONS_00000001	XLOC_000001	AT1G01010|AT1G01010.1	=	q1:AT1G01010|AT1G01010.1|6|0.000000|0.000000|0.000000|1688
TCONS_00000002	XLOC_000002	AT1G01040|AT1G01040.1	=	q1:AT1G01040|AT1G01040.1|20|0.000000|0.000000|0.000000|6251
TCONS_00000003	XLOC_000002	AT1G01040|AT1G01040.2	=	q1:AT1G01040|AT1G01040.2|20|0.000000|0.000000|0.000000|5877
TCONS_00000004	XLOC_000003	AT1G01046|AT1G01046.1	=	q1:AT1G01046|AT1G01046.1|1|0.000000|0.000000|0.000000|207
TCONS_00000005	XLOC_000004	AT1G01073|AT1G01073.1	=	q1:AT1G01073|AT1G01073.1|1|0.000000|0.000000|0.000000|111
TCONS_00000006	XLOC_000005	AT1G01110|AT1G01110.2	=	q1:AT1G01110|AT1G01110.2|5|0.000000|0.000000|0.000000|1782
TCONS_00000007	XLOC_000005	AT1G01110|AT1G01110.1	=	q1:AT1G01110|AT1G01110.1|3|0.000000|0.000000|0.000000|1439
TCONS_00000008	XLOC_000006	AT1G01115|AT1G01115.1	=	q1:AT1G01115|AT1G01115.1|1|0.000000|0.000000|0.000000|117
TCONS_00000009	XLOC_000007	AT1G01160|AT1G01160.1	=	q1:AT1G01160|AT1G01160.1|5|0.000000|0.000000|0.000000|1045
TCONS_00000010	XLOC_000007	AT1G01160|AT1G01160.2	=	q1:AT1G01160|AT1G01160.2|6|0.000000|0.000000|0.000000|1129

the merged.tracking is the compact form of tmap file combined with the loci file. The last column has the gene, gene isoform, number of hits, FPKM, TPM, cov, and length all in that order.

Our last step is to create a count-table for the differential expression software "ballgown". A word of note about ballgown, ballgown requires that all of the files it will analyze be in their own parent directory, "ballgown", and furthermore each file is in its own subdirectory "ballgown/file_sub/file". Knowing we have four files, let's create the required structure for ballgown (run this):

-bash-4.2$ mkdir ballgown
-bash-4.2$ cd ballgown
-bash-4.2$ mkdir athaliana_root_1
-bash-4.2$ mkdir athaliana_root_2
-bash-4.2$ mkdir athaliana_shoot_1
-bash-4.2$ mkdir athaliana_shoot_2
-bash-4.2$ cd -

The reason for this will become obvious soon. Now we will use StringTie to make our count tables (do not run this!):

-bash-4.2$ module load stringtie

stringtie -e -B -p 16 athaliana_root_1.bam -G stringtie_merged.gtf -o ballgown/athaliana_root_1/athaliana_root_1.count

stringtie -e -B -p 16 athaliana_root_2.bam -G stringtie_merged.gtf -o ballgown/athaliana_root_2/athaliana_root_2.count

stringtie -e -B -p 16 athaliana_shoot_1.bam -G stringtie_merged.gtf -o ballgown/athaliana_shoot_1/athaliana_shoot_1.count

stringtie -e -B -p 16 athaliana_shoot_2.bam -G stringtie_merged.gtf -o ballgown/athaliana_shoot_2/athaliana_shoot_2.count

Now we are ready to compile all of our code into a single script:

-bash-4.2$ nano transcript_assembly.sh

 GNU nano 2.3.1                                                   File: transcript_assembly.sh                                                                                                             

#!/bin/bash
#SBATCH --job-name=sam_sort_bam
#SBATCH --mail-user=
#SBATCH --mail-type=ALL
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 16
#SBATCH --mem=120G
#SBATCH -o sam_sort_bam_%j.out
#SBATCH -e sam_sort_bam_%j.err
#SBATCH --partition=general

export TMPDIR=/home/CAM/$USER/tmp/

module load gffread
module load stringtie

gffread TAIR10_GFF3_genes.gff -T -o athaliana_TAIR10_genes.gtf

stringtie -p 16 athaliana_root_1.bam -G athaliana_TAIR10_genes.gtf -o athaliana_root_1.gtf

stringtie -p 16 athaliana_root_2.bam -G athaliana_TAIR10_genes.gtf -o athaliana_root_2.gtf

stringtie -p 16 athaliana_shoot_1.bam -G athaliana_TAIR10_genes.gtf -o athaliana_shoot_1.gtf

stringtie -p 16 athaliana_shoot_2.bam -G athaliana_TAIR10_genes.gtf -o athaliana_shoot_2.gtf

stringtie --merge -p 16 -G -o merged stringtie_merged.gtf

stringtie -e -B -p 16 athaliana_root_1.bam -G stringtie_merged.gtf -o ballgown/athaliana_root_1/athaliana_root_1.count

stringtie -e -B -p 16 athaliana_root_2.bam -G stringtie_merged.gtf -o ballgown/athaliana_root_2/athaliana_root_2.count

stringtie -e -B -p 16 athaliana_shoot_1.bam -G stringtie_merged.gtf -o ballgown/athaliana_shoot_1/athaliana_shoot_1.count

stringtie -e -B -p 16 athaliana_shoot_2.bam -G stringtie_merged.gtf -o ballgown/athaliana_shoot_2/athaliana_shoot_2.count

                                                                                             [ Read 36 lines ]
^G Get Help                       ^O WriteOut                       ^R Read File                      ^Y Prev Page                      ^K Cut Text                       ^C Cur Pos
^X Exit                           ^J Justify                        ^W Where Is                       ^V Next Page                      ^U UnCut Text                     ^T To Spell

-bash-4.2$ sbatch transcript_assembly.sh

Differential expression analysis using ballgown

For many organisms, many of the same genes are expressed in separate cell types, with a variety of phenotype differences a result of the specific isoforms a cell will use. Therefore, when performing a differential expression analysis from different parts of one organism (not one species, but a singular organism), it is wise to perform an isoform expression analysis alongside a standard differential expression analysis and combine the results (as we are doing here). We will only be performing the isoform expresion analysis. Ballgown is a differential expression package for R via Bioconductor ideal for isoform expression analyses. Before beginning, you need to secure copy our ballgown directory from Xanadu to your local machine:

-bash-4.2$ exit
logout
Connection to xanadu-submit-ext.cam.uchc.edu closed.
user:~$ scp -r YOUR.USER.NAME@xanadu-submit-ext.cam.uchc.edu:/home/CAM/$USER/rnaseq_for_model_plant/ballgown .

Now we load RStudio with administrator privileges (otherwise you cannot install packages!).

To begin we must download and load the proper packages:

install.packages("devtools")
install.packages("RFLPtools")
source("http://www.bioconductor.org/biocLite.R")
biocLite(c("alyssafrazee/RSkittleBrewer","ballgown", "genefilter", "dplyr", "devtools"))

library(ballgown)
library(RSkittleBrewer)
library(genefilter)
library(dplyr)
library(ggplot2)
library(gplots)
library(devtools)
library(RFLPtools)

Now we need to set our working directory to the directory which contains our "ballgown" folder. For me, this is:

dir <- "C://Users/Wolf/Desktop/"
setwd(dir)
list.files()

You should see the "ballgown" folder after the list.files() command.

Let's have a look at the ballgown function:

help("ballgown")

constructor function for ballgown objects

Description

constructor function for ballgown objects

Usage

ballgown(samples = NULL, dataDir = NULL, samplePattern = NULL,
  bamfiles = NULL, pData = NULL, verbose = TRUE, meas = "all")
Arguments

samples			vector of file paths to folders containing sample-specific ballgown data (generated by tablemaker). If samples 
			is provided, dataDir and samplePattern are not used.
dataDir			file path to top-level directory containing sample-specific folders with ballgown data in them. Only used if 
			samples is NULL.
samplePattern		regular expression identifying the subdirectories of\ dataDir containing data to be loaded into the ballgown 
			object (and only those subdirectories). Only used if samples is NULL.
bamfiles		optional vector of file paths to read alignment files for each sample. If provided, make sure to sort properly
			(e.g., in the same order as samples). Default NULL.
pData			optional data.frame with rows corresponding to samples and columns corresponding to phenotypic variables.
verbose			if TRUE, print status messages and timing  Information as the object is constructed.
meas			character vector containing either "all" or one or more of: "rcount", "ucount", "mrcount", "cov", "cov_sd", 
			"mcov", "mcov_sd", or "FPKM". The resulting ballgown object will only contain the specified expression 	
			measurements, for the appropriate features. See vignette for which expression measurements are available for 
			which features. "all" creates the full object.

Because of the structure of our ballgown directory, we may use dataDir = "ballgown", samplePattern = "athaliana", measure = "FPKM", and pData = some_type_of_phenotype_matrix.

We want all of the objects in our arguments to be in the same order as they are present in the ballgown directory. Therefore, we want our pData matrix to have two columns -- the first column being the samples as they appear in the ballgown directory, and the second being the phenotype of each sample in the column before it (root or shoot). Let's see the order of our sample files:

list.files("ballgown/")
[1] "athaliana_root_1"  "athaliana_root_2"  "athaliana_shoot_1" "athaliana_shoot_2"

Now we construct a 4x2 phenotype matrix with the first column being our samples in order and the second each sample's phenotype:

pheno_data = c("athaliana_root_1", "athaliana_root_2", "athaliana_shoot_1",  "athaliana_shoot_2","root","root","shoot","shoot")

R fills the rows of each column first. Therefore, our character vector contains all of column 1 in order, followed by all of column 2 in order. The end result will be our matrix.

pheno_matrix = matrix(pheno_data, ncol=2)

pheno_matrix = as.data.frame(pheno_matrix)

colnames(pheno_matrix) <- c("sample", "part")

rownames(pheno_matrix) <- pheno_matrix[,1]

pheno_matrix
                             sample  part
athaliana_root_1   athaliana_root_1  root
athaliana_root_2   athaliana_root_2  root
athaliana_shoot_1 athaliana_shoot_1 shoot
athaliana_shoot_2 athaliana_shoot_2 shoot

We may now create our ballgown object:

bg <- ballgown(dataDir = "ballgown", pData=pheno_matrix, samplePattern = "athaliana")
Wed Jun 13 11:12:33 2018
Wed Jun 13 11:12:33 2018: Reading linking tables
Wed Jun 13 11:12:33 2018: Reading intron data files
Wed Jun 13 11:12:35 2018: Merging intron data
Wed Jun 13 11:12:36 2018: Reading exon data files
Wed Jun 13 11:12:39 2018: Merging exon data
Wed Jun 13 11:12:40 2018: Reading transcript data files
Wed Jun 13 11:12:41 2018: Merging transcript data
Wrapping up the results
Wed Jun 13 11:12:42 2018

We filter our ballgown object to take only genes with variances above 1 using rowVars().

??ballgown::subset

subset ballgown objects to specific samples or genomic locations

Description

subset ballgown objects to specific samples or genomic locations

Usage

subset(x, ...)

## S4 method for signature 'ballgown'
subset(x, cond, genomesubset = TRUE)

Arguments

x	
a 		ballgown object
...		further arguments to generic subset
cond		Condition on which to subset. See details.
genomesubset	if TRUE, subset x to a specific part of the genome. Otherwise, subset x to only include specific samples. TRUE by 
		default.

Details

To use subset, you must provide the cond argument as a string representing a logical expression specifying your desired subset. The subset expression can either involve column names of texpr(x, "all") (if genomesubset is TRUE) or of pData(x) (if genomesubset is FALSE). For example, if you wanted a ballgown object for only chromosome 22, you might call subset(x, "chr == 'chr22'"). (Be sure to handle quotes within character strings appropriately).

bg_filt = subset(bg, "rowVars(texpr(bg))>1", genomesubset=TRUE)

We follow the guide and subset our ballgown object under the condition that the row-variances of the expression data are greater than one, keeping the gene names.

To perform the isoform differential expression analysis we use ballgown's "stattest" function. Let's have a look at it:

??ballgown::stattest

statistical tests for differential expression in ballgown

Description

Test each transcript, gene, exon, or intron in a ballgown object for differential expression, using
comparisons of linear models.

Usage

stattest(gown = NULL, gowntable = NULL, pData = NULL, mod = NULL,
  mod0 = NULL, feature = c("gene", "exon", "intron", "transcript"),
  meas = c("cov", "FPKM", "rcount", "ucount", "mrcount", "mcov"),
  timecourse = FALSE, covariate = NULL, adjustvars = NULL, gexpr = NULL,
  df = 4, getFC = FALSE, libadjust = NULL, log = TRUE)

Arguments

gown		name of an object of class ballgown
gowntable	matrix or matrix-like object with rownames representing feature IDs and columns representing samples, with expression 
		estimates in the cells. Provide the feature name with feature. You must provide exactly one of gown or gowntable. NB:
		gowntable is log-transformed within stattest if log is TRUE, so provide un-logged expression values in gowntable.
pData		Required if gowntable is provided: data frame giving phenotype data for the samples in the columns of gowntable. (Rows
		of pData correspond to columns of gowntable). If gown is used instead, it must have a non-null, valid pData slot (and 
		the pData argument to stattest should be left NULL).
mod		object of class model.matrix representing the design matrix for the linear regression model including covariates of
		interest
mod0		object of class model.matrix representing the design matrix for the linear regression model without the covariates of 
		interest.
feature		the type of genomic feature to be tested for differential expression. If gown is used, must be one of "gene", 
		"transcript", "exon", or "intron". If gowntable is used, this is just used for labeling and can be whatever the rows of
		gowntable represent.
meas		the expression measurement to use for statistical tests. Must be one of "cov", "FPKM", "rcount", "ucount", "mrcount",
		or "mcov". Not all expression measurements are available for all features. Leave as default if gowntable is provided.
timecourse	if TRUE, tests whether or not the expression profiles of genomic features vary over time (or another continuous
		covariate) in the study. Default FALSE. Natural splines are used to fit time profiles, so you must have more timepoints
		than degrees of freedom used to fit the splines. The default df is 4.
covariate	string representing the name of the covariate of interest for the differential expression tests. Must correspond to the
		name of a column of pData(gown). If timecourse=TRUE, this should be the study's time variable.
adjustvars	optional vector of strings representing the names of potential confounders. Must correspond to names of columns of
		pData(gown).
gexpr		optional data frame that is the result of calling gexpr(gown)). (You can speed this function up by pre-creating
		gexpr(gown).)
df		degrees of freedom used for modeling expression over time with natural cubic splines. Default 4. Only used if 
		timecourse=TRUE.
getFC		if TRUE, also return estimated fold changes (adjusted for library size and confounders) between populations. Only
		available for 2-group comparisons at the moment. Default FALSE.
libadjust	library-size adjustment to use in linear models. By default, the adjustment is defined as the sum of the sample's log
		expression measurements below the 75th percentile of those measurements. To use a different library-size adjustment,
		provide a numeric vector of each sample's adjustment value. Entries of this vector correspond to samples in in rows of
		pData. If no library size adjustment is desired, set to FALSE.
log		if TRUE, outcome variable in linear models is log(expression+1), otherwise it's expression. Default TRUE.

We see we can determine which transcripts and genes are differentially expressed in the roots or shoots, alongside the fold changes of each differentially expressed gene as measured in FPKM with the following code:

results_transcripts = stattest(bg_filt, feature="transcript" , covariate = "part" , 
getFC = TRUE, meas = "FPKM")

results_genes = stattest(bg_filt, feature="gene" , covariate = "part" , getFC = TRUE, meas = "FPKM")

Let's take a look at this object:

head(results_genes)
  feature         id           fc         pval      qval
1    gene MSTRG.4142 2.075827e-01 0.0000491845 0.5184498
2    gene  AT4G14220 2.103846e+01 0.0001882308 0.5184498
3    gene MSTRG.3788 5.031936e-03 0.0002087393 0.5184498
4    gene MSTRG.5824 2.867578e-03 0.0002187155 0.5184498
5    gene  MSTRG.568 3.186395e-01 0.0002764999 0.5184498
6    gene  MSTRG.811 3.236509e-04 0.0003448702 0.5184498

Each differentially expressed gene (or isoform) is listed, alongside its ID, fold-change (percent increase), p-value, and q-value.

Now we want to order our results according to their p-value, and then subset to only take results with p-values below 0.01, writing our findings to a csv:

results_genes = arrange(results_genes,pval)
results_genes = subset(results_genes, pval < 0.01)
results_transcripts = arrange(results_transcripts, pval)
results_transcripts = subset(results_transcripts, pval < 0.01)
write.csv(results_transcripts, "transcript_results.csv", row.names=FALSE)
write.csv(results_genes, "results_genes.csv", row.names=FALSE)

##we use row.names=FALSE because currently the row names are just the numbers 1, 2, 3. . .

We should take advantage while we have this results_genes object and annotate the genes we have deemed significant (p-values below 0.01, every gene now in this object). To annotate the genes we will be using biomaRt and biomartr. You can install these with the following code:

## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("biomaRt")
install.packages("biomartr")

The first step in annotating our genes of interest is to choose our database. We do this using the "useMart" function of biomaRt:

library(biomaRt)
library(biomartr)
??biomaRt::useMart

useMart {biomaRt}	R Documentation
Connects to the selected BioMart database and dataset

Description

A first step in using the biomaRt package is to select a BioMart database and dataset to use. The useMart function enables one to connect to a specified BioMart database and dataset within this database. To know which BioMart databases are available see the listMarts function. To know which datasets are available within a BioMart database, first select the BioMart database using useMart and then use the listDatasets function on the selected BioMart, see listDatasets function.

Usage

useMart(biomart, dataset, host="www.ensembl.org",
path="/biomart/martservice", port=80, archive=FALSE, ssl.verifypeer =
TRUE, ensemblRedirect = NULL, version, verbose = FALSE)
Arguments

biomart		BioMart database name you want to connect to. Possible database names can be retrieved with the functio listMarts
dataset		Dataset you want to use. To see the different datasets available within a biomaRt you can e.g. do: mart = 
		useMart('ensembl'), followed by listDatasets(mart).
host		Host to connect to. Defaults to www.ensembl.org
path		Path that should be pasted after to host to get access to the web service URL
port		port to connect to, will be pasted between host and path 
archive		Boolean to indicate if you want to access archived versions of BioMart databases. Note that this argument is now 
		deprecated and will be removed in the future. A better alternative is to leave archive = FALSE and to specify the url 
		of the archived BioMart you want to access. For Ensembl you can view the list of archives using listEnsemblArchives
ssl.verifypeer	Set SSL peer verification on or off. By default ssl.verifypeer is set to TRUE
ensemblRedirect	This argument has now been deprecated.
version		Use version name instead of biomart name to specify which BioMart you want to use
verbose		Give detailed output of what the method is doing while in use, for debugging

A quick google search will show that the genome we used, TAIR10, is the Ensembl format of the thale cress. We are going to want to use the gene dataset. Let's verify that it is there following the instructions provided:

mart = useMart("ensembl")
 head(listDatasets(mart))
                      dataset                        description      version
1  acarolinensis_gene_ensembl     Anole lizard genes (AnoCar2.0)    AnoCar2.0
2   amelanoleuca_gene_ensembl              Panda genes (ailMel1)      ailMel1
3     amexicanus_gene_ensembl        Cave fish genes (AstMex102)    AstMex102
4     anancymaae_gene_ensembl Ma's night monkey genes (Anan_2.0)     Anan_2.0
5 aplatyrhynchos_gene_ensembl          Duck genes (BGI_duck_1.0) BGI_duck_1.0
6        btaurus_gene_ensembl                 Cow genes (UMD3.1)       UMD3.1

We want to scan this for the thale cress. But first, let's make sure we can scan it, period:

listDatasets(mart)[grep("thaliana",listDatasets(mart)[,1]),]
                    dataset           description version
2 amelanoleuca_gene_ensembl Panda genes (ailMel1) ailMel1

We subset the listDatasets(mart) dataframe to include all rows which have the substring "amelanoleuca" in them. The return is row 2, which we can easily verify matches the head of the dataframe. Now let's try it with our species, "thaliana":

listDatasets(mart)[grep("thaliana",listDatasets(mart)[,1]),]
[1] dataset     description version    
<0 rows> (or 0-length row.names)

There is no match. The reason for this is that biomaRt defaults to animal model organisms! We need to access the plant database. Now let's try:

listMarts(host="plants.ensembl.org")
            biomart                      version
1       plants_mart      Ensembl Plants Genes 39
2 plants_variations Ensembl Plants Variations 39

##if you are confused by the use of the listMarts function, read the useMart guide above!

mart = useMart("plants_mart", host="plants.ensembl.org")
head(listDatasets(mart))
              dataset                             description          version
1     alyrata_eg_gene        Arabidopsis lyrata genes (v.1.0)            v.1.0
2   atauschii_eg_gene    Aegilops tauschii genes (ASM34733v1)       ASM34733v1
3   athaliana_eg_gene     Arabidopsis thaliana genes (TAIR10)           TAIR10
4 atrichopoda_eg_gene    Amborella trichopoda genes (AMTR1.0)          AMTR1.0
5 bdistachyon_eg_gene    Brachypodium distachyon genes (v1.0)             v1.0
6      bnapus_eg_gene Brassica napus genes (AST_PRJEB5043_v1) AST_PRJEB5043_v1

##we see the thale cress as row 3! now we may choose our dataset:

thale_cress_mart = useMart("plants_mart",host="plants.ensembl.org",dataset="athaliana_eg_gene")
head(thale_cress_mart)
Error in x[seq_len(n)] : object of type 'S4' is not subsettable

Our mart is in the S4 class and not readable right now. We can process it by using the "getBM" function:

??biomaRt::getBM

Retrieves  Information from the BioMart database

Description

This function is the main biomaRt query function. Given a set of filters and corresponding values, it retrieves the user specified attributes from the BioMart database one is connected to.

Usage

getBM(attributes, filters = "", values = "", mart, curl = NULL, 
checkFilters = TRUE, verbose = FALSE, uniqueRows = TRUE, bmHeader = FALSE,
quote = "\"")
Arguments

attributes	Attributes you want to retrieve. A possible list of attributes can be retrieved using the function listAttributes.
filters		Filters (one or more) that should be used in the query. A possible list of filters can be retrieved using the function 
		listFilters.
values		Values of the filter, e.g. vector of affy IDs. If multiple filters are specified then the argument should be a list of 
		vectors of which the position of each vector corresponds to the position of the filters in the filters argument.
mart		object of class Mart, created with the useMart function.
curl		An optional 'CURLHandle' object, that can be used to speed up getBM when used in a loop.
checkFilters	Sometimes attributes where a value needs to be specified, for example upstream\_flank with value 20 for obtaining 
		upstream sequence flank regions of length 20bp, are treated as filters in BioMarts. To enable such a query to work, one 
		must specify the attribute as a filter and set checkFilters = FALSE for the query to work.
verbose		When using biomaRt in webservice mode and setting verbose to TRUE, the XML query to the webservice will be printed.
uniqueRows	If the result of a query contains multiple identical rows, setting this argument to TRUE (default) will result in 
		deleting the duplicated rows in the query result at the server side.
bmHeader	Boolean to indicate if the result retrieved from the BioMart server should include the data headers or not, defaults to 
		FALSE. This should only be switched on if the default behavior results in errors, setting to on might still be able to 
		retrieve your data in that case
quote		Sometimes parsing of the results fails due to errors in the Ensembl data fields such as containing a quote, in such 
		cases you can try to change the value of quote to try to still parse the results.

Value

A data.frame. There is no implicit mapping between its rows and the function arguments (e.g. filters, values), therefore make sure to have the relevant identifier(s) returned by specifying them in attributes. See Examples.

Let's find out the attributes and filters by following the instructions in the vignette:

dim(listAttributes(thale_cress_mart))
[1] 1118    3

##1118 attributes is too many for us to look through. They are ordered somewhat in prevalence of use.

##let's look at the most commonly used attributes and see if they'll work for us

head(listAttributes(thale_cress_mart))
                   name              description         page
1       ensembl_gene_id           Gene stable ID feature_page
2 ensembl_transcript_id     Transcript stable ID feature_page
3    ensembl_peptide_id        Protein stable ID feature_page
4       ensembl_exon_id           Exon stable ID feature_page
5           description         Gene description feature_page
6       chromosome_name Chromosome/scaffold name feature_page

##we don't know the chromosome name, so we can just take attributes 1,3, and 5

thale_cress_data_frame = getBM(attributes=c("ensembl_gene_id","ensembl_peptide_id","description"),mart=thale_cress_mart)
head(thale_cress_data_frame)

  ensembl_gene_id ensembl_peptide_id                              description
1       AT3G11415                                                            
2       AT1G31258                       other RNA [Source:TAIR;Acc:AT1G31258]
3       AT5G24735                       other RNA [Source:TAIR;Acc:AT5G24735]
4       AT2G45780                       other RNA [Source:TAIR;Acc:AT2G45780]
5       AT2G42425                    Unknown gene [Source:TAIR;Acc:AT2G42425]
6       AT4G01533                       other RNA [Source:TAIR;Acc:AT4G01533]

The default descriptions are certainly underwhelming. Let's see if there are any other types of descriptions we can get:

listAttributes(thale_cress_mart)[grep("descr",listAttributes(thale_cress_mart)[,1]),]
                           name                description         page
5                   description           Gene description feature_page
34       goslim_goa_description     GOSlim GOA Description feature_page
115  interpro_short_description Interpro Short Description feature_page
116        interpro_description       Interpro Description feature_page
153                 description           Gene description    structure
180                 description           Gene description     homologs
1039                description           Gene description          snp
1042         source_description Variant source description          snp
1080                description           Gene description    sequences

Using the other descriptions will take much, much longer as the Information is extracted from the appropriate databases via internet connection. For this tutorial we will be sticking with our un-impressive descriptions. However, you may choose the description best for you and your resesarch. Before we move on to annotating, let's have a look at the filters:

head(listFilters(thale_cress_mart))
                name                            description
1    chromosome_name               Chromosome/scaffold name
2              start                                  Start
3                end                                    End
4             strand                                 Strand
5 chromosomal_region e.g. 1:100:10000:-1, 1:100000:200000:1
6        with_chembl                      With ChEMBL ID(s)

Should we only want to annotate genes from a specific chromosome or any other critera, we would use the "filter" argument in getBM to select only the subset of the genome we desire. We now have all of the pieces required for us to annotate our results. Let's have a look at our gene results object and our thale cress data frame one more time:

head(thale_cress_data_frame)
  ensembl_gene_id ensembl_peptide_id                              description
1       AT3G11415                                                            
2       AT1G31258                       other RNA [Source:TAIR;Acc:AT1G31258]
3       AT5G24735                       other RNA [Source:TAIR;Acc:AT5G24735]
4       AT2G45780                       other RNA [Source:TAIR;Acc:AT2G45780]
5       AT2G42425                    Unknown gene [Source:TAIR;Acc:AT2G42425]
6       AT4G01533                       other RNA [Source:TAIR;Acc:AT4G01533]

head(results_genes)
 feature         id           fc         pval      qval
1    gene MSTRG.4142 2.075827e-01 0.0000491845 0.5184498
2    gene  AT4G14220 2.103846e+01 0.0001882308 0.5184498
3    gene MSTRG.3788 5.031936e-03 0.0002087393 0.5184498
4    gene MSTRG.5824 2.867578e-03 0.0002187155 0.5184498
5    gene  MSTRG.568 3.186395e-01 0.0002764999 0.5184498
6    gene  MSTRG.811 3.236509e-04 0.0003448702 0.5184498

Funny enough, we do not actually use a biomaRt function to annotate our genes! We can simply subset the thale cress data frame to consist of only rows whose ensemble_gene_id matches our results_genes id. Let's give it a try:

annotated_genes = subset(thale_cress_data_frame, ensembl_gene_id %in% results_genes$id)
head(annotated_genes)
     ensembl_gene_id ensembl_peptide_id                                            description
340        AT2G07754                                      pre-tRNA [Source:TAIR;Acc:AT2G07754]
5348       AT1G79830        AT1G79830.1 Golgin Putative 5 [Source:UniProtKB/TrEMBL;Acc:F4HQB9]
5349       AT1G79830        AT1G79830.3 Golgin Putative 5 [Source:UniProtKB/TrEMBL;Acc:F4HQB9]
5350       AT1G79830        AT1G79830.2 Golgin Putative 5 [Source:UniProtKB/TrEMBL;Acc:F4HQB9]
5351       AT1G79830        AT1G79830.4 Golgin Putative 5 [Source:UniProtKB/TrEMBL;Acc:F4HQB9]
5352       AT1G79830        AT1G79830.5 Golgin Putative 5 [Source:UniProtKB/TrEMBL;Acc:F4HQB9]

##let's check our dimensions to ensure every gene was annotated

dim(results_genes)
[1] 216   5

dim(annotated_genes)
[1] 233   3

##our dimensions do not match! Let's investigate:

head(annotated_genes)
     ensembl_gene_id ensembl_peptide_id                                            description
340        AT2G07754                                      pre-tRNA [Source:TAIR;Acc:AT2G07754]
5348       AT1G79830        AT1G79830.1 Golgin Putative 5 [Source:UniProtKB/TrEMBL;Acc:F4HQB9]
5349       AT1G79830        AT1G79830.3 Golgin Putative 5 [Source:UniProtKB/TrEMBL;Acc:F4HQB9]
5350       AT1G79830        AT1G79830.2 Golgin Putative 5 [Source:UniProtKB/TrEMBL;Acc:F4HQB9]
5351       AT1G79830        AT1G79830.4 Golgin Putative 5 [Source:UniProtKB/TrEMBL;Acc:F4HQB9]
5352       AT1G79830        AT1G79830.5 Golgin Putative 5 [Source:UniProtKB/TrEMBL;Acc:F4HQB9]

tail(annotated_genes)
      ensembl_gene_id ensembl_peptide_id
52741       AT1G52400        AT1G52400.1
52742       AT1G52400        AT1G52400.2
52861       AT5G10110        AT5G10110.1
53166       AT1G63360        AT1G63360.2
53167       AT1G63360        AT1G63360.1
53521       AT3G51870        AT3G51870.1
                                                                                            description
52741         Beta-D-glucopyranosyl abscisate beta-glucosidase [Source:UniProtKB/Swiss-Prot;Acc:Q9SE50]
52742         Beta-D-glucopyranosyl abscisate beta-glucosidase [Source:UniProtKB/Swiss-Prot;Acc:Q9SE50]
52861                     DNA-directed RNA polymerase subunit beta [Source:UniProtKB/TrEMBL;Acc:Q8VZV8]
53166            Probable disease resistance protein At1g63360 [Source:UniProtKB/Swiss-Prot;Acc:Q9SH22]
53167            Probable disease resistance protein At1g63360 [Source:UniProtKB/Swiss-Prot;Acc:Q9SH22]
53521 Probable envelope ADP,ATP carrier protein, chloroplastic [Source:UniProtKB/Swiss-Prot;Acc:O65023]

A-ha. We see the mismatch in dimension length is due to some genes having different isoforms and therefore different peptide ids. Because we matched our data frames by gene id, some of the genes we have extracted multiple peptides! Also, take notice of this:

annotated_genes$ensembl_gene_id
  [1] "AT2G07754" "AT1G79830" "AT1G79830" "AT1G79830" "AT1G79830" "AT1G79830" "AT3G01340" "AT3G01340"
  [9] "AT5G16110" "AT5G42680" "AT5G42680" "AT4G38160" "AT4G38160" "AT4G38160" "AT4G38160" "AT1G49380"
 [17] "AT4G24290" "AT4G24290" "AT4G24290" "AT4G25320" "AT2G22430" "AT4G14220" "AT2G17760" "AT3G20290"
 [25] "AT3G20290" "AT3G20290" "AT2G29360" "AT1G74560" "AT1G74560" "AT4G09150" "AT4G09150" "AT5G49840"
 [33] "AT5G49840" "AT5G49840" "AT5G49840" "AT1G29990" "AT5G65660" "AT1G70000" "AT1G70000" "AT4G18780"
 [41] "AT5G47490" "AT5G47490" "AT5G47490" "AT1G54080" "AT1G54080" "AT3G43740" "AT3G43740" "AT4G26900"
 [49] "AT5G53470" "AT1G61800" "AT1G61800" "AT5G27330" "AT3G20890" "AT3G20890" "AT5G10020" "AT5G10020"
 [57] "AT5G54830" "AT3G17430" "AT3G17430" "AT2G43970" "AT2G43970" "AT4G25960" "AT4G30990" "AT4G30990"
 [65] "AT4G30990" "AT1G60010" "AT5G45350" "AT5G45350" "AT5G45350" "AT5G45350" "AT5G45350" "AT5G45350"
 [73] "AT5G45350" "AT5G04510" "AT5G04510" "AT5G04510" "AT3G25430" "AT3G25430" "AT1G32210" "AT1G32210"
 [81] "AT3G58140" "AT5G58540" "AT5G58540" "AT5G58540" "AT2G22230" "AT3G46280" "AT5G45410" "AT5G45410"
 [89] "AT5G45410" "AT5G45410" "AT5G45410" "AT4G16695" "AT4G16695" "AT4G16695" "AT4G16695" "AT4G16695"
 [97] "AT1G20190" "AT1G20190" "AT1G23760" "AT5G09660" "AT5G09660" "AT5G09660" "AT5G09660" "AT5G09660"
[105] "AT4G32760" "AT4G32760" "AT4G32760" "AT4G21460" "AT4G21460" "AT3G51740" "AT3G62930" "AT4G30080"
[113] "AT2G41440" "AT2G41440" "AT2G41440" "AT2G41440" "AT1G06700" "AT1G06700" "AT1G06700" "AT4G06676"
[121] "AT4G06676" "AT1G54920" "AT1G54920" "AT1G54920" "AT1G54920" "AT1G51710" "AT1G51710" "AT5G55790"
[129] "AT5G55790" "AT5G55790" "AT5G03670" "AT5G03670" "AT5G48380" "AT3G32930" "AT3G32930" "AT5G18480"
[137] "AT3G47290" "AT3G47290" "AT1G32100" "AT5G08420" "AT2G17530" "AT2G17530" "AT2G17530" "AT4G34000"
[145] "AT4G34000" "AT4G34000" "AT4G34000" "AT3G48860" "AT3G48860" "AT3G48860" "AT5G18670" "AT1G65730"
[153] "AT1G63770" "AT1G63770" "AT1G63770" "AT1G63770" "AT1G63770" "AT1G63770" "AT1G63770" "AT2G39890"
[161] "AT2G39890" "AT2G39890" "AT4G10120" "AT4G10120" "AT4G10120" "AT4G10120" "AT4G10120" "AT4G36750"
[169] "AT2G20670" "AT2G42620" "AT2G01735" "AT2G01735" "AT2G26340" "AT2G26340" "AT1G27020" "AT5G13680"
[177] "AT3G27320" "AT3G27320" "AT1G24880" "AT1G24880" "AT5G62920" "AT4G17650" "AT1G64330" "AT5G63380"
[185] "AT5G39670" "AT4G08510" "AT4G08510" "AT2G41960" "AT3G11030" "AT4G30100" "AT4G30100" "AT5G56290"
[193] "AT3G57220" "AT3G09560" "AT3G09560" "AT3G09560" "AT3G09560" "AT5G57710" "AT5G25540" "AT5G25540"
[201] "AT5G25540" "AT1G03080" "AT1G03080" "AT1G03080" "AT5G35690" "AT3G53510" "AT5G10240" "AT5G10240"
[209] "AT5G16800" "AT5G16800" "AT5G16800" "AT5G13370" "AT1G01440" "AT5G16390" "AT5G16390" "AT5G63940"
[217] "AT2G46520" "AT1G26740" "AT5G15570" "AT3G51150" "AT3G51150" "AT3G51150" "AT3G51150" "AT3G51150"
[225] "AT1G19010" "AT1G19010" "AT1G52400" "AT1G52400" "AT1G52400" "AT5G10110" "AT1G63360" "AT1G63360"
[233] "AT3G51870"

Not a single one of the "MSTRG" sequences has been annotated. It is time we address what the "MSTRG" sequences are. You may find out by reading this FAQ. Please read the link to understand why we typically cannot use the "MSTRG" sequences in the differential expression analysis, and most certainly not in annotations. In actuality, because each MSTRG represents the isoform of a specific gene, but its ID is different across samples, we would expect all of the MSTRG isoforms to come up as differentially expressed, as the computer cannot find one in any of the other samples! We avoid this by running StringTie's "-eb" options to map all of our assembled transcript reads back to the alignment GTF we created, thereby ensuring that all of our final alignment MSTRG ids match. We do not care about the MSTRG in annotation because the names are randomized! Our assembled transcript reads in each sample are actually complete transcripts, which are then matched to our assembled reference genome. Although weird, some assembled isoforms may actually map to one gene while its individual reads may have mapped to two or more! Therefore, we know that through our procedure forcing consistent nomenclature, we have more confidence that our actual gene counts and differential expression analyses are more robust (which is why this process should be used with highly similar samples, i.e., samples from different parts of one organism), even if we cannot use that Information in our annotations.

Lastly, let's remove the MSTRG sequences from our gene results object and write our files:

write.csv(file="annotated_genes.csv",annotated_genes,row.names=F)
results_genes_without_MSTRG = subset(results_genes, id %in% annotated_genes$ensembl_gene_id)

dim(results_genes_without_MSTRG)
 116 5

head(results_genes_without_MSTRG)
feature        id         fc         pval      qval
2     gene AT4G14220 21.0384627 0.0001882308 0.5184498
7     gene AT5G13680  0.0410302 0.0003455516 0.5184498
8     gene AT3G51870 19.8788224 0.0003618771 0.5184498
10    gene AT5G15570  7.1289982 0.0004775700 0.5184498
11    gene AT4G09150  0.3713041 0.0005426543 0.5184498
12    gene AT2G39890 28.5590198 0.0005462862 0.5184498

write.csv(file="results_genes_without_MSTRG.csv",results_genes_without_MSTRG,row.names=F)

Now we want to visualize our data:

##we want pretty colors for our visualization
tropical = c('red', 'dodgerblue', 'hotpink', 'limegreen', 'yellow')
palette(tropical)

We want to compare our genes based on their FPKM values. We know from reading ballgown's vignette that we can extract the expression data using texpr() and specifying a measure.

fpkm = texpr(bg, meas = "FPKM")
##let's look at the distribution
plot(density(fpkm),main="Density Plot of \nUntransformed FPKM")

We can see virtually nothing except that there are many, many genes that are lowly expressed. The reason for the sharp peak is that the density plot automatically scales its x-axis from the lowest expressed to the highest expressed. Let's see what those values are:

min(fpkm)
0
max(fpkm)
1179157

Due to the scaling, we cannot truly see the distribution. However, what we can do is to transform the data such that the variance is not so staggering, allowing us to see better. There are a few rules for this, all of the data must be transformed in a consistent and reversible manner, after transformation no data may have a negative value, and all data with a value of 0 must also be 0 after transformation. The reason for the second and third rules is more epistemological. For us, if a gene has an FPKM of 0, then for that sample the gene is unexpressed. Should we transform the data and that particular gene's FPKM is now above 0, we are fundamentally changing the nature of that sample -- i.e., we are now saying it is expresesing a gene it actually is not! Additionally, there is no such thing as negative expression, so there is no physical reality where we will have an FPKM beneath 0. With these three rules, we see that taking the log of all our data will prevent negative values, be consistent and reversible, and scale down our variance. However, log(0) = Inf! We have broken a cardinal rule (oddly enough, the fact that it is Infinity is not a rule-breaker, but rather that it is negative Infinity! Seeing this, we can simply add 1 to our data before log transforming, log(0+1) = 0. Now we have fulfilled all three rules.

fpkm = log2(fpkm + 1)
plot(density(fpkm),main="Density Plot of \nTransformed Data")

We now we see an actual distribution. Let's see the difference in distribution between each individual part. To do this we are going to plot the density for each part, one by one, and watch for great changes.

dim(fpkm)
41854     4
colnames(fpkm)
"FPKM.athaliana_root_1"  "FPKM.athaliana_root_2"  "FPKM.athaliana_shoot_1" "FPKM.athaliana_shoot_2"

plot(density(fpkm[,1]),main="Density Comparison")
lines(density(fpkm[,2])
lines(density(fpkm[,3])
lines(density(fpkm[,4])

We see that overall the distributions are pretty similar, but not identical -- shoot 2 has the highest expression levels while root 2 has the lowest.

Now we will generate a PCA plot. I strongly advise you read the PCA link before continuing if you are not familiar with Principal Component Analysis. It will not be explained in this tutorial.

Let's create a vector with our PCA point names

short_names = c("r1","r2","s1","s2")

We are going to be using the Pearson coefficient for our PCA plot. You may think of the Pearson coefficient simply as a measure of similarity. If two datasets are very similar, they will have a Pearson coefficient approaching 1 (every data compared to itself has a Pearson coefficient of 1). If two datasets are very dissimilar, they will have a Pearson coefficient approaching 0 Let's calculate a vector containing the correlation coefficient:

r = cor(fpkm, use="pairwise.complete.obs", method="pearson")

The cor() function takes our expression matrix, as well as two other arguments. the "pairwise.complete.obs" argument tells R that we want to do a pairwise analysis (so compare column 1 to 2, 3, and 4, then column 2 to 3 and 4, and lastly column 3 to 4) only on observations (genes) without missing values. Lastly, our method "pearson" tells us we're using the Pearson method as a similarity metric. Let's have a look at r:

r

				FPKM.athaliana_root_1 FPKM.athaliana_root_2 FPKM.athaliana_shoot_1 FPKM.athaliana_shoot_2
FPKM.athaliana_root_1              1.0000000             0.9127575              0.7254296		0.7639201
FPKM.athaliana_root_2              0.9127575             1.0000000              0.7247305		0.7661130
FPKM.athaliana_shoot_1             0.7254296             0.7247305              1.0000000		0.8877891
FPKM.athaliana_shoot_2             0.7639201             0.7661130              0.8877891		1.0000000

Here we see each member of the diagonal is 1.000000. Of course we knew this already, as each member is 100% similar to itself! Then we have the similarity measures of each sample to each other sample.

Rather than calculate the similarity, it would be nicer to calculate the dissimilarity or distance between each sample. We know that if two samples are the same, their similarity measure is 1.000000. We also know that then their dissimilarity is 0%, or 0.000000. Here we see that if we subtract each element from 1, we get the dissimilarity matrix! Let's do it:

d = 1 - r
d

                       FPKM.athaliana_root_1 FPKM.athaliana_root_2 FPKM.athaliana_shoot_1 FPKM.athaliana_shoot_2
FPKM.athaliana_root_1             0.00000000            0.08724246              0.2745704	0.2360799
FPKM.athaliana_root_2             0.08724246            0.00000000              0.2752695	0.2338870
FPKM.athaliana_shoot_1            0.27457045            0.27526946              0.0000000	0.1122109
FPKM.athaliana_shoot_2            0.23607994            0.23388696              0.1122109	0.0000000

R has a function which will perform the principal component analysis for us when provided with a dissimilarity matrix, cmdscale. Let's have a look at it:

help(cmdscale)

Classical (Metric) Multidimensional Scaling

Description

Classical multidimensional scaling (MDS) of a data matrix. Also known as principal coordinates analysis (Gower, 1966).

Usage

cmdscale(d, k = 2, eig = FALSE, add = FALSE, x.ret = FALSE,
         list. = eig || add || x.ret)
Arguments

d	a distance structure such as that returned by dist or a full symmetric matrix containing the dissimilarities.
k	the maximum dimension of the space which the data are to be represented in; must be in {1, 2, …, n-1}.
eig	indicates whether eigenvalues should be returned.
add	logical indicating if an additive constant c* should be computed, and added to the non-diagonal dissimilarities such that the 
	modified dissimilarities are Euclidean.
x.ret	indicates whether the doubly centred symmetric distance matrix should be returned.
list.	logical indicating if a list should be returned or just the n * k matrix, see ‘Value:’.

Details

Multidimensional scaling takes a set of dissimilarities and returns a set of points such that the distances between the points are approximately equal to the dissimilarities. (It is a major part of what ecologists call ‘ordination’.)

A set of Euclidean distances on n points can be represented exactly in at most n - 1 dimensions. cmdscale follows the analysis of Mardia (1978), and returns the best-fitting k-dimensional representation, where k may be less than the argument k.

The representation is only determined up to location (cmdscale takes the column means of the configuration to be at the origin), rotations and reflections. The configuration returned is given in principal-component axes, so the reflection chosen may differ between R platforms (see prcomp).

When add = TRUE, a minimal additive constant c* is computed such that the dissimilarities d[i,j] + c* are Euclidean and hence can be represented in n - 1 dimensions. Whereas S (Becker et al, 1988) computes this constant using an approximation suggested by Torgerson, R uses the analytical solution of Cailliez (1983), see also Cox and Cox (2001). Note that because of numerical errors the computed eigenvalues need not all be non-negative, and even theoretically the representation could be in fewer than n - 1 dimensions.

Let's perform our principal component analysis:

pca = cmdscale(d, k=2)

We expect pca to have four rows, each row corresponding to a sample, and two columns, the first column representing our first coordinate axis and the second dimension representing our second coordinate axis. If we plot the first column against the second column, the distances between points is the dissimilarity between points.

pca

			[,1]         [,2]
FPKM.athaliana_root_1  -0.1229438 -0.016069664
FPKM.athaliana_root_2  -0.1225334  0.006751752
FPKM.athaliana_shoot_1  0.1454597 -0.045779068
FPKM.athaliana_shoot_2  0.1000175  0.055096980

For this next step it is assumed that you are familiar with plotting in R. If not you may look here.

plot.new()
par(mfrow=c(1,1))
##point colors
point_colors = c("red", "red", "blue", "blue")
plot(pca[,1],pca[,2], xlab="", ylab="", main="PCA plot for all libraries", xlim=c(-0.2,0.2), ylim=c(-0.2,0.2),col=point_colors)
text(pca[,1],pca[,2],pos=2,short_names, col=c("red", "red", "blue", "blue"))

Topological networking using cytoscape

Cytoscape is a desktop program which creates visual topological networks of data. The only requirement to create a topological network in Cytoscape is to have an "edge list". An edge list is a two-column list which comprises two separate types of data. Each row of the edge list is a single piece of Information (so both columns combine to form a piece of Infromation for the row) that Cytoscape uses. To begin, download and install Cytoscape. Now, let's go over the two columns of an edge list:

The two columns of an edge list represent the "sources" and "nodes". You may instead think of the two columns representing an object and its group. For instance, the edge list:

	A	1
	B	1
	C	4
	D	3
	A	4
	C	3

means that object A is in both groups 1 and 4, object B is in only group 1, object C is in groups 3 and 4, and so forth. We could then say that our sources are the objects, and our nodes are the groups to which each object belongs. To make our topological network we simply put one point for each object and one point for each group on a piece of paper and draw arrows going from each object-point to each group-point. If two objects are pointing to the same group, we place those objects closer to the group. If an object is part of two or more groups, that object will have two or more arrows pointing to each distinct group to which it belong. The cardinal rule is that no object may have two arrows pointing to the same group (that is, the set for that group is pairwise disjoint). Now we need to begin thinking about our differential expression data. We could group the genes with the phenotype which expresses them most. . .but that would leave all of our genes (sources) flowing to only four sinks (nodes). That certainly does not communicate much. However, we have our differential gene expression results written into a csv, complete with the fold-change, or percent increase, between phenotypes. We could group our genes according to their fold-changes. However, should we do this, each of our sources would flow only to one sink, and we'd have no Information about the flow between sinks. What we can do, however, is to perform a variety of grouping exercises, successively grouping our results into larger and larger groups. The initial grouping will be the most specific, composed of the most sinks. However, the second grouping will be less specific, consisting of fewer sinks but the same number of sources. Therefore, some of the sinks in the initial grouping have converged. We can now determine which sinks have converged and conncet those sinks in the first grouping. We can repeat this until it is of no more use to us.

To achieve this, we can create a series of family trees, or dendrograms. You have most likely encountered many dendrograms before in an evolutionary biology course. An important note about dendrograms is that only the height of the tree matters. It does not matter, laterally, how close together any two leaves are. Leaves and nodes which are separated by a small height are more closely related than leaves and nodes which are separated by greater heights, regardless of the lateral distance needed to reach one or the other. This will come in handy. We can create a dendrogram of our genes based on their fold changes and then create our groups by taking at first each node and its closer neighbor. Afterwards, we can create a second dendogram of our genes based on their changes and create our groups by taking each node and its two closest neighbors. The second tree will be less specific, and each group will be a combination of two groups frmo the first tree. We continue this process until it is of no use, and then create our edge-list as needed. To do this, we will be back in R, this time using the package "RFLPtools", which allows us to write our groups into a plain-text document. Let's load R and our gene results:

##re-set working directory if you have exited R, and re-load libraries
diffexp_genes = read.csv("results_genes.csv",header=T)
diffexp_genes = as.data.frame(diffexp_genes)
gene_ids = diffexp_genes[,2]
head(diffexp_genes)

  feature         id           fc         pval      qval
1    gene MSTRG.4142 2.075827e-01 0.0000491845 0.5184498
2    gene  AT4G14220 2.103846e+01 0.0001882308 0.5184498
3    gene MSTRG.3788 5.031936e-03 0.0002087393 0.5184498
4    gene MSTRG.5824 2.867578e-03 0.0002187155 0.5184498
5    gene  MSTRG.568 3.186395e-01 0.0002764999 0.5184498
6    gene  MSTRG.811 3.236509e-04 0.0003448702 0.5184498

We care only about the gene IDs and their fold-changes.

diffexp_genes = diffexp_genes[,3]
diffexp_genes = as.data.frame(diffexp_genes)
rownames(diffexp_genes) = gene_ids
colnames(diffexp_genes) = "fc"
head(diffexp_genes)

MSTRG.4142 2.075827e-01
AT4G14220  2.103846e+01
MSTRG.3788 5.031936e-03
MSTRG.5824 2.867578e-03
MSTRG.568  3.186395e-01
MSTRG.811  3.236509e-04

Now we need to calculate the distances of each gene from one another based on fold change. This is easy, we simply start make an nrow x nrow matrix where each cell in the matrix is the absolute value of its row and column genes' fold-changes subtracted from one another. R has a very friendly package "dist()" which does this for us. dist only requires the raw matrix, and will do the rest of the work. Let's try it out:

dist_diffexp_genes = dist(diffexp_genes)

head(dist_diffexp_genes)
[1]  20.83087999   0.20255079   0.20471515   0.11105679   0.20725907   0.16655252  19.67123966  17.15730053

Do not worry that there are no names in this output! dist() has kept track of the names and you may trust this output. Now we may move on to creating our groups for the tree based on the distances. We will be using hclust(), which is a hierarchical clustering command. We won't go into the math involved, as your brain is probably quite tired already. However, hclust requires only one argument, the output of the dist() function. Let's try it:

clusters = hclust(dist_diffexp_genes)

The output of head(clusters) is much too large to include here. However, we can plot our dendrogram quite easily:

plot(clusters)

If you squint you can estimate how many groups is suitable with which to start. I count 12. We can group our dendogram into its twelve most clostly related parts using the function rect.hclust().

rect.hclust() requires two arguments, the dendogram object and the number of groups:

clusters_12 = clusters
plot(clusters_12, main="Fold change dendrogram in 12 parts")
rect.hclust(clusters_12, k = 12)

Each gene now belongs to one cluster 1-12. Using RFLPtools' function write.hclust().

write.hclust() takes the dendrogram object, the number of groups desired, the output file name, and a prefix string which simply describes the data. Because write and rect are both functions of hclust, the output will be the same as long as the number of groups is the same. Let's write our clusters of 12 group:

write.hclust(clusters_12, k = 12, file="clusters_of_12.txt",prefix="genes")
head("clusters_of_12.txt")
clusters_of_12_csv = read.table("clusters_of_12.txt", sep="\t", header=T)
##we know that the file is tab delimited
head(clusters_of_12_csv)
Sample Cluster Cluster.ID
1 MSTRG.4142       1 genes_H_01
2  AT4G14220       1 genes_H_01
3 MSTRG.3788       1 genes_H_01
4 MSTRG.5824       1 genes_H_01
5  MSTRG.568       1 genes_H_01
6  MSTRG.811       1 genes_H_01
##we only need columns 1 and 2
clusters_of_12_csv = clusters_of_12_csv[,1:2]
head(clusters_of_12_csv)
  Sample Cluster
1 MSTRG.4142       1
2  AT4G14220       1
3 MSTRG.3788       1
4 MSTRG.5824       1
5  MSTRG.568       1
6  MSTRG.811       1
##now let's write our csv
write.csv(clusters_of_12_csv, file = "clusters_of_12.csv", row.names=F)

If you briefly open the clusters_of_!2.csv file on your desktop you'll see we have a perfect edge list! Now let's repeat twice more, but with groups of 6 (each group merged with its closest neighbor) and 3 (each group merged with its 3 closest neighbors):

write.hclust(clusters, k = 6, file = "clusters_of_6.txt", prefix="genes")
clusters_of_6_csv = read.table("clusters_of_6.txt", sep="\t", header=T)
clusters_of_6_csv = clusters_of_6_csv[,1:2]
write.csv(clusters_of_6_csv, file="clusters_of_6.csv", row.names=F)
write.hclust(clusters, k = 3, file = "clusters_of_3.txt", prefix="genes")
clusters_of_3_csv = read.table("clusters_of_3.txt", sep="\t", header=T)
clusters_of_3_csv = clusters_of_3_csv[,1:2]
write.csv(clusters_of_3_csv, file="clusters_of_3.csv", row.names=F)

Now that we have our clusters, we are going to need to combine all of them into one edge list. This seems easy, but keep in mind that if we copy and paste them as is we have three group 1s, 2s, and 3s, which each are actually a different group! Let's start with the clusters of 3. We know that in our largest we have groups 1-12. So for each group in clusters of 3 we must transform them such that they are not any number 1 - 12. The easiest way to do this is to multiply each by 13, which will give us 13, 26, 39 as our new groups. Those are not present in any other clustering, so we maintain that they are all unique groups across files. In Excel simply multiply the entire column by 13 in the clusters of 3 file. Now we repeat for the clusters of 6, multiplying each group instead by 14. Our groups are now 1-12, 13, 26, 39, 14, 28, 42, 56, 70, 80. No overlap! Lastly we simply copy clusters of 6 and paste it beneath clusters of 3, followed by copying clusters of 12 and pasting those beneath the other two. Lastly, save your file and you are ready to use Cytoscape!

First, load Cytoscape and you will be greeted with a screen like this: Click this icon:

and load your data. Your screen should now look like:

Cytoscape does not know which are the sources and which are the nodes. We know that our genes are our sources flowing into our nodes. Therefore, we click on sample and the following window should appear: Click on "Source node".

Now click on "Cluster" and then "Target node".

Lastly, hit OK to import your data.

Currently our data is not in a very presentable format. Let's go over to the Style button and click the box which says "default". Click on "curved". Then click on Layout and choose "circular layout". The end result should be this:

Play around and familiarize yourself with Cytoscape. It is a very handy tool!

We see we have one large central network which connects to two small networks. Lastly, we have a peripheral network which seems to be distinct. There simply is too much going on here for us to extract much meaningful Information. Let's pare down our Information. Instead of creating a network based on genes, let's instead create a network based on the groups. That is, should a gene appear in three distinct groups, 1, 2, 3, we could create a network simply with the nodes 1, 2, 3. We have approximately 200 genes, but only 21 groups! We know all the groups to which each gene belongs, so no Information will be lost. Let's combine all of our groupings from the clusters of 3, 6, and 12:

clusters_of_3 = read.csv("clusters_of_3.csv",header=F)
clusters_of_6 = read.csv("clusters_of_6.csv",header=F)
clusters_of_12 = read.csv("clusters_of_12.csv",header=F)
groupings = cbind(clusters_of_3,clusters_of_6)
groupings = cbind(groupings,clusters_of_12)
head(groupings)
          V1 V2         V1 V2         V1 V2
1 MSTRG.4142 13 MSTRG.4142 14 MSTRG.4142  1
2  AT4G14220 13  AT4G14220 14  AT4G14220  1
3 MSTRG.3788 13 MSTRG.3788 14 MSTRG.3788  1
4 MSTRG.5824 13 MSTRG.5824 14 MSTRG.5824  1
5  MSTRG.568 13  MSTRG.568 14  MSTRG.568  1
6  MSTRG.811 13  MSTRG.811 14  MSTRG.811  1

We see that our groupings for each gene correspond to columns 2, 4, and 6. Let's keep the first genes column and the groupings columns:

cluster_col = c("TRUE","TRUE","FALSE","TRUE","FALSE","TRUE")
groupings = groupings[,cluster_col==TRUE]
groupings_col_names = c("Gene","clusters_of_3","clusters_of_6","clusters_of_12")
colnames(groupings) = groupings_col_names
head(groupings)
        Gene clusters_of_3 clusters_of_6 clusters_of_12
1 MSTRG.4142            13            14              1
2  AT4G14220            13            14              1
3 MSTRG.3788            13            14              1
4 MSTRG.5824            13            14              1
5  MSTRG.568            13            14              1
6  MSTRG.811            13            14              1

As we can see, many genes have the same grouping patterns. However, we can extract all of the different grouping patterns using the "unique" function:

colnames(groupings) = groupings_col_names
set_of_uniques = unique(groupings[2:4])
set_of_uniques
    clusters_of_3 clusters_of_6 clusters_of_12
1              13            14              1
15             13            14              2
21             13            14              3
30             26            28              4
35             39            42              5
52             13            14              6
75             13            56              7
97             26            70              8
106            26            28              9
136            26            28             10
171            26            84             11
173            26            28             12

Now, we know that each cluster from our clusters of 3 will have a mate in the clusters of 6 and 12. We also know this for clusters of 6 and so on. Because there are two sets containing mates for each cluster, let's create a matrix which contains each cluster vector (so clusters of 3 groupings) twice as the first column, with the mates in the second column (i.e., for the first instance of clusters of 3 in the first column the second column will have the mates from clusters of 6, in the second instance of clusters of 3 in the first column, the second column will have the mates from clusters of 12). Then let's create a complete edge list combining this Information for all three clusters:

clusters_of_3_network = c(set_of_uniques[,1],set_of_uniques[,1])
clusters_of_6_network = c(set_of_uniques[,2],set_of_uniques[,2])
clusters_of_12_network = c(set_of_uniques[,3],set_of_uniques[,3])
edge_list_3 = matrix(clusters_of_3_network, nrow =24, ncol=2)
edge_list_3[,2] = c(set_of_uniques[,2],set_of_uniques[,3])
edge_list_3
edge_list_6 = matrix(clusters_of_6_network, nrow =24, ncol=2)
edge_list_6[,2] = c(set_of_uniques[,1],set_of_uniques[,3])
edge_list_12 = matrix(clusters_of_12_network, nrow =24, ncol=2)
edge_list_12[,2] = c(set_of_uniques[,1],set_of_uniques[,2])
complete_edge_list = rbind(edge_list_3,edge_list_6)
complete_edge_list = rbind(complete_edge_list, edge_list_12)
head(complete_edge_list)
     [,1] [,2]
[1,]   13   14
[2,]   13   14
[3,]   13   14
[4,]   26   28
[5,]   39   42
[6,]   13   14

write.csv(complete_edge_list,file = "complete_edge_list.csv",row.names=F)

Let's now visualize this edge list in Cytoscape. The end result should look something like this:

While this is nice, we would like some computational proof of which groups have the highest connectivity and the degree of relationships among groups. We will term "connectivity" for this tutorial simply as the number of neighbors a group has. We will term "degree of relationship" as the number of neighbors through which one group must go before reaching the other group + 1. Looking at our Cytoscape output we see that for connected groups the greatest degree of relationships is 2. For groups in separate clusters we will call their degree of relationship as Inf.

If you click on the export button in Cytoscape, we will have the option of exporting the network. Click on that and export the network as a .xml file.

Now, open the .xml file in Microsoft Excel, and you should see something like this:

While we will not go into great detail here, we are interested in the ns1:data4 column. Copy and paste that column into a new Excel workbook and save it under the name "interaction_list.csv". Now let's load this file into R:

interaction_list = read.csv("interaction_list.csv",header=F)
head(interaction_list)
 V1                     V2
1 NA                      x
2  1 12 (interacts with) 28
3  2 12 (interacts with) 26
4  3 11 (interacts with) 84
5  4 11 (interacts with) 26
6  5 10 (interacts with) 28

We see that all of the Information has been stored in the second column and the first column is simply an index. Let's keep only the second column:

> interaction_list=interaction_list[,2]
interaction_list
 [1] x                      12 (interacts with) 28 12 (interacts with) 26 11 (interacts with) 84
 [5] 11 (interacts with) 26 10 (interacts with) 28 10 (interacts with) 26 9 (interacts with) 28 
 [9] 9 (interacts with) 26  8 (interacts with) 70  8 (interacts with) 26  7 (interacts with) 56 
[13] 7 (interacts with) 13  6 (interacts with) 14  6 (interacts with) 13  5 (interacts with) 42 
[17] 5 (interacts with) 39  4 (interacts with) 28  4 (interacts with) 26  3 (interacts with) 14 
[21] 3 (interacts with) 13  2 (interacts with) 14  2 (interacts with) 13  1 (interacts with) 14 
[25] 1 (interacts with) 13  84 (interacts with) 11 84 (interacts with) 26 70 (interacts with) 8 
[29] 70 (interacts with) 26 56 (interacts with) 7  56 (interacts with) 13 42 (interacts with) 5 
[33] 42 (interacts with) 39 39 (interacts with) 5  39 (interacts with) 42 28 (interacts with) 12
[37] 28 (interacts with) 10 28 (interacts with) 9  28 (interacts with) 4  28 (interacts with) 26
[41] 28 (interacts with) 26 28 (interacts with) 26 28 (interacts with) 26 26 (interacts with) 12
[45] 26 (interacts with) 11 26 (interacts with) 10 26 (interacts with) 9  26 (interacts with) 8 
[49] 26 (interacts with) 4  26 (interacts with) 28 26 (interacts with) 84 26 (interacts with) 28
[53] 26 (interacts with) 28 26 (interacts with) 70 26 (interacts with) 28 14 (interacts with) 6 
[57] 14 (interacts with) 3  14 (interacts with) 2  14 (interacts with) 1  14 (interacts with) 13
[61] 14 (interacts with) 13 14 (interacts with) 13 14 (interacts with) 13 13 (interacts with) 7 
[65] 13 (interacts with) 6  13 (interacts with) 3  13 (interacts with) 2  13 (interacts with) 1 
[69] 13 (interacts with) 56 13 (interacts with) 14 13 (interacts with) 14 13 (interacts with) 14
[73] 13 (interacts with) 14
61 Levels: 1 (interacts with) 13 1 (interacts with) 14 10 (interacts with) 26 ... x
dim(interaction_list)
NULL
##we see that our interaction list is just a vector now
length(interaction_list)
73
interaction_list[1]
x
##this value is not needed, let's remove it from our interaction list
interaction_list = interaction_list[2:73]

Now we simply need to create a matrix which contains all of our sources (the group before "interacts with") in the first column and all of their sinks (the group after "interacts with") in the second column. First, it will be useful to remove "(interacts with)" from all of the elements. We can do this using gsub:

	
interaction_list = gsub("[[:punct:]]interacts with[[:punct:]] ","",interaction_list)
interaction_list
 [1] "12 28" "12 26" "11 84" "11 26" "10 28" "10 26" "9 28"  "9 26"  "8 70"  "8 26"  "7 56"  "7 13"  "6 14" 
[14] "6 13"  "5 42"  "5 39"  "4 28"  "4 26"  "3 14"  "3 13"  "2 14"  "2 13"  "1 14"  "1 13"  "84 11" "84 26"
[27] "70 8"  "70 26" "56 7"  "56 13" "42 5"  "42 39" "39 5"  "39 42" "28 12" "28 10" "28 9"  "28 4"  "28 26"
[40] "28 26" "28 26" "28 26" "26 12" "26 11" "26 10" "26 9"  "26 8"  "26 4"  "26 28" "26 84" "26 28" "26 28"
[53] "26 70" "26 28" "14 6"  "14 3"  "14 2"  "14 1"  "14 13" "14 13" "14 13" "14 13" "13 7"  "13 6"  "13 3" 
[66] "13 2"  "13 1"  "13 56" "13 14" "13 14" "13 14" "13 14"

We can now create our sources and sinks by taking the first two characters and last two characters, respectively of each element using substr and nchar.

sources_and_sinks = matrix(0, nrow=72, ncol=2)
sources = vector(length=72)
sinks = vector(length=72)
for (i in 1:72){sources[i] = substr(interaction_list[i],1,2)}
sources
[1] "12" "12" "11" "11" "10" "10" "9 " "9 " "8 " "8 " "7 " "7 " "6 " "6 " "5 " "5 " "4 " "4 " "3 " "3 " "2 " "2 "
[23] "1 " "1 " "84" "84" "70" "70" "56" "56" "42" "42" "39" "39" "28" "28" "28" "28" "28" "28" "28" "28" "26" "26"
[45] "26" "26" "26" "26" "26" "26" "26" "26" "26" "26" "14" "14" "14" "14" "14" "14" "14" "14" "13" "13" "13" "13"
[67] "13" "13" "13" "13" "13" "13"
for (i in 1:72){sinks[i] = substr(interaction_list[i],nchar(interaction_list[i])-2,nchar(interaction_list[i]))}
sinks
[1] " 28" " 26" " 84" " 26" " 28" " 26" " 28" " 26" " 70" " 26" " 56" " 13" " 14" " 13" " 42" " 39" " 28" " 26"
[19] " 14" " 13" " 14" " 13" " 14" " 13" " 11" " 26" "0 8" " 26" "6 7" " 13" "2 5" " 39" "9 5" " 42" " 12" " 10"
[37] "8 9" "8 4" " 26" " 26" " 26" " 26" " 12" " 11" " 10" "6 9" "6 8" "6 4" " 28" " 84" " 28" " 28" " 70" " 28"
[55] "4 6" "4 3" "4 2" "4 1" " 13" " 13" " 13" " 13" "3 7" "3 6" "3 3" "3 2" "3 1" " 56" " 14" " 14" " 14" " 14"

We see that some of our initial cells had one fewer character than others, so we have actually grabbed the last digit of the source alongside! The reason for this is that substr does not consider whitespace as a character! We can remove these using sub:

sinks = gsub("[0-9] ", "", sinks)
sinks
 [1] " 28" " 26" " 84" " 26" " 28" " 26" " 28" " 26" " 70" " 26" " 56" " 13" " 14" " 13" " 42" " 39" " 28" " 26"
[19] " 14" " 13" " 14" " 13" " 14" " 13" " 11" " 26" "8"   " 26" "7"   " 13" "5"   " 39" "5"   " 42" " 12" " 10"
[37] "9"   "4"   " 26" " 26" " 26" " 26" " 12" " 11" " 10" "9"   "8"   "4"   " 28" " 84" " 28" " 28" " 70" " 28"
[55] "6"   "3"   "2"   "1"   " 13" " 13" " 13" " 13" "7"   "6"   "3"   "2"   "1"   " 56" " 14" " 14" " 14" " 14"

If we compare element 55 across all of our vectors, we see that we have the correct format and Information. Now, let's create our sources and sinks matrix. A word of note, these are not truly sources and sinks! They connect in both directions. Because of this, we will want to make our sources and sinks matrix, a sinks and sources matrix, and then adjoin the two together by rows:

sources_and_sinks = matrix(0,nrow=72,ncol=2)
sinks_and_sources = matrix(0, nrow=72, ncol=2)
sources_and_sinks[,1] = sources
sources_and_sinks[,2] = sinks
sinks_and_sources[,1] = sinks
sinks_and_sources[,2] = sources
sources_and_sinks_complete = rbind(sources_and_sinks,sinks_and_sources)
head(sources_and_sinks_complete)
     [,1] [,2] 
[1,] "12" " 28"
[2,] "12" " 26"
[3,] "11" " 84"
[4,] "11" " 26"
[5,] "10" " 28"
[6,] "10" " 26"
## because "sources" and "sinks" is a bit of a misnomer here, we can name our columns sources and sinks
colnames(sources_and_sinks_complete) = c("sources","sinks")
##the easiest way to convert these character vectors into integers is simply to save the object as a csv and load it
write.csv(file="sources_and_sinks_complete.csv",sources_and_sinks_complete,row.names=F)
sources_and_sinks_complete=read.csv("sources_and_sinks_complete.csv",header=T)
head(sources_and_sinks_complete)
  sources sinks
1      12    28
2      12    26
3      11    84
4      11    26
5      10    28
6      10    26
##let's determine all of our unique sources and sinks pairs
unique_interactions = unique(sources_and_sinks_complete)
head(unique_interactions)
  sources sinks
1      12    28
2      12    26
3      11    84
4      11    26
5      10    28
6      10    26
nrow(unique_interactions)
60

We have 60 unique interactions. Our first job is to determine our degree of relations. To do this, we first need a starting degree of relations matrix. We know that we have 21 groups. Let's make a 21 by 21 matrix and place its column and row names as our groups:

degree_of_relation_matrix = matrix(0, nrow=21, ncol=21)
##we know that each group appears in either of the columns, so we can simply grab all of our groups from the sources columns
groups = unique(unique_interactions[,1])
##we order our columns to make them easier to read
groups = sort(groups,decreasing=FALSE)
#now we name our degree_of_relation_matrix
rownames(degree_of_relation_matrix) = groups
colnames(degree_of_relation_matrix) = groups
degree_of_relation_matrix

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 26 28 39 42 56 70 84
1  0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
2  0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
3  0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
4  0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
5  0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
6  0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
7  0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
8  0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
9  0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
10 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
11 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
12 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
13 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
14 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
26 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
28 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
39 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
42 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
56 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
70 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
84 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0

We know a few things which will help us fill in this matrix. Remember that the degree of relationship is simply the number of neighbors through which one group must go to get to the other + 1. That means the most closely related groups will have a degree of 1. What about each group with itself? As a conventional rule, we will define each group as having a 0th degree of relationship with itself. Great, we now know that groups that are connected with no neighbors in between necessary will have a degree of 1, with one neighbor between necessary 2, and so forth. We are lucky in that we know from our Cytoscape output that our highest degree of relationship is 2. Our final convention is that should two groups be disconnected we give them a degree of relationship of Infinity (the higher the degree, the less connected the groups are). While most of this code is straight forward, let's take a minute to go over how we will determine those with a second degree of relationship. Suppose we have groups A and B which has a second degree of relatioship. We know that A and B must share a neighbor. But how do we determine that? Well, if we take the neighbors of A's neighbors, and the neighbors of B's neighbors, we know that at least one of A's neighbors' neighbors and B's neighbors' neighbors will be the same. Therefore, if the length of the intersection of A's neighbors' neighbors and B's neighbors' neighbors is 1 or above, A and B have a second degree relationship! While that is all very confusing, I invite you to read it over a few times until the sense is apparent. Because we know that our highest degree of relationship is 2, if the intersection described prior is the empty set then A and B are not connected, and have a degree of relationship of Infinity. Let's now write our code:

for (i in 1:nrow(degree_of_relation_matrix)){
for (j in 1:ncol(degree_of_relation_matrix)){
source = rownames(degree_of_relation_matrix)[i]
sink = colnames(degree_of_relation_matrix)[j]
if (source == sink){
degree_of_relation_matrix[i,j]=0}

else {
interaction = c(source, sink)
interaction = matrix(as.integer(interaction), ncol = 2, byrow=T)
success = FALSE
if (success==FALSE){
for (x in 1:nrow(unique_interactions)){
if(sum(interaction == unique_interactions[x,])==2){
degree_of_relation_matrix[i,j] = 1
success = TRUE
}
}
}
if (success==FALSE){
extended_source_interaction = unique_interactions[unique_interactions$source == source,]
extended_sink_interaction = unique_interactions[unique_interactions$sink == sink,]
if (length(intersect(extended_source_interaction[,2],extended_sink_interaction[,1]))>=1){
degree_of_relation_matrix[i,j] = 2
}
else{degree_of_relation_matrix[i,j] =  Inf
}
}
}
}
}

degree_of_relation_matrix

      1    2    3    4    5    6    7    8    9   10   11   12   13   14   26   28   39   42   56   70   84
1     0    2    2  Inf  Inf    2    2  Inf  Inf  Inf  Inf  Inf    1    1  Inf  Inf  Inf  Inf    2  Inf  Inf
2     2    0    2  Inf  Inf    2    2  Inf  Inf  Inf  Inf  Inf    1    1  Inf  Inf  Inf  Inf    2  Inf  Inf
3     2    2    0  Inf  Inf    2    2  Inf  Inf  Inf  Inf  Inf    1    1  Inf  Inf  Inf  Inf    2  Inf  Inf
4   Inf  Inf  Inf    0  Inf  Inf  Inf    2    2    2    2    2  Inf  Inf    1    1  Inf  Inf  Inf    2    2
5   Inf  Inf  Inf  Inf    0  Inf  Inf  Inf  Inf  Inf  Inf  Inf  Inf  Inf  Inf  Inf    1    1  Inf  Inf  Inf
6     2    2    2  Inf  Inf    0    2  Inf  Inf  Inf  Inf  Inf    1    1  Inf  Inf  Inf  Inf    2  Inf  Inf
7     2    2    2  Inf  Inf    2    0  Inf  Inf  Inf  Inf  Inf    1    2  Inf  Inf  Inf  Inf    1  Inf  Inf
8   Inf  Inf  Inf    2  Inf  Inf  Inf    0    2    2    2    2  Inf  Inf    1    2  Inf  Inf  Inf    1    2
9   Inf  Inf  Inf    2  Inf  Inf  Inf    2    0    2    2    2  Inf  Inf    1    1  Inf  Inf  Inf    2    2
10  Inf  Inf  Inf    2  Inf  Inf  Inf    2    2    0    2    2  Inf  Inf    1    1  Inf  Inf  Inf    2    2
11  Inf  Inf  Inf    2  Inf  Inf  Inf    2    2    2    0    2  Inf  Inf    1    2  Inf  Inf  Inf    2    1
12  Inf  Inf  Inf    2  Inf  Inf  Inf    2    2    2    2    0  Inf  Inf    1    1  Inf  Inf  Inf    2    2
13    1    1    1  Inf  Inf    1    1  Inf  Inf  Inf  Inf  Inf    0    1  Inf  Inf  Inf  Inf    1  Inf  Inf
14    1    1    1  Inf  Inf    1    2  Inf  Inf  Inf  Inf  Inf    1    0  Inf  Inf  Inf  Inf    2  Inf  Inf
26  Inf  Inf  Inf    1  Inf  Inf  Inf    1    1    1    1    1  Inf  Inf    0    1  Inf  Inf  Inf    1    1
28  Inf  Inf  Inf    1  Inf  Inf  Inf    2    1    1    2    1  Inf  Inf    1    0  Inf  Inf  Inf    2    2
39  Inf  Inf  Inf  Inf    1  Inf  Inf  Inf  Inf  Inf  Inf  Inf  Inf  Inf  Inf  Inf    0    1  Inf  Inf  Inf
42  Inf  Inf  Inf  Inf    1  Inf  Inf  Inf  Inf  Inf  Inf  Inf  Inf  Inf  Inf  Inf    1    0  Inf  Inf  Inf
56    2    2    2  Inf  Inf    2    1  Inf  Inf  Inf  Inf  Inf    1    2  Inf  Inf  Inf  Inf    0  Inf  Inf
70  Inf  Inf  Inf    2  Inf  Inf  Inf    1    2    2    2    2  Inf  Inf    1    2  Inf  Inf  Inf    0    2
84  Inf  Inf  Inf    2  Inf  Inf  Inf    2    2    2    1    2  Inf  Inf    1    2  Inf  Inf  Inf    2    0

write.csv(file="degree_of_relation_matrix.csv",degree_of_relation_matrix,row.names=T)

Notice two things: the first is that the diagonal is 0, the second is that the matrix is symmetric across its diagonal. I will leave those two points for you to ponder. Our final task is much easier. We simply want to determine the connectivity of each group. Therefore, we only need a matrix with one column, each row corresponding to a group. We also know that because our unique_interactions group is non-redundant, every time a group appears in either the first (or second, as they are symmetric) column that it is a unique connection. We can calculate the total number of each connections per group to get the total connectivity for the group. Let's make the matrix:

total_connectivity = matrix(1:nrow(degree_of_relation_matrix),ncol=1)
rownames(total_connectivity) = rownames(degree_of_relation_matrix)

for (i in 1:nrow(total_connectivity)){
total_connectivity[i,] = sum(rownames(total_connectivity)[i] == unique_interactions[,1])
}
total_connectivity
   [,1]
1     2
2     2
3     2
4     2
5     2
6     2
7     2
8     2
9     2
10    2
11    2
12    2
13    7
14    5
26    9
28    5
39    2
42    2
56    2
70    2
84    2
write.csv(file="total_connectivity_by_group.csv",total_connectivity,row.names=T)

Conclusion

You may find yourself wondering exactly what it is that we acccomplished by the end of this analysis. First, to recap we:

Downloaded experimental data
Performed quality control on the experimental data
Aligned the experimental data to the reference genome, allowing us to determine which genes are active in each sample
Determined and quantified the gene isoforms present in each sample
Determined which gene isoforms had different general expression patterns between phenotypes
Created visual gene topologial networks
Extracted computational  Information about those networks

While our work may have seemed completed after creating the visualization, the picture itself is not of much use to other scientists! However, all of the csv files we created are. A scientist may have an interest in one particular gene we studied. With our differential expression results output she will be able to determine how the gene's behavior varied according to phenotype. Perhaps she wants to begin investigating if one gene codes for a transcription factor of another. She can consult our gene complete clusters file and see if the two genes belong to the same cluster (as the activation of one will activate the other!). She may be interested in the total behavior of one cluster in activating or suppressing another cluster. She can determine her base and target clusters by locating the genes in our complete clusters file, extract all genes from those clusters, and then explore how downstream each cluster's effect may be by utilizing the degree of relationship csv. Bio Informatics is a collaborative field. We are always dependent on the work of others to solve the questions about which we are the most passionate. Because of this, it is important to always dig deep in your own analysis, and to create as readable and handy data as you can. Not only because you do not want another scientist to be lost in your files, but they must be readable by a computer! Sometimes, it may have felt like we were going down the rabbit hole in this tutorial. However, the Information we compiled is easy and immediately helpful for fellow scientists to use. Congratulations on finishing this tutorial successfully!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RNA-Seq for Model Plant (Arabidopsis thaliana)

Introduction and programs

Accessing the data using sra-toolkit

Quality control using sickle

Aligning reads to a genome using hisat2

Transcript assembly and quantification with StringTie

Differential expression analysis using ballgown

Topological networking using cytoscape

Conclusion

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE		LICENSE
README.md		README.md
Rplot.png		Rplot.png
all_clusters.csv.png		all_clusters.csv.png
complete_edge_list.csv.png		complete_edge_list.csv.png
cytoscape1.png		cytoscape1.png
cytoscape2.png		cytoscape2.png
cytoscape3.png		cytoscape3.png
cytoscape4.png		cytoscape4.png
cytoscape5.png		cytoscape5.png
data_dump.sh		data_dump.sh
fastqc1.png		fastqc1.png
fastqc2.png		fastqc2.png
fastqc3.png		fastqc3.png
fastqc4.png		fastqc4.png
hisat2_run.sh		hisat2_run.sh
multiqc.png		multiqc.png
pcaplot_for_all_libraries.png		pcaplot_for_all_libraries.png
quality_control.sh		quality_control.sh
re.txt		re.txt
sam_sort_bam.sh		sam_sort_bam.sh
sickle_run.sh		sickle_run.sh
transcript_assembly.sh		transcript_assembly.sh
trimmed.html		trimmed.html
trimmed_SRR3498212_fastqc.html		trimmed_SRR3498212_fastqc.html
trimmed_SRR3498213_fastqc.html		trimmed_SRR3498213_fastqc.html
trimmed_SRR3498215_fastqc.html		trimmed_SRR3498215_fastqc.html
trimmed_SRR3498216_fastqc.html		trimmed_SRR3498216_fastqc.html

License

golden75/RNA-Seq-Model-Organism-plant

Folders and files

Latest commit

History

Repository files navigation

RNA-Seq for Model Plant (Arabidopsis thaliana)

Introduction and programs

Accessing the data using sra-toolkit

Quality control using sickle

Aligning reads to a genome using hisat2

Transcript assembly and quantification with StringTie

Differential expression analysis using ballgown

Topological networking using cytoscape

Conclusion

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages