RNA-seq data analysis

Create a directory dedicated to the analyses:

mkdir analysis

And move into it:

cd analysis

Load the environment:

. ~/rnaseq/.rnaseqenv

RPKM distribution

Have a look at the distribution of RPKM values:

rpkm_distribution.R -i ../quantifications/encode.mouse.gene.FPKM.idr_NA.tsv -l -p 0 -m ../data/quantifications.index.tsv -f age

To look at the plot:

evince boxplot.log_T.psd_0.out.pdf

Clustering analysis

Perform hierarchical clustering to check replicability:

matrix_to_dist.R -i ../quantifications/encode.mouse.gene.FPKM.idr_NA.tsv --log10 -c pearson -o stdout | ggheatmap.R -i stdin --row_metadata ../data/quantifications.index.tsv --col_dendro --row_dendro -B 10 --matrix_palette=~/rnaseq/bin/terrain.colors.3.txt --rowSide_by age --matrix_fill_limits 0.85,1 -o cns.heatmap.pdf

Look at the clustering.

Differential gene expression

Run the DE with the edgeR package (be careful takes read counts and not rpkm values as input):

edgeR.analysis.R -i ../quantifications/encode.mouse.gene.expected_count.idr_NA.tsv -m ../data/quantifications.index.tsv -f age

Write a list of the genes overexpressed after 18 days, according to edgeR analysis:

awk '$NF<0.01 && $4>2{print $1"\tover18"}' edgeR.cpm1.n2.out.tsv > edgeR.0.01.overE18.txt

Write a list of the genes overexpressed after 14 days, according to edgeR analysis:

awk '$NF<0.01 && $4<-2 {print $1"\tover14"}' edgeR.cpm1.n2.out.tsv > edgeR.0.01.overE14.txt

Count how many overexpressed genes there are in each stage:

wc -l edgeR.0.01.over*.txt

Show the results in a heatmap:

(echo -e "gene\tedgeR"; cat edgeR.0.01.over*.txt) > gene.edgeR.tsv
cut -f1 gene.edgeR.tsv | tail -n+2 | selectMatrixRows.sh - ../quantifications/encode.mouse.gene.FPKM.idr_NA.tsv | ggheatmap.R -W 5 -H 9 --col_metadata ../data/quantifications.index.tsv --colSide_by age --col_labels labExpId --row_metadata gene.edgeR.tsv --merge_row_mdata_on gene --rowSide_by edgeR --row_labels none -l -p 0.1 --col_dendro --row_dendro -o heatmap.edgeR.pdf

GO enrichment

Prepare a file with the list of all the genes in the annotation:

awk '{split($10,a,"\""); print a[2]}' ~/rnaseq/refs/mm65.long.ok.gtf | sort -u > universe.txt

Launch the GO enrichment script for the Biological Processes, Molecular Function and Cellular Components in the set of genes overexpressed in E14:

cut -f1 edgeR.0.01.overE14.txt | GO_enrichment.R -u universe.txt -G stdin -c BP -o edgeR.overE14 -s mouse
cut -f1 edgeR.0.01.overE14.txt | GO_enrichment.R -u universe.txt -G stdin -c MF -o edgeR.overE14 -s mouse
cut -f1 edgeR.0.01.overE14.txt | GO_enrichment.R -u universe.txt -G stdin -c CC -o edgeR.overE14 -s mouse

The results can be visualized in the browser, pasting the following paths in the search line:

firefox ~/rnaseq/analysis/edgeR.overE14.BP.html
firefox ~/rnaseq/analysis/edgeR.overE14.MF.html
firefox ~/rnaseq/analysis/edgeR.overE14.CC.html

You can repeat the same for the genes overexpressed in E18:

cut -f1 edgeR.0.01.overE18.txt | GO_enrichment.R -u universe.txt -G stdin -c BP -o edgeR.overE18 -s mouse
cut -f1 edgeR.0.01.overE18.txt | GO_enrichment.R -u universe.txt -G stdin -c MF -o edgeR.overE18 -s mouse
cut -f1 edgeR.0.01.overE18.txt | GO_enrichment.R -u universe.txt -G stdin -c CC -o edgeR.overE18 -s mouse

Make bigWig file with RNAseq signal

Create a bash script called run_bigwig.sh with the following:

#!/bin/bash -e

# load env
. ~/rnaseq/.rnaseqenv

# load module
module load BEDTools/2.21.0-goolf-1.4.10-no-OFED
module load KentUtils/308-goolf-1.4.10-no-OFED


# create bedgraph from mappings
genomeCoverageBed -split -bg -ibam alignments/mouse_cns_E18_rep1_Aligned.sortedByCoord.out.bam > alignments/mouse_cns_E18_rep1_bedGraph.bed
# generate bigwig from bedgraph
bedGraphToBigWig alignments/mouse_cns_E18_rep1_bedGraph.bed ~/rnaseq/refs/mouse_genome_mm9.fa.fai alignments/mouse_cns_E18_rep1.bw

Submit the job to the cluster:

qsub -cwd -q RNAseq -N bigwig_rnaseq_course -e logs -o logs ./run_bigwig.sh

Visualize your results in the UCSC genome browser

Add the gene expression track to the genome browser in bigWig format. The bigWig files must be either uploaded or linked (if they are present somewhere online)

Go to the USCS genome browser web page:

http://genome.ucsc.edu

On the main menu, click on Genomes. Click on Add custom track. Make sure the assembly information is as follows:

clade: Mammal, genome: Mouse, assembly: July 2007 (NCBI/mm9)

Paste the track specifications for each file in the box Paste URLs or data:

track name=mouse_cns_E14_rep1.bw type=bigWig visibility=2 autoScale=off maxHeightPixels=30 color=0,149,347 viewLimits=0:30 bigDataUrl=http://genome.crg.es/~epalumbo/rnaseq/2015nov/mouse_cns_E14_rep1_Aligned.sortedByCoord.out.bw
track name=mouse_cns_E14_rep2.bw type=bigWig visibility=2 autoScale=off maxHeightPixels=30 color=0,149,347 viewLimits=0:30 bigDataUrl=http://genome.crg.es/~epalumbo/rnaseq/2015nov/mouse_cns_E14_rep2_Aligned.sortedByCoord.out.bw
track name=mouse_cns_E18_rep1.bw type=bigWig visibility=2 autoScale=off maxHeightPixels=30 color=69,139,0 viewLimits=0:30 bigDataUrl=http://genome.crg.es/~epalumbo/rnaseq/2015nov/mouse_cns_E18_rep1_Aligned.sortedByCoord.out.bw
track name=mouse_cns_E18_rep2.bw type=bigWig visibility=2 autoScale=off maxHeightPixels=30 color=69,139,0 viewLimits=0:30 bigDataUrl=http://genome.crg.es/~epalumbo/rnaseq/2015nov/mouse_cns_E18_rep2_Aligned.sortedByCoord.out.bw

Click Submit

Go to the genome browser to look at some genes and their RNA-seq signal

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rna-seq-analyses.adoc

rna-seq-analyses.adoc

RNA-seq data analysis

RPKM distribution

Clustering analysis

Differential gene expression

GO enrichment

Make bigWig file with RNAseq signal

Visualize your results in the UCSC genome browser

Files

rna-seq-analyses.adoc

Latest commit

History

rna-seq-analyses.adoc

File metadata and controls

RNA-seq data analysis

RPKM distribution

Clustering analysis

Differential gene expression

GO enrichment

Make bigWig file with RNAseq signal

Visualize your results in the UCSC genome browser