RiboseQC.Rmd

---
title: "Ribo-seQC vignette"
author: "Lorenzo Calviello"
date: '`r format(Sys.time(), "%d %B, %Y")`'
output:
  html_document: 
    toc: true
    toc_float:
      collapsed: false
    toc_depth: 2
---

```{r setup_vig, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Introduction

Ribo-seQC is a package that performs quality control analysis of small RNA-seq data (in .bam format), with a focus on Ribo-seq [1] and related techniques. Thanks to syntax and functions present in packages like *GenomicFeatures*, *rtracklayer* or *BSgenome*, this package can perform comprehensive analyses of alignments on a variety of genomic regions.
This vignette illustrates the usage of the package using example data and annotation from three different experiments performed in three different organisms.

**Warning**: While we encourage users to get familiar with each outlined step, executing all the code on your machine will generate a lot of data (~550 Mb) and it might take some time (5-10 minutes for each analysis and report creation). Please make sure to have enough disk space if you decide to execute code from all the sections.

As a first step, let's load the package:

```{r load_lib}
suppressPackageStartupMessages(library("RiboseQC"))
```


## Exploring the annotation

Let's now download annotation for *Homo sapiens*: a subset of the GENCODE 25 annotation (https://www.gencodegenes.org/human/release_25.html) and the corresponding genome sequences.

```{r create_dirs_hum}
download_testfile = function(filename){
    download.file(paste0("http://bimsbstatic.mdc-berlin.de/ohler/dharnet/riboseqc_testfiles/",filename),destfile = filename)
}
download_testfile(filename = "test_human.2bit")

download_testfile(filename = "test_human.gtf")
```


To parse the rich information present in our *.gtf* file we use the *prepare_annotation_files* function. Such a function creates a *TxDb*  and a compressed *Rdata* file containing several regions of interest and additional information.
Moreover, such function reads a genome file in *.2bit* to forge a *BSgenome* package for fast and efficient query of genomic sequences. A *.2bit* file can be obtained from a *.fasta* file using the *faToTwoBit* software from UCSC: https://genome.ucsc.edu/goldenpath/help/twoBit.html - http://hgdownload.soe.ucsc.edu/admin/exe/ )


```{r create_annot_hum}
prepare_annotation_files(annotation_directory = ".",
                         twobit_file = "test_human.2bit",
                         gtf_file = "test_human.gtf",scientific_name = "Human.test",
                         annotation_name = "genc25_22M",export_bed_tables_TxDb = F,forge_BSgenome = T,create_TxDb = T)
```


We can now read such information using the *load_annotation* function:

```{r load_hum}
load_annotation("test_human.gtf_Rannot")
```

Two objects have now been created: a *genome_seq* object links to the *BSgenome* package we just created and loaded (containing genome sequences), and a *GTF_annotation* object containing important information present in our *.gtf* file.
For instance, we can access genomic sequences using commands as:

```{r gen_hum}
genome_seq[["chr22"]]
genome_seq[["chrM"]]
```

Transcript annotation and CDS annotations can be accessed as follows:

```{r gtf_hum_general_1}
GTF_annotation$exons_txs
```

```{r gtf_hum_general_2}
GTF_annotation$cds_txs
```


The genomic sequences corresponding to such genomic regions can be easily extracted:

```{r gen_hum_cds}
getSeq(genome_seq,GTF_annotation$cds_txs[[4]])
```

A list of annotated start and stop codons, including the transcripts they map to, can be accessed using:

```{r gtf_hum_general_3}
GTF_annotation$start_stop_codons
```

CDS annotation in transcript-level coordinates is also reported:

```{r gtf_hum_general_4}
GTF_annotation$cds_txs_coords
```

A list of gene ids, transcript ids, together with their symbols and biotypes, can be accessed with:

```{r gtf_hum_general_5}
GTF_annotation$trann
```

The genetic codes used for each chromosomes are accessed using:

```{r gtf_hum_general_6}
GTF_annotation$genetic_codes
getGeneticCode(GTF_annotation$genetic_codes["chr22","genetic_code"])
getGeneticCode(GTF_annotation$genetic_codes["chrM","genetic_code"])
```

Annotation and genome sequences are linked together in the annotation creation step.
The BSgenome package corresponding to the .gtf file used is reported in the *GTF_annotation* object:

```{r gtf_hum_general_7}
GTF_annotation$genome_package
```


## Create the html report (human)

Let's now download a subset of a Ribo-seq dataset in HEK293 cells [2].


```{r human_report_1}
download_testfile(filename = "test_human_hek.bam")
```

We now perform different sets of calculations on our data (including read-length and organelle-specific metagene analyses), using the annotation we previously created. Several files (including automatically generated P_sites positions, and other output from the analysis pipeline) will be automatically created. 
Furthermore, an html dynamic report which illustrated the analysis results is created:

```{r human_report_2}
RiboseQC_analysis(annotation_file="test_human.gtf_Rannot",bam_files = "test_human_hek.bam",report_file = "test_human_hek.html",write_tmp_files = F)
```

The html report can be opened by different browsers such as firefox, chrome etc...
Moreover, all the generated plots are available in *.pdf* format and as *RDS* files in the same folder.
For example, the profile of P-site positions along the CDS can be visualized using:

```{r human_report_3,warning=FALSE, results='asis', fig.width=12, fig.height=10, dpi=120}
readRDS("test_human_hek.html_plots/rds/sample1_nucl_4_profiles_P_sites_metagene_subcodon_all")[[1]]
```

Let's now use Ribo-seQC to analyze different samples at once.


## Analyze multiple samples (yeast)

Let's now create annotation for *Saccharomyces cerevisiae*. We will use a custom gtf file coming from the annotation in Ensembl 91, supplied with transcript boundaries from [3].

```{r yeast_1}

download_testfile('test_yeast.2bit')
download_testfile('test_yeast.gtf')


prepare_annotation_files(annotation_directory = ".",
                         twobit_file = "test_yeast.2bit",
                         gtf_file = "test_yeast.gtf",scientific_name = "yeast.test",
                         annotation_name = "yeast_custom",export_bed_tables_TxDb = T,forge_BSgenome = T,create_TxDb = T)
```

As before, we can explore different sequences and regions extracted from annotation.

```{r yeast_2}
load_annotation("test_yeast.gtf_Rannot")

genome_seq[["II"]]
genome_seq[["Mito"]]
GTF_annotation$cds_txs
GTF_annotation$trann
```

We will now download three files corresponding to sample data from a TCP-seq experiment [4], where the three experiments aim at extracting RNA fragments covered by scanning ribosomes (SSU), elongating ribosomes (RS) and input material (input). Ribo-seQC analysis is followed by the creation of a report containing results for all the three experiments.

```{r yeast_3}
download_testfile(filename = "test_yeast_TCP_input.bam")

download_testfile(filename = "test_yeast_TCP_RS.bam")

download_testfile(filename = "test_yeast_TCP_SSU.bam")

bam_filepath_y<-c("test_yeast_TCP_input.bam","test_yeast_TCP_RS.bam","test_yeast_TCP_SSU.bam")

RiboseQC_analysis(annotation_file="test_yeast.gtf_Rannot",bam_files = bam_filepath_y,fast_mode = T,report_file = "test_yeast_TCP.html",sample_names = c("input","RS","SSU"),dest_names = c("input","RS","SSU"),write_tmp_files = F)
```

The report allows to compare different samples side-by-side using tabs for different samples.
For example, the 5'end profiles of different read lengths around start and stop codons can be visualized for different experiments. For the single plots:


```{r yeast_4, warning=FALSE, results='asis', fig.width=12, fig.height=10, dpi=120}
as_ggplot(readRDS("test_yeast_TCP.html_plots/rds/SSU_nucl_4_profiles_fivepr_metagene_subcodon_log2")[[1]])

as_ggplot(readRDS("test_yeast_TCP.html_plots/rds/RS_nucl_4_profiles_fivepr_metagene_subcodon_log2")[[1]])
```


## Analyze multiple samples (Arabidopsis)

We now create the annotation files for *Arabidopsis thaliana*. We will use a custom gtf file containing annotation of the Araport11 project (https://www.araport.org/), and non-coding RNA species from TAIR10.


```{r arab_1}

download_testfile("test_arabidopsis.2bit")
download_testfile("test_arabidopsis.gtf.gz")

prepare_annotation_files(annotation_directory = ".",
                         twobit_file = "test_arabidopsis.2bit",
                         gtf_file = "test_arabidopsis.gtf.gz",scientific_name = "arabidopsis.test",
                         annotation_name = "araport11_custom",export_bed_tables_TxDb = T,forge_BSgenome = T,create_TxDb = T)
```

We will now download two sample datasets from a Ribo-seq experiments in Arabidopsis roots and shoots [5].

```{r arab_2}

annotation="test_arabidopsis.gtf.gz_Rannot"


download_testfile(filename = "test_arabidopsis_root.bam")
download_testfile(filename = "test_arabidopsis_shoot.bam")

bam_filepath=c("test_arabidopsis_root.bam","test_arabidopsis_shoot.bam")

RiboseQC_analysis(annotation_file=annotation,bam_files = bam_filepath,fast_mode = T,report_file = "test_root_shoots.html",dest_names = c("root","shoots"),sample_names = c("root","shoots"),write_tmp_files = F)
```

Despite the lack of a dedicated treatment to purify chloroplast ribosomes in these Ribo-seq experiments (e.g. using chloramphenicol), it is possible to detect footprints from distinct organelles, enriched in the shoots sample rather than roots one.

For instance, we can visualize the amount of footprints mapping to different organelles and biotypes in the two samples:

```{r arab_3}
readRDS("test_root_shoots.html_plots/rds/all_samples_1_readlocdist")[[1]]
```

or visualize z-scored positional codon usage values, calculated using A-site positions for chloroplast ribosomes in the shoots sample:

```{r arab_4, warning=FALSE, results='asis', fig.width=12, fig.height=16, dpi=150}
as_ggplot(readRDS("test_root_shoots.html_plots/rds/shoots_ChrC_7_codonusage_positional_A-sites_per_codon_all_zscore")[[1]])
```

Much more information is available in the html reports, including read length distributions per biotype and compartment, statistics on codon usage, and analysis of the highest mapping positions in the genome.

## References

1) Ingolia, N.T. et al. (2009) Genome-wide analysis in vivo of resolution using ribosome profiling. Science 324, 218–223

2) Calviello, L. et al. (2015) Detecting actively translated open reading frames in ribosome profiling data. Nat. Methods 13, 1–9

3) Daechan P. et al. (2014) Simultaneous mapping of transcript ends at single-nucleotide resolution and identification of widespread promoter-associated non-coding RNA governed by TATA elements, Nucleic Acids Research, Volume 42, Issue 6, 1 April 2014, Pages 3736-3749

4) Archer, S.K. et al. (2016) Dynamics of ribosome scanning and recycling revealed by translation complex profiling. Nature 535, 570–574

5) Hsu, P.Y. et al. (2016) Super-resolution ribosome profiling reveals novel translation events in Arabidopsis. PNAS, 113, E7126–E7135


## Session info

```{r end_sessinf}
session_info()
```