-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathRiboseQC.Rmd
270 lines (170 loc) · 11 KB
/
RiboseQC.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
---
title: "Ribo-seQC vignette"
author: "Lorenzo Calviello"
date: '`r format(Sys.time(), "%d %B, %Y")`'
output:
html_document:
toc: true
toc_float:
collapsed: false
toc_depth: 2
---
```{r setup_vig, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Introduction
Ribo-seQC is a package that performs quality control analysis of small RNA-seq data (in .bam format), with a focus on Ribo-seq [1] and related techniques. Thanks to syntax and functions present in packages like *GenomicFeatures*, *rtracklayer* or *BSgenome*, this package can perform comprehensive analyses of alignments on a variety of genomic regions.
This vignette illustrates the usage of the package using example data and annotation from three different experiments performed in three different organisms.
**Warning**: While we encourage users to get familiar with each outlined step, executing all the code on your machine will generate a lot of data (~550 Mb) and it might take some time (5-10 minutes for each analysis and report creation). Please make sure to have enough disk space if you decide to execute code from all the sections.
As a first step, let's load the package:
```{r load_lib}
suppressPackageStartupMessages(library("RiboseQC"))
```
## Exploring the annotation
Let's now download annotation for *Homo sapiens*: a subset of the GENCODE 25 annotation (https://www.gencodegenes.org/human/release_25.html) and the corresponding genome sequences.
```{r create_dirs_hum}
download_testfile = function(filename){
download.file(paste0("http://bimsbstatic.mdc-berlin.de/ohler/dharnet/riboseqc_testfiles/",filename),destfile = filename)
}
download_testfile(filename = "test_human.2bit")
download_testfile(filename = "test_human.gtf")
```
To parse the rich information present in our *.gtf* file we use the *prepare_annotation_files* function. Such a function creates a *TxDb* and a compressed *Rdata* file containing several regions of interest and additional information.
Moreover, such function reads a genome file in *.2bit* to forge a *BSgenome* package for fast and efficient query of genomic sequences. A *.2bit* file can be obtained from a *.fasta* file using the *faToTwoBit* software from UCSC: https://genome.ucsc.edu/goldenpath/help/twoBit.html - http://hgdownload.soe.ucsc.edu/admin/exe/ )
```{r create_annot_hum}
prepare_annotation_files(annotation_directory = ".",
twobit_file = "test_human.2bit",
gtf_file = "test_human.gtf",scientific_name = "Human.test",
annotation_name = "genc25_22M",export_bed_tables_TxDb = F,forge_BSgenome = T,create_TxDb = T)
```
We can now read such information using the *load_annotation* function:
```{r load_hum}
load_annotation("test_human.gtf_Rannot")
```
Two objects have now been created: a *genome_seq* object links to the *BSgenome* package we just created and loaded (containing genome sequences), and a *GTF_annotation* object containing important information present in our *.gtf* file.
For instance, we can access genomic sequences using commands as:
```{r gen_hum}
genome_seq[["chr22"]]
genome_seq[["chrM"]]
```
Transcript annotation and CDS annotations can be accessed as follows:
```{r gtf_hum_general_1}
GTF_annotation$exons_txs
```
```{r gtf_hum_general_2}
GTF_annotation$cds_txs
```
The genomic sequences corresponding to such genomic regions can be easily extracted:
```{r gen_hum_cds}
getSeq(genome_seq,GTF_annotation$cds_txs[[4]])
```
A list of annotated start and stop codons, including the transcripts they map to, can be accessed using:
```{r gtf_hum_general_3}
GTF_annotation$start_stop_codons
```
CDS annotation in transcript-level coordinates is also reported:
```{r gtf_hum_general_4}
GTF_annotation$cds_txs_coords
```
A list of gene ids, transcript ids, together with their symbols and biotypes, can be accessed with:
```{r gtf_hum_general_5}
GTF_annotation$trann
```
The genetic codes used for each chromosomes are accessed using:
```{r gtf_hum_general_6}
GTF_annotation$genetic_codes
getGeneticCode(GTF_annotation$genetic_codes["chr22","genetic_code"])
getGeneticCode(GTF_annotation$genetic_codes["chrM","genetic_code"])
```
Annotation and genome sequences are linked together in the annotation creation step.
The BSgenome package corresponding to the .gtf file used is reported in the *GTF_annotation* object:
```{r gtf_hum_general_7}
GTF_annotation$genome_package
```
## Create the html report (human)
Let's now download a subset of a Ribo-seq dataset in HEK293 cells [2].
```{r human_report_1}
download_testfile(filename = "test_human_hek.bam")
```
We now perform different sets of calculations on our data (including read-length and organelle-specific metagene analyses), using the annotation we previously created. Several files (including automatically generated P_sites positions, and other output from the analysis pipeline) will be automatically created.
Furthermore, an html dynamic report which illustrated the analysis results is created:
```{r human_report_2}
RiboseQC_analysis(annotation_file="test_human.gtf_Rannot",bam_files = "test_human_hek.bam",report_file = "test_human_hek.html",write_tmp_files = F)
```
The html report can be opened by different browsers such as firefox, chrome etc...
Moreover, all the generated plots are available in *.pdf* format and as *RDS* files in the same folder.
For example, the profile of P-site positions along the CDS can be visualized using:
```{r human_report_3,warning=FALSE, results='asis', fig.width=12, fig.height=10, dpi=120}
readRDS("test_human_hek.html_plots/rds/sample1_nucl_4_profiles_P_sites_metagene_subcodon_all")[[1]]
```
Let's now use Ribo-seQC to analyze different samples at once.
## Analyze multiple samples (yeast)
Let's now create annotation for *Saccharomyces cerevisiae*. We will use a custom gtf file coming from the annotation in Ensembl 91, supplied with transcript boundaries from [3].
```{r yeast_1}
download_testfile('test_yeast.2bit')
download_testfile('test_yeast.gtf')
prepare_annotation_files(annotation_directory = ".",
twobit_file = "test_yeast.2bit",
gtf_file = "test_yeast.gtf",scientific_name = "yeast.test",
annotation_name = "yeast_custom",export_bed_tables_TxDb = T,forge_BSgenome = T,create_TxDb = T)
```
As before, we can explore different sequences and regions extracted from annotation.
```{r yeast_2}
load_annotation("test_yeast.gtf_Rannot")
genome_seq[["II"]]
genome_seq[["Mito"]]
GTF_annotation$cds_txs
GTF_annotation$trann
```
We will now download three files corresponding to sample data from a TCP-seq experiment [4], where the three experiments aim at extracting RNA fragments covered by scanning ribosomes (SSU), elongating ribosomes (RS) and input material (input). Ribo-seQC analysis is followed by the creation of a report containing results for all the three experiments.
```{r yeast_3}
download_testfile(filename = "test_yeast_TCP_input.bam")
download_testfile(filename = "test_yeast_TCP_RS.bam")
download_testfile(filename = "test_yeast_TCP_SSU.bam")
bam_filepath_y<-c("test_yeast_TCP_input.bam","test_yeast_TCP_RS.bam","test_yeast_TCP_SSU.bam")
RiboseQC_analysis(annotation_file="test_yeast.gtf_Rannot",bam_files = bam_filepath_y,fast_mode = T,report_file = "test_yeast_TCP.html",sample_names = c("input","RS","SSU"),dest_names = c("input","RS","SSU"),write_tmp_files = F)
```
The report allows to compare different samples side-by-side using tabs for different samples.
For example, the 5'end profiles of different read lengths around start and stop codons can be visualized for different experiments. For the single plots:
```{r yeast_4, warning=FALSE, results='asis', fig.width=12, fig.height=10, dpi=120}
as_ggplot(readRDS("test_yeast_TCP.html_plots/rds/SSU_nucl_4_profiles_fivepr_metagene_subcodon_log2")[[1]])
as_ggplot(readRDS("test_yeast_TCP.html_plots/rds/RS_nucl_4_profiles_fivepr_metagene_subcodon_log2")[[1]])
```
## Analyze multiple samples (Arabidopsis)
We now create the annotation files for *Arabidopsis thaliana*. We will use a custom gtf file containing annotation of the Araport11 project (https://www.araport.org/), and non-coding RNA species from TAIR10.
```{r arab_1}
download_testfile("test_arabidopsis.2bit")
download_testfile("test_arabidopsis.gtf.gz")
prepare_annotation_files(annotation_directory = ".",
twobit_file = "test_arabidopsis.2bit",
gtf_file = "test_arabidopsis.gtf.gz",scientific_name = "arabidopsis.test",
annotation_name = "araport11_custom",export_bed_tables_TxDb = T,forge_BSgenome = T,create_TxDb = T)
```
We will now download two sample datasets from a Ribo-seq experiments in Arabidopsis roots and shoots [5].
```{r arab_2}
annotation="test_arabidopsis.gtf.gz_Rannot"
download_testfile(filename = "test_arabidopsis_root.bam")
download_testfile(filename = "test_arabidopsis_shoot.bam")
bam_filepath=c("test_arabidopsis_root.bam","test_arabidopsis_shoot.bam")
RiboseQC_analysis(annotation_file=annotation,bam_files = bam_filepath,fast_mode = T,report_file = "test_root_shoots.html",dest_names = c("root","shoots"),sample_names = c("root","shoots"),write_tmp_files = F)
```
Despite the lack of a dedicated treatment to purify chloroplast ribosomes in these Ribo-seq experiments (e.g. using chloramphenicol), it is possible to detect footprints from distinct organelles, enriched in the shoots sample rather than roots one.
For instance, we can visualize the amount of footprints mapping to different organelles and biotypes in the two samples:
```{r arab_3}
readRDS("test_root_shoots.html_plots/rds/all_samples_1_readlocdist")[[1]]
```
or visualize z-scored positional codon usage values, calculated using A-site positions for chloroplast ribosomes in the shoots sample:
```{r arab_4, warning=FALSE, results='asis', fig.width=12, fig.height=16, dpi=150}
as_ggplot(readRDS("test_root_shoots.html_plots/rds/shoots_ChrC_7_codonusage_positional_A-sites_per_codon_all_zscore")[[1]])
```
Much more information is available in the html reports, including read length distributions per biotype and compartment, statistics on codon usage, and analysis of the highest mapping positions in the genome.
## References
1) Ingolia, N.T. et al. (2009) Genome-wide analysis in vivo of resolution using ribosome profiling. Science 324, 218–223
2) Calviello, L. et al. (2015) Detecting actively translated open reading frames in ribosome profiling data. Nat. Methods 13, 1–9
3) Daechan P. et al. (2014) Simultaneous mapping of transcript ends at single-nucleotide resolution and identification of widespread promoter-associated non-coding RNA governed by TATA elements, Nucleic Acids Research, Volume 42, Issue 6, 1 April 2014, Pages 3736-3749
4) Archer, S.K. et al. (2016) Dynamics of ribosome scanning and recycling revealed by translation complex profiling. Nature 535, 570–574
5) Hsu, P.Y. et al. (2016) Super-resolution ribosome profiling reveals novel translation events in Arabidopsis. PNAS, 113, E7126–E7135
## Session info
```{r end_sessinf}
session_info()
```