-
Notifications
You must be signed in to change notification settings - Fork 1
/
OmicsPLS_shortCourse_ANSWERS.Rmd
416 lines (298 loc) · 15.1 KB
/
OmicsPLS_shortCourse_ANSWERS.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
---
title: "Integrating multi-omics underlying Down syndrome with OmicsPLS"
subtitle: "Answer key"
author: "Said el Bouhaddani and Jeanine Houwing-Duistermaat"
output:
rmdformats::material:
highlight: tango
fig_width: 9
fig_height: 7
self_contained: yes
keep_md: yes
code_folding: show
---
```{css CSS stuff, echo=FALSE}
.Routp {
background-color: "#D1F2EB";
border: 2px solid grey;
font-weight: bold;
}
.bluebox {
padding: 1em 1em 1em 3.5em;
background: white;
border-left: 10px solid green;
}
.blueboxx {
padding: 0.5em;
background: white;
border-left: 10px solid blue;
color: black;
}
.question {
background-image: url("https://github.com/selbouhaddani/Contents/raw/main/QuestionTiny.png");
background-repeat: no-repeat;
background-size: 3em 3em;
background-position: left center;
}
```
```{r global_options, include=FALSE, eval=TRUE}
library(knitr)
library(rmdformats)
opts_chunk$set(fig.path='Figs/', eval=TRUE, results = "hold", fig.align = 'center',
class.source="Routp", echo=TRUE, warning=TRUE, message=TRUE,
dev='png', dpi=200, cache = FALSE, collapse=TRUE)
```
<script type="text/javascript">
function toggle_visibility(id) {
var e = document.getElementById(id);
if(e.style.display == 'block')
e.style.display = 'none';
else
e.style.display = 'block';
}
</script>
# Introduction and description of the data
These are exercises for the OmicsPLS short course. There are several questions throughout the text, and the corresponding R code to answer the question is given after the question. The answer key contains all output of each code block as well as brief answers to the questions. Note that some questions don't have a unique answer, but the idea should be clear.
## Integrative analysis
In this part, we consider a data integration approach to link the information in methylation and glycomics data. This 'joint information' is then related to Down syndrome. In the exercises, we work with a subset of methylation only on chromosome 21.
A flexible data integration approach for two heterogeneous datasets is O2PLS and is available in the OmicsPLS package. We will use OmicsPLS to select the most important genes corresponding with the methylation sites, this mapping can be found in the `CpG_groups` object where each CpG (methylation) site has one or more associated genes. The number of methylation sites is found by running `length(CpG_groups)`, and the number of genes is found with `length(unique(CpG_groups))`.
## Load packages and datasets
We need several packages for data handling, fitting and visualizing the results. Run this code to see which are not yet installed. All packages can be installed with `install.packages`, except disgenet2r, which has a separate install command shown below.
```{r Required packages, eval=F}
req_pack <- c("MASS", "parallel", "tidyverse", "magrittr",
"OmicsPLS", "httr", "disgenet2r", "GGally")
if(sum(!(req_pack %in% installed.packages()[,1])) > 0){
cat("\nThe following packages are missing:\n")
req_pack[which(!(req_pack %in% installed.packages()[,1]))]
} else cat("\nNo packages missing.\n")
```
```{r Load packages, message=FALSE}
library(MASS) # statistical tools, such as lda
library(parallel) # parallel computing
library(tidyverse) # dataset & viz tools
library(magrittr) # pipe %>% operators
library(OmicsPLS) # data integration toolkit
## Also needed but not loaded
# install.packages("httr")
# install.packages("GGally")
# remotes::install_bitbucket("ibi_group/disgenet2r")
```
The datasets are found in the `DownSyndrome.RData` file. We work with a subset of the methylation data measured only on chromosome 21. A simple `load` statement should load them in your workspace. The `str` function can be used to get a first impression of the data objects.
```{r Load and inspect datasets}
load("DownSyndrome.RData")
## str gives an overview of all kinds of objects
cat("Methylation data:\n")
str(methylation)
cat("\nGlycomics data:\n")
str(glycomics)
cat("\nCpG mapping to gene:\n")
str(CpG_groups)
```
## Descriptive analysis
Before any analysis can be performed, you should consider calculating some descriptives about the data.
:::: {.bluebox .question data-latex=""}
**_Exercises._** Plot boxplots of (a subset of) the data columns. Also describe the demographics: case-controls, age and sex distributions. Are there any remarkable observations?
::::
\
:::: {.blueboxx data-latex=""}
<a href="javascript:void()" onclick="toggle_visibility('foo0');">
**_Answers (Click here)_**
</a>
<div id="foo0" style="display:none">
The variables appear to be symmetric, without large outliers. There are some large variance CpG variables. There are 29 DS patients, 29 mothers, and 27 siblings. You can further inspect age and sex distributions using `hist` and `table`, e.g. `table(ClinicalVars$sex, ClinicalVars$group)`.
</div>
::::
\
```{r Descriptive plots}
boxplot(glycomics)
boxplot(methylation[,1:100])
print(ClinicalVars)
table(ClinicalVars$group)
```
# Cross-validation to choose number of OmicsPLS components
## Choosing the number of components
To apply OmicsPLS, we first need to decide on the number of components to retain. A cross-validation is usually performed. In the cross-validation, a grid is specified as well as the number of folds. If desired, you can use multiple cores to speed up the calculations. On a Windows machine, this requires copying the data matrices to each parallel process, so keep an eye on memory usage.
Note that there are other ways besides cross-validation, such as the scree plot ([click here for more info](https://selbouhaddani.eu/2020-10-11-Introduction-OmicsPLS-scree/)).
:::: {.bluebox .question data-latex=""}
Perform a cross-validation for the number of joint and specific components. What is the optimal number of components for each part?
::::
\
:::: {.blueboxx data-latex=""}
<a href="javascript:void()" onclick="toggle_visibility('foo1');">
**_Answers (Click here)_**
</a>
<div id="foo1" style="display:none">
The optimal number of components changes a bit between runs, but 2, 4 and 6 (or 4,2,6) seems fine. One can also run the commented out lines to make a scree plot, as mentioned above.
</div>
::::
\
```{r Decide on number of components, cache = TRUE}
set.seed(83654)
crossval_o2m_adjR2(X = methylation, Y = glycomics,
a = 1:5, ax = 0:10, ay = 0:9, nr_folds = 10, nr_cores = 1)
## -> 4 2 6
## Code to run a scree plot is
# par(mfrow=c(1,3))
# plot(svd(crossprod(methylation,glycomics),0,0)$d^2 %>%
# (function(e) e/sum(e)), main='Joint Scree plot')
# plot(svd(tcrossprod(methylation),0,0)$d %>% (function(e) e/sum(e)),
# main="Methylation Scree plot")
# plot(svd(crossprod(glycomics),0,0)$d %>% (function(e) e/sum(e)),
# main="Glycomics Scree plot")
# par(mfrow=c(1,1))
# ## -> 3 5 1
r <- 4; rx <- 2; ry <- 6
```
# Apply OmicsPLS to methylation and glycomics
## Fit O2PLS
We fit O2PLS to the methylation and glycomics data, and calculate the variance explained by the joint and specific parts.
```{r Fit OmicsPLS and summary}
fit <- o2m(methylation, glycomics, r, rx, ry)
summary(fit)
```
## Visualize the results: loadings
Next, we inspect the loadings. Each of the 3322 loading values in the methylation parts represents a CpG site indicated by a cg ID. For the glycomics parts, we have 10 glycan peaks/IDs.
Each label is a cg ID or glycan ID, and the axes represent the respective components.
:::: {.bluebox .question data-latex=""}
Give an interpretation of these results. Which features have highest loadings, in which components? Which glycan and methylation features have the highest covariance according to the plot? Click "zoom" in RStudio if the labels don't fit on the screen.
::::
\
:::: {.blueboxx data-latex=""}
<a href="javascript:void()" onclick="toggle_visibility('foo5');">
**_Answers (Click here)_**
</a>
<div id="foo5" style="display:none">
For the glycan data: in component 1, P1 and P7 have a high loading value. In component 2, P4, P10 and P8 are relatively high. For the methylation data component 1: cg11866463, component 2: cg02464073.
</div>
::::
\
```{r Plot the loadings}
plot(fit,loading_name = "Yj",i=1,j=2,label = "col") + theme_bw()
plot(fit,loading_name = "Xj",i=1,j=2,label = "col") + theme_bw()
```
## Visualize the results: scores
Next, we investigate if these joint components are associated with Down syndrome. We first look at the scatterplot of the scores, colored by DS.
:::: {.bluebox .question data-latex=""}
Give an interpretation. Are the joint scores able to separate Down syndrome?
::::
\
:::: {.blueboxx data-latex=""}
<a href="javascript:void()" onclick="toggle_visibility('foo6');">
**_Answers (Click here)_**
</a>
<div id="foo6" style="display:none">
There is no perfect separation, but in the first component there seems to be a difference in means between Down syndrome and the siblings. The mothers have similar mean as DS, indicating that the first joint component may represent "biological age".
</div>
::::
\
```{r Plot the scores}
data.frame(Group = ClinicalVars$group, JPC = scores(fit, "Xjoint")) %>%
pivot_longer(-Group,
names_to = "Comp",
values_to = "Scores") %>%
ggplot(aes(x=Comp, y=Scores, col=Group)) +
geom_boxplot() + xlab("Component") + ylab("Methylation scores") +
theme_bw()
data.frame(Group = ClinicalVars$group, JPC = scores(fit, "Yjoint")) %>%
pivot_longer(-Group,
names_to = "Comp",
values_to = "Scores") %>%
ggplot(aes(x=Comp, y=Scores, col=Group)) +
geom_boxplot() + xlab("Component") + ylab("Glycomics scores") +
theme_bw()
## Another (fancy) approach is to run the following for multiple plots in one go
GGally::ggpairs(data.frame(Tt=fit$Tt,U=fit$U),
aes(col = ClinicalVars$group), progress = FALSE,
title = "Joint X and Y components against each other") + theme_bw()
```
We perform a logistic regression with the Down syndrome status as outcome, and the joint methylation scores as covariates. We exclude the mothers for now.
:::: {.bluebox .question data-latex=""}
Which group category is the reference? Are there joint scores that are significantly associated with Down syndrome? Are the p-values correctly interpretable in this case? Why (not)?
::::
\
:::: {.blueboxx data-latex=""}
<a href="javascript:void()" onclick="toggle_visibility('foo7');">
**_Answers (Click here)_**
</a>
<div id="foo7" style="display:none">
DS is the reference (the first level when running `levels(glm_datmat$outc)`). The first joint component is significant. However, the components themselves are estimated, this extra uncertainty is not incorporated in the p-value.
</div>
::::
\
```{r Logistic regression with JPC}
glm_datmat <- data.frame(JPC=scores(fit, "Xjoint"),
outc = ClinicalVars$group, age=ClinicalVars$age, sex=ClinicalVars$sex)
glm(outc ~ ., data = glm_datmat%>% filter(outc != "MA"), family = "binomial") %>%
summary()
```
# Interpretation of the top genes
We saw that joint methylation component one seemed to be significantly associated with Down syndrome: the mean scores differed significantly between DS and SB. Of interest is the genes corresponding with the top CpG sites, are their target genes representing some biological pathway? To this end, we use String-DB to cluster the top genes. Although there is an R package for String-DB, we are going to use [the String-DB website](https://string-db.org/). On the website, click "multiple proteins". The input there is the list of top genes.
Although determining a threshold to select the number of 'top' CpG sites is not straightforward, we are going to select 200 based on earlier analysis of these data. We also need to map from cg ID to gene ID.
```{r Interpretation based on top genes}
top_cg <- order(loadings(fit, subset=1)^2,decreasing = TRUE)
gene_list <- CpG_groups[top_cg[1:200]]
gene_list %<>% paste0(collapse = ";") %>%
str_split(";") %>% unlist %>% unique
```
## Using string-DB
Copy-paste the top genes in the String-DB website.
:::: {.bluebox .question data-latex=""}
Is there any remarkable clustering visible? Go to the analysis tab, is there any significant enrichment?
::::
\
:::: {.blueboxx data-latex=""}
<a href="javascript:void()" onclick="toggle_visibility('foo8');">
**_Answers (Click here)_**
</a>
<div id="foo8" style="display:none">
From the [https://string-db.org/](https://string-db.org/) website, choose homo sapiens as organism and paste the gene list. There is one clear cluster with many KRTAP genes, the other genes seem less clustered. Note that KRTAP (keratine associated proteins) are all located on chromosome 21. Enrichment analysis shows the complexity of the genetic architecture of DS, it is somewhat dominated by keratin related terms. The PubMed enrichment shows multiple terms investigating Down syndrome.
</div>
::::
\
```{r Paste the result in e.g. string}
# gene_list %>% cat(sep="\n")
```
## Using DisGeNet
You can also use the DisGeNet R package to perform Disease-gene enrichment. If you cannot install the package, the output is given below.
:::: {.bluebox .question data-latex=""}
Which disease clusters are most significant? What does this say about the top genes based on integrating methylation and glycomics data?
::::
\
:::: {.blueboxx data-latex=""}
<a href="javascript:void()" onclick="toggle_visibility('foo9');">
**_Answers (Click here)_**
</a>
<div id="foo9" style="display:none">
The top hit is Down syndrome. The top genes apparently contain more genes that are associated with Down syndrome than you would expect by chance.
</div>
::::
\
```{r Run DisGeNet analysis, message=FALSE}
### Bonus, run this code to perform DisGeNet enrichment
# disgenet_api_key <- "271e054761763b144a97872b059fd573186bdd9f"
options(timeout=4e9)
httr::timeout(4e9) # if needed to give curl more time
DGN_DE <- disgenet2r::disease_enrichment(gene_list, database = "ALL")
DGN_DE@qresult[1:10,
c("Description", "FDR", "Ratio", "BgRatio")]
```
```
Description FDR Ratio BgRatio
236 Down Syndrome 4.983930e-35 38/65 766/21666
2736 Complete Trisomy 21 Syndrome 2.709284e-34 36/65 669/21666
2320 DOWN SYNDROME CRITICAL REGION 9.293346e-19 13/65 57/21666
1853 Chromosome 21 monosomy 1.849628e-07 5/65 13/21666
2468 Alzheimer disease type 1 1.452115e-05 3/65 3/21666
169 Cognition Disorders 8.819130e-05 12/65 607/21666
375 Hirschsprung Disease 8.819130e-05 10/65 384/21666
548 Mental Retardation 1.006005e-04 11/65 505/21666
207 Presenile dementia 2.703029e-03 11/65 718/21666
303 Fragile X Syndrome 4.152165e-03 6/65 194/21666
```
BONUS: You can also combine the String-DB and DisGeNet analyses by making an interaction netwerk of the genes in a particular disease term. Below is the code to print the genes that are in the Down Syndrome disease term. This list can be copy-pasted into String-DB and analyzed.
```{r Subset DisGeNet DS}
DGN_DownS <- DGN_DE@qresult %>%
filter(Description == "Down Syndrome") %>%
pull(shared_symbol) %>% str_split(";") %>% unlist
# cat(DGN_DownS, sep="\n")
```