-
Notifications
You must be signed in to change notification settings - Fork 1
/
r-markdown-lipidomics-cluster.Rmd
364 lines (251 loc) · 12.8 KB
/
r-markdown-lipidomics-cluster.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
---
title: "How to create a statistical report with R Markdown"
author: "Anja Eggert"
date: "`r Sys.Date()`"
output: html_document
---
<style>
p.comment {
background-color: #DBDBDB;
padding: 10px;
border: 1px solid black;
margin-left: 0px;
border-radius: 5px;
}
</style>
```{r load.libraries, message = FALSE, warning = FALSE}
library(tidyverse) # tibble, pipe, etc.
library(readxl) # read in Excel data directly, in tibble format
library(factoextra) # beautiful cluster-graph with kmeans()
library(knitr) # R Markdown and html output
library(kableExtra) # pretty html tables
```
\
# Chunk options
The initial line in a code chunk may include various options. For example, `echo = FALSE` indicates that the code will not be shown in the final document (though any results/output would still be displayed). You use `results = "hide"` to hide the results/output (but here the code would still be displayed). You use `include = FALSE` to have the chunk evaluated, but neither the code nor its output displayed.
etc.
\
# Example data set
Shotgun lipidomics was applied and 335 single metabolites were measured in two experimental groups with 14 samples in the control group and 14 samples in the treatment group. The single metabolites can be assigned to 15 lipid classes.
Lipid concentrations are given normalized to total lipids: nmol/mg lipid.
**Aim:** Detect significant differences between groups of single fatty acid metabolites (or sum of metabolites).
\
# Read in data
Typically, data sets are saved in Excel files. Here we read in the data using the function `read_excel()` from the **readxl-R-package**.
Then we apply following steps:
* We have a quick look at the data frame, e.g. using `str()`.
* We re-code the group labels with more clear labels.
* We apply the detection limit of the lab method to define zero values.
* Exclude the columns with sums of lipid classes, calculate here in R.
* We arrange the data frame from wide to long format.
* We export the data as a csv-file.
\
```{r read.data, message = FALSE, warning = FALSE}
dat <- read_excel("./data/data-lipidomics-28samples.xlsx")
dat[,1:6]
# here I replace k and v in group with "control" and "treatment"
dat <- dat %>%
mutate(group = case_when(
group == "k" ~ "control",
group == "v" ~ "treatment"))
# here I replace all NAs and zeros with detection limit (= 0.001)
dat <- dat %>%
replace(., . == 0 | is.na(.), 0.001)
# transpose data frame to long format
# only select columns with single lipids
dat <- dat %>%
dplyr::select(sample, group, `LEG in kg`, `ADFI in g`, contains(":")) %>%
gather(key = "lipid", value = "concentration",`PC 14:0_16:0`:`LPG 18:1`)
# export file as .csv (open source data format), only if it was not done yet
# 1 if-case: if (Condition) {TRUE-block}
if (!file.exists("./data/data-lipidomics-28samples.csv"))
write.csv(dat, file = "./data/data-lipidomics-28samples.csv", row.names = FALSE)
```
\
# Look at the data
Here we show the data in an awesome html-style Tabel of the **kableExtra-R-package**.
To learn how to generate these tables, you can visit: [link](https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html)
\
```{r data, message = FALSE, warning = FALSE}
dat %>%
arrange(sample) %>%
tibble::rowid_to_column("#") %>%
kable(caption = "Data set - long format") %>%
kable_styling(bootstrap_options = c("striped", "hover"), font_size = 10) %>%
scroll_box(width = "600px", height = "500px")
```
\
# Create summary statistics and plots
First look at the differences of the fatty acid classes between the two groups:
```{r data.summary, message = FALSE, warning = FALSE}
dat %>%
group_by(lipid,group) %>%
summarize(mean_conc = mean(concentration),
sd_conc = sd(concentration)
)
```
\
Plotting results, here only the 7 22:6-lipids. We use `facet_wrap()` to plot all individual plots together.
```{r boxplots, message = FALSE, warning = FALSE, fig.width=12, fig.height=10}
dat %>%
dplyr::filter(grepl("22:6",lipid)) %>%
mutate(lipid = as.factor(lipid)) %>%
mutate(lipid = reorder(lipid, -concentration)) %>%
ggplot(aes(x = group, y = concentration, fill = group))+
geom_boxplot()+
facet_wrap(~lipid, ncol = 3, scales = "free_y")+
scale_fill_manual(values = c("cornflowerblue", "orange"))+
xlab(label = "")+
ylab(label = c(expression(paste("Concentration (nmol ", mg^-1, " lipid)"))))+
theme_bw()+
theme(legend.position="none")
# You can also save the plot as png-file
# ggsave(filename = "lipids-plot.png", width = 20, height = 24, units = "cm")
```
\
# Multiple *t*-tests
Metabolomics data often have a log-normal distribution, i.e. the log(data) is normally distributed. We just assume this here (without any test).
We extract the *p*-values from the test results. Use `?t.test` to look at the structure of the listed results.
```{r mult.test, message = FALSE, warning = FALSE}
p.val <- dat %>%
group_by(lipid) %>%
summarise(p.wil = wilcox.test(concentration~group)$p.value,
p.ttest.log = t.test(log(concentration)~group)$p.value
) %>%
arrange(p.ttest.log)
# quick look at the generated p values
p.val
```
\
# False Discovery Rate (FDR)
<div class="alert alert-info">
<strong>Attention!</strong> Modern high-throughput technologies (micro-array data, gene expressions, metabolomics data, *etc.*) generate high-dimensional data: $p>>n$.
</div>
## Some theory
<p class="comment">
How do we control the error rates in multiple comparisons?
One way is to control the **familywise type 1 error rate (FWER)**, i.e. the probability of making at least one false rejection of a null hypothesis. The simplest adjustment is the Bonferroni correction, where we define a particular comparison as statistically significant only when the *p*-value is less than alpha (often 0.05) divided by the number of all comparisons:
$$p<\frac{\alpha}{m}$$
This approach is very conservative as this threshold will be very small the more comparisons are made.
**False discovery rate (FDR)** controlling procedures (e.g. Benjamini-Hochberg, BH) instead control the proportion of wrongly rejected null hypotheses amongst those that are rejected. The FDR methods are more powerful than the FWER methods and are the method of choice for high-dimensional data.
</p>
\
## Our example
The data set obtained here using shotgun lipidomics is a typical high-dimensional data set, i.e. hundreds of lipid concentrations were measured in much less independent replicates, 335 lipid species in a control group with 14 samples and a treatment group with 14 samples.
Accordingly, we made 335 comparisons, i.e. 335 individual *t*-tests were performed. Have a quick look at the distribution of the 335 *p*- values.
\
```{r, message=FALSE, warning=FALSE}
sum(p.val$p.ttest.log<0.05)
hist(p.val$p.ttest.log,breaks=seq(0,1,.05))
```
\
*p*-values are random variables. Mathematically, one can demonstrate that under the null hypothesis (and some assumptions are met, in this case, the test statistic *t* follows standard normal distribution), *p*-values follow a uniform (0,1) distribution.
There are 222 of the 335 lipids with a *p*-value < 0.05, have all of them statistically different concentrations in treatment vs. control? **NO!** We need to correct for the multiple comparisons.
One way is to limit the FWER, e.g. by applying the Bonferroni adjustment, i.e. to decrease the individual error rate to:
$$p < \frac{\alpha}{m} = \frac{0.05}{335} = 0.000149$$
As the Bonferroni approach is very conservative, we applied a method controlling the **False Discovery Rate (FDR)** and limit the number of false positives to some value among all significant features, here to 5%.
We can use a function from base R `p.adjust()`:
\
```{r pvals, message=FALSE, warning=FALSE}
p.adj.BH <- p.adjust(p.val$p.ttest.log, method="fdr")
# p values not adjusted
sum(p.val$p.ttest.log < 0.05)
# pvalues with Bonferroni adjustment
sum(p.val$p.ttest.log < 0.05/335)
# pvalues with FDR adjustment
sum(p.adj.BH < 0.05)
```
\
# K-means clustering
## Some theory
<p class="comment">
*K-means clustering* is a popular **unsupervised machine learning algorithm** for partitioning a given data set into a set of *k* groups. The number *k* needs to be pre-specified. The objects within the same cluster are as similar as possible, whereas objects from different clusters are as dissimilar as possible. Each cluster is represented by its centroid, i.e the mean of points assigned to the cluster.
</p>
\
The k-means algorithm can be summarized as follow:
\
<div class="alert alert-info">
1. Specify number of clusters K.
2. Initialize centroids by first shuffling the dataset and then **randomly selecting** K data points for the centroids without replacement.
3. Compute the sum of the squared distance between data points and all centroids and assign each data point to the closest cluster (centroid).
4. Compute the centroids for the new clusters by taking the average of the all data points that belong to each cluster.
5. Iteration, iteration, iteration... until the assignment of data points to clusters is not changing.
</div>
\
The standard R function for k-means clustering is `kmeans()` from the **stats-R-package** (already part of base R):
\
```{r kmeans.func, eval = FALSE}
kmeans(x, centers, iter.max = 10, nstart = 25,
algorithm = c("Hartigan-Wong", "Lloyd", "Forgy",
"MacQueen"), trace=FALSE)
```
\
**Arguments:**
* `x`: numeric matrix, numeric data frame or a numeric vector
* `centers`: Possible values are the number of clusters (k) or a set of initial (distinct) cluster centers. If a number, a random set of (distinct) rows in x is chosen as the initial centers.
* `iter.max`: The maximum number of iterations allowed. Default value is 10.
* `nstart`: The number of random starting partitions when centers is a number. Trying nstart = 25 is often recommended.
<div class="alert alert-info">
As the final result of k-means clustering result is sensitive to the random starting assignments, we specify nstart = 25. This means that R will try 25 different random starting assignments and then select the best results corresponding to the one with the lowest within cluster variation. The default value of nstart in R is one. But, it’s strongly recommended to compute k-means clustering with a large value of nstart such as 25 or 50, in order to have a more stable result.
</div>
\
## Computing k-means clustering
The `kmeans()` function requires a matrix of variables, i.e. a wide data table. You can use the function `spread()`. Here we also scale the data.
\
```{r, message=FALSE, warning=FALSE}
dat.wide <- dat %>%
spread(lipid, concentration) %>%
dplyr::select(contains(":"))
# Scaling the data
dat.wide.scaled <- scale(dat.wide)
```
\
<div class="alert alert-info">
As k-means clustering algorithm starts with k randomly selected centroids, it’s always recommended to use the `set.seed()` function in order to set a seed for R’s random number generator. The aim is to make the results **reproducible !!**
</div>
\
```{r}
# Compute k-means with k = 2
set.seed(42)
km.res <- kmeans(dat.wide.scaled, centers = 2, nstart = 25)
```
\
The `kmeans()` function returns a **list** of components, including:
* `cluster`: A vector of integers (from 1:k) indicating the cluster to which each point is allocated
* `centers`: A matrix of cluster centers (cluster means)
* `totss`: The total sum of squares (TSS), it measures the total variance in the data.
* `withinss`: Vector of within-cluster sum of squares, one component per cluster
* `tot.withinss`: Total within-cluster sum of squares, i.e. sum of withinss
* `betweenss`: The between-cluster sum of squares, i.e. totss minus tot.withinss
* `size`: The number of observations in each cluster
These components can be accessed as follow:
```{r}
# structure of the results (a list)
str(km.res)
# Cluster number for each of the observations
km.res$cluster
# Cluster size
km.res$size
# Cluster means
km.res$centers[, 1:5]
```
\
## Visualizing k-means clusters
We could visualize the data in a scatter plot with coloring each data point according to its cluster assignment.
<div class="alert alert-info">
<bold>Attention!!</bold> We have a multi-dimensional data set, a solution is to reduce dimensionalit with Principal Component Analysis (PCA) and to plot data points according to the first two principal components coordinates.
</div>
The **factoextra-R-package** provides ggplot2-based elegant visualization of partitioning methods including kmeans! Here we use the function `fviz_cluster()`.
\
```{r}
fviz_cluster(km.res, data = dat.wide.scaled,
palette = c("firebrick","cornflowerblue"),
ggtheme = theme_minimal(),
main = "Partitioning Clustering Plot"
)
```
\
**THE END**
```{r session.info}
sessionInfo()
```