forked from nathaliagg/docker_xcms
-
Notifications
You must be signed in to change notification settings - Fork 0
/
template_pre-processing.Rmd
275 lines (179 loc) · 6.43 KB
/
template_pre-processing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
---
output: html_document
editor_options:
chunk_output_type: console
---
## PROJECT NAME
## Load necessary libraries
```{r}
library(tidyverse)
library(xcms)
## Modify appropriately
doParallel::registerDoParallel(10)
## customFunnctions.R is from "https://raw.githubusercontent.com/jorainer/xcms-gnps-tools/master/customFunctions.R"
source("https://raw.githubusercontent.com/jorainer/xcms-gnps-tools/master/customFunctions.R")
```
## 0 - Import data
**Important** Data was already in centroid mode.
```{r}
# define metadata, ie, phenotypic data
pd <- read.csv("path/to/metadata.csv")
# get filenames from directory
# remember that pd$FileName is a column in pd
files <- as.vector(paste0("path/to/data_mzxml/", pd$FileName))
# Read the data:
data_cent <- readMSData(files,
pdata = new("NAnnotatedDataFrame", pd),
mode = "onDisk",
verbose = TRUE)
```
The MS experiment is now represented as an `OnDiskMSnExp` object.
Information stored in the `OnDiskMSnExp` object after import:
```{r}
show(data_cent)
```
Save R object as `.RData`
Save as an R object so later, you don't need to re-import the data
```{r}
## Save data object as rds
save(data_cent, pd, file = "data_centroided.RData")
```
## 1 - Load data
```{r}
# optional
# load("data_centroided.RData")
```
## 2 - Pre-processing
#### 2.1 - Chromatographic peak detection
```{r}
## Set the CentWaveParam object
cwp <- CentWaveParam(
peakwidth = c(10, 80),
ppm = 25, # default
snthresh = 10, # default
noise = 1e3,
mzdiff = -0.001
)
## peak detection
data_cent <- findChromPeaks(data_cent, param = cwp)
```
Extract information about number of peak detected per sample.
```{r}
chrom_peaks_df <- as.data.frame(chromPeaks(data_cent))
n.peaks.sample <- chrom_peaks_df %>% count(sample)
colnames(n.peaks.sample) <- c('sampleIndex', 'totalPeaksDetected')
n.peaks.sample <- merge(n.peaks.sample, pData(data_cent), by.x = "sampleIndex", by.y = 0)
n.peaks.sample[,c(1,4,2)]
write.csv(x = n.peaks.sample, file = "NumberDetectedPeaks_per_sample.csv", row.names = F)
```
#### 2.2 - Alignment
```{r}
## Group peaks (these parameters will be used in correspondence too)
mfraction = 0.3
bandwith = 30 # default
size_overlap_slices = 0.25 # default
## a - define PeakDensityParam
pdp <- PeakDensityParam(sampleGroups = data_cent$LeafType,
bw = bandwith, # default
minFraction = mfraction,
binSize = size_overlap_slices # default
)
## b - Group peaks
data_cent <- groupChromPeaks(data_cent, pdp)
## Retention time correction using defaut parameters
## a - define PeakGroupsParam
pgp <- PeakGroupsParam(minFraction = mfraction)
## b - alignment
data_cent <- adjustRtime(data_cent, param = pgp)
```
```{r}
## test if object has adjusted retention time
hasAdjustedRtime(data_cent)
```
Retention time correction with either *obiwarp* or *peakGroups* is performed on all spectra including MS>1 levels if present in the [data](https://rdrr.io/bioc/xcms/man/adjustRtime-peakGroups.html)
#### 2.3 - Correspondence
```{r}
## define PeakDensityParam
pdp <- PeakDensityParam(sampleGroups= data_cent$LeafType,
bw = bandwith,
minFraction = mfraction,
binSize = size_overlap_slices
)
## perform correspondence analysis
data_cent <- groupChromPeaks(data_cent, param=pdp)
```
#### 2.4 - Fill-in missing values
Missing values occur if no chromatographic peak was assigned to a feature either because peak detection failed, or because the corresponding ion is absent in the respective sample.
```{r}
## determine the number of missing values
number_na_i = sum(is.na(featureValues(data_cent)))
```
Fill-in missing peak data by a specified ppm (5) and expanding the mz range by mz width/4 (0.25)
```{r warning=False}
## a - define parameter
fpp <- FillChromPeaksParam(ppm = 5, expandMz = 0.25)
## b - fill in
data_cent <- fillChromPeaks(data_cent, param=fpp)
```
Determine the number of missing values after filling in:
```{r}
number_na_f = sum(is.na(featureValues(data_cent)))
## remaining number of na values
number_na_f
## determine number of filled peaks
# number_na_i - number_na_f
```
--> End of preprocessing
## 3 - Feature Correspondence
```{r}
## extract feature values after filling in
fmat_fld <- featureValues(data_cent, value="into", method="maxint")
## replace NA with zero
fmat_fld[is.na(fmat_fld)] <- 0
## replace colnames with samplecode
fmat_fld <- as.data.frame(fmat_fld)
colnames(fmat_fld) <- as.vector(pd$SampleCode)
write.csv(file='featureCorrespondence.csv', fmat_fld)
```
## 4 - Feature Definitions
```{r}
## get feature definitions and intensities
featuresDef <- featureDefinitions(data_cent)
## merge feature definitions with correspondencce
dataTable <- merge(featuresDef, fmat_fld, by = 0, all = TRUE)
dataTable <- dataTable[, !(colnames(dataTable) %in% c("peakidx"))]
colnames(dataTable)[1] <- "Features"
write.csv(dataTable, "featureDefinitions.csv", quote = FALSE, row.names = FALSE)
```
## 5 - Saving Spectra information
```{r}
## spectra information of pre-processed data
## these data are useful for cloud plots
fdata <- fData(data_cent)
write.csv(fdata, "spectraInformation.csv")
```
## 6 - Extracting MS2
```{r}
# I modified the source function to write the feature name as the title of each MS2
source("modified_writeMgfData.R")
```
```{r warning=FALSE}
## export the individual spectra into a .mgf file
filteredMs2Spectra <- featureSpectra(data_cent, return.type = "Spectra", msLevel = 2)
filteredMs2Spectra <- clean(filteredMs2Spectra, all = TRUE)
filteredMs2Spectra <- formatSpectraForGNPS(filteredMs2Spectra) # this is one of the custom funtions
filteredMs2Spectra_consensus <- combineSpectra(filteredMs2Spectra,
fcol = "feature_id",
method = consensusSpectrum,
mzd = 0,
minProp = 0.5,
ppm = 25,
intensityFun = median,
mzFun = median)
mod_writeMgfDataFile(filteredMs2Spectra_consensus,
"ms2spectra_consensus.mgf")
```
## 7 - Save pre-processed RData
```{r}
save.image(file = "xcms_pre-processed.RData")
```