-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathFeature_annotation_private_dataset.qmd
388 lines (309 loc) · 14 KB
/
Feature_annotation_private_dataset.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
---
title: "Predict formula and structure of feature from an XcmsExperiment object using Sirius through the RuSrius package."
format: html
editor: visual
date: 'Compiled: `r format(Sys.Date(), "%B %d, %Y")`'
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(Spectra)
library(MsExperiment)
library(xcms)
library(Rsirius)
library(MetaboAnnotation)
library(RuSirius)
library(MetaboCoreUtils)
```
## Introduction
This vignette demonstrate a basic workflow to import detected feature of an
`XcmsExperiement` object into *Sirius* and run the main *Sirius* tool such as
formula identification, structure annotation, compound class prediction and
spectral library matching and finally retrieve the results.
This is a foundational example and does not cover all the possible parameters
for each Sirius tool. For detailed parameter information, consult the `run()`
function documentation. More information can be found in the [Sirius
documentation online](https://v6.docs.sirius-ms.io/).
The data used in this vignette is not yet publicly available. If you do not have
it please use the other vignette in this package named "Annotation of
chromatographic peaks from an xcmsExperiment object Sirius."
**IMPORTANT:** This is a work in progress. Feedback is highly valued, especially
regarding enhancements or additions that could simplify your workflow. Your
input as a user is essential.
## Load data
We import mzML files of a mixture of standards spiked in blood (or plasma ?).
```{r, warning=FALSE}
pth <- file.path( "~/GitHub/RuSirius_tmp/vignettes/doc/")
fls <- list.files(pth, pattern = "mzML", full.names = TRUE)
fls <- fls[1:5]
mse <- readMsExperiment(spectraFiles = fls)
```
## Processing example dataset
Here, we apply pre-optimized parameters for processing the example dataset.
```{r message=FALSE, warning=FALSE}
cwp <- CentWaveParam(ppm = 50,
peakwidth = c(2, 18),
snthresh = 5,
mzdiff = 0.001,
prefilter = c(4, 300),
noise = 100,
integrate = 2)
data <- findChromPeaks(mse, param = cwp)
#' Peak refinement
mnp <- MergeNeighboringPeaksParam(expandRt = 3.5, expandMz = 0.001,
minProp = 3/4)
data <- refineChromPeaks(data, param = mnp)
#' Alignment
sampleData(data)$highRes <- TRUE
pdp1 <- PeakDensityParam(sampleGroups = sampleData(data)$highRes, bw = 3,
minFraction = 0.7, binSize = 0.015)
data <- groupChromPeaks(data, param = pdp1)
pgp <- PeakGroupsParam(minFraction = 0.8, extraPeaks = 100, span = 0.8)
data <- adjustRtime(data, param = pgp)
#' Correspondence analysis
pdp2 <- PeakDensityParam(sampleGroups =sampleData(data)$highRes, bw = 3,
minFraction = 0.3, binSize = 0.015) #change 0.3 to higher if find psectra
data <- groupChromPeaks(data, param = pdp2)
#' Gap-filling
data <- fillChromPeaks(data, param = ChromPeakAreaParam())
nrow(featureDefinitions(data))
```
## Match features to known spiked compounds
Below we load our list of spiked standard. Using `matchMz()` we match the
theoretical m/z of the standards to the m/z of the features in the data. We
filter the matches to keep only the ones that are within 10 seconds of the
target retention time.
```{r}
std <- read.table("standards_dilution.txt",
sep = "\t", header = TRUE)
std$exactmass <- calculateMass(std$formula)
adducts <- c("[M+H]+", "[M+2H]2+", "[M+Na]+", "[M+K]+", "[M+NH4]+",
"[M+H-H2O]+", "[M+H+Na]2+", "[M+2Na]2+", "[M+H-NH3]+",
"[M+2Na-H]+", "[M+2K-H]+",
"[2M+H]+", "[M+H-H4O2]+", "[M+H-Hexose-H2O]+", "[M+H-CH2O2]+")
query <- featureDefinitions(data)
query$fts_id <- rownames(featureDefinitions(data))
prm <- Mass2MzParam(adducts = adducts, ppm = 30)
mtchs <- matchMz(query, std, param = prm, mzColname = "mzmed")
mtchs_sub <- mtchs[whichQuery(mtchs)]
mD <- matchedData(
mtchs_sub, columns = c("fts_id", "mzmed", "ppm_error", "rtmed",
"target_RT", "target_name", "adduct",
"target_formula"))
mD <- mD[which(abs(mD$rtmed - mD$target_RT) < 10), ]
mD <- mD[order(mD$target_name), ]
```
## Select features for annotation
There are 55 features that we could import into Sirius.
## Spectra extraction
To prepare the data for *Sirius*, we extract the MS2 and MS1 spectra for the
features we selected. It is important to ensure that all the feature that we
want to import have both MS1 and MS2 spectra.
We also extract the necessary metadata for these features thought the parameter
`featureColumns`.
```{r warning=FALSE}
ms2_spectra <- featureSpectra(data, expandRt = 3, return.type = "Spectra",
ppm = 20, expandMz = 0,
featureColumns = c("mzmed", "rtmed", "rtmin", "rtmax"))
## filter ms2
low_int <- function(x, ...) x > max(x, na.rm = TRUE) * 0.05
ms2_spectra <- filterIntensity(ms2_spectra, intensity = low_int)
ms2_spectra <- ms2_spectra[lengths(ms2_spectra) > 1]
features <- unique(spectraData(ms2_spectra)$feature_id)[1:10] #reduce to make import faster.
features
## now we extract the ms1 spectra
ms1_spectra <- featureSpectra(data, return.type = "Spectra",
features = features,
msLevel = 1L,
featureColumns = c("mzmed", "rtmed",
"rtmin", "rtmax"),
chromPeakColumns = c("maxo"),
method = "closest_rt")
## filter ms1
ms1_spectra <- filterIntensity(ms1_spectra, intensity = low_int)
ms1_spectra <- ms1_spectra[lengths(ms1_spectra) > 1]
```
As of now, the process to import data into Sirius is extremely slow, with the
MS1 spectra being a bottle neck. As of now we advise to only import one spectra
per feature to facilitate this. The user can combine the MS1 spectra of all
chrompeaks for a feature using `combinePsectra()` or (and we will do this below)
we select the MS1 spectra with the highest intensity. Therefore for each feature
we only keep the MS1 spectra that has the highest "maxo".
```{r}
sp <- split(spectraData(ms1_spectra), spectraData(ms1_spectra)$feature_id)
cp <- vapply(sp, function(x){
x$chrom_peak_id[which.max(x$chrom_peak_maxo)]
}, character(1))
ms1_spectra <- ms1_spectra[which(spectraData(ms1_spectra)$chrom_peak_id %in% cp)]
```
## Open Sirius and project set up
The Sirius application is initialized via the API, requiring only a project ID.
If the project exists, it is opened; otherwise, a new project is created. The
`srs` object acts as the connection to Sirius and holds project details.
Properly shut down the connection with `shutdown(srs)` after completing your
work.
This `srs` variable is needed for any task that necessitate to communicate with
the application. You can learn more about this object class by running `?Sirius`
in the console. Below I do not precise the `path` parameter, by default Sirius
will save your project in the `sirius_projects` folder in your user directory.
If you want to save it somewhere else you can specify the `path =` parameter.
```{r}
# Initialize Sirius connection
srs <- Sirius(projectId = "standards", path = "C:/Users/phili/sp")
srs
```
You can find all the utility functions of this package by running `?Utils` in
the console.
**NOTE** if you have any idea of utility functions that could be implemented do
not hesitate to ask.
## Feature import
Preprocessed `xcms` data is imported into Sirius, and a summary `data.frame` is
returned with feature information. This information can also be retrieved using
the utility function `featuresInfo()`.
```{r message=FALSE, warning=FALSE}
## Import data into Sirius
srs <- import(sirius = srs, ms1Spectra = ms1_spectra, ms2Spectra = ms2_spectra)
## See information about the features
featuresInfo(srs) |> head()
```
Notes:
- It could also be discussed that this `data.frame` could be stored direction
into the `srs` object
- When running `import()` i automatically create a mapping data.frame between
the *xcms* feature ID and the *Sirius* feature ID. It is stored in the `srs`
object, the `featureMap` slot. This can be used in the future so the user
never need to interact with the *Sirius* IDs.
Below is an example of how to extract features ID, the utility function
`featuresId()` quickly extract all available ID either `sirius` or `xcms`.
```{r}
fts_id <- featuresId(srs, type = "sirius")
```
## Searchable database
Whether it is for structure prediction or spectral library matching, users can
upload their custom databases into Sirius. In this vignette, we demonstrate how
to test spectral library matching by creating and loading a custom database into
Sirius. This process can also be completed easily via the Sirius graphical user
interface (GUI). If you prefer an interactive approach, you can use the
`openGUI(srs)` command to open the Sirius app and manage your database directly.
In this example, we download the MassBank library from GNPS, which needs to be
loaded into Sirius to generate a `.sirius.db` file. Below we will download in
our current directory but you can precise where you want to save it using
`location =` parameter.
```{r}
## Download the MassBank EU library
download.file("https://external.gnps2.org/gnpslibrary/MASSBANKEU.mgf",
destfile = "MASSBANKEU.mgf")
createDB(srs, databaseId = "massbankeuCustom", files = "MASSBANKEU.mgf")
```
NOTE: THis takes quite a while, will change to a smaller database later. Once
the database is created and loaded, you can verify its successful import by
running the following command:
```{r}
listDBs(srs)
```
Find more on how to handle databases in Sirius by typing `?siriusDBs` in the
console.
## Submit job to Sirius
Annotation and prediction begin after data import. The `run()` function accepts
parameters for each Sirius tool, such as formula identification, structure
database search, and compound class prediction. Parameters can also specify
adducts or custom databases. Detailed documentation for these parameters is
available in the `run()` function's help file.
The `wait` parameter ensures the function waits for job completion before
proceeding. If set to `FALSE`, the job ID is returned, and the user must check
the status using `jobInfo()`.
```{r warning=FALSE}
## Stat computation
job_id <- run(srs,
detectableAdducts = c("[M+H]+", "[M+Na]+", "[M+K]+",
"[M+H-H2O]+", "[M+2Na-H]+", "[M+2K-H]+",
"[M+H-H4O2]+"),
spectraSearchParams = spectraMatchingParam(
spectraSearchDBs = c("massbankeuCustom")),
formulaIdParams = formulaIdParam(numberOfCandidates = 5,
numberOfCandidatesPerIonization = 2,
isotopeMs2Settings = c("IGNORE"),
minPeaksToInjectSpecLibMatch = 4),
predictParams = predictParam(),
structureDbSearchParams = structureDbSearchParam(),
recompute = TRUE
)
srs
```
```{r}
## Get info for the job
jobInfo(srs, job_id)
```
## Retrieve results
To obtain a summary of all results, including the top formulas, structures, and
compound class predictions, use the following code. This summary table provides
a quick overview to evaluate whether the results align with expectations.
However, we recommend not relying on this table as-is for detailed analysis.
Instead, use the functions described later in this vignette to explore the
results in greater depth.
An important aspect of the summary table is the confidence-related columns,
which provide insight into the reliability of the predictions.
```{r, warning=FALSE}
summarytb <- summary(sirius = srs, result.type = "structure")
head(summarytb)
```
## Formula identification results:
For detailed results, the results() function can be used with the `result.type`
parameter set to `"formulaId"`, `"structureDb"`, `"compoundClass"`, or
`"deNovo"`. Note that all results are linked to a predicted formula.
The parameters `topFormula` and `topStructure` allow users to specify how many
formulas or structures should be included in the output. The results can be
returned either as a list or a data.frame, based on the return.type parameter.
Note: Suggestions for renaming the `results()` function or feedback on this
implementation are welcome. We aim to adapt based on user needs.
```{r get-res}
finaltb <- results(srs,
return.type = "data.frame",
result.type = "formulaId",
topFormula = 5)
finaltb
```
### Structure DBs search results
The following example shows the top two structure annotations for the top five
formulas of each feature. This can provide an insightful view into the
structural predictions.
```{r, warning=FALSE}
finalstructredb <- results(srs,
return.type = "data.frame",
result.type = "structureDb",
topFormula = 5,
topStructure = 2)
finalstructredb
```
For a more visual exploration of the results, you can open the Sirius GUI with
the commands below:
```{r}
# openGUI(srs)
# closeGUI(srs)
```
### Compound class prediction results
To retrieve compound class predictions, use the following code. Below is an
example showing all compound annotations with confidence scores above 50% for
the top two formulas of each feature.
```{r, warning=FALSE}
finalcomp <- results(srs,
return.type = "data.frame",
result.type = "compoundClass",
topFormula = 5,
topStructure = 2)
head(finalcomp)
```
We are not showing *de novo* structure prediction here but the other vignette
discuss it.
Future developments ideas:
- Return fragmentation tree/single peak annotation
- Implement spectra merging within xcms
- docker image
- filtering based on.. (adduct,..)
- use formula results for feature grouping in xcms
## Clean up
```{r}
#delete results
file.remove(projectInfo(srs)$location)
# Close the Sirius session
shutdown(srs)
```