-
Notifications
You must be signed in to change notification settings - Fork 9
/
Copy path05-workflow.Rmd
188 lines (108 loc) · 15.3 KB
/
05-workflow.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
---
bibliography: references.bib
---
```{r}
dist_loc <- list.files(
find.package("DiagrammeR"),
recursive = TRUE,
pattern = "mermaid.*js",
full.names = TRUE
)
js_cdn_url <- "https://cdnjs.cloudflare.com/ajax/libs/mermaid/9.0.1/mermaid.min.js"
download.file(js_cdn_url, dist_loc)
```
# Workflow
You could check this book for metabolomics data analysis [@li2020a].
```{r}
DiagrammeR::mermaid("
flowchart TB
I(peak-picking) --> C
C(visulization) --> D(normalization/batch correction)
D --> A(annotation/identification)
A --> H(statistical analysis)
C --> A --> B(omics analysis)
D --> H
B --> H
H --> E(experimental validation)
A --> E
H --> A
B --> E
C --> H
")
```
## Platform for metabolomics data analysis
Here is a list for related open source [projects](http://strimmerlab.org/notes/mass-spectrometry.html)
### XCMS & XCMS online
[XCMS online](https://xcmsonline.scripps.edu/landing_page.php?pgcontent=mainPage) is hosted by Scripps Institute. If your datasets are not large, XCMS online would be the best option for you. Recently they updated the online version to support more functions for systems biology. They use metlin and iso metlin to annotate the MS/MS data. Pathway analysis is also supported. Besides, to accelerate the process, xcms online employed stream (windows only). You could use stream to connect your instrument workstation to their server and process the data along with the data acquisition automate. They also developed apps for xcms online, but I think apps for slack would be even cooler to control the data processing.
[xcms](https://bioconductor.org/packages/release/bioc/html/xcms.html) is different from xcms online while they might share the same code. I used it almost every data to run local metabolomics data analysis. Recently, they will change their version to xcms 3 with major update for object class. Their data format would integrate into the MSnbase package and the parameters would be easy to set up for each step. Normally, I will use msconvert-IPO-xcms-xMSannotator-metaboanalyst as workflow to process the offline data. It could accelerate the process by parallel processing. However, if you are not familiar with R, you would better to choose some software below. For xcms, 1000 files will need around 5 hours to generate the peaks list on a regular workstation.
[IPO](https://github.com/rietho/IPO) A Tool for automated Optimization of XCMS Parameters [@libiseller2015] and [Warpgroup](https://github.com/nathaniel-mahieu/warpgroup) is used for chromatogram subregion detection, consensus integration bound determination and accurate missing value integration[@mahieu2016]. A case study to compare different xcms parameters with IPO can be found for GC-MS [@dossantos2023]. Another option is AutoTuner, which are much faster than IPO[@mclean2020]. Recently, MetaboAnalystR 3.0 could also optimize the parameters for xcms while you need to perform the following analysis within this software[@pang2020]. For IPO, ten files will need \~12 hours to generate the optimized results on a regular workstation. [Paramounter](https://github.com/HuanLab/Paramounter) is a direct measurement of universal parameters to process metabolomics data in a “White Box”[@guo2022]. Another research use machine learning method to compare different optimization methods and they are all better than the default setting of xcms[@lassen2021]. It could be extended to include ion mobility[@dodds2022].
Check those papers for the XCMS based workflow[@forsberg2018; @huan2017; @mahieu2016a; @montenegro-burke2017; @domingo-almenara2020; @stancliffe2022]. For metlin related annotation, check those papers[@guijas2018; @tautenhahn2012; @xue2020; @domingo-almenara2018a].
[MAIT](https://www.bioconductor.org/packages/release/bioc/html/MAIT.html) based on xcms and you could find source code [here](https://github.com/jpgroup/MAIT)[@fernandez-albert2014a].
[iMet-Q](http://ms.iis.sinica.edu.tw/comics/Software_iMet-Q.html) is an automated tool with friendly user interfaces for quantifying metabolites in full-scan liquid chromatography-mass spectrometry (LC-MS) data [@chang2016]
compMS2Miner is an Automatable Metabolite Identification, Visualization, and Data-Sharing R Package for High-Resolution LC--MS Data Sets. Here is related papers [@edmands2017; @edmands2018; @edmands2015].
mzMatch is a modular, open source and platform independent data processing pipeline for metabolomics LC/MS data written in the Java language, which could be coupled with xcms [@scheltema2011; @creek2012]. It also could be used for annotation with MetAssign[@daly2014].
### PRIMe
[PRIMe](http://prime.psc.riken.jp/Metabolomics_Software/) is from RIKEN and UC Davis. They update their database frequently[@tsugawa2016]. It supports mzML and major MS vendor formats. They defined own file format ABF and eco-system for omics studies. The software are updated almost everyday. You could use MS-DIAL for untargeted analysis and MRMOROBS for targeted analysis. For annotation, they developed MS-FINDER and statistic tools with excel. This platform could replaced the dear software from company and well prepared for MS/MS data analysis and lipidomics. They are open source, work on Windows and also could run within mathmamtics. However, they don't cover pathway analysis. Another feature is they always show the most recently spectral records from public repositories. You could always get the updated MSP spectra files for your own data analysis.
For PRIMe based workflow, check those papers[@lai2018; @matsuo2017; @treutler2016; @tsugawa2015; @tsugawa2016; @kind2018]. There are also extensions for their workflow[@uchino2022] and workflow for environmental science[@bonnefille2023].
### GNPS
[GNPS](http://gnps.%20ucsd.edu) is an open-access knowledge base for community-wide organization and sharing of raw, processed or identified tandem mass (MS/MS) spectrometry data. It's a straight forward annotation methods for MS/MS data. Feature-based molecular networking (FBMN) within GNPS could be coupled with xcms, openMS, MS-DIAL, MZmine2, and other popular software. GNPS also have a dashboard for online mass spectrometery data analysis[@petras2021a].
Check those papers for GNPS and related projects[@aron2020; @nothias2020; @scheubert2017; @silva2018; @wang2016b;@bittremieux2023].
### OpenMS & SIRIUS
[OpenMS](https://www.openms.de/) is another good platform for mass spectrum data analysis developed with C++. You could use them as plugin of [KNIME](https://www.knime.org/). I suggest anyone who want to be a data scientist to get familiar with platform like KNIME because they supplied various API for different programme language, which is easy to use and show every steps for others. Also TOPPView in OpenMS could be the best software to visualize the MS data. You could always use the metabolomics workflow to train starter about details in data processing. pyOpenMS and OpenSWATH are also used in this platform. If you want to turn into industry, this platform fit you best because you might get a clear idea about solution and workflow.
Check those paper for OpenMS based workflow[@bertsch2011; @pfeuffer2017; @rost2014; @rost2016; @rurik2020; @alka2020;@pfeuffer2024].
OpenMS could be coupled to SIRIUS 4 for annotation. [Sirius](https://bio.informatik.uni-jena.de/software/sirius/) is a new java-based software framework for discovering a landscape of de-novo identification of metabolites using single and tandem mass spectrometry. SIRIUS 4 project integrates a collection of our tools, including [CSI:FingerID](https://www.csi-fingerid.uni-jena.de/), [ZODIAC](https://bio.informatik.uni-jena.de/software/zodiac/) and [CANOPUS](https://bio.informatik.uni-jena.de/software/canopus/). Check those papers for SIRIUS based workflow[@duhrkop2019; @duhrkop2020a; @alka2020; @ludwig2020].
### MZmine 2
[MZmine 2](http://mzmine.github.io/) has three version developed on Java platform and the lastest version is included into [MSDK](https://msdk.github.io/). Similar function could be found from MZmine 2 as shown in XCMS online. However, MZmine 2 do not have pathway analysis. You could use metaboanalyst for that purpose. Actually, you could go into MSDK to find similar function supplied by [ProteoSuite](http://www.proteosuite.org) and [Openchrom](https://www.openchrom.net/). If you are a experienced coder for Java, you should start here.
Check those papers for MZmine based workflow[@pluskal2010; @pluskal2020].
### Emory MaHPIC
This platform is composed by several R packages from Emory University including [apLCMS](https://sourceforge.net/projects/aplcms/) to collect the data, [xMSanalyzer](https://sourceforge.net/projects/xmsanalyzer/) to handle automated pipeline for large-scale, non-targeted metabolomics data, [xMSannotator](https://sourceforge.net/projects/xmsannotator/) for annotation of LC-MS data and [Mummichog](https://code.google.com/archive/p/atcg/wikis/mummichog_for_metabolomics.wiki) for pathway and network analysis for high-throughput metabolomics. This platform would be preferred by someone from environmental science to study exposome.
You could check those papers for Emory workflow[@uppal2013; @uppal2017; @yu2009b; @li2013; @liu2020].
### Others
- [PMDDA](https://yufree.github.io/pmdda/script.html) is a reproducible workflow for exhaustive MS2 data acquisition of MS1 features[@yu2022b] will data and script available online.
- [tidymass](https://www.tidymass.org/) is an object-oriented reproducible analysis framework for LC–MS data[@shen2022].
- [R for mass spectrometry](https://www.rformassspectrometry.org/) is a R software collection for the analysis and interpretation of high throughput mass spectrometry assays.
- [MAVEN](http://genomics-pubs.princeton.edu/mzroll/index.php?show=index) from Princeton University [@melamud2010; @clasquin2012].
- [metabolomics](https://github.com/cran/metabolomics) is a CRAN package for analysis of metabolomics data.
- [autoGCMSDataAnal](http://software.tobaccodb.org/software/autogcmsdataanal) is a Matlab based comprehensive data analysis strategy for GC-MS-based untargeted metabolomics and [AntDAS2](http://software.tobaccodb.org/software/antdas2) provided An automatic data analysis strategy for UPLC-HRMS-based metabolomics[@yu2019; @zhang2020].
- [enviGCMS](https://github.com/yufree/enviGCMS) from environmental non-targeted analysis and [rmwf](https://github.com/yufree/rmwf) for reproducible metabolomics workflow [@yu2020; @yu2019a].
- Pseudotargeted metabolomics method [@zheng2020; @wang2016a].
- [pySM](https://github.com/alexandrovteam/pySM) provides a reference implementation of our pipeline for False Discovery Rate-controlled metabolite annotation of high-resolution imaging mass spectrometry data [@palmer2017].
- [TinyMS](https://github.com/griquelme/tidyms) is a Python-Based Pipeline for Preprocessing LC--MS Data for Untargeted Metabolomics Workflows [@riquelme2020]
- [MetaboliteDetector](https://md.tu-bs.de/) is a QT4 based software package for the analysis of GC/MS based metabolomics data [@hiller2009].
- [W4M](http://workflow4metabolomics.org/) and [metaX](http://metax.genomics.cn/) could analysis data online [@giacomoni2015; @wen2017; @jalili2020].
- [FTMSVisualization](https://github.com/wkew/FTMSVisualization) is a suite of tools for visualizing complex mixture FT-MS data [@kew2017]
- [magma](http://www.emetabolomics.org/magma) could predict and match MS/MS files.
- [metabCombiner](https://github.com/hhabra/metabCombiner) Paired Untargeted LC-HRMS Metabolomics Feature Matching and Concatenation of Disparately Acquired Data Sets[@habra2021]
- [SLAW](https://github.com/zamboni-lab/SLAW) is a scalable and self-Optimizing processing workflow for Untargeted LC-MS with a docker image [@delabriere2021].
- [patRoon](https://github.com/rickhelmus/patRoon): open source software platform for environmental mass spectrometry based non-target screening [@helmus2021].
- 'shape-orientated' algorithm: A new 'shape-orientated' continuous wavelet transform (CWT)-based algorithm employing an adapted Marr wavelet (AMW) with a shape matching index (SMI), defined as peak height normalized wavelet coefficient for feature filtering, was developed for chromatographic peak detection and quantification. [@bai2022]
- automRm An R Package for Fully Automatic LC-QQQ-MS Data Preprocessing Powered by Machine Learning. [@eilertz2022]
- [IDSL.UFA](https://ufa.idsl.me/)Intrinsic Peak Analysis (IPA) for HRMS Data. [@baygi2022]
- [DEIMoS](https://github.com/pnnl/deimos): An Open-Source Tool for Processing High-Dimensional Mass Spectrometry Data [@colby2022]
- Omics Untargeted Key Script is a tools to make untargeted LC-MS metabolomic profiling with the latest computational features readily accessible in a ready-to-use unified manner to a research community[@plyushchenko2022].
- [MetEx](http://www.metaboex.cn/MetEx) is a targeted extraction strategy for improving the coverage and accuracy of metabolite annotation[@zheng2022a].
- Asari:Trackable and scalable LC-MS metabolomics data processing software in Python[@li2023a]
- NOMspectra: An Open-Source Python Package for Processing High Resolution Mass Spectrometry Data on Natural Organic Matter[@volikov2023]
- MARS:A Multipurpose Software for Untargeted LC−MS-Based Metabolomics and Exposomics with GUI in C++ [@goracci2024]
- MeRgeION: a Multifunctional R Pipeline for Small Molecule LC-MS/MS Data Processing, Searching, and Organizing [@liu2023a]
### Workflow Comparison
Here are some comparisons for different workflow and you could make selection based on their works[@myers2017; @weber2017; @li2018a;@liao2023].
[xcmsrocker](https://github.com/yufree/xcmsrocker) is a docker image for metabolomics to compare R based software with template[@yu2022b].
## Project Setup
I suggest building your data analysis projects in RStudio (Click File - New project - New dictionary - Empty project). Then assign a name for your project. I also recommend the following tips if you are familiar with it.
- Use [git](https://git-scm.com/)/[github](https://github.com/) to make version control of your code and sync your project online.
- Don't use your name for your project because other peoples might cooperate with you and someone might check your data when you publish your papers. Each project should be a work for one paper or one chapter in your thesis.
- Use **workflow** document(txt or doc) in your project to record all of the steps and code you performed for this project. Treat this document as digital version of your experiment notebook
- Use **data** folder in your project folder for the raw data and the results you get in data analysis
- Use **figure** folder in your project folder for the figure
- Use **munuscript** folder in your project folder for the manuscript (you could write paper in rstudio with the help of template in [Rmarkdown](https://github.com/rstudio/rticles))
- Just double click $$yourprojectname$$.Rproj to start your project
## Data sharing
See this paper[@haug2017]:
- [MetaboLights](http://www.ebi.ac.uk/metabolights/) EU based
- [The Metabolomics Workbench](http://www.metabolomicsworkbench.org/) US based
- [MetaboBank](https://mb2.ddbj.nig.ac.jp/) Japan based
- [MetabolomeXchange](http://www.metabolomexchange.org/site/) search engine
- [MetabolomeExpress](https://www.metabolome-express.org/) a public place to process, interpret and share GC/MS metabolomics datasets[@carroll2010].
## Contest
- [CASMI](http://www.casmi-contest.org/) predict small molecular contest[@blazenovic2017]