Replies: 3 comments 4 replies
-
@mneinast @jcmatese @fkang-pu I just threw out a bunch of questions off the top of my head. Feel free to discuss whatever aspect you feel we should prioritize. |
Beta Was this translation helpful? Give feedback.
-
I think it's worth defining two "modes" of automated peak extraction:
Xi's near-term focus is on (2), but (1) is close behind and Mike Skinnider has immediate use cases for it (ie search for all instances of a specific m/z at some RT). EDIT: (1) would only generate a list of mz, retention time, and intensity. This would not include formula. See my reply below. But I'll focus on (2)... These are some key points about auto-generated peaks (mode 2):
I think integrating with curated data is ideal. The basic structure of data will match PeakData / PeakGroups, so it makes sense to integrate it there. If automated data becomes the norm, then integrating now will reduce complications later. It would definitely be important to include a "distinguishing display" (ie marker that this data is from an automated process).
Yes, calculate Fcirc from automated data if the manual representation of the tracer compound is missing. Otherwise, default to the manual representation. We probably need a distinguishing display.
I think not, to ensure long-term compatability (and also sometimes we'll have no measurement of tracer compounds).
I think no filter at all - not even for multiple representations (ie negative mode and positive mode). We will need to see if the algorithm made an attempt to find the compound, and would be interested in checking consistency for a compound between every available mzXML. You have a nice idea for detection thresholds. For a moment, I wondered if this could be a flag/warning to the user that the automated compound is under a low detection threshold. But I think we should not include any "judgement" of the automated peaks at all. I don't want to flag some peaks for one issue, then create false trust in a peak because we failed to check some other aspect of quality. You also have an interesting thought about considering labeled elements. To implement it, we could only keep peaks associated with the tracer labeled elements defined in metadata. This could be useful to the end-user analyzing data outside of tracebase, but it does violate the aforementioned "no judgement, no filter" rules I just described...
From the lab's perspective, this is TBD. Josh wants a centralized source of automated peaks. It may be most logical to do this directly on msdata as mzXML are generated. Many users will use automated peaks for experiments that do not include tracers, which is not an explicitly supported experiment for TraceBase. Where the correction should run is another question. This typically requires user knowledge of the instrumentation and tracers used. I think Xi may be automating this as a final part of his pipeline though.
I'm conflicted. If we add this, it would create a huge new incentive for users to upload mzXML, and increase the likelihood that TraceBase becomes an integral part of everyone's workflow. It is also a very clean way to implement a centralized algorithm (or multiple) and ensure the results are organized. One of TraceBase's biggest strengths is the enforced metadata (ie detailed but comparable information about tissue, tracer, treatment, etc). If you combine this with access to every known compound (automated mode 2) or even every peak in every mzXML (automated mode 1), then suddenly TraceBase offers a whole new level of opportunity. On the other hand, doing this could subtly discourage users from using alternative algorithms and loading those results (unless we decide to support/implement multiple algorithms).
If we don't integrate automated peak picking, then we just accept all algorithms but require a specific, defined format of input data. If we do integrate automated peak picking, then i'm imagining every mzXML is pushed through parallel piplelines of automated peak picking (each with different paramaters), and we'd keep all of them.
This is possible but I worry that it is not robust to changes in LCMS methods. Specifically, I worry that the choice of best representation could change over time (ie different between samples run in 2024 on instrument A vs 2028 on instrument B). I think the "no judgement, no filter" mantra can apply here too.
The current files generated by the automated tool are xlsx and average 117 KB per sample (with about 550 compounds each). This will increase as the number of known compounds increases. I don't know the size of "mode 1" data per sample, but I believe Xi is finding 2000-5000 putative compounds (ie real PeakGroups, not LCMS artifacts) per sample.
yall might know better than me. |
Beta Was this translation helpful? Give feedback.
-
Here are increasing levels of support for auto-generated peaks:
1b) Same as 1, except we add classification of peaks as automated or not, and we keep every automated representation of the data.
The order of (2) or (3) could be switched. |
Beta Was this translation helpful? Give feedback.
-
TraceBase's focus up to now has been about providing access to high quality, researcher-curated peak data. It's a manual/labor-intensive endeavor and given researchers focus on the specific data relevant to their immediate efforts, there can exist lots of valuable unanalyzed data. We are on the verge of providing access to the raw files containing that data, but it's not very accessible. It's a "black box". You don't know what's there until you download the files and import them into El-Maven and start the analysis process.
Xi has been working on an automated peak-picking algorithm that, at the very least, has the potential to tell us what compounds are in the "unanalyzed" data, and at best, has the potential to eliminate all manual peak picking efforts.
We want to be able to integrate this data into TraceBase in some form or fashion that makes this unanalyzed data more accessible. Some of the questions we need to answer are:
Beta Was this translation helpful? Give feedback.
All reactions