How do we integrate automated peak picking? #1265

hepcat72 · 2024-10-16T19:12:40Z

hepcat72
Oct 16, 2024
Maintainer

TraceBase's focus up to now has been about providing access to high quality, researcher-curated peak data. It's a manual/labor-intensive endeavor and given researchers focus on the specific data relevant to their immediate efforts, there can exist lots of valuable unanalyzed data. We are on the verge of providing access to the raw files containing that data, but it's not very accessible. It's a "black box". You don't know what's there until you download the files and import them into El-Maven and start the analysis process.

Xi has been working on an automated peak-picking algorithm that, at the very least, has the potential to tell us what compounds are in the "unanalyzed" data, and at best, has the potential to eliminate all manual peak picking efforts.

We want to be able to integrate this data into TraceBase in some form or fashion that makes this unanalyzed data more accessible. Some of the questions we need to answer are:

Where do we want to display this data?
- Integrated with curated data with a distinguishing display?
- Separate views for autopicked peaks?
How do we want to differentiate this data from manually curated/confirmed data?
- Do we calculate FCirc on this data?
- Do we require manual identification of tracer compounds?
How do we want to filter the upload of this data?
- Detection thresholds
- Known compounds lists?
- Base it on labeled elements?
Where will the automated peak picking process run? Where will the correction run?
- Should it be a part of tracebase?
- How do we control parameters in the run for uniformity?
- Can there be per-compound settings in configuration profiles?
What changes do we need to make to the database to accommodate all this new (copious) data? How will storage needs grow?
Do we need to make changes to TraceBase for it to be able to scale with this higher data flow? How much data can we expect?
- E.g. Are there pages that are already slow that will get intractably slower?

hepcat72 · 2024-10-16T19:14:38Z

hepcat72
Oct 16, 2024
Maintainer Author

@mneinast @jcmatese @fkang-pu I just threw out a bunch of questions off the top of my head. Feel free to discuss whatever aspect you feel we should prioritize.

0 replies

mneinast · 2024-10-16T22:14:25Z

mneinast
Oct 16, 2024

I think it's worth defining two "modes" of automated peak extraction:

get every peak out of an mzXML (identified as a known compound, or as an unknown). In other words, for each mzXML, create a table of peaks with the m/z, retention time, and peak intensity.
attempt to get all compounds in a knowns list from an mzXML.

Xi's near-term focus is on (2), but (1) is close behind and Mike Skinnider has immediate use cases for it (ie search for all instances of a specific m/z at some RT).

EDIT: (1) would only generate a list of mz, retention time, and intensity. This would not include formula. See my reply below.

But I'll focus on (2)...

These are some key points about auto-generated peaks (mode 2):

the main purpose is to provide access to additional data, even if it is not 100% trusted.
in many ways, it is most valuable if it is treated with "no judgement, no filter". This is the opposite of how we deal with curated data.
in the long run, automated peak selection may improve enough to replace manual selection completely.

Where do we want to display this data?

I think integrating with curated data is ideal. The basic structure of data will match PeakData / PeakGroups, so it makes sense to integrate it there. If automated data becomes the norm, then integrating now will reduce complications later. It would definitely be important to include a "distinguishing display" (ie marker that this data is from an automated process).

How do we want to differentiate this data from manually curated/confirmed data?

Do we calculate FCirc on this data?

Yes, calculate Fcirc from automated data if the manual representation of the tracer compound is missing. Otherwise, default to the manual representation. We probably need a distinguishing display.

Do we require manual identification of tracer compounds?

I think not, to ensure long-term compatability (and also sometimes we'll have no measurement of tracer compounds).

How do we want to filter the upload of this data?

I think no filter at all - not even for multiple representations (ie negative mode and positive mode). We will need to see if the algorithm made an attempt to find the compound, and would be interested in checking consistency for a compound between every available mzXML.

You have a nice idea for detection thresholds. For a moment, I wondered if this could be a flag/warning to the user that the automated compound is under a low detection threshold. But I think we should not include any "judgement" of the automated peaks at all. I don't want to flag some peaks for one issue, then create false trust in a peak because we failed to check some other aspect of quality.

You also have an interesting thought about considering labeled elements. To implement it, we could only keep peaks associated with the tracer labeled elements defined in metadata. This could be useful to the end-user analyzing data outside of tracebase, but it does violate the aforementioned "no judgement, no filter" rules I just described...

Where will the automated peak picking process run? Where will the correction run?

From the lab's perspective, this is TBD. Josh wants a centralized source of automated peaks. It may be most logical to do this directly on msdata as mzXML are generated. Many users will use automated peaks for experiments that do not include tracers, which is not an explicitly supported experiment for TraceBase.

Where the correction should run is another question. This typically requires user knowledge of the instrumentation and tracers used. I think Xi may be automating this as a final part of his pipeline though.

Should it be a part of tracebase?

I'm conflicted. If we add this, it would create a huge new incentive for users to upload mzXML, and increase the likelihood that TraceBase becomes an integral part of everyone's workflow. It is also a very clean way to implement a centralized algorithm (or multiple) and ensure the results are organized.

One of TraceBase's biggest strengths is the enforced metadata (ie detailed but comparable information about tissue, tracer, treatment, etc). If you combine this with access to every known compound (automated mode 2) or even every peak in every mzXML (automated mode 1), then suddenly TraceBase offers a whole new level of opportunity.

On the other hand, doing this could subtly discourage users from using alternative algorithms and loading those results (unless we decide to support/implement multiple algorithms).

How do we control parameters in the run for uniformity?

If we don't integrate automated peak picking, then we just accept all algorithms but require a specific, defined format of input data.

If we do integrate automated peak picking, then i'm imagining every mzXML is pushed through parallel piplelines of automated peak picking (each with different paramaters), and we'd keep all of them.

Can there be per-compound settings in configuration profiles?

This is possible but I worry that it is not robust to changes in LCMS methods. Specifically, I worry that the choice of best representation could change over time (ie different between samples run in 2024 on instrument A vs 2028 on instrument B). I think the "no judgement, no filter" mantra can apply here too.

What changes do we need to make to the database to accommodate all this new (copious) data? How will storage needs grow?

The current files generated by the automated tool are xlsx and average 117 KB per sample (with about 550 compounds each). This will increase as the number of known compounds increases.

I don't know the size of "mode 1" data per sample, but I believe Xi is finding 2000-5000 putative compounds (ie real PeakGroups, not LCMS artifacts) per sample.

Do we need to make changes to TraceBase for it to be able to scale with this higher data flow? How much data can we expect?
E.g. Are there pages that are already slow that will get intractably slower?

yall might know better than me.

0 replies

mneinast · 2024-10-16T22:24:19Z

mneinast
Oct 16, 2024

Here are increasing levels of support for auto-generated peaks:

basic integration of mode-2 automated peaks (knowns list) into current version of tracebase. Users upload xlsx files generated by automated algorithms as they would any other peak annotation file. Results are integrated with tracebase, and no distinguishing marker is used. User still must choose 1 preferred representation of each compound for each sample. We could do this now if we decide to (xi's automated data appears to be fully compatible with tracebase now).

1b) Same as 1, except we add classification of peaks as automated or not, and we keep every automated representation of the data.

alternatively, classify each peak as "curated preferred", "manual not preferred", or "automated"

add support for unidentified peaks, but all of this is still uploaded by the researcher.
add integration of automated algorithms for peak picking to tracebase itself. Perform our own peak picking on every mzXML uploaded, and organize it for users directly.

The order of (2) or (3) could be switched.

4 replies

hepcat72 Oct 21, 2024
Maintainer Author

Hey @mneinast - how would "unidentified peaks" be represented? Are they organized into "Peak Groups" or just individual rows? Are they in the original sheet only? Are the compound, formula, and isotopeLabel columns empty?

mneinast Oct 21, 2024

So I now think these "unidentified peaks" would actually not have any formula assigned, so they would not be organized in a way that matches PeakData or PeakGroups. Since their format is very different (basically an mz, retention time, and intensity), I think integration with tracebase would require a significant overhaul of the model.

hepcat72 Oct 21, 2024
Maintainer Author

If we were to load this data, what would be its utility in TraceBase? My naïve guess would be that it would have utility to users in 2 potential cases:

A user knows what compound they're looking for and know what mz and retention time it should have. They search for it, find it, and then retrieve that mzXML so they can do peak picking with it
A user doesn't know the compound they're looking for and are trying to answer a question about what changes occur in different conditions to find peaks that seem to change in intensity with the change in condition.

I don't think that the schema changes would be that involved to store the data, but the difficulty seems like it would be knowing the difference between unidentified data and identified data. We save median mz and rt, not the raw values. If re-analysis results in new peakdata and new peakgroups, we would need to make sure to remove the "unidentified" data, but how would we associate the new identified data with the unidentified data?

Anyway, I was just asking because I was reviewing your notes from the last meeting (and this discussion) and preparing an agenda for the next meeting. I wanted to distill this discussion into a set of proposed requirements that we could discuss (priorities aside). We can then try and pick and choose and/or assign priorities afterward.

mneinast Oct 21, 2024

The first use case that comes to mind is like your example (1). User is interested in a putative compound (so an mz within a specific RT range), and wants to find examples of this peak in tracebase mzXML. If they have a predicted formula, they could use maven to check for labeling.

To go even further, we could allow the user to enter a formula on tracebase for one of these unidentified mzXML, consider the labeled element(s) for that sample, then check for labeled peaks given that formula. In other words, we'd shortcut the maven step. I'm guessing this is feasible since it's just looking for a specific subset of m/z at the same exact RT. All of this might be in Josh's vision - he kind of hinted at this in our last meeting. But it's a really big change to the way we think of tracebase.

Basically the metadata in tracebase could tremendously help the user learn a lot about an "unknown" compound (or a compound which is newly identified), without doing new experiments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do we integrate automated peak picking? #1265

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How do we integrate automated peak picking? #1265

hepcat72 Oct 16, 2024 Maintainer

Replies: 3 comments · 4 replies

hepcat72 Oct 16, 2024 Maintainer Author

mneinast Oct 16, 2024

mneinast Oct 16, 2024

hepcat72 Oct 21, 2024 Maintainer Author

mneinast Oct 21, 2024

hepcat72 Oct 21, 2024 Maintainer Author

mneinast Oct 21, 2024

hepcat72
Oct 16, 2024
Maintainer

Replies: 3 comments 4 replies

hepcat72
Oct 16, 2024
Maintainer Author

mneinast
Oct 16, 2024

mneinast
Oct 16, 2024

hepcat72 Oct 21, 2024
Maintainer Author

hepcat72 Oct 21, 2024
Maintainer Author