Improve the identification of Compounds and Compound Synonyms #1186

mneinast · 2024-08-22T18:07:18Z

mneinast
Aug 22, 2024

We have previously tried to construct a master list of compound names, but this still was not enough to cover even more Compounds / Synonyms encountered in a recent submission. I had to spend a couple of hours manually looking up HMDB IDs...here I'll summarize my process, then describe ideas from Rob, and then add my own discussion.

how I manually searched for HMDB IDs to identify compounds from a new submission

The Study Doc contained a Compounds Sheet which indicated every unique Compound Name + Formula in the Annotated Files which I had submitted. If an entry was found in Tracebase, then the Tracebase synonyms and corresponding HMDB ID were also entered into the same Compounds sheet. Unidentified compounds had an empty HMDB ID.

For each compound, I first checked to see if this compound is just a new synonym for an existing Tracebase compound. Specifically, I searched for the formula and/or name in the Compounds table on the live Tracebase site. If the correct HMDB was found, I copied this to the sheet. If it was not in Tracebase, I searched HMDB using the compound name.

Eventually, I filled the missing HMDB IDs with the correct value (or a not-available when I could not find an HMDB ID).

Ideas to handle this

In validation, attempt to handle unidentified compounds:

retrieve and consult the HMDB synonym list
check formulas (including for ionization differences)
HMDB could be searched and possible matches proposed
Compound entries in the tracebase interface could retrieve, check, and display HMDB info with possible warnings displayed
The validation interface could link searches of the compound name to HMDB
The excel cells could link to an HMDB search of the compound name

Alternatively, could we just dump the entire HMDB ID names + formulas + synonyms into Tracebase? This would avoid building a lookup feature into tracebase, but might require periodic updates as the HMDB list grows. It's also decently large (I think about 200k HMDB compounds).

hepcat72 · 2024-08-26T16:20:43Z

hepcat72
Aug 26, 2024
Maintainer

I'm still on vacation, but there's a quiet time while I wait for my parents to return from a dentist visit, so I thought I would throw out some ideas...

I'm thinking about this in terms of barriers to submission. One barrier is: whether or not the researcher can submit. The other is whether we can load. RN, we allow submission without the researcher needing to fill out all the compound info, but even though we allow this, we must interact with them to confirm compound info, so whether or not they can submit is not a real barrier to focus on. In fact, allowing the researcher to handle the compound info, speeds up the process, so the question is (IMO), how do we streamline that process?

One thought I've had is that perhaps (hear me out here), relying on (or requiring) a compound record to exist, is something we can eliminate. After all, it's a "guess" by the researcher, and we never change the peak group name that's submitted, regardless of the assigned compound record(s) that's linked to the PeakGroup. The formula, stored in the PeakGroup record is pretty much all that we need to make tracebase work, so I propose the following. We may not do this, but I think it's at least valuabkle as a perspective through which to view this issue:

Don't require compounds to be linked to PeakGroups
Any code that performs calculations using formulas stored in the compound record should instead use the formula in the PeakGroup record
- Note, the Hydrogen content may differ from compounds due to ionization
- At the least, we can employ a strategy to "fall back" to the formula in the peak group if there is no linked compound present
- This would necessitate considerations for being able to compare equivalent formulas that differ by ionization and/or formatting (e.g. order of elements and or duplicate element entries - chempy has utilities to accomplish this)
We create an interface that allows curators to link peak groups and compounds
- We can even have an interface to allow researchers to propose links (/edits) that a curator must approve/commit (this is an interface I implemented for the Toxin and Virulence Factor database at Los Alamos)

This would allow Tracebase to work without all of the compound lookup requirements, and provide an avenue to make those edits after loading (at the researcher's leisure).

3 replies

mneinast Sep 4, 2024
Author

I really like this idea. It may be a stretch too far to completely stop using the primary compound's formula defined in tracebase, but it would be huge to just accept new data even if it does not (yet) match a compound in the database. So I like the idea to "fall back" to the formula in the peak group if there is no linked compound present.

That said, one of the major benefits of tracebase is that it squashes all these synonyms to a single record. So if we go with the "fall back" method, we would need to implement a method for users to update synonyms / add compounds anyway. I like your point, though, that maybe this doesn't have to block submission.

hepcat72 Sep 4, 2024
Maintainer

Yeah, thanks. And re-reading my comment, I realize that I should distill it a bit. So what I mean by TraceBase being able to work without the compound records is... all the calculations just use the formula. The only other reason we need the compounds is for searching, but we can search using the peak group name. The problem that the compound records solve is, like you said: linking synonyms. That is something that could be an overhead process that happens after load...? Anyway, like I said, I'm just throwing out crazy ideas here - trying to think outside the box.

mneinast Sep 4, 2024
Author

makes sense! I like the idea for sure!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve the identification of Compounds and Compound Synonyms #1186

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Improve the identification of Compounds and Compound Synonyms #1186

Uh oh!

mneinast Aug 22, 2024

how I manually searched for HMDB IDs to identify compounds from a new submission

Ideas to handle this

Replies: 1 comment · 3 replies

Uh oh!

Uh oh!

hepcat72 Aug 26, 2024 Maintainer

Uh oh!

mneinast Sep 4, 2024 Author

Uh oh!

hepcat72 Sep 4, 2024 Maintainer

Uh oh!

mneinast Sep 4, 2024 Author

mneinast
Aug 22, 2024

Replies: 1 comment 3 replies

hepcat72
Aug 26, 2024
Maintainer

mneinast Sep 4, 2024
Author

hepcat72 Sep 4, 2024
Maintainer

mneinast Sep 4, 2024
Author