Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: new sets of id for reactions and metabolites #174

Closed
2 tasks done
haowang-bioinfo opened this issue Jun 1, 2020 · 45 comments
Closed
2 tasks done

feat: new sets of id for reactions and metabolites #174

haowang-bioinfo opened this issue Jun 1, 2020 · 45 comments

Comments

@haowang-bioinfo
Copy link
Member

haowang-bioinfo commented Jun 1, 2020

Description of the issue:

  • Here a new set of reaction id is proposed, to promote the application of Human-GEM as a template for generic usage
  • This new set of rxn ids will be added as a new field to the annotation file humanGEMRxnAssoc.JSON, without affecting the model

Planned implementation:

  • Append a new field rxnMAID to humanGEMRxnAssoc.JSONreactions.tsv
  • The element ids in rxnMAID will be the same as those in rxns, except that the prefix HMR_ will be systematically replaced with MAR (abbreviation of Metabolic Atlas reactions)
  • Append a new field metMAID to metabolites.tsv

I hereby confirm that I have:

  • Done this analysis in the master branch of the repository
  • Checked that a similar issue does not exist already
@JonathanRob
Copy link
Collaborator

Is the plan to eventually assign an MA_ ID for non-HMR_ reactions as well?

@haowang-bioinfo
Copy link
Member Author

haowang-bioinfo commented Jun 2, 2020

Yes @JonathanRob, this plan aims toward assigning every rxns id with a unique MA_ identifier eventually.

Given that the whole process may take some time, it might be better to begin with a simple implementation, just replacing HMR_ with MA_ at first.

@mihai-sysbio
Copy link
Member

Seeing that this change has been in (slow) progress for some time, I'm wondering when it would be a good time to make this change in the entire model. I think the merging of PR #203 would make it easier to implement the adoption of new identifiers.
Another question outside the scope of this issue is if metabolites should also use MA_ identifiers.

@haowang-bioinfo
Copy link
Member Author

@mihai-sysbio honestly, myself look forward to a quick adaption to entire model.

@mihai-sysbio
Copy link
Member

I can see some of the benefits of this transition, but I'm having a hard time spotting downsides besides the obvious backwards compatibility, which can be preserved by moving previous identifiers to the annotation. What other downsides are there?

@haowang-bioinfo
Copy link
Member Author

haowang-bioinfo commented Nov 25, 2020

After discussing with @mihai-sysbio and @JonathanRob, the following consensus was reached regarding this transition:

  • Expand the scope of MA ids from reactions to metabolites, and probably also compartments
  • The MA id is formatted as concatenation of two parts: prefix letters and suffix digits
  • The prefix are: MAR for reactions, MAM for metabolites, MAC for compartments
  • The suffix digits are numbers, with fixed width, in sorted (or random?) order

Any comments/suggestions to this proposed naming convention are welcome.

@mihai-sysbio
Copy link
Member

mihai-sysbio commented Nov 25, 2020

The letters M and R are in line with the BiGG recommendations, which is good. I think it's preferable to not have any _ in the identifier though. Should we ignore the special naming scheme for pseudoreactions?
Are there any thoughts of the compartmentalized metabolites, which is the pairing of a compartment and a metabolite?

@haowang-bioinfo
Copy link
Member Author

haowang-bioinfo commented Nov 25, 2020

@mihai-sysbio agree to take away underscore _.

Now it's probably better to ignore the special naming for pseudoreactions.

I'm not certain about the exact usage of compartment id, which can be implemented "slowly" than the other two.

@haowang-bioinfo
Copy link
Member Author

Due to the breaking of backwards compatibility from this transition, its implementation should be well-planned, i.e. probably associating with a major release.

@haowang-bioinfo
Copy link
Member Author

haowang-bioinfo commented Feb 6, 2021

The current implementation of rxnMAID is incomplete - not all reactions have such a id. This caused confusion during the transition and alredy impeded the downstream development in Metabolic Atlas.

Here is a proposal to systemically assign an rxnMAID for each reaction in Human-GEM:

  1. Format change: from MA_DDDDD to MARDDDDD, where D represents one digit
  2. Retain the existing ids and follow the new format (replacing underscore with R)
  3. Add values to empty ones by picking ids in the range from MAR00001 to MAR09999
  4. An easy implementation is to make an orderly filling and avoid duplication

@JonathanRob
Copy link
Collaborator

@Hao-Chalmers From you comment above, I thought we had agreed to MARDDDDD for reactions?

The prefix are: MAR for reactions, MAM for metabolites, MAC for compartments

@haowang-bioinfo
Copy link
Member Author

@JonathanRob yes, the proposal was updated as agreed.

@mihai-sysbio
Copy link
Member

mihai-sysbio commented Feb 6, 2021

Thinking about these IDs, they appear to be encouraging a cross-model approach (ie the same identifier in multiple models). However, this creates the difficulty of maintaining consistency of these IDs between models, which shouldn't be taken lightly.
Another approach would be to create IDs within the scope of a model (ie MA:Human-GEM.R0001). The downside here is that mapping across models is more difficult.
I have some thoughts on how we can ensure the consistency so that we can pursue the first option with confidence, but I would value a discussion with you before starting anything.

@mihai-sysbio
Copy link
Member

Specifically about the ID format, I would encourage the separation of the MA part from the rest of the ID. Also, having looked around on identifiers.org, the _ separator seems a bit uncommon. My preference at the moment is a ., which would result in something like MA.R00001.

@haowang-bioinfo
Copy link
Member Author

haowang-bioinfo commented Feb 13, 2021

IMHO, the cross-model approach (using the same identifiers in multiple models) would actually facilitate in maintaining consistency between models. While the approach of creating IDs within the scope of a model does make the mapping a difficult task.

@haowang-bioinfo
Copy link
Member Author

It seems we've had a consensus on the MA rxn id format as MARDDDDD. I plan to implement as such. @mihai-sysbio @JonathanRob what do you think?

@mihai-sysbio
Copy link
Member

I've had another look on identifiers.org at BiGG reactions, VMH reactions, MNX reactions and MetaCyc reactions. It seems okay, and very much in line with MNX, to follow the format MARDDDDD proposed above by @Hao-Chalmers, even though my preference is still to separate the domain space like MA.RDDDDD.

@haowang-bioinfo
Copy link
Member Author

@mihai-sysbio what would be the usage of separating an id into domains?

@mihai-sysbio
Copy link
Member

To me, separating the domain space makes for an easier recognition of what the letters mean, especially down the line with MAM for metabolites and who knows what else (say MAG for gene reaction rules or MAC for complexes).

@haowang-bioinfo
Copy link
Member Author

haowang-bioinfo commented Feb 22, 2021

The advantage of an additional dot is not so clear, because they are mainly recognised by code. In addition, this additional delimiter may cause unexpected effect since these ids could be processed by various packages that treat ids with different regular expression patterns (. is a special character).

@mihai-sysbio
Copy link
Member

The other option is to completely skip the MA letters in the identifier name, since identifiers.org has a dedicated prefix. The result would be something like https://identifiers.org/metabolic.atlas:R00001 and other website would refer to it as Metabolic Atlas ID: R00001.

@haowang-bioinfo
Copy link
Member Author

it seems KEGG took this R00001 format already.

@JonathanRob
Copy link
Collaborator

Yes, to avoid confusion with other database IDs, I would keep the MA prefix within the identifier.

As for the period (MA.R vs. MAR), I think that it makes it more clear by separating the MA prefix from the rest of the identifier, but to be honest I don't have a strong preference either way. I don't believe it should cause parsing issues to have the . present, since many ID types already include special characters.

@cfrainay
Copy link

Dear all,

Thanks for bringing this up, having a model-specific prefix to avoid confusion and conflicts between models built from same sources sounds like a very good idea.

Regarding the special characters, since the "." is interpreted in regex and sh commands and can add confusion when present in a URI, I guess it might have more inconvenient side effects than _ or nothing.

Sure, it can easily be escaped, but doesn't seem that necessary. The alphabetic characters prefix followed by digit only suffix might already be enough to separate both, if that is ever needed.

@mihai-sysbio
Copy link
Member

mihai-sysbio commented Feb 28, 2021

Thank you all for the valuable input. For the looks I still prefer . but even though, like @JonathanRob said, with proper escaping it shouldn't be a problem, let's remove the possibility of situations as described by @cfrainay.

Let's go for MA-R0001.

The separation is meant increased legibility and to avoid potential confusion (if we imagine a contrived example of a website called Metabolic Atlas Reactants then R in a reaction with id MAR0001 can have double meaning: is it Reactants or reactions?).

@haowang-bioinfo
Copy link
Member Author

haowang-bioinfo commented Mar 2, 2021

Okay, we end up with the format of MA-RDDDDD, where D represents a digit.

@haowang-bioinfo haowang-bioinfo changed the title feat: a new set of id for reactions feat: new sets of id for reactions and metabolites Mar 30, 2021
@mihai-sysbio
Copy link
Member

Similarly, I would propose MA-MDDDDD for the metabolite IDs.

@haowang-bioinfo
Copy link
Member Author

It does appear to be reasonable to standardize MA met ids as such.

To make a convenient transition from the previous HMR format, I would recommend to continue appending compartment id as suffix: MA-MDDDDDC, where C represent the abbreviation of corresponding compartment. @mihai-sysbio @JonathanRob what do you think?

@mihai-sysbio
Copy link
Member

appending compartment id as suffix: MA-MDDDDDC, where C represent the abbreviation of corresponding compartment

I see that for reactions there is no such last letter C for the implemented reaction IDs:

rxns rxnKEGGID rxnBiGGID rxnEHMNID rxnHepatoNET1ID rxnREACTOMEID rxnRecon3DID rxnMetaNetXID rxnHMR2ID rxnRatconID rxnTCDBID spontaneous rxnMAID
"FAOXC101C8m" "" "FAOXC101C8m" "" "" "" "FAOXC101C8m" "" "" "" "" 0 "MA-R04967"
"FAOXC101C8x" "" "FAOXC101C8x" "" "" "" "FAOXC101C8x" "" "" "" "" 0 "MA-R04968"

@haowang-bioinfo
Copy link
Member Author

haowang-bioinfo commented Apr 6, 2021

@mihai-sysbio having compartment id as suffix is a common practice for naming met ids, such as in BiGG (ATP). Another advantage is to retain the consistency and inheritance (at least to some extent) with HMR ids as below.

rxns rxnMAID
"HMR_1967" "MA-R01967"
mets metMAID
"m01967c" "MA-M01967c"

@mihai-sysbio
Copy link
Member

True!
The advantage of the other approach is to enable a section like Present in compartments as here https://identifiers.org/vmhmetabolite:h2o .

@haowang-bioinfo
Copy link
Member Author

haowang-bioinfo commented Apr 11, 2021

Implementation of MA met ids (cdd63f3):

  • Convert HMR ids to MA format as proposed (e.g. from m01967c to MA-M01967c).
  • Replace non-HMR ids with unused values in the range of MA-M00001 to MA-M09999, appending with compartment id.
  • Ensure an ordered filling and all ids are unique.

@mihai-sysbio @JonathanRob what do you think about this plan?

@JonathanRob
Copy link
Collaborator

@mihai-sysbio I just wanted to comment on your note that compartment abbreviations are used in metabolite IDs but not reaction IDs.

Ideally, the reaction and metabolite IDs should not be embedded with compartment information because it implicitly enforces a specific set of compartment abbreviations (i.e., "s" is extracellular, which differs from the COBRA style of using "e" for extracellular). In practice, however, it's more complicated.

I think it makes sense to use different IDs for the same reaction in different compartments (rather than using the same base ID with different compartment abbreviations) because some of the reaction properties may differ. For example, the same reaction may be catalyzed by a different enzyme depending on the compartment, meaning that the gene-reaction rule will differ between the two cases. Similarly, the same reaction taking place in a different species will be catalyzed by a different enzyme (unless it's spontaneous). Finally, the environment of a different compartment (or species) may be such that the reaction reversibility is affected, so the associated lower and/or upper bounds would be affected.

For a metabolite, however, it is the same compound with the same annotations, properties, and associations regardless of what compartment or species it's in. It may change its protonation state due to a pH difference, but then it's treated as a different metabolite with a different ID. So unlike a reaction, a metabolite can be treated as (effectively) identical everywhere it exists. In fact, a metabolite should have the same exact ID even in different compartments, but since GEM standards require that all metabolites in a GEM must have a unique ID, we are forced to differentiate them somehow.

So this is my long way of saying that I'm not really happy with the idea of embedding the compartment abbreviation within the metabolite ID, but to me it still seems (very, very slightly) better than using a completely different ID for different compartment versions of a metabolite.

@mihai-sysbio
Copy link
Member

mihai-sysbio commented Apr 11, 2021

Thank you both for the ideas and explanations.

In Metabolic Atlas we do distinguish between a Metabolite and a CompartmentalizedMetabolite. The ID format suggested by @Hao-Chalmers would then map directly to the CompartmentalizedMetabolite. I'm still thinking of ways to associate the two.

@haowang-bioinfo
Copy link
Member Author

haowang-bioinfo commented Apr 12, 2021

@JonathanRob nice comments that are very good considerations in identifier formatting, which has been and will be a long-term process.

On the other hand, to have a short-term progress, my feeling is we've reached a consensus with the proposed plan. Am I right? @mihai-sysbio

@mihai-sysbio
Copy link
Member

a consensus with proposed plan

@Hao-Chalmers indeed!

I was wondering if it would make sense to have another column in the metabolites.tsv that maps to the Metabolite, so that each CompartmentalizedMetabolite row is obviously mapped to the Metabolite. In our case, the compartmentalized ID will be with the compartment letter as you said @Hao-Chalmers (even though it is non-standard as @JonathanRob mentioned), and the non-compartmentalized one will simply be missing that letter. It looks really redundant, but it's also very clear and would require no custom regex parsing.

@haowang-bioinfo
Copy link
Member Author

haowang-bioinfo commented Apr 12, 2021

if it would make sense to have another column with non-compartmentalized ids that will simply be missing that compartment letter

@mihai-sysbio in metabolites.tsv there is an existing column "metsNoComp" for serving this purpose.

@mihai-sysbio
Copy link
Member

@Hao-Chalmers is the plan to place these new IDs in:

  • mets and metsNoComp columns, or
  • just the metMAID column, or
  • metMAID and metNoCompMAID columns?

@haowang-bioinfo
Copy link
Member Author

haowang-bioinfo commented Apr 12, 2021

for now, let's focus on adding metMAID column first.

@mihai-sysbio
Copy link
Member

I see, so the metsNoComp then isn't directly useful for the purpose of mapping non-compartmentalized MA ids.

@mihai-sysbio
Copy link
Member

@Hao-Chalmers what are your thoughts on another column metsNoCompMAID?

@haowang-bioinfo
Copy link
Member Author

@Hao-Chalmers what are your thoughts on another column metsNoCompMAID?

My thoughts are:

  • avoid implementations for uncertain requirements (Occam's Razor)
  • prioritize to certain things
  • metsNoCompMAID is super easy to obtain (just trimming the last character)
  • metsNoCompMAIDwould end up as identical column to metsNoComp once MA ids are used

@mihai-sysbio
Copy link
Member

avoid implementations for uncertain requirements (Occam's Razor)

The requirement is for a person to link to all compartmentalized metabolites. This cannot be done at the moment, and it would need a metabolite id, similar to metsNoComp.

prioritize to certain things

I'm not sure what this means.

metsNoCompMAID is super easy to obtain (just trimming the last character)

Parsing/regex on identifiers is not a good practice, even if it's easy.

metsNoCompMAID would end up as identical column to metsNoComp once MA ids are used

Totally agree. Let's then implement this in 2 stages - only mets now, and automatically obtain metsNoComp later.

@haowang-bioinfo
Copy link
Member Author

prioritize to certain things

I'm not sure what this means.

This means that the consensus to metMAID is certain and can be prioritized.

@mihai-sysbio
Copy link
Member

I believe this issue can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants