Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BDNF has no gene or protein interactions #150

Closed
Genomewide opened this issue Mar 22, 2023 · 16 comments
Closed

BDNF has no gene or protein interactions #150

Genomewide opened this issue Mar 22, 2023 · 16 comments
Assignees
Labels
duplicate names Sometimes the name resolver/node normalizer needs more mappings in order to condense Hammerhead (Sprint 6) - due Oct 4 in CI This ticket will be fixed in CI by the end of Hammerhead (Sprint 6) (Oct 4) needs review this ticket needs a broad group of people to review and assign next steps because it crosses teams

Comments

@Genomewide
Copy link

Genomewide commented Mar 22, 2023

Searching for any edge with BDNF to another gene or protein returns zero results from the ARAX UI. This is a well-studied gene and just looking at SPOKE shows a number of connected proteins.

The query is the query that failed to return results:

{
  "edges": {
    "e0": {
      "subject": "n0",
      "object": "n1"
    }
  },
  "nodes": {
    "n0": {
      "ids": [
        "CHEMBL.COMPOUND:CHEMBL2108230"
      ],
      "categories": [
        "biolink:SmallMolecule"
      ],
      "is_set": false,
      "name": "ABRINEURIN"
    },
    "n1": {
      "categories": [
        "biolink:Gene",
        "biolink:Protein"
      ],
      "is_set": false
    }
  }
}

image

image

BTE returns results when I used the HGNC:1033. It appears that BTE adjusts the categories from small molecules to small molecules and gene.

{
  "edges": {
    "e0": {
      "object": "n1",
      "subject": "n0"
    }
  },
  "nodes": {
    "n0": {
      "categories": [
        "biolink:SmallMolecule",
        "biolink:Gene"
      ],
      "ids": [
        "HGNC:1033"
      ],
      "is_set": true,
      "name": "ABRINEURIN"
    },
    "n1": {
      "categories": [
        "biolink:Gene",
        "biolink:Protein"
      ],
      "is_set": false
    }
  }
}

ARAX returned results without modifying it. HGNC:1033 still worked to return results. However, it will not using the ARAX query builder that returns CHEMBL.COMPOUND:CHEMBL2108230 for BDNF.

{
  "edges": {
    "N1": {
      "attribute_constraints": [],
      "object": "n1",
      "predicates": [
        "biolink:has_normalized_google_distance_with"
      ],
      "qualifier_constraints": [],
      "subject": "n0"
    },
    "e0": {
      "attribute_constraints": [],
      "object": "n1",
      "qualifier_constraints": [],
      "subject": "n0"
    }
  },
  "nodes": {
    "n0": {
      "categories": [
        "biolink:SmallMolecule"
      ],
      "constraints": [],
      "ids": [
        "HGNC:1033"
      ],
      "is_set": false
    },
    "n1": {
      "categories": [
        "biolink:Gene",
        "biolink:Protein"
      ],
      "constraints": [],
      "is_set": false
    }
  }
}
@cbizon
Copy link
Collaborator

cbizon commented Apr 12, 2023

This looks to me like a normalization issue - the CHEMBL identifier doesn't merge with the UniProt Ids. The CHEMBL one is considered a chemical entity while the UniProt is a Protein. I guess it makes sense to merge these? Do we have some other examples of CHEMBL identifiers for proteins? @gaurav what do you think?

@sierra-moxon sierra-moxon assigned gaurav and gprice1129 and unassigned gprice1129 Apr 21, 2023
@sierra-moxon
Copy link
Member

from TAQA:
we'll push to the NodeNorm folks for more investigation, probably not a data/UI question - SPOKE has plenty of interactions :)

@Genomewide
Copy link
Author

Genomewide commented Apr 24, 2023

Repeated process.
Looks like BDNF gets returned as a small molecule called "ABRINEURIN" when using the node selection tool in ARAX UI.
Manually changing the query to the gene curie HGNC:1033 gave results.

https://arax.ncats.io/?r=351c7c8b-9bbf-4818-bff5-22e6e2b09529

image

image

{
  "edges": {
    "e0": {
      "subject": "n0",
      "object": "n1"
    }
  },
  "nodes": {
    "n0": {
      "ids": [
        "CHEMBL.COMPOUND:CHEMBL2108230"
      ],
      "categories": [
        "biolink:SmallMolecule"
      ],
      "is_set": false,
      "name": "ABRINEURIN"
    },
    "n1": {
      "categories": [
        "biolink:Gene",
        "biolink:Protein"
      ],
      "is_set": false
    }
  }
}

changed to HGNC curie and remove the small molecule category results in
image

But, even trying to get the synonyms of the HGNC curie return the small molecule

image

How would this impact queries?

@sierra-moxon sierra-moxon added this to the June 30 milestone Jun 1, 2023
@sierra-moxon
Copy link
Member

sierra-moxon commented Jun 2, 2023

from TAQA:
ABRINEURIN and BDNF are two names for the same thing and give different answers depending on which you select. Lots of results tied to BDNF as a gene/protein w/HGNC id (NCBI, OMIM, UniProtKB, etc. but no ChEBML identifier).
Any non-ChEMBL id will get back all these good results.

ABRINEURIN is not a SmallMolecule? That isn't a SmallMolecule.
Is there a modeling issue here? maybe ARAX normalizer?
Maybe CHEMBL brings in gene identifiers? -- should CHEMBL be in the id_prefixes section of 'biolink:Gene'

@sierra-moxon sierra-moxon self-assigned this Jun 2, 2023
@andrewsu
Copy link

andrewsu commented Jun 2, 2023

@Genomewide in the last screenshot in your most recent post, you showed the ARAX UI mapping HGNC:1033 to CHEMBL.COMPOUND:CHEMBL2108230. Can you tell me how you got that? When I plug HGNC:1033 into the ARAXI synonym mapper, I get this:

image

I ask because I can't find a good database that maps between gene/protein IDs and chemical/drug IDs in cases where the therapeutic is a protein. If ARAX has found a source for that, I'm guessing the NodeNorm folks would be very interested in that mapping, and it could help solve this issue...

@edeutsch
Copy link

edeutsch commented Jun 3, 2023

Note that our production maturity is still using an older database and old Node Synonymizer system, while everything else (testing, staging, dev) is using a newer database and newer Node Synonymizer. So you will get different answers at these two sites:

Old system (production): https://arax.transltr.io/?term=HGNC:1033

New system (staging): https://arax.ci.transltr.io/?term=HGNC:1033

@sierra-moxon
Copy link
Member

@andrewsu, @Genomewide - What can we do here for BDNF having no gene or protein interactions? @gaurav - can node normalizer help in any way here?

@sierra-moxon sierra-moxon modified the milestones: A: June 30 , B: July 31 Jul 3, 2023
@andrewsu
Copy link

andrewsu commented Jul 6, 2023

Two thoughts from me:

  • @edeutsch: do you know the resource that your prod system uses to equate HGNC:1033 with CHEMBL.COMPOUND:CHEMBL2108230? If yes, I think NN could consider importing that (but needing to be cautious of over-merging, which presumably is the reason why CI ARAX does not merge those two IDs)
  • Do we think this is a priority for our pre-Fall milestones? My guess is that this issue primarily affects our ability to link biologic therapeutics to the rest of our knowledge graph. Biologics are important of course, but not sure how we decide whether the effort is worth the benefit

@Genomewide
Copy link
Author

I agree with Andrew about the likely group that is affected by this. If there was a way to identify some group of genes that needed this by looking at the list of biologics or some low hanging fruit, I would think it would be worth it . If there was a somewhat targeted way to get 50-80% of them linked up I would think it would be worth a little time.

I don't know how insulin does not suffer from the same problem. Is there a fix that was done before?

@edeutsch
Copy link

edeutsch commented Jul 6, 2023

It's not easy anymore to be sure, but I think it probably was this record:
https://www.ebi.ac.uk/chembl/compound_report_card/CHEMBL2108230/
that says that CHEMBL.COMPOUND:CHEMBL2108230 is called abrineurin
and then UniProtKB
https://www.uniprot.org/uniprotkb/P23560/entry
also gave it an alternate name of abrineurin
and that is equivalent to HGNC:1033

So based on ChEMBL and UniProtKB giving it alternate names of abrineurin
The old NodeSynonymizer made lots of links that way.
but it often overmerged.
Name-based merging is often very good, but generated a lot of of tickets when it went bad

@andrewsu
Copy link

andrewsu commented Jul 6, 2023

If name merging is the only way to bridge between the compound/drug identifiers and the gene/protein identifiers, and we've already established that name merging comes with some undesirable properties, then I suspect that there is not "low lying fruit" to be harvested here. And given that this type of drug-protein edge is not directly the subject of one of our MVP queries, my vote is to table this issue until later (post-fall).

I don't know how insulin does not suffer from the same problem.

I think insulin will have exactly the same problem. The ARAX synonymizer merges the compound/drug IDs and the gene/protein IDs on prod (https://arax.transltr.io/?r=2d241a3c-25f1-498a-9eab-ff618f65b68c), but not on CI (https://arax.ci.transltr.io/?r=2d241a3c-25f1-498a-9eab-ff618f65b68c)

@cbizon
Copy link
Collaborator

cbizon commented Jul 6, 2023

I agree with @andrewsu that this is a tough problem that we will probably not solve by Sept. I think (?) that the correct response might be to allow a conflation between chemical/drug and protein but it's going to take some work to implement that and test it out. FWIW, I am not a big fan of name merging for the reason that Eric mentions - I think that there is plenty of structured equivalence or other relationship information that we should try to take advantage of first.

@andrewsu
Copy link

since there are two votes in favor of tabling this until the fall (and no opposed), going to adjust the milestone now...

@gaurav
Copy link

gaurav commented Oct 19, 2023

NodeNorm seems to have split abrineurin into multiple cliques: https://nodenormalization-sri.renci.org/1.4/get_normalized_nodes?curie=HGNC%3A1033&curie=UniProtKB%3AP23560&curie=CHEMBL.COMPOUND%3ACHEMBL2108230&curie=MESH:C415772&curie=PR:000004716&curie=UNII:A1ED6W905I&curie=UNII:86ZE5V51WT&conflate=false&drug_chemical_conflate=false&description=false

Some of these might be genuine splits, but I think something is going wrong in protein conflation here. I'm tracking this at TranslatorSRI/NodeNormalization#224. Is it correct to assume that this is at the same level of priority as other cliquing issues, or is there something particularly bad about this issue?

@sierra-moxon sierra-moxon added the duplicate names Sometimes the name resolver/node normalizer needs more mappings in order to condense label Nov 3, 2023
@sierra-moxon sierra-moxon added the needs review this ticket needs a broad group of people to review and assign next steps because it crosses teams label May 17, 2024
@sstemann
Copy link

@Genomewide can you please retest, i'm not clear what we're looking for

@Genomewide
Copy link
Author

Synonyms from ARAX production don't work at all now.
image

Appears to be connected on CI though and returns results.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate names Sometimes the name resolver/node normalizer needs more mappings in order to condense Hammerhead (Sprint 6) - due Oct 4 in CI This ticket will be fixed in CI by the end of Hammerhead (Sprint 6) (Oct 4) needs review this ticket needs a broad group of people to review and assign next steps because it crosses teams
Projects
None yet
Development

No branches or pull requests

8 participants