An ID can be present in multiple cliques (CURIE duplication) #276

gaurav · 2024-05-17T16:24:15Z

A good example is UMLS:C0006050, which is present both as its own clique and is also present in a completely different clique, since they are chemicals and proteins respectively: https://nodenormalization-sri.renci.org/get_normalized_nodes?curie=UMLS%3AC0006050&curie=UNII%3AE211KPY694&conflate=true&drug_chemical_conflate=true

This is because db1 contains the JSON of all the identifiers, so as long as the cliques have different preferred IDs this is fine.

It's probably not possible to fix this without overhauling how the Redis databases in NodeNorm work, but it would be nice to have some sort of index-wide test (#225) to catch when this happens.

gaurav · 2024-09-09T13:35:47Z

As of PR #342, we can use the DuckDB database to generate a list of the 74,886 duplicated CURIEs:
duplicate_curies-2024sep9.csv -- some of these don't look right just yet, but I figured it was worth sharing as a first-shot knowing that the results might change later.

Here's the distribution by the types of Biolink entities being combined:

   1 biolink:CellularComponent||biolink:ComplexMolecularMixture
   1 biolink:ChemicalEntity||biolink:GrossAnatomicalStructure
   1 biolink:ChemicalEntity||biolink:Polypeptide||biolink:Protein
   1 biolink:Disease||biolink:MolecularMixture
   1 biolink:Disease||biolink:OrganismTaxon
   1 biolink:GrossAnatomicalStructure||biolink:SmallMolecule
   1 biolink:MolecularMixture||biolink:OrganismTaxon
   1 biolink:OrganismTaxon||biolink:SmallMolecule
   1 biolink:PhenotypicFeature||biolink:Protein
   2 biolink:AnatomicalEntity||biolink:ChemicalEntity||biolink:Protein
   2 biolink:Cell||biolink:Disease
   2 biolink:ChemicalEntity||biolink:MolecularMixture||biolink:Protein
   3 biolink:CellularComponent||biolink:ChemicalEntity
   3 biolink:Disease||biolink:SmallMolecule
   3 biolink:GrossAnatomicalStructure||biolink:OrganismTaxon
   4 biolink:Cell||biolink:PhenotypicFeature
   4 biolink:ChemicalEntity||biolink:PhenotypicFeature
   4 biolink:GrossAnatomicalStructure||biolink:PhenotypicFeature
   6 biolink:AnatomicalEntity||biolink:ComplexMolecularMixture
   6 biolink:CellularComponent||biolink:Disease
   9 biolink:ChemicalEntity||biolink:Disease
  12 biolink:Cell||biolink:OrganismTaxon
  13 biolink:CellularComponent||biolink:PhenotypicFeature
  13 biolink:Cell||biolink:ChemicalEntity
  14 biolink:Disease||biolink:GrossAnatomicalStructure
  15 biolink:AnatomicalEntity||biolink:ChemicalEntity
  15 biolink:AnatomicalEntity||biolink:Disease
  17 biolink:ComplexMolecularMixture||biolink:Protein
  25 biolink:ChemicalEntity||biolink:Protein||biolink:SmallMolecule
  28 biolink:ChemicalEntity||biolink:OrganismTaxon
  34 biolink:Polypeptide||biolink:Protein
  37 biolink:AnatomicalEntity||biolink:OrganismTaxon
  55 biolink:Disease||biolink:Gene
  56 biolink:AnatomicalEntity||biolink:PhenotypicFeature
  57 biolink:ChemicalEntity||biolink:Drug||biolink:Protein
  81 biolink:MolecularMixture||biolink:Protein
 123 biolink:Drug
 264 biolink:Drug||biolink:Protein
 628 biolink:MacromolecularComplex
 764 biolink:Protein||biolink:SmallMolecule
6600 biolink:Gene||biolink:Protein
65977 biolink:ChemicalEntity||biolink:Protein

Some of these are being duplicated within the DuckDB database (MacromolecularComplex) and aren't actually duplicated in the compendia files, while others are duplicated in the compendia files but are returned correctly by NodeNorm (which I think is because the system is randomly choosing the correct clique to merge on). However, they all represent something wrong that needs to be fixed somewhere.

gaurav · 2024-09-09T13:59:35Z

I took out some duplicates and ended up with 33,557 results instead:
duplicate_curies-2024sep9-2.csv

Biolink type distribution:

   1 biolink:AnatomicalEntity||biolink:OrganismTaxon
   1 biolink:ChemicalEntity||biolink:GrossAnatomicalStructure
   1 biolink:ChemicalEntity||biolink:Polypeptide||biolink:Protein
   1 biolink:Disease||biolink:MolecularMixture
   1 biolink:Disease||biolink:OrganismTaxon
   1 biolink:GrossAnatomicalStructure||biolink:SmallMolecule
   1 biolink:MolecularMixture||biolink:OrganismTaxon
   1 biolink:OrganismTaxon||biolink:SmallMolecule
   1 biolink:PhenotypicFeature||biolink:Protein
   1 biolink:Polypeptide||biolink:Protein
   2 biolink:AnatomicalEntity||biolink:ChemicalEntity||biolink:Protein
   2 biolink:Cell||biolink:Disease
   2 biolink:ChemicalEntity||biolink:MolecularMixture||biolink:Protein
   3 biolink:Cell||biolink:OrganismTaxon
   3 biolink:Disease||biolink:SmallMolecule
   3 biolink:GrossAnatomicalStructure||biolink:OrganismTaxon
   4 biolink:Cell||biolink:PhenotypicFeature
   4 biolink:ChemicalEntity||biolink:PhenotypicFeature
   4 biolink:GrossAnatomicalStructure||biolink:PhenotypicFeature
   6 biolink:AnatomicalEntity||biolink:ComplexMolecularMixture
   6 biolink:CellularComponent||biolink:Disease
   8 biolink:ChemicalEntity||biolink:OrganismTaxon
   9 biolink:ChemicalEntity||biolink:Disease
  11 biolink:Cell||biolink:ChemicalEntity
  12 biolink:AnatomicalEntity||biolink:ChemicalEntity
  12 biolink:CellularComponent||biolink:PhenotypicFeature
  14 biolink:Disease||biolink:GrossAnatomicalStructure
  15 biolink:AnatomicalEntity||biolink:Disease
  17 biolink:ComplexMolecularMixture||biolink:Protein
  25 biolink:ChemicalEntity||biolink:Protein||biolink:SmallMolecule
  55 biolink:Disease||biolink:Gene
  56 biolink:AnatomicalEntity||biolink:PhenotypicFeature
  57 biolink:ChemicalEntity||biolink:Drug||biolink:Protein
  81 biolink:MolecularMixture||biolink:Protein
 263 biolink:Drug||biolink:Protein
 764 biolink:Protein||biolink:SmallMolecule
32109 biolink:ChemicalEntity||biolink:Protein

gaurav · 2024-10-22T20:36:08Z

Next step:

Get some examples, talk to Biolink people about what exactly we want this kind of thing to be: proteins or chemicals.
The easiest fix would be if we can remove something from one or the other processing (i.e. exclude proteins when processing chemicals, exclude chemicals when processing proteins).
- Note that there might be e.g. PubChem identifiers that aren't connected to other proteins, we would need to get that information into protein, either directly from the PubChem download or indirectly via the chemical code.
The quick-and-dirty approach would be to come up with some sort of post-processing on the compendium files (there are after all only 33k cliques that need figuring out), but that would be inelegant and possibly vulnerable to other changes down the line. More investigation up front would be useful.

This PR adds a DuckDB-based check for duplicate CURIEs (see latest report #276 (comment)).

gaurav modified the milestones: Issues needing investigation, Babel September 2024 May 17, 2024

gaurav mentioned this issue May 17, 2024

merge Botulinum toxin A nodes NCATSTranslator/Feedback#395

Open

gaurav mentioned this issue Jun 12, 2024

Replace the in-memory glomming code with DuckDB databases #291

Open

gaurav mentioned this issue Jul 19, 2024

Some entities are showing up in both proteins and chemicals #310

Open

gaurav changed the title ~~An ID can be present in multiple cliques~~ An ID can be present in multiple cliques (CURIE duplication) Sep 9, 2024

gaurav mentioned this issue Oct 15, 2024

Added a check for duplicate CURIEs #342

Merged

gaurav modified the milestones: Babel September 2024, Babel November 2024 Oct 21, 2024

gaurav added a commit that referenced this issue Nov 2, 2024

Merge pull request #342 from TranslatorSRI/check-duplicate-ids

321490a

This PR adds a DuckDB-based check for duplicate CURIEs (see latest report #276 (comment)).

gaurav mentioned this issue Nov 5, 2024

2024oct24 defaults UMLS IDs to proteins rather than chemical entities #370

Open

gaurav modified the milestones: Babel November 2024, Babel January 2025 Nov 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An ID can be present in multiple cliques (CURIE duplication) #276

An ID can be present in multiple cliques (CURIE duplication) #276

gaurav commented May 17, 2024

gaurav commented Sep 9, 2024

gaurav commented Sep 9, 2024

gaurav commented Oct 22, 2024

An ID can be present in multiple cliques (CURIE duplication) #276

An ID can be present in multiple cliques (CURIE duplication) #276

Comments

gaurav commented May 17, 2024

gaurav commented Sep 9, 2024

gaurav commented Sep 9, 2024

gaurav commented Oct 22, 2024