-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
An ID can be present in multiple cliques (CURIE duplication) #276
Comments
As of PR #342, we can use the DuckDB database to generate a list of the 74,886 duplicated CURIEs: Here's the distribution by the types of Biolink entities being combined:
Some of these are being duplicated within the DuckDB database (MacromolecularComplex) and aren't actually duplicated in the compendia files, while others are duplicated in the compendia files but are returned correctly by NodeNorm (which I think is because the system is randomly choosing the correct clique to merge on). However, they all represent something wrong that needs to be fixed somewhere. |
I took out some duplicates and ended up with 33,557 results instead: Biolink type distribution:
|
Next step:
|
This PR adds a DuckDB-based check for duplicate CURIEs (see latest report #276 (comment)).
A good example is UMLS:C0006050, which is present both as its own clique and is also present in a completely different clique, since they are chemicals and proteins respectively: https://nodenormalization-sri.renci.org/get_normalized_nodes?curie=UMLS%3AC0006050&curie=UNII%3AE211KPY694&conflate=true&drug_chemical_conflate=true
This is because db1 contains the JSON of all the identifiers, so as long as the cliques have different preferred IDs this is fine.
It's probably not possible to fix this without overhauling how the Redis databases in NodeNorm work, but it would be nice to have some sort of index-wide test (#225) to catch when this happens.
The text was updated successfully, but these errors were encountered: