Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update new semmeddb api #63

Closed
colleenXu opened this issue Apr 1, 2022 · 9 comments
Closed

update new semmeddb api #63

colleenXu opened this issue Apr 1, 2022 · 9 comments
Assignees
Labels
data source Data source pending to create a new API

Comments

@colleenXu
Copy link

Follow up to #30 . Noting that SEMMEDDB recently released new data files (publications up to feb 2022). In addition with the "outdated identifiers in associations" issue (also pasted below), we may want to look at updating this pending API...


Note that we're encountering issues with outdated UMLS identifiers in the semmed data...TranslatorSRI/NodeNormalization#119 (comment).

Things to consider:

  • having only 1 record for the pipe-delimited stuff - and using the ID that isn't outdated (aka ncbigene)...
  • checking if UMLS IDs are outdated or not during ingestion

This causes issues for BTE since there's no cross-mapping / label retrieval during ID resolution...

@newgene newgene added the data source Data source pending to create a new API label Apr 1, 2022
@erikyao
Copy link
Contributor

erikyao commented Apr 11, 2022

Hi @r76941156, I've forked your semmed_parser repo into https://github.com/biothings/semmeddb. And I've renamed it to the API name.

Could you spare some time to update the API to the latest release? I can help if you have other priorities. Thank you!

@colleenXu
Copy link
Author

colleenXu commented Apr 15, 2022

Thoughts from discussion with @erikyao today: @newgene @andrewsu

There may be a way to find and replace outdated UMLS IDs...if we can find this table in a recent UMLS release. The SEMMEDDB data files don't seem to include info on what UMLS IDs are outdated. Note that I've seen the "outdated UMLS ID issue" with "genes" but it's not clear to me if only "genes" are affected.

The pipe-delimited values are confusing

  • In some cases, the list is 2 elements long and 1 ID is a UMLS (the first one?) and 1 ID is an entrez/NCBIGene one. In this case, it seems like we could pick just 1 ID to make the association record - and that it'd be better to pick the entrez/NCBIGene ID since the UMLS ID might be outdated
  • In other cases, the list is > 2 elements long. It looks like the first element is a UMLS or a entrez/NCBIGene ID, and the other elements are entrez/NCBIGene IDs. From the names, it looks like this is a set of related concepts - where we may want to "enumerate" and make association records for each ID.

Yao said he will look into both of these issues and return with more information.

@erikyao
Copy link
Contributor

erikyao commented Apr 15, 2022

  • In some cases, the list is 2 elements long and 1 ID is a UMLS (the first one?) and 1 ID is an entrez/NCBIGene one. In this case, it seems like we could pick just 1 ID to make the association record - and that it'd be better to pick the entrez/NCBIGene ID since the UMLS ID might be outdated

E.g. "C1451465|1398", "CRK protein, human|CRK"

  • In other cases, the list is > 2 elements long. It looks like the first element is a UMLS or a entrez/NCBIGene ID, and the other elements are entrez/NCBIGene IDs. From the names, it looks like this is a set of related concepts - where we may want to "enumerate" and make association records for each ID.

E.g. "C0007082|1048|1084|5670", "Carcinoembryonic Antigen|CEACAM5|CEACAM3|PSG2"

@colleenXu
Copy link
Author

colleenXu commented Apr 15, 2022

Also discussed with @erikyao (@newgene @andrewsu):

It would be useful if the parser generated a file of "combinations" for use in auto-generating x-bte annotations based on what records will exist in the API. Right now, I analyze the semmeddb predications file separately to generate the x-bte annotations.

I'm imagining something like this:

  • remove all rows / records where the novelty for subject OR novelty for object is 0. This is because we only make x-bte annotations off of data with novelty = 1 (@andrewsu 's previous decision)
  • based off the data, generate a table of unique combos of subject ID namespace + subject semantic type + predicate + object ID namespace + object semantic type. A count of how many records exist for each combo would be awesome too.

For example:

subject ID subject type predicate object ID object type count
UMLS aapp DISRUPTS UMLS aapp 123456
UMLS aapp DISRUPTS NCBIGene aapp 222
NCBIGene aapp DISRUPTS UMLS aapp 4567
NCBIGene aapp DISRUPTS NCBIGene aapp 123456
UMLS gngm CAUSES UMLS dsyn 987654
NCBIGene gngm CAUSES UMLS dsyn 6543

@colleenXu
Copy link
Author

related to issues here https://github.com/biothings/semmeddb/issues

@colleenXu
Copy link
Author

And also some notes about semmeddb -> x-bte operations:

  • right now, one operation = 1 output semmed-semantic type. If pending hub is at biothings version>=0.10, we can use OR to query for multiple output semmed-semantic types. However, we already have times when we cut off the amount of records returned in sub-query....and that'll get worse if these sub-queries are more expansive (internal slack links)
  • Already mentioned above:
    • dealing with retired UMLS IDs + pipes in semmeddb data (original issue, yao's issues)
    • add operations for ncbigene field. The problem is how to do this when it may not be clear what operations have many underlying records with ncbigene IDs.

@colleenXu
Copy link
Author

@andrewsu @erikyao do we want to close this issue? I thought outdated / retired IDs and pipe-delimited fields have been dealt with?

And then are there other issues we can close?

@erikyao
Copy link
Contributor

erikyao commented May 24, 2023

@colleenXu no issue on my end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data source Data source pending to create a new API
Projects
None yet
Development

No branches or pull requests

5 participants