update new semmeddb api #63

colleenXu · 2022-04-01T19:55:39Z

Follow up to #30 . Noting that SEMMEDDB recently released new data files (publications up to feb 2022). In addition with the "outdated identifiers in associations" issue (also pasted below), we may want to look at updating this pending API...

Note that we're encountering issues with outdated UMLS identifiers in the semmed data...TranslatorSRI/NodeNormalization#119 (comment).

Things to consider:

having only 1 record for the pipe-delimited stuff - and using the ID that isn't outdated (aka ncbigene)...
checking if UMLS IDs are outdated or not during ingestion

This causes issues for BTE since there's no cross-mapping / label retrieval during ID resolution...

erikyao · 2022-04-11T19:22:32Z

Hi @r76941156, I've forked your semmed_parser repo into https://github.com/biothings/semmeddb. And I've renamed it to the API name.

Could you spare some time to update the API to the latest release? I can help if you have other priorities. Thank you!

colleenXu · 2022-04-15T23:09:18Z

Thoughts from discussion with @erikyao today: @newgene @andrewsu

There may be a way to find and replace outdated UMLS IDs...if we can find this table in a recent UMLS release. The SEMMEDDB data files don't seem to include info on what UMLS IDs are outdated. Note that I've seen the "outdated UMLS ID issue" with "genes" but it's not clear to me if only "genes" are affected.

The pipe-delimited values are confusing

In some cases, the list is 2 elements long and 1 ID is a UMLS (the first one?) and 1 ID is an entrez/NCBIGene one. In this case, it seems like we could pick just 1 ID to make the association record - and that it'd be better to pick the entrez/NCBIGene ID since the UMLS ID might be outdated
In other cases, the list is > 2 elements long. It looks like the first element is a UMLS or a entrez/NCBIGene ID, and the other elements are entrez/NCBIGene IDs. From the names, it looks like this is a set of related concepts - where we may want to "enumerate" and make association records for each ID.

Yao said he will look into both of these issues and return with more information.

erikyao · 2022-04-15T23:14:37Z

In some cases, the list is 2 elements long and 1 ID is a UMLS (the first one?) and 1 ID is an entrez/NCBIGene one. In this case, it seems like we could pick just 1 ID to make the association record - and that it'd be better to pick the entrez/NCBIGene ID since the UMLS ID might be outdated

E.g. "C1451465|1398", "CRK protein, human|CRK"

In other cases, the list is > 2 elements long. It looks like the first element is a UMLS or a entrez/NCBIGene ID, and the other elements are entrez/NCBIGene IDs. From the names, it looks like this is a set of related concepts - where we may want to "enumerate" and make association records for each ID.

E.g. "C0007082|1048|1084|5670", "Carcinoembryonic Antigen|CEACAM5|CEACAM3|PSG2"

colleenXu · 2022-04-15T23:29:21Z

Also discussed with @erikyao (@newgene @andrewsu):

It would be useful if the parser generated a file of "combinations" for use in auto-generating x-bte annotations based on what records will exist in the API. Right now, I analyze the semmeddb predications file separately to generate the x-bte annotations.

I'm imagining something like this:

remove all rows / records where the novelty for subject OR novelty for object is 0. This is because we only make x-bte annotations off of data with novelty = 1 (@andrewsu 's previous decision)
based off the data, generate a table of unique combos of subject ID namespace + subject semantic type + predicate + object ID namespace + object semantic type. A count of how many records exist for each combo would be awesome too.

For example:

subject ID	subject type	predicate	object ID	object type	count
UMLS	aapp	DISRUPTS	UMLS	aapp	123456
UMLS	aapp	DISRUPTS	NCBIGene	aapp	222
NCBIGene	aapp	DISRUPTS	UMLS	aapp	4567
NCBIGene	aapp	DISRUPTS	NCBIGene	aapp	123456
UMLS	gngm	CAUSES	UMLS	dsyn	987654
NCBIGene	gngm	CAUSES	UMLS	dsyn	6543

colleenXu · 2022-05-23T22:26:11Z

related to issues here https://github.com/biothings/semmeddb/issues

colleenXu · 2022-05-23T22:41:28Z

And also some notes about semmeddb -> x-bte operations:

right now, one operation = 1 output semmed-semantic type. If pending hub is at biothings version>=0.10, we can use OR to query for multiple output semmed-semantic types. However, we already have times when we cut off the amount of records returned in sub-query....and that'll get worse if these sub-queries are more expansive (internal slack links)
Already mentioned above:
- dealing with retired UMLS IDs + pipes in semmeddb data (original issue, yao's issues)
- add operations for ncbigene field. The problem is how to do this when it may not be clear what operations have many underlying records with ncbigene IDs.

colleenXu · 2022-06-06T20:19:06Z

See some notes in the semmed session of Translator's June 2022 Relay led by Andrew here: https://docs.google.com/document/d/1EpX-gGjYFlMCL6FXpRdeoTJV4TrOdz5EOjGjg7a4iqE/edit#

more links from rtx-kg2 stuff:

https://github.com/RTXteam/RTX-KG2/blob/1d3cb6da97a16647de5b5ccf67b1ffcb2f4c3c73/predicate-remap.yaml#L4226-L4361 (vs other lines https://github.com/RTXteam/RTX-KG2/blob/1d3cb6da97a16647de5b5ccf67b1ffcb2f4c3c73/predicate-remap.yaml#L4700?)
mapping of semmed semantic type for nodes/categories to biolink model: https://github.com/RTXteam/RTX-KG2/blob/1d3cb6da97a16647de5b5ccf67b1ffcb2f4c3c73/curies-to-categories.yaml#L286-L412
https://github.com/callahantiff/SemRepRDF

colleenXu · 2023-05-24T17:02:21Z

@andrewsu @erikyao do we want to close this issue? I thought outdated / retired IDs and pipe-delimited fields have been dealt with?

And then are there other issues we can close?

erikyao · 2023-05-24T17:32:02Z

@colleenXu no issue on my end

newgene added the data source Data source pending to create a new API label Apr 1, 2022

newgene assigned erikyao Apr 1, 2022

erikyao assigned r76941156 Apr 11, 2022

colleenXu mentioned this issue May 12, 2022

BTE did not find a subset of DMDB edges biothings/biothings_explorer#212

Closed

andrewsu unassigned r76941156 Sep 1, 2022

colleenXu mentioned this issue Sep 21, 2022

Use biothings with_total to efficiently batch query biothings/call-apis.js#59

Merged

erikyao mentioned this issue Jan 11, 2023

SemMedDB V43 update biothings/semmeddb#8

Merged

colleenXu mentioned this issue May 24, 2023

semmeddb: generating x-bte annotations using metatriple file biothings/biothings_explorer#644

Closed

colleenXu mentioned this issue May 24, 2023

semmeddb: update notebook to generate metatriples file biothings/biothings_explorer#645

Closed

andrewsu closed this as completed May 24, 2023

colleenXu mentioned this issue Dec 17, 2024

update BioThings semmeddb: download files ASAP? #265

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update new semmeddb api #63

update new semmeddb api #63

colleenXu commented Apr 1, 2022

erikyao commented Apr 11, 2022

colleenXu commented Apr 15, 2022 •

edited

Loading

erikyao commented Apr 15, 2022

colleenXu commented Apr 15, 2022 •

edited

Loading

colleenXu commented May 23, 2022

colleenXu commented May 23, 2022

colleenXu commented Jun 6, 2022 •

edited

Loading

colleenXu commented May 24, 2023

erikyao commented May 24, 2023

update new semmeddb api #63

update new semmeddb api #63

Comments

colleenXu commented Apr 1, 2022

erikyao commented Apr 11, 2022

colleenXu commented Apr 15, 2022 • edited Loading

erikyao commented Apr 15, 2022

colleenXu commented Apr 15, 2022 • edited Loading

colleenXu commented May 23, 2022

colleenXu commented May 23, 2022

colleenXu commented Jun 6, 2022 • edited Loading

colleenXu commented May 24, 2023

erikyao commented May 24, 2023

colleenXu commented Apr 15, 2022 •

edited

Loading

colleenXu commented Apr 15, 2022 •

edited

Loading

colleenXu commented Jun 6, 2022 •

edited

Loading