-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update SemMedDB APIs #30
Comments
@andrewsu The parsers are almost finished. Need some suggestions for below item. Last time we added a new item for reverse direction and this is not supported by original SEMMED predication (e.g., treated_by for treats). I assume this task can't be done without using inverse predication from biolink model. So, I am thinking to use below approach: Current: Future: Forward direction uses original predication but reserver direction uses prediction from biolink. Or please let me know if you have better suggestions. Thanks. |
Hmm, good question. I'm not in love with the result (because |
Per our discussion, the current plan is to create a single SemMedDB API with each JSON object as a triple. Like this existing API: https://biothings.ncats.io/text_mining_targeted_association/query, where each object looks like this:
Basically this is planned right now:
While we are implementing the |
@andrewsu @newgene @colleenXu I prepared one example file for your reference (Should I follow the format in the word doc or the above format?). Please let me know if you want to adjust the format, especially for below 2 items:
"affects": { In addition, since we process the SEMMED DB independently, some data checks will be handled by the translator. For example,
I am still doing some tests with DrugMechDB and issue #227 and may adjust something later. Will keep you posted. |
A couple quick thoughts on your questions above.
And a few additional notes:
|
The current structure is one (subject cui) to many (object cuis). Thus, the semantic type of subject maybe different while dealing with different object CUIs (example).
No problem. BTE should also take care this while doing the mapping process.
I refered the doc used last time for this example. I will change this per suggested request.
I used the latest version (07/2021). One column (either subject cui or object cui) may contain multiple ids (example). They are not cuis so have to be processed first (e.g., Entrez ID: 1 <-> C1412045). So, you will see different results. |
My understanding was that we were moving to an association-based format (see the comment above from Chunlei #30 (comment)), not a entity-based format (which is where there's 1 subject to many objects). In the association-based format, there's just "triples": one set of subject-predicate-object. Is that not correct? It's interesting that one CUI may be assigned to a different semantic type, depending on which association it's in. The association-based format should work well with that though. |
I'm also not sure what from Sander means...:
It seems that with association-based modeling, each triple is an object that holds publication info under the association field. I think it's alright if multiple objects have the same PMID under their association field... Also, I believe there's no need to create "reverse" triples, so I'm not sure what this comment from Sander means...
Finally, what does it mean when the subject or object CUI field has multiple IDs? This is from Sander's most recent comment:
Are they equivalent IDs for the same entity or IDs for different entities that are supposed to be considered as some kind of "group" for this association? That should help determine if they should be included as equivalent IDs in one association/triple, or whether this has to be split into multiple triples... |
@colleenXu yes, thank you, you've pointed out some key points. I think the fundamental issue is that each document should be a single row in the predications document, not an entity-centric collection of multiple rows. So @r76941156 your example file needs to be pretty fundamentally reorganized. See the example (from a different resource) that Chunlei posted in this comment #30 (comment) I think there is a secondary question of what it means when the IDs have |
Please see the new example file and let me know if you want to change anything. All information are from SEMMED DB. The current '_id' field is unique (predication_id) so it should fine for our use. For some CUI id columns that have '|' between ids, I did not find any relevant information in SEMMED's documents. Based on comments from Stuppie and Mike, these non-CUI ids should be gene ids. You are free to browse relevant articles based on their PMIDs to see if you want a better presentation format. They used to process them into CUIs and used them in the '_id' columns. |
Super, thanks... So just to confirm, when I propose the following tweaks to the structure:
The modified structure would look something like this:
Let's give @newgene and @colleenXu one day to chime in with any additional feedback. If none, go ahead and push forward with preparing that modified file please... |
(FYI: I'm going to use my understanding of the columns of the predication table, but perhaps there's already been discussions of what columns to keep and what to exclude...My comments below are not intended to be a final word or override previous decisions...) What about:
Like the below example (taken from the example Sander's provided)
|
@colleenXu @andrewsu |
After a quick check in with @newgene, let's go with the following:
Example record below:
And a bit of a more detailed note on the cases where the
would get converted to three separate records that look something like this:
Please post if anything isn't clear or if there are other edge cases that need to be considered. Thanks! |
I'd like clarification on why we wouldn't merge predications that differ only by PMID (and their unique ID). I believe this is the approach taken by the previous SEMMED APIs, and I think it makes sense from a BTE perspective (having 1 edge with multiple supporting publications). (but side note: BTE's response-transform could potentially merge records that only differ by PMID too...) Is it to keep this data close to its original format? Or perhaps it is related to the "multiple ID" issues described above (where the predication gets split into 3 objects)? |
The primary motivation is to keep the format/structure as close to the raw data as possible. When we consume it in BTE (or at very least before we send the data back out), we'll be transforming it to "1 edge with multiple supporting publications"... |
For now, I will not add the 'semantic_type_name' for them. Or please let me know your preferred format. Thanks. |
Ahh right, great point. the pipe-separated gene IDs was the one thing I didn't discuss with @newgene. I suggest the following then:
right, that is what I am proposing
It looks like the 2018 version only deleted a few semantic types. In that case, let's just use the 2013 version. If there are any semantic types that are even older that don't exist in the 2013 file, then yes, I think not adding the
|
This sounds good to me. In this Semmed case, |
Every row in the csv file should be one document. (CLARIFICATION: When multiple IDs are present in a single row, break that up so that becomes like it was multiple separate rows.) There should be no nesting of multiple rows in a single document (so no introduction of the the
|
@andrewsu @newgene @colleenXu @erikyao I have done some tests with limited records on my local machine and su08 and the result looks good to me. Now, the GitHub repo for SEMMED DB is ready to be deployed on pending hub. After deployment, I will do more tests based on the whole dataset and BTE can also test related new features. If anything I can help while implementing new logic with BTE, please let me know. Thanks. |
@erikyao Let's use |
Existing SemMedDB APIs are callled |
Published to https://pending.biothings.io/semmeddb Please let me know if I can be of any further help. |
same as https://biothings.ncats.io/semmeddb A few example queries:
where |
Next step is to create the API metadata and test the new features from biothings/call-apis.js#30, @colleenXu? I can review and merge it to |
x-bte annotation has been done for the latest release, and BTE has been switched to use semmeddb instead of the old semmed apis... |
Note that we're encountering issues with outdated UMLS identifiers in the semmed data...TranslatorSRI/NodeNormalization#119 (comment). Things to consider:
This causes issues for BTE since there's no cross-mapping / label retrieval during ID resolution... |
Noting that SEMMEDDB recently released new data files (publications up to feb 2022). In addition with the "outdated identifiers in associations" issue, we may want to look at updating this pending API... |
splitting out SemMedDB from the broader issue of updating APIs on pending #25
In addition to updating to the latest data files, we will remove the logic to convert to biolink model from the parser, leaving that to be done in the smartAPI mapping.
The text was updated successfully, but these errors were encountered: