Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create API for SuppKG (Dietary Supplements) #55

Closed
andrewsu opened this issue Feb 10, 2022 · 7 comments
Closed

Create API for SuppKG (Dietary Supplements) #55

andrewsu opened this issue Feb 10, 2022 · 7 comments
Assignees
Labels
data source Data source pending to create a new API

Comments

@andrewsu
Copy link
Member

andrewsu commented Feb 10, 2022

SuppKG contains a variety of edges for Dietary Supplements.

Publication: https://pubmed.ncbi.nlm.nih.gov/35709900/
Preprint: https://arxiv.org/abs/2106.12741
Download link: https://github.com/zhang-informatics/SemRep_DS/tree/main/SuppKG

There are 595222 entries under the links. Here is one example record:

        {
            "relations": [
                {
                    "pmid": 1394115,
                    "sentence": "Turmeric and curcumin were also found to reverse the aflatoxin induced liver damage produced by feeding aflatoxin B1 (AFB1) (5 micrograms/day per 14 days) to ducklings.",
                    "conf": 0.9303833842,
                    "tuid": 0
                },
                {
                    "pmid": 1394115,
                    "sentence": "Reversal of aflatoxin induced liver damage by turmeric and curcumin.",
                    "conf": 0.9396179318000001,
                    "tuid": 0
                }
            ],
            "source": "C0001734",
            "target": "C0151763",
            "key": "CAUSES"
        },

I believe we want to create a record like this (where the info for name can be found in the nodes section of the json).

{
    "_id": "C0001734_C0151763_CAUSES",
    "subject": {
        "umls": "C0001734",
        "name": "aflatoxin",
        "semtypes": [ "bacs", "hops"]
    },
    "relation": [
        {
            "pmid": 1394115,
            "sentence": "Turmeric and curcumin were also found to reverse the aflatoxin induced liver damage produced by feeding aflatoxin B1 (AFB1) (5 micrograms/day per 14 days) to ducklings.",
            "conf": 0.9303833842,
            "tuid": 0
        },
        {
            "pmid": 1394115,
            "sentence": "Reversal of aflatoxin induced liver damage by turmeric and curcumin.",
            "conf": 0.9396179318000001,
            "tuid": 0
        }
    ],
    "object": {
        "umls": "C0151763",
        "name": "damage liver",
        "semtypes": [ "patf" ]
    },
    "predicate": "CAUSES"
}
@colleenXu
Copy link

colleenXu commented May 24, 2022

Pasted from Slack, my notes after reviewing the output file from:

an open source contributor created this parser https://github.com/mnarayan1/suppkg-data/blob/main/parser.py to address this ticket #55. The sample output file is at https://drive.google.com/file/d/1qsPvQre8E4Cz0JqvLR44A8vMuJz57VAq/view?usp=sharing

I think the structure is okay for writing queries with x-bte annotation.
But....

Point 0: I wonder if the relation array ever gets a LOT of elements

Point 1: Looking at the output file, some umls IDs seem to start with "DC" which seems incorrect. It looks like the "D" should be removed, so the ID starts with "C". Examples:

Point 2: Looking at the output file, some IDs don't seem to match their names. Examples:

  • the idx 2 record has object.name as "aceite niauli". I'm not sure what that means. The object.umls is DC0028908. After removing the "D" (see point 1), this ID corresponds with "oils".
  • the idx 6 record has object.name as "genotoxins". However, the corresponding ID's official name is mutagens (genotoxins does show up as an "atom" underneath, likely a cross-mapped ID).

Point 3: Looking at the output file, some semantic types don't exist or don't seem to match the ID given
I'm seeming "dsp" in object.semtypes (idx 2, idx 7 records) and this isn't a UMLS abbreviation (they're always 4 letters)

  • the idx 7 record has object.umls: DC0016163, which seems to refer to "fishes". However, the object.semtypes are imft (Immunologic Factor) and "dsp". Which seems odd.
  • the idx 9 record has DC1140671, which seems to refer to "rice / Oryza sativa". However, its semtypes are orch (organic chemical) and phsu (pharmacological substance). Again, odd.

@andrewsu
Copy link
Member Author

I think this has to do with the fact that suppKG apparently is using a (very) old version of UMLS. From their preprint:

image

This may mean that we should perform some of the same analyses/filtering as we did for semmeddb, as described in biothings/semmeddb#2.

@colleenXu
Copy link

colleenXu commented May 24, 2022

(from looking at the materials + method section of the preprint)

It sounds like the authors made some pseudo-UMLS IDs from "iDISK terms" that didn't map to an existing CUI....is that right? And that some of these "iDISK terms" were drug supplement ingredients...This makes me wonder about the MRCONSO.RRF file that they mention, which sound like it may have mappings from the original "iDISK terms" to pseudo-UMLS IDs used in their KG...

Also, it sounds like they put "phsu" as the semantic type for all drug supplements for their work, even if the original UMLS ID isn't considered a Pharmacological Substance. This makes me think of the plant terms (Point 3 / bullet 2 in my above post).

They also mention a networkx file and I wonder if that's useful...

@erikyao
Copy link
Contributor

erikyao commented Jun 1, 2022

(from looking at the materials + method section of the preprint)

It sounds like the authors made some pseudo-UMLS IDs from "iDISK terms" that didn't map to an existing CUI....is that right? And that some of these "iDISK terms" were drug supplement ingredients...This makes me wonder about the MRCONSO.RRF file that they mention, which sound like it may have mappings from the original "iDISK terms" to pseudo-UMLS IDs used in their KG...

Also, it sounds like they put "phsu" as the semantic type for all drug supplements for their work, even if the original UMLS ID isn't considered a Pharmacological Substance. This makes me think of the plant terms (Point 3 / bullet 2 in my above post).

They also mention a networkx file and I wonder if that's useful...

Hi @colleenXu , from SemRep_DS/docs/SemRep_full_fielded_output.txt:

*_CUI: The CUI of the subject/object entity. If a CUI starts with
'DC' instead of just 'C' it is an iDISK CUI and is not present in the UMLS.

@andrewsu
Copy link
Member Author

andrewsu commented Jun 1, 2022

@erikyao deployed the API at https://biothings.ncats.io/suppkg based on the parser written by @mnarayan1 (https://github.com/biothings/SuppKG). @colleenXu can you add creation of the smartAPI annotation to your to-do list please? ("Normal" priority -- no special urgency here...)

Let's also leave this ticket open for the moment so we can contemplate enhancements to the parser (for example, to handle retired UMLS IDs, get more current human-readable names and semtypes, etc)...

@colleenXu
Copy link

colleenXu commented Jun 16, 2023

@colleenXu
Copy link

colleenXu commented Aug 25, 2023

Closing because the API has been made. The rest of the work and discussion can be moved to biothings/biothings_explorer#706

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data source Data source pending to create a new API
Projects
None yet
Development

No branches or pull requests

3 participants