Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SuppKG API YAML #122

Merged
merged 6 commits into from
Aug 25, 2023
Merged

SuppKG API YAML #122

merged 6 commits into from
Aug 25, 2023

Conversation

mnarayan1
Copy link
Contributor

YAML for the SuppKG API. The API is located here.

Notes:

  • There currently isn't an infores for SuppKG, so I've left that field blank for now
  • Since the subjects and objects take on various semantic types, I used NamedThing for the semantic field of the x-bte-operations section. Is there something more specific I should use instead?

I've been trying to test my yaml file with this query:

curl --request POST \
  --url http://localhost:3000/v1/smartapi/suppkg/query \
  --header 'Content-Type: application/json' \
  --data '{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids": ["UMLS:C0008780"]
                },
                "n1": {
                    "categories": ["biolink:NamedThing"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}'

Here is my smartapi_overrides.json file:

{
  "conf": {
    "only_overrides": true
  },
  "apis": {
    "suppkg": "https://raw.githubusercontent.com/mnarayan1/translator-api-registry/master/suppkg/suppkg.yaml"  
  }
}

However, I'm getting this error: {"error":"Your input query graph is invalid","more_info":"Your Input Query Graph is invalid."}

Are there any issues with my annotations? Should I format my query differently?

@colleenXu
Copy link
Collaborator

colleenXu commented Jun 17, 2023

I know this post is long and kinda intimidating >.<. I think you've done a good job overall (great attention to detail!).

I'll summarize the feedback as:

  • we most likely want to do a deeper dive into the data to write more specific operations @andrewsu. See the "But regarding operations" section below.
    • If we're interested in only "supplement treats disease" edges, we could narrow our scope to only the combos that seem to be representing this info. there seems to be associations that aren't really related to supplements in this resource (example)
    • however, this resource seems to have many of the same issues as semmeddb. So does the parser need a similar amount of effort to remove entities with novelty 0, handle retired IDs, get a publication count for filtering, etc?
  • trying to troubleshoot the issues you're having querying BTE (first point of "Addressing the issues you raised")
  • some minor adjustments to the yaml, giving feedback on small parts of the x-bte annotation

Addressing the issues you raised
  1. I'm not sure why you're getting that error when testing locally. I can paste the cURL you provided into my terminal and BTE will execute without errors
    • Postman converts my queries to the cURL snippet below, which is a bit different from what you provided. Maybe try that format?
    • Maybe run git pull and npm run pull to see if BTE updates. If it does, then run npm run compile to make sure any changes are incorporated.
    • If you still have issues, please post to our lab's #ncats-translator channel so multiple people can look into what's going on...
    • potentially useful reminder: you can test a local file, but it'll need 3 slashes (like I'm using file:///Users/colleenxu/Desktop/translator-api-registry/_temp_testing/suppkg.yaml). And if you adjust your yaml and then want to test it, you'll want to save your yaml file, stop/quit BTE, run the API_OVERRIDE=true npm run smartapi_sync command again, and then start BTE again. BTE won't automatically pull in the changes
cURL from Postman
curl --location 'http://localhost:3000/v1/smartapi/suppkg/query' \
--header 'Content-Type: application/json' \
--data '{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids": ["UMLS:C0008780"]
                },
                "n1": {
                    "categories": ["biolink:NamedThing"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}
'
  1. use the not-created-yet infores:biothings-suppkg in the x-translator section (infores for the BioThings API). Then use infores:suppkg in the operations. Once we're close to registering this yaml, I'll make a PR to biolink-model with these new infores IDs.

  2. yeah, there seems to be issues with writing operations as general as NamedThing - NamedThing. Related to the "But regarding operations" section below.

    • from my testing, it looks like BTE won't use these operations because every starting/input ID is found to be something more specific than NamedThing (the queryEdge is more specific than the operation so BTE doesn't match this operation with the queryEdge). To get your example query working, once I made the changes to the operations/response-mapping (mentioned in the next section), I had to change the inputs.semantic to a more specific category that matched that input ID (Disease).
Minor yaml suggestions
  • add a sentence in the info.description of the API: about the publication / what suppKG has inside, with a link to the publication
  • change version to what the BioThings API version is. It looks like it might be 2021?
  • tag the endpoints with the path /association/ as association. Right now they're tagged as interaction. My understanding is that this tag on endpoints is to group endpoints by path
  • in the currently commented-out testExamples, I provide the IDs in the format that they would be when querying in TRAPI format. So they have the biolink-model prefix (specified in inputs.id and outputs.id).
Feedback on the current operations
  • the parameter.fields section needs to list / encompass all the fields in the response-mapping because that part of the query tells the BioThings API what fields you want in the response.
  • also I suggest adding more fields: relation.conf and relation.sentence seem useful.
  • in the response-mapping, input_name and output_name are special keywords that tell BTE that these aren't edge-attributes - their values should actually be used to replace the node names when SRI Node Normalizer doesn't provide a human-readable name for those node's IDs. They should match the inputs/outputs of the operation:
    • subject-object's response-mapping object should have input_name: subject.name
    • object-subject's response-mapping subject should have input_name: object.name

But regarding the operations

EDIT: @andrewsu and I have decided that this is a good next step.

With this resource, I think we'll need to write more specific operations, based on the set of unique combos of subject.semtypes,predicate,object.semtypes values (meta-triples). This probably involves analyzing the data underlying this API.

  • counts and example triples would be nice
  • It may also help to get a sense of how many subjects and objects have multiple values in semtypes vs just 1 value. For the operations, we'd want to set one of the semtypes fields, and I'm not sure if providing 1 value will do vs multiple values (using AND or OR?)
  • The stuff I wrote in the summary at the top of this comment is still relevant (on narrower scope + semmeddb issues). We likely only want information on supplements.

Then, depending on how many unique combos there are, we could then decide whether we want to map to biolink-model / write operations manually or through code (like what we do with semmeddb).

Here's an example of what I think the format for operations would be (I've worked through it and tested it):

the x-bte operations and response-mapping section
    SmallMolecule-treats-Disease:
    ## 595,222 records 
      - supportBatch: true
        useTemplating: true ## flag to say templating is being used below
        inputs:
          - id: UMLS
            semantic: SmallMolecule
        requestBodyType: object
        requestBody:
          body: >-
            {"q": {{ queryInputs | replPrefix('predicate:TREATS AND object.semtypes:((dsyn) OR (neop))
            AND subject.umls')| dump }}, "scopes": []}
        outputs:
          - id: UMLS
            semantic: Disease
        parameters:
          fields: object.umls,relation,subject.name,object.name
          size: 1000
        predicate: treats
        source: "infores:suppkg" # no infores for suppkg yet
        response_mapping:
          "$ref": "#/components/x-bte-response-mapping/object"
        # testExamples:
        #   - qInput: "UMLS:C0062737"      ## histaglobin
        #     oneOutput: "UMLS:C0002103"   ## allergic rhinitis
    SmallMolecule-treats-Disease-rev:
      - supportBatch: true
        useTemplating: true
        inputs:
          - id: UMLS
            semantic: Disease
        requestBodyType: object
        requestBody:
          body: >-
            {"q": {{ queryInputs | replPrefix('predicate:TREATS AND subject.semtypes:((phsu) OR (orch))
            AND object.umls')| dump }}, "scopes": []}
        outputs:
          - id: UMLS
            semantic: SmallMolecule
        parameters:
          fields: subject.umls,relation,subject.name,object.name
          size: 1000
        predicate: treated_by
        source: "infores:suppkg" # no infores for suppkg yet
        response_mapping:
          "$ref": "#/components/x-bte-response-mapping/subject"
        # testExamples:
        #   - qInput: "UMLS:C0263338"      ## urticaria, chronic
        #     oneOutput: "UMLS:C0062737"   ## histaglobin
  x-bte-response-mapping:
    object:
      UMLS: object.umls
      suppkg_confidence_score: relation.conf  ## not sure what to name this...you may know better?
      pubmed: relation.pmid
      "biolink:supporting_text": relation.sentence
      input_name: subject.name 
      output_name: object.name
    subject:
      UMLS: subject.umls
      suppkg_confidence_score: relation.conf  ## not sure what to name this...you may know better?
      pubmed: relation.pmid
      "biolink:supporting_text": relation.sentence
      input_name: object.name
      output_name: subject.name 

Example response from testing: suppkg.txt

notes:

  • the q using OR is optional, and something I haven't used in x-bte annotations anywhere yet
  • I changed the examples because the "adrenal cortex hormones" UMLS ID expands to >1000 IDs, so BTE won't actually execute the hop.

@colleenXu
Copy link
Collaborator

colleenXu commented Jun 17, 2023

well...now I'm done editing my comment >.<. Hopefully this makes it easier to digest

@mnarayan1
Copy link
Contributor Author

@colleenXu Thank you for the feedback! I've updated the yaml with your suggestions, and replaced the operations section with what you wrote.

With this resource, I think we'll need to write more specific operations, based on the set of unique combos of subject.semtypes,predicate,object.semtypes values (meta-triples). This probably involves analyzing the data underlying this API.

Regarding the above, I can get counts for the predicates and how many subjects/objects have multiple semtypes.

@colleenXu
Copy link
Collaborator

colleenXu commented Jul 11, 2023

@mnarayan1 (CC @andrewsu )

I'd like to check in: how is the analysis of the data's predicates/semtypes going? or being able to test YAMLs locally?

@mnarayan1
Copy link
Contributor Author

@colleenXu Sorry for the late response, I was out of town. I fixed the issue with my local installation of BTE, and I am able to test the yaml now.

Here is the analysis I've gotten on the data.

Number of records with only one semtype: 190314

Occurrences of each predicate:
CAUSES: 28792
COEXISTS_WITH: 73720
COMPARED_WITH: 12826
PREDISPOSES: 4647
AUGMENTS: 17074
STIMULATES: 14759
ASSOCIATED_WITH: 17417
ISA: 11234
AFFECTS: 49248
INTERACTS_WITH: 43273
PART_OF: 40920
ADMINISTERED_TO: 10329
PROCESS_OF: 54557
PRODUCES: 8031
PRECEDES: 2453
USES: 25120
LOCATION_OF: 77989
DIAGNOSES: 4895
DISRUPTS: 14084
COMPLICATES: 443
INHIBITS: 16856
TREATS: 43353
PREVENTS: 10247
CONVERTS_TO: 896
SAME_AS: 142
HIGHER_THAN: 1411
LOWER_THAN: 93
METHOD_OF: 5588
MEASURES: 3449
OCCURS_IN: 1139
MANIFESTATION_OF: 237

Is there any other information I should get?

@colleenXu
Copy link
Collaborator

colleenXu commented Jul 17, 2023

@mnarayan1

Based on your info, it sounds like:

  • a LOT of data has >1 semtype for either the subject or object (68%)
  • this data contains a LOT of different meta-triples / kinds of relationships

I think it would be helpful to have more specific info:

A) Do you know what exact semtypes field values correspond to supplements? If you don't, is there a way to analyze the data and figure this out?

B) Is it possible to generate a table containing counts of how many records there are for each unique combo of subject.semtypes, predicate, object.semtypes values (meta-triples)? Something like this:

subject semtype predicate object semtype count
phsu,orch TREATS dsyn 4000
phsu TREATS dsyn 6000
orch TREATS dsyn 300

What would be most helpful are exact matches: so phsu,orch represents just that, and not stuff that's an inexact match like phsu or phsu,orch,bacs.

C) I see a relation.conf field in the records. Do we have a sense of the distribution of this value? A range would be helpful, or something like this

My brainstorming

This KP is very similar to semmeddb...which is problematic because semmeddb has thousands of operations and requires a TON of special processing (pmid count, semtype/domain-predicate/range-predicate exclusions, novelty, etc.).

My tentative ideas are:

  • figure out what meta-triples cover useful info on supplements, and only make x-bte annotation for those.
  • my guess on supplement semtypes:
  • Double-check against the semmeddb exclusions (the type/domain/range stuff) to make sure they're allowed.
  • can we filter by relation.conf? I think we can only query "have at least 1 of the relation.conf values for this record be > X" but that may still be helpful....or we could adjust the parser to only include data with a relation.conf value > X...

@colleenXu
Copy link
Collaborator

Err...and the table from B) may be way too large for a github comment. A csv / tsv file may be the best way to share this table (along with a jupyter notebook or google colab notebook of the data analysis you're doing and how you're generating the table).

@mnarayan1
Copy link
Contributor Author

@colleenXu

Here is the notebook where I've done my work. It has a list of semtypes that could correspond to supplements, distribution of relation.conf values, and code used to generate the table of meta-triples.

A) There doesn't seem to be anywhere in SuppKG that explicitly states whether or not something is a dietary supplement. However, I looked through this list (containing all 133 UMLS semantic types) and compiled a list of semtypes that could possibly correspond to a supplement (excluding objects, body parts, diseases, etc.)

B) Here is the csv file with unique triples and their counts.

C) The distribution of relation.conf values is in the notebook. All relation.conf values are between 0.5 and 0.968.

@andrewsu
Copy link
Contributor

So while there are many metatriples in suppkg, we are really only interested in the ones that directly relate to supplements. So if you took your list of possible semantic types associated with supplements from your notebook, can you redo the analysis showing the counts of each metatriple in this csv?

@mnarayan1
Copy link
Contributor Author

Here are the counts of metatriples with only supplements.

@andrewsu
Copy link
Contributor

Hmm, that still results in a huge list of metatriples. So let's change gears a little bit. Rather than trying to come up with exclusion filters to remove what we don't want, let's instead focus on defining a small set of inclusion filters for triples that we do want. For this resource, the most unique thing we get are for [supplements] - TREATS - [disease]. So, if I restrict your CSV to rows where the predicate is TREATS, the object is "dsyn", and the count is > 100, I get this list:

subject predicate object count
['orch', 'phsu'] TREATS ['dsyn'] 2180
['phsu'] TREATS ['dsyn'] 2066
['phsu', 'plnt'] TREATS ['dsyn'] 1307
['orch', 'phsu', 'dsp'] TREATS ['dsyn'] 746
['orch', 'phsu', 'vita', 'dsp'] TREATS ['dsyn'] 301
['phsu', 'plnt', 'dsp'] TREATS ['dsyn'] 299
['food', 'phsu', 'dsp'] TREATS ['dsyn'] 297
['bacs', 'orch', 'phsu', 'dsp'] TREATS ['dsyn'] 281
['antb', 'orch'] TREATS ['dsyn'] 236
['bact', 'phsu', 'dsp'] TREATS ['dsyn'] 218
['antb'] TREATS ['dsyn'] 202
['aapp', 'gngm', 'bacs', 'phsu', 'dsp'] TREATS ['dsyn'] 176
['bact', 'phsu'] TREATS ['dsyn'] 167
['bacs', 'phsu'] TREATS ['dsyn'] 150
['aapp', 'gngm', 'phsu'] TREATS ['dsyn'] 132
['bacs', 'orch', 'phsu'] TREATS ['dsyn'] 128
['inch', 'phsu'] TREATS ['dsyn'] 119
['phsu', 'dsp'] TREATS ['dsyn'] 106

I would take the union of all the subject types, and see if you can create a smartAPI operation (or a set of operations) to retrieve those triples specifically. Does that make sense?

@mnarayan1
Copy link
Contributor Author

@andrewsu @colleenXu I've finished writing the operations to retrieve the above triples. I've tested them out on my local BTE instance, and the queries for each triple type seem to work (I included the testExamples in the yaml). Is there anything else I should add?

@colleenXu
Copy link
Collaborator

colleenXu commented Aug 16, 2023

@mnarayan1

Suggested major edits:

I think it'll be simpler and more elegant to have 2 operations

One for supplement-treats-disease. It would be very similar to the current SmallMolecule-treats-Disease, but the object.semtypes would be set to all SEMMED semantic types that are mapped to Disease: (acab OR anab OR cgab OR comd OR dsyn OR mobd OR neop). So the requestBody would be like the code chunk below
* During my testing, the nested parentheses weren't needed.
* This list of semtypes comes from my analysis of the valid SEMMEDDB metatriples after Translator-curated exclusions were applied.

    	requestBody:
      	body: >-
        	{"q": {{ queryInputs | replPrefix('predicate:TREATS AND object.semtypes:(acab OR anab OR cgab OR comd OR dsyn OR mobd OR neop)
        	AND subject.umls')| dump }}, "scopes": []}

The other for disease-treated_by-supplement. It would be similar to one of the rev operations, but the subject.semtypes would be set to some of the SEMMED semantic types for supplements: (aapp OR antb OR bacs OR dsp OR food OR inch OR orch OR phsu OR vita). AND subject.semtypes would be set to NOT be other semantic-types for supplements: (bact OR gngm OR plnt). So the requestBody would be like the code chunk below
* excluding bact, gngm, plnt because there are Translator-curated exclusions (domain-predicate) that say it isn't valid to have these as the subject for a TREATS statement

    	requestBody:
      	body: >-
        	{"q": {{ queryInputs | replPrefix('predicate:TREATS AND (NOT subject.semtypes:(bact OR gngm OR plnt))
        	AND subject.semtypes:(aapp OR antb OR bacs OR dsp OR food OR inch OR orch OR phsu OR vita)
        	AND object.umls')| dump }}, "scopes": []}
adjust response-mapping

The final response-mapping may look something like this:

	object:
  	UMLS: object.umls
  	ref_pmid: relation.pmid
  	"biolink:supporting_text": relation.sentence
  	input_name: subject.name
  	output_name: object.name
  	## not including these fields due to data-processing / biolink-modeling issues
  	# suppkg_confidence_score: relation.conf  
change the parameter.fields to match the response-mapping

For the two operations, the parameter.fields can be changed since we'll only need the fields that are referenced in the response-mapping. So something like this could work for the supplement-treats-disease operation (object.umls contains the output): object.umls,relation.pmid,relation.sentence,subject.name,object.name

Minor edits

click here to expand
  • set info.x-translator.biolink-version to "3.5.3" instead
  • in servers, you can probably remove the Production server object since it's identical to the Encrypted Production server object
  • in description, also include the link to the suppKG paper (ref: Andrew's post here)

@colleenXu
Copy link
Collaborator

@andrewsu

This API seems to still have "fake" UMLS:DC IDs, and I suggest discussing this (parser enhancements?)....before registering the SmartAPI yaml (which would make it accessible via the api-specific endpoints (v1/smartapi/).

This was previously brought up starting here and the comments below it all seem relevant.

@andrewsu
Copy link
Contributor

@colleenXu Let's go ahead and allow these "fake UMLS IDs" to be returned. Presumably, NodeNormalizer will fail to resolve these, and BTE will use the original names from SuppKG as the human-readable names for presentation in the ARAX UI and Translator UI. At least that's how I think it will work -- let's see how it works in practice...

@mnarayan1 let us know when you have the updates done from @colleenXu's suggestions above...

@mnarayan1
Copy link
Contributor Author

@andrewsu @colleenXu I've finished with the edits, and the testing is still working for me.

@colleenXu colleenXu merged commit 1cbc8ac into NCATS-Tangerine:master Aug 25, 2023
@colleenXu
Copy link
Collaborator

I'm going to merge this PR, since the yaml looks ready. Good job @mnarayan1!

We'll continue discussion and next steps in biothings/biothings_explorer#706

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants