Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expanded functionality for biolink:publications edge-attribute #677

Closed
colleenXu opened this issue Jul 26, 2023 · 9 comments
Closed

expanded functionality for biolink:publications edge-attribute #677

colleenXu opened this issue Jul 26, 2023 · 9 comments
Assignees
Labels
enhancement New feature or request On Test Related changes are deployed to Test server

Comments

@colleenXu
Copy link
Collaborator

colleenXu commented Jul 26, 2023

The Translator UI is supposed to be able to handle more kinds of "references" (publications) for an edge - not just the PMIDs we provide in the biolink:publications edge-attribute right now. In Translator Slack comms, the UI team has confirmed that they plan to support the specification here.

For now, we don't have to worry about "free-text description"-style references (we don't really have any of these).

And I'll explain the spec below...

Implementation

We'd like to adjust / expand our behavior to match this spec and provide more reference info to users....by taking the values from sometimes multiple fields, replacing/appending proper prefixes, and putting them into 1 edge-attribute.

Here's what's involved:

  1. This is the set of response-mapping keys for fields that we want to use as input for the biolink:publications's value, and how they should be processed:
    1. ref_pmid (previously pubmed): we want the output-strings to have the prefix PMID
    2. ref_url (previously biolink:source_web_page): no processing needed. The strings are urls
    3. ref_pmcid: we want the output-strings to have the prefix PMCID (however, I made a biolink-model issue because which prefix to use was confusing Confusion on PMC vs PMCID biolink/biolink-model#1366)
    4. ref_clinicaltrials: we want the output-strings to have the prefix clinicaltrials. However, the spec said putting this data in this edge-attribute was temporary / in-flux...
    5. ref_doi: we want the output-strings to have the prefix doi (biolink-model spelling ref)
    6. ref_isbn: we want the output-strings to have the prefix isbn (biolink-model spelling ref)
  2. I've updated the x-bte annotation to use these special response-mapping keys here. To use these SmartAPI yamls for dev work, put the content below into your local BTE smartapi_overrides file:
SmartAPI overrides

Note: PharmGKB is excluded here because it isn't added to the config file yet. it can be added here if you want, but the stuff listed here should be plenty to test the 6 response-mapping keys above...

{
  "conf": {
    "only_overrides": false
  },
  "apis": {
    "0212611d1c670f9107baf00b77f0889a": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/CTD/smartapi.yaml",
    "1f47552dabd67351d4c625adb0a10d00": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/EBIgene2phenotype/smartapi.yaml",
    "77ed27f111262d0289ed4f4071faa619": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/MGIgene2phenotype/smartapi.yaml",
    "38e9e5169a72aee3659c9ddba956790d": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/bindingdb/smartapi.yaml",
    "e3edd325c76f2992a111b43a907a4870": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/dgidb/openapi.yml",
    "316eab811fd9ef1097df98bcaa9f7361": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/gtrx/gtrx.yaml",
    "dca415f2d792976af9d642b7e73f7a41": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/litvar/smartapi.yaml",
    "8f08d1446e0bb9c2b323713ce83e2bd3": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/mychem.info/openapi_full.yml",
    "671b45c0301c8624abbd26ae78449ca2": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/mydisease.info/smartapi.yaml",
    "59dce17363dce279d389100834e43648": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/mygene.info/openapi_full.yml",
    "09c8782d9f4027712e65b95424adba79": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/myvariant.info/openapi_full.yml",
    "b772ebfbfa536bba37764d7fddb11d6f": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/ncats_rare_source/smartapi.yaml",
    "edeb26858bd27d0322af93e7a9e08761": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/pfocr/smartapi.yaml",
    "03283cc2b21c077be6794e1704b1d230": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/rhea/smartapi.yaml",
    "1d288b3a3caf75d541ffaae3aab386c8": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/semmeddb/smartapi.yaml",
    "d22b657426375a5295e7da8a303b9893": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/biolink/openapi.yml"
  }
}
  1. For each KG edge, we'll want to take these 6 response-mapping keys, process the fields' contents, and stuff those output-strings together into an array of strings that's 1 biolink:publications edge-attribute (many-to-1). It should have this format:
{
    "attribute_type_id": "biolink:publications",
    "value": [
        "PMID:1234",
        "http://hello_world.com",
        "PMCID:PMC1234",
        "clinicaltrials:NCT1234",
        "doi:1234/1234",
        "isbn:1234-1234-1"
    ],
    "value_type_id": "linkml:Uriorcurie"

}

Potentially-helpful implementation notes:

  • we want the code to handle when the field's value is null or an empty string (ignore, don't add to output?)
  • we want the code to remove duplicates in the value array, after the array has been assembled (and after records are merged into edges...)
  • sometimes a field's value will already have a prefix (it may or may not be formatted exactly the way it should be for biolink-model) and sometimes it won't. So sometimes we'll be adding a prefix, sometimes replacing it, and sometimes we may not need to do anything (it's already formatted the way we want)
@colleenXu colleenXu added the enhancement New feature or request label Jul 26, 2023
@colleenXu
Copy link
Collaborator Author

Issues we'll want to deal with at some point:

  • I didn't modify biolink / monarch API but I'll need to (it still uses the pubmed response-mapping key). I think this is tricky to work with because of the special api-response-transform that happens to it. It's not clear to me if the transformed publications field holds only PMIDs or if it can sometimes have non-PMID publications the way the raw API responses do...
  • is the value_type_id above correct? I'll need to check

Known issue, set-aside and out-of-scope for now:
We won't fully follow the spec because of some limitations with the current x-bte annotation (this may improve with JQ-related processing):

click here to expand

The documentation says we "MUST report only one identifier per publication" and must report the CURIE (not the full url) whenever we can.

But right now, we won't follow this in these cases:

  • MyChem:
    • Chembl drug mechanisms: both the CURIE and the expanded url will be present for PMID / clinicaltrials / doi / PMC. This is because each reference object has a full url, but only some kinds also had the ID, and I included response-mapping keys for both CURIE stuff and url stuff...
    • Chembl drug indications: both the CURIE and the expanded url will be present for clinicaltrials. Same reasoning as above
  • Bindingdb: in some cases, an edge will have a doi + a pmid that actually refer to the same publication (they're in the same relation object of the original API response). I annotated both fields because there are plenty of cases where relationships have only 1 of the fields
  • PharmGKB (not added to config list yet): ref_url: data.literature._sameAs seems to be the best way to get 1 value per unique reference. However, sometimes this field's value is an expanded url for a PMID/PMCID (clinical guidelines, variant annotation). There shouldn't be any issues with "reporting the same publication more than once"

Finally: the spec says a second biolink:publications edge-attribute can be made when the reference info is a free-text string. I don't think we need to do that in this issue, because I didn't notice any strong examples of these in the SmartAPI yamls...

@rjawesome
Copy link
Contributor

rjawesome commented Jul 26, 2023

sometimes a field's value will already have a prefix (it may or may not be formatted exactly the way it should be for biolink-model) and sometimes it won't. So sometimes we'll be adding a prefix, sometimes replacing it, and sometimes we may not need to do anything (it's already formatted the way we want)

My current plan is to check if the prefix [with ":" so like "PMID:"] is there (in any casing), and if so, strip the prefix. Then just add the prefix. Is there any other cases that should be handled?

I didn't modify biolink / monarch API but I'll need to (it still uses the pubmed response-mapping key). I think this is tricky to work with because of the special api-response-transform that happens to it. It's not clear to me if the transformed publications field holds only PMIDs or if it can sometimes have non-PMID publications the way the raw API responses do

It seems like the current code is attempting to filter out only PMID IDs. But if they are using the same/known prefixes for the other ID types, then we have two options

  • this transformer could be modified to put different ID types into different properties based on prefix (ie. publicationsPMID, publicationsPMCID, etc.) which could then be referenced in the smartapi yaml using ref_pmid, ref_pmcid, etc.
  • if the publication ids from biolink is trusted to be formatted exactly as we want it, we could modify the transformer directly take the raw API publications field and put it into the publications field of the record

PharmGKB (not added to config list yet): ref_url: data.literature._sameAs seems to be the best way to get 1 value per unique reference. However, sometimes this field's value is an expanded url for a PMID/PMCID (clinical guidelines, variant annotation). There shouldn't be any issues with "reporting the same publication more than once"

If we wanted to use the CURIE's when possible, we could parse the URL looking for the URLs that identify PMID/PMCID (ie. http://www.ncbi.nlm.nih.gov/pmc/, http://www.ncbi.nlm.nih.gov/pubmed), and then translate it to a PMID/PMCID/etc

@colleenXu
Copy link
Collaborator Author

Remember, you can search the PR / branch for SmartAPI yamls that contain the keywords you're testing

Test for ref_isbn, ref_pmid, ref_url
  • Query only MyDisease through BTE with the TRAPI query below.
  • The sub-query to MyDisease will retrieve this data
    • phenotype_related_to_disease objects have diff mixes of ref_ fields:
      • HP:0002762 has both a pmid and a website field (representing two different references)
      • many have only the isbn field or only the website field
    • ref_ field output has prefixes on it (isbn prefix needs replacing)
  • Note: ref_isbn is only used in the MyDisease Disease -> PhenotypicFeature operations (response-mapping here), so that's what this test is based on.
{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["OMIM:127300"],
                    "categories":["biolink:Disease"]
                },
                "n1": {
                    "categories":["biolink:PhenotypicFeature"]
               }
            },
            "edges": {
                "e1": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}
Test for ref_doi, ref_clinicaltrials (and ref_pmid and ref_url)
  • This test is based on MyChem's drug-mechanism operations (response-mapping here and here)
  • Query only MyChem through BTE with the TRAPI query below.
  • The sub-query to MyChem will retrieve this data
  • drug_mechanisms objects have diff mixes of ref_ fields:
    • the target_components.uniprotP10721 has mechanism_refs for doid, clinicaltrials, pmid, and url
  • notice that for this KP, no ref_ field output has prefixes
  • Note that there are other operations with keywords ref_doi and ref_clinicaltrials
{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["CHEMBL.COMPOUND:CHEMBL1278146"],
                    "categories":["biolink:SmallMolecule"]
                },
                "n1": {
                    "categories":["biolink:Gene"]
               }
            },
            "edges": {
                "e1": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:interacts_with"]
                }
            }
        }
    }
}
Test for ref_pmc, (and ref_url)
  • This test is based on MyChem's drug-mechanism operations (response-mapping here and here)
  • Query only MyChem through BTE with the TRAPI query below.
  • The sub-query to MyChem will retrieve this data
  • the target_components.uniprotO14763 has mechanism_refs for pmc and url (the object that's type: "Other"
  • Note that there are other operations that use the keyword ref_pmc
{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["CHEMBL.COMPOUND:CHEMBL1743036"],
                    "categories":["biolink:SmallMolecule"]
                },
                "n1": {
                    "categories":["biolink:Gene"]
               }
            },
            "edges": {
                "e1": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:interacts_with"]
                }
            }
        }
    }
}
Test using bindingdb: ref_pmid, ref_url, ref_doi
  • This test is based on BindingDB operations (response-mapping here)
  • Query only BindingDB through BTE with the TRAPI query below.
  • The sub-query to BindingDB will retrieve this data
  • relation field holds 3 objects:
    • 2 have all 3 ref_ fields (pmid, doi, url). pmid + doi refer to the same publication
    • 1 has only the url field
{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["UniProtKB:Q9NWZ3"],
                    "categories":["biolink:Gene"]
                },
                "n1": {
                    "ids":["PUBCHEM.COMPOUND:90134669"],
                    "categories":["biolink:SmallMolecule"]
               }
            },
            "edges": {
                "e1": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}

@colleenXu
Copy link
Collaborator Author

colleenXu commented Jul 27, 2023

@rjawesome

  1. On checking / stripping / replacing prefixes: I think that's fine. I was wondering if it'd be any faster to keep the prefix and not strip it out, when it's already in the correct format (exactly correct case/spelling).
  2. On biolink/monarch. Hmm:
  • for now, I've updated the branch's yaml to use ref_pmid and updated the SmartAPI overrides above to include biolink/monarch API
  • I'm not sure what other kinds of references are in their output...we could ask them...or do you have an idea of how we could figure this out?
  • I lean towards the "put into different properties" since that would give us finer control...
  1. On converting urls to CURIEs: If this is possible, this is a GREAT idea that could help with most of the cases (convert to CURIEs, then remove duplicates). However, I think this is extra functionality (not critical).

Converting urls to CURIEs may also not be possible in all cases:

  • I think cases with DOI are tricky because there are a lot of possible base URLs + sometimes the DOI ID doesn't quite match up with the url (ex: doi is 10.1158/1538-7445.AM2013-DDT02-01 vs url is http://cancerres.aacrjournals.org/content/73/8_Supplement/DDT02-01).
  • MyChem drug_indications clinicaltrial IDs aren't a perfect match to their url. ex: url is https://clinicaltrials.gov/search?id=%22NCT00485888%22 and ID is NCT00485888. The problem is the trailing %22
  • I dunno if we want to handle cases like http vs https or having the www. vs not
Here's some lists of base URLs that we'd want to turn into CURIE prefixes

The stuff after the url should be the ID. These are the exact base urls I've found.

Turn into PMID::

  • http://europepmc.org/abstract/MED/ example
  • https://www.ncbi.nlm.nih.gov/pubmed/ example

Turn into clinicaltrials:

  • https://clinicaltrials.gov/ct2/show/ example

Turn into PMCID:

  • http://europepmc.org/articles/ example
  • https://www.ncbi.nlm.nih.gov/pmc/articles/ example

Turn into doi: this may be too difficult to do since there's a lot of different possible base urls

  • http://onlinelibrary.wiley.com/doi/ example
  • http://www.nejm.org/doi/full/ example
  • https://www.tandfonline.com/doi/abs/ example

@rjawesome
Copy link
Contributor

See PRs (description of behavior on api-response-transform PR)

@tokebe tokebe added this to the 2023-08-18 Code Freeze milestone Aug 3, 2023
@tokebe tokebe added On CI Related changes are deployed to CI server On Test Related changes are deployed to Test server and removed On CI Related changes are deployed to CI server labels Aug 11, 2023
@colleenXu
Copy link
Collaborator Author

colleenXu commented Aug 14, 2023

The corresponding SmartAPI updates have been done, and the registrations have been refreshed. NCATS-Tangerine/translator-api-registry#128 This means all instances with this code deployed (dev/ci/test) should begin working with this feature within minutes (after they pull the latest registry info).

This update was need for this code to work properly. The code isn't back-compatible, so the old behavior (using the pubmed keyword in response-mapping) wasn't working on the instances that had a deployment with this code.

EDIT: until the code from this issue is deployed on Prod, Prod will have wonkiness with how it handles publication info - since it doesn't have the code to process the new response-mapping keywords. Jackson has already made a post in Translator Slack (general channel) informing the consortium of this.

@colleenXu
Copy link
Collaborator Author

And info from Aug 9-10th from UI team (Translator slack links):

@tokebe
Copy link
Member

tokebe commented Aug 23, 2023

@colleenXu can this be closed as completed?

@colleenXu
Copy link
Collaborator Author

colleenXu commented Aug 24, 2023

Yep let's close this as complete since it's been deployed.

The limitations are:

  • "Converting urls to CURIEs may also not be possible in all cases", listed here
  • the https PMC urls for PFOCR figure urls vs PharmGKB articles with follow-up discussion here and here. Not addressed yet.
  • in some cases, spec says to add urls to the source's entry in the sources part of the TRAPI edge, rather than putting it into the publications edge-attribute. We haven't implemented this at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request On Test Related changes are deployed to Test server
Projects
None yet
Development

No branches or pull requests

3 participants