expanded functionality for `biolink:publications` edge-attribute #677

colleenXu · 2023-07-26T06:29:28Z

The Translator UI is supposed to be able to handle more kinds of "references" (publications) for an edge - not just the PMIDs we provide in the biolink:publications edge-attribute right now. In Translator Slack comms, the UI team has confirmed that they plan to support the specification here.

For now, we don't have to worry about "free-text description"-style references (we don't really have any of these).

And I'll explain the spec below...

Implementation

We'd like to adjust / expand our behavior to match this spec and provide more reference info to users....by taking the values from sometimes multiple fields, replacing/appending proper prefixes, and putting them into 1 edge-attribute.

Here's what's involved:

This is the set of response-mapping keys for fields that we want to use as input for the biolink:publications's value, and how they should be processed:
1. ref_pmid (previously pubmed): we want the output-strings to have the prefix PMID
2. ref_url (previously biolink:source_web_page): no processing needed. The strings are urls
3. ref_pmcid: we want the output-strings to have the prefix PMCID (however, I made a biolink-model issue because which prefix to use was confusing Confusion on PMC vs PMCID biolink/biolink-model#1366)
4. ref_clinicaltrials: we want the output-strings to have the prefix clinicaltrials. However, the spec said putting this data in this edge-attribute was temporary / in-flux...
5. ref_doi: we want the output-strings to have the prefix doi (biolink-model spelling ref)
6. ref_isbn: we want the output-strings to have the prefix isbn (biolink-model spelling ref)
I've updated the x-bte annotation to use these special response-mapping keys here. To use these SmartAPI yamls for dev work, put the content below into your local BTE smartapi_overrides file:

SmartAPI overrides

Note: PharmGKB is excluded here because it isn't added to the config file yet. it can be added here if you want, but the stuff listed here should be plenty to test the 6 response-mapping keys above...

{
  "conf": {
    "only_overrides": false
  },
  "apis": {
    "0212611d1c670f9107baf00b77f0889a": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/CTD/smartapi.yaml",
    "1f47552dabd67351d4c625adb0a10d00": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/EBIgene2phenotype/smartapi.yaml",
    "77ed27f111262d0289ed4f4071faa619": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/MGIgene2phenotype/smartapi.yaml",
    "38e9e5169a72aee3659c9ddba956790d": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/bindingdb/smartapi.yaml",
    "e3edd325c76f2992a111b43a907a4870": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/dgidb/openapi.yml",
    "316eab811fd9ef1097df98bcaa9f7361": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/gtrx/gtrx.yaml",
    "dca415f2d792976af9d642b7e73f7a41": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/litvar/smartapi.yaml",
    "8f08d1446e0bb9c2b323713ce83e2bd3": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/mychem.info/openapi_full.yml",
    "671b45c0301c8624abbd26ae78449ca2": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/mydisease.info/smartapi.yaml",
    "59dce17363dce279d389100834e43648": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/mygene.info/openapi_full.yml",
    "09c8782d9f4027712e65b95424adba79": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/myvariant.info/openapi_full.yml",
    "b772ebfbfa536bba37764d7fddb11d6f": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/ncats_rare_source/smartapi.yaml",
    "edeb26858bd27d0322af93e7a9e08761": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/pfocr/smartapi.yaml",
    "03283cc2b21c077be6794e1704b1d230": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/rhea/smartapi.yaml",
    "1d288b3a3caf75d541ffaae3aab386c8": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/semmeddb/smartapi.yaml",
    "d22b657426375a5295e7da8a303b9893": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/biolink/openapi.yml"
  }
}

For each KG edge, we'll want to take these 6 response-mapping keys, process the fields' contents, and stuff those output-strings together into an array of strings that's 1 biolink:publications edge-attribute (many-to-1). It should have this format:

{
    "attribute_type_id": "biolink:publications",
    "value": [
        "PMID:1234",
        "http://hello_world.com",
        "PMCID:PMC1234",
        "clinicaltrials:NCT1234",
        "doi:1234/1234",
        "isbn:1234-1234-1"
    ],
    "value_type_id": "linkml:Uriorcurie"

}

Potentially-helpful implementation notes:

we want the code to handle when the field's value is null or an empty string (ignore, don't add to output?)
we want the code to remove duplicates in the value array, after the array has been assembled (and after records are merged into edges...)
sometimes a field's value will already have a prefix (it may or may not be formatted exactly the way it should be for biolink-model) and sometimes it won't. So sometimes we'll be adding a prefix, sometimes replacing it, and sometimes we may not need to do anything (it's already formatted the way we want)

The text was updated successfully, but these errors were encountered:

colleenXu · 2023-07-26T06:44:20Z

Issues we'll want to deal with at some point:

I didn't modify biolink / monarch API but I'll need to (it still uses the pubmed response-mapping key). I think this is tricky to work with because of the special api-response-transform that happens to it. It's not clear to me if the transformed publications field holds only PMIDs or if it can sometimes have non-PMID publications the way the raw API responses do...
is the value_type_id above correct? I'll need to check

Known issue, set-aside and out-of-scope for now:
We won't fully follow the spec because of some limitations with the current x-bte annotation (this may improve with JQ-related processing):

click here to expand

The documentation says we "MUST report only one identifier per publication" and must report the CURIE (not the full url) whenever we can.

But right now, we won't follow this in these cases:

MyChem:
- Chembl drug mechanisms: both the CURIE and the expanded url will be present for PMID / clinicaltrials / doi / PMC. This is because each reference object has a full url, but only some kinds also had the ID, and I included response-mapping keys for both CURIE stuff and url stuff...
- Chembl drug indications: both the CURIE and the expanded url will be present for clinicaltrials. Same reasoning as above
Bindingdb: in some cases, an edge will have a doi + a pmid that actually refer to the same publication (they're in the same relation object of the original API response). I annotated both fields because there are plenty of cases where relationships have only 1 of the fields
PharmGKB (not added to config list yet): ref_url: data.literature._sameAs seems to be the best way to get 1 value per unique reference. However, sometimes this field's value is an expanded url for a PMID/PMCID (clinical guidelines, variant annotation). There shouldn't be any issues with "reporting the same publication more than once"

Finally: the spec says a second biolink:publications edge-attribute can be made when the reference info is a free-text string. I don't think we need to do that in this issue, because I didn't notice any strong examples of these in the SmartAPI yamls...

rjawesome · 2023-07-26T18:32:20Z

sometimes a field's value will already have a prefix (it may or may not be formatted exactly the way it should be for biolink-model) and sometimes it won't. So sometimes we'll be adding a prefix, sometimes replacing it, and sometimes we may not need to do anything (it's already formatted the way we want)

My current plan is to check if the prefix [with ":" so like "PMID:"] is there (in any casing), and if so, strip the prefix. Then just add the prefix. Is there any other cases that should be handled?

I didn't modify biolink / monarch API but I'll need to (it still uses the pubmed response-mapping key). I think this is tricky to work with because of the special api-response-transform that happens to it. It's not clear to me if the transformed publications field holds only PMIDs or if it can sometimes have non-PMID publications the way the raw API responses do

It seems like the current code is attempting to filter out only PMID IDs. But if they are using the same/known prefixes for the other ID types, then we have two options

this transformer could be modified to put different ID types into different properties based on prefix (ie. publicationsPMID, publicationsPMCID, etc.) which could then be referenced in the smartapi yaml using ref_pmid, ref_pmcid, etc.
if the publication ids from biolink is trusted to be formatted exactly as we want it, we could modify the transformer directly take the raw API publications field and put it into the publications field of the record

PharmGKB (not added to config list yet): ref_url: data.literature._sameAs seems to be the best way to get 1 value per unique reference. However, sometimes this field's value is an expanded url for a PMID/PMCID (clinical guidelines, variant annotation). There shouldn't be any issues with "reporting the same publication more than once"

If we wanted to use the CURIE's when possible, we could parse the URL looking for the URLs that identify PMID/PMCID (ie. http://www.ncbi.nlm.nih.gov/pmc/, http://www.ncbi.nlm.nih.gov/pubmed), and then translate it to a PMID/PMCID/etc

colleenXu · 2023-07-27T00:08:29Z

Remember, you can search the PR / branch for SmartAPI yamls that contain the keywords you're testing

Test for ref_isbn, ref_pmid, ref_url

Query only MyDisease through BTE with the TRAPI query below.
The sub-query to MyDisease will retrieve this data
- phenotype_related_to_disease objects have diff mixes of ref_ fields:
  - HP:0002762 has both a pmid and a website field (representing two different references)
  - many have only the isbn field or only the website field
- ref_ field output has prefixes on it (isbn prefix needs replacing)
Note: ref_isbn is only used in the MyDisease Disease -> PhenotypicFeature operations (response-mapping here), so that's what this test is based on.

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["OMIM:127300"],
                    "categories":["biolink:Disease"]
                },
                "n1": {
                    "categories":["biolink:PhenotypicFeature"]
               }
            },
            "edges": {
                "e1": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}

Test for ref_doi, ref_clinicaltrials (and ref_pmid and ref_url)

This test is based on MyChem's drug-mechanism operations (response-mapping here and here)
Query only MyChem through BTE with the TRAPI query below.
The sub-query to MyChem will retrieve this data
drug_mechanisms objects have diff mixes of ref_ fields:
- the target_components.uniprotP10721 has mechanism_refs for doid, clinicaltrials, pmid, and url
notice that for this KP, no ref_ field output has prefixes
Note that there are other operations with keywords ref_doi and ref_clinicaltrials

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["CHEMBL.COMPOUND:CHEMBL1278146"],
                    "categories":["biolink:SmallMolecule"]
                },
                "n1": {
                    "categories":["biolink:Gene"]
               }
            },
            "edges": {
                "e1": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:interacts_with"]
                }
            }
        }
    }
}

Test for ref_pmc, (and ref_url)

This test is based on MyChem's drug-mechanism operations (response-mapping here and here)
Query only MyChem through BTE with the TRAPI query below.
The sub-query to MyChem will retrieve this data
the target_components.uniprotO14763 has mechanism_refs for pmc and url (the object that's type: "Other"
Note that there are other operations that use the keyword ref_pmc

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["CHEMBL.COMPOUND:CHEMBL1743036"],
                    "categories":["biolink:SmallMolecule"]
                },
                "n1": {
                    "categories":["biolink:Gene"]
               }
            },
            "edges": {
                "e1": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:interacts_with"]
                }
            }
        }
    }
}

Test using bindingdb: ref_pmid, ref_url, ref_doi

This test is based on BindingDB operations (response-mapping here)
Query only BindingDB through BTE with the TRAPI query below.
The sub-query to BindingDB will retrieve this data
relation field holds 3 objects:
- 2 have all 3 ref_ fields (pmid, doi, url). pmid + doi refer to the same publication
- 1 has only the url field

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["UniProtKB:Q9NWZ3"],
                    "categories":["biolink:Gene"]
                },
                "n1": {
                    "ids":["PUBCHEM.COMPOUND:90134669"],
                    "categories":["biolink:SmallMolecule"]
               }
            },
            "edges": {
                "e1": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}

colleenXu · 2023-07-27T00:48:50Z

@rjawesome

On checking / stripping / replacing prefixes: I think that's fine. I was wondering if it'd be any faster to keep the prefix and not strip it out, when it's already in the correct format (exactly correct case/spelling).
On biolink/monarch. Hmm:

for now, I've updated the branch's yaml to use ref_pmid and updated the SmartAPI overrides above to include biolink/monarch API
I'm not sure what other kinds of references are in their output...we could ask them...or do you have an idea of how we could figure this out?
- I have a note saying there was WormBase:WBPaper. But I don't think we'd recognize this in Translator...
I lean towards the "put into different properties" since that would give us finer control...

On converting urls to CURIEs: If this is possible, this is a GREAT idea that could help with most of the cases (convert to CURIEs, then remove duplicates). However, I think this is extra functionality (not critical).

Converting urls to CURIEs may also not be possible in all cases:

I think cases with DOI are tricky because there are a lot of possible base URLs + sometimes the DOI ID doesn't quite match up with the url (ex: doi is 10.1158/1538-7445.AM2013-DDT02-01 vs url is http://cancerres.aacrjournals.org/content/73/8_Supplement/DDT02-01).
MyChem drug_indications clinicaltrial IDs aren't a perfect match to their url. ex: url is https://clinicaltrials.gov/search?id=%22NCT00485888%22 and ID is NCT00485888. The problem is the trailing %22
I dunno if we want to handle cases like http vs https or having the www. vs not

Here's some lists of base URLs that we'd want to turn into CURIE prefixes

The stuff after the url should be the ID. These are the exact base urls I've found.

Turn into PMID::

http://europepmc.org/abstract/MED/ example
https://www.ncbi.nlm.nih.gov/pubmed/ example

Turn into clinicaltrials:

https://clinicaltrials.gov/ct2/show/ example

Turn into PMCID:

http://europepmc.org/articles/ example
https://www.ncbi.nlm.nih.gov/pmc/articles/ example

Turn into doi: this may be too difficult to do since there's a lot of different possible base urls

http://onlinelibrary.wiley.com/doi/ example
http://www.nejm.org/doi/full/ example
https://www.tandfonline.com/doi/abs/ example

rjawesome · 2023-07-28T20:59:31Z

See PRs (description of behavior on api-response-transform PR)

colleenXu · 2023-08-14T23:18:50Z

The corresponding SmartAPI updates have been done, and the registrations have been refreshed. NCATS-Tangerine/translator-api-registry#128 This means all instances with this code deployed (dev/ci/test) should begin working with this feature within minutes (after they pull the latest registry info).

This update was need for this code to work properly. The code isn't back-compatible, so the old behavior (using the pubmed keyword in response-mapping) wasn't working on the instances that had a deployment with this code.

EDIT: until the code from this issue is deployed on Prod, Prod will have wonkiness with how it handles publication info - since it doesn't have the code to process the new response-mapping keywords. Jackson has already made a post in Translator Slack (general channel) informing the consortium of this.

colleenXu · 2023-08-14T23:37:00Z

And info from Aug 9-10th from UI team (Translator slack links):

they will support this enhanced publication info
they probably won't support isbn. In that case, their UI will silently ignore it when it's there, so we don't need to remove it for now

tokebe · 2023-08-23T16:14:11Z

@colleenXu can this be closed as completed?

colleenXu · 2023-08-24T08:04:19Z

Yep let's close this as complete since it's been deployed.

The limitations are:

"Converting urls to CURIEs may also not be possible in all cases", listed here
the https PMC urls for PFOCR figure urls vs PharmGKB articles with follow-up discussion here and here. Not addressed yet.
in some cases, spec says to add urls to the source's entry in the sources part of the TRAPI edge, rather than putting it into the publications edge-attribute. We haven't implemented this at all.

colleenXu added the enhancement New feature or request label Jul 26, 2023

colleenXu mentioned this issue Jul 26, 2023

Publication keywords NCATS-Tangerine/translator-api-registry#128

Merged

rjawesome mentioned this issue Jul 28, 2023

Publication refactor biothings/api-respone-transform.js#53

Merged

rjawesome self-assigned this Jul 28, 2023

rjawesome mentioned this issue Jul 28, 2023

change value_type_id for publications biothings/bte_trapi_query_graph_handler#162

Merged

tokebe added this to the 2023-08-18 Code Freeze milestone Aug 3, 2023

tokebe added On CI Related changes are deployed to CI server On Test Related changes are deployed to Test server and removed On CI Related changes are deployed to CI server labels Aug 11, 2023

colleenXu mentioned this issue Aug 16, 2023

SuppKG API YAML NCATS-Tangerine/translator-api-registry#122

Merged

colleenXu closed this as completed Aug 24, 2023

colleenXu mentioned this issue Mar 26, 2024

TRAPI 1.5: support source_record_urls #803

Closed

colleenXu mentioned this issue Aug 28, 2024

semmed sentence edge attributes biothings/api-respone-transform.js#68

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

expanded functionality for `biolink:publications` edge-attribute #677

expanded functionality for `biolink:publications` edge-attribute #677

colleenXu commented Jul 26, 2023 •

edited

Loading

colleenXu commented Jul 26, 2023

rjawesome commented Jul 26, 2023 •

edited

Loading

colleenXu commented Jul 27, 2023

colleenXu commented Jul 27, 2023 •

edited

Loading

rjawesome commented Jul 28, 2023

colleenXu commented Aug 14, 2023 •

edited

Loading

colleenXu commented Aug 14, 2023

tokebe commented Aug 23, 2023

colleenXu commented Aug 24, 2023 •

edited

Loading

expanded functionality for biolink:publications edge-attribute #677

expanded functionality for biolink:publications edge-attribute #677

Comments

colleenXu commented Jul 26, 2023 • edited Loading

Implementation

colleenXu commented Jul 26, 2023

rjawesome commented Jul 26, 2023 • edited Loading

colleenXu commented Jul 27, 2023

colleenXu commented Jul 27, 2023 • edited Loading

rjawesome commented Jul 28, 2023

colleenXu commented Aug 14, 2023 • edited Loading

colleenXu commented Aug 14, 2023

tokebe commented Aug 23, 2023

colleenXu commented Aug 24, 2023 • edited Loading

expanded functionality for `biolink:publications` edge-attribute #677

expanded functionality for `biolink:publications` edge-attribute #677

colleenXu commented Jul 26, 2023 •

edited

Loading

rjawesome commented Jul 26, 2023 •

edited

Loading

colleenXu commented Jul 27, 2023 •

edited

Loading

colleenXu commented Aug 14, 2023 •

edited

Loading

colleenXu commented Aug 24, 2023 •

edited

Loading