Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publication refactor #53

Merged
merged 8 commits into from
Aug 11, 2023
Merged

Publication refactor #53

merged 8 commits into from
Aug 11, 2023

Conversation

rjawesome
Copy link
Contributor

@rjawesome rjawesome commented Jul 28, 2023

biothings/biothings_explorer#677. Handles different types of publications. For URLs, if it starts with the URL string of a prefix, it is turned into a curie by stripping the URL part and then adding the prefix.

For publication strings with CURIE prefixes, it is checked if the prefix [with ":" so like "PMID:"] exists (in any casing), and if so, strip the prefix. Then just add the prefix back on the beginning of the string.

Current info that is being used for the prefixes:

  • {prop: "ref_pmid", prefix: "PMID:", urls: ["http://www.ncbi.nlm.nih.gov/pubmed/", "http://europepmc.org/abstract/MED/"]},
  • {prop: "ref_pmcid", prefix: "PMCID:", urls: ["http://www.ncbi.nlm.nih.gov/pmc/articles/", "http://europepmc.org/articles/"]},
  • {prop: "ref_clinicaltrials", prefix: "clinicaltrials:", urls: ["https://clinicaltrials.gov/ct2/show/"]},
  • {prop: "ref_doi", prefix: "doi:", urls: ["https://doi.org/", "http://www.nejm.org/doi/full/", "https://www.tandfonline.com/doi/abs/", "http://onlinelibrary.wiley.com/doi/"]},
  • {prop: "ref_isbn", prefix: "isbn:", urls: ["https://www.isbn-international.org/identifier/"]}

@colleenXu
Copy link
Contributor

colleenXu commented Aug 10, 2023

(MY LAST POST HERE IS A TDLR, MAYBE READ THAT FIRST)


Hmmm, we are producing the desired behavior for PFOCR (mix of PMCID CURIEs for articles and urls for figures). But I'm a little concerned:

  • the figure urls start with https://www.ncbi.nlm.nih.gov/pmc/articles/, which is the https version of what we do check for and replace with the CURIE prefix...
  • I found one future case where we'd want to recognize the https version and turn it into a CURIE: PharmGKB (has url field output like https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3994233). I'm not including it in my list of base urls to add in the next comment...because I don't want to mess up what is working right now for PFOCR (which BTE does currently use).

Example of what BTE currently generates for PFOCR, which is what I wanted...:

                        {
                            "attribute_type_id": "biolink:publications",
                            "value": [
                                "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2577715/bin/fig-15.jpg",
                                "PMCID:PMC2577715",
                                "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1994616/bin/JCI0731809.f2.jpg",
                                "PMCID:PMC1994616",
                                "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3218933/bin/bcr2876-2.jpg",
                                "PMCID:PMC3218933",
                                "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3669657/bin/nihms461261f6.jpg",
                                "PMCID:PMC3669657",
                                "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5962346/bin/nihms923919f1.jpg",
                                "PMCID:PMC5962346"
                            ],
                            "value_type_id": "linkml:Uriorcurie"
                        }

@colleenXu
Copy link
Contributor

colleenXu commented Aug 10, 2023

@rjawesome

Can you ADD some base urls for replacing with prefix?

  • clinicaltrials: https://www.clinicaltrials.gov/ct2/show/.
    In my test query for "Test for ref_doi, ref_clinicaltrials (and ref_pmid and ref_url)", I'm see that a url that is equivalent to clinicaltrials:NCT02390947, which is already present in that edge-attribute.
  • PMID: https://www.ncbi.nlm.nih.gov/pubmed/. (found during some PharmGKB testing)

@colleenXu
Copy link
Contributor

colleenXu commented Aug 10, 2023

@rjawesome Also is it possible to do a quick-fix for CTD processing, to handle when the data.PubMedIds field returns pipe-delimited text? And put it into this feature for now?

(If you can't, it's okay. I can remove the ref_pmid annotation for now...)

I know you handled it in the JQ-branches (discussed here)...but I'm not confident this will get in, in time for the upcoming "freeze".

@colleenXu
Copy link
Contributor

Okay, done testing. Overall, it seems to be working well!

To summarize my posts above:

@rjawesome
Copy link
Contributor Author

rjawesome commented Aug 10, 2023

Also is it possible to do a quick-fix for CTD processing, to handle when the data.PubMedIds field returns pipe-delimited text? And put it into this feature for now?

From what I understand it's already handled. The CTD transformer converts pipe delimited to array, and the publication code is designed to handle an array if it is passed in. Let me know if there are any issues with this.

Can you ADD some base urls for replacing with prefix?

Done.

Hmmm, we are producing the desired behavior for PFOCR (mix of PMCID CURIEs for articles and urls for figures). But I'm a little concerned:

There's probably a few ways to go about this:

  1. I originally thought to look for "/" after the URL, but seems like doi ids contain these
  2. look for image extensions and then exclude them
  3. maybe write a regex/desired format for each ID and make sure the part after the URL matches

EDIT:
The issue with CTD might be because of the ID vs Id casing (ie. PubMedIds vs PubMedIDs), I can probably edit the transformer to handle both of these cases.

@colleenXu
Copy link
Contributor

colleenXu commented Aug 11, 2023

@rjawesome

On the https PMC urls for PFOCR figure urls vs PharmGKB articles:

  • I imagine the "look for image extensions" or "regex" may be the solution. Regex in particular may also help with the clinicaltrials urls https://clinicaltrials.gov/search?id=%22NCT00485888%22 which we currently don't replace with the CURIE (since the ID is between the two %22 rather than at the end). EDIT: the clinicaltrials bit was noted here for MyChem drug_indications
    • However, I think this is some work for not much gain at the moment. It'll be more important later when we want to add PharmGKB to BTE's config list (noted in the PharmGKB issue here)
    • So if you don't have the time to get this done today / tomorrow / early next week, I think it's okay to drop this topic until later...

@colleenXu
Copy link
Contributor

colleenXu commented Aug 11, 2023

On adding base urls to replace with prefix:

After retesting, it looks like this has been successfully addressed. So the basic requirements of this feature have been met.

@tokebe I think you could deploy this to Test when you're ready. However, you could also choose to wait until @rjawesome responds on the CTD issue below...

@colleenXu
Copy link
Contributor

colleenXu commented Aug 11, 2023

@rjawesome

On quick fix for CTD pmids:

  • this is definitely happening, see my examples below. It is happening with operations that have data.PubMedIds fields and not data.PubMedIDs (used for other operations like the disease2chemical and disease2gene ones)
  • I would like this addressed if possible, before the code freeze. But if not, let me know and I can adjust x-bte annotation (remove the response-mappings for pmid fields).
example query for operation chemical2gene

This operation has the field spelled data.PubMedIds in its response.

Send this query to CTD thru BTE:

{
    "message": {
        "query_graph": {
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:related_to"]
                }
            },
            "nodes": {
                "n0": {
                    "ids": ["MESH:C006303"],
                    "categories": ["biolink:SmallMolecule"]
                },
                "n1": {
                    "categories": ["biolink:Gene"]
                }
            }
        }
    }
}

The equivalent sub-query in BTE is the GET request http://ctdbase.org/tools/batchQuery.go?inputType=chem&inputTerms=C006303&inputTermSearchType=directAssociations&report=genes_curated&format=json

This part of the raw response is turned into the problematic edge in BTE:

    {
        "CasRN": "52583-41-2",
        "ChemicalId": "C006303",
        "ChemicalName": "acivicin",
        "GeneId": "2678",
        "GeneSymbol": "GGT1",
        "Input": "c006303",
        "Organism": "Bos taurus",
        "OrganismId": "9913",
        "PubMedIds": "15629191"
    },
    {
        "CasRN": "52583-41-2",
        "ChemicalId": "C006303",
        "ChemicalName": "acivicin",
        "GeneId": "2678",
        "GeneSymbol": "GGT1",
        "Input": "c006303",
        "Organism": "Mus musculus",
        "OrganismId": "10090",
        "PubMedIds": "15629191|19419996"
    },
    {
        "CasRN": "52583-41-2",
        "ChemicalId": "C006303",
        "ChemicalName": "acivicin",
        "GeneId": "2678",
        "GeneSymbol": "GGT1",
        "Input": "c006303",
        "PubMedIds": "1673268|1678558|19419996"
    }

The problematic edge with pipe-delimited strings in the biolink:publications edge-attribute:

                "006f724ebdc9389dc3429ffb181e2150": {
                    "predicate": "biolink:related_to",
                    "subject": "PUBCHEM.COMPOUND:294641",
                    "object": "NCBIGene:2678",
                    "attributes": [
                        {
                            "attribute_type_id": "biolink:publications",
                            "value": [
                                "PMID:15629191",
                                "PMID:15629191|19419996",
                                "PMID:1673268|1678558|19419996"
                            ],
                            "value_type_id": "linkml:Uriorcurie"
                        }
                    ],
example query for operation gene2chemical

This operation also has the field spelled data.PubMedIds in its response.

Send this query to CTD thru BTE:

{
    "message": {
        "query_graph": {
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:related_to"]
                }
            },
            "nodes": {
                "n0": {
                    "ids": ["NCBIGene:10800"],
                    "categories": ["biolink:Gene"]
                },
                "n1": {
                    "categories": ["biolink:SmallMolecule"]
                }
            }
        }
    }
}

The equivalent sub-query in BTE is the GET request http://ctdbase.org/tools/batchQuery.go?inputType=chem&inputTerms=C006303&inputTermSearchType=directAssociations&report=genes_curated&format=json

BTE is creating responses like this, which are basically the pipe-delimited strings in the raw GET request's response with PMID: added to them

                        {
                            "attribute_type_id": "biolink:publications",
                            "value": [
                                "PMID:16630147|17641958|18379861|19767084|20485159"
                            ],
                            "value_type_id": "linkml:Uriorcurie"
                        }

                        {
                            "attribute_type_id": "biolink:publications",
                            "value": [
                                "PMID:18651869|9797038",
                                "PMID:10825389|10851239|11093801|14718577|18651869",
                                "PMID:15888248"
                            ],
                            "value_type_id": "linkml:Uriorcurie"
                        }

@rjawesome
Copy link
Contributor Author

Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants