Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standard/conventions for representing supporting publications #398

Closed
mbrush opened this issue Feb 28, 2023 · 11 comments
Closed

Standard/conventions for representing supporting publications #398

mbrush opened this issue Feb 28, 2023 · 11 comments

Comments

@mbrush
Copy link
Collaborator

mbrush commented Feb 28, 2023

Testing efforts have identified variability w.r.t. how publications and other documents are represented as support for Edges and Study Result objects in TRAPI messages. The Biolink Model provides two edge properties ( supporting_document, and its child publications - which are the topic of the Biolink ticket here), as well as a Publication class.

Here. I would like to finalize and document conventions for how these elements are applied in Attributes in TRAPI messages to structure this type of information.

Questions/Issues:

  1. When to use these different properties? (ignore for now - as we are voting here on whether to have two properties at all))

  2. How to represent the value of these properties?
    a. Current convention is to use established document identifiers such as pmids, pmc ids, dois, etc. Should there set an order of preference here, or a requirement to use one over another (e.g. if a pub has a PMC id and PMID, always use the PMID)? Do we have a resource that maps between different types of identifiers for documents/publications?

  3. In cases where there are multiple documents/pubs supporting a given Statement or Study Result, should these be captured as a list/array of document identifiers in the Attribute.value field, or one at a time in separate Attribute objects?
    a. It was decided earlier that separate attributes were best - as this would allow for additional metadata about each publication to be captured in other Attribute fields (e.g. url, description/title, etc). But it wasn't clear if this was a requirement or recommendation?
    b. In practice, most KPs are providing lists of pmids in a single Attribute - esp in cases where there are very many supporting pubs (can be tens to hundreds in some cases). But the format/syntax used to represent such lists is varibale (e.g. a single string vs formal array of object references? if a string, use of comma vs pipe delimiters?)

  4. Where/how should we capture additional metadata about a publication (e.g. title, year, url, MeSH terms, journal, authors, etc)?
    a. Some of the use cases for applying such additional publication info are outlined in the ticket here - most relate to supporting O& O for ordering results, and which pubs to show first for a given result.
    b. Some of this info could be captured using Attribute fields such as value_url or description (see json example below) - but this doesn't work if we allow multiple pubs as a list in a single Attribute object.
    c. But TMKP already provides a publication metadata API that provides the types of info listed here - so representing an yof this in the publications Attribute seems duplicative.
    d. @bill-baumgartner I assume here that Publication/Documents are represented as instances of Biolink:Publication, which has a collection of node properties for describing things like title, authors, etc?

{
  "attribute_type_id": "biolink:publications",    
  "value":  "PMID:32961480",                                  
  "value_type_id": "biolink:Publication",    
  "value_url":  "https://pubmed.ncbi.nlm.nih.gov/32961480/",
  "description": Review article, "Discovery of novel liver X receptor inverse agonists as …",
  "attribute_source": "infores:molecular-data-provider"
}
@mbrush
Copy link
Collaborator Author

mbrush commented Feb 28, 2023

Proposal:

Given the state of affairs described above (i.e. the current preference of KPs to capture lists of pub ids in a single Attribute object, and the availability of TMKPs publication metadata endpoint to provision publication info on demand) the simplest path forward may be to:

  • Establish the current practice of capturing multiple supporting publication ids as a list in the value field of a single Attribute object, as the required standard/convention. e.g.
{
  "attribute_type_id": "biolink:publications",    
  "value":   ["PMID:31737390", "PMC:6815562"] ,                                  
  "value_type_id": "biolink:Publication",    
  "attribute_source": "infores:text-mining-provider-targeted"
}
  • Define a specific syntax for capturing these lists- as an array of Publication object references:
  "value":  ["PMID:31737390", "PMC:6815562"]   #  formal array of Publication object references
    vs.
  "value":  "PMID:31737390|PMC6815562"          #  single string, pipe delimiters
  "value":  "PMID:31737390, PMC:6815562"        #  single string, comma delimiters
  • Any standard / general metadata about a pub (title, year, url, MeSH terms, journal, authors, ...) can be looked up using a publication id from the TMKP publication metadata service as needed - so no need to duplicate this in the publications Attribute object.
    • make sure that all pub identifier types in the data can be used to lookup pub metadata in the TMKP service (e.g. may require a mapping/translation across pub identifier types, pmc - pmid - doi - . . . )
  • Any additional / specific info about a pub that a KP wants to return that is not in TMKPs metadata service can be put into the description or value_url field of the Attribute object as desired by the KP (but note that there will be no structured way to link this info to a specific pub in the list).
  • For the specific use case of text-mined edges, we consider the standard for capturing and linking source sentence text and other NLP metadata to its source pub in #399.

@mbrush
Copy link
Collaborator Author

mbrush commented Mar 2, 2023

Related issue: how to represent Clinical Trials that support an Edge/Statement. At present, some KPs are using the publications slot to capture NCT clinical trial ids - but these ids represent trials, not publications:

image
(note that this KP is actually using the biolink:Publication class as the attribute_type_id instead of the publications slot).

We need to decide if we want to allow this (which means considering something like NCT00494715 to be a document or publication describing a clinical trial, rather than the clinical trial itself), or define a different model/pattern to link an edge to a supporting clinical trial.

To me, these NCT records are descriptions of a clinical study/trial that was performed and the results that were generated - and the most accurate and useful representation would be to use a property like supporting_studies here, and treat the NCT record as an instance of a Study instead of a Publication. This would also allow us to directly capture information about the study like its dates, status, size, methods, design, site, etc. We would just need to think about how it would work to represent Studies as nodes in a KG, and flesh out our modeling of 'Study' in Biolink a bit. Here, we could work with the Multiomics Team - who is working on a Clinical Trials KP where I think this info would live. An Edge supported by the results of a clinical trial would references the identifier/curie for this study as a supporting_study, and more info about this study could be retrieved/looked up from this Clinical Trials KP. Just like the paradigm for Pubs with the Publication Metadata API.

I think it would be quite analogous to the publications modeling pattern we are putting in place, but with a supporting_studies property pointing at a Study instance, instead of a publications property pointing to a Publication instance. In both cases, we can represent the Pub or Study as a node in a graph in which we can capture its characteristics. The TMKP Team's Publication Metadata KP is working on this for publications, and I believe the Multiomics Team is working on a Clinical Trials KP that would do the same for trials . . . more info about this study could be retrieved/looked up from this Clinical Trials KP, just like the paradigm for Pubs with the Publication Metadata API.

That said, an argument could be made that NCT records are descriptions of clinical studies and their results, just like most journal articles are descriptions of studies that were preformed the the results they generated. But treating an NCT record like a publication makes it harder to capture the rich details of the trial itself - and if this is important, we should represent them as Studies, not Publications.

@edeutsch
Copy link
Collaborator

Is it the mere existence of the clinical trial that is providing the evidence or is it the published outcome/report of the clinical trial that is providing the evidence? I would imagine the latter? Perhaps it is the final report (I assume there always is one? I don't actually know) from the trial that should be cited as a publication? Perhaps these identifiers identify studies, yes, but most crucial identify the published outcome?

@mbrush
Copy link
Collaborator Author

mbrush commented Apr 25, 2023

There are also considerations about the form/syntax of the value for a CT.gov record. Currently referenced as a string representing the identifier, e.g. NCT00222573. Should we require a CURIE or URL form of this be captured? Bioregistry.io already defines one we could use, and an expansion. See https://bioregistry.io/registry/clinicaltrials.

Assuming this means we can require CURIE/URL-based representation of clinical trials, consider the concrete proposals below for how we reference supporting clinical trials. Examples below show supporting clinical trials alongside references to publications referenced as CURIEs/URLS vs free text, per the decision to split these into separate Attribute objects here.

Option 1: Consider clinical trial record ids to be Publications, and capture them using the publications edge property.

# Attribute containing pubs referenced by CURIE or URL (which here include NCT records)
{
  "attribute_type_id": "biolink:publications",            
  "value": [
           "PMID:31737390",
           "PMID:6815562",     
           "http://info.gov.hk/gia/general/201011/02/P201011020204.htm",
           "clinicaltrials:NCT00222573",
           "clinicaltrials:NCT00503152"
           ]                                      
  "value_type_id": "biolink:Uriorcurie",    
  "attribute_source": "infores:hmdb"
}, 

# Attribute containing pubs referenced as free text
{
  "attribute_type_id": "biolink:publications",           
  "value": [
            "Thematic Review Series: Glycerolipids. Phosphatidylserine and phosphatidylethanolamine in mammalian cells: two metabolically related aminophospholipids",  
            "Toranosuke Saito, Takashi Ishibashi, Tomoharu Shiozaki, Tetsuo Shiraishi, 'Developer for pressure-sensitive recording sheets, aqueous dispersion of the developer and method for preparing the developer.' U.S. Patent US5118443, issued September, 1986.: http://www.google.ca/patents/US5118443"
           ],                                  
  "value_type_id": "string",    
  "attribute_source": "infores:hmdb"
}

Option 2: Consider clinical trial record ids to represent the study/trial itself, and capture using a supporting_studies edge property .

# Attribute containing pubs referenced by CURIE or URL (which here do not include clinical trial records)
{
  "attribute_type_id": "biolink:publications",            
  "value": [
           "PMID:31737390",
           "PMID:6815562",     
           "http://info.gov.hk/gia/general/201011/02/P201011020204.htm"
           ]                                      
  "value_type_id": "biolink:Uriorcurie",    
  "attribute_source": "infores:hmdb"
}, 

# Attribute containing pubs referenced as free text
{
  "attribute_type_id": "biolink:publications",           
  "value": [
            "Thematic Review Series: Glycerolipids. Phosphatidylserine and phosphatidylethanolamine in mammalian cells: two metabolically related aminophospholipids",  
            "Toranosuke Saito, Takashi Ishibashi, Tomoharu Shiozaki, Tetsuo Shiraishi, 'Developer for pressure-sensitive recording sheets, aqueous dispersion of the developer and method for preparing the developer.' U.S. Patent US5118443, issued September, 1986.: http://www.google.ca/patents/US5118443"
           ],                                  
  "value_type_id": "string",    
  "attribute_source": "infores:hmdb"
},

# Attribute containing supporting clinical trials
{
  "attribute_type_id": "biolink:supporting_studies",           
  "value": [
           "clinicaltrials:NCT00222573",
           "clinicaltrials:NCT00503152"
           ],                                  
  "value_type_id": "biolink:Uriorcurie",    
  "attribute_source": "infores:hmdb"
}

@edeutsch
Copy link
Collaborator

I don't have a strong opinion, but I think I'm liking option 1. It depends a little on my abstract question above to which there was no answer provided:

Is it the mere existence of the clinical trial that is providing the evidence or is it the published outcome/report of the clinical trial that is providing the evidence? I would imagine the latter? Perhaps it is the final report (I assume there always is one? I don't actually know) from the trial that should be cited as a publication? Perhaps these identifiers identify studies, yes, but most crucially identify the published outcome? Again, I don't know myself. Can someone shed light on this?

@mbrush
Copy link
Collaborator Author

mbrush commented May 23, 2023

May 22 Update:

Clinical trial records in clinical-trials.gov are often referenced as support for edges reporting a drug to treat a condition.
e..g "NCT00222573". In recent discussions we settled on a specific CURIE syntax with which to reference clinical trials as reported in registries like clinical-trials.gov, in the Attribute.value field:

"value": "clinicaltrials:NCT00222573"

We also decided that any URLs that allow users to link directly to the ct.gov site to explore these clinical trial records should be
captured in the Attribute.value_url field:

"value": "clinicaltrials:NCT00222573"
"value_url": "https://clinicaltrials.gov/search?id=%22NCT03074773%22"

We have yet to settle on what edge property will be the attribute_type_id for the Attribute object. We could choose to treat these NCT ids as publications and use the existing biolink:publications edge property, or we could use a more specific edge property like biolink:supporting_studies that distinguishes trials from pubs, (and lets us point to an instance of a performed study/trial whose results support the 'treats' edge).

Concrete examples of these competing proposals are presented below - showing how supporting trials would be represented alongside supporting publications for both approaches:

Option 1: Use the publications edge property and capture trials in same Attribute as pubs.

# Attribute containing pubs referenced by CURIE or URL (which here include NCT records)
{
  "attribute_type_id": "biolink:publications",            
  "value": [
           "PMID:31737390",          # a publication reporting the results of one supporting trial
           "PMID:6815562",           # a publication reporting the results of another supporting trial
           "clinicaltrials:NCT00222573",
           "clinicaltrials:NCT00503152",
           "clinicaltrials:NCT00634963"
           ]                                      
  "value_type_id": "biolink:Uriorcurie",    
  "value_urls":  "https://clinicaltrials.gov/search?id=%22NCT02658760%22OR%22NCT02679560%22OR%22NCT05084573%22",
  "attribute_source": "infores:chembl"
}, 

Option 2: Use a supporting_studies edge property and create a separate Attribute object.

# Attribute containing pubs referenced by CURIE or URL (which here do not include clinical trial records)
{
  "attribute_type_id": "biolink:publications",            
  "value": [
           "PMID:31737390",          # a publication reporting the results of one supporting trial
           "PMID:6815562",           # a publication reporting the results of another supporting trial
           ]                                      
  "value_type_id": "biolink:Uriorcurie",    
  "attribute_source": "infores:chembl"
}, 

# Attribute containing supporting clinical trials
{
  "attribute_type_id": "biolink:supporting_studies",           
  "value": [
           "clinicaltrials:NCT02658760",
           "clinicaltrials:NCT02679560",
           "clinicaltrials:NCT05084573"
           ],                                  
  "value_type_id": "biolink:Uriorcurie",    
  "value_urls":  "https://clinicaltrials.gov/search?id=%22NCT02658760%22OR%22NCT02679560%22OR%22NCT05084573%22",
  "attribute_source": "infores:chembl"
}

In making a decision, consider how the Clinical Trials KP being developed by the Multiomics team will represent these clinical trial records in their data (as information entities (pubs / records), or as research activities (studies/trials) - and aim to apply consistent semantics across these representations.

Also, consider that using a separate supporting_studies property makes it easier for the UI to find / count supporting pubs vs trials - esp if we start using other designators for studies besides NCT ids.

@gglusman
Copy link

gglusman commented May 23, 2023

Note the 'value_url' as shown ( https://clinicaltrials.gov/search?id=%22NCT03074773%22 ) doesn't work.
The correct syntax for the search seems to be: https://clinicaltrials.gov/ct2/results?term=NCT03074773

To get to the trial directly, for the current version of the system, the URL would be: https://clinicaltrials.gov/ct2/show/NCT03074773

Looks like they're developing a new version, and under it, the URL seems to be: https://beta.clinicaltrials.gov/study/NCT03074773

...and the search syntax: https://beta.clinicaltrials.gov/search?id=NCT03074773

@mbrush
Copy link
Collaborator Author

mbrush commented May 23, 2023

Thanks Gustavo - I used the search syntax in my examples only because this is the format of the links we get from sources like chembl as provided by MolePro. Ideally we would want to point people directly to individual trial pages. And good to know about the new version in development. I;d like to find out more about this from you or Kamileh.

@gglusman
Copy link

To clarify, the new version appears to be just for their UI.
You get prompted to try it when accessing the old one.

@mbrush
Copy link
Collaborator Author

mbrush commented May 23, 2023

Outcome of 5-23-23 EPC Call - general preference for 'supporting_study' as the long term solution, but may have to wait to implement until after September as KPs may not be able to regenerate data to be compliant with this specification before then. In mean time, modeling team will get required edge property into Biolink, and draft an initial specification for representing supporting clinical trials / studies in TRAPI.

@mbrush
Copy link
Collaborator Author

mbrush commented Jul 22, 2023

Closing with the creation of the supporting_publications_specification- but as noted in the spec, we need to return to this and modify the specification for how to reference supporting clinical trials. I created #447 as a separate ticket for this issue.

@mbrush mbrush closed this as completed Jul 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests