Skip to content
Tiffany J. Callahan edited this page Jan 10, 2018 · 19 revisions

SemRepRDF

This Wiki documents our process of transforming the National Library of Medicine's Semantic Knowledge Representation predications into an open linked data resource. This work was developed on and presented at the 4th Annual Biomedical Linked Annotation Hackathon, held in Kashiwa Japan (January, 2018).

Proposal presentation

Motivation and Background

Sources of “big” biomedical data like electronic health records (EHRs), high-throughput experiments, and Internet of Things devices provide researchers and clinicians with unprecedented opportunities for scientific advancement (Piai et al., 2013). Unfortunately, to fully utilize these data researchers must face the formidable challenge of synthesizing relevant information from an exponentially expanding body of scientific literature (Sinoara et al., 2017, Simmons et al., 2017). To help solve this problem, the natural language processing and biomedical research communities have developed rigorous algorithms resulting in the generation of impressive collections of annotated text corpora. While the breadth of concept annotations in existing corpora is extensive, large-scale annotation of relations between annotated concepts is often limited or incomplete (Neves et al., 2014). With this in mind, we propose to extend the coverage of existing annotations in PubAnnotation by transforming the National Library of Medicine’s Semantic Representation (SemRep) predications into semantically-linked annotations.

Proposed Work

We will use this year’s hackathon to complete initial work started this summer in collaboration with Dr. Jin-Dong Kim (see draft of schema in Figure 1). The specific goals we would like to complete during BLAH include:

  • Refine, extend, and implement the schema for representing SemRep predications (including the representation of annotation source provenance and/or metadata). This representation will be developed to ensure compatibility with existing PubAnnotation projects. Once we have a finalized version of the schema we will ask for feedback from other hackathon attendees.
  • Use the finalized schema to generate semantically-linked SemRep annotations. If we do not have enough time to generate representations for the full set of predications, we will generate them for a small subset. We will use this subset to obtain feedback from hackathon attendees.
  • Document the transformation process, including all discussions on GitHub.
  • Elicit feedback and ideas from hackathon attendees of ways to make SemRep annotations available to the community without violating UMLS licensing or Terms and Conditions of Use.
  • Create an RDF version of the transformed SemRep annotations that can be made publicly available for download.
Figure 1. Draft of proposed schema to make SemRep Predicates compatible with PubAnnotation

Prior Work

There have been a few prior efforts aimed at converting SemRep predications into linked data, but, most of these efforts were intentionally designed to convert only small subsets of the full database. To the best of our knowledge, no existing efforts have converted the full set of predications with the intention of integrating the resulting annotations into an existing public repository.

Future Work and Planned Projects

In the spirit of linked data, our future work will focus on ways to map semantically-linked SemRep annotation concepts and relations to Open Biomedical Ontology (OBO) concepts.

SemRep Project Details

Specific information regarding SemRep is detailed below:

  • The SemRep program is managed under the Semantic Knowledge Representation (SKR) project and is maintained by research staff at the National Library of Medicine (NLM).
  • SemRep predications are generated using MetaMap. When building the predications, subjects and objects concepts are taken from the UMLS Metathesaurus, while relations are taken from the UMLS Semantic Network.
  • We will create annotations using a downloaded version of the Semantic MEDLINE Database (SemMedDB). Currently, the database contains over 89 million predications generated from 26.7 million PubMed citations.
  • An example predication (taken from the online description of SemRep) is shown below:
    • Sentence: We used hemofiltration to treat a patient with digoxin overdose that was complicated by refractory hyperkalemia

    • SemRep Generated Predications:

      • Hemofiltration-TREATS-Patients 
      • Digoxin overdose-PROCESS_OF-Patients 
      • hyperkalemia-COMPLICATES-Digoxin overdose 
      • Hemofiltration-TREATS(INFER)-Digoxin overdose
  • As briefly mentioned above, the use of SemRep requires obtaining a free UMLS license. The decision to work with licensed data should be weighed with consideration to the potential impact or contribution that the data source may have on the field. Given the size (~89 million predications [subject-predicate-object]) and coverage (26.7 million citations) of SemRep, when integrated with existing projects in PubAnnotation (especially those that include concepts not represented in the UMLS), it has great potential to be a very valuable resource to the community.
Clone this wiki locally