-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Here, we document the transformation of the National Library of Medicine's Semantic Knowledge Representation predications into RDF. This work was developed as part of the 4th Annual Biomedical Linked Annotation Hackathon (BLAH4), held in Kashiwa Japan (January, 2018).
Sources of “big” biomedical data like electronic health records (EHRs), high-throughput experiments, and Internet of Things devices provide researchers and clinicians with unprecedented opportunities for scientific advancement (Piai et al., 2013). Unfortunately, to fully utilize these data researchers must face the formidable challenge of synthesizing relevant information from an exponentially expanding body of scientific literature (Sinoara et al., 2017, Simmons et al., 2017). To help solve this problem, the natural language processing and biomedical research communities have developed rigorous algorithms resulting in the generation of impressive collections of annotated text corpora. While the breadth of concept annotations in existing corpora is extensive, large-scale annotation of relations between annotated concepts is often limited or incomplete (Neves et al., 2014). With this in mind, we propose to extend the coverage of existing annotations in PubAnnotation by transforming the National Library of Medicine’s Semantic Representation (SemRep) predications into semantically-linked annotations.
Given the size (~91 million predications [subject-predicate-object]) and coverage (26.7 million citations) of SemRep, when mapped to ontologies and integrated with existing projects in PubAnnotation (especially those that include concepts not represented in the UMLS), it has great potential to be a very valuable resource to the community. We will generate two versions of the SemRep predications in RDF:
- SemRepRDF-UMLS: UMLS-Only version containing UMLS licensed vocabularies.
- SemRepRDF-LOD: A Linked Open Data (LOD) version of the annotations that does not include any licensed vocabularies/terminologies. To create this version, we will leverage the UMLS concepts to map to other resources that are not subject to licensing restrictions.
To accomplish this goal, we used the hackathon to do the following:
- Finalize schema for representing SemRep predications (Finalized schema shown in Figure 1 below).
- Review UMLS licensing and Terms of Use (Licensing).
- Identify open resources and ontologies to map to existing annotations (UMLS Concept and Relation Mapping).
- Document the transformation process, including all discussions on GitHub (Hackathon Blog).
- Create an RDF version of the transformed SemRep annotations that can be made publicly available for download.
Figure 1. Draft of proposed schema to make SemRep Predicates compatible with PubAnnotation
There have been a few prior efforts aimed at converting SemRep predications into linked data, but, most of these efforts were intentionally designed to convert only small subsets of the full database (OMOP RDF, Zhang et al., 2014, Zhang et al., 2013). To the best of our knowledge, no existing efforts have converted the full set of predications with the intention of integrating the resulting annotations into an existing public repository.
Specific information regarding SemRep is detailed below:
- The SemRep program is managed under the Semantic Knowledge Representation (SKR) project and is maintained by research staff at the National Library of Medicine (NLM).
- SemRep predications are generated using MetaMap. When building the predications, subjects and objects concepts are taken from the UMLS Metathesaurus, while relations are taken from the UMLS Semantic Network.
- Annotations are stored in the Semantic MEDLINE Database (SemMedDB).
- An example predication (taken from the online description of SemRep), combined with PubAnnotation, is shown below in Figure 2: Figure 2. Example of PubAnnotation and SemRep annotations (prior to mapping UMLS identifiers to open resources and ontologies)