Skip to content
Tiffany J. Callahan edited this page Sep 21, 2020 · 17 revisions

Release: v1.0 (first official release)


TODO: ADD WORKFLOW PICTURE HERE



Release Updates

New Data Sources:

New Functionality:

  • Added build options (see here for more details): full, partial, and post-closure
  • Added options to construct the knowledge graph using a subclass-based or instance-based approach (see here for more information)
  • Added Data_Preparation.ipynb Jupyter Notebook to aid in creation of mapping, filtering, and labeling datasets
  • Added Ontology_Cleaning.ipynb Jupyter Notebook to aid in cleaning and preprocessing ontology data
  • Added generates_dependency_documents.py to assist users with the creation of required input documents
  • Knowledge graph can be constructed using primary edges or primary and inverse edges
  • Added metadata for instance data nodes to knowledge graph (see here for details)
  • Improved reproducibility by providing detailed metadata on all downloaded data sources
  • Removed redundant resource download
  • Added explicit typing to all functions and class attributes
  • Modified OWL-NETS to decode OWL-encoded classes and triples (see here for more information)
  • Provide a Networkx MultiDiGraph graph for each full build (.gpickle)
  • Added Docker containers to build pkt_kg and to host Neo4J instances for easy querying of pkt_kg knowledge graphs


Jupyter Notebooks


Ontologies

Downloaded Resource Information:


Ontologies: The specific ontologies used in the knowledge graph, including class and axiom counts, are shown in the table below (links and counts refer to the cleaned ontology files).

Ontology Class Count Axiom Count
Cell Line Ontology (cell) 44,858 512,471
Chemical Entities of Biological Interest - Lite (chemical) 116,167 970,288
Gene Ontology (gobp, gomf, gocc) 44,579 517,187
Human Disease Ontology (disease) 16,150 157,349
Human Phenotype Ontology (phenotype) 25,904 335,752
Human Protein Ontology (protein) 108,408 1,273,437
Pathway Ontology (pathway) 2,600 21,867
Relation Ontology (relations) 82 5,595
Sequence Ontology (genes, variants) 2,237 21,421
Uber-Anatomy Ontology (anatomy) 18,567 258,060
The Vaccine Ontology (vaccine) 6,527 55,663

NOTE. Please see the Ontology_Cleaning.ipynb Jupyter Notebook for details on how the ontologies were preprocessed prior to being added to the knowledge graph.



Creation of Mapping, Filtering, and Labeling Data

To create the edge types listed in the tables above, several additional files were needed. For more details on these data sources, please see the Data_Preparation.ipynb Jupyter Notebook.

Ontology Class-Instance/Subclass Node Mapping: subclass_construction_map.pkl


Relations Data:
See the relations_directory README.md for more information.


Node Metadata:
See the node_directory README.md for more information.

GENE RNA PATHWAY VARIANT
chemical-gene chemical-rna chemical-pathway variant-disease
gene-disease rna-anatomy gobp-pathway variant-gene
gene-gene rna-cell pathway-gocc variant-phenotype
gene-pathway rna-protein pathway-gomf
gene-phenotype protein-pathway
gene-protein
gene-rna


Edge Data

Data Download and Creation Dates: April-May 2020

Master Edge Lists: Master_Edge_List_Dict.json


Edge List: Whenever possible, we limit edges to human sapiens concepts that are supported by some type of evidence. For the exact specifications, please see the resource_info.txt. The counts shown in the table below reflect only those edges that had valid node metadata.

Edge Edge Relation
(Inverse relation)
Subject Count Edge Count
(Rels / Rels+InvRels)
Object Count)
chemical-disease substance that treats
(is treated by substance)
3,006 67,822 / 135,644 2,334
chemical-gene interacts with 447 17,352 / 34,704 12,224
chemical-gobp molecularly interacts with 3,008 1,204,086 / 2,408,172 6,452
chemical-gocc molecularly interacts with 2,083 108,132 / 216,264 737
chemical-gomf molecularly interacts with 2,595 101,632 / 203,264 1,189
chemical-pathway participates in
(has participant)
2,114 27,891 / 55,782 2,098
chemical-phenotype substance that treats
(is treated by substance)
2,976 57,462 / 114,924 1,450
chemical-protein interacts with 3,309 99,482 / 198,9964 13,067
chemical-rna interacts with 1,701 1,904,370 / 3,808,740 176,266
disease-phenotype has phenotype (phenotype of) 3,707 163,422 / 326,844 7,246
gene-disease causes or contributes to 8,154 65,242 / 65,242 3,902
gene-gene genetically interacts with 18,207 1,694,441 / 3,388,882 18,886
gene-pathway participates in
(has participant)
10,370 105,200 / 210,400 1,811
gene-phenotype causes or contributes to 6,867 27,014 / 27,014 1,630
gene-protein has gene product
(gene product of)
19,388 38,316 / 76,632 37,422
gene-rna transcribed to
(transcribed from
38,886 216,077 / 432,154 211,456
gobp-pathway realized in response to 696 1,128 / 1,128 1,128
pathway-gocc has component 10,485 14,823 / 14,823 99
pathway-gomf has function (function of) 1,690 1,690 / 3,380 578
protein-anatomy located in (location of) 21,117 60,595 / 121,190 68
protein-catalyst molecularly interacts with 2,848 19,018 / 38,036 2,658
protein-cell located in (location of) 20,005 148,267 / 296,534 125
protein-cofactor molecularly interacts with 1,540 1,904 / 3,808 43
protein-gobp participates in
(has participant)
34,336 279,197 / 558,394 12,353
protein-gocc located in (location of) 35,941 165,353 / 330,706 1,765
protein-gomf has function (function of) 34,070 131,073 / 262,146 4,271
protein-pathway participates in
(has participant)
21,373 226,092 / 452,184 2,322
protein-protein molecularly interacts with 32,781 3,251,279 / 3,251,279 32,781
rna-anatomy located in (location of) 29,484 401,703 / 803,406 102
rna-cell located in (location of) 14,345 69,7996 / 139,592 127
rna-protein ribosomally translates to
(ribosomal translation of)
167,943 338,100 / 676,200 37,859
variant-disease causes or contributes to 23,850 52,214 / 52,214 2,346
variant-gene causally influences
(causally influenced by)
355,582 355,582 / 711,164 4,755
variant-phenotype causes or contributes to 2,895 3,510 / 3,510 548

Rels: Relations Only; Rels+InvRels: Relations and Inverse Relations.




Knowledge Graph Construction Approach



Subclass-Based Construction

Contents


There are several options for generating knowledge graphs:

  • Relations:

    • Standard Relations: The knowledge graph has been built with a single set of edge relations.
    • Relations + Inverse Relations: The knowledge graph has been built with the standard set of relations, and if available, the InverseObjectProperties of the standard relations. With one caveat - if the original standard relation is a type of interaction (e.g. interacts_with, molecularly interacts with) and the provided edge list is not symmetric (meaning both sides of the interaction are not included in the provided edge list), then the interaction-related relation will be reused to represent the missing interactions.
  • Closure:

    • Not Closed: The knowledge graph is not closed and thus has not been checked for consistency.
    • Closed: The knowledge graph has been deductively closed.
  • OWL Semantics:
    Required Input Document: OWL_NETS_Property_Types.txt

    • OWL: The knowledge graph has not been filtered.
    • OWL Decoded: The knowledge graph has been filtered to decode OWL-encoded classes and triples. For information on how we process OWL semantics, please see the OWL-NETS 2.0 wiki.

Knowledge Graphs

Three different types of files are included in the table below:

  • Knowledge Graphs: The knowledge graph can be downloaded in two different formats:
  • OWL-NETS Results: A pickled (.pickle) nested dictionary where each outer key is an anonymous node and the two inner keys contain: (1) a dictionary of owl-encoded triples and (2) a set of owl-decoded triples.
Details Files Classes Axioms Individuals Object Properties Triples
Merged Ontology Data MergedOntologies.owl 366,846 3,923,625 123 825 7,403,065
STANDARD RELATIONS
Not Closed OWL PheKnowLator.owl
PheKnowLator.gpickle
966,951 16,754,185 123 825 54,495,953
OWL Decoded PheKnowLator.nt
PheKnowLator.gpickle
--- --- --- --- ---
Closed OWL PheKnowLator.owl
PheKnowLator.gpickle
--- --- --- --- ---
OWL Decoded PheKnowLator.nt
PheKnowLator.gpickle
--- --- --- --- ---
STANDARD RELATIONS + INVERSE RELATIONS
Not Closed OWL PheKnowLator.owl
PheKnowLator.gpickle
966,951 24,747,664 123 825 86,512,173
OWL Decoded PheKnowLator.nt
PheKnowLator.gpickle
966,943 21,151,866 --- 283 21,151,866
Closed OWL PheKnowLator.owl
PheKnowLator.gpickle
--- --- --- --- ---
OWL Decoded PheKnowLator.nt
PheKnowLator.gpickle
--- --- --- --- ---

Edge Lists

Three different types of files are included in the table below:

  • Integer Labels: a tab-delimited .txt file containing three columns, one for each part of a triple (i.e. subject, predicate, object). The subject, predicate, and object identifiers have been mapped to integers.
  • Identifier Labels: a tab-delimited .txt file containing three columns, one for each part of a triple (i.e. subject, predicate, object). Both the subject and object identifiers have not been mapped to integers.
  • Identifier-Integer Map: a .json file containing a dictionary where the keys are node identifiers and the values are integers.
Details Not Closed Closed
STANDARD RELATIONS
OWL Semantics Triple List - Integer (txt) Triple List - Integer (txt)
Triple List - Identifier (txt) Triple List - Identifier (txt)
Node Integer-Identifier Map (json) Node Integer-Identifier Map (json)
No OWL Semantics Triple List - Integer (txt) Triple List - Integer (txt)
Triple List - Identifier (txt) Triple List - Identifier (txt)
Node Integer-Identifier Map (json) Node Integer-Identifier Map (json)
RELATIONS + INVERSE RELATIONS
OWL Semantics Triple List - Integer (txt) Triple List - Integer (txt)
Triple List - Identifier (txt) Triple List - Identifier (txt)
Node Integer-Identifier Map (json) Node Integer-Identifier Map (json)
No OWL Semantics Triple List - Integer (txt) Triple List - Integer (txt)
Triple List - Identifier (txt) Triple List - Identifier (txt)
Node Integer-Identifier Map (json) Node Integer-Identifier Map (json)

Node Metadata

Each file is tab-delimited .txt file that contains the following columns:

  • node_id (e.g. "GO_0048252")
  • label (e.g. "lauric acid metabolic process")
  • description/definition (e.g. "The chemical reactions and pathways involving lauric acid, a fatty acid with the formula CH3(CH2)10COOH. Derived from vegetable sources.")
  • synonym (e.g. "lauric acid metabolism|n-dodecanoic acid metabolic process|n-dodecanoic acid metabolism")
Detail Not Closed Closed
STANDARD RELATIONS
OWL Semantics Node Data (txt) Node Data (txt)
No OWL Semantics Node Data (txt) Node Data (txt)
RELATIONS + INVERSE RELATIONS
OWL Semantics Node Data (txt) Node Data (txt)
No OWL Semantics Node Data (txt) Node Data (txt)

Return to Top



Instance-Based Construction

Contents


There are several options for generating knowledge graphs:

  • Relations:

    • Standard Relations: The knowledge graph has been built with a single set of edge relations.
    • Relations + Inverse Relations: The knowledge graph has been built with the standard set of relations, and if available, the InverseObjectProperties of the standard relations. With one caveat - if the original standard relation is a type of interaction (e.g. interacts_with, molecularly interacts with) and the provided edge list is not symmetric (meaning both sides of the interaction are not included in the provided edge list), then the interaction-related relation will be reused to represent the missing interactions.
  • Closure:

    • Not Closed: The knowledge graph is not closed and thus has not been checked for consistency.
    • Closed: The knowledge graph has been deductively closed.
  • OWL Semantics:
    Required Input Document: OWL_NETS_Property_Types.txt

    • OWL: The knowledge graph has not been filtered.
    • OWL Decoded: The knowledge graph has been filtered to decode OWL-encoded classes and triples. For information on how we process OWL semantics, please see the OWL-NETS 2.0 wiki.

Knowledge Graphs

Three different types of files are included in the table below:

  • Knowledge Graphs: The knowledge graph can be downloaded in two different formats:
  • Class Instance IRI-UUID Map: A dictionary that maps each original ontology class international resource identifier (keys) to its instance referenced by a universally unique identifier (values) saved as a .json file.
  • OWL-NETS Results: A pickled (.pickle) nested dictionary where each outer key is an anonymous node and the two inner keys contain: (1) a dictionary of owl-encoded triples and (2) a set of owl-decoded triples.
Details Files Classes Axioms Individuals Object Properties Triples
Merged Ontology Data MergedOntologies.owl 366,846 3,923,625 123 825 7,403,065
STANDARD RELATIONS
Not Closed OWL PheKnowLator.owl
PheKnowLator.gpickle
--- --- --- --- ---
OWL Decoded PheKnowLator.nt
PheKnowLator.gpickle
--- --- --- --- ---
Closed OWL PheKnowLator.owl
PheKnowLator.gpickle
--- --- --- --- ---
OWL Decoded PheKnowLator.nt
PheKnowLator.gpickle
--- --- --- --- ---
STANDARD RELATIONS + INVERSE RELATIONS
Not Closed OWL PheKnowLator.owl
PheKnowLator.gpickle
--- --- --- --- ---
OWL Decoded PheKnowLator.nt
PheKnowLator.gpickle
--- --- --- --- ---
Closed OWL PheKnowLator.owl
PheKnowLator.gpickle
--- --- --- --- ---
OWL Decoded PheKnowLator.nt
PheKnowLator.gpickle
--- --- --- --- ---

Edge Lists

Three different types of files are included in the table below:

  • Integer Labels: a tab-delimited .txt file containing three columns, one for each part of a triple (i.e. subject, predicate, object). The subject, predicate, and object identifiers have been mapped to integers.
  • Identifier Labels: a tab-delimited .txt file containing three columns, one for each part of a triple (i.e. subject, predicate, object). Both the subject and object identifiers have not been mapped to integers.
  • Identifier-Integer Map: a .json file containing a dictionary where the keys are node identifiers and the values are integers.
Details Not Closed Closed
STANDARD RELATIONS
OWL Semantics Triple List - Integer (txt) Triple List - Integer (txt)
Triple List - Identifier (txt) Triple List - Identifier (txt)
Node Integer-Identifier Map (json) Node Integer-Identifier Map (json)
No OWL Semantics Triple List - Integer (txt) Triple List - Integer (txt)
Triple List - Identifier (txt) Triple List - Identifier (txt)
Node Integer-Identifier Map (json) Node Integer-Identifier Map (json)
RELATIONS + INVERSE RELATIONS
OWL Semantics Triple List - Integer (txt) Triple List - Integer (txt)
Triple List - Identifier (txt) Triple List - Identifier (txt)
Node Integer-Identifier Map (json) Node Integer-Identifier Map (json)
No OWL Semantics Triple List - Integer (txt) Triple List - Integer (txt)
Triple List - Identifier (txt) Triple List - Identifier (txt)
Node Integer-Identifier Map (json) Node Integer-Identifier Map (json)

Node Metadata

Each file is tab-delimited .txt file that contains the following columns:

  • node_id (e.g. "GO_0048252")
  • label (e.g. "lauric acid metabolic process")
  • description/definition (e.g. "The chemical reactions and pathways involving lauric acid, a fatty acid with the formula CH3(CH2)10COOH. Derived from vegetable sources.")
  • synonym (e.g. "lauric acid metabolism|n-dodecanoic acid metabolic process|n-dodecanoic acid metabolism")
Detail Not Closed Closed
STANDARD RELATIONS
OWL Semantics Node Data (txt) Node Data (txt)
No OWL Semantics Node Data (txt) Node Data (txt)
RELATIONS + INVERSE RELATIONS
OWL Semantics Node Data (txt) Node Data (txt)
No OWL Semantics Node Data (txt) Node Data (txt)




Return to Top



This project is licensed under Apache License 2.0 - see the LICENSE.md file for details. If you intend to use any of the information on this Wiki, please provide the appropriate attribution by citing this repository:

@misc{callahan_tj_2019_3401437,
  author       = {Callahan, TJ},
  title        = {PheKnowLator},
  month        = mar,
  year         = 2019,
  doi          = {10.5281/zenodo.3401437},
  url          = {https://doi.org/10.5281/zenodo.3401437}
}
Clone this wiki locally