-
Notifications
You must be signed in to change notification settings - Fork 12
V1.0
TODO: ADD WORKFLOW PICTURE HERE
New Data Sources:
- The Protein Ontology (created a new human version of the Protein Ontology)
- The Uber-Anatomy Ontology
- The Sequence Ontology
- Human Protein Atlas data including transcript expression in cell and tissue types
- Variant data from ClinVar
- Gene-gene interactions from GeneMania
- Included additional data from Reactome and UniProt
New Functionality:
- Added build options (see here for more details):
full
,partial
, andpost-closure
- Added options to construct the knowledge graph using a
subclass-based
orinstance-based
approach (see here for more information) - Added
Data_Preparation.ipynb
Jupyter Notebook to aid in creation of mapping, filtering, and labeling datasets - Added
Ontology_Cleaning.ipynb
Jupyter Notebook to aid in cleaning and preprocessing ontology data - Added
generates_dependency_documents.py
to assist users with the creation of required input documents - Knowledge graph can be constructed using primary edges or primary and inverse edges
- Added metadata for instance data nodes to knowledge graph (see here for details)
- Improved reproducibility by providing detailed metadata on all downloaded data sources
- Removed redundant resource download
- Added explicit typing to all functions and class attributes
- Modified
OWL-NETS
to decode OWL-encoded classes and triples (see here for more information) - Provide a Networkx MultiDiGraph graph for each
full
build (.gpickle
) - Added Docker containers to build
pkt_kg
and to host Neo4J instances for easy querying ofpkt_kg
knowledge graphs
Downloaded Resource Information:
Ontologies: The specific ontologies used in the knowledge graph, including class and axiom counts, are shown in the table below (links and counts refer to the cleaned ontology files).
Ontology | Class Count | Axiom Count |
---|---|---|
Cell Line Ontology (cell) | 44,858 |
512,471 |
Chemical Entities of Biological Interest - Lite (chemical) | 116,167 |
970,288 |
Gene Ontology (gobp, gomf, gocc) | 44,579 |
517,187 |
Human Disease Ontology (disease) | 16,150 |
157,349 |
Human Phenotype Ontology (phenotype) | 25,904 |
335,752 |
Human Protein Ontology (protein) | 108,408 |
1,273,437 |
Pathway Ontology (pathway) | 2,600 |
21,867 |
Relation Ontology (relations) | 82 |
5,595 |
Sequence Ontology (genes, variants) | 2,237 |
21,421 |
Uber-Anatomy Ontology (anatomy) | 18,567 |
258,060 |
The Vaccine Ontology (vaccine) | 6,527 |
55,663 |
NOTE. Please see the Ontology_Cleaning.ipynb
Jupyter Notebook for details on how the ontologies were preprocessed prior to being added to the knowledge graph.
To create the edge types listed in the tables above, several additional files were needed. For more details on these data sources, please see the Data_Preparation.ipynb
Jupyter Notebook.
Ontology Class-Instance/Subclass Node Mapping: subclass_construction_map.pkl
Relations Data:
See the relations_directory
README.md
for more information.
- Relations and Inverse Relations ➞
INVERSE_RELATIONS.txt
- Relations and Labels ➞
RELATIONS_LABELS.txt
Node Metadata:
See the node_directory
README.md
for more information.
Data Download and Creation Dates: April-May 2020
Master Edge Lists: Master_Edge_List_Dict.json
Edge List: Whenever possible, we limit edges to human sapiens
concepts that are supported by some type of evidence. For the exact specifications, please see the resource_info.txt
. The counts shown in the table below reflect only those edges that had valid node metadata.
Edge | Edge Relation (Inverse relation) |
Subject Count | Edge Count (Rels / Rels+InvRels) |
Object Count) |
---|---|---|---|---|
chemical-disease | substance that treats (is treated by substance) |
3,006 |
67,822 / 135,644
|
2,334 |
chemical-gene | interacts with | 447 |
17,352 / 34,704
|
12,224 |
chemical-gobp | molecularly interacts with | 3,008 |
1,204,086 / 2,408,172
|
6,452 |
chemical-gocc | molecularly interacts with | 2,083 |
108,132 / 216,264
|
737 |
chemical-gomf | molecularly interacts with | 2,595 |
101,632 / 203,264
|
1,189 |
chemical-pathway | participates in (has participant) |
2,114 |
27,891 / 55,782
|
2,098 |
chemical-phenotype | substance that treats (is treated by substance) |
2,976 |
57,462 / 114,924
|
1,450 |
chemical-protein | interacts with | 3,309 |
99,482 / 198,9964
|
13,067 |
chemical-rna | interacts with | 1,701 |
1,904,370 / 3,808,740
|
176,266 |
disease-phenotype | has phenotype (phenotype of) | 3,707 |
163,422 / 326,844
|
7,246 |
gene-disease | causes or contributes to | 8,154 |
65,242 / 65,242
|
3,902 |
gene-gene | genetically interacts with | 18,207 |
1,694,441 / 3,388,882
|
18,886 |
gene-pathway | participates in (has participant) |
10,370 |
105,200 / 210,400
|
1,811 |
gene-phenotype | causes or contributes to | 6,867 |
27,014 / 27,014
|
1,630 |
gene-protein | has gene product (gene product of) |
19,388 |
38,316 / 76,632
|
37,422 |
gene-rna | transcribed to (transcribed from |
38,886 |
216,077 / 432,154
|
211,456 |
gobp-pathway | realized in response to | 696 |
1,128 / 1,128
|
1,128 |
pathway-gocc | has component | 10,485 |
14,823 / 14,823
|
99 |
pathway-gomf | has function (function of) | 1,690 |
1,690 / 3,380
|
578 |
protein-anatomy | located in (location of) | 21,117 |
60,595 / 121,190
|
68 |
protein-catalyst | molecularly interacts with | 2,848 |
19,018 / 38,036
|
2,658 |
protein-cell | located in (location of) | 20,005 |
148,267 / 296,534
|
125 |
protein-cofactor | molecularly interacts with | 1,540 |
1,904 / 3,808
|
43 |
protein-gobp | participates in (has participant) |
34,336 |
279,197 / 558,394
|
12,353 |
protein-gocc | located in (location of) | 35,941 |
165,353 / 330,706
|
1,765 |
protein-gomf | has function (function of) | 34,070 |
131,073 / 262,146
|
4,271 |
protein-pathway | participates in (has participant) |
21,373 |
226,092 / 452,184
|
2,322 |
protein-protein | molecularly interacts with | 32,781 |
3,251,279 / 3,251,279
|
32,781 |
rna-anatomy | located in (location of) | 29,484 |
401,703 / 803,406
|
102 |
rna-cell | located in (location of) | 14,345 |
69,7996 / 139,592
|
127 |
rna-protein | ribosomally translates to (ribosomal translation of) |
167,943 |
338,100 / 676,200
|
37,859 |
variant-disease | causes or contributes to | 23,850 |
52,214 / 52,214
|
2,346 |
variant-gene | causally influences (causally influenced by) |
355,582 |
355,582 / 711,164
|
4,755 |
variant-phenotype | causes or contributes to | 2,895 |
3,510 / 3,510
|
548 |
Rels: Relations Only; Rels+InvRels: Relations and Inverse Relations.
There are several options for generating knowledge graphs:
-
Relations:
- Standard Relations: The knowledge graph has been built with a single set of edge relations.
-
Relations + Inverse Relations: The knowledge graph has been built with the standard set of relations, and if available, the
InverseObjectProperties
of the standard relations. With one caveat - if the original standard relation is a type of interaction (e.g.interacts_with
,molecularly interacts with
) and the provided edge list is not symmetric (meaning both sides of the interaction are not included in the provided edge list), then the interaction-related relation will be reused to represent the missing interactions.
-
Closure:
- Not Closed: The knowledge graph is not closed and thus has not been checked for consistency.
- Closed: The knowledge graph has been deductively closed.
-
OWL Semantics:
Required Input Document:OWL_NETS_Property_Types.txt
- OWL: The knowledge graph has not been filtered.
-
OWL Decoded: The knowledge graph has been filtered to decode OWL-encoded classes and triples. For information on how we process OWL semantics, please see the
OWL-NETS 2.0
wiki.
Three different types of files are included in the table below:
-
Knowledge Graphs: The knowledge graph can be downloaded in two different formats:
-
RDFLib Graph
serialized and saved as.owl
file -
Networkx MultiDiGraph
saved as an.gpickle
file
-
-
OWL-NETS Results: A pickled (
.pickle
) nested dictionary where each outer key is ananonymous node
and the two inner keys contain: (1) a dictionary ofowl-encoded
triples and (2) a set ofowl-decoded
triples.
Details | Files | Classes | Axioms | Individuals | Object Properties | Triples | |
---|---|---|---|---|---|---|---|
Merged Ontology Data | MergedOntologies.owl | 366,846 | 3,923,625 | 123 | 825 | 7,403,065 | |
STANDARD RELATIONS | |||||||
Not Closed | OWL |
PheKnowLator.owl
PheKnowLator.gpickle |
966,951 | 16,754,185 | 123 | 825 | 54,495,953 |
OWL Decoded |
PheKnowLator.nt
PheKnowLator.gpickle |
--- | --- | --- | --- | --- | |
Closed | OWL |
PheKnowLator.owl
PheKnowLator.gpickle |
--- | --- | --- | --- | --- |
OWL Decoded |
PheKnowLator.nt
PheKnowLator.gpickle |
--- | --- | --- | --- | --- | |
STANDARD RELATIONS + INVERSE RELATIONS | |||||||
Not Closed | OWL |
PheKnowLator.owl
PheKnowLator.gpickle |
966,951 | 24,747,664 | 123 | 825 | 86,512,173 |
OWL Decoded |
PheKnowLator.nt
PheKnowLator.gpickle |
966,943 | 21,151,866 | --- | 283 | 21,151,866 | |
Closed | OWL |
PheKnowLator.owl
PheKnowLator.gpickle |
--- | --- | --- | --- | --- |
OWL Decoded |
PheKnowLator.nt
PheKnowLator.gpickle |
--- | --- | --- | --- | --- |
Three different types of files are included in the table below:
-
Integer Labels: a tab-delimited
.txt
file containing three columns, one for each part of a triple (i.e. subject, predicate, object). The subject, predicate, and object identifiers have been mapped to integers. -
Identifier Labels: a tab-delimited
.txt
file containing three columns, one for each part of a triple (i.e. subject, predicate, object). Both the subject and object identifiers have not been mapped to integers. -
Identifier-Integer Map: a
.json
file containing a dictionary where the keys are node identifiers and the values are integers.
Each file is tab-delimited .txt
file that contains the following columns:
-
node_id
(e.g. "GO_0048252") -
label
(e.g. "lauric acid metabolic process") -
description/definition
(e.g. "The chemical reactions and pathways involving lauric acid, a fatty acid with the formula CH3(CH2)10COOH. Derived from vegetable sources.") -
synonym
(e.g. "lauric acid metabolism|n-dodecanoic acid metabolic process|n-dodecanoic acid metabolism")
Detail | Not Closed | Closed |
---|---|---|
STANDARD RELATIONS | ||
OWL Semantics | Node Data (txt) | Node Data (txt) |
No OWL Semantics | Node Data (txt) | Node Data (txt) |
RELATIONS + INVERSE RELATIONS | ||
OWL Semantics | Node Data (txt) | Node Data (txt) |
No OWL Semantics | Node Data (txt) | Node Data (txt) |
There are several options for generating knowledge graphs:
-
Relations:
- Standard Relations: The knowledge graph has been built with a single set of edge relations.
-
Relations + Inverse Relations: The knowledge graph has been built with the standard set of relations, and if available, the
InverseObjectProperties
of the standard relations. With one caveat - if the original standard relation is a type of interaction (e.g.interacts_with
,molecularly interacts with
) and the provided edge list is not symmetric (meaning both sides of the interaction are not included in the provided edge list), then the interaction-related relation will be reused to represent the missing interactions.
-
Closure:
- Not Closed: The knowledge graph is not closed and thus has not been checked for consistency.
- Closed: The knowledge graph has been deductively closed.
-
OWL Semantics:
Required Input Document:OWL_NETS_Property_Types.txt
- OWL: The knowledge graph has not been filtered.
-
OWL Decoded: The knowledge graph has been filtered to decode OWL-encoded classes and triples. For information on how we process OWL semantics, please see the
OWL-NETS 2.0
wiki.
Three different types of files are included in the table below:
-
Knowledge Graphs: The knowledge graph can be downloaded in two different formats:
-
RDFLib Graph
serialized and saved as.owl
file -
Networkx MultiDiGraph
saved as an.gpickle
file
-
-
Class Instance IRI-UUID Map: A dictionary that maps each original ontology class international resource identifier (
keys
) to its instance referenced by a universally unique identifier (values
) saved as a.json
file. -
OWL-NETS Results: A pickled (
.pickle
) nested dictionary where each outer key is ananonymous node
and the two inner keys contain: (1) a dictionary ofowl-encoded
triples and (2) a set ofowl-decoded
triples.
Details | Files | Classes | Axioms | Individuals | Object Properties | Triples | |
---|---|---|---|---|---|---|---|
Merged Ontology Data | MergedOntologies.owl | 366,846 | 3,923,625 | 123 | 825 | 7,403,065 | |
STANDARD RELATIONS | |||||||
Not Closed | OWL |
PheKnowLator.owl
PheKnowLator.gpickle |
--- | --- | --- | --- | --- |
OWL Decoded |
PheKnowLator.nt
PheKnowLator.gpickle |
--- | --- | --- | --- | --- | |
Closed | OWL |
PheKnowLator.owl
PheKnowLator.gpickle |
--- | --- | --- | --- | --- |
OWL Decoded |
PheKnowLator.nt
PheKnowLator.gpickle |
--- | --- | --- | --- | --- | |
STANDARD RELATIONS + INVERSE RELATIONS | |||||||
Not Closed | OWL |
PheKnowLator.owl
PheKnowLator.gpickle |
--- | --- | --- | --- | --- |
OWL Decoded |
PheKnowLator.nt
PheKnowLator.gpickle |
--- | --- | --- | --- | --- | |
Closed | OWL |
PheKnowLator.owl
PheKnowLator.gpickle |
--- | --- | --- | --- | --- |
OWL Decoded |
PheKnowLator.nt
PheKnowLator.gpickle |
--- | --- | --- | --- | --- |
Three different types of files are included in the table below:
-
Integer Labels: a tab-delimited
.txt
file containing three columns, one for each part of a triple (i.e. subject, predicate, object). The subject, predicate, and object identifiers have been mapped to integers. -
Identifier Labels: a tab-delimited
.txt
file containing three columns, one for each part of a triple (i.e. subject, predicate, object). Both the subject and object identifiers have not been mapped to integers. -
Identifier-Integer Map: a
.json
file containing a dictionary where the keys are node identifiers and the values are integers.
Each file is tab-delimited .txt
file that contains the following columns:
-
node_id
(e.g. "GO_0048252") -
label
(e.g. "lauric acid metabolic process") -
description/definition
(e.g. "The chemical reactions and pathways involving lauric acid, a fatty acid with the formula CH3(CH2)10COOH. Derived from vegetable sources.") -
synonym
(e.g. "lauric acid metabolism|n-dodecanoic acid metabolic process|n-dodecanoic acid metabolism")
Detail | Not Closed | Closed |
---|---|---|
STANDARD RELATIONS | ||
OWL Semantics | Node Data (txt) | Node Data (txt) |
No OWL Semantics | Node Data (txt) | Node Data (txt) |
RELATIONS + INVERSE RELATIONS | ||
OWL Semantics | Node Data (txt) | Node Data (txt) |
No OWL Semantics | Node Data (txt) | Node Data (txt) |
This project is licensed under Apache License 2.0 - see the LICENSE.md
file for details. If you intend to use any of the information on this Wiki, please provide the appropriate attribution by citing this repository:
@misc{callahan_tj_2019_3401437,
author = {Callahan, TJ},
title = {PheKnowLator},
month = mar,
year = 2019,
doi = {10.5281/zenodo.3401437},
url = {https://doi.org/10.5281/zenodo.3401437}
}