v4 Data Sources

Release v4.0.0 Knowledge Graph Data Sources

Release: v4.0.0

Data Access: https://console.cloud.google.com/storage/browser/pheknowlator/archived_builds/release_v4.0.0

Dependencies:

Data_Preparation.ipynb documents the creation of all generated data
Ontology_Cleaning.ipynb documents all ontology cleaning and preprocessing

Other Relevant Output:

Node, relation, and edges metadata extracted from data sources that are not an ontology: pheknowlator_source_metadata.xlsx

Rationale: The goal of this build is to create a knowledge graph that represents human disease mechanisms.

Ontologies
Data Sources

ONTOLOGIES

Cell Ontology
Cell Line Ontology
Chemical Entities of Biological Interest Ontology
Gene Ontology
Human Phenotype Ontology
Mondo Disease Ontology
Pathway Ontology
Protein Ontology
Relations Ontology
Sequence Ontology
Uber-Anatomy Ontology
Vaccine Ontology

Cell Ontology

Homepage: GitHub
Citation:

Bard J, Rhee SY, Ashburner M. An ontology for cell types. Genome Biology. 2005;6(2):R21

Usage: The Cell Ontology (CL) was utilized to connect transcripts and proteins to cells. Additionally, the edges between this ontology and its dependencies are utilized:

ChEBI
GO
PATO
PRO
RO
UBERON

Return to Top

Cell Line Ontology

Homepage: http://www.clo-ontology.org/
Citation:

Sarntivijai S, Lin Y, Xiang Z, Meehan TF, Diehl AD, Vempati UD, Schürer SC, Pang C, Malone J, Parkinson H, Liu Y. CLO: the cell line ontology. Journal of Biomedical Semantics. 2014;5(1):37

Usage: The Cell Line Ontology (CLO) was utilized this ontology to map cell lines to transcripts and proteins. Additionally, the edges between this ontology and its dependencies are utilized:

CL
DOID
NCBITaxon
UBERON

Return to Top

Chemical Entities of Biological Interest Ontology

Homepage: https://www.ebi.ac.uk/chebi/
Citation:

Hastings J, Owen G, Dekker A, Ennis M, Kale N, Muthukrishnan V, Turner S, Swainston N, Mendes P, Steinbeck C. ChEBI in 2016: Improved services and an expanding collection of metabolites. Nucleic Acids Research. 2015;44(D1):D1214-9

Usage: The Chemical Entities of Biological Interest (ChEBI) Ontology was utilized to connect chemicals to complexes, diseases, genes, GO biological processes, GO cellular components, GO molecular functions, pathways, phenotypes, reactions, and transcripts.

Return to Top

Gene Ontology

Homepage: http://geneontology.org/
Citations:

Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA. Gene ontology: tool for the unification of biology. Nature Genetics. 2000;25(1):25

The Gene Ontology Consortium. The Gene Ontology Resource: 20 years and still GOing strong. Nucleic Acids Research. 2018;47(D1):D330-8

Usage: The Gene Ontology (GO) was utilized to connect biological processes, cellular components, and molecular functions to chemicals, pathways, and proteins. Additionally, the edges between this ontology and its dependencies are utilized:

CL
NCBITaxon
RO
UBERON

Other Gene Ontology Data Used: goa_human.gaf.gz

Usage: Utilized to create protein-gobp, protein-gocc, and protein-gomf edges. The original data is filtered such that only records meeting the following criteria were included:
- Protein-GO Biological Process: column[3] not in ["NOT"] and column[8] == "P" and column[11] == "protein" and column[12] == "taxon:9606"
- Protein-GO Cellular Component: column[3] not in ["NOT"] and column[8] == "C" and column[11] == "protein" and column[12] == "taxon:9606"
- Protein-GO Molecular Function: column[3] not in ["NOT"] and column[8] == "F" and column[11] == "protein" and column[12] == "taxon:9606"

Return to Top

Human Phenotype Ontology

Homepage: https://hpo.jax.org/
Citation:

Köhler S, Carmody L, Vasilevsky N, Jacobsen JO, Danis D, Gourdine JP, Gargano M, Harris NL, Matentzoglu N, McMurry JA, Osumi-Sutherland D. Expansion of the Human Phenotype Ontology (HPO) knowledge base and resources. Nucleic Acids Research. 2018;47(D1):D1018-27

Usage: The Human Phenotype Ontology (HP) was utilized to connect phenotypes to chemicals, diseases, genes, and variants. Additionally, the edges between this ontology and its dependencies are utilized:

CL
ChEBI
GO
UBERON

Files

Other Human Phenotype Ontology Data Used: phenotype.hpoa

Usage: Utilized to create disease-phenotype edges. The original data is filtered such that only records meeting the following criteria were included:
- Qualifier != "NOT"

Return to Top

Mondo Disease Ontology

Homepage: https://mondo.monarchinitiative.org/
Citation:

Mungall CJ, McMurry JA, Köhler S, Balhoff JP, Borromeo C, Brush M, Carbon S, Conlin T, Dunn N, Engelstad M, Foster E. The Monarch Initiative: an integrative data and analytic platform connecting phenotypes to genotypes across species. Nucleic Acids Research. 2017;45(D1):D712-22

Usage: The Mondo Disease Ontology (Mondo) was utilized to connect diseases to chemicals, phenotypes, genes, and variants. Additionally, the edges between this ontology and its dependencies are utilized:

CL
NCBITaxon
GO
HPO
UBERON

Return to Top

Pathway Ontology

Homepage: rgd.mcw.edu
Citation:

Petri V, Jayaraman P, Tutaj M, Hayman GT, Smith JR, De Pons J, Laulederkind SJ, Lowry TF, Nigam R, Wang SJ, Shimoyama M. The pathway ontology–updates and applications. Journal of Biomedical Semantics. 2014;5(1):7.

Usage: The Pathway Ontology (PW) was utilized to connect pathways to GO biological processes, GO cellular components, GO molecular functions, Reactome pathways. Several steps are taken in order to connect Pathway Ontology identifiers to Reactome pathways and GO biological processes. To connect Pathway Ontology identifiers to Reactome pathways, we use ComPath Pathway Database Mappings developed by Daniel Domingo-Fernández (PMID:30564458).

Files

Downloaded Mapping Data
- curated_mappings.txt
- kegg_reactome.csv
Generated Mapping Data
- REACTOME_PW_GO_MAPPINGS.txt

Return to Top

Protein Ontology

Homepage: https://proconsortium.org/
Citation:

Natale DA, Arighi CN, Barker WC, Blake JA, Bult CJ, Caudy M, Drabkin HJ, D’Eustachio P, Evsikov AV, Huang H, Nchoutmboube J. The Protein Ontology: a structured representation of protein forms and complexes. Nucleic Acids Research. 2010;39(suppl_1):D539-45

Usage: The Protein Ontology (PR) was utilized to connect proteins to chemicals, genes, anatomy, catalysts, cell lines, cofactors, complexes, GO biological processes, GO cellular components, GO molecular functions, pathways, proteins, reactions, and transcripts. Additionally, the edges between this ontology and its dependencies are utilized:

ChEBI
DOID
GO

Notes: A partial, human-only version of this ontology was used. Details on how this version of the ontology was generated can be found under the Protein Ontology section of the Data_Preparation.ipynb Jupyter Notebook.

Files

Generated Human Version Protein Ontology (PRO)
- human_pro.owl (closed with hermit reasoner)
Other PRO Data Used: promapping.txt
Generated Mapping Data
- Merged Gene, RNA, Protein Map: Merged_gene_rna_protein_identifiers.pkl
- Ensembl Transcript-PRO Identifier Mapping: ENSEMBL_TRANSCRIPT_PROTEIN_ONTOLOGY_MAP.txt
- Entrez Gene-PRO Identifier Mapping: ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt
- UniProt Accession-PRO Identifier Mapping: UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt
- STRING-PRO Identifier Mapping: STRING_PRO_ONTOLOGY_MAP.txt

Return to Top

Relations Ontology

Homepage: GitHub
Citation:

Smith B, Ceusters W, Klagges B, Köhler J, Kumar A, Lomax J, Mungall C, Neuhaus F, Rector AL, Rosse C. Relations in biomedical ontologies. Genome Biology. 2005;6(5):R46.

Usage: The Relations Ontology (RO) was utilized to connect all data sources in knowledge graph. Additionally, the ontology is queried prior to building the knowledge graph to identify all relations, their inverse properties, and their labels.

Files

Generated RO Data
- INVERSE_RELATIONS.txt
- RELATIONS_LABELS.txt

Return to Top

Sequence Ontology

Homepage: GitHub
Citation:

Eilbeck K, Lewis SE, Mungall CJ, Yandell M, Stein L, Durbin R, Ashburner M. The Sequence Ontology: a tool for the unification of genome annotations. Genome Biology. 2005;6(5):R44

Usage: The Sequence Ontology (SO) was utilized to connect transcripts and other genomic material like genes and variants.

Files

Generated Mapping Data
- genomic_sequence_ontology_mappings.xlsx
- SO_GENE_TRANSCRIPT_VARIANT_TYPE_MAPPING.txt

Return to Top

Uber-Anatomy Ontology

Homepage: GitHub
Citation:

Mungall CJ, Torniai C, Gkoutos GV, Lewis SE, Haendel MA. Uberon, an integrative multi-species anatomy ontology. Genome Biology. 2012;13(1):R5

Usage: The Uber-Anatomy Ontology (UBERON) was utilized to connect tissues, fluids, and cells to proteins and transcripts. Additionally, the edges between this ontology and its dependencies are utilized:

ChEBI
CL
GO
PRO

Return to Top

Vaccine Ontology

Homepage: http://www.violinet.org/vaccineontology/
Citations:

He Y, Racz R, Sayers S, Lin Y, Todd T, Hur J, Li X, Patel M, Zhao B, Chung M, Ostrow J. Updates on the web-based VIOLIN vaccine database and analysis system. Nucleic Acids Research. 2013;42(D1):D1124-32

Xiang Z, Todd T, Ku KP, Kovacic BL, Larson CB, Chen F, Hodges AP, Tian Y, Olenzek EA, Zhao B, Colby LA. VIOLIN: vaccine investigation and online information network. Nucleic Acids Research. 2007;36(suppl_1):D923-8

Usage: The Vaccine Ontology (VO) was utilized for the edges between it and its dependencies:

ChEBI
DOID
GO
PRO
UBERON

Return to Top

DATA SOURCES

BioPortal
ClinVar
Comparative Toxicogenomics Database
DisGeNET
Ensembl
GeneMANIA
Genotype-Tissue Expression Project
Human Genome Organisation Gene Nomenclature Committee
Human Protein Atlas
National Center for Biotechnology Information Gene
National Center for Biotechnology Information MedGen
Reactome Pathway Database
Search Tool for Recurring Instances of Neighbouring Genes Database
Universal Protein Resource Knowledgebase

BioPortal

Homepage: BioPortal
Citation:

BioPortal. Lexical OWL Ontology Matcher (LOOM)

Ghazvinian A, Noy NF, Musen MA. Creating mappings for ontologies in biomedicine: simple methods work. In AMIA Annual Symposium Proceedings 2009 (Vol. 2009, p. 198). American Medical Informatics Association

Usage: BioPortal was utilized to obtain mappings between MeSH identifiers and ChEBI identifiers for chemicals-diseases, chemicals-genes, chemical-GO biological processes, chemicals-GO cellular components, chemicals-GO molecular functions, chemicals-phenotypes, chemicals-proteins, and chemicals-transcripts. Additional information on how this data was processed can be obtained from the NCBO_rest_api.py GitHub Gist script.

⭐ ALTERNATIVE METHOD⭐ Since the above approach can take over two days to process, we have developed an alternative solution which downloads the mesh2021.nt data file directly from MeSH and the Flat_file_tab_delimited/names.tsv.gz file directly from ChEBI. Using these files, we have recapitulated the LOOM algorithm implemented by BioPortal when creating mappings between these resources. The procedure is relatively straightforward and utilizes the following information from each resource:

For all MeSH SCR Chemicals, obtain the following information:
- Identifiers: MeSH identifiers
- Labels: string labels using the RDFS:label object property
- Synonyms: track down all synonyms using the vocab:concept and vocab:preferredConcept object properties
For all ChEBI classes, obtain the following information:
- Labels: string labels using the RDFS:label object property
- Synonyms: track down all synonyms using all synonym object properties

Files

Generated Data: MESH_CHEBI_MAP.txt

Return to Top

ClinVar

Homepage: https://www.ncbi.nlm.nih.gov/clinvar/
Citation:

Landrum MJ, Lee JM, Benson M, Brown GR, Chao C, Chitipiralla S, Gu B, Hart J, Hoffman D, Jang W, Karapetyan K. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Research. 2017;46(D1):D1062-7

Usage: ClinVar was utilized to create variant-gene, variant-disease, and variant-phenotype edges. The original data files (list under the Downloaded Data heading below) are combined and filtered to create the most robust file of variants. Detailed explanations of the steps performed can be found in the Clinvar Variant-Diseases and Phenotypes section of the Data_Preparation.ipynb notebook. The original data are not filtered prior to constructing edges, but it is advised to apply filters from the available metadata prior to performing any downstream analyses.

Files

Downloaded Data
Generated Edge Data: CLINVAR_VARIANT_GENE_DISEASE_PHENOTYPE_EDGES.txt

Return to Top

Comparative Toxicogenomics Database

Homepage: http://ctdbase.org/
Citations:

Curated [chemical–gene interactions|chemical-go interactions|chemical–disease interactions|gene–pathway interactions] data were retrieved from the Comparative Toxicogenomics Database (CTD), MDI Biological Laboratory, Salisbury Cove, Maine, and NC State University, Raleigh, North Carolina. World Wide Web (URL: http://ctdbase.org/)

Davis AP, Grondin CJ, Johnson RJ, Sciaky D, McMorran R, Wiegers J, Wiegers TC, Mattingly CJ. The comparative toxicogenomics database: update 2019. Nucleic Acids Research. 2018;47(D1):D948-54

Usage: The Comparative Toxicogenomics Database (CTD) was utilized to create chemical-disease, chemical-gene, chemical-GO biological process, chemical-GO cellular components, chemical-GO molecular functions, chemical-phenotype, chemical-protein, chemical-rna, and gene-pathway edges. The original data is filtered such that only records meeting the following criteria were included:

Chemical-Disease/Phenotype Relations
- chemical-disease: PubMedIDs != ""
- chemical-phenotype: PubMedIDs != ""
Chemical-Gene Relations
- chemical-gene: Organism == "Homo sapiens", GeneForms == "gene", and PubMedIDs != ""
- chemical-protein: Organism == "Homo sapiens", GeneForms == "protein", and PubMedIDs != ""
- chemical-rna: Organism == "Homo sapiens", GeneForms == "mRNA", and affects and PubMedIDs != ""
Chemical-GO Relations
- chemical-GO biological process: PhenotypeName == "Biological Process"
- chemical-GO cellular components: PhenotypeName == "Cellular Component"
- chemical-GO molecular functions: PhenotypeName == "Molecular Function"
Gene-Pathway Relations
- gene-pathway: column[5] == "Homo sapiens"

Files

Downloaded Data
- Chemical-Gene Relations: CTD_chem_gene_ixns.tsv.gz
- Chemical-Disease/Phenotype Relations: CTD_chemicals_diseases.tsv.gz
- Chemical-GO Relations: CTD_chem_go_enriched.tsv.gz
- Gene-Pathway Relations: CTD_genes_pathways.tsv.gz

Return to Top

DisGeNET

Homepage: https://www.disgenet.org/
Citation:

Gene-disease association data retrieved from DisGeNET v6.0 (http://www.disgenet.org/), Integrative Biomedical Informatics Group GRIB/IMIM/UPF. [December, 2019].

Piñero J, Ramírez-Anguita JM, Saüch-Pitarch J, Ronzano F, Centeno E, Sanz F, Furlong LI. The DisGeNET knowledge platform for disease genomics: 2019 update. Nucleic Acids Research. 2019.

Usage: DisGeNET was utilized to create gene-disease, and gene-phenotype edges. The original data is filtered such that only records meeting the following criteria were included:

gene-disease: diseaseType == "disease"
gene-phenotype: diseaseType == "phenotype"

Additionally, data from this source was used to create mappings between different types of disease and phenotype identifiers, including:

OMIM, ORPHA, UMLS, ICD ➞ DOID
OMIM, ORPHA, UMLS, ICD ➞ HPO

Files

Downloaded Data
- Disease/Phenotype-Gene Relations: curated_gene_disease_associations.tsv.gz
- Disease Identifier Mapping: disease_mappings.tsv.gz
Generated Mapping Data
- Disease Identifier Mapping: PHENOTPYE_HPO_MAP.txt
- Phenotype Identifier Mapping: DISEASE_DOID_MAP.txt

Return to Top

Ensembl

Homepage: https://uswest.ensembl.org/index.html
Citation:

Zerbino DR, Achuthan P, Akanni W, Amode MR, Barrell D, Bhai J, Billis K, Cummins C, Gall A, Girón CG, Gil L. Ensembl 2018. Nucleic Acids Research. 2017;46(D1):D754-61

Usage: Ensembl data was utilized to create mappings between Ensembl genes, transcripts, and proteins with NCBI Gene identifiers, HUGO gene symbols, UniProt Accession identifiers, and Protein Ontology identifiers in the knowledge graph (for additional details on the processing of these data, see Data_Preparation.ipynb):

Ensembl Transcript IDs ➞ PRO IDs
Gene Ensembl IDs ➞ Entrez Gene IDs
Gene Ensembl IDs ➞ PRO IDs
Gene Symbols ➞ Transcript Ensembl IDs
Entrez Gene IDs ➞ Transcript Ensembl IDs
Entrez Gene IDs ➞ PRO IDs
Protein Ensembl IDs ➞ UniProt Protein Accession
STRING IDs ➞ PRO IDs
UniProt Protein Accession ➞ Entrez Gene IDs

Files

Downloaded Data
Generated Mapping Data
- Cleaned Ensembl Gene Set: ensembl_identifier_data_cleaned.txt
- Merged Gene, RNA, Protein Map: Merged_gene_rna_protein_identifiers.pkl
- Ensembl Transcript-PRO Identifier Mapping: ENSEMBL_TRANSCRIPT_PROTEIN_ONTOLOGY_MAP.txt
- Gene Symbol-Ensembl Transcript Identifier Mapping: GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt
- Entrez Gene-Ensembl Transcript Identifier Mapping: ENTREZ_GENE_ENSEMBL_TRANSCRIPT_MAP.txt
- Entrez Gene-PRO Identifier Mapping: ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt
- Ensembl Gene-Entrez Gene Identifier Mapping: ENSEMBL_GENE_ENTREZ_GENE_MAP.txt

Return to Top

GeneMANIA

Homepage: https://genemania.org/
Citation:

Warde-Farley D, Donaldson SL, Comes O, Zuberi K, Badrawi R, Chao P, Franz M, Grouios C, Kazi F, Lopes CT, Maitland A. The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Research. 2010;38(suppl_2):W214-20

Usage: GeneMANIA was utilized to create gene-gene edges.

Files

Downloaded Data: COMBINED.DEFAULT_NETWORKS.BP_COMBINING.txt

Return to Top

Genotype-Tissue Expression Project

Homepage: https://gtexportal.org/home/
Citation:

Lonsdale J, Thomas J, Salvatore M, Phillips R, Lo E, Shad S, Hasz R, Walters G, Garcia F, Young N, Foster B. The genotype-tissue expression (GTEx) project. Nature Genetics. 2013;45(6):580

Usage: The Genotype-Tissue Expression (GTEx) Project was utilized to create edges between protein-cell, protein-anatomy, rna-cell and rna-anatomy entities.

We chose to use the RNASeQC file over the RSEM file as advised by the GTEx website

The RSEM estimates are based on combining isoform-level estimates, which adds uncertainty to the resulting gene-level values (the isoform-level estimates are highly inaccurate in some cases). For analyses, we recommend filterng the metadata to only keep ExpressionValues >= 1.0.

Zooma was utilized to automatically annotate the 154 unique tissues and cell types. GTEx provides mappings from tissue types to UBERON and EFO. These provided mappings were verified and extended, such that all samples which referenced a cell type were also mapped to the Cell and the Cell Line ontologies. This resulted in a total of 56 mappings (1.04 mappings/concepts).
The original data is filtered such that only records meeting the following criteria were included:
- Protein-Anatomy/Cell Relations
  - protein-anatomy: column[3] == Evidence at protein level and column[4] == anatomy
  - protein-Cell: column[3] == Evidence at protein level and column[4] == cell line
- RNA-Anatomy/Cell Relations
  - rna-anatomy: column[3] == Evidence at protein level and column[4] == anatomy
  - rna-Cell: column[3] == Evidence at protein level and column[4] == cell line

Files

Downloaded Data: GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct
Mapping Results: zooma_tissue_cell_mapping_04JAN2020.xlsx
Generated Data
The final mapping set was combined with terms from the Human Protein Atlas, see here for more information.
- All HPA tissue and cell type strings: HPA_tissues.txt
- Final Term Mapping: HPA_GTEx_TISSUE_CELL_MAP.txt
- Final RNA, Gene, Protein-Tissues and Cell Types Relations: HPA_GTEX_RNA_GENE_PROTEIN_EDGES.txt

Return to Top

Human Genome Organisation Gene Nomenclature Committee

Homepage: https://www.genenames.org/
Citations:

HGNC Database, HUGO Gene Nomenclature Committee (HGNC), European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom www.genenames.org

Yates B, Braschi B, Gray K, Seal R, Tweedie S, Bruford E. Genenames.org: the HGNC and VGNC Resources in 2017. Nucleic Acids Research. 2017;45(D1):D619-625

Usage: The Human Genome Organisation (HUGO) data was utilized to obtain mappings between NCBI Gene identifiers, HUGO gene symbols, UniProt Accession identifiers, and Protein Ontology identifiers. For additional details on the processing of these data, see Data_Preparation.ipynb:

Ensembl Transcript IDs ➞ PRO IDs
Gene Ensembl IDs ➞ Entrez Gene IDs
Gene Ensembl IDs ➞ PRO IDs
Gene Symbols ➞ Transcript Ensembl IDs
Entrez Gene IDs ➞ Transcript Ensembl IDs
Entrez Gene IDs ➞ PRO IDs
Protein Ensembl IDs ➞ UniProt Protein Accession
STRING IDs ➞ PRO IDs
UniProt Protein Accession ➞ Entrez Gene IDs

Files

Downloaded Data: hgnc_complete_set.txt
Generated Data
- Merged Gene, RNA, Protein Map: Merged_gene_rna_protein_identifiers.pkl
- Gene Symbol-Ensembl Transcript Identifier Mapping: GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt

Return to Top

Human Protein Atlas

Homepage: https://www.proteinatlas.org/
Citation:

Uhlén M, Fagerberg L, Hallström BM, Lindskog C, Oksvold P, Mardinoglu A, Sivertsson Å, Kampf C, Sjöstedt E, Asplund A, Olsson I. Tissue-based map of the human proteome. Science. 2015;347(6220):1260419

Usage: The Human Protein Atlas (HPA) was utilized to create rna-cell, rna-anatomy, protein-cell, and protein-anatomy edges.

Evidence between gene and RNA expression in specific tissue types was derived by HPA. For analyses, we recommend filterng the metadata to only keep ExpressionValues >= 1.0.
Zooma was utilized to automatically annotate the 153 unique tissues and cell types from Human Protein Atlas for all human protein-coding genes in the Human Proteome to the Cell Ontology, Cell Line Ontology, and the Uber-Anatomy Ontology. To best represent each concept, the automatic mappings from Zooma were extend through manual mapping efforts to ensure each concept cell type was matched to a Cell Ontology, Cell Line Ontology, and UBERON ontology term. This resulted in a total of 281 mappings (1.84 mappings/concepts).
The original data is filtered such that only records meeting the following criteria were included:
- Protein-Anatomy/Cell Relations
  - protein-anatomy: column[3] == Evidence at protein level and column[4] == anatomy
  - protein-Cell: column[3] == Evidence at protein level and column[4] == cell line
- RNA-Anatomy/Cell Relations
  - rna-anatomy: column[3] == Evidence at protein level and column[4] == anatomy
  - rna-Cell: column[3] == Evidence at protein level and column[4] == `cell line``

Files

Downloaded Data: proteinatlas_search.tsv
Mapping Results: zooma_tissue_cell_mapping_04JAN2020.xlsx
Generated Data
- Final Term Mapping: HPA_GTEx_TISSUE_CELL_MAP.txt
- Final RNA, Gene, Protein-Tissues and Cell Types Relations: HPA_GTEX_RNA_GENE_PROTEIN_EDGES.txt

Return to Top

National Center for Biotechnology Information Gene

Homepage: https://www.ncbi.nlm.nih.gov/gene/
Citation:

Maglott D, Ostell J, Pruitt KD, Tatusova T. Entrez Gene: gene-centered information at NCBI. Nucleic Acids Research. 2005;33(suppl_1):D54-8.

Usage: The National Center for Biotechnology Information (NCBI) Gene data was utilized to obtain mappings between NCBI Gene identifiers, HUGO gene symbols, UniProt Accession identifiers, and Protein Ontology identifiers. For additional details on the processing of these data, see Data_Preparation.ipynb:

Ensembl Transcript IDs ➞ PRO IDs
Gene Ensembl IDs ➞ Entrez Gene IDs
Gene Ensembl IDs ➞ PRO IDs
Gene Symbols ➞ Transcript Ensembl IDs
Entrez Gene IDs ➞ Transcript Ensembl IDs
Entrez Gene IDs ➞ PRO IDs
Protein Ensembl IDs ➞ UniProt Protein Accession
STRING IDs ➞ PRO IDs
UniProt Protein Accession ➞ Entrez Gene IDs

Files

Downloaded Data: Homo_sapiens.gene_info.gz
Generated Data
- Merged Gene, RNA, Protein Map: Merged_gene_rna_protein_identifiers.pkl
- Entrez Gene-Ensembl Transcript Identifier Mapping: ENTREZ_GENE_ENSEMBL_TRANSCRIPT_MAP.txt
- Entrez Gene-PRO Identifier Mapping: ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt
- Ensembl Gene-Entrez Gene Identifier Mapping: ENSEMBL_GENE_ENTREZ_GENE_MAP.txt
- Uniprot Accession-Entrez Gene Identifier Mapping: UNIPROT_ACCESSION_ENTREZ_GENE_MAP.txt

Return to Top

National Center for Biotechnology Information MedGen

Homepage: https://www.ncbi.nlm.nih.gov/medgen/
Citation:

Louden DN. Medgen: NCBI’s Portal to Information on Medical Conditions with a Genetic Component. Medical Reference Services Quarterly. 2020 Apr 2;39(2):183-91.

Usage: The National Center for Biotechnology Information MedGen was utilized to obtain mappings between MedGen identifiers and Human Protein Ontology identifiers and Mondo Disease Ontology identifiers. This file is combined with data from the PHENOTPYE_HPO_MAP.txt and DISEASE_DOID_MAP.txt files. For additional details on the processing of these data, see Data_Preparation.ipynb:

MedGen IDs ➞ HP/MONDO IDs
UMLS CUI IDs ➞ HP/MONDO IDs
OMIM IDs ➞ HP/MONDO IDs
Orphanet IDs ➞ HP/MONDO IDs

Files

Downloaded Data: MGCONSO.RRF.gz
Generated Data
- Disease and Phenotype Identifier Map: MedGen_Disease_Phenotype_identifiers.txt

Return to Top

Reactome Pathway Database

Homepage: https://reactome.org/
Citation:

Fabregat A, Jupe S, Matthews L, Sidiropoulos K, Gillespie M, Garapati P, Haw R, Jassal B, Korninger F, May B, Milacic M. The reactome pathway knowledgebase. Nucleic Acids Research. 2017;46(D1):D649-55

Usage: The Reactome Pathway Database was utilized to create chemical-pathway, GO Biological process-pathway, pathway-GO Cellular component, GO Molecular function-pathway, and protein-pathway edges. The original data is filtered such that only records meeting the following criteria were included:

chemical-pathway: column[5] == "Homo sapiens"
GO Biological Process-pathway: column[5] startswith "REACTOME", column[8] == "P", column[12] == "taxon:9606", and column[3] not in ["NOT"]
pathway-GO Cellular Component: column[5] startswith "REACTOME", column[8] == "C", column[12] == "taxon:9606", and column[3] not in ["NOT"]
GO Molecular Function-Pathway: column[5] startswith "REACTOME", column[8] == "F", column[12] == "taxon:9606", and column[3] not in ["NOT"]
protein-pathway: column[5] == "Homo sapiens"

Files

Downloaded Data
- Chemical-Pathway Relations: ChEBI2Reactome_All_Levels.txt
- Pathway-GO Relations: gene_association.reactome
- Protein-Pathway Relations: UniProt2Reactome_All_Levels.txt

Return to Top

Search Tool for Recurring Instances of Neighbouring Genes Database

Homepage: string-db.org
Citation:

Szklarczyk D, Gable AL, Lyon D, Junge A, Wyder S, Huerta-Cepas J, Simonovic M, Doncheva NT, Morris JH, Bork P, Jensen LJ. STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Research. 2018;47(D1):D607-13

Usage: The Search Tool for Recurring Instances of Neighbouring Genes (STRING) Database was utilized to create protein-protein edges. For analyses, we recommend filtering these edges such that only records with a combined_score >= "700" (>90th percentile) are included.

Files

Downloaded Data: 9606.protein.links.v11.0.txt.gz
Generated Data: STRING-PRO Identifier Mapping: STRING_PRO_ONTOLOGY_MAP.txt

Return to Top

Universal Protein Resource Knowledgebase

Homepage: https://www.uniprot.org/
Citation:

UniProt Consortium. UniProt: a worldwide hub of protein knowledge. Nucleic acids research. 2018;47(D1):D506-15

Usage: The Universal Protein Resource (UniProt) Knowledgebase was utilized to obtain cofactor/catalyst-protein and protein-coding gene-protein edges as well as mappings between NCBI Gene identifiers, HUGO gene symbols, Universal Protein Resource (UniProt) Accession identifiers, and Protein Ontology identifiers. For additional details on the processing of these data, see Data_Preparation.ipynb:

Ensembl Transcript IDs ➞ PRO IDs
Gene Ensembl IDs ➞ Entrez Gene IDs
Gene Ensembl IDs ➞ PRO IDs
Gene Symbols ➞ Transcript Ensembl IDs
Entrez Gene IDs ➞ Transcript Ensembl IDs
Entrez Gene IDs ➞ PRO IDs
Protein Ensembl IDs ➞ UniProt Protein Accession
STRING IDs ➞ PRO IDs
UniProt Protein Accession ➞ Entrez Gene IDs

Files

Downloaded Data
- Cofactor and Catalyst relations: Cofactor/Catalyst Query Results
- UniProt Identifier Mapping: UniProt Identifier Query Results
Generated Data
- Merged Gene, RNA, Protein Map: Merged_gene_rna_protein_identifiers.pkl
- Protein-Cofactor Relations: UNIPROT_PROTEIN_COFACTOR.txt
- Protein-Catalyst Relations: UNIPROT_PROTEIN_CATALYST.txt
- UniProt Accession-PRO Identifier Mapping: UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt
- UniProt Accession-Entrez Gene Identifier Mapping: UNIPROT_ACCESSION_ENTREZ_GENE_MAP.txt

Return to Top

This project is licensed under Apache License 2.0 - see the LICENSE.md file for details. If you intend to use any of the information on this Wiki, please provide the appropriate attribution by citing this repository:

@misc{callahan_tj_2019_3401437,
  author       = {Callahan, TJ},
  title        = {PheKnowLator},
  month        = mar,
  year         = 2019,
  doi          = {10.5281/zenodo.3401437},
  url          = {https://doi.org/10.5281/zenodo.3401437}
}

Project Information

Tutorials and Use Cases

Tutorials
- Implementing OWL-NETS
- Processing an RDF Graph
Use Cases
- Drug Safety
  - KG-based Investigation of Drug-Outcome Pairs

Releases

Benchmarks and Builds
- Archived

Human Disease KG Builds

Archived Builds
- v1.0.0
  - September 3, 2019
- v2.0.0
- v2.1.0
- v3.0.2
  - October 18, 2021
  - November 01, 2021

FAQs

How to Get Involved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v4 Data Sources

Release v4.0.0 Knowledge Graph Data Sources

ONTOLOGIES

Cell Ontology

Cell Line Ontology

Chemical Entities of Biological Interest Ontology

Gene Ontology

Human Phenotype Ontology

Mondo Disease Ontology

Pathway Ontology

Protein Ontology

Relations Ontology

Sequence Ontology

Uber-Anatomy Ontology

Vaccine Ontology

DATA SOURCES

BioPortal

ClinVar

Comparative Toxicogenomics Database

DisGeNET

Ensembl

GeneMANIA

Genotype-Tissue Expression Project

Human Genome Organisation Gene Nomenclature Committee

Human Protein Atlas

National Center for Biotechnology Information Gene

National Center for Biotechnology Information MedGen

Reactome Pathway Database

Search Tool for Recurring Instances of Neighbouring Genes Database

Universal Protein Resource Knowledgebase

Project Information

Tutorials and Use Cases

Releases

Human Disease KG Builds

FAQs

Enabling Reproducible Research

Clone this wiki locally