This guide will teach you the complete process of building the Alzheimer’s Knowledge Base (AlzKB). It’s not a concise process, but it is extensible to other applications of knowledge engineering. We use the same process for our other knowledge bases (such as ComptoxAI), so this guide can also be used to teach you how to build your own.
The following diagram gives an overview of the build process:
- First, you use domain knowledge to create the ontology
- Then, you collect the data sources and use them to populate the ontology
- Finally, you convert the ontology into a graph database
Important note: Most users don’t need to follow these steps, since it is already done! Unless you want to extend AlzKB or make major modifications to its node/edge types, you should skip to the next section. If you DO want to do those things, then keep reading.
AlzKB uses an OWL 2 ontology to act something like a ‘template’ for the nodes and relationships in the final knowledge graph. While the actual nodes and relationships are added automatically according to the ‘rules’ defined in the ontology, the ontology itself is constructed manually, using domain knowledge about AD. We do this using the Protégé ontology editor. If you don’t already have it, download and install Protégé Desktop on your computer.
The next step is to collect the source data files that will eventually become the nodes, relationships, and properties in AlzKB’s knowledge graph. Since databases are distributed in a variety of formats and modalities, you will have to work with a mix of plain-text “flat” files as well as relational (SQL) databases. All of the SQL databases parsed to build AlzKB are distributed for MySQL (as opposed to some other flavor of SQL).
Source | Directory name | Entity type(s) | URL | Extra instructions |
---|---|---|---|---|
Hetionet | hetionet | Many - see populate-ontology.py | GitHub | Hetionet |
NCBI Gene | ncbigene | Genes | Homo_sapiens.gene_info.gz | NCBI Gene |
Drugbank | drugbank | Drugs / drug candidates | DrugBank website | Drugbank |
DisGeNET | disgenet | Diseases and disease-gene edges | DisGeNET | DisGeNET |
Download the hetionet-v1.0-edges.sif.gz
(extract it using gunzip
)
and hetionet-v1.0-nodes.tsv
files from the Hetionet Github
repository. Both of them are, essentially, TSV files, even though one
has the .sif
extension.
Hetionet is, itself, a knowledge base, and contains many of the core biological entities used in AlzKB. Accordingly, it contains data derived from many other third-party sources.
Download the Homo_sapiens.gene_info.gz
file from the NCBI FTP page
and extract it (e.g., using gunzip
).
Create a CUSTOM
subdirectory inside the ncbigene
directory. Inside
of that subdirectory, place the following two files:
- alzkb_parse_ncbigene.py
- Homo_sapiens_expr_advanced.tsv (from the Bgee database)
Then, run alzkb_parse_ncbigene.py
(no external Python packages
should be needed). You’ll notice that it creates two output files
that are used while populating the ontology.
In order to download the Academic DrugBank datasets, you need to first create a free DrugBank account and verify your email address. After verifying your email address, they may need some more information regarding your DrugBank account, like the description of how you plan to use DrugBank, a description of your organization, Who is sponsoring this research, and What is the end goal of this research. Account approval can take up to several business days to weeks based on our experience.
After your access has been approved, navigate to the Academic Download page on the Drugbank website (linked
above) by selecting the “Download” tab and “Academic Download”. Select the “External Links” tab. In the table titled “External
Drug Links”, click the “Download” button on the row labeled
“All”. This will download a zip file. Extract the contents of that zip
file, and make sure it is named drug_links.csv
(some versions use a
space instead of an underscore in the filename).
Although DisGeNET is available under a Creative Commons license, the database requires users to create a free account to download the tab-delimited data files. Therefore, you should create a user account and log in. Then, navigate to the Downloads page on the DisGeNET website. Now, download the two necessary files by clicking on the corresponding links:
- “UMLS CUI to several disease vocabularies” (under the “UMLS CUI to
several disease vocabularies” section heading - the resulting file
name will be
disease_mappings.tsv.gz
) - “UMLS CUI to top disease classes” (the resulting file will be named
disease_mappings_to_attributes.tar.gz
)
Next, download curated_disease_gene_associations.tsv.gz
directly by
copying the following URL into your web browser:
https://www.disgenet.org/static/disgenet_ap1/files/downloads/curated_gene_disease_associations.tsv.gz
All three files are gzipped, so extract them into the disgenet/
directory using your favorite method (e.g., gunzip from the command
line, 7zip from within Windows, etc.).
Now that you have the three necessary data files, you should run the
AlzKB script we wrote to filter for rows in those files corresponding
to Alzheimer’s Disease, named alzkb_parse_disgenet.py
. This script
is in the scripts/
directory of the AlzKB repository, so either find
it on your local filesystem if you already have a copy of the
repository, or find it on the AlzKB GitHub repository in your web
browser.
You can then run the Python script from within the disgenet/
directory, which should deposit two filtered data files in the
disgenet/CUSTOM/
subdirectory. These will be automatically detected
and used when you run the ontology population script, along with the
unmodified curated_disease_gene_associations.tsv
file.
Then you create a directory that will hold all of the raw data files. It can be ‘D:\data' or something else you prefer. Within that, there will be 1 folder for each third-party database, and in those folders, you’ll put the individual csv/tsv/txt files.
If you don’t already have MySQL installed, install it. We recommend using either a package manager (if one is available on your OS), or installing MySQL Community Server from the mysql.com website (e.g., by visiting https://dev.mysql.com/downloads/mysql/). Make sure it’s running and you have the ability to create and modify new databases.
The Adverse Outcome Pathway Database (AOP-DB) is the only MySQL database you need to install to build the current version of AlzKB. It can be downloaded at: https://gaftp.epa.gov/EPADataCommons/ORD/AOP-DB/
WARNING: This is a big download (7.2G while compressed)! Make sure you have enough disk space before proceeding.
You’ll have to extract two archives - first, unzip the AOP-DB_v2.zip
archive, which should contain two *.tar.gz archives and another .zip
archive. Now, extract the *.tar.gz archive containing nogi
in its
name (the smaller of the two). Windows doesn’t natively support
extracting .tar.gz archives, so you’ll either have to download another
program that does this (e.g., 7-zip) or extract it in a Unix-based
environment (Linux, MacOS, Windows Subsystem for Linux, Cygwin, etc.)
that has the tar
program available on the command line. Once you’ve
extracted it, you should have a file named something like
aopdb_no-orthoscores.sql
.
Now, create an empty database in MySQL, and name it aopdb
. Make sure
you have full admin privileges on the database. Then, load the (newly
extracted) .sql
file into the empty database. I always find this
easiest from the command line, by running a command such as:
$ mysql -u username -p database_name < aopdb_no-orthoscores.sql
Substitute your username after the -u
option and enter your password
when prompted. If you prefer to import it from a GUI, you can use a
tool like MySQL Workbench or DataGrip.
WARNING: It can take a while to import, so be ready to take a break or do something else while you wait.
Now that we have an ontology (currently ‘unpopulated’, consisting of a
class hierarchy, object property types, data property types, and
possibly annotations), we can populate it with records from the
third-party databases we collected in the previous step. Fortunately,
this is a largely automated process, facilitated by a tool we call
ista
(ista is the Sindarin word for knowledge). With ista
, you
write a Python script that first tells ista
where to find the
third-party data sources, and then maps each of those data sources to
one or two node or edge types defined in the ontology (as classes or
object properties, respectively). Here, we’ll walk through the
different parts of AlzKB’s ista
build script and discuss what each
component does. If you are reading this guide to modify or extend
AlzKB, you should be able to use the information in the following few
sections to write your own build script.
For reference, an up-to-date, complete copy of this build file can be
found in the AlzKB source repository at the location
alzkb/populate_ontology.py
.
- Keep MySQL Server running
- Install mysqlclient via Anaconda-Navigator
- Clone the ista repository onto your computer (
git clone https://github.com/RomanoLab/ista
) cd ista
pip install .
At the top of the file, we do some imports of necessary Python
packages. First comes ista
. We don’t import the whole package, just
the classes and function that we actually interact with.
from ista import FlatFileDatabaseParser, MySQLDatabaseParser
from ista.util import print_onto_stats
In order to interact with OWL 2 ontology files, we bring in the
owlready2
library.
import owlready2
We put private data for our local MySQL databases (hostname, username,
and password) in a file named secrets.py
, and then make sure the
file is added to our .gitignore
file so it isn’t checked into
version control. You’ll have to create that file yourself, and define
the variables MYSQL_HOSTNAME
, MYSQL_USERNAME
, and
MYSQL_PASSWORD
. Then, in the build script, you’ll import the file
containing those variables and wrap them into a configuration dict.
import secrets
mysql_config = {
'host': secrets.MYSQL_HOSTNAME,
'user': secrets.MYSQL_USERNAME,
'passwd': secrets.MYSQL_PASSWORD
}
Since we are populating an ontology, we need to load the ontology into
owlready2
. Make sure to modify this path to fit the location of the
AlzKB ontology file on your system! Future versions of AlzKB will
source the path dynamically. Also note the file://
prefix, which
tells owlready2
to look on the local file system rather than load a
web URL. Since this guide was made on a Windows desktop, you’ll notice
that we have to use escaped backslashes to specify file paths that the
Python interpreter will parse correctly.
onto = owlready2.get_ontology("file://D:\\projects\\ista\\tests\\projects\\alzkb\\alzkb.rdf").load()
We also set the ‘base’ directory for all of the flat files that ista
will be loading. You will have determined this location already (see
Obtaining the third-party data sources).
data_dir = "D:\\data\\"
Now, we can actually register the source databases with ista
’s
parser classes. We use FlatFileDatabaseParser
for data sources
stored as one or more delimited flat files, and MySQLDatabaseParser
for data sources in a MySQL database. For flat file-based sources, the
first argument given to the parser’s constructor MUST be the
subdirectory (within data_dir
) where that source’s data files are
contained, and for MySQL sources it MUST be the name of the MySQL
database. If not, ista
won’t know where to find the files. The
second argument is always the ontology object loaded using
owlready2
, and the third is either the base data directory or the
MySQL config dictionary, both of which were defined above.
epa = FlatFileDatabaseParser("epa", onto, data_dir)
ncbigene = FlatFileDatabaseParser("ncbigene", onto, data_dir)
drugbank = FlatFileDatabaseParser("drugbank", onto, data_dir)
hetionet = FlatFileDatabaseParser("hetionet", onto, data_dir)
aopdb = MySQLDatabaseParser("aopdb", onto, mysql_config)
aopwiki = FlatFileDatabaseParser("aopwiki", onto, data_dir)
tox21 = FlatFileDatabaseParser("tox21", onto, data_dir)
disgenet = FlatFileDatabaseParser("disgenet", onto, data_dir)
In the following two sections, we’ll go over a few examples of how to define mappings using these parser objects. We won’t replicate every mapping in this guide for brevity, but you can see all of them in the full AlzKB build script.
hetionet.parse_node_type(
node_type="Symptom",
source_filename="hetionet-v1.0-nodes.tsv",
fmt="tsv",
parse_config={
"iri_column_name": "name",
"headers": True,
"filter_column": "kind",
"filter_value": "Symptom",
"data_transforms": {
"id": lambda x: x.split("::")[-1]
},
"data_property_map": {
"id": onto.xrefMeSH,
"name": onto.commonName
}
},
merge=False,
skip=False
)
This block indicates the third-party database is hetionet, and the file is hetionet-v1.0-nodes.tsv
So the file it will look for is D:\data\hetionet\hetionet-v1.0-nodes.tsv
Some of the configuration blocks will have a CUSTOM\ prefix to the filename. This means that the file was created by us manually and will need to be stored in a CUSTOM subdirectory of the database folder. For example:
disgenet.parse_node_type(
node_type="Disease",
source_filename="CUSTOM/disease_mappings_to_attributes_alzheimer.tsv", # Filtered for just Alzheimer disease
fmt="tsv-pandas",
parse_config={
"iri_column_name": "diseaseId",
"headers": True,
"data_property_map": {
"diseaseId": onto.xrefUmlsCUI,
"name": onto.commonName,
}
},
merge=False,
skip=False
)
This file will be D:\data\disgenet\CUSTOM\disease_mappings_alzheimer.tsv
aopdb.parse_node_type(
node_type="Drug",
source_table="chemical_info",
parse_config={
"iri_column_name": "DTX_id",
"data_property_map": {"ChemicalID": onto.xrefMeSH},
"merge_column": {
"source_column_name": "DTX_id",
"data_property": onto.xrefDTXSID
}
},
merge=True,
skip=False
)
This block indicates the third-party database is AOP-DB, and the source table is chemical_info.
Every flat file or SQL table from a third-party data source can be
mapped a single node or relationship type. For example, a file
describing diseases can be mapped to the Disease
node type, where
each line in the file corresponds to a disease to be inserted (or
‘merged’—see below) into the knowledge graph. If the source is being
mapped to a node type (rather than a relationship type), ista
additionally can populate one or more node properties from the
feature columns in the source file.
Each mapping is defined using a method call in the ista
Python
script.
Now you have set the location of data resources, ontology, and defined mapping method. Run populate_ontology.py
The alzkb-populated.rdf is the output of this step and will be used for setting Neo4j Graph database.
If you haven’t done so already, download Neo4j from the Neo4j Download Center. Most users should select Neo4j Desktop, but advanced users can instead opt for Community Server (the instructions for which are well outside of the scope of this guide).
You should now create a new graph database that will be populated with the contents of AlzKB. In Neo4j Community, this can be done as follows:
- Create a new project by clicking the “New” button in the upper left, then selecting “Create project”.
- In the project panel (on the right of the screen), you will see the
default name “Project” populates automatically. Hover over this
name and click the edit icon, then change the name to
AlzKB
. - To the right of the project name, click “Add”, and select “Local
DBMS”. Change the Name to
AlzKB DBMS
, specify a password that you will remember, and use the Version dropdown to select “4.4.0” (if it is not already selected). Click “Create”. Wait for the operation to finish. - Install plugins:
- Click the name of the DBMS (“AlzKB DBMS”, if you have followed the guide), and in the new panel to the right click the “Plugins” tab.
- Expand the “APOC” option, click “Install”, and wait for the operation to complete.
- Do the same for the “Graph Data Science Library” and “Neosemantics (n10s)” plugins.
- Before starting the DBMS, click the ellipsis immediately to the
right of the “Open” button, and then click “Settings…”. Make the
following changes to the configuration file:
- Set
dbms.memory.heap.initial_size
to2048m
. - Set
dbms.memory.heap.max_size
to4G
. - Set
dbms.memory.pagecache.size
to2048m
. - Uncomment the line containing
dbms.security.procedures.allowlist=apoc.coll.*,apoc.load.*,gds.*
to activate it. - Add
n10s.*,apoc.cypher.*,apoc.help
todbms.security.procedures.allowlist=apoc.coll.*,apoc.load.*,gds.*
- Click the “Apply” button, then “Close”.
- Set
- Click “Start” to start the graph database.
- Open neo4j Browser and run the following Cypher to import RDF data
# Cleaning nodes
MATCH (n) DETACH DELETE n
# Constraint Creation
CREATE CONSTRAINT n10s_unique_uri FOR (r:Resource) REQUIRE r.uri IS UNIQUE
# Creating a Graph Configuration
CALL n10s.graphconfig.init()
CALL n10s.graphconfig.set({applyNeo4jNaming: true, handleVocabUris: 'IGNORE'})
# Importing RDF
CALL n10s.rdf.import.fetch( "file://D:\\data\\alzkb-populated.rdf", "RDF/XML")
- Run the Cyphers below to clean nodes
MATCH (n:Resource) REMOVE n:Resource;
MATCH (n:NamedIndividual) REMOVE n:NamedIndividual;
MATCH (n:AllDisjointClasses) REMOVE n:AllDisjointClasses;
MATCH (n:AllDisjointProperties) REMOVE n:AllDisjointProperties;
MATCH (n:DatatypeProperty) REMOVE n:DatatypeProperty;
MATCH (n:FunctionalProperty) REMOVE n:FunctionalProperty;
MATCH (n:ObjectProperty) REMOVE n:ObjectProperty;
MATCH (n:AnnotationProperty) REMOVE n:AnnotationProperty;
MATCH (n:SymmetricProperty) REMOVE n:SymmetricProperty;
MATCH (n:_GraphConfig) REMOVE n:_GraphConfig;
MATCH (n:Ontology) REMOVE n:Ontology;
MATCH (n:Restriction) REMOVE n:Restriction;
MATCH (n:Class) REMOVE n:Class;
MATCH (n) WHERE size(labels(n)) = 0 DETACH DELETE n; # Removes nodes without labels
Now, you have built the AlzKB from scratch. You can find the number of nodes and relationships with
CALL db.labels() YIELD label
CALL apoc.cypher.run('MATCH (:`'+label+'`) RETURN count(*) as count',{}) YIELD value
RETURN label, value.count ORDER BY label
CALL db.relationshipTypes() YIELD relationshipType as type
CALL apoc.cypher.run('MATCH ()-[:`'+type+'`]->() RETURN count(*) as count',{}) YIELD value
RETURN type, value.count ORDER BY type
In version 2.0, we added “TranscriptionFactor” nodes, “TRANSCRIPTIONFACTORINTERACTSWITHGENE” relationships, node properties of “chromosome” number and “sourcedatabase”, relationships properties of “correlation”, “score”, “p_fisher”, “z_score”, “affinity_nm”, “confidence”, “sourcedatabase”, and “unbiased”.
To achieve this, we added the above entities to the ontology RDF and now named alzkb_v2.rdf
in the alzkb\data
directory. Then collect additional source data files as detailed in the table below.
Source | Directory name | Entity type(s) | URL | Extra instructions |
---|---|---|---|---|
TRRUST | dorothea | Transcription factors(TF) and TF-gene edges | TRRUST Download | TRRUST |
DoRothEA | dorothea | Transcription factors(TF) and TF-gene edges | DoRothEA Installation | DoRothEA RScript |
Download trrust_rawdata.human.tsv
from TRRUST Download. Install DoRothEA by following the DoRothEA Installation within R. Place the trrust_rawdata.human.tsv
and alzkb_parse_dorothea.py
inside of Dorothea/
subdirectory, which should be within your raw data directory (e.g., D:\data
). Run alzkb_parse_dorothea.py
. You’ll notice that it creates a tf.tsv
file that is used while populating the ontology.
Since Hetionet does not have an up-to-date update plan, we have replicated them using the rephetio paper and source code to ensure AlzKB has current data. Follow the steps in AlzKB-updates Github repository to create hetionet-custom-nodes.tsv
and hetionet-custom-edges.tsv
. Place these files in the hetionet/
subdirectory.
Place the updated alzkb_parse_ncbigene.py
, alzkb_parse_drugbank.py
, and alzkb_parse_disgenet.py
from the scripts/
directory in their respective raw data file subdirectory. Run each script to process the data for the next step.
Now that we have the updated ontology and updated data files, run the updated alzkb/populate_ontology.py
to populate records. It creates a alzkb_v2-populated.rdf
file that will be used in next step.
If you haven’t done so already, download Memgraph from the Install Memgraph page. Most users install Memgraph using a pre-prepared docker-compose.yml
file by executing:
- for Linux and macOS:
curl https://install.memgraph.com | sh
- for Windows:
iwr https://windows.memgraph.com | iex
More details are in Install Memgraph with Docker
Before uploading the file to Memgrpah, run alzkb/rdf_to_memgraph_csv.py
with the alzkb_v2-populated.rdf
file to generate alzkb-populated.csv
.
Then run populate_edge_weights.py
to create alzkb_with_edge_properties.csv
file if you want to add edge properies to the knowledge graph.
Follow the instructions in importing-data-into-memgraph Step 1. Starting Memgraph with Docker to upload the alzkb-populated.csv
or alzkb_with_edge_properties.csv
file to the container.
Open Memgraph Lab. Memgraph Lab is available at http://localhost:3000
. Click the Query Execution
in MENU on the left bar. Then, you can type a Cypher query in the Cypher Editor
.
- To create indexes, run the following Cypher queries:
CREATE INDEX ON :Drug(nodeID);
CREATE INDEX ON :Gene(nodeID);
CREATE INDEX ON :BiologicalProcess(nodeID);
CREATE INDEX ON :Pathway(nodeID);
CREATE INDEX ON :MolecularFunction(nodeID);
CREATE INDEX ON :CellularComponent(nodeID);
CREATE INDEX ON :Symptom(nodeID);
CREATE INDEX ON :BodyPart(nodeID);
CREATE INDEX ON :DrugClass(nodeID);
CREATE INDEX ON :Disease(nodeID);
CREATE INDEX ON :TranscriptionFactor (nodeID);
- To check the current storage mode, run:
SHOW STORAGE INFO;
- Change the storage mode to analytical before import:
STORAGE MODE IN_MEMORY_ANALYTICAL;
- Drug nodes
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._labels = ':Drug' AND row.commonName <> ''
CREATE (d:Drug {nodeID: row._id, commonName: row.commonName, sourceDatabase: row.sourceDatabase,
xrefCasRN: row.xrefCasRN, xrefDrugbank: row.xrefDrugbank});
MATCH (d:Drug)
RETURN count(d);
- Gene nodes
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._labels = ':Gene'
CREATE (g:Gene {nodeID: row._id, commonName: row.commonName, geneSymbol: row.geneSymbol, sourceDatabase: row.sourceDatabase,
typeOfGene: row.typeOfGene, chromosome: row.chromosome, xrefEnsembl: row.xrefEnsembl,
xrefHGNC: row.xrefHGNC, xrefNcbiGene: toInteger(row.xrefNcbiGene), xrefOMIM: row.xrefOMIM});
MATCH (g:Gene)
RETURN count(g);
- BiologicalProcess nodes
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._labels = ':BiologicalProcess'
CREATE (b:BiologicalProcess {nodeID: row._id, commonName: row.commonName, sourceDatabase: row.sourceDatabase,
xrefGeneOntology: row.xrefGeneOntology});
MATCH (b:BiologicalProcess)
RETURN count(b)
- Pathway nodes
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._labels = ':Pathway'
CREATE (p:Pathway {nodeID: row._id, pathwayId: row.pathwayId, pathwayName: row.pathwayName, sourceDatabase: row.sourceDatabase});
MATCH (p:Pathway)
RETURN count(p)
- MolecularFunction nodes
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._labels = ':MolecularFunction'
CREATE (m:MolecularFunction {nodeID: row._id, commonName: row.commonName, xrefGeneOntology: row.xrefGeneOntology});
MATCH (m:MolecularFunction)
RETURN count(m)
- CellularComponent nodes
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._labels = ':CellularComponent'
CREATE (c:CellularComponent {nodeID: row._id, commonName: row.commonName, xrefGeneOntology: row.xrefGeneOntology});
MATCH (c:CellularComponent)
RETURN count(c)
- Symptom nodes
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._labels = ':Symptom'
CREATE (s:Symptom {nodeID: row._id, commonName: row.commonName, sourceDatabase: row.sourceDatabase, xrefMeSH: row.xrefMeSH});
MATCH (s:Symptom)
RETURN count(s)
- BodyPart nodes
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._labels = ':BodyPart'
CREATE (b:BodyPart {nodeID: row._id, commonName: row.commonName, sourceDatabase: row.sourceDatabase, xrefUberon: row.xrefUberon});
MATCH (b:BodyPart)
RETURN count(b)
- DrugClass nodes
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._labels = ':DrugClass'
CREATE (d:DrugClass {nodeID: row._id, commonName: row.commonName, sourceDatabase: row.sourceDatabase, xrefNciThesaurus: row.xrefNciThesaurus});
MATCH (d:DrugClass)
RETURN count(d)
- Disease nodes
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._labels = ':Disease'
CREATE (d:Disease {nodeID: row._id, commonName: row.commonName, sourceDatabase: row.sourceDatabase,
xrefDiseaseOntology: row.xrefDiseaseOntology, xrefUmlsCUI: row.xrefUmlsCUI});
MATCH (d:Disease)
RETURN count(d)
- Transcription Factor nodes
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._labels = ':TranscriptionFactor'
CREATE (t:TranscriptionFactor {nodeID: row._id, sourceDatabase: row.sourceDatabase, TF: row.TF});
MATCH (t:TranscriptionFactor)
RETURN count(t)
- GENEPARTICIPATESINBIOLOGICALPROCESS relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'GENEPARTICIPATESINBIOLOGICALPROCESS'
MATCH (g:Gene {nodeID: row._start}) MATCH (b:BiologicalProcess {nodeID: row._end})
MERGE (g)-[rel:GENEPARTICIPATESINBIOLOGICALPROCESS]->(b)
RETURN count(rel)
- GENEREGULATESGENE relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'GENEREGULATESGENE'
MATCH (g:Gene {nodeID: row._start}) MATCH (g2:Gene {nodeID: row._end})
MERGE (g)-[rel:GENEREGULATESGENE]->(g2)
RETURN count(rel)
- GENEINPATHWAY relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'GENEINPATHWAY'
MATCH (g:Gene {nodeID: row._start}) MATCH (p:Pathway {nodeID: row._end})
MERGE (g)-[rel:GENEINPATHWAY]->(p)
RETURN count(rel)
- GENEINTERACTSWITHGENE relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'GENEINTERACTSWITHGENE'
MATCH (g:Gene {nodeID: row._start}) MATCH (g2:Gene {nodeID: row._end})
MERGE (g)-[rel:GENEINTERACTSWITHGENE]->(g2)
RETURN count(rel)
- BODYPARTUNDEREXPRESSESGENE relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'BODYPARTUNDEREXPRESSESGENE'
MATCH (b:BodyPart {nodeID: row._start}) MATCH (g:Gene {nodeID: row._end})
MERGE (b)-[rel:BODYPARTUNDEREXPRESSESGENE]->(g)
RETURN count(rel)
- BODYPARTOVEREXPRESSESGENE relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'BODYPARTOVEREXPRESSESGENE'
MATCH (b:BodyPart {nodeID: row._start}) MATCH (g:Gene {nodeID: row._end})
MERGE (b)-[rel:BODYPARTOVEREXPRESSESGENE]->(g)
RETURN count(rel)
- GENEHASMOLECULARFUNCTION relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'GENEHASMOLECULARFUNCTION'
MATCH (g:Gene {nodeID: row._start}) MATCH (m:MolecularFunction {nodeID: row._end})
MERGE (g)-[rel:GENEHASMOLECULARFUNCTION]->(m)
RETURN count(rel)
- GENEASSOCIATEDWITHCELLULARCOMPONENT relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'GENEASSOCIATEDWITHCELLULARCOMPONENT'
MATCH (g:Gene {nodeID: row._start}) MATCH (c:CellularComponent {nodeID: row._end})
MERGE (g)-[rel:GENEASSOCIATEDWITHCELLULARCOMPONENT]->(c)
RETURN count(rel)
- GENECOVARIESWITHGENE relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'GENECOVARIESWITHGENE'
MATCH (g:Gene {nodeID: row._start}) MATCH (g2:Gene {nodeID: row._end})
MERGE (g)-[rel:GENECOVARIESWITHGENE {sourceDB: row.sourceDB, unbiased: row.unbiased, correlation: row.correlation}]->(g2)
RETURN count(rel)
- CHEMICALDECREASESEXPRESSION relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'CHEMICALDECREASESEXPRESSION'
MATCH (d:Drug {nodeID: row._start}) MATCH (g:Gene {nodeID: row._end})
MERGE (d)-[rel:CHEMICALDECREASESEXPRESSION {sourceDB: row.sourceDB, unbiased: row.unbiased, z_score: row.z_score}]->(g)
RETURN count(rel)
- CHEMICALINCREASESEXPRESSION relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'CHEMICALINCREASESEXPRESSION'
MATCH (d:Drug {nodeID: row._start}) MATCH (g:Gene {nodeID: row._end})
MERGE (d)-[rel:CHEMICALINCREASESEXPRESSION {sourceDB: row.sourceDB, unbiased: row.unbiased, z_score: row.z_score}]->(g)
RETURN count(rel)
- CHEMICALBINDSGENE relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'CHEMICALBINDSGENE'
MATCH (d:Drug {nodeID: row._start}) MATCH (g:Gene {nodeID: row._end})
MERGE (d)-[rel:CHEMICALBINDSGENE {sourceDB: row.sourceDB, unbiased: row.unbiased, affinity_nM: row.affinity_nM}]->(g)
RETURN count(rel)
- DRUGINCLASS relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'DRUGINCLASS'
MATCH (d:Drug {nodeID: row._start}) MATCH (d2:DrugClass {nodeID: row._end})
MERGE (d)-[rel:DRUGINCLASS]->(d2)
RETURN count(rel)
- GENEASSOCIATESWITHDISEASE relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'GENEASSOCIATESWITHDISEASE'
MATCH (g:Gene {nodeID: row._start}) MATCH (d:Disease {nodeID: row._end})
MERGE (g)-[rel:GENEASSOCIATESWITHDISEASE {sourceDB: row.sourceDB, score: row.score}]->(d)
RETURN count(rel)
- SYMPTOMMANIFESTATIONOFDISEASE relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'SYMPTOMMANIFESTATIONOFDISEASE'
MATCH (s:Symptom {nodeID: row._start}) MATCH (d:Disease {nodeID: row._end})
MERGE (s)-[rel:SYMPTOMMANIFESTATIONOFDISEASE {sourceDB: row.sourceDB, unbiased: row.unbiased, p_fisher: row.p_fisher}]->(d)
RETURN count(rel)
- DISEASELOCALIZESTOANATOMY relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'DISEASELOCALIZESTOANATOMY'
MATCH (d:Disease {nodeID: row._start}) MATCH (b:BodyPart {nodeID: row._end})
MERGE (d)-[rel:DISEASELOCALIZESTOANATOMY {sourceDB: row.sourceDB, unbiased: row.unbiased, p_fisher: row.p_fisher}]->(b)
RETURN count(rel)
- DRUGTREATSDISEASE relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'DRUGTREATSDISEASE'
MATCH (d:Drug {nodeID: row._start}) MATCH (d2:Disease {nodeID: row._end})
MERGE (d)-[rel:DRUGTREATSDISEASE]->(d2)
RETURN count(rel)
- DRUGCAUSESEFFECT relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'DRUGCAUSESEFFECT'
MATCH (d:Drug {nodeID: row._start}) MATCH (d2:Disease {nodeID: row._end})
MERGE (d)-[rel:DRUGCAUSESEFFECT]->(d2)
RETURN count(rel)
- TRANSCRIPTIONFACTORINTERACTSWITHGENE relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'TRANSCRIPTIONFACTORINTERACTSWITHGENE'
MATCH (t:TranscriptionFactor {nodeID: row._start}) MATCH (g:Gene {nodeID: row._end})
MERGE (t)-[rel:TRANSCRIPTIONFACTORINTERACTSWITHGENE {sourceDB: row.sourceDB, confidence: row.confidence}]->(g)
RETURN count(rel)
After importing the data, follow these steps to switch back to the transactional storage mode:
- Switch to Transactional Storage Mode:
STORAGE MODE IN_MEMORY_TRANSACTIONAL;
- Verify the Storage Mode Switch:
SHOW STORAGE INFO;