To generate and run the Graph Database, a number of processes are run. Each process can be run standalone if needed, but the following two maven profiles are provided that call underlying Java wrappers for convenience.
build-graph
- Takes the input data and generates csv's that are used to build the graph database using neo4j'sorg.neo4j.tooling.ImportTool
run-neo4j
- Starts up the neo4j database, based one the graph built previously
The project consists of 6 parsers, each one takes two parameters, a specific file or a folder (depending on the parser) and an output directory. These are specified in config.properties.
Run mvn clean package
; this will generate graph-bundle.zip in the /target folder
Below is the list of the Parsers, and the Nodes and Relationships that each one is responsible for producing.
The java command is executed from the lib folder of the extracted zipfile. Substitute ${PARSER_NAME}
with the fully qualified name of the Parser e.g. com.graph.db.file.vcf.VcfParser
java -classpath '*:../conf/' **${PARSER_NAME}**
This parses the genotype VCF file containing the variant to individual relationships.
Nodes | Relationships |
---|---|
Person | HetVariantToPerson |
HomVariantToPerson |
This parses the annotation file produced by the Variant Effect Predictor in the JSON format (VCF format to be supported soon).
Nodes | Relationships |
---|---|
GeneticVariant | GeneToGeneticVariant |
TranscriptVariant | GeneticVariantToTranscriptVariant |
ConsequenceTerm | TranscriptToTranscriptVariant |
Transcript | TranscriptVariantToConsequenceTerm |
Gene |
This parses the OMIM-HPO file which links genes to the HPO terms to which they are associate.
Nodes | Relationships |
---|---|
GeneToTerm |
This parses the phenotype file which links individuals to their HPO terms.
Nodes | Relationships |
---|---|
Person | PersonToObservedTerm |
PersonToNonObservedTerm |
This loads the HPO ontology which links HPO terms to other HPO terms. The relationships are produced; TermToParentTerm simply links a Term to its Parent Term, and TermToDescendantTerms produces all the descendant Terms for a specific Term, e.g. querying for HP:0000001 will output every single term since it is the root node.
Nodes | Relationships |
---|---|
Term | TermToParentTerm |
TermToDescendantTerms |
This loads the Gencode gene-to-transcript file.
Nodes | Relationships |
---|---|
Transcript | TranscriptToGene |
Gene |
To ensure optimum performance for the queries the following Constraints and Indexes are created by com.graph.db.DatabaseIndexCreator
- Term - termId
- Person - personId
- GeneticVariant - variantId
- Gene - gene_id
- TranscriptVariant - hgvsc
- Transcript - transcript_id
- ConsequenceTerm - consequenceTerm
- GeneticVariant - allele_freq
- GeneticVariant - cadd
- GeneticVariant - exac_AF
- Gene - gene_name
Pheno4J needs to create nodes first and then loads the relationships.
In config.properties
you can specify any additional files you would like loaded into neo4j as part of the bulk upload.
For the relevant Node or Relationship (e.g. GeneticVariant, GeneToTerm), specify the full path to the file. If there are multiple files, separate them by commas. The files should not have a header, and the order of the columns must match the headers produced by HeaderGenerator
.
Number of Individuals | Number of Variants | Total Number of Nodes | Total Number of Relationships | Total Number of Properties | Database Size (MB) |
---|---|---|---|---|---|
1,000 | 1,876,797 | 4,223,968 | 103,155,817 | 153,133,671 | 5,901 |
2,000 | 2,673,978 | 5,911,219 | 199,069,603 | 205,735,931 | 9,897 |
3,000 | 3,218,287 | 7,062,972 | 293,919,001 | 243,309,397 | 13,613 |
4,000 | 3,653,139 | 7,982,961 | 389,593,314 | 271,785,796 | 17,205 |
5,000 | 4,008,807 | 8,736,155 | 484,182,286 | 294,170,309 | 20,653 |
5,025 | 4,086,921 | 8,832,245 | 486,827,321 | 296,355,931 | 20,781 |