-
Notifications
You must be signed in to change notification settings - Fork 1
Redis Schema
GCV Microservices use Redis equipped with the RediSearch module as their datastore. Though scripts are provided to load data from a Chado database, here we describe the schema so that developers may wrap their own loading scripts or build services on the GCV database.
It is assumed that every chromosome and gene in the database has a unique name.
This is to keep service APIs and code simple by allowing names to be used as unique identifiers.
However, to make the code more verbose, names are not used verbatim to as unique identifiers.
Instead, they are given a prefix designating the type of entity they are.
Specifically, each chromosome is given the chromosome:
prefix and each gene is given the gene:
prefix.
Currently, only chromosomes are stored in vanilla Redis. Each chromosome has four ordered lists: one for gene names, one for functional gene annotations, one for gene fmin values, and one for gene fmax values. The elements of these lists are in the same order their genes appear on the chromosome. The key for each list is the chromosome's ID with a suffix describing the values in the list. As with the ID prefix, the intention with the suffix is to make the code more verbose.
chromosome:<chromosome name>:genes
→ [<gene name>, ...]
chromosome:<chromosome name>:families
→ [<annotation>, ...]
chromosome:<chromosome name>:fmins
→ [<fmin>, ...]
chromosome:<chromosome name>:fmaxs
→ [<fmax>, ...]
Chromosome and gene entities are stored in RediSearch. This is to allow them to be easily searched, sorted, and filtered in directly Redis. Each chromosome/gene is given its own document and the ID is as previously described. For ease of use, chromosomes and genes are stored in separate indexes.
The chromosome index is named chromosomeIdx
and has the following schema:
{
name: TextField,
length: NumericField,
genus: TextField,
species: TextField,
}
name
is the chromosome's unique name, length
is the number of nucleotides in the chromosome's sequence, genus
is the genus of the chromosome's organism, and species
is the species of the chromosome's organism.
The gene index is named geneIdx
and has the following schema:
{
chromosome: TextField,
name: TextField,
fmin: NumericField,
fmax: NumericField,
family: TextField,
strand: NumericField,
index: sortable NumericField,
}
chromosome
is the unique name of the gene's chromosome, name
is the unique name of the gene, fmin
is the inter-base coordinate of the leftmost/minimal boundary of the gene, fmax
is the inter-base coordinate of the rightmost/maximal boundary of the gene, family
is the functional annotation of the gene, strand
has a value of -1, 0, or 1 designating whether the gene is on the forward, unknown, or reverse strand of its chromosome, respectively, and index
is the index of the gene in its chromosome's genes, annotations, fmins, and fmaxs lists, as previously described.