Skip to content

Redis Schema

Alan Cleary edited this page Sep 18, 2020 · 5 revisions

GCV Microservices use Redis equipped with the RediSearch module as their datastore. Though scripts are provided to load data from a Chado database, here we describe the schema so that developers may wrap their own loading scripts or build services on the GCV database.

Unique names and IDs

It is assumed that every chromosome and gene in the database has a unique name. This is to keep service APIs and code simple by allowing names to be used as unique identifiers. However, to make the code more verbose, names are not used verbatim to as unique identifiers. Instead, they are given a prefix designating the type of entity they are. Specifically, each chromosome is given the chromosome: prefix and each gene is given the gene: prefix.

Redis

Currently, only chromosomes are stored in vanilla Redis. Each chromosome has four ordered lists: one for gene names, one for functional gene annotations, one for gene fmin values, and one for gene fmax values. The elements of these lists are in the same order their genes appear on the chromosome. The key for each list is the chromosome's ID with a suffix describing the values in the list. As with the ID prefix, the intention with the suffix is to make the code more verbose.

chromosome:<chromosome name>:genes[<gene name>, ...]
chromosome:<chromosome name>:families[<annotation>, ...]
chromosome:<chromosome name>:fmins[<fmin>, ...]
chromosome:<chromosome name>:fmaxs[<fmax>, ...]

RediSearch

Chromosome and gene entities are stored in RediSearch. This is to allow them to be easily searched, sorted, and filtered in directly Redis. Each chromosome/gene is given its own document and the ID is as previously described. For ease of use, chromosomes and genes are stored in separate indexes.

Chromosome index

The chromosome index is named chromosomeIdx and has the following schema:

{
  name: TextField,
  length: NumericField,
  genus: TextField,
  species: TextField,
}

name is the chromosome's unique name, length is the number of nucleotides in the chromosome's sequence, genus is the genus of the chromosome's organism, and species is the species of the chromosome's organism.

Gene index

The gene index is named geneIdx and has the following schema:

{
  chromosome: TextField,
  name: TextField,
  fmin: NumericField,
  fmax: NumericField,
  family: TextField,
  strand: NumericField,
  index: sortable NumericField,
}

chromosome is the unique name of the gene's chromosome, name is the unique name of the gene, fmin is the inter-base coordinate of the leftmost/minimal boundary of the gene, fmax is the inter-base coordinate of the rightmost/maximal boundary of the gene, family is the functional annotation of the gene, strand has a value of -1, 0, or 1 designating whether the gene is on the forward, unknown, or reverse strand of its chromosome, respectively, and index is the index of the gene in its chromosome's genes, annotations, fmins, and fmaxs lists, as previously described.

Clone this wiki locally