Skip to content

Genotype Data Format

MrFlick edited this page Jul 30, 2021 · 2 revisions

Genotype Data Configuration

Encore uses many pieces of data in order to perform a genome wide association analysis. The server administrator must configure the application to find the genotype data. This is a description of those configuration options.

GUID

A genotype data source (which can by multiple data freeze) is identified by a GUID in the system (`00000000-0000-0000-0000-000000000000' where 0 can be a number 0-9 or a lowercase letter a-f). This GUID can be generated by any means.

Database

The application uses the database to store some meta data about each genotype freeze (such as the display name and build information). This is stored in the genotypes table. New data can be added to this table via the admin area of the application. There are no file paths tracked in the database though.

Meta File For File Paths

The paths to all the files are tracked in a JSON file in each freeze directory. Encore will look for this folder starting in the path defined in the GENO_DATA_FOLDER variable in the flask_config.py configuration file. In the GENO_DATA_FOLDER, Encore will then look for a folder using the GUID stored in the corresponding database record. And in that folder, it will look for a meta.json file. File paths in this directory are assumed to be relative to Encore genotype directory.

This JSON file contains the paths to the files needed for analysis for this freeze. Here's a basic example

{
  "savs": "savs/chr*.rehead.sav",
  "annovcfs": "savs/chr*.anno.vcf.gz",
  "groups": "groups/nonsyn.grp",
  "pca_genotypes_path": "pcas/1pct.mergedthin.rehead.bed",
  "phenotypes": {
    "file": "pcas/1pct.proj.rehead.sscore",
    "meta": "pcas/meta.json",
    "name": "PCA Data"
  },
  "samples_path": "samples.txt",
  "stats": {
    "genotype_count": 19706819536,
    "record_count": 877506482,
    "sample_count": 140306
  }
} 

Actual genotype data (savs for vcfs)

You can either store your data in vcf format or sav format (the latter is preferred because it is much faster to read). This can be a string with a path to the file that contains all the data. If the data is split by chromosome, use a "*" to indicate where the chromosome number should go. Alternatively you can also use an object with a key for each chromosome name and a value of the path to that chromosome. For example

{
  ...
  "savs": {
     "1": "savs/chr1.rehead.sav",
     "2": "savs/chr2.rehead.sav",
     ...
   },
  ...
}

If unspecified the default value for vcfs is vcfs/chr*.vcf.gz and for savs is savs savs/chr*.sav

Annotation information (annovcf)

Encore needs to know where to look for the INFO data for the variants. Typically this is kept separate from the genotypes data in order to make the genotype-only files smaller and easier to read. Here we need a path to a tabix-indexed vcf site list. Again, if data for different chromosomes is spread across different files, you can use a * to indicate where the replacement should happen. There is no default value for this property. If not presented, no additional information will be shown on the variant detail page.

Sample names (samples_path)

Encore needs a quick way to look up what samples are included in the data set in order to help identify things like which column in a phenotype file contains the sample IDS. We assume that there is a text file with one sample ID per row. The path to this file should be given in the samples_path key.

Basic Stats (stats, stats_path)

Encore provides quick summaries of the genotype freezes by showing the number of samples, markers, and total records (samples * markers). You can either include this information as a diction in the meta file via stats (as above) or you can specify a stats_path which would be a path to a JSON file with an object with the same three values: genotype_count, record_count, and sample_count.