-
Notifications
You must be signed in to change notification settings - Fork 6
Genotype Data Format
Encore uses many pieces of data in order to perform a genome wide association analysis. The server administrator must configure the application to find the genotype data. This is a description of those configuration options.
A genotype data source (which can by multiple data freeze) is identified by a GUID in the system (`00000000-0000-0000-0000-000000000000' where 0 can be a number 0-9 or a lowercase letter a-f). This GUID can be generated by any means.
The application uses the database to store some meta data about each genotype freeze (such as the display name and build information). This is stored in the genotypes
table. New data can be added to this table via the admin area of the application. There are no file paths tracked in the database though.
The paths to all the files are tracked in a JSON file in each freeze directory. Encore will look for this folder starting in the path defined in the GENO_DATA_FOLDER
variable in the flask_config.py
configuration file. In the GENO_DATA_FOLDER
, Encore will then look for a folder using the GUID stored in the corresponding database record. And in that folder, it will look for a meta.json
file. File paths in this directory are assumed to be relative to Encore genotype directory.
This JSON file contains the paths to the files needed for analysis for this freeze. Here's a basic example
{
"savs": "savs/chr*.rehead.sav",
"annovcfs": "savs/chr*.anno.vcf.gz",
"groups": "groups/nonsyn.grp",
"pca_genotypes_path": "pcas/1pct.mergedthin.rehead.bed",
"phenotypes": {
"file": "pcas/1pct.proj.rehead.sscore",
"meta": "pcas/meta.json",
"name": "PCA Data"
},
"samples_path": "samples.txt",
"stats": {
"genotype_count": 19706819536,
"record_count": 877506482,
"sample_count": 140306
}
}
You can either store your data in vcf
format or sav
format (the latter is preferred because it is much faster to read). This can be a string with a path to the file that contains all the data. If the data is split by chromosome, use a "*" to indicate where the chromosome number should go. Alternatively you can also use an object with a key for each chromosome name and a value of the path to that chromosome. For example
{
...
"savs": {
"1": "savs/chr1.rehead.sav",
"2": "savs/chr2.rehead.sav",
...
},
...
}
If unspecified the default value for vcfs
is vcfs/chr*.vcf.gz
and for savs
is savs
savs/chr*.sav
Encore needs to know where to look for the INFO data for the variants. Typically this is kept separate from the genotypes data in order to make the genotype-only files smaller and easier to read. Here we need a path to a tabix-indexed vcf site list. Again, if data for different chromosomes is spread across different files, you can use a *
to indicate where the replacement should happen. There is no default value for this property. If not presented, no additional information will be shown on the variant detail page.
Encore needs a quick way to look up what samples are included in the data set in order to help identify things like which column in a phenotype file contains the sample IDS. We assume that there is a text file with one sample ID per row. The path to this file should be given in the samples_path
key.
Encore provides quick summaries of the genotype freezes by showing the number of samples, markers, and total records (samples * markers). You can either include this information as a diction in the meta file via stats
(as above) or you can specify a stats_path
which would be a path to a JSON file with an object with the same three values: genotype_count
, record_count
, and sample_count
.