STRUCTURE and ROLLOFF reimplementations for CS 4775 (Computational Genetics and Genomics) at Cornell
Original STRUCTURE paper: "Inference of Population Structure Using Multilocus Genotype Data"
Original ROLLOFF paper: "The History of African Gene Flow into Southern Europeans, Levantines, and Jews"
To install in a virtual environment:
python3 -m venv env
source env/bin/activate
pip install -r requirements.txt
structure.py
infers admixture proportions given a dataset of genetic variants in Variant Call Format. The output is written to an HDF5 file, which is used by visualize.py
and rolloff.py
.
python structure.py [-h] [-k num_populations] [-o output.hdf5]
[--profile] [-d drop_frac] [-m num_burn_in_rounds]
[-s num_samples] [-c num_rounds_btwn_samples] data_file
Command-line arguments:
data_file
: the input file, either a VCF file or an Eigenstrat (.phgeno
) file-k
: the number of populations-o
or--out
: the HDF5 file to which the output will be written-d
or--drop-frac
: the fraction of loci to drop-m
or--burn-in
: the burn-in period-s
or--num-samples
: the number of samples to collect-c
or--sample-interval
: number of rounds between samples
structure.py
can also take a .phgeno
data file as its primary argument.
visualize.py
creates charts to display the output from structure.py
. It takes one command-line argument, the location of the HDF5 file.
rolloff.py
estimates the ROLLOFF statistic for two populations. It takes two command-line arguments:
rolloff.py [-h] [--profile] [-m centimorgans] data_file.hdf5
data_file.hdf5
: the file generated bystructure.py
-m
or--min-bin-size
: the minimum bin size to use, in centimorgans
The data files are compressed and stored in the data/
folder. They are from the 1000 Genomes repository. Use of these datasets is subject to these terms.
data/1.1-200000.ALL.chr1_GRCh38.genotypes.20170504.vcf.gz
(the 200K dataset) contains variant calls from the first 200,000 positions in chromosome 1.data/1.1-2500000.ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
(the 2.5M dataset) contains variant calls from the first 2.5 million positions in chromosome 1.