Skip to content

STRUCTURE and ROLLOFF reimplementations for CS 4775 (Computational Genetics and Genomics)

License

Notifications You must be signed in to change notification settings

veeara282/cs4775-structure-rolloff

Repository files navigation

STRUCTURE and ROLLOFF

STRUCTURE and ROLLOFF reimplementations for CS 4775 (Computational Genetics and Genomics) at Cornell

Original STRUCTURE paper: "Inference of Population Structure Using Multilocus Genotype Data"

Original ROLLOFF paper: "The History of African Gene Flow into Southern Europeans, Levantines, and Jews"

Installation

To install in a virtual environment:

python3 -m venv env
source env/bin/activate
pip install -r requirements.txt

Usage

structure.py infers admixture proportions given a dataset of genetic variants in Variant Call Format. The output is written to an HDF5 file, which is used by visualize.py and rolloff.py.

python structure.py [-h] [-k num_populations] [-o output.hdf5]
                    [--profile] [-d drop_frac] [-m num_burn_in_rounds]
                    [-s num_samples] [-c num_rounds_btwn_samples] data_file

Command-line arguments:

  • data_file: the input file, either a VCF file or an Eigenstrat (.phgeno) file
  • -k: the number of populations
  • -o or --out: the HDF5 file to which the output will be written
  • -d or --drop-frac: the fraction of loci to drop
  • -m or --burn-in: the burn-in period
  • -s or --num-samples: the number of samples to collect
  • -c or --sample-interval: number of rounds between samples

structure.py can also take a .phgeno data file as its primary argument.

visualize.py creates charts to display the output from structure.py. It takes one command-line argument, the location of the HDF5 file.

rolloff.py estimates the ROLLOFF statistic for two populations. It takes two command-line arguments:

rolloff.py [-h] [--profile] [-m centimorgans] data_file.hdf5
  • data_file.hdf5: the file generated by structure.py
  • -m or --min-bin-size: the minimum bin size to use, in centimorgans

Data files

The data files are compressed and stored in the data/ folder. They are from the 1000 Genomes repository. Use of these datasets is subject to these terms.

  • data/1.1-200000.ALL.chr1_GRCh38.genotypes.20170504.vcf.gz (the 200K dataset) contains variant calls from the first 200,000 positions in chromosome 1.
  • data/1.1-2500000.ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz (the 2.5M dataset) contains variant calls from the first 2.5 million positions in chromosome 1.

About

STRUCTURE and ROLLOFF reimplementations for CS 4775 (Computational Genetics and Genomics)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages