linkd

Linkage disequilibrium (LD) refers to the nonrandom assortment of alleles at two different locations, or loci, in the genome. In other words, if two alleles (or genomic variants in this case) are in LD with one another, they are inherited together more than one would expect by random chance. This is a collection of scripts for calculating population-specific LD using variant calls from the 1000 Genomes Project. Each script can be executed by itself or submitted to an HPC cluster for faster processing.

Usage

Using datasets from the 1K Genome Project requires a lot of disk space. Uncompressed variant calls are roughly 1TB in size.

Start by retrieving variant calls for all 22 autosomes and both sex chromosomes. By default, all data (raw and processed) is saved to the data/ directory.

$ ./src/retrieve-1k-genomes.sh

We may also wish to update old SNP identifiers prior to calculating LD. If this is the case, also retrieve the latest identifier merge table (currently set to dbSNP v. 150) from NCBI:

$ ./src/retrieve-merged-snps.sh

Next we filter out variant call samples based on population structure. For example, if we wanted to only include chromosome 21 samples from the European superpopulation we would use the -s option along with the population code EUR:

$ ./src/filter-samples.sh -s EUR data/chr21.vcf data/chr21-filtered.vcf

We could also select individual populations. For example, if we wanted to only include individuals of Han Chinese descent:

$ ./src/filter-samples.sh CHB,CHS data/chr21.vcf data/chr21-filtered.vcf

The sample-population map can be found at data/1k-genomes-sample-populations.tsv. A list of populations, population codes, and metadata is available on the 1K Genomes Project website.

Updating SNP identifiers to their more recent versions can now be performed on filtered variant calls:

$ ./src/update-snp-identifiers.sh data/merged-snps.tsv data/chr21-filtered.vcf data/chr21-merged.vcf

Finally, given a list of SNPs, we can calculate LD scores (D') between all pairwise combinations of SNPs on that list and SNPs residing on the same chromosome.

$ ./src/calculate-ld.sh snp-list.txt data/chr21-merged.vcf data/chr21-ld

HPC usage

All scripts can be submitted to an HPC cluster running PBS/TORQUE. Submission scripts can be found in the hpc/ directory. These scripts will process all chromosomes simultaneously. Their performance (node and processes per node utilization) can be tweaked by editing the submission or src/ scripts. The HPC version of the previous usage examples:

$ ./hpc/submit-retrieve-1k-genomes.sh

## Only keep samples from the European superpop...
$ ./hpc/submit-filter-samples.sh -s EUR

## ...or only keep samples of Han Chinese descent
$ ./hpc/submit-filter-samples.sh CHB,CHS

$ ./hpc/submit-calculate-ld.sh snp-list.txt

Requirements and installation

The following dependencies are required:

Python 2.7/3.5/3.6
pandas
miller
plink 1.9

Make sure all dependencies are available on your $PATH.

Disk space requirements

As mentioned in the introduction, the 1K Genome Project data requires a large amount of disk space. These scripts can be easily modified to operate over gzipped variant calls but there are currently no plans to do so. Unzipped, the raw 1K Genome Project variants take up almost 1TB of space:

$ du -h -c data/chr+([0-9]|X|Y).vcf
44G     data/chr10.vcf
49G     data/chr11.vcf
42G     data/chr12.vcf
34G     data/chr13.vcf
32G     data/chr14.vcf
27G     data/chr15.vcf
30G     data/chr16.vcf
26G     data/chr17.vcf
27G     data/chr18.vcf
20G     data/chr19.vcf
77G     data/chr1.vcf
20G     data/chr20.vcf
12G     data/chr21.vcf
14G     data/chr22.vcf
77G     data/chr2.vcf
70G     data/chr3.vcf
69G     data/chr4.vcf
58G     data/chr5.vcf
55G     data/chr6.vcf
52G     data/chr7.vcf
50G     data/chr8.vcf
43G     data/chr9.vcf
32G     data/chrX.vcf
188M    data/chrY.vcf
949G    total

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
hpc		hpc
src		src
.gitignore		.gitignore
.linkd.sh		.linkd.sh
license		license
readme.rst		readme.rst

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

linkd

Usage

HPC usage

Requirements and installation

Disk space requirements

About

Releases

Packages

Languages

License

treynr/linkd

Folders and files

Latest commit

History

Repository files navigation

linkd

Usage

HPC usage

Requirements and installation

Disk space requirements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages