Skip to content
/ linkd Public

Calculate population-specific linkage disequilibrium for individual genomic variants

License

Notifications You must be signed in to change notification settings

treynr/linkd

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

linkd

Linkage disequilibrium (LD) refers to the nonrandom assortment of alleles at two different locations, or loci, in the genome. In other words, if two alleles (or genomic variants in this case) are in LD with one another, they are inherited together more than one would expect by random chance. This is a collection of scripts for calculating population-specific LD using variant calls from the 1000 Genomes Project. Each script can be executed by itself or submitted to an HPC cluster for faster processing.

Usage

Using datasets from the 1K Genome Project requires a lot of disk space. Uncompressed variant calls are roughly 1TB in size.

Start by retrieving variant calls for all 22 autosomes and both sex chromosomes. By default, all data (raw and processed) is saved to the data/ directory.

$ ./src/retrieve-1k-genomes.sh

We may also wish to update old SNP identifiers prior to calculating LD. If this is the case, also retrieve the latest identifier merge table (currently set to dbSNP v. 150) from NCBI:

$ ./src/retrieve-merged-snps.sh

Next we filter out variant call samples based on population structure. For example, if we wanted to only include chromosome 21 samples from the European superpopulation we would use the -s option along with the population code EUR:

$ ./src/filter-samples.sh -s EUR data/chr21.vcf data/chr21-filtered.vcf

We could also select individual populations. For example, if we wanted to only include individuals of Han Chinese descent:

$ ./src/filter-samples.sh CHB,CHS data/chr21.vcf data/chr21-filtered.vcf

The sample-population map can be found at data/1k-genomes-sample-populations.tsv. A list of populations, population codes, and metadata is available on the 1K Genomes Project website.

Updating SNP identifiers to their more recent versions can now be performed on filtered variant calls:

$ ./src/update-snp-identifiers.sh data/merged-snps.tsv data/chr21-filtered.vcf data/chr21-merged.vcf

Finally, given a list of SNPs, we can calculate LD scores (D') between all pairwise combinations of SNPs on that list and SNPs residing on the same chromosome.

$ ./src/calculate-ld.sh snp-list.txt data/chr21-merged.vcf data/chr21-ld

HPC usage

All scripts can be submitted to an HPC cluster running PBS/TORQUE. Submission scripts can be found in the hpc/ directory. These scripts will process all chromosomes simultaneously. Their performance (node and processes per node utilization) can be tweaked by editing the submission or src/ scripts. The HPC version of the previous usage examples:

$ ./hpc/submit-retrieve-1k-genomes.sh

## Only keep samples from the European superpop...
$ ./hpc/submit-filter-samples.sh -s EUR

## ...or only keep samples of Han Chinese descent
$ ./hpc/submit-filter-samples.sh CHB,CHS

$ ./hpc/submit-calculate-ld.sh snp-list.txt

Requirements and installation

The following dependencies are required:

Make sure all dependencies are available on your $PATH.

Disk space requirements

As mentioned in the introduction, the 1K Genome Project data requires a large amount of disk space. These scripts can be easily modified to operate over gzipped variant calls but there are currently no plans to do so. Unzipped, the raw 1K Genome Project variants take up almost 1TB of space:

$ du -h -c data/chr+([0-9]|X|Y).vcf
44G     data/chr10.vcf
49G     data/chr11.vcf
42G     data/chr12.vcf
34G     data/chr13.vcf
32G     data/chr14.vcf
27G     data/chr15.vcf
30G     data/chr16.vcf
26G     data/chr17.vcf
27G     data/chr18.vcf
20G     data/chr19.vcf
77G     data/chr1.vcf
20G     data/chr20.vcf
12G     data/chr21.vcf
14G     data/chr22.vcf
77G     data/chr2.vcf
70G     data/chr3.vcf
69G     data/chr4.vcf
58G     data/chr5.vcf
55G     data/chr6.vcf
52G     data/chr7.vcf
50G     data/chr8.vcf
43G     data/chr9.vcf
32G     data/chrX.vcf
188M    data/chrY.vcf
949G    total

About

Calculate population-specific linkage disequilibrium for individual genomic variants

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published