HaploNet

HaploNet is a framework for inferring fine-scale population structure using neural networks in an unsupervised approach for phased haplotypes of whole-genome sequencing (WGS) data. We utilize a variational autoencoder (VAE) framework to learn mappings to and from a low-dimensional latent space in which we will perform indirect clustering of haplotypes with a Gaussian mixture prior (Gaussian Mixture Variational Autoencoder).

Citation

Please cite our paper in Genome Research: Haplotype and Population Structure Inference using Neural Networks in Whole-Genome Sequencing Data.

Dependencies

The HaploNet framework relies on the following Python packages that you can install through conda (recommended, see next) or pip:

pytorch
numpy
cython
scipy
cyvcf2

Follow the link to find more information on how to install PyTorch for your setup (GPU/CPU). You can create an environment through conda easily as follows:

# GPU setup
conda env create -f environment_gpu.yml

# CPU setup
conda env create -f environment_cpu.yml

Install and build

git clone https://github.com/Rosemeis/HaploNet.git
cd HaploNet
pip3 install -e .

You can now run HaploNet with the haplonet command.

Usage

HaploNet can now be trained on a phased genotype file in VCF/BCF format as follows (using default parameters and on GPU):

haplonet train --bcf chr1.bcf --cuda --out haplonet.chr1
# Saves log-likelihoods in binary NumPy matrix (haplonet.chr1.loglike.npy)
# and a log-file with parameters used in the training (haplonet.chr1.log)

# Run for 22 chromosomes and save output path in a filelist (needed in downstream analyses)
for c in {1..22}
do
	haplonet train --bcf chr${c}.bcf --cuda --out haplonet.chr${c}
	realpath haplonet.chr${c}.loglike.npy >> haplonet.filelist
done

HaploNet outputs the neural network log-likelihoods by default which are used to infer global population structure (PCA and admixture). With the '--latent' argument, the parameters of the learnt latent spaces of the GMVAE can be saved as well. See all available options in HaploNet with the following command:

haplonet -h
haplonet train -h # training haplonet
haplonet admix -h # estimate ancestry
haplonet pca -h # perform pca

All the following analyses assume that HaploNet has been run for all chromosomes and a filelist has been created, which contains the log-likelihood output paths for each chromosome (e.g. haplonet.filelist). The argument "--like" can be used if you only have one chromosome or merged file.

Estimate ancestry proportions and haplotype cluster frequencies

The EM algorithm in HaploNet can be run with K=2 and 64 threads (CPU based).

haplonet admix --filelist haplonet.filelist --K 2 --threads 64 --seed 0 --out haplonet.admixture.k2

# Saves ancestry proportions in a text-file (haplonet.admixture.k2.q)
# and ancestral cluster frequencies in a binary NumPy matrix (haplonet.admixture.k2.f.npy)

And the admixture proportions can as an example be plotted in R as follows:

q <- read.table("haplonet.admixture.k2.q")
barplot(t(q), space=0, border=NA, col=c("dodgerblue3", "firebrick2"), xlab="Individuals", ylab="Proportions", main="HaploNet - Admixture")

Infer population structure using PCA

Estimate eigenvectors directly using SVD (recommended for big datasets):

haplonet pca --filelist haplonet.filelist --threads 64 --out haplonet.pca

e <- as.matrix(read.table("haplonet.pca.eigenvecs"))
plot(e[,1:2], main="HaploNet - PCA", xlab="PC1", ylab="PC2")

Compute the covariance matrix followed by eigendecomposition in R:

haplonet pca --filelist haplonet.filelist --cov --threads 64 --out haplonet.pca

C <- as.matrix(read.table("haplonet.pca.cov"))
e <- eigen(C)
plot(e$vectors[,1:2], main="HaploNet - PCA", xlab="PC1", ylab="PC2")

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
haplonet		haplonet
scripts		scripts
LICENSE		LICENSE
README.md		README.md
environment_cpu.yml		environment_cpu.yml
environment_gpu.yml		environment_gpu.yml
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HaploNet

Citation

Dependencies

Install and build

Usage

Estimate ancestry proportions and haplotype cluster frequencies

Infer population structure using PCA

About

Releases 2

Packages

Contributors 2

Languages

License

Rosemeis/HaploNet

Folders and files

Latest commit

History

Repository files navigation

HaploNet

Citation

Dependencies

Install and build

Usage

Estimate ancestry proportions and haplotype cluster frequencies

Infer population structure using PCA

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Languages

Packages