GitHub - u-brite/HeriVar: Quantifying the combined heritability of a trait based on a multi-ethnic LD panel with equal distribution of samples among each ancestry group.

HeriVar

Quantifying the combined heritability of a trait based on a multi-ethnic LD panel with equal distribution of samples among each ancestry group.

Background

Heritability of a trait is often identified and reported in an ancestry group stratified manner. This limits the ability to estimate and report the combined heritability in a multi-ethnic population. Although there are several methods demonstrated recently with robust ways of calculating heritability with or without individual-level datasets, these methods are limited to ancestry-specific groups. In this project, we are proposing a way to calculate combined heritability using a multi-ethnic reference linkage-disequilibrium (LD) panel with equal proportions of data. We will use current existing tools to simulate and calculate heritability and report it as a framework that can be implemented and explored further. This will lead to the development of a novel approach to estimating the heritability of particular traits in multi-ethnic populations. As a part of Team HeriVar, you will be contributing to the demonstration of methodology, calculation of heritability, and work as a team to promote the method.

With the increasing availability of multi-ethnic whole genome sequence datasets, there is a gaping absence of an approach to estimate the heritability of particular phenotypic trait that accounts for the multi-ethnic genetic architecture. This approach of calculating the combined multi-ethnic heritability has not been pursued previously. This project helps us understand the problems facing this issue in the field of genomics and helps in generating a framework using existing tools to calculate and assess the heritability of a trait in multi-ethnic populations.

Data

High Coverage 1000g dataset downloaded from http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_phased/
GWAS summary statisitcs for NTproBNP (In house) & BP downloaded from Pan-Ukbiobank analysis. (https://pan.ukbb.broadinstitute.org/phenotypes)

Tools

R. ( module load R )
Python. ( module load Anaconda3 )
PLINK (https://www.cog-genomics.org/plink/2.0/ or module load PLINK in Cheaha).
LDAK/SUMHER (https://dougspeed.com/sumher/).
LDSC (https://github.com/bulik/ldsc).
LiftOver ( https://genome.ucsc.edu/cgi-bin/hgLiftOver )

Process

Dependencies

LDSC requires Anaconda3 or Python-2.7 and subpackages like bitarray, nose, pybedtools, scipy, numpy, pandas, bioconda. (will be installed when generating environment).
SumHer uses Intel MKL Libraries as dependencies. ( module load imkl/2020.1.217-iimpi-2020a )

Installation

LDSC ( Required to be installed by everyone in their home directory to use it )
- Clone the github of ldsc (git clone https://github.com/bulik/ldsc.git) and cd into the folder
- Module load Anaconda3 ( module load Anaconda3 )
- Install dependencies using conda as suggested by github ( conda env create --file environment.yml )
- Activate ldsc ( source activate ldsc )
- Test installation by running python scripts shared as path of repo ( ./ldsc.py -h )
Sumher
- Download the LDAK Linux executable file by requesting using name and email ( you will get an email from the developer with downloadables if you are a first time user )
- Unzip the executable file and use it. ( /data/project/ubrite/hackathon2022/staging_area_teams/HeriVar/Tools/ldak5.2.linux - It can be accessible by everyone)
- It also have executable for MAC users. Note: Please check Dependencies before installing the tools.
LiftOver
- Download the file from https://genome.ucsc.edu/cgi-bin/hgLiftOver
- Download the chain file needed for conversion - we can download it from above link.
- Run liftOver -h

Results

Datasets
- We downloaded 1000g high coverage reference dataset from http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_phased/.
- We then extracted individuals files and randomly chose 489 unrelated individuals among each ancestry group.
- Rationale behind including sample individuals from multiple ancestry groups is by taking equal number of individuals, we can have equal ld pattern distribution among the individuals.
- Admixed population were excluded from the analysis along with related individuals which to 1956 individuals.
- We removed variants with less than 1% minor allele frequency and variants with more than 5% missing data.
```
                         Allele Frequency Distribution among each ancestry and overall.
```

PCA Analysis
- We used Plink to calculate principal compnents analysis to test whether we have equal distributions of samples per ancestry group.
```
                                   PC distributions stratified by Ancestry
```

Prunning & Thresholding
- After subsetting to sample of interest, we did prunning and thresholding based on different cutoffs.
- Plink is used to generate the files needed.
- We used R2 and window size parameters for analysis.
  - R-squared cutoff of 0.2, 0.4, 0.6, 0.8.
  - Window size of 250kb, 500kb, 1Mb, 10Mb.
```
                                Distribution of Variants after P + T
```

We had ran near 1000 jobs for generating this datasets in Cheaha.
We decided to exclude High LD regions as recommended by the tools.
We subsetted the datasets to two categories.
- Pre HighLD regions removal.
- Post Hight LD regions removal.
Refernces panel generation
- We used the two categories as mentioend above and used two tools to calculated reference LD panels.
- We used ldsc to generate LD scores for all the categories we have.
```
                                    LD_scores Distribution for Chromosome 22
```

For LDAK annotations, We used liftover to convert blk annotations from grch37 to grch38 and working on generting tagging files
- We had an issue generating LDAK annotations files and decided to pursue analysis after hackathon.
Phenotypes Processing
- We have also worked on processing phenotypes based as suggested by the tools.
Heritability
- We tried to generate h2 values using LDAK & LDSC but couldnt able to complete because of last minute issues.

Team Members

Akhil Pampana | pampana.akhil@gmail.com | apampana@uabmc.edu | Team Leader
Nick Sumpter | nicks95@uab.edu | Team Member
Yongyu (Frank) Qiang | frankqiang5040@gmail.com | Team Member

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
configs		configs
docs		docs
notebooks		notebooks
results		results
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Final_Logo.gif		Final_Logo.gif
LICENSE		LICENSE
README.md		README.md
Work_Flow.png		Work_Flow.png
allelefreq.png		allelefreq.png
allelefreq_ancestry.png		allelefreq_ancestry.png
ldsc_22.png		ldsc_22.png
pc1_vs_pc2.png		pc1_vs_pc2.png
pc1_vs_pc3.png		pc1_vs_pc3.png
prune_compare.png		prune_compare.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HeriVar

Table of Contents

Background

Data

Tools

Process

Dependencies

Installation

Results

Team Members

About

Releases

Packages

Contributors 3

Languages

License

u-brite/HeriVar

Folders and files

Latest commit

History

Repository files navigation

HeriVar

Table of Contents

Background

Data

Tools

Process

Dependencies

Installation

Results

Team Members

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages