Calculate Heterozygosity with Pileup
- Introduction
- Installation
- Usage
- License
- Citation
This program is useful to:
- calculate and visualize the heterozygosity of a genome assembly
- to visualize ploidy
- and to determine if the genome is sufficiently haplotype-collapsed.
It works by scanning a bam file with mpileup
and asking the questions, "What is the read depth at this position, and how many bases match the reference base?" This information is turned into a 3D histogram, where:
- Column 0 is the read depth
- Column 1 is how many reads had the reference allele at that position
- Column 3 is the number of positions in the genome that match Columns 0 and 1
This 3D histogram is then used to generate plots of the marginal histogram of read depths across the whole genome (top panel), the heterozygosity calculated at every specific read depth (middle panel), and a plot of the underlying 3D histogram (bottom panel).
Starting with the bottom panel - this is the plot of the 3D histogram. The purpose of this plot is to visualize where the large smudge is that falls at approximately 1/2 of the read depth coverage at that position. This smudge is a histogram of the heterozygous positions in the genome, supported by sequencing reads aligned in a bam file. The red dashed lines are the bounding lines for calculating heterozygosity later on. In later calculations chep
considers everything between the top-most and bottom-most red dashed lines to be heterozygous sites. Everything above the top-most red dashed line is considered to be a homozygous site. The blue vertical line is drawn through the region of the plot that will most accurately estimate the SNP/short-indel heterozygosity.
The middle panel shows the calculated heterozygosity for that depth. For each read depth the total number of heterozygous sites is divided by the total number of bases. To be more explicit, to calculate the heterozygosity at the blue vertical line (x=107), we count the number of heterozygous positions at read depth 107 between the dashed red lines, then divide that number by the total number of bases with read depth=107. The heterozygosity to the left of the blue line is overestimated (lots of noise, good signal), the heterozygosity ON the blue line is the most accurate (little noise, lots of signal), and the heterozygosity to the right of the blue line is underestimated (little noise, little signal).
The top panel, the histogram of read depth across the entire genome, is good for QC'ing (A) whether or not the genome assembly has uncollapsed haplotigs, and for determining the mode of the coverage depth for the reads (lambda). In the example plot above, we can see that there are two peaks. The peak with a mode around 55 is from uncollapsed haplotigs, and the peak with the mode around 107 is from properly collapsed regions of the genome. From this plot we learn that mode of coverage for the genome (lambda) is 107 (blue vertical line), and therefore this is the best place to measure the heterozygosity.
Summary - with these three plots, you can diagnose many basic properties of a genome assembly!
NB - this program is written with the assumption that the reads being mapped against the genome are from the same individual that was used to generate the genome assembly. Otherwise, the estimates of heterozygosity will be wrong, and there will be many positions that do not have the reference base whatsoever (black streak along the bottom of the lower panel).
To install, go to a directory where you want to install the program. Execute:
git clone https://github.com/conchoecia/chep.git
cd chep
make
Currently, chep
doesn't automatically install the scripts into a system-wide location. As a result you must add the following line to your ~/.bash_profile
or ~/.bashrc
file.
export PATH=$PATH:/home/dschultz/chep/bin
First, generate a bam file of shotgun long- or short reads to a reference genome. Then, run mpileup
on the bam and pipe the output into chep
.
samtools mpileup -f ref.fa shot_to_ref.sorted.bam | chep_pileup_to_array > chep_3D.txt
Then plot the results. Add your own parameters for -x x_min
and -X x_max
, or -d
to make the plot on a back background.
chep_plot -f chep_3D.txt -x 0 -X 200 -d
Chep is free software and is licensed under GPLv3.
Currently there is no way to cite this software aside from citing this github page.