For Calling Mitochondrial Homoplasmies and Heteroplasmies
A bioinformatics pipeline for estimating mitochondrial DNA copy number and heteroplasmy levels from whole genome sequencing data, Battle et. al, NAR 2022 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9112767/
* Illumina paired-end reads pre-aligned to a reference: .bam or .cram files
* Alignment indeces: .bai or .crai files
* Alignment indexstats: .idxstats files
* OS: UNIX/LINUX
* SOFTWARE PACKAGES: git,wget,tar,unzip,make,autoconf,gcc,java,perl,python
* SOFTWARE PACKAGES: bwa,samtools,bedtools,fastp,samblaster,bcftools,htslib,freebayes
* JAVA JARS: gatk,mutserve,haplogrep,haplocheck
* HUMAN ASSEMBLY: hs38DH
$ git clone https://github.com/dpuiu/MitoHPC.git
$ cd MitoHPC/
$ git pull
or
$ git checkout .
$ cd MitoHPC/scripts
$ export HP_SDIR=`pwd` # set script directory variable
$ echo "export HP_SDIR=`pwd`" >> ~/.bashrc # add variable to your ~/.bashrc so it gets initialized automatically
$ . ./init.sh # or one of init.{hs38DH,hg19,mm39}.sh correponding to a different reference
# needs admin/sudo permissions
$ sudo $HP_SDIR/install_sysprerequisites.sh # install perl,pthon,java,wget ...
# (unless already installed)
$ $HP_SDIR/install_prerequisites.sh # install bwa,samtools,bedtools ...
# (unless already installed)
or
$ $HP_SDIR/install_prerequisites.sh -f # force install bwa,samtools,bedtools ...
# (latest versions)
$ $HP_SDIR/checkInstall.sh
# if successfull => "Success message!"
$ cat checkInstall.log # records software version/path info
# go to your working directory
# copy over the init.sh file
$ cp -i $HP_SDIR/init.sh . # copy init to working dir.
# view/edit init.sh file
$ cat init.sh # check/edit local init file
...
export HP_RNAME=hs38DH # human reference used for alignment (Ex: hs38DH, hg19)
export HP_IN=$PWD/in.txt # input TSV file (automatically generated by runnng ". ./init.sh")
export HP_ADIR=$PWD/bams/ # pre-existing alignment dir; .bam, .bai, [.idxstats] files
# or
export HP_ADIR=$PWD/crams/ # .cram, .crai, [.idxstats] files
export HP_ODIR=$PWD/out/ # output dir
export HP_L=222000 # Use at most 222K reads (random subsampling); if empty, use all the reads
export HP_M=mutect2 # SNV caller: mutect2, mutserve, or freebayes
export HP_I=2 # number of SNV calling iterations
# 1: one, uses rCRS or RSRS as reference
# 2: two, the 2nd one used the sample consensus
# computed in the 1st iteration; higher accuracy
export HP_FNAME=filter # VCF filter name
export HP_FRULE="egrep -v 'strict_strand|strand_bias|base_qual|map_qual|weak_evidence|slippage|position|HP'" # VCF filtering rule
export HP_SH=bash # job scheduling: bash, qsub,sbatch, ..
$ rm -f $HP_IN # delete the previous $HP_IN("in.txt") file if necessary
$ . ./init.sh # source init file
$ printenv | grep HP_ | sort # check HP_ variables
$ nano $HP_IN # check input file
$ $HP_SDIR/run.sh > run.all.sh # create command file in working dir.
$ bash ./run.all.sh # execute command file
# or run in parallel
$ grep filter.sh ./run.all.sh | parallel ; getSummary.sh
$ nano init.sh # update parameters
$ . ./init.sh # source init file
$ $HP_SDIR/run.sh > run.all.sh # recreate command file
$ bash ./run.all.sh
# available at
https://hub.docker.com/r/dpuiu1/mitohpc/tags
# note: not fully tested; uses the newest version of prerequisite software + a few other updates
# pull image (if necessary)
$ docker pull dpuiu1/mitohpc
# cd work_directory
# run interactively; use the "PIPELINE USAGE" instructions
$ docker run -e HP_SDIR="/MitoHPC/scripts/" -w $PWD -v $HOME:$HOME -it dpuiu1/mitohpc
# run using the "mitohpc.sh" script
docker run -e HP_ADIR="bams" -e HP_ODIR="out" -e HP_IN="in.txt" -w $PWD -v $HOME:$HOME -u "$(id -u):$(id -g)" dpuiu1/mitohpc mitohpc.sh
# or
docker run -e HP_ADIR="bams" -e HP_ODIR="out" -e HP_IN="in.txt" -w $PWD -v $HOME:$HOME -u "$(id -u):$(id -g)" dpuiu1/mitohpc cat /MitoHPC/scripts//mitohpc.sh > ./mitohpc.sh
# edit ./mitohpc.sh ; chmod u+x ./mitohpc.sh
docker run -e HP_ADIR="bams" -e HP_ODIR="out" -e HP_IN="in.txt" -w $PWD -v $HOME:$HOME -u "$(id -u):$(id -g)" dpuiu1/mitohpc ./mitohpc.sh
# available at
https://cloud.sylabs.io/library/dpuiu/project-dir/mitohpc
# note: not fully tested; uses the newest version of prerequisite software + a few other updates
# pull image (if necessary)
$ singularity pull mitohpc.sif library://dpuiu/project-dir/mitohpc:latest
$ SINGULARITY_PATH=`pwd -P`
# cd work_directory
# run interactively ; use the "PIPELINE USAGE" instructions
$ singularity shell --env HP_SDIR="/MitoHPC/scripts/" $SINGULARITY_PATH/mitohpc.sif --pwd
# under $HP_ODIR: TAB/SUMMARY/VCF/FASTA Files:
count.tab # total reads & mtDNA-CN counts
subsample.tab # subsampled read counts
cvg.tab # subsampled coverage stats
# 1st ITERATION
{mutect2,mutserve,freebayes}.{03,05,10}.vcf # SNVs; 3,5,10% heteroplasmy thold; single SNV from single sample in each line; last column = sample name
{mutect2,mutserve,freebayes}.{03,05,10}.concat.vcf # SNVs; 3,5,10% heteroplasmy thold; single SNV from single sample in each line; GT/AF/SF
{mutect2,mutserve,freebayes}.{03,05,10}.merge.vcf # SNVs; 3,5,10% heteroplasmy thold; single SNV from multiple samples in each line
{mutect2,mutserve,freebayes}.{03,05,10}.merge.sitesOnly.vcf
{mutect2,mutserve,freebayes}.{03,05,10}.$HP_FNAME.* # SNVs filtered according to $HP_RULE
{mutect2,mutserve,freebayes}.{03,05,10}.tab # SNV counts
{mutect2,mutserve,freebayes}.{03,05,10}.summary # SNV count summaries
{mutect2,mutserve,freebayes}.{03,05,10}.pos # position summaries; AC and AF values for both HOM and HET
{mutect2,mutserve,freebayes}.{03,05,10}.suspicious.{tab,ids} # samples which either have low mtDNA-CN,low cvg, failed haplockeck, multiple NUMT/HG labels (AF<1)
{mutect2,mutserve,freebayes}.fa # consensus sequence
{mutect2,mutserve,freebayes}.haplogroup?.tab # haplogroup
{mutect2,mutserve,freebayes}.haplocheck.tab # contamination screen
# 2nd ITERATION
{mutect2.mutect2,freebayes.freebayes}.*
Run # Sample name
all # number of reads in the sample reads
mapped # number of aligned reads in the sample
MT # number of reads aligned to chrM
M # mtDNA copy number
haplogroup # sample haplogroup
H # number of homozygous SNVs (the multiallelic SNVs counted once)
h # number of heterozygous SNVs
S # number of homozygous SNVs (excluding INDELs)
s # number of heterozygous SNVs (excluding INDELs)
I # number of homozygous INDELs
i # number of heterozygous INDELs
...
?p # "p" suffix stands for non homopolimeric
...
A # total number of SNVs (H+h)
CDS.bed.gz : coding regions
COMPLEX.bed.gz : coding regions
RNR.bed.gz : rRNA
TRN.bed.gz : tRNA
DLOOP.bed.gz : D-loop
HV.bed.gz : hyper variable regions
HP.bed.gz : homopolymers
TAG.bed.gz : DLOOP,CDS,RNR,TRN,GAP or NON(coding)
MITIMPACT.vcf.gz : MITIMPACT APOGEE scores
HG.vcf.gz : haplogroup specific SNV's
dbSNP.vcf.gz : dbSNP
NUMT.vcf.gz : NUMT SNV's
HS.bed.gz : hotspots
NONSYN.vcf.gz : NONSYNONIMOUS codon mutations in CDS regions
STOP.vcf.gz : STOP codon mutations
. #regions min q1 q2 q3 max mean sum
CDS 13 207 525 784 1141 1812 876.54 11395
RNR 2 954 954 1559 1559 1559 1256.50 2513
TRN 22 59 68 69 70 75 68.54 1508
DLOOP 2 576 576 746 746 746 661.00 1322
HV 3 136 136 315 359 359 270.00 810
HP 10 5 9 13 15 22 12.600 126
. #SNVs #POSITIONS
NONSYN 25890 9314
MITIMPACT 24190 9103
STOP 1615 1425
HG 1098 1098
dbSNP 388 359
NUMT 382 381
HS 13 13
3 simulated samples, 100x chrM coverage
$ head $HP_IN
#sampleName inputFile outputPath/prefix
sample_A bams/sample_A.bam out/sample_A/sample_A
sample_B bams/sample_B.bam out/sample_B/sample_B
sample_C bams/sample_C.bam out/sample_C/sample_C
$ ls out/
sample_A/
sample_B/
sample_C/
$ ls $HP_ADIR/*
bams/sample_A.bam
bams/sample_A.bam.bai
bams/sample_A.idxstats
bams/sample_A.count
$ cat $HP_ODIR/count.tab
Run all_reads mapped_reads MT_reads mtDNA-CN
sample_A 851537886 848029490 396766 182
sample_B 884383716 882213718 506597 223
sample_C 786560467 785208588 503241 249
# 1st iteration : homoplasmies(AF=1) and heteroplasmies(AF<1)
$ cat mutect2.03.concat.vcf
...
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="INDEL">
##INFO=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=HP,Number=0,Type=Flag,Description="Homoloplymer">
##INFO=<ID=HS,Number=0,Type=Flag,Description="Hot spot">
##INFO=<ID=CDS,Number=0,Type=Flag,Description="coding region">
##INFO=<ID=RNR,Number=0,Type=Flag,Description="rRNA">
##INFO=<ID=TRN,Number=0,Type=Flag,Description="tRNA">
##INFO=<ID=DLOOP,Number=0,Type=Flag,Description="D-loop">
##INFO=<ID=HG,Number=1,Type=String,Description="haplogroup">
##INFO=<ID=NUMT,Number=0,Type=Flag,Description="NUMT like">
##INFO=<ID=HV,Number=0,Type=Flag,Description="Hypervariable">
##INFO=<ID=AP,Number=1,Type=String,Description="Mitimpact APOGEE">
##INFO=<ID=APS,Number=1,Type=String,Description="Mitimpact APOGEE_score">
...
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE
chrM 64 . C T . clustered_events;haplotype HV=HV-II;DLOOP;GT=0|1;DP=66;AF=1 SM sample_A
chrM 73 . A G . PASS HV=HV-II;DLOOP;NUMT=MT780574.1|KM281532.1|KM281533.1|KM281524.1;GT=0/1;DP=51;AF=1 SM sample_B
chrM 73 . A G . PASS HV=HV-II;DLOOP;NUMT=MT780574.1|KM281532.1|KM281533.1|KM281524.1;GT=0/1;DP=46;AF=1 SM sample_C
chrM 73 . A G . clustered_events;haplotype HV=HV-II;DLOOP;NUMT=MT780574.1|KM281532.1|KM281533.1|KM281524.1;GT=0|1;DP=70;AF=1 SM sample_A
chrM 146 . T C . clustered_events;haplotype HV=HV-II;DLOOP;GT=0|1;DP=88;AF=1 SM sample_A
...
chrM 1037 . A C . PASS RNR=RNR1;GT=0/1;DP=70;AF=0.178 SM sample_B
chrM 1038 . C G . PASS RNR=RNR1;GT=0/1;DP=70;AF=0.165 SM sample_A
chrM 1042 . T A . PASS RNR=RNR1;GT=0/1;DP=60;AF=0.24 SM sample_C
...
chrM 3988 . T G . PASS CDS=ND1;AP=Pathogenic;APS=0.52;GT=0/1;DP=60;AF=0.224;AP=Pathogenic;APS=0.52 SM sample_B
chrM 3989 . A T . PASS CDS=ND1;AP=Pathogenic;APS=0.52;GT=0/1;DP=65;AF=0.237;AP=Pathogenic;APS=0.52 SM sample_A
chrM 5186 . A T . PASS CDS=ND2;HG=U;AP=Pathogenic;APS=0.51;GT=0|1;DP=97;AF=0.212;AP=Pathogenic;APS=0.51 SM sample_C
...
$ cat mutect2.03.merge.vcf
...
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample_A sample_B sample_C
...
chrM 3988 . T G . . AC=1;AN=2;CDS=ND1;AP=Pathogenic;APS=0.52 GT:DP:AF .:.:. 0/1:60:0.224 .:.:.
chrM 3989 . A T . . AC=1;AN=2;CDS=ND1;AP=Pathogenic;APS=0.52 GT:DP:AF 0/1:65:0.237 .:.:. .:.:.
chrM 3993 . A T . . AC=1;AN=2;CDS=ND1 GT:DP:AF .:.:. .:.:. 0/1:52:0.258
...
$ cat mutect2.03.merge.sitesOnly.vcf
...
#CHROM POS ID REF ALT QUAL FILTER INFO
...
chrM 3988 . T G . . AC=1;AN=2;CDS=ND1;AP=Pathogenic;APS=0.52
chrM 3989 . A T . . AC=1;AN=2;CDS=ND1;AP=Pathogenic;APS=0.52
chrM 3993 . A T . . AC=1;AN=2;CDS=ND1
...
# 1st iteration
$ cat mutect2.03.tab
Run H h S s I i Hp hp Sp sp Ip ip A
sample_A 30 45 28 36 2 9 28 44 28 36 0 8 75
sample_B 26 43 25 34 1 9 22 42 22 34 0 8 69
sample_C 39 43 35 35 4 8 37 43 35 35 2 8 82
# 2nd iteration
$ cat mutect2.mutect2.03.tab
Run H h S s I i Hp hp Sp sp Ip ip A
sample_A 30 44 28 36 2 8 28 44 28 36 0 8 74 # one less heteroplasmy (h) compared with the 1st iteration
sample_B 26 42 25 34 1 8 22 42 22 34 0 8 68 # one less heteroplasmy (h) compared with the 1st iteration
sample_C 39 43 35 35 4 8 37 43 35 35 2 8 82
# 1st iteration
$ cat mutect2.03.summary | column -t
id count nonZero min max median mean sum
H 3 3 26 39 30 31.67 95
h 3 3 43 45 43 43.67 131
S 3 3 25 35 28 29.33 88
s 3 3 34 36 35 35 105
I 3 3 1 4 2 2.33 7
i 3 3 8 9 9 8.67 26
Hp 3 3 22 37 28 29 87
hp 3 3 42 44 43 43 129
Sp 3 3 22 35 28 28.33 85
sp 3 3 34 36 35 35 105
Ip 3 1 0 2 0 0.67 2
ip 3 3 8 8 8 8 24
A 3 3 69 82 75 75.33 226
$ cat mutect2.haplogroup.tab
Run haplogroup
sample_A A2+(64)
sample_B B2
sample_C C
$ cat mutect2.haplocheck.tab | column -t
Run ContaminationStatus ContaminationLevel Distance SampleCoverage
sample_A NO ND 14 78
sample_B NO ND 14 79
sample_C NO ND 14 78
$ reAnnotateVcf.sh old.vcf new.vcf
-
Edit init.sh: variables HP_RNAME, HP_RMT, HP_RNUMT, HP_RCOUNT, HP_RURL, HP_MTLEN
-
Or use an alternative init.*.sh
# Example: mm39 mouse reference # INSTALL section: replace ./init.sh with ./init.mm39.sh cd $HP_SDIR/ mv ./init.sh ./init.default.sh cp ./init.mm39.sh ./init.sh
-
Rerun "init" scripts (!!! important)
cd $HP_SDIR/ . ./init.sh # init nvironmental variables ./install_prerequisites.sh # downloads new reference from $HP_RURL, extracts the $HP_MT sequnce,creates new reference .fasta, .fai, .dict, .bwt files
-
One has to pre-aligned the FASTQ reads to the whole genome reference
# Example: hs38DH human reference # cd to the work directory ; run "init.sh" cp -i $HP_SDIR/init.sh . . ./init.sh # create hs38DH bwa index (only run once) bwa_index.sh # create directory paths(symlinks) ln -s path_to_fastq_directory fastq mkdir bams # for each sample: test fwd/rev FASTQ files exist ls -l fastq/sample_[12].f*q* # for each sample: run bwa mem bwa_mem.sh fastq/sample bams/sample ls bams/
-
By default, the SNV with DP<100 or labeled as "strict_strand|strand_bias|base_qual|map_qual|weak_evidence|slippage|position|HP" are filtered out
-
To change the criteria, update the following 2 variables in the "init.sh" file
export HP_FRULE="egrep -v 'strict_strand|strand_bias|base_qual|map_qual|weak_evidence|slippage|position|HP'" export HP_DP=100
-
Samples with multiple problems (failing haplocheck, multiple NUMTs, multiple heteroplasmies belonging to a different haplogroup, average coverage < HP_DP ...) are saved in "*.suspicious.{id,tab}" files
-
By default, suspicious samples are not removed; The users should check these files and maually remove the samples
-
Update the git: download scripts/{filterHiFi.sh,fixbcftoolsVcf.pl,bcftools.vcf} ; make the *.pl scripts executable
-
Run replacing the default(mutect2) snvcaller with bcftools , filter.sh with filterHiFi.sh, minimum coverage 250
$ run.sh | sed 's|mutect2|bcftools|g' | sed 's|filter.sh|filterHiFi.sh|' | sed 's|100|25|g' > run.all.sh