This document outlines all the steps and considerations for calling and merging a trio using DeepVariant and GLnexus. These best practices were developed and evaluated as described in the bioRxiv preprint Accurate, scalable cohort variant calls using DeepVariant and GLnexus.
The process involves 3 major stages: running DeepVariant to create individual genome call sets, running GLnexus to merge call sets, and analyzing the merged call set.
NOTE: This case study demonstrates an example of how to run DeepVariant end-to-end on one machine. The steps below were done on a machine with this example command to start a machine.
The steps in this document can be extended to merge larger cohorts as well.
See this workflow:
A few things to note before we start:
- If you are looking for ways to run DeepVariant in larger batches, please refer to the third party solutions section.
- It is recommended to use BAM files with original quality scores. In the case
that BAM files went through recalibration, optional DV flags can be used in
order to use original scores:
--parse_sam_aux_fields
,--use_original_quality_scores
. - DeepVariant optionally allows gVCF output. This option is required for further GLnexus analysis in this document.
The Whole Exome Sequencing (WES) dataset we're using is from:
ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/
- HG002_NA24385_son
- HG003_NA24149_father
- HG004_NA24143_mother
Just for convenience, we use aria2 to download our data. You can change it to whatever other tools (wget, curl) that you prefer.
To install aria2, you can run: sudo apt-get -y install aria2
DIR="${PWD}/trio"
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/OsloUniversityHospital_Exome/151002_7001448_0359_AC7F6GANXX_Sample_HG002-EEogPU_v02-KIT-Av5_AGATGTAC_L008.posiSrt.markDup.bam -o HG002.bam
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/OsloUniversityHospital_Exome/151002_7001448_0359_AC7F6GANXX_Sample_HG002-EEogPU_v02-KIT-Av5_AGATGTAC_L008.posiSrt.markDup.bai -o HG002.bai
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG003_NA24149_father/OsloUniversityHospital_Exome/151002_7001448_0359_AC7F6GANXX_Sample_HG003-EEogPU_v02-KIT-Av5_TCTTCACA_L008.posiSrt.markDup.bam -o HG003.bam
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG003_NA24149_father/OsloUniversityHospital_Exome/151002_7001448_0359_AC7F6GANXX_Sample_HG003-EEogPU_v02-KIT-Av5_TCTTCACA_L008.posiSrt.markDup.bai -o HG003.bai
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG004_NA24143_mother/OsloUniversityHospital_Exome/151002_7001448_0359_AC7F6GANXX_Sample_HG004-EEogPU_v02-KIT-Av5_CCGAAGTA_L008.posiSrt.markDup.bam -o HG004.bam
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG004_NA24143_mother/OsloUniversityHospital_Exome/151002_7001448_0359_AC7F6GANXX_Sample_HG004-EEogPU_v02-KIT-Av5_CCGAAGTA_L008.posiSrt.markDup.bai -o HG004.bai
aria2c -c -x10 -s10 -d "${DIR}" https://storage.googleapis.com/deepvariant/exome-case-study-testdata/hs37d5.fa.gz
gunzip ${DIR}/hs37d5.fa.gz
aria2c -c -x10 -s10 -d "${DIR}" https://storage.googleapis.com/deepvariant/exome-case-study-testdata/hs37d5.fa.fai
aria2c -c -x10 -s10 -d "${DIR}" https://storage.googleapis.com/deepvariant/exome-case-study-testdata/agilent_sureselect_human_all_exon_v5_b37_targets.bed
There have been newer version of the truth files, including v4.1, GRCh37 for HG002, and v4.2, GRCh38 for HG002-4. In the future we will plan to update this documentation with newer versions.
HG002:
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG002_NA24385_son/NISTv3.3.2/GRCh37/HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_triophased.vcf.gz -o HG002_truth.vcf.gz
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG002_NA24385_son/NISTv3.3.2/GRCh37/HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_triophased.vcf.gz.tbi -o HG002_truth.vcf.gz.tbi
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG002_NA24385_son/NISTv3.3.2/GRCh37/HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_noinconsistent.bed -o HG002_truth.bed
HG003:
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG003_NA24149_father/NISTv3.3.2/GRCh37/HG003_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X_CHROM1-22_v.3.3.2_highconf.vcf.gz -o HG003_truth.vcf.gz
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG003_NA24149_father/NISTv3.3.2/GRCh37/HG003_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X_CHROM1-22_v.3.3.2_highconf.vcf.gz.tbi -o HG003_truth.vcf.gz.tbi
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG003_NA24149_father/NISTv3.3.2/GRCh37/HG003_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X_CHROM1-22_v.3.3.2_highconf_noinconsistent.bed -o HG003_truth.bed
HG004:
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG004_NA24143_mother/NISTv3.3.2/GRCh37/HG004_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X_CHROM1-22_v.3.3.2_highconf.vcf.gz -o HG004_truth.vcf.gz
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG004_NA24143_mother/NISTv3.3.2/GRCh37/HG004_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X_CHROM1-22_v.3.3.2_highconf.vcf.gz.tbi -o HG004_truth.vcf.gz.tbi
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG004_NA24143_mother/NISTv3.3.2/GRCh37/HG004_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X_CHROM1-22_v.3.3.2_highconf_noinconsistent.bed -o HG004_truth.bed
Here are example commands used to install these tools. You should also look at http://www.htslib.org for official instructions.
sudo apt-get install -y build-essential libncurses5-dev zlib1g-dev libbz2-dev liblzma-dev tabix
wget https://github.com/samtools/samtools/releases/download/1.9/samtools-1.9.tar.bz2
tar -xvf samtools-1.9.tar.bz2
pushd samtools-1.9 && ./configure && make && sudo make install && popd
wget https://github.com/samtools/bcftools/releases/download/1.9/bcftools-1.9.tar.bz2
tar -xvf bcftools-1.9.tar.bz2
pushd bcftools-1.9 && ./configure && make && sudo make install && popd
wget https://github.com/samtools/htslib/releases/download/1.9/htslib-1.9.tar.bz2
tar -xvf htslib-1.9.tar.bz2
pushd htslib-1.9 && ./configure && make && sudo make install && popd
First, install docker if you don't have it yet: sudo apt-get -y install docker.io
With the example command below, it runs DeepVariant on the trio one by one. This is for demonstration only. If you're running this on a large cohort, running serially is not the most effective approach.
N_SHARDS=$(nproc) # Or change to the number of cores you want to use
CAPTURE_BED=agilent_sureselect_human_all_exon_v5_b37_targets.bed
VERSION=1.1.0
declare -a trio=(HG002 HG003 HG004)
for SAMPLE in "${trio[@]}"
do
BAM=${SAMPLE}.bam
OUTPUT_VCF=${SAMPLE}.vcf.gz
OUTPUT_GVCF=${SAMPLE}.g.vcf.gz
time sudo docker run \
-v "${DIR}":"/data" \
google/deepvariant:${VERSION} \
/opt/deepvariant/bin/run_deepvariant \
--model_type=WES \
--ref="/data/hs37d5.fa" \
--reads="/data/${BAM}" \
--regions="/data/${CAPTURE_BED}" \
--output_vcf="/data/${OUTPUT_VCF}" \
--output_gvcf="/data/${OUTPUT_GVCF}" \
--num_shards=${N_SHARDS}
done
And then run GLnexus with this config:
sudo docker pull quay.io/mlin/glnexus:v1.2.7
time sudo docker run \
-v "${DIR}":"/data" \
quay.io/mlin/glnexus:v1.2.7 \
/usr/local/bin/glnexus_cli \
--config DeepVariantWES \
--bed "/data/${CAPTURE_BED}" \
/data/HG004.g.vcf.gz /data/HG003.g.vcf.gz /data/HG002.g.vcf.gz \
| bcftools view - | bgzip -c > ${DIR}/deepvariant.cohort.vcf.gz
When we ran on this WES trio, it took only about 13 seconds. For more details on performance, see GLnexus performance guide.
For a WGS cohort, we recommend using --config DeepVariantWGS
instead of
DeepVariantWES
. Another preset DeepVariant_unfiltered
is available in
glnexus:v1.2.7
or later versions for merging DeepVariant gVCFs with no QC
filters or genotype revision (see
GitHub issue #326 for a
potential use case). The details of these presets can be found
here.
Create an SDF template from our reference file:
sudo docker run \
-v "${DIR}":"/data" \
realtimegenomics/rtg-tools format \
-o /data/hs37d5.sdf /data/hs37d5.fa
Create a PED file $DIR/trio.ped
that looks like this (with the sample name
of the trio):
#PED format pedigree
#
#fam-id/ind-id/pat-id/mat-id: 0=unknown
#sex: 1=male; 2=female; 0=unknown
#phenotype: -9=missing, 0=missing; 1=unaffected; 2=affected
#
#fam-id ind-id pat-id mat-id sex phen
1 Sample_Diag-excap51-HG002-EEogPU Sample_Diag-excap51-HG003-EEogPU Sample_Diag-excap51-HG004-EEogPU 1 0
1 Sample_Diag-excap51-HG003-EEogPU 0 0 1 0
1 Sample_Diag-excap51-HG004-EEogPU 0 0 2 0
sudo docker run \
-v "${DIR}":"/data" \
realtimegenomics/rtg-tools mendelian \
-i /data/deepvariant.cohort.vcf.gz \
-o /data/deepvariant.annotated.vcf.gz \
--pedigree=/data/trio.ped \
-t /data/hs37d5.sdf \
| tee ${DIR}/deepvariant.input_rtg_output.txt
The output is:
Checking: /data/deepvariant.cohort.vcf.gz
Family: [Sample_Diag-excap51-HG003-EEogPU + Sample_Diag-excap51-HG004-EEogPU] -> [Sample_Diag-excap51-HG002-EEogPU]
1 non-pass records were skipped
Concordance Sample_Diag-excap51-HG002-EEogPU: F:58583/59080 (99.16%) M:58912/59051 (99.76%) F+M:58299/58951 (98.89%)
Sample Sample_Diag-excap51-HG002-EEogPU has less than 99.0 concordance with both parents. Check for incorrect pedigree or sample mislabelling.
839/59304 (1.41%) records did not conform to expected call ploidy
59206/59304 (99.83%) records were variant in at least 1 family member and checked for Mendelian constraints
203/59206 (0.34%) records had indeterminate consistency status due to incomplete calls
667/59206 (1.13%) records contained a violation of Mendelian constraints
From this report, we know that there is a 1.13% Mendelian violation rate, and
0.34% of the records had incomplete calls (with .
) so RTG couldn't determine
whether there is violation or not.
In addition to the cohort quality statistics, for completeness we generate single-sample quality metrics.
We run bcftools stats
on the 3 VCF outputs. Since our DeepVariant run already
constrained to just the capture regions, no need to specify it again here. We
had to pass in the -f PASS
flag so that only the PASS calls are considered.
declare -a trio=(HG002 HG003 HG004)
for SAMPLE in "${trio[@]}"
do
bcftools stats -f PASS \
${DIR}/${SAMPLE}.vcf.gz \
> ${DIR}/${SAMPLE}.stats
done
Sample | [3]ts | [4]tv | [5]ts/tv | [6]ts (1st ALT) | [7]tv (1st ALT) | [8]ts/tv (1st ALT) |
---|---|---|---|---|---|---|
HG002 | 29964 | 11677 | 2.57 | 29951 | 11656 | 2.57 |
HG003 | 29830 | 11761 | 2.54 | 29822 | 11740 | 2.54 |
HG004 | 30059 | 11839 | 2.54 | 30049 | 11826 | 2.54 |
If you want to restrict to the truth BED files, use this command:
declare -a trio=(HG002 HG003 HG004)
for SAMPLE in "${trio[@]}"
do
bcftools stats -f PASS \
-T ${DIR}/${SAMPLE}_truth.bed \
${DIR}/${SAMPLE}.vcf.gz \
> ${DIR}/${SAMPLE}.with_truth_bed.stats
done
Which resulted in this table:
Sample | [3]ts | [4]tv | [5]ts/tv | [6]ts (1st ALT) | [7]tv (1st ALT) | [8]ts/tv (1st ALT) |
---|---|---|---|---|---|---|
HG002 | 24473 | 9254 | 2.64 | 24468 | 9244 | 2.65 |
HG003 | 24170 | 9184 | 2.63 | 24167 | 9176 | 2.63 |
HG004 | 24310 | 9336 | 2.60 | 24303 | 9329 | 2.61 |
declare -a trio=(HG002 HG003 HG004)
for SAMPLE in "${trio[@]}"
do
sudo docker run \
-v "${DIR}":"/data" \
realtimegenomics/rtg-tools vcfstats \
/data/${SAMPLE}.vcf.gz \
> ${DIR}/${SAMPLE}.vcfstats
done
which shows the following:
HG002:
Location : /data/HG002.vcf.gz
Failed Filters : 14476
Passed Filters : 45376
SNPs : 41607
MNPs : 0
Insertions : 1908
Deletions : 1840
Indels : 20
Same as reference : 1
SNP Transitions/Transversions: 2.57 (41826/16299)
Total Het/Hom ratio : 1.51 (27266/18109)
SNP Het/Hom ratio : 1.52 (25111/16496)
MNP Het/Hom ratio : - (0/0)
Insertion Het/Hom ratio : 1.12 (1006/902)
Deletion Het/Hom ratio : 1.59 (1129/711)
Indel Het/Hom ratio : - (20/0)
Insertion/Deletion ratio : 1.04 (1908/1840)
Indel/SNP+MNP ratio : 0.09 (3768/41607)
HG003:
Location : /data/HG003.vcf.gz
Failed Filters : 15262
Passed Filters : 45259
SNPs : 41562
MNPs : 0
Insertions : 1887
Deletions : 1792
Indels : 18
Same as reference : 0
SNP Transitions/Transversions: 2.52 (41610/16524)
Total Het/Hom ratio : 1.50 (27137/18122)
SNP Het/Hom ratio : 1.51 (25010/16552)
MNP Het/Hom ratio : - (0/0)
Insertion Het/Hom ratio : 1.15 (1011/876)
Deletion Het/Hom ratio : 1.58 (1098/694)
Indel Het/Hom ratio : - (18/0)
Insertion/Deletion ratio : 1.05 (1887/1792)
Indel/SNP+MNP ratio : 0.09 (3697/41562)
HG004:
Location : /data/HG004.vcf.gz
Failed Filters : 14923
Passed Filters : 45590
SNPs : 41873
MNPs : 0
Insertions : 1895
Deletions : 1801
Indels : 20
Same as reference : 1
SNP Transitions/Transversions: 2.55 (41650/16332)
Total Het/Hom ratio : 1.58 (27939/17650)
SNP Het/Hom ratio : 1.60 (25781/16092)
MNP Het/Hom ratio : - (0/0)
Insertion Het/Hom ratio : 1.17 (1022/873)
Deletion Het/Hom ratio : 1.63 (1116/685)
Indel Het/Hom ratio : - (20/0)
Insertion/Deletion ratio : 1.05 (1895/1801)
Indel/SNP+MNP ratio : 0.09 (3716/41873)
sudo docker pull pkrusche/hap.py
declare -a trio=(HG002 HG003 HG004)
for SAMPLE in "${trio[@]}"
do
sudo docker run -i \
-v "${DIR}":"/data" \
pkrusche/hap.py /opt/hap.py/bin/hap.py \
"/data/${SAMPLE}_truth.vcf.gz" \
"/data/${SAMPLE}.vcf.gz" \
-f "/data/${SAMPLE}_truth.bed" \
-T "/data/${CAPTURE_BED}" \
-r "/data/hs37d5.fa" \
-o "/data/${SAMPLE}.happy.output" \
--engine=vcfeval > ${DIR}/${SAMPLE}.stdout
done
Accuracy F1 scores:
Sample | Indel | SNP |
---|---|---|
HG002 | 0.972570 | 0.999110 |
HG003 | 0.971865 | 0.999130 |
HG004 | 0.973802 | 0.999361 |