You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi all, I encountered some unexpected results when testing the GATK joint genotype calling pipeline on my dataset.
The number of variants in the final recalibrated VCF (joint_germline_recalibrated.vcf.gz) is exactly double that in the raw VCF (joint_germline.vcf.gz) outputted from GenotypeGVCFs. This is because all variants are completely duplicated due to merging separately recalibrated SNP and indel VCFs, the entries are also not identical as shown below. The fix is to sequentially recalibrate for SNPs then indels, or vice versa, on the same VCF.
Raw VCF counts and example entries:
$ bcftools +counts joint_germline.vcf.gz
Number of samples: 15
Number of SNPs: 13717272
Number of INDELs: 3503929
Number of MNPs: 0
Number of others: 0
Number of sites: 17113477
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
1 10146 rs375931351 AC A 102.67 . AC=3;AF=0.1;AN=30;BaseQRankSum=0.967;DB;DP=698;ExcessHet=0;FS=0;InbreedingCoeff=0.5022;MLEAC=2;MLEAF=0.067;MQ=40.97;MQRankSum=0.967;QD=14.67;ReadPosRankSum=0.967;SOR=0.223 GT:AD:DP:GQ:PL
1 10327 rs112750067 T C 106.77 . AC=2;AF=0.067;AN=30;DB;DP=280;ExcessHet=0;FS=0;InbreedingCoeff=0.214;MLEAC=4;MLEAF=0.133;MQ=30.71;QD=25.36;SOR=2.833 GT:AD:DP:GQ:PGT:PID:PL:PS
Recalibrated VCF counts and example entries:
$ bcftools +counts joint_germline_recalibrated.vcf.gz
Number of samples: 15
Number of SNPs: 27434544
Number of INDELs: 7007858
Number of MNPs: 0
Number of others: 0
Number of sites: 34226954
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
1 10146 rs375931351 AC A 102.67 PASS AC=3;AF=0.1;AN=30;BaseQRankSum=0.967;DB;DP=698;ExcessHet=0;FS=0;InbreedingCoeff=0.5022;MLEAC=2;MLEAF=0.067;MQ=40.97;MQRankSum=0.967;NEGATIVE_TRAIN_SITE;POSITIVE_TRAIN_SITE;QD=14.67;ReadPosRankSum=0.967;SOR=0.223;VQSLOD=-9.728e-01;culprit=DP GT:AD:DP:GQ:PL
1 10146 rs375931351 AC A 102.67 . AC=3;AF=0.1;AN=30;BaseQRankSum=0.967;DB;DP=698;ExcessHet=0;FS=0;InbreedingCoeff=0.5022;MLEAC=2;MLEAF=0.067;MQ=40.97;MQRankSum=0.967;QD=14.67;ReadPosRankSum=0.967;SOR=0.223 GT:AD:DP:GQ:PL
1 10327 rs112750067 T C 106.77 . AC=2;AF=0.067;AN=30;DB;DP=280;ExcessHet=0;FS=0;InbreedingCoeff=0.214;MLEAC=4;MLEAF=0.133;MQ=30.71;QD=25.36;SOR=2.833 GT:AD:DP:GQ:PGT:PID:PL:PS
1 10327 rs112750067 T C 106.77 VQSRTrancheSNP99.90to100.00 AC=2;AF=0.067;AN=30;DB;DP=280;ExcessHet=0;FS=0;InbreedingCoeff=0.214;MLEAC=4;MLEAF=0.133;MQ=30.71;NEGATIVE_TRAIN_SITE;POSITIVE_TRAIN_SITE;QD=25.36;SOR=2.833;VQSLOD=-1.242e+00;culprit=SOR GT:AD:DP:GQ:PGT:PID:PL:PS
I used the --merge annotation method and the final annotated VCF (joint_germline_recalibrated_snpEff_VEP.ann.vcf.gz) is missing almost half of the variants compared to the un-annotated VCFs, which was pretty alarming. After tracing the changes to the VCF it appears that VEP is intentionally filtering out common variants via the --filter_common external argument, which removes alleles with global AF > 0.01. While this is not a bug per-se, it would be helpful to make this an optional argument or at least document it somewhere to prevent scares and confusion.
Variant counts following through the annotation pipeline:
$ bcftools +counts joint_germline_snpEff.ann.vcf.gz
Number of samples: 15
Number of SNPs: 13717272
Number of INDELs: 3503929
Number of MNPs: 0
Number of others: 0
Number of sites: 17113477
$ bcftools +counts joint_germline_snpEff_VEP.ann.vcf.gz
Number of samples: 15
Number of SNPs: 5954775
Number of INDELs: 3370807
Number of MNPs: 0
Number of others: 0
Number of sites: 9223960
The text was updated successfully, but these errors were encountered:
Hey @evanwu1119 ! Thanks for reporting and investigating the issues. We can for sure add a parameter to prevent this. However in the meantime, you can also add a custom config for VEP that removes the filter_common parameter
Hi all, I encountered some unexpected results when testing the GATK joint genotype calling pipeline on my dataset.
joint_germline_recalibrated.vcf.gz
) is exactly double that in the raw VCF (joint_germline.vcf.gz
) outputted from GenotypeGVCFs. This is because all variants are completely duplicated due to merging separately recalibrated SNP and indel VCFs, the entries are also not identical as shown below. The fix is to sequentially recalibrate for SNPs then indels, or vice versa, on the same VCF.Raw VCF counts and example entries:
Recalibrated VCF counts and example entries:
--merge
annotation method and the final annotated VCF (joint_germline_recalibrated_snpEff_VEP.ann.vcf.gz
) is missing almost half of the variants compared to the un-annotated VCFs, which was pretty alarming. After tracing the changes to the VCF it appears that VEP is intentionally filtering out common variants via the--filter_common
external argument, which removes alleles with global AF > 0.01. While this is not a bug per-se, it would be helpful to make this an optional argument or at least document it somewhere to prevent scares and confusion.Variant counts following through the annotation pipeline:
The text was updated successfully, but these errors were encountered: