You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When there are three SNPs in close proximity with the first having a homozygous-alt genotype and the other two being hets that are in trans, the GATK incorrectly outputs genotypes and phasing indicating they are in cis. I haven't tested more broadly (e.g. with > 3 variants or with indels etc.) but my suspicion is that it is to do with the first variant in the phase set being homozygous.
This was seen happening on real data from a real sample, but I have also been able to reproduce this with synthetic test data that I can attach here.
Steps to reproduce
I've attached phasing.zip to this issue. It contains a BAM file of synthetic data where I've introduced two variant haplotypes at 50 locations each separated by about 1000 bases. My goal in doing this was just to have a number of different sequence contexts and variant alleles in case that affected anything. It also contains the resulting VCF from running this GATK command using 4.1.4.1:
While the BAM clearly shows the two hets as in trans with one another:
The resulting variant calls are given as in-cis:
chr2 179393825 . C A,<NON_REF> 2686.03 . DP=60;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQandDP=216000,60 GT:AD:DP:GQ:PGT:PID:PL:PS:SB 1|1:0,60,0:60:99:0|1:179393825_C_A:2700,181,0,2700,181,2700:179393825:0,0,60,0
chr2 179393826 . T <NON_REF> . . END=179393826 GT:DP:GQ:MIN_DP:PL 0/0:60:99:60:0,120,1800
chr2 179393827 . T G,<NON_REF> 1386.60 . BaseQRankSum=0.000;DP=60;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQandDP=216000,60;ReadPosRankSum=0.157 GT:AD:DP:GQ:PGT:PID:PL:PS:SB 0|1:25,35,0:60:99:0|1:179393825_C_A:1394,0,944,1470,1050,2519:179393825:25,0,35,0
chr2 179393828 . A <NON_REF> . . END=179393828 GT:DP:GQ:MIN_DP:PL 0/0:60:99:60:0,120,1800
chr2 179393829 . A C,<NON_REF> 936.60 . BaseQRankSum=0.000;DP=60;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQandDP=216000,60;ReadPosRankSum=-0.217 GT:AD:DP:GQ:PGT:PID:PL:PS:SB 0|1:35,25,0:60:99:0|1:179393825_C_A:944,0,1394,1050,1470,2519:179393825:35,0,25,0
Expected behavior
A phase set of three variants should be emitted that shows the two het SNPs in-trans (i.e. the ref allele for one in phase with the alt allele for the other).
Actual behavior
A phase set of three variants is emitted with the two het SNPs in-cis (i.e. alt alleles in phase).
The text was updated successfully, but these errors were encountered:
This doesn't seem really related to the other stuff I've been doing with the phasing code, but I've been taking a look since I've been working in that area. After some testing I can confirm that this does seem to be an error in the phasing algorithm logic that occurs when the first variant in the set of called variants is homozygous alt, as @tfenne suggests. I'll try to come up with a fix and either package it with my fix to #6845 or in a separate PR.
Bug Report
Affected tool(s) or class(es)
HaplotypeCaller when emitting physical phasing.
Affected version(s)
Description
When there are three SNPs in close proximity with the first having a homozygous-alt genotype and the other two being hets that are in trans, the GATK incorrectly outputs genotypes and phasing indicating they are in cis. I haven't tested more broadly (e.g. with > 3 variants or with indels etc.) but my suspicion is that it is to do with the first variant in the phase set being homozygous.
This was seen happening on real data from a real sample, but I have also been able to reproduce this with synthetic test data that I can attach here.
Steps to reproduce
I've attached phasing.zip to this issue. It contains a BAM file of synthetic data where I've introduced two variant haplotypes at 50 locations each separated by about 1000 bases. My goal in doing this was just to have a number of different sequence contexts and variant alleles in case that affected anything. It also contains the resulting VCF from running this GATK command using 4.1.4.1:
While the BAM clearly shows the two hets as in trans with one another:
The resulting variant calls are given as in-cis:
Expected behavior
A phase set of three variants should be emitted that shows the two het SNPs in-trans (i.e. the ref allele for one in phase with the alt allele for the other).
Actual behavior
A phase set of three variants is emitted with the two het SNPs in-cis (i.e. alt alleles in phase).
The text was updated successfully, but these errors were encountered: