-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HaplotypeCaller GGA mode crashes ("allele ... not in the variant context" exception) for alleles input including certain split multiallelic records and genotypes #5355
Comments
Hi @gmagoon, I'm taking a look at this but haven't been able to reproduce it so far. Is there any chance you could share a snippet of your |
Hi @cwhelan, In my testing, the issue doesn't seem to be sensitive to the choice of BAM (e.g. same crash occurs with a BAM that doesn't have any read coverage at this locus). If you haven't been able to reproduce the issue, I'm wondering if the critical factor may be some aspect of the particular genotypes in the |
Thanks @gmagoon , that helped me figure out the cause of the problem (the samples not only needed to have genotypes but to have the right combination of alleles in their genotypes). The root cause of this issue was similar to #5336 but cropped up at a different place. I'll prepare a pull request and get this fixed shortly. |
Awesome, that's great @cwhelan . Out of curiosity, is there a particular reason that HaplotypeCaller GGA is considering the genotypes in the |
Short answer: No, it's not using any information from the genotypes at all. You should get the same results with or without genotypes. Long technical answer: Both of these issues are related to the fact that when combining multiple possible variants at a site (i.e. variants with different alleles from the GGA |
I see, that makes sense...thanks very much @cwhelan |
I came across a case where
--alleles
input containing a set of records involving a multiallelic STR locus split into two records and genotypes causes HaplotypeCaller (inGENOTYPE_GIVEN_ALLELES
or "GGA" mode) to crash.The
--alleles
input (originating from 1000 Genomes Phase 3 v5a call set) has the multiallelic STR split into a record associated with insertions and a record with a deletion, along with a bunch of phased genotypes:Running
$HOME/gatk-4.0.11.0/gatk --java-options "-Xmx4g" HaplotypeCaller -R $HOME/GRCh37files/hs37d5.fa -I /mnt/fast/test.bam -O test.out.vcf.gz -L 22 --genotyping-mode GENOTYPE_GIVEN_ALLELES --alleles test.vcf.gz
, the resulting error is:This is with version 4.0.11.0. Unlike #5336, it doesn't seem to be related to #4963, since version 4.0.5.1 has the same issue.
Now in my view, it would be better for the multiallelic STR records to be combined into a single multiallelic record (as HaplotypeCaller would do with the output), as follows:
And indeed, with that
--alleles
input with a single condensed record, HaplotypeCaller runs without error.Additionally, omitting the genotypes also runs without error:
So it seems to be the combination of the split-record multiallelic and genotypes in
--alleles
file that is problematic here. Probably an edge case by most definitions (and straightforward to work around by either omitting genotypes or condensing multiallelics into a single record) but I figured it was worth pointing out.I should probably also add that many other split multiallelics seem to be processed fine, without crash, e.g.:
The text was updated successfully, but these errors were encountered: