-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes parsing variant annotations for multi-allelic rows #1346
Fixes parsing variant annotations for multi-allelic rows #1346
Conversation
Can one of the admins verify this patch? |
Jenkins, test this please |
Thanks for reporting this, @majkiw! The test looks good; give me a bit to review the fix, to make sure there aren't other cases where we're off. That's probably the only method in If you could extract a test file from a non-gVCF with multiple alts, that would be helpful. Additionally we should probably craft one with some pathological cases. |
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you for the fix, @majkiw! I will merge this manually.
Thank you, @majkiw! |
Thank you guys for processing this PR so quickly :) |
When I tried to run the code in master I had
java.lang.ArrayIndexOutOfBoundsException
exceptions inorg.bdgenomics.adam.converters.VariantContextConverter#fromArrayExtractor
whenever multi-allelic row was encountered.It turned out the index was off by 1 when accessing multivalued annotations.
For example in
gvcf_dir/gvcf_multiallelic.g.vcf
there is a line:chr22 18030096 . TAAA T,TA,TAA,<NON_REF> 564.73 . BaseQRankSum=-0.133;ClippingRankSum=-1.438;DP=114;MLEAC=0,1,1,0;MLEAF=0.00,0.500,0.500,0.00;MQ=69.72;MQ0=0;MQRankSum=-0.686;ReadPosRankSum=-0.013 GT:AD:DP:GQ:PL 2/3:13,3,17,17,0:50:86:602,508,1628,86,678,553,137,342,0,281,467,744,353,309,659
ALT is
T,TA,TAA,<NON_REF>
with alt allele indexes (as returned byvc.getAlleleIndex(allele)
):1,2,3,4
When accessing annotation
MLEAC=0,1,1,0
it would load indexes1,2,3
forT,TA,TAA
resulting in1,1,0
.However there is never a value for reference allele here! So instead it should be loading indexes
0,1,2
resulting in values0,1,1
.Incidentally test using
gvcf_dir/gvcf_multiallelic.g.vcf
wasn't breaking for you because there was always extra0
as the last value corresponding to<NON_REF>
- which was effecting in silent shift of values.This becomes much easier to break in "real" VCF where there is no
<NON_REF>
anymore.