-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BED Region off-by-one error #193
Comments
Thanks for the test data. Looking at the examples, I believe no tool was correct. Truvari's consistency of handling these cases has been updated. I5 and I6 should be considered outside. Truvari was mishandling them due to how pyintervaltree defines overlaps. The logic for why I5 and I6 are considered outside is also why D2 and D3 are outside. Specifically - the bed boundaries cover the last nucleotide known by the baseline. While the variant D3 claims that VCF position 10 is present, the bed boundaries not covering VCF position 10 means the baseline doesn't. The giab logic to my understanding for some years has been that a variant must start and end within a single includebed region.
|
Thanks for the explanation Adam! Would you be able to find a reference supporting the fact that GIAB requires all query variant bases to be within the BED region? I believe you and do see your point that the reference allele includes a base outside the region... I would just like to point to an accepted standard that won't change before I modify how vcfdist operates. Personally, I find it very unintuitive that D3 would be considered outside and D4 considered inside. Both deletions go up until (but not over) the edge of the BED region; the fact that the preceding reference base is included in D3 is just due to an arbitrary VCF formatting convention. D3 could also be reported as 11 GCGCGCG G (including the ref base after). Or you could not include a reference base in the VCF representation: 11 GCGCGC _. Because that reference base doesn't change, I personally wouldn't count it as part of the variant. As an aside, how do you think query variants versus ground-truth variants should be handled when using BEDs during benchmarking? If we restrict both to exactly within the BED regions, then if a truth variant is shifted a single base right then the query variant on the BED border will be marked as a FP (even if an excluded truth variant is equivalent). We could add a slop to the truth variants, and then only allow TPs for these truth variants (if a truth variant just outside the region isn't equivalent to a query variant within the region, discard it). Have you considered this before? How does Truvari currently operate? |
The D3 example has a reference base. But there's nothing in the VCF specification that forces that. So the variant could also be describing a substitution to VCF position 10 (e.g. I agree that the VCF conventions make this tricky. However, if VCF positions 11 to 15 were deleted, the bed file - which should span all confident baseline variants to be included in the benchmark - would need an 'anchor base' equivalent in order to capture D3. This coupling of the baseline variants to their accompanying bed files is why Truvari doesn't consider baseline variants outside of the regions. However users can allow comparison variants outside of bed regions to contribute to matches with |
Ah, I see. Basically, because the currently-used benchmarking BED files were generated by converting (included/excluded) VCF variants to regions using a particular methodology, it's important to use the same methodology when converting back and choosing which variants to evaluate? |
Version :
Truvari 4.1.0
Describe the bug :
I spent way too long figuring this out when comparing vcfdist, vcfeval, and Truvari, but there appears to be an off-by-one error in Truvari somewhere. In the following example with BED region
reference 10 20
, Truvari incorrectly decides which variants to include and exclude.BED files are zero indexed, start inclusive and stop exclusive. VCF files are 1-indexed. If I modify the BED region file to be
reference 9 19
, Truvari includes the correct variants, but its behavior still differs from vcfeval and vcfdist for variants on the region border. There's no real "correct" answer here, but it would be nice for different tools to have the same behavior.What do you think should be done for insertions and deletions in this case, and why? For vcfdist I tried to be consistent with vcfeval, but I would consider changing my implementation if you think another option is better.
To Reproduce :
I've attached a zip file containing the BED files, reference, and VCFs for each mutation listed, as well as a script to run vcfeval, vcfdist, or Truvari.
bed_example.zip
The text was updated successfully, but these errors were encountered: