Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BED Region off-by-one error #193

Closed
TimD1 opened this issue Feb 2, 2024 · 4 comments
Closed

BED Region off-by-one error #193

TimD1 opened this issue Feb 2, 2024 · 4 comments

Comments

@TimD1
Copy link

TimD1 commented Feb 2, 2024

Version :

Truvari 4.1.0

Describe the bug :

I spent way too long figuring this out when comparing vcfdist, vcfeval, and Truvari, but there appears to be an off-by-one error in Truvari somewhere. In the following example with BED region reference 10 20, Truvari incorrectly decides which variants to include and exclude.
BED region example

BED files are zero indexed, start inclusive and stop exclusive. VCF files are 1-indexed. If I modify the BED region file to be reference 9 19, Truvari includes the correct variants, but its behavior still differs from vcfeval and vcfdist for variants on the region border. There's no real "correct" answer here, but it would be nice for different tools to have the same behavior.

What do you think should be done for insertions and deletions in this case, and why? For vcfdist I tried to be consistent with vcfeval, but I would consider changing my implementation if you think another option is better.

To Reproduce :
I've attached a zip file containing the BED files, reference, and VCFs for each mutation listed, as well as a script to run vcfeval, vcfdist, or Truvari.
bed_example.zip

@ACEnglish
Copy link
Owner

Thanks for the test data. Looking at the examples, I believe no tool was correct. Truvari's consistency of handling these cases has been updated.

I5 and I6 should be considered outside. Truvari was mishandling them due to how pyintervaltree defines overlaps. The logic for why I5 and I6 are considered outside is also why D2 and D3 are outside. Specifically - the bed boundaries cover the last nucleotide known by the baseline. While the variant D3 claims that VCF position 10 is present, the bed boundaries not covering VCF position 10 means the baseline doesn't. The giab logic to my understanding for some years has been that a variant must start and end within a single includebed region.

reference	4	D1	TATATAT	T	30	.	include=out	GT	1|1 is out
reference	7	D2	ATATGCG	A	30	.	include=out	GT	1|1 is out
reference	9	I1	A	ATATATA	30	.	include=out	GT	1|1 is out
reference	10	S1	T	C	30	.	include=out	GT	1|1 is out
reference	10	I2	T	TGCGCGC	30	.	include=out	GT	1|1 is out
reference	10	D3	TGCGCGC	T	30	.	include=out	GT	1|1 is out
reference	11	S2	G	A	30	.	include=in	GT	1|1 is in
reference	11	I3	G	GCGCGCG	30	.	include=in	GT	1|1 is in
reference	14	D4	CGCGCGC	C	30	.	include=in	GT	1|1 is in
reference	17	D5	GCGCATA	G	30	.	include=out	GT	1|1 is out
reference	19	I4	G	GCGCGCG	30	.	include=in	GT	1|1 is in
reference	20	I5	C	CATATAT	30	.	include=out	GT	1|1 is out
reference	20	S3	C	T	30	.	include=in	GT	1|1 is in
reference	20	D6	CATATAT	C	30	.	include=out	GT	1|1 is out
reference	21	I6	A	ATATATA	30	.	include=out	GT	1|1 is out
reference	21	S3	A	G	30	.	include=out	GT	1|1 is out

@TimD1
Copy link
Author

TimD1 commented Feb 3, 2024

Thanks for the explanation Adam! Would you be able to find a reference supporting the fact that GIAB requires all query variant bases to be within the BED region? I believe you and do see your point that the reference allele includes a base outside the region... I would just like to point to an accepted standard that won't change before I modify how vcfdist operates.

Personally, I find it very unintuitive that D3 would be considered outside and D4 considered inside. Both deletions go up until (but not over) the edge of the BED region; the fact that the preceding reference base is included in D3 is just due to an arbitrary VCF formatting convention. D3 could also be reported as 11 GCGCGCG G (including the ref base after). Or you could not include a reference base in the VCF representation: 11 GCGCGC _. Because that reference base doesn't change, I personally wouldn't count it as part of the variant.

As an aside, how do you think query variants versus ground-truth variants should be handled when using BEDs during benchmarking? If we restrict both to exactly within the BED regions, then if a truth variant is shifted a single base right then the query variant on the BED border will be marked as a FP (even if an excluded truth variant is equivalent). We could add a slop to the truth variants, and then only allow TPs for these truth variants (if a truth variant just outside the region isn't equivalent to a query variant within the region, discard it). Have you considered this before? How does Truvari currently operate?

@ACEnglish
Copy link
Owner

The D3 example has a reference base. But there's nothing in the VCF specification that forces that. So the variant could also be describing a substitution to VCF position 10 (e.g. ref 7 D2 ATATGCG T .... The bed file not including VCF10/Bed9 means any change to VCF10 is unknown and cannot be accurately classified by the benchmark.

I agree that the VCF conventions make this tricky. However, if VCF positions 11 to 15 were deleted, the bed file - which should span all confident baseline variants to be included in the benchmark - would need an 'anchor base' equivalent in order to capture D3. This coupling of the baseline variants to their accompanying bed files is why Truvari doesn't consider baseline variants outside of the regions. However users can allow comparison variants outside of bed regions to contribute to matches with --extend.

@TimD1
Copy link
Author

TimD1 commented Feb 5, 2024

Ah, I see. Basically, because the currently-used benchmarking BED files were generated by converting (included/excluded) VCF variants to regions using a particular methodology, it's important to use the same methodology when converting back and choosing which variants to evaluate?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants