Softening the boundaries of the bed file regions for matching variants #99

bnoyvert · 2022-02-25T19:06:47Z

bnoyvert
Feb 25, 2022

I would like to suggest "softening" the boundaries of regions used to prefilter variants in truvari bench.

Truvari bench optionally takes as an argument a bed file of regions of interest (--includebed argument). Only truth and comparison set variants that fully overlap the regions are considered for matching. As a result often a variant is called false positive or false negative when it would match a variant in the other set, but that match is just outside of the region. The typical situation where this problem is encountered is for example a comparison of structural variants called from HG002 long read data to the truth set of variants in high confidence regions described here.

I am suggesting a strategy similar to the one implemented in truvari bench for size matching - only the truth set variants longer than sizemin (equal to 50 by default) are considered, but they are allowed to match shorter comparison set variants of size sizefilt (30 by default) or longer.

I suggest to introduce a parameter extend for extending the regions on each side. By default it is equal to the refdist parameter. Variants that are not fully included in the extended regions are filtered out from the beginning. Then if one of the matching variants is fully covered by the original unextended region then the other is allowed a more relaxed overlap - if it is included in the extended region then both variants are called as true positive, if both matching variants are not fully covered by the original region then they are just skipped, same if the variant doesn't have a match and is not fully covered by the original region. To be counted as FP or FN the variant has to be fully included in the unextended region and should have no match in the extended region.

The approach is implemented in the fork of Truvari here:
bnoyvert@6465454

I would like to submit a merge request, please let me know if you would consider the merge.
Thank you,
Boris

ACEnglish · 2022-02-25T22:49:32Z

ACEnglish
Feb 25, 2022
Maintainer

Hello Boris,

Thank you for your post and interest in Truvari. I’ve reviewed your code and believe it’s a reasonable approach. But I have hesitancies.

The sizefilt / sizemin soft-thresholding has the documented side effect of .. giving the call a better chance to be useful and less chance to be detrimental to final statistics. In the interest of fair comparisons, I'm not enthusiastic about these unbalanced parameters that only help boost performance. But I agree the reality is these edge cases do need to be addressed.
This may not be relevant for the code currently, but I'm working on an implementation that leverages parallelization to speed up processing. If --extend were implemented it would cause problems in situations where the user's --includebed has regions which overlap after they're extended. If a variant is within two extended regions, it could potentially be double counted.
I believe a non-Truvari solution to this problem may be a reasonable ask of users. If there's interest in comparison calls that lie just outside of --includebed regions, one could simply expand the bed file regions e.g. `awk '{$2 -= 500; $3 += 500; print $0}'
A quick check of HG002 GIAB v0.6 Tier1 SVs found ~1.6K SVs that are PASS and have start/end boundaries within 500bp of a Tier1 bed region. Note that this quick/simple check doesn't fully capture your proposed changes. However, it does point to how opening the door to allow tp-comp calls in 'soft' regions could easily extend to questions about why we don't allow tp-base calls in these 'soft' regions to become potential TPs. This is a 'slippery-slope' argument, but theoretically one could imagine --expand on --expand until the --includebed is expanded to the full-chromosome and therefore becomes unnecessary.
If I'm understanding the proposed changes correctly, the --expand soft boundaries could make comparisons between runs more difficult. "Then if one of the matching variants is fully covered by the original unextended region then the other is allowed a more relaxed overlap". Imagine we have two replicates of HG002 that we're comparing against GIAB. If in only one of the replicates the caller has a 'soft boundary' TP from GIAB, that replicate will have a higher 'base cnt' than the replicate that doesn't have a call matching. This would make it harder to compare between replicates since the performance of the first replicate is calculated by against an extra base-TP.

I understand the motivation behind this, and I believe you've implemented it well enough. In order to keep the tool simple, I've tried to keep the functionality minimal around "This is how comparison should be done". The --extend feature feels more like "This is how comparison could be done". It's again very reasonable, but given the above points, I don't know if I can adopt this functionality into Truvari.

Have a great day,
~/Adam English

0 replies

bnoyvert · 2022-03-04T20:47:41Z

bnoyvert
Mar 4, 2022
Author

Thank you @ACEnglish, I appreciate your detailed reply and your point of view. I agree that these edge cases need to be addressed somehow. In my evaluations 10-20% of false negative calls arise because the matching variant in the comparison set is just outside of the high confidence region. Simply extending the regions by 500 bases is not the same as using the --extend 500 option, since then the variants from the extension will contribute to false negative and false positive calls.

The --extend feature feels more like "This is how comparison could be done"

So do you think one could introduce the --extend option as a non-default one? I.e. extend=0 by default, but can be set to a positive value by the user.

Or alternatively one could think of adding a post-processing script flagging variants in the fn.vcf and fp.vcf files that match variants just outside the regions.

Many thanks and best wishes,
Boris

0 replies

ACEnglish · 2022-03-04T22:45:24Z

ACEnglish
Mar 4, 2022
Maintainer

That proposal definitely solves most of my concerns. I think I'm becoming convinced. I'll review the code more to see if how you've implemented it is best.

But something we'll still need to address is point number 5. When the --includebed feature was requested by GIAB, we designed it specifically to only include GIAB variants inside the high-confidence regions. So I see the high-confidence bed file as being an extension of the --base variants. This means that base variants outside the --includebed shouldn't be considered equal to those inside. Therefore, I don't think we should allow non-high-confidence calls to become TPs. Plus, this keeps the number of base variants the same between runs.

So --comp calls can sit in the --extend regions and turn a FN into a TP. For the last case I can imagine, perhaps it's possible we could mark comparison calls that match a non-hc call such that they aren't considered a FP i.e. they're neither TP or FP? Though this makes the summary statistics a little more difficult to interpret. the --comp calls can be one of TP, FP, or sorta-TP-if-not-for-includebed. Thoughts?

5 replies

bnoyvert Mar 5, 2022
Author

So --comp calls can sit in the --extend regions and turn a FN into a TP.

Yes, I think this would be a fair treatment. SV callers often place the calls in tandem repeats far away from the truth set variant, often outside the designated high confidence region.

But I agree, allowing non-high-confidence --base calls to become TPs could be controversial. From my evaluations I can see that there are just a handful of such cases, so not a big problem. It is perhaps still a good idea to mark the variants in fp.vcf matching --base calls just outside the regions as "sorta-TP-if-not-for-includebed".

I could modify my fork to implement the above approach, or obviously I don't mind if you implement it.

Thank you!

ACEnglish Mar 9, 2022
Maintainer

Hello,

After spending some time thinking about it, I believe we can settle on the minimal features for --extend. By default it is 0 and let's drop the "sorta-TP-if-not-for-includebed". We can revisit it after we get the functionality of FN to TP based on --comp in --extend functionality implemented.

I'll leave it to you to produce a pull request to implement your feature. As a note, I would like for you to reconsider adding the extra if statements and extra parameters to multiple methods. I haven't thought it through fully, but one example of how this could be designed better is if instead of sending --extend to the RegionVCFIterator.__init__, it could be sent directly to the iterate method (e.g. comp_i = regions.iterate(comp, args.extend). And I feel it might be possible to not add a parameter to output_writer. The code you've added inside the bench_main loop for call in itertools.chain.from_iterable(map(compare_chunk, chunks)): should be extracted to a method that is something like update_MatchResult_based_on_extend. This way the output_writer is only outputting a MatchResult and we don't conflate Output and Data manipulation operations inside a single method.

There are several requirements for a pull request to be accepted:

CI/CD actions will need to pass (pylint and functests). I recommend following the Docker instructions to run them locally / quickly.
You'll need to design functional tests which are added to repo_utils/truvari_ssshtests.sh such that your feature has test coverage
I'm flexible on what the test coverage is, but please aim for it to be as complete as possible. There is a hard-failure triggered at 85% (which I'm about to raise to 90% since I've had the coverage up to ~91% for a while now).
Provide a short writeup for the wiki documentation that describes to a user what the new parameter is and how/why to use it (just enough to get someone started, if they have questions we can point them to this discussion).

After all of that is done, we may need to revisit the possibility of this feature conflicting with my previous point number 2 " If --extend were implemented it would cause problems in situations where the user's --includebed has regions which overlap after they're extended.". I just pushed a change to develop that starts to address the issue (see new RegionVCFIterator.merge_overlaps function), but we need to keep an eye on it.

All of this development is a big ask. But I would like to say up-front 1. thank you for helping with the project. 2. I'll try to keep an open mind, but as you can tell from the CI/CD requirements and such, I'm trying (probably too) hard to keep the code as extra clean and well designed as possible. If we can't implement this feature in a straight forward way, it is possible I'll reject a pull request.

Let me know if you have any follow up questions.

Have a great day,
~/Adam

bnoyvert May 9, 2022
Author

Thank you for the recommendations @ACEnglish. I implemented them in the new code, please see bnoyvert@1afc9c1

In summary:

The default for --extend is 0.
Overlapping extended regions are merged using your new merge_overlaps method.
Only --comp variants are allowed to be in the extensions, potentially matching --base variants in the original regions and turning FNs to TPs. --base calls are always in the original regions, and total base count is not affected by the --extend option.
The code modification is minimal - there are no new parameters in the existing functions and methods.
A new extend method for RegionVCFIterator class is added.
A simple check is added in the bench_main loop for call in itertools.chain.from_iterable(map(compare_chunk, chunks)) to prevent unmatched --comp calls in extended regions from being counted as FPs.

Please could you have a look? If you are happy with the code I will proceed to design functional tests.

Thank you,
Boris

ACEnglish May 12, 2022
Maintainer

Looks pretty good. Great job! I may still do a little refactoring but that will be more for me to make sure I've got my head around it so that I can deal with future tickets should they arise.

Feel free to pull request the code as is. For the functional tests, you should look into the possibility of just designing a .bed file that pokes all the --extend edge cases for some combination of comparison between repo_utils/test_files/*.vcf.gz I'd rather not add more vcfs but a small bed would be fine. Once you make that, I'm fine adding the test to repo_utils/truvari_ssshtests.sh

bnoyvert May 16, 2022
Author

I added the functional tests using existing vcf files. I have just pulled the merge request. I hope it is all fine, let me know if there are any problems.
Thank you,
Boris

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Softening the boundaries of the bed file regions for matching variants #99

{{title}}

Replies: 3 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Softening the boundaries of the bed file regions for matching variants #99

bnoyvert Feb 25, 2022

Replies: 3 comments · 5 replies

ACEnglish Feb 25, 2022 Maintainer

bnoyvert Mar 4, 2022 Author

ACEnglish Mar 4, 2022 Maintainer

bnoyvert Mar 5, 2022 Author

ACEnglish Mar 9, 2022 Maintainer

bnoyvert May 9, 2022 Author

ACEnglish May 12, 2022 Maintainer

bnoyvert May 16, 2022 Author

bnoyvert
Feb 25, 2022

Replies: 3 comments 5 replies

ACEnglish
Feb 25, 2022
Maintainer

bnoyvert
Mar 4, 2022
Author

ACEnglish
Mar 4, 2022
Maintainer

bnoyvert Mar 5, 2022
Author

ACEnglish Mar 9, 2022
Maintainer

bnoyvert May 9, 2022
Author

ACEnglish May 12, 2022
Maintainer

bnoyvert May 16, 2022
Author