No true positives being found on some chromosomes #91

fbursa · 2021-12-20T16:32:30Z

For some chromosomes no true positives are found at all. For other chromosomes the behaviour is normal.

I am using versions 3.0 and 3.1.0-dev (git show-ref --hash HEAD: 0aeccee), with the datasets at
https://labs.epi2me.io/gm24385_q20_2021.10/
downloaded using
aws s3 sync --no-sign-request s3://ont-open-data/gm24385_q20_2021.10/extra_analysis/structural_variants structural_variants
With version 3.0 the behaviour is as I would expect, i.e. similar proportions of FPs, FNs and TPs on every chromosome, and recall and precision > 90%. However, with version 3.1.0-dev recall and precision are down to ~50%.

Summary results files:
summary_3.0.txt
summary_dev.txt

Examining the output vcfs, this appears to be due to all calls in chromosomes 10-22 (i.e. all the double digit ones - possibly not a coincidence?) being called as FPs or FNs:

grep -v ^# results_dev/tp-base.vcf | cut -f 1 | sort | uniq -c
711 1
701 2
565 3
654 4
519 5
557 6
545 7
540 8
380 9
234 X
5 Y

grep -v ^# results_dev/fn.vcf | cut -f 1 | sort | uniq -c
15 1
476 10
453 11
467 12
394 13
267 14
243 15
262 16
341 17
245 18
323 19
16 2
266 20
177 21
184 22
7 3
9 4
9 5
12 6
15 7
9 8
5 9
3 X

grep -v ^# results_dev/fp.vcf | cut -f 1 | sort | uniq -c
24 1
492 10
454 11
463 12
399 13
278 14
249 15
271 16
348 17
238 18
333 19
40 2
261 20
180 21
190 22
24 3
32 4
25 5
35 6
28 7
25 8
28 9
19 X
2 Y

The command I am using is
truvari bench --passonly -b HG002_SVs_Tier1_v0.6.vcf.gz --includebed eval_high_conf.bed -p 0 -c cutesv_filtered_sorted.vcf.gz -f human_g1k_v37.fasta -o results_3.0

The text was updated successfully, but these errors were encountered:

ACEnglish · 2021-12-20T17:09:08Z

Hello,

I'm able to recreate this error. Thanks for the informative ticket.
As to why this is happening: The base and comparison VCF files are sorted differently, i.e.

$ zgrep -v "#" HG002_SVs_Tier1_v0.6.vcf.gz | cut -f1 | uniq -c | head -n3
5397 1
5684 2
4257 3
$ zgrep -v "#" cutesv_filtered_sorted.vcf.gz | cut -f1 | uniq -c | head -n3
3241 1
2125 10
2029 11

3.1-dev has a new 'file zipper' iteration technique that helped with performance and enabled some new features. However, it's clearly broken. In the short term, if you'd like to continue using 3.1-dev, you'd simply need to use the same sorting/indexing command on the base vcf as what comes from your pipeline. You'd also be okay staying with 3.0.1 for now as 3.1 isn't (supposed) to change any results, just expand bench annotations.

I'll keep this ticket open for referencing commits while I work to fix this.

Custom wrappers of pysam.VariantFile iterators works for now, for bench. Anywhere truvari.bench.file_zipper is used (e.g. collapse), I'll need to do the same wrapping. This will be considered technical debt until I can figure out a better long-term solution. Also, Dockerfile somehow became outdated/broke. Updated to use pip instead of easy_install to fix Minor pylint change in vcf2df and formatting in setup.py

Removed custom_iterator and refactored GenomeTree to RegionVCFIter. Better reuse of code and don't need the weird __next__/__iter__. Should be faster, particularly with includebeds Found an error where TruScore precision lost from float to int casting lost optimal matches.

ACEnglish · 2021-12-21T00:29:56Z

Fixed and verified identical results on your test data between 3.0.1 and develop. Please let me know if you run into other issues and I'll be happy to help.

fbursa · 2021-12-21T11:23:59Z

Thanks for fixing it so fast! I’ll try it today/tomorrow and let you know if I run into any more issues. Francis From: Adam English ***@***.***> Sent: 21 December 2021 00:30 To: spiralgenetics/truvari ***@***.***> Cc: Francis Bursa ***@***.***>; Author ***@***.***> Subject: Re: [spiralgenetics/truvari] No true positives being found on some chromosomes (Issue #91) Fixed and verified identical results on your test data between 3.0.1 and develop. Please let me know if you run into other issues and I'll be happy to help. — Reply to this email directly, view it on GitHub<#91 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AW7X5TIMI7EPWIJZRQ7FOJ3UR7DA7ANCNFSM5KODXHQQ>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you authored the thread.Message ID: ***@***.******@***.***>> IMPORTANT NOTICE: The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, re-transmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer. Although we routinely screen for viruses, addressees should check this e-mail and any attachment for viruses. We make no warranty as to absence of viruses in this e-mail or any attachments.

ACEnglish · 2021-12-21T17:47:16Z

Quick follow up. While trying to fix this and debugging with your data, I noticed an opportunity to improve some things. For your data, not much is different since the calls are of high-quality. However, a few calls are now finding a better, 'closer' match. You should be able to recreate this by checking out develop. If you're interested in seeing how I used your data to illustrate the new 'TruScore' and INS reciprocal overlap, see discussions #92 #93

fbursa · 2021-12-22T09:11:41Z

Hi Adam, Thanks – this all looks super interesting, I’ll have a closer look. Francis From: Adam English ***@***.***> Sent: 21 December 2021 17:47 To: spiralgenetics/truvari ***@***.***> Cc: Francis Bursa ***@***.***>; Author ***@***.***> Subject: Re: [spiralgenetics/truvari] No true positives being found on some chromosomes (Issue #91) Quick follow up. While trying to fix this and debugging with your data, I noticed an opportunity to improve some things. For your data, not much is different since the calls are of high-quality. However, a few calls are now finding a better, 'closer' match. You should be able to recreate this by checking out develop. If you're interested in seeing how I used your data to illustrate the new 'TruScore' and INS reciprocal overlap, see discussions #92<#92> #93<#93> — Reply to this email directly, view it on GitHub<#91 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AW7X5TKV5UGOKRAVKW3DDZDUSC4S5ANCNFSM5KODXHQQ>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you authored the thread.Message ID: ***@***.******@***.***>> IMPORTANT NOTICE: The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, re-transmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer. Although we routinely screen for viruses, addressees should check this e-mail and any attachment for viruses. We make no warranty as to absence of viruses in this e-mail or any attachments.

ACEnglish closed this as completed Dec 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No true positives being found on some chromosomes #91

No true positives being found on some chromosomes #91

fbursa commented Dec 20, 2021

ACEnglish commented Dec 20, 2021

ACEnglish commented Dec 21, 2021

fbursa commented Dec 21, 2021 via email

ACEnglish commented Dec 21, 2021

fbursa commented Dec 22, 2021 via email

No true positives being found on some chromosomes #91

No true positives being found on some chromosomes #91

Comments

fbursa commented Dec 20, 2021

ACEnglish commented Dec 20, 2021

ACEnglish commented Dec 21, 2021

fbursa commented Dec 21, 2021 via email

ACEnglish commented Dec 21, 2021

fbursa commented Dec 22, 2021 via email