Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Truvari bench finds no calls in vcf file. #137

Closed
Irallia opened this issue Oct 20, 2022 · 5 comments
Closed

Truvari bench finds no calls in vcf file. #137

Irallia opened this issue Oct 20, 2022 · 5 comments

Comments

@Irallia
Copy link

Irallia commented Oct 20, 2022

Version :

Truvari v3.0.0

Describe the bug :

Hello.
I am currently simulating a vcf file with the help of Mason and a fastq file from it. My tool calculates its own vcf from the fastq file and with truvari bench I compare the truth (the vcf from Mason) with my own.
My problem: Mason's vcf does not seem to meet truvari's requirements, as truvari does not find any calls:

2022-10-19 17:20:49,926 [INFO] Matching base to calls
2022-10-19 17:20:49,968 [WARNING] No TP or FN calls in base!
2022-10-19 17:20:49,968 [INFO] Parsing FPs from calls
2022-10-19 17:20:49,994 [WARNING] No TP or FN calls in base!

However, I don't get any error message and therefore wanted to ask which header information and which column information truvari accesses.

To Reproduce :
Here is my simulation script: https://github.com/Irallia/iGenVar/blob/TEST/benchmarks/dataset_comparison_simulate_data/test/benchmark/simulation/mason_simulation.sh
I can also send you my created vcf file for running truvari bench with it.

Expected behavior :
A error message like: Row 80 in your base vcf file was not accepted because the QUAL field was corrupted. would be super helpful.

Example Data :
As the files are huge, please ask and I give you access.

@ACEnglish
Copy link
Owner

Hello,

Those warnings are just warnings, not errors. Truvari will throw errors if there is a problem with the VCF such as corrupted qual fields. Instead, the warnings are telling us that the --base VCF file has no SVs. Please check the file and make sure there are SVs. For Truvari to consider a VCF entry to be an SV, it has to pass some set of conditionals:

  • The call must be between --sizemin and --sizemax in length. See docs for how size is determined
  • The call must be non-reference-homozygous (0/0) when using --no-ref
  • The call must have a PASS for FILTER if --passonly
  • The call's start and end must land within a region from the --includebed if it was provided

It's possible I'm forgetting something, so please check these things and if you're still having issues I will need to see the VCF. Perhaps not the entire file, but just your hg38_SV_simulation_noSNP_sorted_fixed.vcf.gz if that file is used as the --base or the other non-mason VCF I believe you described you're comparing against. I wouldn't need to see the entire file, just a handful of entries you believe are SVs but Truvari is disagreeing.

Have a great day,
~/Adam English

@Irallia
Copy link
Author

Irallia commented Oct 24, 2022

Hello,
thank you for your answer. I checked the conditionals you gave me, but I still have empty base calls.

Hello,

Those warnings are just warnings, not errors. Truvari will throw errors if there is a problem with the VCF such as corrupted qual fields. Instead, the warnings are telling us that the --base VCF file has no SVs.

Sorry, of course I meant warnings, not errors.

Please check the file and make sure there are SVs. For Truvari to consider a VCF entry to be an SV, it has to pass some set of conditionals:

* The call must be between `--sizemin` and `--sizemax` in length. See [docs](https://truvari.readthedocs.io/en/latest/truvari.html#entry-size) for how size is determined

Yes, there is an SVLEN. Plus we have start and end information.

* The call must be non-reference-homozygous (0/0) when using `--no-ref`

I hope that the GT ./. works, as it works in my other vcf files.

* The call must have a PASS for FILTER if `--passonly`

Yes mason does only create entrys with PASS.

* The call's start and end must land within a region from the `--includebed` if it was provided

The bed file is created by convert2bed --input=vcf --output=bed < hg38_SV_simulation_noSNP_sorted_fixed.vcf > hg38_SV_simulation_noSNP.bed thus it should include all regions.

It's possible I'm forgetting something, so please check these things and if you're still having issues I will need to see the VCF. Perhaps not the entire file, but just your hg38_SV_simulation_noSNP_sorted_fixed.vcf.gz if that file is used as the --base or the other non-mason VCF I believe you described you're comparing against. I wouldn't need to see the entire file, just a handful of entries you believe are SVs but Truvari is disagreeing.

Here is the start of my hg38_SV_simulation_noSNP_sorted_fixed.vcf.gz and I think at least the sim_sv_indel* ones should got used by truvari.

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=TARGETPOS,Number=1,Type=String,Description="Target position for duplications.">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr21,length=46709983>
##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=INS,Description="Insertion of novel sequence">
##ALT=<ID=DUP,Description="Duplication">
##ALT=<ID=INV,Description="Inversion">
##reference=hg38_chr21.fa
##source=mason_variator
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	simulated
chr21	5030334	sim_trans_0	N	A[chr21:5031186[	1	PASS	SVTYPE=BND	GT	./.
chr21	5030335	sim_trans_0	N	[chr21:5047468[A	1	PASS	SVTYPE=BND	GT	./.
chr21	5031185	sim_trans_0	N	C[chr21:5047469[	1	PASS	SVTYPE=BND	GT	./.
chr21	5031186	sim_trans_0	N	[chr21:5030334[A	1	PASS	SVTYPE=BND	GT	./.
chr21	5047468	sim_trans_0	N	T[chr21:5030335[	1	PASS	SVTYPE=BND	GT	./.
chr21	5047469	sim_trans_0	N	[chr21:5031185[C	1	PASS	SVTYPE=BND	GT	./.
chr21	5060293	sim_dup_0	N	<DUP>	1	PASS	END=5073031;SVLEN=12738;SVTYPE=DUP;TARGETPOS=chr21:5087033	GT	./.
chr21	5105328	sim_small_indel_0	N	CTTTCAC	1	PASS	.	GT	./.
chr21	5126727	sim_small_indel_1	N	C	1	PASS	.	GT	./.
chr21	5127415	sim_sv_indel_0	N	<DEL>	1	PASS	SVLEN=-18647;SVTYPE=DEL;END=5146062	GT	./.
chr21	5152875	sim_small_indel_2	N	ACT	1	PASS	.	GT	./.
chr21	5217965	sim_sv_indel_1	N	<INS>	1	PASS	SVLEN=5919;SVTYPE=INS;END=5217965	GT	./.
chr21	5223979	sim_sv_indel_2	N	<DEL>	1	PASS	SVLEN=-9486;SVTYPE=DEL;END=5233465	GT	./.
chr21	5252802	sim_trans_1	N	T[chr21:5271385[	1	PASS	SVTYPE=BND	GT	./.
chr21	5252803	sim_trans_1	N	[chr21:5288484[T	1	PASS	SVTYPE=BND	GT	./.
chr21	5256763	sim_small_indel_3	N	C	1	PASS	.	GT	./.
chr21	5271384	sim_trans_1	N	A[chr21:5288485[	1	PASS	SVTYPE=BND	GT	./.
chr21	5271385	sim_trans_1	N	[chr21:5252802[A	1	PASS	SVTYPE=BND	GT	./.
chr21	5288484	sim_trans_1	N	T[chr21:5252803[	1	PASS	SVTYPE=BND	GT	./.
chr21	5288485	sim_trans_1	N	[chr21:5271384[T	1	PASS	SVTYPE=BND	GT	./.
chr21	5315180	sim_small_indel_4	N	AGACAGAGAGGCTTGGA	1	PASS	.	GT	./.
chr21	5316839	sim_inv_0	N	<INV>	1	PASS	END=5335052;SVLEN=18213;SVTYPE=INV	GT	./.
chr21	5337350	sim_dup_1	N	<DUP>	1	PASS	END=5350120;SVLEN=12770;SVTYPE=DUP;TARGETPOS=chr21:5366707	GT	./.
...

Have a great day, ~/Adam English

Thank you for your quick and helpful answer!
Best Lydia Buntrock

@ACEnglish
Copy link
Owner

I believe the problem might be with the --includebed. Could you try to run once without that file and see if it changes anything?

The reason I'm suspicious of the bed is that the example entries you provided above worked just fine (see below).
It could be as simple as an off by one error since beds/vcfs use different indexing (0/1 based respectively). I'd also be curious to see what would happen if you inflated your bed regions (see bedtools slop here).

git checkout tags/v3.0.0
python3 -m pip install .
truvari bench -b ticket137.vcf.gz -c ticket137.vcf.gz -o test3.0 -f reference/grch38/GRCh38_1kg_mainchrs.fa
cat test3.0/summary.txt
{
    "TP-base": 6,
    "TP-call": 6,
    "FP": 0,
    "FN": 0,
    "precision": 1.0,
    "recall": 1.0,
    "f1": 1.0,
    "base cnt": 6,
    "call cnt": 6,
    "TP-call_TP-gt": 6,
    "TP-call_FP-gt": 0,
    "TP-base_TP-gt": 6,
    "TP-base_FP-gt": 0,
    "gt_concordance": 1.0
}

@Irallia
Copy link
Author

Irallia commented Oct 25, 2022

It was indeed the bedfile. Thank you very much for your help. Now I have learned a lot more about truvari.
From my point of view, the issue can now be closed.

@ACEnglish
Copy link
Owner

Thanks for reporting this.

As a note: I looked into it more and I believe you've actually found an error. Apparently pyintervaltree isn't behaving as I was expecting and therefore your bed files didn't contain the SVs. For example, say we have coordinates from BedOps convert2bed using repo_utils/test_files/input1.vcf.gz :

>>> x = IntervalTree()
>>> x.addi(66234, 66235) # convert2bed coordinates of first entry
>>> x.overlaps(66234) # pysam.VariantRecord.start
True
>>> x.overlaps(66235) # pysam.VariantRecord.stop
False

Position 66235 not overlapping position 66235 is not desired behavior. I'll close this ticket after I get the change checked in

ACEnglish added a commit that referenced this issue Oct 25, 2022
pyintervaltree overlaps method doesn't behave well with variant ends. Adding one to the ends when doing `addi` produces
desired behavior
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants