Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about SV calling consistency #61

Open
riyasj327 opened this issue Oct 2, 2024 · 8 comments
Open

Question about SV calling consistency #61

riyasj327 opened this issue Oct 2, 2024 · 8 comments

Comments

@riyasj327
Copy link

Greetings,

Thank you so much for this amazing tool!
So I am checking the consistency between the SV calls after randomizing and splitting the reads into two equal datasets. Each dataset is of around 40X coverage and is nanopore data.

I used truvari consistency for finding the instersection between the VCFs of the two datasets.
For only SVs:
#VCFs Calls Pct
2 1042752 43.81%
1 1337473 56.19%

Where the overlaps are 43.8%

With SNVs also:
#VCFs Calls Pct
2 5248295 70.38%
1 2208985 29.62%
Where the overlaps are 70%

Truvari consitency calls a variant as match if the CHROM, POS, ID, REF and ALT are exactly same with no allowed difference.

I saw that some mismatches are due to SV length differences. I wanted to understand how PAV algorithm works in finding the breakpoints and what is causing the mismatches because technically the datasets are from the same sample and we expect the same variants?

Any thoughts on this would be really helpful!

thanks,
Riya

@paudano
Copy link
Contributor

paudano commented Oct 4, 2024

I think the intersect parameters are far too stringent for the data. ONT reads have a higher homopolymer error rate, and if you are matching on exact size and position, then a single homopolymer error can shift the size of the SV by one or two bases, which would cause them not to match with these intersect criteria. When downsampling and assembling, there's less coverage per haplotype to assemble, which is going to increase the error rate. This is true for ONT and PacBio, though ONT R9 is going to be noiser than R10, which is probably going to be noiser than modern HiFi (though R10 is getting closer).

I'm assuming this is diploid and that there is a high divergence between your reference and organism (2.3 M SVs and 7 M SNVs is massive). You must have 2 or more alleles (assuming diploid for this answer). For homozygous sites, PAV has to pick one haplotype to represent the SV itself. So, it could be choosing the maternal haplotype for one split and the paternal for the other split. The first haplotype is determined by the order of haplotypes input, so if you have a way of phasing so that the maternal or paternal haplotype is always first, then it would reduce this effect (although this is not necessary for most variant callsets). Any biological difference within the SV sequence between the two alleles can then change the representation in one split vs the other (i.e. an indel inside the SV that one haplotype has but the other does not). This would explain many of the differences.

Given the biology and the technology, the intersect parameters need to be adjusted. For SVs, I intersect with 50% reciprocal overlap and require at least 80% sequence identity. I use my own tool (SV-Pop) for this, which isn't the easiest, but Truvari can do it too (it's harder to tune Truvari to handle SVs and indels well in the same run).

I hope this helps.

@riyasj327
Copy link
Author

riyasj327 commented Oct 8, 2024

thank you so much for your valuable insights! This is very helpful.
Yes, I agree it is too stringent in the case of SV calling. So I also tried few other tools like surivor merge and truvari bench.

Survivor merge with default values https://github.com/fritzsedlazeck/SURVIVOR/wiki#consensus-calling-or-comparing-svs-across-samples other than just the bp increased to 50 showed these many no of calls in dataset 1, 2 and the no of overlaps given below:
1 - 27842
2 - 27839
overlaps -  21674

I also tried truvari bench https://github.com/ACEnglish/truvari/wiki/bench and saw that around 20% were non overlapping(false positives and negatives) with the defaults truvari bench parameters (they allow some differences and is not like truvari consistency where it looks for exact matches).

Yes, this is diploid and I don't know why truvari consistency is showing high no. of variant calls unlike truvari bench or survivor merge. Maybe they are breakpoints and not individual variant calls? The total no of PAV calls for the datasets are around 41700!

Also I am using hapdup for getting diploid assemblies and that gives me haplotype 1 and 2 where I am always mentioning hapdup_dual_1.fasta as HAP1 and hapdup_dual_2.fasta as HAP2.

I am in the process of increasing coverage and checking how is the change in the number of overlaps.

@paudano
Copy link
Contributor

paudano commented Oct 9, 2024

It might be worth looking through this paper, I think it's relevant to what you are doing:
https://www.nature.com/articles/s41467-024-46614-z

I'm assuming this is human, and the numbers don't make sense. You should have higher-quality assemblies at 40x. Overall, assembly-based approaches are pretty consistent, but there's something different about this case. I can make some other guesses about what it might be, but it's hard to tell from just what's posted here.

@riyasj327
Copy link
Author

Yes, I have had a look on this paper before. Yes it is human. But since I am converting my haploid assembly which I got using 40x to diploid the coverage will be further less for each haplotype? So that is why I thought of increasing the coverage. Also I think it's better to stick with truvari bench instead of truvari consitency where it looks for exact matches.

Based on the paper and my personal experience I now fixed the truvari bench parameters as (--pctseq 0.5 --pctsize 0.5 -r 500 --passonly -s 50) and this are the truvari scores I got:

Dataset 1 VCF with Dataset 2 VCF (these 2 datasets are shuffled and splitted equally from 80x data and then undergone diploid assembly and SV calling):
"TP-base": 25320,
"TP-comp": 25320,
"FP": 2552,
"FN": 2463,
"precision": 0.9084385763490241,
"recall": 0.9113486664507072,
"f1": 0.909891294582697,
"base cnt": 27783,
"comp cnt": 27872,

And based on how I change the r value I see some changes in the false negatives and positives which is as expected.

Dataset 1 VCF (first run) with Dataset 1 VCF (second run) (the same dataset 1 used before - I did diploid assembly and SV calling on it again - so expecting more consistency this time)

"TP-base": 26278,
"TP-comp": 26278,
"FP": 1497,
"FN": 1505,
"precision": 0.9461026102610262,
"recall": 0.9458301839254221,
"f1": 0.9459663774793909,
"base cnt": 27783,
"comp cnt": 27775,

I think the scores are not too bad. The precision, recall and f1 scores are good enough?

@paudano
Copy link
Contributor

paudano commented Oct 10, 2024

I always look at these numbers with the whole callset, as you are doing, but also with tandem repeats excluded. On T2T-CHM13, I'd drop tandem repeats and CenSat loci (UCSC track "CenSat Annotation"); anything other than CT and MON produces many false positive calls.

Tandem repeats change so quickly and have so much homology that aligners have difficulty handling them consistently. One small change (a single SNP or indel) can give an SV a very different representation, sometimes several kbp or breaking it into multiple smaller SVs. Some progress has been made for these loci, but these remain a source of noise in callsets.

@riyasj327
Copy link
Author

riyasj327 commented Oct 22, 2024

hmm I totally agree. I was using hg38 as the reference since that gave me more overlaps than T2T. So when I tried including the GIAB GRCH38 HG002 stvar benchmark bed file the scores improved and is so much better now meaning most of the non overlaps were in the non-confident regions.

Also I am now assuming that most of the stochasticity was with assembly step. When I called the variants from the exact same diploid assembly a second time the overlaps are very much high as expected unlike before when the reads and assemblies were different though from a single sample.

Summary of consistency
#VCFs Calls Pct
2 6337816 91.54%
1 585793 8.46%

and with just the SVs 95.41% overlaps.

@paudano
Copy link
Contributor

paudano commented Oct 26, 2024

I don't often run ONT-only assemblies, but the amount of variation you are getting by subsetting the same data seems odd. I might try assembling each sequencing run separately and comparing to see if one stands out (you might also be able to run Merqury checking assemblies against each sequencing run to see if one of those stands out. I'd be looking for contamination or a mix-up in sequence data, it might cause the assemblies to differ quite a bit. It's hard for me to tell how likely this is from what's in this thread. PAV and the tools it runs are are pretty deterministic, we've run it multiple times on the same assembly in the past and gotten the same callset from both runs, so if this kind of volatility across subsets is happening, I'd suspect the assembly stage. I'm always on the lookout for PAV bugs though, so if you find something, please let me know.

Yes, you are going to get more volatility in the repetitive regions, and it's worse with the T2T assembly because it includes more of these (real centromeres and rDNA arrays, VNTRs are not as collapsed as in GRCh38, etc). PAV will call most loci where minimap2 aligns, but the alignments are very volatile across samples. They indicate real variation, but small variants lead to very large differences in how they are aligned and represented. If you are seeing lots of volatility outside those loci, then something is likely going on with the assemblies and I'd focus some QC efforts on the input data to make sure it wasn't affecting my results.

This doesn't mean you have to throw out all calls in repetitive loci, but excluding them for some basic QC analyses is almost always helpful.

Good luck!

@riyasj327
Copy link
Author

riyasj327 commented Oct 28, 2024

thanks @paudano

Yes I have checked with the developer of the assembler that I used and they have mentioned that even if the reads are exact same I will not exact same assemblies because of the non deterministic steps in assembly. So especially with different reads I got so many non overlaps. With using the same reads I still got a good number of non overlaps and now when I did the third experiment where I passed the exact same diploid assemblies for variant calling to check if there is any stochasticity happens while variant calling I saw that it is very less which is good.

Truvari bench results:
"TP-base": 23443,
"TP-comp": 23443,
"FP": 0,
"FN": 8,
"precision": 1.0,
"recall": 0.9996588631614857,
"f1": 0.999829402482194,
"base cnt": 23451,
"comp cnt": 23443
with --pctseq 0.5 --pctsize 0.5 -r 500 --passonly -s 50 --pctovl 0 and include bed

And when exact matches with no differences are allowed, it showed, (I am not sure how variants are considered here because the number doesn't make sense here unlike above one, maybe truvari bench is filtering the SVs alone and consistency looks on all the variants)
Truvari consistency:
#VCFs Calls Pct
2 6337816 91.54%
1 585793 8.46%

91.5% overlaps!!

So most of the randomness is caused during the assembly or the other steps that I am using. But one thing I am worried is if you are sure that I should get 100% overlaps for the same assemblies, may be I can run again and check? Because even when I pass the exact same diploid assemblies (to avoid any randomness caused by any other steps ) I do not get 100% overlaps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants