-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dorado correct
discarding reads in repeats
#851
Comments
Hi @diego-rt - thanks for highlighting this. I think you're right, we will look into it. |
I think this is a problem for us despite the nice gains through dorado-correct. What I see is hugely improved contig N50s (from 37 -> 68 MB N50) for a plant genome of about 700 MB when using dorado corrected reads. That's great, but the total assembly size is typically around 670 MB using flye or hifiasm with raw 10.4.1 ONT reads. With dorado corrected reads, we only see a total genome size of about 640-644 MB (so about 30 MB or 5% less), indicating that probably repeat rich reads and regions are missing. |
To support @diego-rt point , this issue is due to the All vs All overlap from minimap2. Minimap2 discards reads coming from long repetitive regions. Note: we used HERRO instead of dorado correct , since it has separate scripts for running the three steps (preprocessing, AvA, Herro Inference) of the HERRO correction pipeline. Given below , we compare the read mapping coverage across the hg002 chromosome 19 MATERNAL reference for the raw read set and the HERRO corrected read set respectively. Regards, |
Hi @tijyojwad, is this fixed in the v0.7.2 release or was that a separate alignment issue mentioned in the notes? |
We will explore applying wfmash to this. It should behave differently in repeats. |
This issue should have been resolved in dorado 0.7.3 and there has also been improvements to the tool's general stability in the newly released dorado 0.8.0. Closing this issue as resolved but please re-open or create a new issue if it has not been properly addressed. Kind regards, |
I don't think it has been fixed yet. I'm using dorado 0.7.3 and I can confirm it's still a problem. If you are referring to the fix The problem is that minimap2 discards high frequency minimisers (i.e. as in option |
I'm experimenting with the idea suggested by @ekg but I need some guidance on what is the expectation from dorado's side on how the PAF file should look like. It would be good to know what alignment minimal lengths, minimal identities, max indels, PAF sorting order, optional tags, etc. this PAF file should have when using the option I've tried a few seemingly valid PAF files already and depending on the mapping parameters I either get a segmentation fault or very few reads are actually corrected (despite a 'normal' number being reported in the verbose log). Using dorado correct for both alignment and correction I get normal results using this small dataset. |
Hi @diego-rt! Thanks for experimenting with this, this is an interesting problem to solve. The PAF file should be formatted like this:
There is no minimal length per se, though the
Can you provide a bit more information about the segfault? Would be great to fix it. Let us know how it goes! |
Winnowmap has removed the AvA mode so far as I know. What is the order of reads in the input? Note that AvA will work badly if reads are sorted by position. |
Hi @lh3
|
@lh3 I have not used winnowmap in a while, but can you not just recreate the -ava-ont preset with commandline argument values? |
Winnowmap algorithm is optimized for read-to-genome alignment, not for AvA. It requires a list of high-frequency k-mers which takes time to produce. It is also times slower than minimap2. If minimap2 is already the sole performance bottleneck, winnowmap will be more problematic. AvA assumes reads in random order. With the HERRO setting, 2.7X of human reads are indexed in each batch. With sorted input, you will have 60X in one batch. You will need a threshold of 20-fold higher. The distorted k-mer count distribution will also screw up frequency-based threshold estimation. The solution is to either increase the k-mer count threshold with |
Oh I see! Well in my case I'm only doing targeted assembly of tangles so for instance this region only has 2.2 Gbp of ONT reads (~40x or so), so it should all fit in one batch. I've ran the AvA alignments using shuffled and position sorted reads and because of the small size of the dataset I get essentially identical results:
But thanks a lot for bringing it up @lh3 , I hadn't thought of that and will shuffle the reads when using datasets larger than the batch size. Regarding winnowmap, bizarrely it actually works without a high frequency minimiser list and even more bizarrely, it is actually faster without the high frequency minimiser list. We benchmarked it in this issue here. That being said, winnowmap is still much faster than minimap2 with tweaked I guess one could indeed put together the AvA preset @jelber2 but I'm also not sure how well this will work. I will test but I don't have high hopes. |
Without the high-frequency k-mer (not minimizer) list, winnowmap is close to the old minimap2. That is why it is faster and uses less memory, but then it would lose the main advantage and is less accurate for human reads. The observation of improved result in the other thread could be verkko specific or because the winnowmap-specific improvement is not optimized for non-human genomes. As to |
@lh3 Would it not make more sense to adjust |
Hi,
I wanted to run
dorado correct
on a set of reads spanning a complex satellite region. This worked fine but after mapping I realised that it discarded all the reads in the most repetitive region. I assume this might have to do with the way you do the all versus all alignments which might involve discarding the most abundant minimisers. This is a big problem because these are obviously the most useful reads.Thanks a ton!
The text was updated successfully, but these errors were encountered: