`dorado correct` discarding reads in repeats #851

diego-rt · 2024-05-29T14:11:01Z

Hi,

I wanted to run dorado correct on a set of reads spanning a complex satellite region. This worked fine but after mapping I realised that it discarded all the reads in the most repetitive region. I assume this might have to do with the way you do the all versus all alignments which might involve discarding the most abundant minimisers. This is a big problem because these are obviously the most useful reads.

Thanks a ton!

The text was updated successfully, but these errors were encountered:

tijyojwad · 2024-05-30T00:38:02Z

Hi @diego-rt - thanks for highlighting this. I think you're right, we will look into it.

colindaven · 2024-06-10T11:59:51Z

I think this is a problem for us despite the nice gains through dorado-correct.

What I see is hugely improved contig N50s (from 37 -> 68 MB N50) for a plant genome of about 700 MB when using dorado corrected reads.

That's great, but the total assembly size is typically around 670 MB using flye or hifiasm with raw 10.4.1 ONT reads. With dorado corrected reads, we only see a total genome size of about 640-644 MB (so about 30 MB or 5% less), indicating that probably repeat rich reads and regions are missing.

bpanda-dev · 2024-06-11T06:37:49Z

To support @diego-rt point , this issue is due to the All vs All overlap from minimap2. Minimap2 discards reads coming from long repetitive regions.
To check this, we mapped the reads present in AvA minimap2 output to the reference genome and analysed regions with low coverage, we find that it does not contain the reads pertaining to repeat regions especially the centromeric satellite regions.

Note: we used HERRO instead of dorado correct , since it has separate scripts for running the three steps (preprocessing, AvA, Herro Inference) of the HERRO correction pipeline.

Given below , we compare the read mapping coverage across the hg002 chromosome 19 MATERNAL reference for the raw read set and the HERRO corrected read set respectively.

Regards,
Bikram Panda

bdrosen · 2024-06-18T18:15:50Z

Hi @tijyojwad, is this fixed in the v0.7.2 release or was that a separate alignment issue mentioned in the notes?
"3b51c1b - Fix sub-par alignments in dorado correct"
Thanks!

ekg · 2024-07-25T04:52:17Z

We will explore applying wfmash to this. It should behave differently in repeats.

HalfPhoton · 2024-09-17T09:23:57Z

This issue should have been resolved in dorado 0.7.3 and there has also been improvements to the tool's general stability in the newly released dorado 0.8.0.

Closing this issue as resolved but please re-open or create a new issue if it has not been properly addressed.

Kind regards,
Rich

diego-rt · 2024-09-17T20:34:20Z

Hi @HalfPhoton @tijyojwad

I don't think it has been fixed yet. I'm using dorado 0.7.3 and I can confirm it's still a problem.

If you are referring to the fix Remove limit on number of overlaps considered during all-vs-all alignment in dorado correct introduced in dorado 0.7.3, I don't think that this addresses the underlying issue of this problem.

The problem is that minimap2 discards high frequency minimisers (i.e. as in option -f in the minimap2 manual). If you want to be able to correct reads in satellites you are going to have to use an approach like winnowmap, a minimap2 fork that does not discard high frequency minimisers. What @ekg proposed also sounds like a great idea.

diego-rt · 2024-10-02T18:32:57Z

Hi @HalfPhoton @tijyojwad

I'm experimenting with the idea suggested by @ekg but I need some guidance on what is the expectation from dorado's side on how the PAF file should look like. It would be good to know what alignment minimal lengths, minimal identities, max indels, PAF sorting order, optional tags, etc. this PAF file should have when using the option dorado correct --from-paf.

I've tried a few seemingly valid PAF files already and depending on the mapping parameters I either get a segmentation fault or very few reads are actually corrected (despite a 'normal' number being reported in the verbose log). Using dorado correct for both alignment and correction I get normal results using this small dataset.

svc-jstone · 2024-10-07T21:30:20Z

Hi @diego-rt! Thanks for experimenting with this, this is an interesting problem to solve.

The PAF file should be formatted like this:

Sorted by target name.
Not contain self-mappings.
Needs to contain dual mappings.
IMPORTANT: Be careful that both the forward and the reverse alignments have left-aligned indels with respect to the target. If you generate the reciprocal overlaps instead of computing them (e.g. by reversing the symmetric overlap from a symmetric alignment on the revcmp strand), your consensus will be sub par.
There needs to be a CIGAR tag cg:Z:.
CIGAR needs to have =/X operations and not M. This may cause the process to terminate instead of failing gracefully at the moment (needs a small bugfix).
Fields 10, 11 and 12 are ignored for this purpose at the moment.

There is no minimal length per se, though the dorado correct overlap settings use minimum chaining score cutoff of 4000.
Other Minimap2 parameters are mentioned here:
#1058

get a segmentation fault or very few reads are actually corrected

Can you provide a bit more information about the segfault? Would be great to fix it.

Let us know how it goes!
I'm also interested in hearing the compute time comparisons between methods, if it's possible to measure 🙂

lh3 · 2024-10-09T03:22:15Z

Winnowmap has removed the AvA mode so far as I know.

What is the order of reads in the input? Note that AvA will work badly if reads are sorted by position.

diego-rt · 2024-10-09T08:12:41Z

Winnowmap has removed the AvA mode so far as I know.

What is the order of reads in the input? Note that AvA will work badly if reads are sorted by position.

Hi @lh3

Yes I just realised recently that AvA is missing in winnowmap.
You are totally right that reads were fetched from a position-sorted BAM file since I was fetching reads mapped to unitigs in tangles. I will test with reads in a randomised order but I doubt it will fully resolve this. Please forgive my ignorance, but why is the order detrimental?
Do you have any suggestions on what we can try to overcome this issue? I've tried decreasing the -f parameter in minimap2 but obviously this leads to unfeasible runtimes and memory requirements. If I'm not mistaken based on @svc-jstone input, this would be the equivalent minimap2 alignment command:

minimap2 -K8g -cx ava-ont -k25 -w17 -e200 -r150,2000 -m4000 -z200,200 -t 16 --eqx --dual=yes ${ont_reads} ${ont_reads} > aln.paf

jelber2 · 2024-10-09T09:56:00Z

@lh3 I have not used winnowmap in a while, but can you not just recreate the -ava-ont preset with commandline argument values?

lh3 · 2024-10-09T12:59:47Z

Winnowmap algorithm is optimized for read-to-genome alignment, not for AvA. It requires a list of high-frequency k-mers which takes time to produce. It is also times slower than minimap2. If minimap2 is already the sole performance bottleneck, winnowmap will be more problematic.

AvA assumes reads in random order. With the HERRO setting, 2.7X of human reads are indexed in each batch. With sorted input, you will have 60X in one batch. You will need a threshold of 20-fold higher. The distorted k-mer count distribution will also screw up frequency-based threshold estimation.

The solution is to either increase the k-mer count threshold with -f or reduce the batch size with -I, or both, but in general, it is tricky to reassemble aligned reads in centromeres. The read alignment can be wrong.

diego-rt · 2024-10-09T13:35:53Z

Oh I see! Well in my case I'm only doing targeted assembly of tangles so for instance this region only has 2.2 Gbp of ONT reads (~40x or so), so it should all fit in one batch.

I've ran the AvA alignments using shuffled and position sorted reads and because of the small size of the dataset I get essentially identical results:

wc -l ont_reads.aln.paf 
3012688 ont_reads.aln.paf

wc -l ont_reads.shuffled.aln.paf 
3012684 ont_reads.shuffled.aln.paf

But thanks a lot for bringing it up @lh3 , I hadn't thought of that and will shuffle the reads when using datasets larger than the batch size.

Regarding winnowmap, bizarrely it actually works without a high frequency minimiser list and even more bizarrely, it is actually faster without the high frequency minimiser list. We benchmarked it in this issue here.

That being said, winnowmap is still much faster than minimap2 with tweaked -f flags when aligning to a reference. In the case of AvA the minimap2 runs for this datasets with a decreased -f 0 are taking upwards of 600G memory and have been running for more than 2 days, compared to ~30 min for minimap2 with the settings listed above, so this is clearly unfeasible. I guess I could search for an -f value that gives reasonable runtimes but I fear that I would have to adjust it for each satellite, which would be unpractical.

I guess one could indeed put together the AvA preset @jelber2 but I'm also not sure how well this will work. I will test but I don't have high hopes.

lh3 · 2024-10-09T13:56:03Z

Without the high-frequency k-mer (not minimizer) list, winnowmap is close to the old minimap2. That is why it is faster and uses less memory, but then it would lose the main advantage and is less accurate for human reads. The observation of improved result in the other thread could be verkko specific or because the winnowmap-specific improvement is not optimized for non-human genomes.

As to -f, try -f5000, -f10000 or higher, or as I said, reduce the batch size. -f0 uses all k-mers. It is too extreme. Note that current read aligners, winnowmap included, are less accurate than whole-genome assemblers. When alignment is wrong, you can't get good local reassembly anyway. I would recommend whole-genome assembly.

jelber2 · 2024-10-15T07:59:38Z

@lh3 Would it not make more sense to adjust
-f
from the default
-f 0.0002
to
-f 0.00002
or
-f 0.000002 ,
since one does not know the number of repetitive minimizer occurrences, and the smaller fractions I list (i.e., they are still bigger than 0) would be more liberal than the default and allow for reads with more repetitive minimizers to be included?

HalfPhoton added the read_correction Read error correction label May 29, 2024

HalfPhoton closed this as completed Sep 17, 2024

HalfPhoton reopened this Sep 18, 2024

diego-rt mentioned this issue Oct 9, 2024

Winnowmap for read overlapping marbl/Winnowmap#17

Open

diego-rt mentioned this issue Oct 11, 2024

AvA of ONT reads waveygang/wfmash#281

Open

jelber2 mentioned this issue Oct 15, 2024

Caution about herro/dorado correct tossing out repetitive reads prio to Shasta assembly paoloshasta/shasta#38

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`dorado correct` discarding reads in repeats #851

`dorado correct` discarding reads in repeats #851

diego-rt commented May 29, 2024

tijyojwad commented May 30, 2024

colindaven commented Jun 10, 2024

bpanda-dev commented Jun 11, 2024 •

edited

Loading

bdrosen commented Jun 18, 2024

ekg commented Jul 25, 2024

HalfPhoton commented Sep 17, 2024

diego-rt commented Sep 17, 2024

diego-rt commented Oct 2, 2024

svc-jstone commented Oct 7, 2024

lh3 commented Oct 9, 2024

diego-rt commented Oct 9, 2024

jelber2 commented Oct 9, 2024

lh3 commented Oct 9, 2024

diego-rt commented Oct 9, 2024

lh3 commented Oct 9, 2024 •

edited

Loading

jelber2 commented Oct 15, 2024

dorado correct discarding reads in repeats #851

dorado correct discarding reads in repeats #851

Comments

diego-rt commented May 29, 2024

tijyojwad commented May 30, 2024

colindaven commented Jun 10, 2024

bpanda-dev commented Jun 11, 2024 • edited Loading

bdrosen commented Jun 18, 2024

ekg commented Jul 25, 2024

HalfPhoton commented Sep 17, 2024

diego-rt commented Sep 17, 2024

diego-rt commented Oct 2, 2024

svc-jstone commented Oct 7, 2024

lh3 commented Oct 9, 2024

diego-rt commented Oct 9, 2024

jelber2 commented Oct 9, 2024

lh3 commented Oct 9, 2024

diego-rt commented Oct 9, 2024

lh3 commented Oct 9, 2024 • edited Loading

jelber2 commented Oct 15, 2024

`dorado correct` discarding reads in repeats #851

`dorado correct` discarding reads in repeats #851

bpanda-dev commented Jun 11, 2024 •

edited

Loading

lh3 commented Oct 9, 2024 •

edited

Loading