Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dorado correct discarding reads in repeats #851

Open
diego-rt opened this issue May 29, 2024 · 16 comments
Open

dorado correct discarding reads in repeats #851

diego-rt opened this issue May 29, 2024 · 16 comments
Labels
read_correction Read error correction

Comments

@diego-rt
Copy link

Hi,

I wanted to run dorado correct on a set of reads spanning a complex satellite region. This worked fine but after mapping I realised that it discarded all the reads in the most repetitive region. I assume this might have to do with the way you do the all versus all alignments which might involve discarding the most abundant minimisers. This is a big problem because these are obviously the most useful reads.

Thanks a ton!

Screenshot 2024-05-28 at 22 44 03

@HalfPhoton HalfPhoton added the read_correction Read error correction label May 29, 2024
@tijyojwad
Copy link
Collaborator

Hi @diego-rt - thanks for highlighting this. I think you're right, we will look into it.

@colindaven
Copy link

I think this is a problem for us despite the nice gains through dorado-correct.

What I see is hugely improved contig N50s (from 37 -> 68 MB N50) for a plant genome of about 700 MB when using dorado corrected reads.

That's great, but the total assembly size is typically around 670 MB using flye or hifiasm with raw 10.4.1 ONT reads. With dorado corrected reads, we only see a total genome size of about 640-644 MB (so about 30 MB or 5% less), indicating that probably repeat rich reads and regions are missing.

@bpanda-dev
Copy link

bpanda-dev commented Jun 11, 2024

To support @diego-rt point , this issue is due to the All vs All overlap from minimap2. Minimap2 discards reads coming from long repetitive regions.
To check this, we mapped the reads present in AvA minimap2 output to the reference genome and analysed regions with low coverage, we find that it does not contain the reads pertaining to repeat regions especially the centromeric satellite regions.

Note: we used HERRO instead of dorado correct , since it has separate scripts for running the three steps (preprocessing, AvA, Herro Inference) of the HERRO correction pipeline.

Given below , we compare the read mapping coverage across the hg002 chromosome 19 MATERNAL reference for the raw read set and the HERRO corrected read set respectively.

chr19_raw sorted_bam2plot_chr19_MATERNAL
chr19_herro sorted_bam2plot_chr19_MATERNAL

Regards,
Bikram Panda

@bdrosen
Copy link

bdrosen commented Jun 18, 2024

Hi @tijyojwad, is this fixed in the v0.7.2 release or was that a separate alignment issue mentioned in the notes?
"3b51c1b - Fix sub-par alignments in dorado correct"
Thanks!

@ekg
Copy link

ekg commented Jul 25, 2024

We will explore applying wfmash to this. It should behave differently in repeats.

@HalfPhoton
Copy link
Collaborator

This issue should have been resolved in dorado 0.7.3 and there has also been improvements to the tool's general stability in the newly released dorado 0.8.0.

Closing this issue as resolved but please re-open or create a new issue if it has not been properly addressed.

Kind regards,
Rich

@diego-rt
Copy link
Author

Hi @HalfPhoton @tijyojwad

I don't think it has been fixed yet. I'm using dorado 0.7.3 and I can confirm it's still a problem.

If you are referring to the fix Remove limit on number of overlaps considered during all-vs-all alignment in dorado correct introduced in dorado 0.7.3, I don't think that this addresses the underlying issue of this problem.

The problem is that minimap2 discards high frequency minimisers (i.e. as in option -f in the minimap2 manual). If you want to be able to correct reads in satellites you are going to have to use an approach like winnowmap, a minimap2 fork that does not discard high frequency minimisers. What @ekg proposed also sounds like a great idea.

@HalfPhoton HalfPhoton reopened this Sep 18, 2024
@diego-rt
Copy link
Author

diego-rt commented Oct 2, 2024

Hi @HalfPhoton @tijyojwad

I'm experimenting with the idea suggested by @ekg but I need some guidance on what is the expectation from dorado's side on how the PAF file should look like. It would be good to know what alignment minimal lengths, minimal identities, max indels, PAF sorting order, optional tags, etc. this PAF file should have when using the option dorado correct --from-paf.

I've tried a few seemingly valid PAF files already and depending on the mapping parameters I either get a segmentation fault or very few reads are actually corrected (despite a 'normal' number being reported in the verbose log). Using dorado correct for both alignment and correction I get normal results using this small dataset.

@svc-jstone
Copy link
Contributor

Hi @diego-rt! Thanks for experimenting with this, this is an interesting problem to solve.

The PAF file should be formatted like this:

  • Sorted by target name.
  • Not contain self-mappings.
  • Needs to contain dual mappings.
  • IMPORTANT: Be careful that both the forward and the reverse alignments have left-aligned indels with respect to the target. If you generate the reciprocal overlaps instead of computing them (e.g. by reversing the symmetric overlap from a symmetric alignment on the revcmp strand), your consensus will be sub par.
  • There needs to be a CIGAR tag cg:Z:.
  • CIGAR needs to have =/X operations and not M. This may cause the process to terminate instead of failing gracefully at the moment (needs a small bugfix).
  • Fields 10, 11 and 12 are ignored for this purpose at the moment.

There is no minimal length per se, though the dorado correct overlap settings use minimum chaining score cutoff of 4000.
Other Minimap2 parameters are mentioned here:
#1058

get a segmentation fault or very few reads are actually corrected

Can you provide a bit more information about the segfault? Would be great to fix it.

Let us know how it goes!
I'm also interested in hearing the compute time comparisons between methods, if it's possible to measure 🙂

@lh3
Copy link

lh3 commented Oct 9, 2024

Winnowmap has removed the AvA mode so far as I know.

What is the order of reads in the input? Note that AvA will work badly if reads are sorted by position.

@diego-rt
Copy link
Author

diego-rt commented Oct 9, 2024

Winnowmap has removed the AvA mode so far as I know.

What is the order of reads in the input? Note that AvA will work badly if reads are sorted by position.

Hi @lh3

  1. Yes I just realised recently that AvA is missing in winnowmap.
  2. You are totally right that reads were fetched from a position-sorted BAM file since I was fetching reads mapped to unitigs in tangles. I will test with reads in a randomised order but I doubt it will fully resolve this. Please forgive my ignorance, but why is the order detrimental?
  3. Do you have any suggestions on what we can try to overcome this issue? I've tried decreasing the -f parameter in minimap2 but obviously this leads to unfeasible runtimes and memory requirements. If I'm not mistaken based on @svc-jstone input, this would be the equivalent minimap2 alignment command:
minimap2 -K8g -cx ava-ont -k25 -w17 -e200 -r150,2000 -m4000 -z200,200 -t 16 --eqx --dual=yes ${ont_reads} ${ont_reads} > aln.paf

@jelber2
Copy link

jelber2 commented Oct 9, 2024

@lh3 I have not used winnowmap in a while, but can you not just recreate the -ava-ont preset with commandline argument values?

@lh3
Copy link

lh3 commented Oct 9, 2024

Winnowmap algorithm is optimized for read-to-genome alignment, not for AvA. It requires a list of high-frequency k-mers which takes time to produce. It is also times slower than minimap2. If minimap2 is already the sole performance bottleneck, winnowmap will be more problematic.

AvA assumes reads in random order. With the HERRO setting, 2.7X of human reads are indexed in each batch. With sorted input, you will have 60X in one batch. You will need a threshold of 20-fold higher. The distorted k-mer count distribution will also screw up frequency-based threshold estimation.

The solution is to either increase the k-mer count threshold with -f or reduce the batch size with -I, or both, but in general, it is tricky to reassemble aligned reads in centromeres. The read alignment can be wrong.

@diego-rt
Copy link
Author

diego-rt commented Oct 9, 2024

Oh I see! Well in my case I'm only doing targeted assembly of tangles so for instance this region only has 2.2 Gbp of ONT reads (~40x or so), so it should all fit in one batch.

I've ran the AvA alignments using shuffled and position sorted reads and because of the small size of the dataset I get essentially identical results:

wc -l ont_reads.aln.paf 
3012688 ont_reads.aln.paf

wc -l ont_reads.shuffled.aln.paf 
3012684 ont_reads.shuffled.aln.paf

But thanks a lot for bringing it up @lh3 , I hadn't thought of that and will shuffle the reads when using datasets larger than the batch size.

Regarding winnowmap, bizarrely it actually works without a high frequency minimiser list and even more bizarrely, it is actually faster without the high frequency minimiser list. We benchmarked it in this issue here.

That being said, winnowmap is still much faster than minimap2 with tweaked -f flags when aligning to a reference. In the case of AvA the minimap2 runs for this datasets with a decreased -f 0 are taking upwards of 600G memory and have been running for more than 2 days, compared to ~30 min for minimap2 with the settings listed above, so this is clearly unfeasible. I guess I could search for an -f value that gives reasonable runtimes but I fear that I would have to adjust it for each satellite, which would be unpractical.

I guess one could indeed put together the AvA preset @jelber2 but I'm also not sure how well this will work. I will test but I don't have high hopes.

@lh3
Copy link

lh3 commented Oct 9, 2024

Without the high-frequency k-mer (not minimizer) list, winnowmap is close to the old minimap2. That is why it is faster and uses less memory, but then it would lose the main advantage and is less accurate for human reads. The observation of improved result in the other thread could be verkko specific or because the winnowmap-specific improvement is not optimized for non-human genomes.

As to -f, try -f5000, -f10000 or higher, or as I said, reduce the batch size. -f0 uses all k-mers. It is too extreme. Note that current read aligners, winnowmap included, are less accurate than whole-genome assemblers. When alignment is wrong, you can't get good local reassembly anyway. I would recommend whole-genome assembly.

@jelber2
Copy link

jelber2 commented Oct 15, 2024

@lh3 Would it not make more sense to adjust
-f
from the default
-f 0.0002
to
-f 0.00002
or
-f 0.000002 ,
since one does not know the number of repetitive minimizer occurrences, and the smaller fractions I list (i.e., they are still bigger than 0) would be more liberal than the default and allow for reads with more repetitive minimizers to be included?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
read_correction Read error correction
Projects
None yet
Development

No branches or pull requests

10 participants