Which part of over 200 X Hifi reads is selected ? #2241

sunriseTM · 2023-07-09T08:39:26Z

Hi, I am assembling a microorgaism genome with over 200X Hifi reads, but canu told be that only 200X was selected for assembling, do you have any ideas on which part of it was selected?

Canu says this:

For genome size of 28000000 bases,
retain 5600000000 bases (200.00X coverage).

Found 1106878 reads with 20999061795 bases (749.97X coverage).
Dropped 811663 reads with 15399054876 bases (549.97X coverage).
Retained 295215 reads with 5600006919 bases (200.00X coverage).

brianwalenz · 2023-07-09T11:52:15Z

At that point in the process it is picking random reads. Parameter readSamplingCoverage controls this behavior: https://canu.readthedocs.io/en/latest/parameter-reference.html#readsamplingcoverage. Increasing the coverage here will greatly slow down correction and the only real benefit is if there are sequences (e.g., plasmids) in the input.

After correction, reads are further down-sampled to (roughly) 40x, but this does take into account 'rare' sequence. This is done because overlap-layout-consensus methods seem to suffer with excessive coverage. This particular parameter is corOutCoverage: https://canu.readthedocs.io/en/latest/faq.html?highlight=coroutcoverage#why-is-my-assembly-is-missing-my-favorite-short-plasmid

sunriseTM · 2023-07-09T14:54:30Z

Thanks for your in-time reply! I know Canu will only output 40X corrected reads when I was assembling with PacBio CLR reads. is there also a 'correction' process for Hifi reads？I have been supposing Hicanu will assemble Hifi reads directly.

brianwalenz · 2023-07-10T10:36:05Z

Oops, I missed that you had HiFi data! Sorry for the confusion. There is no correction phase for HiFi reads; they're also assumed to be pre-trimmed.

skoren · 2023-07-10T11:34:17Z

There is still default subsampling for HiFi reads (it should have been 50x not 200x) because higher coverage doesn't help overlap-layout type algorithms. I would suggest subsampling your reads to 50x randomly, definitely do not select the longest reads with HiFi as those will be the lowest quality.

skoren · 2023-07-14T14:27:51Z

There was a bug introduced in v2.2 that was setting the hifi subsampling to 200x as well instead of 50x. I suggest using maxInputCoverage=50 for hifi data until the next release.

sunriseTM changed the title ~~Which part of over 200 Xis selected over 200X~~ Which part of over 200 X Hifi reads is selected ? Jul 9, 2023

skoren added a commit that referenced this issue Jul 14, 2023

Fix default coverage subsampling for HiFi (e.g. #2241)

e0ed3bb

skoren closed this as completed Jul 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Which part of over 200 X Hifi reads is selected ? #2241

Which part of over 200 X Hifi reads is selected ? #2241

sunriseTM commented Jul 9, 2023 •

edited

Loading

brianwalenz commented Jul 9, 2023

sunriseTM commented Jul 9, 2023

brianwalenz commented Jul 10, 2023

skoren commented Jul 10, 2023 •

edited

Loading

skoren commented Jul 14, 2023

Which part of over 200 X Hifi reads is selected ? #2241

Which part of over 200 X Hifi reads is selected ? #2241

Comments

sunriseTM commented Jul 9, 2023 • edited Loading

brianwalenz commented Jul 9, 2023

sunriseTM commented Jul 9, 2023

brianwalenz commented Jul 10, 2023

skoren commented Jul 10, 2023 • edited Loading

skoren commented Jul 14, 2023

sunriseTM commented Jul 9, 2023 •

edited

Loading

skoren commented Jul 10, 2023 •

edited

Loading