Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Which part of over 200 X Hifi reads is selected ? #2241

Closed
sunriseTM opened this issue Jul 9, 2023 · 5 comments
Closed

Which part of over 200 X Hifi reads is selected ? #2241

sunriseTM opened this issue Jul 9, 2023 · 5 comments

Comments

@sunriseTM
Copy link

sunriseTM commented Jul 9, 2023

Hi, I am assembling a microorgaism genome with over 200X Hifi reads, but canu told be that only 200X was selected for assembling, do you have any ideas on which part of it was selected?

Canu says this:

For genome size of 28000000 bases,
retain 5600000000 bases (200.00X coverage).

Found 1106878 reads with 20999061795 bases (749.97X coverage).
Dropped 811663 reads with 15399054876 bases (549.97X coverage).
Retained 295215 reads with 5600006919 bases (200.00X coverage).

@sunriseTM sunriseTM changed the title Which part of over 200 Xis selected over 200X Which part of over 200 X Hifi reads is selected ? Jul 9, 2023
@brianwalenz
Copy link
Member

At that point in the process it is picking random reads. Parameter readSamplingCoverage controls this behavior: https://canu.readthedocs.io/en/latest/parameter-reference.html#readsamplingcoverage. Increasing the coverage here will greatly slow down correction and the only real benefit is if there are sequences (e.g., plasmids) in the input.

After correction, reads are further down-sampled to (roughly) 40x, but this does take into account 'rare' sequence. This is done because overlap-layout-consensus methods seem to suffer with excessive coverage. This particular parameter is corOutCoverage: https://canu.readthedocs.io/en/latest/faq.html?highlight=coroutcoverage#why-is-my-assembly-is-missing-my-favorite-short-plasmid

@sunriseTM
Copy link
Author

Thanks for your in-time reply! I know Canu will only output 40X corrected reads when I was assembling with PacBio CLR reads. is there also a 'correction' process for Hifi reads?I have been supposing Hicanu will assemble Hifi reads directly.

@brianwalenz
Copy link
Member

Oops, I missed that you had HiFi data! Sorry for the confusion. There is no correction phase for HiFi reads; they're also assumed to be pre-trimmed.

@skoren
Copy link
Member

skoren commented Jul 10, 2023

There is still default subsampling for HiFi reads (it should have been 50x not 200x) because higher coverage doesn't help overlap-layout type algorithms. I would suggest subsampling your reads to 50x randomly, definitely do not select the longest reads with HiFi as those will be the lowest quality.

@skoren
Copy link
Member

skoren commented Jul 14, 2023

There was a bug introduced in v2.2 that was setting the hifi subsampling to 200x as well instead of 50x. I suggest using maxInputCoverage=50 for hifi data until the next release.

@skoren skoren closed this as completed Jul 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants