-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
larger genome size than expected #2354
Comments
What's the ploidy of this plant, is the 1.5gb the single haploid size before accounting for the ploidy? I ask because based on the overlap coverage and the k-mer plot in the report, the coverage you have is between 12-16x or 7-9gb which is inline with the assembly. It's expect that HiFi data will separate and assemble all the haplotypes and generate a larger assembly than the haploid genome size (e.g. for human the asm is 6gb not 3gb). You can see this on the FAQ: https://canu.readthedocs.io/en/latest/faq.html#my-genome-size-and-assembly-size-are-different-help. If you haven't yet, I'd suggest running genomescope2 on the k-mer histogram for your genome to get its estimate of size and ploidy. You can then rely on a tool like purge_dups to remove the alt loci in the assembly. |
Dear skoren, Thank you very much for your response. According to genomescope2, the plant is diploid, and its genome size is 1.486 GB. |
Is that measured via illumina data or the same hifi input? This is all one plant tissue/sample right, you wouldn't expect population variability (like if you use a collection of gametes)? |
I run genomescope2 using both the illumina and the hifi reads and illumina gives 1.486GB while hifi gives 1.5GB, all with high heterozygosity of 6.13%. Please I need your input on this assembly, I spent a lot of time on it. Thanks |
I don't see anything in the assembly indicating that there is an issue, the logs are consistent with a much larger genome so I'm not sure what you can do to change it. Can you share the k-mer histogram file from Canu's run in the unitigging/0-mercounts folder? Generally, collapsing haplotypes doesn't work on HiFi data. There are a few suggestions on the FAQ to try to trim the data but I don't think it will make any difference here. I think you have multiple haplotypes in the assembly. You can confirm this using Busco or similar core gene counts and, assuming it is due to haplotype separation, your best best is likely to rely on purge_dups as I initially suggested. |
Ah, that fit looks quite bogus, it's classifying the main peak as error k-mers and the peak it has identified for the full model doesn't actually exist in the data. So I wouldn't believe that result at all. Have you tried increasing the ploidy to see if you get a better fit? |
Thank you for your time. you are right that the assembler is not identifying the right peaks. I run hifiasm with several tweaks; the result is always about 7.5Gb, and the hifiasm does not identify the correct homozygous and heterozygous coverage. find attached the kmer count and please advise me. Thank you. |
Idle, issue looks to be truly much larger genome size than original incorrect genomescope estimate. |
Hello, thank you once again for developing canu.
I am trying to assemble a plant genome with heterozygosity of 6.13% and high repeat. the estimated genome size is 1.486gb and canu2.2 give me a result with 9gb size.
my code is:
./canu -p myassembly -d canu_assembly maxThreads=32 genomeSize=1.5g -pacbio-hifi /nfs_fs/nfs4/Samaila/project/GingerGenome/SH/HiFi_Hic/HIFI_DATA/ShHIFI.fasta.gz
I have the assembly report for you below. Please I need your assistance on how to improve the assembly using canu2.2
The text was updated successfully, but these errors were encountered: