-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2.1.1: overlapInCorePartition generates 67GB large shellscript full of just if and fi #1924
Comments
You're running correction which should not use overlapInCore by default and so likely it hasn't been configured correctly/tested. It is definitely not recommended or supported to run overlapInCore for this step. Did you set canu to use overlapInCore instead of the default mhap? As for the if/elif, the script is auto-generated so it makes sense to keep all the format of all the partitions the same. |
Here is the commandline from the logs how it was configured for the first time:
The Here is what I tried so far (some did not finish yet) ;-)
I have only 5x genomic DNA coverage (haploid), 10x seemingly but the input sample was a diploid tissue, hence only 5x. |
Yeah, |
About, the limited amount data I know. We may get more, but I wonder whether we hop to 20-30x range (haploid). Provided canu cannot mix Nanopore and Pacbio HiFi reads I think we should go for another batch of Nanopore reads, probably again 1D reads. We just received some Illumina 1.3 billion of 2x150 PE reads for polishing. We have some 800 Mbp assembled using canu (read correction) and shasta assembler (somebody else did that). Probably not ideal but at least something to work with. How would you tweak my above commands to get something meaningful out? My above attempts with canu yielded: attempt6 gave [UNITIGGING/CONSENSUS] attempt8 [UNITIGGING/CONSENSUS] attempt9 gave [UNITIGGING/CONSENSUS] |
I'd vote for 20x HiFi over 20x ONT personally. When you use the non-default overlapper options like minimap or mhap, you probably also need to specify utgReAlign=true (the shortcut -fast is overlapper=mhap utgReAlign=true), otherwise you'll likely end up with lots of overlaps that aren't usable for assembly. This will slow down the run. Honestly, having about 4gb assembled for a 7gb genome at 5-10x isn't bad. |
Thak you for you comments. If you think it is better to sacrifice the ONT data, I am fine to follow you. Myself would not dare to say that. Great, I wil incorporate the options you propose. I was happy to get any output from Did I say 4Gbp assembled? Noo, 420 Mbp unless I am blind. It is not bad either, we care about protein-coding regions only, so I would myself be happy to shrink down the 62-64 Gbp of the remaining unassembled data into something more compact and easier to search through using |
I added some info for the utgReAlign option to the docs. The tip also defaults it to on so you'd have to turn it off explicitly. Yes, 420 you're right, I misread it on GitHub, 4gb would be too good. You don't have to throw away the ONT data, you can use it to fill gaps and/or for validation of the HiFi assembly. |
So how would you modify the commands I tried so far in #1924 (comment) ? The minimap2-based assemblies finished, the remaining not. I will recompile canu to current Meanwhile may have some first set of Ilumina-only contigs, so I could think of correcting the nanopore raw reads before canu assembly. |
There isn't a command that'd speed it up significantly, a 7gb genome will take time, especially on a single node with 100ish cores. You can try overlapper=mhap or minimap and utgReAlign=true, it's compatible with both overlappers and might help. If you have Illumina data + low coverage nanopore, you could try a hybrid assembler like Masurca which is designed for that data type. |
Thank you. |
Hi @skoren , please replace the |
The shell code here is auto-generated so it would add complexity to the perl scripts to special case first/last statements then. In the interest of keeping that code simple I'd prefer to leave the if construct alone. I don't think it makes much difference in shell speed assuming a reasonable number of jobs/partitions. |
Hi,
it happened to me that
overlapInCorePartition
generated too many partitions and a 67GB largecorrection/1-overlapper/overlap.sh
shellscript, practically just huge if/else code which a call to perl utility afterwards. Interpreting this file takes several CPU cores.When canu configured itself, it logged:
Then, it did its math:
I think for sure a roadblock should be installed somewhere so this huge piece of code should be prevented from further processing. Than we can think of what to change in the code (dunno where).
BTW, the
if
andfi
should at least be turned intoif
andelif
.The text was updated successfully, but these errors were encountered: