-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to generate a de novo assembly (Meccus pallidipennis)? (Question) #44
Comments
We don't have an assembly configuration optimized for your situation, but I suggest using, at least as a starting point, one of the two we developed for the ULK114 toolkit as announced by ONT in May 2024:
Since your reads are shorter than the ULK114 reads the results will not be optimal. But if you can post here the following output files from your assembly I may then be able to suggest tweaks for improvement:
Do you have an estimate of genome size for this species? Any idea of heterozygosity rate? |
Thanks for your response, I tried running Shasta with the aforementioned configuration files, however, these are not available within Shasta's --config options. Where can I access or download the configuration files? Moreover, I would like to add that we don’t have data related to the genome size and heterozygosity rate of M. pallidipennis. However, the genome size of the closest species to M. pallidipennis is T. infestants, whose genome size is around 17,301 bp. |
Sorry, my directions were incorrect. You should omit the
OR
I apologize for the confusion. |
That genome size of 17301 bp seems too small. Are you sure that is not the number of genes instead? |
I have run the de novo assembly using the following command: /home/mfonseca/LGC_INMEGEN/Proyecto_chinches/Shasta/shasta-Linux-0.13.0 --input /home/mfonseca/LGC_INMEGEN/Proyecto_chinches/ONT_data_chinches/FAY11728_pass.fastq --config Nanopore-r10.4.1_e8.2-400bps_sup-Raw-Sep2024 --anonymousmemoryMode --t 50 In the end, only the stdout.log file was generated. On the other hand, we have corroborated that there is no genome size or a specific number of genes for M. pallidipennis and the closest species (T. infestans). |
There are several things happening here. Let's discuss them one at a time. It appears that your assembly process was killed while still running. Is it possible that this happened because of a memory issue or other system limit issue? Did you see a "Killed" message on the process output? That message would not make it to
By the time the process was killed some other files should have been generated. Can you post here a list of the files that were created? Your reads are very short, and the 10 Kb read length cutoff used by that assembly configuration resulted in most of your coverage being discarded: as you can see from line 19 of
To find a reasonable cutoff value you can use the information in Given that your reads are so short I would also override the minimum length of an alignment by also adding the following to the command line:
Do you know why your reads are so short? The ULK114 reads described by ONT in May 2024 have a read N50 around 100 Kb, and even without an Ultra-Long (UL) protocol you should be able to easily get 30 to 50 Kb read N50. Working with these short reads will be a serious handicap for the assembly process even if we are able to optimize the assembly configuration. |
A “killed” message was not generated. About the machine memory: head -1 /proc/meminfo ulimit -a List of files created: Binned-ReadLengthHistogram.csv About why we have short reads, we suspect it was related to the library preparation, most of the short fragments were not successfully eliminated. |
Ok, your machine configuration looks fine, but we have no explanation why the assembly was interrupted. Is it possible that you are using a batch system that requires you to specify a time limit? Or other resource limitations? Are you running interactively or under a job submission system? In any event, try running again, this time using my above suggestions:
In addition to
|
We have run the new de novo assembly interactively (no-job submission system), with the following command: /home/mfonseca/LGC_INMEGEN/Proyecto_chinches/Shasta/shasta-Linux-0.13.0 --input /home/mfonseca/LGC_INMEGEN/Proyecto_chinches/ONT_data_chinches/FAY11728_pass.fastq --config Nanopore-r10.4.1_e8.2-400bps_sup-Raw-Sep2024 --Reads.minReadLength 5000 --Align.minAlignedMarkerCount 100 Here are the new results: |
The new
So, most of your 11 Gb of coverage was again discarded (see lines 19 and 23 of
It is likely that this is a crash caused by the pathologically low coverage (only 5 alignment candidates were found, see line 132 of |
Sorry, here are the new results: |
Even with the reduced 5 Kb read length cutoff, still more than half of your 11 Gb of coverage is discarded. From
Your reads are just too short. Shasta is designed for long reads with an N50 of 20-30 Kb or more. I am not an expert in sequencing protocols, but you might consider asking ONT for help, because I know that most people who use ONT reads get much longer reads than you are getting. In addition, given the low number of alignment candidates found, it is still possible that your coverage is just too low for the genome you want to assemble. For example, if you were to assume a genome size of 1 Gb, your 4 Gb in reads longer than 5 Kb would correspond to only 4x coverage, entirely insufficient for de novo assembly (you need to be around 30x at least). |
Hello, any suggestions to perform a de novo assembly of Meccus pallidipennis and how to generate the corresponding config file? Considering that I have a fastq.qz file with 5.93 million unphased reads generated by the Ligation Sequencing DNA V14 kit (SQK-LSK114-XL) and the flow cell FLO-MIN114?
The text was updated successfully, but these errors were encountered: