How to generate a de novo assembly (Meccus pallidipennis)? (Question) #44

mfonseca91 · 2024-12-02T22:37:36Z

Hello, any suggestions to perform a de novo assembly of Meccus pallidipennis and how to generate the corresponding config file? Considering that I have a fastq.qz file with 5.93 million unphased reads generated by the Ligation Sequencing DNA V14 kit (SQK-LSK114-XL) and the flow cell FLO-MIN114?

paoloshasta · 2024-12-02T23:18:59Z

We don't have an assembly configuration optimized for your situation, but I suggest using, at least as a starting point, one of the two we developed for the ULK114 toolkit as announced by ONT in May 2024:

If you used HERRO error correction after base calling, use --config Nanopore-r10.4.1_e8.2-400bps_sup-Herro-Sep2024.conf.
If you did not use HERRO error correction and you are working with reads out of the standard base calling process, use --config Nanopore-r10.4.1_e8.2-400bps_sup-Raw-Sep2024.conf.

Since your reads are shorter than the ULK114 reads the results will not be optimal. But if you can post here the following output files from your assembly I may then be able to suggest tweaks for improvement:

AssemblySummary.html
stdout.log

Do you have an estimate of genome size for this species? Any idea of heterozygosity rate?

mfonseca91 · 2024-12-03T17:11:50Z

We don't have an assembly configuration optimized for your situation, but I suggest using, at least as a starting point, one of the two we developed for the ULK114 toolkit as announced by ONT in May 2024:
* If you used HERRO error correction after base calling, use `--config Nanopore-r10.4.1_e8.2-400bps_sup-Herro-Sep2024.conf`.

* If you did not use HERRO error correction and you are working with reads out of the standard base calling process, use `--config Nanopore-r10.4.1_e8.2-400bps_sup-Raw-Sep2024.conf`.
Since your reads are shorter than the ULK114 reads the results will not be optimal. But if you can post here the following output files from your assembly I may then be able to suggest tweaks for improvement:
* `AssemblySummary.html`

* `stdout.log`
Do you have an estimate of genome size for this species? Any idea of heterozygosity rate?

Thanks for your response, I tried running Shasta with the aforementioned configuration files, however, these are not available within Shasta's --config options. Where can I access or download the configuration files? Moreover, I would like to add that we don’t have data related to the genome size and heterozygosity rate of M. pallidipennis. However, the genome size of the closest species to M. pallidipennis is T. infestants, whose genome size is around 17,301 bp.

paoloshasta · 2024-12-03T17:41:32Z

Sorry, my directions were incorrect. You should omit the .conf portion of the assembly configurations, because those are built-in configurations that don't require a configuration file. Make sure to use the latest Shasta release, 0.13.0. So you should use either:

--config Nanopore-r10.4.1_e8.2-400bps_sup-Herro-Sep2024

OR

--config Nanopore-r10.4.1_e8.2-400bps_sup-Raw-Sep2024

I apologize for the confusion.

paoloshasta · 2024-12-03T17:44:49Z

That genome size of 17301 bp seems too small. Are you sure that is not the number of genes instead?

mfonseca91 · 2024-12-03T21:35:23Z

That genome size of 17301 bp seems too small. Are you sure that is not the number of genes instead?

I have run the de novo assembly using the following command:

/home/mfonseca/LGC_INMEGEN/Proyecto_chinches/Shasta/shasta-Linux-0.13.0 --input /home/mfonseca/LGC_INMEGEN/Proyecto_chinches/ONT_data_chinches/FAY11728_pass.fastq --config Nanopore-r10.4.1_e8.2-400bps_sup-Raw-Sep2024 --anonymousmemoryMode --t 50

In the end, only the stdout.log file was generated.
stdout.log

On the other hand, we have corroborated that there is no genome size or a specific number of genes for M. pallidipennis and the closest species (T. infestans).

paoloshasta · 2024-12-03T22:10:40Z

There are several things happening here. Let's discuss them one at a time.

It appears that your assembly process was killed while still running. Is it possible that this happened because of a memory issue or other system limit issue? Did you see a "Killed" message on the process output? That message would not make it to stdout.log. How much memory does the machine you are using have? Please post the output of the following commands:

head -1 /proc/meminfo
ulimit -a

By the time the process was killed some other files should have been generated. Can you post here a list of the files that were created?

Your reads are very short, and the 10 Kb read length cutoff used by that assembly configuration resulted in most of your coverage being discarded: as you can see from line 19 of stdout.log, 10.1 Gb of coverage in reads shorter than 10 Kb were discarded, with the result that the assembly used only 0.8 Gb of coverage in reads longer than 10 Kb (see line 23 of stdout.log). So the first thing to do is decrease the read length cutoff. You can do this via option --Reads.minReadLength. For example, to reduce the read length cutoff to 5 Kb add the following to your command line:

--Reads.minReadLength 5000

To find a reasonable cutoff value you can use the information in Binned-ReadLengthHistogram.csv. Choose a value that keeps about 80% of your 11 Gb of coverage. Hopefully this will result in a read length cutoff that is not too low.

Given that your reads are so short I would also override the minimum length of an alignment by also adding the following to the command line:

--Align.minAlignedMarkerCount 100

Do you know why your reads are so short? The ULK114 reads described by ONT in May 2024 have a read N50 around 100 Kb, and even without an Ultra-Long (UL) protocol you should be able to easily get 30 to 50 Kb read N50. Working with these short reads will be a serious handicap for the assembly process even if we are able to optimize the assembly configuration.

mfonseca91 · 2024-12-03T22:43:25Z

There are several things happening here. Let's discuss them one at a time.

It appears that your assembly process was killed while still running. Is it possible that this happened because of a memory issue or other system limit issue? Did you see a "Killed" message on the process output? That message would not make it to stdout.log. How much memory does the machine you are using have? Please post the output of the following commands:
head -1 /proc/meminfo
ulimit -a
By the time the process was killed some other files should have been generated. Can you post here a list of the files that were created?

Your reads are very short, and the 10 Kb read length cutoff used by that assembly configuration resulted in most of your coverage being discarded: as you can see from line 19 of stdout.log, 10.1 Gb of coverage in reads shorter than 10 Kb were discarded, with the result that the assembly used only 0.8 Gb of coverage in reads longer than 10 Kb (see line 23 of stdout.log). So the first thing to do is decrease the read length cutoff. You can do this via option --Reads.minReadLength. For example, to reduce the read length cutoff to 5 Kb add the following to your command line:
--Reads.minReadLength 5000
To find a reasonable cutoff value you can use the information in Binned-ReadLengthHistogram.csv. Choose a value that keeps about 80% of your 11 Gb of coverage. Hopefully this will result in a read length cutoff that is not too low.

Given that your reads are so short I would also override the minimum length of an alignment by also adding the following to the command line:
--Align.minAlignedMarkerCount 100
Do you know why your reads are so short? The ULK114 reads described by ONT in May 2024 have a read N50 around 100 Kb, and even without an Ultra-Long (UL) protocol you should be able to easily get 30 to 50 Kb read N50. Working with these short reads will be a serious handicap for the assembly process even if we are able to optimize the assembly configuration.

A “killed” message was not generated.

About the machine memory:

head -1 /proc/meminfo
MemTotal: 1056443908 kB

ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 4126359
max locked memory       (kbytes, -l) 65536
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4126359
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

List of files created:

Binned-ReadLengthHistogram.csv
KmerFrequencyHistogram.csv
performance.log
ReadLowHashStatistics.csv
stdout.log
DuplicateReads.csv
LowHashBucketHistogram.csv
ReadLengthHistogram.csv
shasta.conf
SuppressedAlignmentCandidates.csv

About why we have short reads, we suspect it was related to the library preparation, most of the short fragments were not successfully eliminated.

paoloshasta · 2024-12-03T22:50:54Z

Ok, your machine configuration looks fine, but we have no explanation why the assembly was interrupted. Is it possible that you are using a batch system that requires you to specify a time limit? Or other resource limitations? Are you running interactively or under a job submission system?

In any event, try running again, this time using my above suggestions:

--config Nanopore-r10.4.1_e8.2-400bps_sup-Raw-Sep2024 --Reads.minReadLength 5000 --Align.minAlignedMarkerCount 100

In addition to stdout.log, please also post the following for the new assembly:

Binned-ReadLengthHistogram.csv for the new assembly, so I can get a better idea of your read length distribution.
performance.log

mfonseca91 · 2024-12-04T16:42:53Z

Ok, your machine configuration looks fine, but we have no explanation why the assembly was interrupted. Is it possible that you are using a batch system that requires you to specify a time limit? Or other resource limitations? Are you running interactively or under a job submission system?

In any event, try running again, this time using my above suggestions:
--config Nanopore-r10.4.1_e8.2-400bps_sup-Raw-Sep2024 --Reads.minReadLength 5000 --Align.minAlignedMarkerCount 100
In addition to stdout.log, please also post the following for the new assembly:
* `Binned-ReadLengthHistogram.csv` for the new assembly, so I can get a better idea of your read length distribution.

* `performance.log`

We have run the new de novo assembly interactively (no-job submission system), with the following command:

/home/mfonseca/LGC_INMEGEN/Proyecto_chinches/Shasta/shasta-Linux-0.13.0 --input /home/mfonseca/LGC_INMEGEN/Proyecto_chinches/ONT_data_chinches/FAY11728_pass.fastq --config Nanopore-r10.4.1_e8.2-400bps_sup-Raw-Sep2024 --Reads.minReadLength 5000 --Align.minAlignedMarkerCount 100

Here are the new results:

stdout.log
Binned-ReadLengthHistogram.csv
performance.log

paoloshasta · 2024-12-04T17:16:32Z

The new stdout.log shows (at line 3) that you used the following command line, which is identical to what you used previously:

/home/mfonseca/LGC_INMEGEN/Proyecto_chinches/Shasta/shasta-Linux-0.13.0 --input /home/mfonseca/LGC_INMEGEN/Proyecto_chinches/ONT_data_chinches/FAY11728_pass.fastq --config Nanopore-r10.4.1_e8.2-400bps_sup-Raw-Sep2024 --memoryMode anonymous --t 50

So, most of your 11 Gb of coverage was again discarded (see lines 19 and 23 of stdout.log). Please add the options I suggested:

--Reads.minReadLength 5000 --Align.minAlignedMarkerCount 100

It is likely that this is a crash caused by the pathologically low coverage (only 5 alignment candidates were found, see line 132 of stdout.log). I can investigate the crash if you somehow give me access to the input file (you can do that privately via the e-mail address on my GitHub profile). However, fixing the crash will not improve the low coverage situation. Reducing the read length cutoff to 5 Kb might help, but that depends on the read length distribution of your reads, which appear to be abnormally short.

mfonseca91 · 2024-12-04T17:52:11Z

The new stdout.log shows (at line 3) that you used the following command line, which is identical to what you used previously:
/home/mfonseca/LGC_INMEGEN/Proyecto_chinches/Shasta/shasta-Linux-0.13.0 --input /home/mfonseca/LGC_INMEGEN/Proyecto_chinches/ONT_data_chinches/FAY11728_pass.fastq --config Nanopore-r10.4.1_e8.2-400bps_sup-Raw-Sep2024 --memoryMode anonymous --t 50
So, most of your 11 Gb of coverage was again discarded (see lines 19 and 23 of stdout.log). Please add the options I suggested:
--Reads.minReadLength 5000 --Align.minAlignedMarkerCount 100
It is likely that this is a crash caused by the pathologically low coverage (only 5 alignment candidates were found, see line 132 of stdout.log). I can investigate the crash if you somehow give me access to the input file (you can do that privately via the e-mail address on my GitHub profile). However, fixing the crash will not improve the low coverage situation. Reducing the read length cutoff to 5 Kb might help, but that depends on the read length distribution of your reads, which appear to be abnormally short.

Sorry, here are the new results:

stdout.log
Binned-ReadLengthHistogram.csv
performance.log

paoloshasta · 2024-12-04T18:03:50Z

Even with the reduced 5 Kb read length cutoff, still more than half of your 11 Gb of coverage is discarded. From stdout.log:

Discarded 5357248 short reads for a total 6628819705 bases.
Total number of raw bases is 4232871866.

Your reads are just too short. Shasta is designed for long reads with an N50 of 20-30 Kb or more. I am not an expert in sequencing protocols, but you might consider asking ONT for help, because I know that most people who use ONT reads get much longer reads than you are getting.

In addition, given the low number of alignment candidates found, it is still possible that your coverage is just too low for the genome you want to assemble. For example, if you were to assume a genome size of 1 Gb, your 4 Gb in reads longer than 5 Kb would correspond to only 4x coverage, entirely insufficient for de novo assembly (you need to be around 30x at least).

paoloshasta added the question Further information is requested label Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to generate a de novo assembly (Meccus pallidipennis)? (Question) #44

How to generate a de novo assembly (Meccus pallidipennis)? (Question) #44

mfonseca91 commented Dec 2, 2024

paoloshasta commented Dec 2, 2024

mfonseca91 commented Dec 3, 2024

paoloshasta commented Dec 3, 2024

paoloshasta commented Dec 3, 2024

mfonseca91 commented Dec 3, 2024

paoloshasta commented Dec 3, 2024

mfonseca91 commented Dec 3, 2024

paoloshasta commented Dec 3, 2024

mfonseca91 commented Dec 4, 2024

paoloshasta commented Dec 4, 2024

mfonseca91 commented Dec 4, 2024

paoloshasta commented Dec 4, 2024

How to generate a de novo assembly (Meccus pallidipennis)? (Question) #44

How to generate a de novo assembly (Meccus pallidipennis)? (Question) #44

Comments

mfonseca91 commented Dec 2, 2024

paoloshasta commented Dec 2, 2024

mfonseca91 commented Dec 3, 2024

paoloshasta commented Dec 3, 2024

paoloshasta commented Dec 3, 2024

mfonseca91 commented Dec 3, 2024

paoloshasta commented Dec 3, 2024

mfonseca91 commented Dec 3, 2024

paoloshasta commented Dec 3, 2024

mfonseca91 commented Dec 4, 2024

paoloshasta commented Dec 4, 2024

mfonseca91 commented Dec 4, 2024

paoloshasta commented Dec 4, 2024