Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to generate a de novo assembly (Meccus pallidipennis)? (Question) #44

Open
mfonseca91 opened this issue Dec 2, 2024 · 12 comments
Open
Labels
question Further information is requested

Comments

@mfonseca91
Copy link

Hello, any suggestions to perform a de novo assembly of Meccus pallidipennis and how to generate the corresponding config file? Considering that I have a fastq.qz file with 5.93 million unphased reads generated by the Ligation Sequencing DNA V14 kit (SQK-LSK114-XL) and the flow cell FLO-MIN114?

@paoloshasta
Copy link
Owner

We don't have an assembly configuration optimized for your situation, but I suggest using, at least as a starting point, one of the two we developed for the ULK114 toolkit as announced by ONT in May 2024:

  • If you used HERRO error correction after base calling, use --config Nanopore-r10.4.1_e8.2-400bps_sup-Herro-Sep2024.conf.
  • If you did not use HERRO error correction and you are working with reads out of the standard base calling process, use --config Nanopore-r10.4.1_e8.2-400bps_sup-Raw-Sep2024.conf.

Since your reads are shorter than the ULK114 reads the results will not be optimal. But if you can post here the following output files from your assembly I may then be able to suggest tweaks for improvement:

  • AssemblySummary.html
  • stdout.log

Do you have an estimate of genome size for this species? Any idea of heterozygosity rate?

@paoloshasta paoloshasta added the question Further information is requested label Dec 2, 2024
@mfonseca91
Copy link
Author

We don't have an assembly configuration optimized for your situation, but I suggest using, at least as a starting point, one of the two we developed for the ULK114 toolkit as announced by ONT in May 2024:

* If you used HERRO error correction after base calling, use `--config Nanopore-r10.4.1_e8.2-400bps_sup-Herro-Sep2024.conf`.

* If you did not use HERRO error correction and you are working with reads out of the standard base calling process, use `--config Nanopore-r10.4.1_e8.2-400bps_sup-Raw-Sep2024.conf`.

Since your reads are shorter than the ULK114 reads the results will not be optimal. But if you can post here the following output files from your assembly I may then be able to suggest tweaks for improvement:

* `AssemblySummary.html`

* `stdout.log`

Do you have an estimate of genome size for this species? Any idea of heterozygosity rate?

Thanks for your response, I tried running Shasta with the aforementioned configuration files, however, these are not available within Shasta's --config options. Where can I access or download the configuration files? Moreover, I would like to add that we don’t have data related to the genome size and heterozygosity rate of M. pallidipennis. However, the genome size of the closest species to M. pallidipennis is T. infestants, whose genome size is around 17,301 bp.

@paoloshasta
Copy link
Owner

Sorry, my directions were incorrect. You should omit the .conf portion of the assembly configurations, because those are built-in configurations that don't require a configuration file. Make sure to use the latest Shasta release, 0.13.0. So you should use either:

--config Nanopore-r10.4.1_e8.2-400bps_sup-Herro-Sep2024

OR

--config Nanopore-r10.4.1_e8.2-400bps_sup-Raw-Sep2024

I apologize for the confusion.

@paoloshasta
Copy link
Owner

That genome size of 17301 bp seems too small. Are you sure that is not the number of genes instead?

@mfonseca91
Copy link
Author

That genome size of 17301 bp seems too small. Are you sure that is not the number of genes instead?

I have run the de novo assembly using the following command:

/home/mfonseca/LGC_INMEGEN/Proyecto_chinches/Shasta/shasta-Linux-0.13.0 --input /home/mfonseca/LGC_INMEGEN/Proyecto_chinches/ONT_data_chinches/FAY11728_pass.fastq --config Nanopore-r10.4.1_e8.2-400bps_sup-Raw-Sep2024 --anonymousmemoryMode --t 50

In the end, only the stdout.log file was generated.
stdout.log

On the other hand, we have corroborated that there is no genome size or a specific number of genes for M. pallidipennis and the closest species (T. infestans).

@paoloshasta
Copy link
Owner

There are several things happening here. Let's discuss them one at a time.

It appears that your assembly process was killed while still running. Is it possible that this happened because of a memory issue or other system limit issue? Did you see a "Killed" message on the process output? That message would not make it to stdout.log. How much memory does the machine you are using have? Please post the output of the following commands:

head -1 /proc/meminfo
ulimit -a

By the time the process was killed some other files should have been generated. Can you post here a list of the files that were created?

Your reads are very short, and the 10 Kb read length cutoff used by that assembly configuration resulted in most of your coverage being discarded: as you can see from line 19 of stdout.log, 10.1 Gb of coverage in reads shorter than 10 Kb were discarded, with the result that the assembly used only 0.8 Gb of coverage in reads longer than 10 Kb (see line 23 of stdout.log). So the first thing to do is decrease the read length cutoff. You can do this via option --Reads.minReadLength. For example, to reduce the read length cutoff to 5 Kb add the following to your command line:

--Reads.minReadLength 5000

To find a reasonable cutoff value you can use the information in Binned-ReadLengthHistogram.csv. Choose a value that keeps about 80% of your 11 Gb of coverage. Hopefully this will result in a read length cutoff that is not too low.

Given that your reads are so short I would also override the minimum length of an alignment by also adding the following to the command line:

--Align.minAlignedMarkerCount 100

Do you know why your reads are so short? The ULK114 reads described by ONT in May 2024 have a read N50 around 100 Kb, and even without an Ultra-Long (UL) protocol you should be able to easily get 30 to 50 Kb read N50. Working with these short reads will be a serious handicap for the assembly process even if we are able to optimize the assembly configuration.

@mfonseca91
Copy link
Author

There are several things happening here. Let's discuss them one at a time.

It appears that your assembly process was killed while still running. Is it possible that this happened because of a memory issue or other system limit issue? Did you see a "Killed" message on the process output? That message would not make it to stdout.log. How much memory does the machine you are using have? Please post the output of the following commands:

head -1 /proc/meminfo
ulimit -a

By the time the process was killed some other files should have been generated. Can you post here a list of the files that were created?

Your reads are very short, and the 10 Kb read length cutoff used by that assembly configuration resulted in most of your coverage being discarded: as you can see from line 19 of stdout.log, 10.1 Gb of coverage in reads shorter than 10 Kb were discarded, with the result that the assembly used only 0.8 Gb of coverage in reads longer than 10 Kb (see line 23 of stdout.log). So the first thing to do is decrease the read length cutoff. You can do this via option --Reads.minReadLength. For example, to reduce the read length cutoff to 5 Kb add the following to your command line:

--Reads.minReadLength 5000

To find a reasonable cutoff value you can use the information in Binned-ReadLengthHistogram.csv. Choose a value that keeps about 80% of your 11 Gb of coverage. Hopefully this will result in a read length cutoff that is not too low.

Given that your reads are so short I would also override the minimum length of an alignment by also adding the following to the command line:

--Align.minAlignedMarkerCount 100

Do you know why your reads are so short? The ULK114 reads described by ONT in May 2024 have a read N50 around 100 Kb, and even without an Ultra-Long (UL) protocol you should be able to easily get 30 to 50 Kb read N50. Working with these short reads will be a serious handicap for the assembly process even if we are able to optimize the assembly configuration.

A “killed” message was not generated.

About the machine memory:

head -1 /proc/meminfo
MemTotal:       1056443908 kB

ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 4126359
max locked memory       (kbytes, -l) 65536
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4126359
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

List of files created:

Binned-ReadLengthHistogram.csv 
KmerFrequencyHistogram.csv 
performance.log         
ReadLowHashStatistics.csv 
stdout.log
DuplicateReads.csv             
LowHashBucketHistogram.csv 
ReadLengthHistogram.csv 
shasta.conf               
SuppressedAlignmentCandidates.csv

About why we have short reads, we suspect it was related to the library preparation, most of the short fragments were not successfully eliminated.

@paoloshasta
Copy link
Owner

Ok, your machine configuration looks fine, but we have no explanation why the assembly was interrupted. Is it possible that you are using a batch system that requires you to specify a time limit? Or other resource limitations? Are you running interactively or under a job submission system?

In any event, try running again, this time using my above suggestions:

--config Nanopore-r10.4.1_e8.2-400bps_sup-Raw-Sep2024 --Reads.minReadLength 5000 --Align.minAlignedMarkerCount 100

In addition to stdout.log, please also post the following for the new assembly:

  • Binned-ReadLengthHistogram.csv for the new assembly, so I can get a better idea of your read length distribution.
  • performance.log

@mfonseca91
Copy link
Author

Ok, your machine configuration looks fine, but we have no explanation why the assembly was interrupted. Is it possible that you are using a batch system that requires you to specify a time limit? Or other resource limitations? Are you running interactively or under a job submission system?

In any event, try running again, this time using my above suggestions:

--config Nanopore-r10.4.1_e8.2-400bps_sup-Raw-Sep2024 --Reads.minReadLength 5000 --Align.minAlignedMarkerCount 100

In addition to stdout.log, please also post the following for the new assembly:

* `Binned-ReadLengthHistogram.csv` for the new assembly, so I can get a better idea of your read length distribution.

* `performance.log`

We have run the new de novo assembly interactively (no-job submission system), with the following command:

/home/mfonseca/LGC_INMEGEN/Proyecto_chinches/Shasta/shasta-Linux-0.13.0 --input /home/mfonseca/LGC_INMEGEN/Proyecto_chinches/ONT_data_chinches/FAY11728_pass.fastq --config Nanopore-r10.4.1_e8.2-400bps_sup-Raw-Sep2024 --Reads.minReadLength 5000 --Align.minAlignedMarkerCount 100

Here are the new results:

stdout.log
Binned-ReadLengthHistogram.csv
performance.log

@paoloshasta
Copy link
Owner

The new stdout.log shows (at line 3) that you used the following command line, which is identical to what you used previously:

/home/mfonseca/LGC_INMEGEN/Proyecto_chinches/Shasta/shasta-Linux-0.13.0 --input /home/mfonseca/LGC_INMEGEN/Proyecto_chinches/ONT_data_chinches/FAY11728_pass.fastq --config Nanopore-r10.4.1_e8.2-400bps_sup-Raw-Sep2024 --memoryMode anonymous --t 50

So, most of your 11 Gb of coverage was again discarded (see lines 19 and 23 of stdout.log). Please add the options I suggested:

--Reads.minReadLength 5000 --Align.minAlignedMarkerCount 100

It is likely that this is a crash caused by the pathologically low coverage (only 5 alignment candidates were found, see line 132 of stdout.log). I can investigate the crash if you somehow give me access to the input file (you can do that privately via the e-mail address on my GitHub profile). However, fixing the crash will not improve the low coverage situation. Reducing the read length cutoff to 5 Kb might help, but that depends on the read length distribution of your reads, which appear to be abnormally short.

@mfonseca91
Copy link
Author

The new stdout.log shows (at line 3) that you used the following command line, which is identical to what you used previously:

/home/mfonseca/LGC_INMEGEN/Proyecto_chinches/Shasta/shasta-Linux-0.13.0 --input /home/mfonseca/LGC_INMEGEN/Proyecto_chinches/ONT_data_chinches/FAY11728_pass.fastq --config Nanopore-r10.4.1_e8.2-400bps_sup-Raw-Sep2024 --memoryMode anonymous --t 50

So, most of your 11 Gb of coverage was again discarded (see lines 19 and 23 of stdout.log). Please add the options I suggested:

--Reads.minReadLength 5000 --Align.minAlignedMarkerCount 100

It is likely that this is a crash caused by the pathologically low coverage (only 5 alignment candidates were found, see line 132 of stdout.log). I can investigate the crash if you somehow give me access to the input file (you can do that privately via the e-mail address on my GitHub profile). However, fixing the crash will not improve the low coverage situation. Reducing the read length cutoff to 5 Kb might help, but that depends on the read length distribution of your reads, which appear to be abnormally short.

Sorry, here are the new results:

stdout.log
Binned-ReadLengthHistogram.csv
performance.log

@paoloshasta
Copy link
Owner

Even with the reduced 5 Kb read length cutoff, still more than half of your 11 Gb of coverage is discarded. From stdout.log:

Discarded 5357248 short reads for a total 6628819705 bases.
Total number of raw bases is 4232871866.

Your reads are just too short. Shasta is designed for long reads with an N50 of 20-30 Kb or more. I am not an expert in sequencing protocols, but you might consider asking ONT for help, because I know that most people who use ONT reads get much longer reads than you are getting.

In addition, given the low number of alignment candidates found, it is still possible that your coverage is just too low for the genome you want to assemble. For example, if you were to assume a genome size of 1 Gb, your 4 Gb in reads longer than 5 Kb would correspond to only 4x coverage, entirely insufficient for de novo assembly (you need to be around 30x at least).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants