-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
learnErrors() on PacBio 16S very slow #2005
Comments
A couple things you can do to speed things up a a bit:
This seems low to me. Is this 30-35% of all available processors? Or 30-35% of a single processor? |
Hi benjjneb, first of all, thanks again for your timely response.
It's now running for approximately 1.5 days and the output is:
I only have 4 samples, so I'm surprised to see "The max qual score of...." now 5 times already. Is there any indication on the overall progress based on this output? I also used library(parallel) to confirm the number of CPUs and set them (minus 1) in the code:
And here is my filtering step tracked:
|
Yes. That indicates that you've made it through the
This is the part that is confusing to me that you are seeing such a low CPU usage. For example, when I run a job like this on my laptop (OSX) I typically see 400%-700% CPU usage (based on 100% being full usage of one core). I am not a windows user... are the numbers you are reporting a percent of a single core? Or a percent of all 16 cores (e.g. 100% = full usage of all 16 cores)? |
I now looped 8 times, two more two go.
Sorry about not being clearer. 100% = full usage of all 16 cores. |
No, I am not loving those error model plots. Arguably the key feature to look for is that error rates are decreasing with increasing quality scores, and that does not seem to be the case here. What does a
You are arguably on the very edge of when you should even consider using DADA2 based on the low rate of duplication amongst your input data. Can you say a little more about what amplicon you are sequencing, and from what environment? Also what instrument/chemistry are you using?
I still don't understand why the utilization is so low, but I guess it got there in the end. |
Here is a plotQualityProfile. It's PacBio HiFi full-length 16S data using Revio systems and should be these primers:
Could one of the issues be the lack of replicates? I got this data from a colleague who collected 4 soil samples from different environments and got them sequenced, without replication or pseudo-replication. |
Hi,
I'm trying to use dada2 to analyse PacBio 16S full amplicon data. I only have four samples which I filtered with these settings:
The resulting fastq.gz files are 22 to 23 mb large.
Dereplication shows:
I realise I have a lot of unique sequences.
Next, I want to learn the error rates but it is extremely slow. My output so far:
LearnErrors() has been running for about 24 hours now and is still going. Is this normal? Can it be sped up somehow?
My computer should be sufficient to calculate. I have 32 GB of RAM and a CPU with 8 cores, 16 logical processors and base speed 4.20 Ghz. CPU consumption for R process is consistently around 30-35%
Any help with this would be very much appreciated. Thank you.
The text was updated successfully, but these errors were encountered: