Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

learnErrors() on PacBio 16S very slow #2005

Open
trawler-crawler opened this issue Aug 22, 2024 · 7 comments
Open

learnErrors() on PacBio 16S very slow #2005

trawler-crawler opened this issue Aug 22, 2024 · 7 comments

Comments

@trawler-crawler
Copy link

Hi,

I'm trying to use dada2 to analyse PacBio 16S full amplicon data. I only have four samples which I filtered with these settings:

> track2 <- filterAndTrim(list.files(input_dir, pattern="\\.fastq\\.gz$", full.names=TRUE),
                         filts2, 
                         minQ=3, minLen=1100, maxLen=1800, 
                         maxN=0, rm.phix=FALSE, maxEE=2, 
                         compress=TRUE, multithread=TRUE)

> print(track2)
                             reads.in reads.out
1.fastq.gz      190343     63771
2.fastq.gz        187175     63361
3.fastq.gz   203640     68187
4.fastq.gz        184003     62988

The resulting fastq.gz files are 22 to 23 mb large.

Dereplication shows:

> drp2 <- derepFastq(filts2, verbose=TRUE)
Dereplicating sequence entries in Fastq file: C:/Users/1.fastq.gz
Encountered 56052 unique sequences from 63771 total sequences read.
Dereplicating sequence entries in Fastq file: C:/Users/2.fastq.gz
Encountered 54948 unique sequences from 63361 total sequences read.
Dereplicating sequence entries in Fastq file: C:/Users/3.fastq.gz
Encountered 55390 unique sequences from 68187 total sequences read.
Dereplicating sequence entries in Fastq file: C:/Users/4.fastq.gz
Encountered 53148 unique sequences from 62988 total sequences read.

I realise I have a lot of unique sequences.

Next, I want to learn the error rates but it is extremely slow. My output so far:

> err2 <- learnErrors(drp2, errorEstimationFunction=PacBioErrfun, BAND_SIZE=32, multithread=TRUE)
186771417 total bases in 127132 reads from 2 samples will be used for learning the error rates.
The max qual score of 93 was not detected. Using standard error fitting.
The max qual score of 93 was not detected. Using standard error fitting.
The max qual score of 93 was not detected. Using standard error fitting.
The max qual score of 93 was not detected. Using standard error fitting.
The max qual score of 93 was not detected. Using standard error fitting.

LearnErrors() has been running for about 24 hours now and is still going. Is this normal? Can it be sped up somehow?

My computer should be sufficient to calculate. I have 32 GB of RAM and a CPU with 8 cores, 16 logical processors and base speed 4.20 Ghz. CPU consumption for R process is consistently around 30-35%

Any help with this would be very much appreciated. Thank you.

@benjjneb
Copy link
Owner

learnErrors is running the core DADA2 denoising algorithm repeatedly. Increasing numbers of unique sequences, increasing true diversity in a sample, and increasing sequence length all increase computation time. learnErrors can be the most computationally intensive step in the DADA2 workflow, and long read data is more challenging computationally, so what you're seeing is not out of the realm of expectations. That said, given your number of unique sequences, I would expect this to be computationally tractable in ~days on a laptop.

A couple things you can do to speed things up a a bit:

  1. Filter more stringently.
  2. remove the BAND_SIZE=32 parameter (default is 16 and will be faster)
  3. Move to a higher performance compute node.
  4. Check whether there is any issue with memory pressure. If so, remove the derepFastq step and run learnErrors on the filtered fastq files instead (they will then be loaded into memory one-at-a-time on the fly).

CPU consumption for R process is consistently around 30-35%

This seems low to me. Is this 30-35% of all available processors? Or 30-35% of a single processor?

@trawler-crawler
Copy link
Author

trawler-crawler commented Sep 16, 2024

Hi benjjneb,

first of all, thanks again for your timely response.

  1. I adjusted maxEE to filter more stringent
  2. I removed the BAND_SIZE=32 parameter
  3. I used the University node which allowed me to use more RAM (>200 GB) but only gives me 8 cores or so. After 1.5 days of running learnErrors, my session expired and logged me out, loosing any progress. Again, it didn't appear to use 100% of the CPU but 60-70%, but I believe this number is skewed because the CPU would also be used by other users. It is using Windows, just like I am on my own computer.
  4. I'm running the script again on my own computer and currently use only 70% of my RAM (32GB DDR5), so I don't believe there is any memory pressure. That's 70% of my total RAM. R-Studio / R is sitting at 5.5 GB.
  5. It is 30-35% of all available processors/cores. But right now even just 10%.

It's now running for approximately 1.5 days and the output is:

err2 <- learnErrors(drp2, errorEstimationFunction=PacBioErrfun, multithread=cores_to_use)
186771417 total bases in 127132 reads from 2 samples will be used for learning the error rates.
The max qual score of 93 was not detected. Using standard error fitting.
The max qual score of 93 was not detected. Using standard error fitting.
The max qual score of 93 was not detected. Using standard error fitting.
The max qual score of 93 was not detected. Using standard error fitting.
The max qual score of 93 was not detected. Using standard error fitting.

I only have 4 samples, so I'm surprised to see "The max qual score of...." now 5 times already. Is there any indication on the overall progress based on this output?

I also used library(parallel) to confirm the number of CPUs and set them (minus 1) in the code:

library(parallel)
available_cores <- detectCores() - 1  # Reserve one core for the OS
cores_to_use <- available_cores  

# Learn error rates
err2 <- learnErrors(drp2, errorEstimationFunction=PacBioErrfun, multithread=cores_to_use)

print(paste("Available cores:", available_cores))
[1] "Available cores: 15"

And here is my filtering step tracked:

print(track2)
reads.in reads.out
1.fastq.gz 190343 63771
2.fastq.gz 187175 63361
3.fastq.gz 203640 68187
4.fastq.gz 184003 62988

@benjjneb
Copy link
Owner

I only have 4 samples, so I'm surprised to see "The max qual score of...." now 5 times already. Is there any indication on the overall progress based on this output?

Yes. That indicates that you've made it through the selfConsist loop 5 times. The default max number of selfConsist steps is 10, so you are over half-way to the process finishing based on that output.

Again, it didn't appear to use 100% of the CPU but 60-70%, but I believe this number is skewed because the CPU would also be used by other users.

It is 30-35% of all available processors/cores. But right now even just 10%.

This is the part that is confusing to me that you are seeing such a low CPU usage. For example, when I run a job like this on my laptop (OSX) I typically see 400%-700% CPU usage (based on 100% being full usage of one core). I am not a windows user... are the numbers you are reporting a percent of a single core? Or a percent of all 16 cores (e.g. 100% = full usage of all 16 cores)?

@trawler-crawler
Copy link
Author

I now looped 8 times, two more two go.

This is the part that is confusing to me that you are seeing such a low CPU usage. For example, when I run a job like this on my laptop (OSX) I typically see 400%-700% CPU usage (based on 100% being full usage of one core). I am not a windows user... are the numbers you are reporting a percent of a single core? Or a percent of all 16 cores (e.g. 100% = full usage of all 16 cores)?

Sorry about not being clearer. 100% = full usage of all 16 cores.
When I had 30-40%, I basically used 3-4 cores.
But since yesterday I'm down to 1-10% CPU usage, almost nothing. But it seems that it still progressed because the selfConsist looped a few more times.

@trawler-crawler
Copy link
Author

The learn error rate step has now completed and I thought it would be good to also show you the resulting plot output. Does this look okay so far?

plot_errors

@benjjneb
Copy link
Owner

No, I am not loving those error model plots. Arguably the key feature to look for is that error rates are decreasing with increasing quality scores, and that does not seem to be the case here. What does a plotQualityProfile plot look like for an example sample?

Dereplicating sequence entries in Fastq file: C:/Users/1.fastq.gz
Encountered 56052 unique sequences from 63771 total sequences read.

You are arguably on the very edge of when you should even consider using DADA2 based on the low rate of duplication amongst your input data. Can you say a little more about what amplicon you are sequencing, and from what environment? Also what instrument/chemistry are you using?

But since yesterday I'm down to 1-10% CPU usage, almost nothing. But it seems that it still progressed because the selfConsist looped a few more times.

I still don't understand why the utilization is so low, but I guess it got there in the end.

@trawler-crawler
Copy link
Author

Rplot05

Here is a plotQualityProfile.

It's PacBio HiFi full-length 16S data using Revio systems and should be these primers:

primer_fwd = "AGRGTTYGATYMTGGCTCAG"
primer_rev = "AAGTCGTAACAAGGTARCY"

Could one of the issues be the lack of replicates? I got this data from a colleague who collected 4 soil samples from different environments and got them sequenced, without replication or pseudo-replication.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants