Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FilterAndTrim discards huge amount of reads with trunclen option #2006

Open
sofiehn opened this issue Aug 23, 2024 · 9 comments
Open

FilterAndTrim discards huge amount of reads with trunclen option #2006

sofiehn opened this issue Aug 23, 2024 · 9 comments

Comments

@sofiehn
Copy link

sofiehn commented Aug 23, 2024

I'm experiencing a weird thing when I use filterandtrim on some samples from a recent run, I haven't experienced it before, I've put two examples in here.

There are>100.000 reads for almost all samples and the quality seems good, but when I add a truncation length to filterandtrim only 1/5 or so are left, however, if I leave out trunclen almost all reads pass, and there's not much difference if I change in the quality settings, as the quality is good.

I've checked the read length of the two samples

51_T, R1 305819 reads at 281bp, R2 288198 at 279bp
38_G, R1 197796 reads at 281, R2 185732 at 279bp

But even when I put in the actual length of most of the reads (281 R1 and 279 R2) as the trunclen 1/5 of the reads is still discarded, but almost all reads pass without the trunclen option, any idea why this is? code snippets below and qualityplots attached.

out <- filterAndTrim(fwd=r1, filt=filts, rev=r2, filt.rev=filts,

  •                  maxN=0, maxEE=2, truncQ=2, rm.phix=TRUE, truncLen=c(281, 279),
    
  •                  compress=TRUE, multithread=FALSE, matchIDs = TRUE) 
    

out
reads.in reads.out
51_T_rmprimer_R1.fastq.gz 320012 42715
38_G_rmprimer_R1.fastq.gz 206572 30676

out <- filterAndTrim(fwd=r1, filt=filts, rev=r2, filt.rev=filts,

  •                  maxN=0, maxEE=2, truncQ=2, rm.phix=TRUE, truncLen=c(280, 260),
    
  •                  compress=TRUE, multithread=FALSE, matchIDs = TRUE)
    

out
reads.in reads.out
51_T_rmprimer_R1.fastq.gz 320012 43634
38_G_rmprimer_R1.fastq.gz 206572 31330

out <- filterAndTrim(fwd=r1, filt=filts, rev=r2, filt.rev=filts,

  •                  maxN=0, maxEE=2, truncQ=2, rm.phix=TRUE,
    
  •                  compress=TRUE, multithread=FALSE, matchIDs = TRUE)
    

out
reads.in reads.out
51_T_rmprimer_R1.fastq.gz 320012 305001
38_G_rmprimer_R1.fastq.gz 206572 197720

Read2_quality.pdf
Read1_quality.pdf

@benjjneb
Copy link
Owner

truncLen does two things, it truncates reads to the defined length and it discards all reads shorter than the given length. There is probably a small amount of length variation in your reads (that apparently have previously been trimmed/truncated in some way). If you reduce your truncation lengths slightly, almost all the reads should pass, based on the sharp cumulate distribution of your read lengths (red line in your quality profile plots).

@sofiehn
Copy link
Author

sofiehn commented Aug 26, 2024

yes I would also think so, but it's not the case, I've tried many different truncation lengths without any luck.
here's two more, I think I also went down to 210 once and about the same output, only without truncation all get through.

out <- filterAndTrim(fwd=r1, filt=filts, rev=r2, filt.rev=filts,
maxN=0, maxEE=2, truncQ=2, rm.phix=TRUE, truncLen=c(275, 270),
compress=TRUE, multithread=FALSE, matchIDs = TRUE)
out
reads.in reads.out
51_T_rmprimer_R1.fastq.gz 320012 43593
38_G_rmprimer_R1.fastq.gz 206572 31292

out <- filterAndTrim(fwd=r1, filt=filts, rev=r2, filt.rev=filts,
maxN=0, maxEE=2, truncQ=2, rm.phix=TRUE, truncLen=c(270, 265),
compress=TRUE, multithread=FALSE, matchIDs = TRUE)
out
reads.in reads.out
51_T_rmprimer_R1.fastq.gz 320012 43625
38_G_rmprimer_R1.fastq.gz 206572 31321

@benjjneb
Copy link
Owner

That's mysterious. I'm not sure.

Could you create a test sample that shows this behavior that you can share with me?

It could be one selected sample, or even better a subsetted sample (e.g. just the first 5000 reads of each) that demonstrates this behavior.

@sofiehn
Copy link
Author

sofiehn commented Aug 27, 2024

It really is! I've extracted 5000 reads for each and attached them. with trunclen 275, 270 only around 700 reads get through, but with 0,0 ~4700 gets through

38_G_sub_R1_001.fastq.gz
38_G_sub_R2_001.fastq.gz
51_T_sub_R1_001.fastq.gz
51_T_sub_R2_001.fastq.gz

@benjjneb
Copy link
Owner

benjjneb commented Sep 2, 2024

Ok so I figured out what is going on here. Using just your first example file I ran the following (filtering with no truncL) which resulted in 97% of the reads passing the filter but...

fnF <- "38_G_sub_R1_001.fastq"
filtF <- file.path("filt", basename(fnF))
filterAndTrim(fnF, filtF, maxEE=2, verbose=TRUE)
plotQualityProfile(filtF)

filt_tL0

The quality profile shows there is extensive length variation in the output as shown by the red line (the fraction of reads that reach that nucleotide position). This is why when truncL is added in many reads fail to pass, because they don't reach the truncation length. What is goin on here? Well there are two relevant filters that are turned on by default in filterAndTrim: truncQ=2 and maxN=0. The length variation we are seeing in the filtered output is being caused by truncQ=2, if you turn it off, that length variation goes away BUT now most reads are lost to the filter because...

filterAndTrim(fnF, filtF, maxEE=2, truncQ=0, verbose=TRUE)
plotQualityProfile(filtF)

filt_tL0_tQ0

the quality score 2 positions are mostly being assigned N so when we stop truncation at the quality score 2 positions, we n ow have Ns in the sequences. You can check with with filterAndTrim(..., truncQ=0, maxN=Inf) which causes all reads to pass.

In algorithmic terms the following "filters" are bring applied in the order truncQ first, truncLen second, maxN third. Because your data has Ns that are almost always associated with a quality score of 2, the default filters will allow reads to pass with ragged lengths where the first Q=2/N was seen. But, when truncLen is enforced, it throws away all those short reads that were truncated at an early Q=2/N position, which unfortunately is most of your data.

Unsatisfyingly I don't have an easy workaround for you outside of accepting this large read loss. DADA2 does not handle Ns in the sequences, and most of your data has Ns in the sequences. That isn't something we usually see, so may be worth checking with the sequencing provider if that is an option.

@sofiehn
Copy link
Author

sofiehn commented Sep 3, 2024

So it is actually due to bad quality of the reads, which will also be the reason why this never happened before. I will talk with our sequencing center, thanks for your help!

@sofiehn
Copy link
Author

sofiehn commented Sep 3, 2024

and is there a more automatic way of finding the best truncLen or getting the best point from the plot written in bp somehow?

@benjjneb
Copy link
Owner

benjjneb commented Sep 5, 2024

Michael Weinstein when he was at Zymo research created a tool for this purpose (Figaro: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7424690/).

In general we think doing it by hand works well. Don't worry about perfectly optimizing. The goal is just to cut off the "quality crash" parts of reads (often seen at the end of Illumina reverse reads) with the requirement that the reads are still long enough after truncation to be successfully merged (the sum of the forward+reverse truncation lengths should be the length of the sequenced amplicon + 20nts or more so that they overlap enough to be merged).

@sofiehn
Copy link
Author

sofiehn commented Sep 6, 2024

ok, thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants