-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FilterAndTrim discards huge amount of reads with trunclen option #2006
Comments
|
yes I would also think so, but it's not the case, I've tried many different truncation lengths without any luck. out <- filterAndTrim(fwd=r1, filt=filts, rev=r2, filt.rev=filts, out <- filterAndTrim(fwd=r1, filt=filts, rev=r2, filt.rev=filts, |
That's mysterious. I'm not sure. Could you create a test sample that shows this behavior that you can share with me? It could be one selected sample, or even better a subsetted sample (e.g. just the first 5000 reads of each) that demonstrates this behavior. |
It really is! I've extracted 5000 reads for each and attached them. with trunclen 275, 270 only around 700 reads get through, but with 0,0 ~4700 gets through 38_G_sub_R1_001.fastq.gz |
Ok so I figured out what is going on here. Using just your first example file I ran the following (filtering with no
The quality profile shows there is extensive length variation in the output as shown by the red line (the fraction of reads that reach that nucleotide position). This is why when
the quality score 2 positions are mostly being assigned In algorithmic terms the following "filters" are bring applied in the order Unsatisfyingly I don't have an easy workaround for you outside of accepting this large read loss. DADA2 does not handle Ns in the sequences, and most of your data has Ns in the sequences. That isn't something we usually see, so may be worth checking with the sequencing provider if that is an option. |
So it is actually due to bad quality of the reads, which will also be the reason why this never happened before. I will talk with our sequencing center, thanks for your help! |
and is there a more automatic way of finding the best truncLen or getting the best point from the plot written in bp somehow? |
Michael Weinstein when he was at Zymo research created a tool for this purpose (Figaro: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7424690/). In general we think doing it by hand works well. Don't worry about perfectly optimizing. The goal is just to cut off the "quality crash" parts of reads (often seen at the end of Illumina reverse reads) with the requirement that the reads are still long enough after truncation to be successfully merged (the sum of the forward+reverse truncation lengths should be the length of the sequenced amplicon + 20nts or more so that they overlap enough to be merged). |
ok, thanks :) |
I'm experiencing a weird thing when I use filterandtrim on some samples from a recent run, I haven't experienced it before, I've put two examples in here.
There are>100.000 reads for almost all samples and the quality seems good, but when I add a truncation length to filterandtrim only 1/5 or so are left, however, if I leave out trunclen almost all reads pass, and there's not much difference if I change in the quality settings, as the quality is good.
I've checked the read length of the two samples
51_T, R1 305819 reads at 281bp, R2 288198 at 279bp
38_G, R1 197796 reads at 281, R2 185732 at 279bp
But even when I put in the actual length of most of the reads (281 R1 and 279 R2) as the trunclen 1/5 of the reads is still discarded, but almost all reads pass without the trunclen option, any idea why this is? code snippets below and qualityplots attached.
out <- filterAndTrim(fwd=r1, filt=filts, rev=r2, filt.rev=filts,
out <- filterAndTrim(fwd=r1, filt=filts, rev=r2, filt.rev=filts,
Read2_quality.pdf
Read1_quality.pdf
The text was updated successfully, but these errors were encountered: