-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mixed Orientations and Large Data Processing #938
Comments
I think you've rather nicely laid out the options. If it was me, I would go with option (3), i.e. run dada2 twice, and then merge the tables together at the end (after reverse-complementing
Yes, our general reco is to
In theory, one could get more accurate error models with more data. In practice, the gain is typically negligible, at least on normal Illumina data.
Do you care about high sensitivity to rare per-sample variants? Then I recommend pseudo-pooling. (Note that you will also pick up other rare things, like rare contaminants, as well.)
It shouldn't affect anything much, the 16S/18S/12S sequences will naturally be denoised into different ASVs, which can then be removed afterwards. There is a price to be paid in terms of compute-time, but that's all I think. |
Thank you for the quick feedback!
Sounds good. Does dada2 have an inbuilt routine that deals with redundancy? I see the
Ok, I will learn errors based on the lane.
I think it could be useful to use a slightly larger number so that we define the error model on more then 1 sample. Dada2 seem to multithread up to 16 cores so it should go relatively quick.
We have two blank samples for each lane so contaminants are not an issue. I favour pooling or pseudo pooling. We have to benchmark the computational burden...
I agree. 18S will be removed by dada2 as reads don't overlap. I will give it a try and let you know if I bump into any other issues. Hans |
Great, let us know if you hit any other issues.
No, but that can be accomplished using base R (albeit a bit hackily since base R isn't great with string manipulation):
|
It is the reverse complement after all... Anyhow splitting with cutadapt seems to work and I'm now looking at first results for an individual lane. Read numbers were subsampled to 20k per sample so that I can have a good test dataset and see if the pipeline works. I find the learnErrors plot relatively peculiar. It seems that the error/phred quality is off (for the better) for phred scores around 30. Other lanes look very similar. Can you explain? Should I be worried? |
Does this data have binned quality scores? |
I'm not sure what you mean. I ran |
Yes the |
These plots strongly suggest that the data you have used binned quality scores. i.e. only a subset of quality scores was used (perhaps only increments of 7 or something like that) in order to save space upon compression. The default fitting of the error model in dada2 can struggle in the presence of binned quality scores. Our suggestion at this point in time is to enforce monotonicity on the matrix of the error rates that is learned. |
Hi
This result in a much more linear fit. Would that be a good idea to solve the problem? |
Crudely, those parameters and results look reasonable, and I would not be uncomfortable about using them. I am being cagey, because we are still waiting on good binned-Q data (probably from NovaSeq) with mock communities of known composition to develop official recommendations. But unofficially, that does seem reasonable. |
Hi Ben Thanks for the agreement. We can safely proceed :). Actually we got a mostly functional pipeline set up and just ran all subsampled (to 20k reads a sample) samples through it to test the performance and check for any hick-ups. One thing that feels really weird is the way how dada2 uses multiple threads. If I submit 1 job with 64 cores to our cluster, both I checked by setting omp threads manually:
But that doesnt change anything. Do you have any explanation for this? Example of the command that uses multithreading:
Thanks a lot, |
Hi Ben Just another question. I ran a full nextseq run without pooling through dada (1.14). I get an
Is this something to worry about? Will the Counter start at 0 again after an overflow? Considering the dada2 call I would suggest that the warning is created during the Here is the call: err.r1 <- readRDS(err.rds.r1)
dd.r1 <- dada(s.f.r1, err=err.r1, pool=FALSE, multithread = threads)
dd.r1[[1]]
seqtab.r1 <- makeSequenceTable(dd.r1)
saveRDS(seqtab.r1, file = outfile.r1) |
Happily, no this isn't something you need to worry about. It is happening during the We should fix this more gracefully though. As datasets have gotten larger and larger we are hitting spots where R's signed 32-bit integers are not ideal. |
I don't have an entirely satisfactory response, but I can tell you that OMP thread values are not an issue here. The multithreading for all dada functions (except One thing I would try is to see if specifying the number of threads, i.e. |
Hi Ben Sounds reasonable. Good that the overflow doesn't break the ASV table :). Regarding Multithreading... I tested with |
Closing, but please re-open as needed. |
Hi @hjruscheweyh, I'm encountering the same type of challenges you're describing above, especially those regarding the funky results from learnErrors(). It looks like the sequencing facility also sent us results with binned quality scores and I've been scratching my head on how to address/correct for that. If I may, I'd like to ask you a question about the solution you used. You state above that you used:
I might be guilty of being dense here but for the sake of clarity, when carrying this out is it as simple as just passing
Thanks, Jake |
Hi Jacob I think @GuillemSalazar Is far more qualified to answer this question ;). Best, |
Thank you @hjruscheweyh. @GuillemSalazar, would you be able to chime in on my question above? |
Hi Jacob.
and the used it inside the
Hope this helps! |
I'm sorry, I thought I had responded to your reply. We were hoping that the sequencing facility would be able to recover the original (non-binned) versions of the quality score so I had not tried your method yet. It looks like they aren't able to give us the original data we are looking for so I'll be taking your approach. Thank you for your response and being explicit about how you addressed the issue. Jake |
Hi Ben
I have recently started using dada2 and want to incorporate your tool into our 16S analysis pipeline. Don't be shocked, I have a couple of questions :).
We get our data from Hiseq runs where the protocol doesn't guarantee the standard orientation of sequencing reads.
We're using the standard primers for V4/V5 extraction with Foward-515F=
GTGYCAGCMGCCGCGGTAA
and Reverse-926R=CCGYCAATTYMTTTRAGTTT
. You would expect that all R1 reads start with the Forward-515F primers and all R2 reads start with the Reverse-926R primer. But they don't. ~50% of the R1 reads start with the Forward-515F primer and the other 50% of R1 reads start with the Reverse-926R primer. Usingcutadapt
I can remove the primers easily but get 4 files out:The
R*_1_noprimer.fastq
files catch the515F----926R
orientated read pairs andR*_2_noprimer.fastq
files catch the926R----515F
read pairs.The question is how to proceed. The options are:
515F----926R
and926R----515F
)What do you think the right way would be and how could it be implemented?
I work on a relatively large 16S sequencing project where multiple Hiseq runs generated ~5000 samples and ~1TB of gzipped fastq files. I would like to process all of this data with dada2 and produce one big table that represents all samples. How do you recommend running dada2 assuming that you want to get the best possible result within a maximum of 1 month compute time on a large server (100 cores, 1TB RAM). Your big data tutorial for paired end reads (https://benjjneb.github.io/dada2/bigdata_paired.html) already has some recommendations.
So my questions relate on how to run the
dada()
and thelearnErrors()
functionnbases=1e8
for learning?mergePairs
step. Is there any systematic problem that would justify to remove these sequences before analysis?Thanks a lot for your feedback,
Hans
The text was updated successfully, but these errors were encountered: