Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixed Orientations and Large Data Processing #938

Closed
hjruscheweyh opened this issue Feb 7, 2020 · 22 comments
Closed

Mixed Orientations and Large Data Processing #938

hjruscheweyh opened this issue Feb 7, 2020 · 22 comments

Comments

@hjruscheweyh
Copy link

Hi Ben

I have recently started using dada2 and want to incorporate your tool into our 16S analysis pipeline. Don't be shocked, I have a couple of questions :).

  1. Mixed orientations of R1/R2 reads

We get our data from Hiseq runs where the protocol doesn't guarantee the standard orientation of sequencing reads.

We're using the standard primers for V4/V5 extraction with Foward-515F=GTGYCAGCMGCCGCGGTAA and Reverse-926R=CCGYCAATTYMTTTRAGTTT. You would expect that all R1 reads start with the Forward-515F primers and all R2 reads start with the Reverse-926R primer. But they don't. ~50% of the R1 reads start with the Forward-515F primer and the other 50% of R1 reads start with the Reverse-926R primer. Using cutadapt I can remove the primers easily but get 4 files out:

cutadapt --discard-untrimmed -g GTGYCAGCMGCCGCGGTAA -G CCGYCAATTYMTTTRAGTTT  -o R1_1_noprimer.fastq -p R2_1_noprimer.fastq R1.fastq R2.fastq -j 8 --pair-adapters --max-n 0 --minimum-length 75

cutadapt --discard-untrimmed -G GTGYCAGCMGCCGCGGTAA -g CCGYCAATTYMTTTRAGTTT  -o R1_2_noprimer.fastq -p R2_2_noprimer.fastq R1.fastq R2.fastq -j 8 --pair-adapters --max-n 0 --minimum-length 75

The R*_1_noprimer.fastq files catch the 515F----926R orientated read pairs and R*_2_noprimer.fastq files catch the 926R----515F read pairs.

The question is how to proceed. The options are:

  • Throw all all R1 sequences into one file and same for R2 and run dada2 on it.
    • The advantage is that the way dada2 estimates the errors wont be biased by mixing R1 and R2 sequences together
    • The disadvantage is that abundances of ASVs are halfed and that rare ASVs might be lost on the way.
    • Another disadvantage could be that (not sure) that merging is difficult and that half of the ASVs need to be reversed after merging
  • Define the R1 file as all sequences that start with the 515F primer and R2 as the file with all sequences starting with 926R.
    • The advantage is that we wont run into trouble with merging and that dada2 can use proper abundances for ASV calculation
    • The disavantage is that the error model is now calculated from a mixed R1 file and a mixed R2 file.
  • Run dada2 twice for both readsets (515F----926R and 926R----515F)
    • Its more complex to design
    • More robust to perform merging ASV tables as you know that all sequences in one ASV table need to be reversed

What do you think the right way would be and how could it be implemented?

  1. Hiseq, multiple runs, multiple lanes

I work on a relatively large 16S sequencing project where multiple Hiseq runs generated ~5000 samples and ~1TB of gzipped fastq files. I would like to process all of this data with dada2 and produce one big table that represents all samples. How do you recommend running dada2 assuming that you want to get the best possible result within a maximum of 1 month compute time on a large server (100 cores, 1TB RAM). Your big data tutorial for paired end reads (https://benjjneb.github.io/dada2/bigdata_paired.html) already has some recommendations.

So my questions relate on how to run the dada() and the learnErrors() function

  • I have information of sample, run and lane for each sequencing file. I guess I learn errors based on all sequencing files created from a lane, right?
  • Is there any advantage of using more then nbases=1e8 for learning?
  • Should I run sample inference naive, pooled or pseudo-pooled assuming I'm not limited by runtime/cpu resources? On the big data example you use sample by sample inference. Is there a specific reason besides runtime?
  • My data contains a lot of 18S and 12S sequences that should be removed from downstream analysis. 18S for example wont merge and therefore will drop out in the mergePairs step. Is there any systematic problem that would justify to remove these sequences before analysis?

Thanks a lot for your feedback,

Hans

@benjjneb
Copy link
Owner

benjjneb commented Feb 7, 2020

The question is how to proceed. The options are:

I think you've rather nicely laid out the options. If it was me, I would go with option (3), i.e. run dada2 twice, and then merge the tables together at the end (after reverse-complementing 926R----515F). I think that is the best balance between avoiding mixing the F/R error models, and using all the data.

I have information of sample, run and lane for each sequencing file. I guess I learn errors based on all sequencing files created from a lane, right?

Yes, our general reco is to learnErrors on each "processing batch" so that the assumption that the error process is the same across the samples using the same error model is valid.

Is there any advantage of using more then nbases=1e8 for learning?

In theory, one could get more accurate error models with more data. In practice, the gain is typically negligible, at least on normal Illumina data.

Should I run sample inference naive, pooled or pseudo-pooled assuming I'm not limited by runtime/cpu resources? On the big data example you use sample by sample inference. Is there a specific reason besides runtime?

Do you care about high sensitivity to rare per-sample variants? Then I recommend pseudo-pooling. (Note that you will also pick up other rare things, like rare contaminants, as well.)

My data contains a lot of 18S and 12S sequences that should be removed from downstream analysis. 18S for example wont merge and therefore will drop out in the mergePairs step. Is there any systematic problem that would justify to remove these sequences before analysis?

It shouldn't affect anything much, the 16S/18S/12S sequences will naturally be denoised into different ASVs, which can then be removed afterwards. There is a price to be paid in terms of compute-time, but that's all I think.

@hjruscheweyh
Copy link
Author

Thank you for the quick feedback!

The question is how to proceed. The options are:

I think you've rather nicely laid out the options. If it was me, I would go with option (3), i.e. run dada2 twice, and then merge the tables together at the end (after reverse-complementing 926R----515F). I think that is the best balance between avoiding mixing the F/R error models, and using all the data.

Sounds good. Does dada2 have an inbuilt routine that deals with redundancy? I see the collapseNoMismatch function but it doesn't reverse the sequence (We need to reverse the sequence but not reverse complement)

I have information of sample, run and lane for each sequencing file. I guess I learn errors based on all sequencing files created from a lane, right?

Yes, our general reco is to learnErrors on each "processing batch" so that the assumption that the error process is the same across the samples using the same error model is valid.

Ok, I will learn errors based on the lane.

Is there any advantage of using more then nbases=1e8 for learning?

In theory, one could get more accurate error models with more data. In practice, the gain is typically negligible, at least on normal Illumina data.

I think it could be useful to use a slightly larger number so that we define the error model on more then 1 sample. Dada2 seem to multithread up to 16 cores so it should go relatively quick.

Should I run sample inference naive, pooled or pseudo-pooled assuming I'm not limited by runtime/cpu resources? On the big data example you use sample by sample inference. Is there a specific reason besides runtime?

Do you care about high sensitivity to rare per-sample variants? Then I recommend pseudo-pooling. (Note that you will also pick up other rare things, like rare contaminants, as well.)

We have two blank samples for each lane so contaminants are not an issue. I favour pooling or pseudo pooling. We have to benchmark the computational burden...

My data contains a lot of 18S and 12S sequences that should be removed from downstream analysis. 18S for example wont merge and therefore will drop out in the mergePairs step. Is there any systematic problem that would justify to remove these sequences before analysis?

It shouldn't affect anything much, the 16S/18S/12S sequences will naturally be denoised into different ASVs, which can then be removed afterwards. There is a price to be paid in terms of compute-time, but that's all I think.

I agree. 18S will be removed by dada2 as reads don't overlap.

I will give it a try and let you know if I bump into any other issues.

Hans

@benjjneb
Copy link
Owner

Great, let us know if you hit any other issues.

Does dada2 have an inbuilt routine that deals with redundancy? I see the collapseNoMismatch function but it doesn't reverse the sequence (We need to reverse the sequence but not reverse complement)

No, but that can be accomplished using base R (albeit a bit hackily since base R isn't great with string manipulation):

dna.rev <-  intToUtf8(rev(utf8ToInt(dna)))

@hjruscheweyh
Copy link
Author

hjruscheweyh commented Feb 14, 2020

It is the reverse complement after all... Anyhow splitting with cutadapt seems to work and I'm now looking at first results for an individual lane. Read numbers were subsampled to 20k per sample so that I can have a good test dataset and see if the pipeline works. I find the learnErrors plot relatively peculiar. It seems that the error/phred quality is off (for the better) for phred scores around 30. Other lanes look very similar. Can you explain? Should I be worried?

err_example_1 .pdf
err_example_2.pdf

@benjjneb
Copy link
Owner

Does this data have binned quality scores?

@hjruscheweyh
Copy link
Author

I'm not sure what you mean.

I ran filterAndTrim for an entire lane (between 29 and 230 samples depending on lane) of subsamples reads and then ran learnErrors with nbases=1e8 so that most samples contributed to the learning. Are you looking for the plotQualityProfile plot?

@benjjneb
Copy link
Owner

Yes the plotQualiityProfile plot would be helpful.

@hjruscheweyh
Copy link
Author

I attached the qualityPlot for raw and qc files from the R1 read. I also added the result of learnErrors. Looking at the QC profile I believe that maybe the absence of low quality bases messes with the way that errors are learned.
fw_R1 learnerrors rds

QC

qc

RAW

raw

@benjjneb
Copy link
Owner

These plots strongly suggest that the data you have used binned quality scores. i.e. only a subset of quality scores was used (perhaps only increments of 7 or something like that) in order to save space upon compression.

The default fitting of the error model in dada2 can struggle in the presence of binned quality scores. Our suggestion at this point in time is to enforce monotonicity on the matrix of the error rates that is learned.

@GuillemSalazar
Copy link

Hi
We tried to enforce monotonicity by changing the parameters (span=2 and the log-transformed totals as weights) of the loess function used by loessErrfun():

mod.lo <- loess(rlogp ~ q, df, weights = log10(tot),span = 2)

This result in a much more linear fit. Would that be a good idea to solve the problem?

I attach the comparison below.
dada2_loess_modified

@benjjneb
Copy link
Owner

Crudely, those parameters and results look reasonable, and I would not be uncomfortable about using them.

I am being cagey, because we are still waiting on good binned-Q data (probably from NovaSeq) with mock communities of known composition to develop official recommendations. But unofficially, that does seem reasonable.

@hjruscheweyh
Copy link
Author

Hi Ben

Thanks for the agreement. We can safely proceed :). Actually we got a mostly functional pipeline set up and just ran all subsampled (to 20k reads a sample) samples through it to test the performance and check for any hick-ups. One thing that feels really weird is the way how dada2 uses multiple threads. If I submit 1 job with 64 cores to our cluster, both learnErrors and dada2 seem to use only up to 20 cores. The other 44 cores remain idle. That is ok since I have plenty of jobs to submit in parallel. So I submit 4 16 core jobs as individual scripts (all on the same physical server). All 4 jobs use only 4-5 cores (Same jobs that used 20 cores before). Do you know any reason for this behaviour? It seems that there is a global variable that dictates maximal usage.

I checked by setting omp threads manually:

set OMP_NUM_THREADS=64
#or
export OMP_NUM_THREADS=64

But that doesnt change anything. Do you have any explanation for this?

Example of the command that uses multithreading:

dd.r1 <- dada(s.f.r1, err=err.r1, pool=TRUE, multithread = threads)

Thanks a lot,
Hans

@hjruscheweyh
Copy link
Author

Hi Ben

Just another question. I ran a full nextseq run without pooling through dada (1.14). I get an integer overflow issue. See below:

Sample 1 - 351212 reads in 30538 unique sequences.
Sample 2 - 715450 reads in 34912 unique sequences.
Sample 3 - 508777 reads in 25100 unique sequences.
Sample 4 - 490106 reads in 69691 unique sequences.
Sample 5 - 281025 reads in 20950 unique sequences.
Sample 6 - 344128 reads in 48603 unique sequences.
Sample 7 - 544861 reads in 43337 unique sequences.
Sample 8 - 560897 reads in 29655 unique sequences.
Sample 9 - 482598 reads in 46598 unique sequences.
Sample 10 - 869021 reads in 67254 unique sequences.
Sample 11 - 978367 reads in 62491 unique sequences.
Sample 12 - 747545 reads in 145082 unique sequences.
Sample 13 - 390345 reads in 31410 unique sequences.
Sample 14 - 590625 reads in 88199 unique sequences.
Sample 15 - 659543 reads in 58882 unique sequences.
Sample 16 - 1067459 reads in 105692 unique sequences.
Sample 17 - 446157 reads in 54103 unique sequences.
Sample 18 - 783800 reads in 70840 unique sequences.
Sample 19 - 1044266 reads in 128867 unique sequences.
Sample 20 - 556955 reads in 120568 unique sequences.
Sample 21 - 711144 reads in 132977 unique sequences.
Sample 22 - 524082 reads in 56383 unique sequences.
Sample 23 - 497646 reads in 31314 unique sequences.
Sample 24 - 388973 reads in 30250 unique sequences.
Sample 25 - 543830 reads in 22780 unique sequences.
Sample 26 - 414769 reads in 17575 unique sequences.
Sample 27 - 424535 reads in 48922 unique sequences.
Sample 28 - 580920 reads in 89356 unique sequences.
Sample 29 - 754170 reads in 114243 unique sequences.
Sample 30 - 532297 reads in 35909 unique sequences.
Sample 31 - 410756 reads in 43897 unique sequences.
Sample 32 - 655691 reads in 38228 unique sequences.
Sample 33 - 440342 reads in 24788 unique sequences.
Sample 34 - 267031 reads in 17722 unique sequences.
Sample 35 - 333218 reads in 16916 unique sequences.
Sample 36 - 478222 reads in 25560 unique sequences.
Sample 37 - 349399 reads in 22206 unique sequences.
Sample 38 - 395663 reads in 28953 unique sequences.
Sample 39 - 401721 reads in 19853 unique sequences.
Sample 40 - 493948 reads in 29174 unique sequences.
Sample 41 - 273872 reads in 34111 unique sequences.
Sample 42 - 303505 reads in 26666 unique sequences.
Sample 43 - 329251 reads in 45456 unique sequences.
Sample 44 - 814162 reads in 146546 unique sequences.
Sample 45 - 928740 reads in 100951 unique sequences.
Sample 46 - 423992 reads in 49015 unique sequences.
Sample 47 - 944860 reads in 86988 unique sequences.
Sample 48 - 1178791 reads in 126231 unique sequences.
Sample 49 - 215092 reads in 14849 unique sequences.
Sample 50 - 1143874 reads in 86430 unique sequences.
Sample 51 - 757443 reads in 141917 unique sequences.
Sample 52 - 694078 reads in 71720 unique sequences.
Sample 53 - 804583 reads in 145016 unique sequences.
Sample 54 - 760782 reads in 66370 unique sequences.
Sample 55 - 907568 reads in 82049 unique sequences.
Sample 56 - 429540 reads in 56058 unique sequences.
Sample 57 - 522284 reads in 39068 unique sequences.
Sample 58 - 1014281 reads in 108653 unique sequences.
Sample 59 - 579333 reads in 123978 unique sequences.
Sample 60 - 725787 reads in 93378 unique sequences.
Sample 61 - 623413 reads in 119078 unique sequences.
Sample 62 - 715183 reads in 92045 unique sequences.
Sample 63 - 724224 reads in 35930 unique sequences.
Sample 64 - 427033 reads in 25868 unique sequences.
Sample 65 - 503935 reads in 19992 unique sequences.
Sample 66 - 673159 reads in 30431 unique sequences.
Sample 67 - 510420 reads in 58367 unique sequences.
Sample 68 - 358977 reads in 41997 unique sequences.
Sample 69 - 595986 reads in 38220 unique sequences.
Sample 70 - 535834 reads in 31515 unique sequences.
Sample 71 - 312531 reads in 22720 unique sequences.
Sample 72 - 470833 reads in 20763 unique sequences.
Sample 73 - 301065 reads in 18402 unique sequences.
Sample 74 - 282795 reads in 42583 unique sequences.
Sample 75 - 227250 reads in 18164 unique sequences.
Sample 76 - 244842 reads in 34272 unique sequences.
Sample 77 - 609205 reads in 47344 unique sequences.
Sample 78 - 604785 reads in 30716 unique sequences.
Sample 79 - 364555 reads in 24459 unique sequences.
Sample 80 - 576879 reads in 21796 unique sequences.
Sample 81 - 816342 reads in 34079 unique sequences.
Sample 82 - 511440 reads in 61557 unique sequences.
Sample 83 - 440398 reads in 29454 unique sequences.
Sample 84 - 452220 reads in 52282 unique sequences.
Sample 85 - 617149 reads in 40831 unique sequences.
Sample 86 - 609355 reads in 45217 unique sequences.
Sample 87 - 567823 reads in 31745 unique sequences.
Sample 88 - 865608 reads in 90645 unique sequences.
Sample 89 - 643062 reads in 108151 unique sequences.
Sample 90 - 523234 reads in 60372 unique sequences.
Sample 91 - 463230 reads in 22168 unique sequences.
Sample 92 - 619403 reads in 31414 unique sequences.
Sample 93 - 413890 reads in 51083 unique sequences.
Sample 94 - 447178 reads in 65991 unique sequences.
Sample 95 - 280427 reads in 22384 unique sequences.
Sample 96 - 339920 reads in 44630 unique sequences.
Sample 97 - 567251 reads in 41982 unique sequences.
Sample 98 - 933535 reads in 96262 unique sequences.
Sample 99 - 950937 reads in 35135 unique sequences.
Sample 100 - 526168 reads in 35557 unique sequences.
Sample 101 - 537072 reads in 26301 unique sequences.
Sample 102 - 823232 reads in 44566 unique sequences.
Sample 103 - 543038 reads in 79279 unique sequences.
Sample 104 - 434677 reads in 31803 unique sequences.
Sample 105 - 540475 reads in 67741 unique sequences.
Sample 106 - 482124 reads in 31974 unique sequences.
Sample 107 - 494407 reads in 27844 unique sequences.
Sample 108 - 426101 reads in 29132 unique sequences.
Sample 109 - 384887 reads in 17177 unique sequences.
Sample 110 - 605725 reads in 37939 unique sequences.
Sample 111 - 283327 reads in 35177 unique sequences.
Sample 112 - 411370 reads in 27475 unique sequences.
Sample 113 - 738941 reads in 93453 unique sequences.
Sample 114 - 466651 reads in 30718 unique sequences.
Sample 115 - 489540 reads in 43275 unique sequences.
Sample 116 - 631197 reads in 32602 unique sequences.
Sample 117 - 908527 reads in 55642 unique sequences.
Sample 118 - 311299 reads in 42533 unique sequences.
Sample 119 - 806491 reads in 56382 unique sequences.
Sample 120 - 634925 reads in 61296 unique sequences.
Sample 121 - 605364 reads in 34744 unique sequences.
Sample 122 - 540710 reads in 44721 unique sequences.
Sample 123 - 322667 reads in 44563 unique sequences.
Sample 124 - 998462 reads in 88132 unique sequences.
Sample 125 - 1020303 reads in 57904 unique sequences.
Sample 126 - 437042 reads in 63050 unique sequences.
Sample 127 - 377367 reads in 24407 unique sequences.
Sample 128 - 318160 reads in 36955 unique sequences.
Sample 129 - 682924 reads in 52293 unique sequences.
Sample 130 - 396760 reads in 20120 unique sequences.
Sample 131 - 389061 reads in 33373 unique sequences.
Sample 132 - 757474 reads in 34484 unique sequences.
Sample 133 - 966686 reads in 54061 unique sequences.
Sample 134 - 498868 reads in 64678 unique sequences.
Sample 135 - 1102668 reads in 72959 unique sequences.
Sample 136 - 1058348 reads in 52536 unique sequences.
Sample 137 - 617600 reads in 57259 unique sequences.
Sample 138 - 470937 reads in 63810 unique sequences.
Sample 139 - 530213 reads in 37780 unique sequences.
Sample 140 - 1229099 reads in 73671 unique sequences.
Sample 141 - 698508 reads in 36275 unique sequences.
Sample 142 - 762751 reads in 110834 unique sequences.
Sample 143 - 507742 reads in 42413 unique sequences.
Sample 144 - 421899 reads in 54406 unique sequences.
Sample 145 - 694575 reads in 51037 unique sequences.
Sample 146 - 1054888 reads in 122967 unique sequences.
Sample 147 - 519121 reads in 119042 unique sequences.
Sample 148 - 520157 reads in 71854 unique sequences.
Sample 149 - 453936 reads in 77250 unique sequences.
Sample 150 - 514423 reads in 49384 unique sequences.
Sample 151 - 256358 reads in 15848 unique sequences.
Sample 152 - 353658 reads in 25660 unique sequences.
Sample 153 - 578110 reads in 25939 unique sequences.
Sample 154 - 564404 reads in 33160 unique sequences.
Sample 155 - 317172 reads in 40154 unique sequences.
Sample 156 - 651152 reads in 37385 unique sequences.
Sample 157 - 440152 reads in 55582 unique sequences.
Sample 158 - 746817 reads in 59516 unique sequences.
Sample 159 - 468810 reads in 25676 unique sequences.
Sample 160 - 416283 reads in 28748 unique sequences.
Sample 161 - 592442 reads in 24338 unique sequences.
Sample 162 - 371323 reads in 52251 unique sequences.
Sample 163 - 234308 reads in 15578 unique sequences.
Sample 164 - 830571 reads in 82223 unique sequences.
Sample 165 - 219809 reads in 15874 unique sequences.
Sample 166 - 608131 reads in 27132 unique sequences.
Sample 167 - 618864 reads in 127906 unique sequences.
Sample 168 - 855238 reads in 92880 unique sequences.
Sample 169 - 629280 reads in 113934 unique sequences.
Sample 170 - 929830 reads in 116317 unique sequences.
Sample 171 - 528834 reads in 54220 unique sequences.
Sample 172 - 690503 reads in 42588 unique sequences.
Sample 173 - 553222 reads in 33490 unique sequences.
Sample 174 - 352484 reads in 30471 unique sequences.
Sample 175 - 457372 reads in 31885 unique sequences.
Sample 176 - 580432 reads in 21688 unique sequences.
Sample 177 - 589975 reads in 30152 unique sequences.
Sample 178 - 376157 reads in 44228 unique sequences.
Sample 179 - 390658 reads in 30892 unique sequences.
Sample 180 - 286415 reads in 38268 unique sequences.
Sample 181 - 626501 reads in 30635 unique sequences.
Sample 182 - 438099 reads in 21334 unique sequences.
Sample 183 - 393738 reads in 29008 unique sequences.
Sample 184 - 407496 reads in 18682 unique sequences.
Sample 185 - 467012 reads in 23527 unique sequences.
Sample 186 - 542730 reads in 72130 unique sequences.
Sample 187 - 532527 reads in 26692 unique sequences.
Sample 188 - 533958 reads in 54953 unique sequences.
Sample 189 - 425544 reads in 32603 unique sequences.
Sample 190 - 425461 reads in 42137 unique sequences.
Sample 191 - 1008071 reads in 51648 unique sequences.
Sample 192 - 766610 reads in 64112 unique sequences.
Sample 193 - 450537 reads in 60441 unique sequences.
Sample 194 - 957894 reads in 60520 unique sequences.
Sample 195 - 866874 reads in 72894 unique sequences.
Sample 196 - 496739 reads in 84762 unique sequences.
Sample 197 - 545488 reads in 62296 unique sequences.
Sample 198 - 738860 reads in 144436 unique sequences.
Sample 199 - 459875 reads in 28038 unique sequences.
Sample 200 - 449384 reads in 28772 unique sequences.
Sample 201 - 306527 reads in 25468 unique sequences.
Sample 202 - 285946 reads in 11344 unique sequences.
Sample 203 - 553047 reads in 26212 unique sequences.
Sample 204 - 469381 reads in 59822 unique sequences.
Sample 205 - 437080 reads in 23638 unique sequences.
Sample 206 - 195941 reads in 24240 unique sequences.
Sample 207 - 366690 reads in 24631 unique sequences.
Sample 208 - 592025 reads in 38582 unique sequences.
Sample 209 - 258381 reads in 21392 unique sequences.
Sample 210 - 738862 reads in 52154 unique sequences.
Sample 211 - 1068967 reads in 76322 unique sequences.
Sample 212 - 557706 reads in 44625 unique sequences.
Sample 213 - 958885 reads in 55090 unique sequences.
Sample 214 - 628206 reads in 34072 unique sequences.
Sample 215 - 511720 reads in 35449 unique sequences.
Sample 216 - 307537 reads in 45232 unique sequences.
Sample 217 - 606819 reads in 73886 unique sequences.
Sample 218 - 368785 reads in 26080 unique sequences.
Sample 219 - 927827 reads in 149280 unique sequences.
Sample 220 - 754982 reads in 51405 unique sequences.
Sample 221 - 240727 reads in 14116 unique sequences.
Sample 222 - 595837 reads in 68246 unique sequences.
Sample 223 - 353888 reads in 13935 unique sequences.
Sample 224 - 258141 reads in 21685 unique sequences.
Sample 225 - 572448 reads in 55347 unique sequences.
Sample 226 - 580444 reads in 43897 unique sequences.
Sample 227 - 1109885 reads in 55443 unique sequences.
Sample 228 - 357341 reads in 13441 unique sequences.
Sample 229 - 222906 reads in 15170 unique sequences.
Sample 230 - 752673 reads in 37262 unique sequences.
Sample 231 - 875523 reads in 46453 unique sequences.
Sample 232 - 744938 reads in 116636 unique sequences.
Sample 233 - 883347 reads in 62757 unique sequences.
Sample 234 - 1185173 reads in 89559 unique sequences.
Sample 235 - 703968 reads in 87625 unique sequences.
Sample 236 - 469223 reads in 40192 unique sequences.
Warning messages:
1: In rval[, 1:ncol(tt)] + tt : NAs produced by integer overflow
2: In rval[, 1:ncol(tt)] + tt : NAs produced by integer overflow
3: In rval[, 1:ncol(tt)] + tt : NAs produced by integer overflow
4: In rval[, 1:ncol(tt)] + tt : NAs produced by integer overflow
5: In rval[, 1:ncol(tt)] + tt : NAs produced by integer overflow
dada-class: object describing DADA2 denoising results
896 sequence variants were inferred from 30538 input unique sequences.
Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16

Is this something to worry about? Will the Counter start at 0 again after an overflow? Considering the dada2 call I would suggest that the warning is created during the dada2 call and not during the makeSequenceTable call.

Here is the call:

err.r1 <- readRDS(err.rds.r1)


dd.r1 <- dada(s.f.r1, err=err.r1, pool=FALSE, multithread = threads)
dd.r1[[1]]
seqtab.r1 <- makeSequenceTable(dd.r1)

saveRDS(seqtab.r1, file = outfile.r1)

@benjjneb
Copy link
Owner

Is this something to worry about? Will the Counter start at 0 again after an overflow? Considering the dada2 call I would suggest that the warning is created during the dada2 call and not during the makeSequenceTable call.

Happily, no this isn't something you need to worry about. It is happening during the dada call, but the overflowing computation being performed isn't relevant to the sample inference/denoising, it is part of the post-hoc summation of the observed error rates over all the samples. But those post-hoc error rates aren't used for anything.

We should fix this more gracefully though. As datasets have gotten larger and larger we are hitting spots where R's signed 32-bit integers are not ideal.

@benjjneb
Copy link
Owner

All 4 jobs use only 4-5 cores (Same jobs that used 20 cores before). Do you know any reason for this behaviour? It seems that there is a global variable that dictates maximal usage.

I don't have an entirely satisfactory response, but I can tell you that OMP thread values are not an issue here. The multithreading for all dada functions (except filterAndTrim) uses the RcppParallel package, which relies on Intel's TBB (threaded building blocks) for the backend implementation. So, if there is a non-ideal interaction between the compute environment/architecture and the relevant TBB functionality, one can end up with non-optimal thread usage.

One thing I would try is to see if specifying the number of threads, i.e. multithread=16 rather than multithread=TRUE, does a better job. That would be the case if the issue is in auto-detecting the available thread numbers. But beyond that, I just don't understand the multithreading backend well enough to offer much guidance.

@hjruscheweyh
Copy link
Author

Hi Ben

Sounds reasonable. Good that the overflow doesn't break the ASV table :).

Regarding Multithreading... I tested with multithread=N that is why I found the behaviour strange. Other parts of the pipeline work perfectly multhithreaded, e.g. the bimera part. So I guess the dada algorithm with pool=TRUE just doesn't multithread that well. Let's hope that I can push large quantities of data in a reasonable time through the pipeline :).

@benjjneb
Copy link
Owner

benjjneb commented Apr 6, 2020

Closing, but please re-open as needed.

@JacobRPrice
Copy link

Hi @hjruscheweyh,

I'm encountering the same type of challenges you're describing above, especially those regarding the funky results from learnErrors(). It looks like the sequencing facility also sent us results with binned quality scores and I've been scratching my head on how to address/correct for that. If I may, I'd like to ask you a question about the solution you used.

You state above that you used:

mod.lo <- loess(rlogp ~ q, df, weights = log10(tot),span = 2)

I might be guilty of being dense here but for the sake of clarity, when carrying this out is it as simple as just passing mod.lo for the errorEstimationFunction argument of dada(), i.e.:

dadaFs <- dada(
  derep = filtFs,
  err = NULL,
  errorEstimationFunction = mod.lo, 
  selfConsist = TRUE,
  pool = FALSE,
  multithread = parallel::detectCores() - 1, 
  verbose = TRUE
)

Thanks,

Jake

@hjruscheweyh
Copy link
Author

Hi Jacob

I think @GuillemSalazar Is far more qualified to answer this question ;).

Best,
Hans

@JacobRPrice
Copy link

Thank you @hjruscheweyh.

@GuillemSalazar, would you be able to chime in on my question above?

@GuillemSalazar
Copy link

Hi Jacob.
We just manually defined a modified loessErrfun() function as followed:

loessErrfun_mod <- function (trans) {
  qq <- as.numeric(colnames(trans))
  est <- matrix(0, nrow = 0, ncol = length(qq))
  for (nti in c("A", "C", "G", "T")) {
    for (ntj in c("A", "C", "G", "T")) {
      if (nti != ntj) {
        errs <- trans[paste0(nti, "2", ntj), ]
        tot <- colSums(trans[paste0(nti, "2", c("A",
                                                "C", "G", "T")), ])
        rlogp <- log10((errs + 1)/tot)
        rlogp[is.infinite(rlogp)] <- NA
        df <- data.frame(q = qq, errs = errs, tot = tot,
                         rlogp = rlogp)
        mod.lo <- loess(rlogp ~ q, df, weights = log10(tot),span = 2)
        pred <- predict(mod.lo, qq)
        maxrli <- max(which(!is.na(pred)))
        minrli <- min(which(!is.na(pred)))
        pred[seq_along(pred) > maxrli] <- pred[[maxrli]]
        pred[seq_along(pred) < minrli] <- pred[[minrli]]
        est <- rbind(est, 10^pred)
      }
    }
  }
  MAX_ERROR_RATE <- 0.25
  MIN_ERROR_RATE <- 1e-07
  est[est > MAX_ERROR_RATE] <- MAX_ERROR_RATE
  est[est < MIN_ERROR_RATE] <- MIN_ERROR_RATE
  err <- rbind(1 - colSums(est[1:3, ]), est[1:3, ], est[4,
                                                        ], 1 - colSums(est[4:6, ]), est[5:6, ], est[7:8, ], 1 -
                 colSums(est[7:9, ]), est[9, ], est[10:12, ], 1 - colSums(est[10:12,
                                                                              ]))
  rownames(err) <- paste0(rep(c("A", "C", "G", "T"), each = 4),
                          "2", c("A", "C", "G", "T"))
  colnames(err) <- colnames(trans)
  return(err)
}

and the used it inside the dada() function:

dd <- dada(..., errorEstimationFunction = loessErrfun_mod)

Hope this helps!

@JacobRPrice
Copy link

@GuillemSalazar

I'm sorry, I thought I had responded to your reply. We were hoping that the sequencing facility would be able to recover the original (non-binned) versions of the quality score so I had not tried your method yet. It looks like they aren't able to give us the original data we are looking for so I'll be taking your approach. Thank you for your response and being explicit about how you addressed the issue.

Jake

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants