-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Best way to run proovread on large dataset #48
Comments
Hi Beatriz, the recommendation regarding chunk size in the manual is somewhat outdated. With the latest versions of proovread, the rule of thumb is: Use chunks as large as possible, but with respect to the following limits:
Also, max --threads - running different instances on the same machine is no longer necessary. |
Hi Thomas, Thank you for your prompt reply and you useful comments. I was interested in using ccs reads as well as illumina for correction. However, I see that the %masked drops considerably in the last iteration when I use illumina+css. Is this normal? Is it better avoid mixing data types? At the moment I am doing these runs with low coverage, so I don't know if that would be a factor. ILLUMINA HISEQ (16X) ONLY ILLUMINA HISEQ (16X) + CCS (1X) Any suggestions? I am getting MiSeq data soon ~50X coverage, in that case do you recommend using only MiSeq data for correction? Thanks in advance, Beatriz |
The difference between the two runs is the running mode: sr (short reads for HiSeq <=100bp) vs mr (medium reads for MiSeq/merged Hiseq reads >100bp). Because of the longer ccs reads, proovread decided to use the mr mode for the second run. But that mode is not sensitive enough to properly align the shorter HiSeq reads. You could explicitly set the mode to sr also when using ccs. However, you won't get that problem for MiSeq data anyway. 50X MiSeq is more or less the perfect data set. If possible, use merged overlapping MiSeq reads for correction. They will also work well with or without ccs reads. |
Hi Thomas, I wanted to ask you for advice regarding overlapping the MiSeq reads, so far I my quality control pipeline includes trimming (trimmomatic) and then doing error correction (musket). Do you know what approach is more successful to correct long PacBio reads using Miseq data with proovread:
Considering that my MiSeq reads have poor quality at the 3' end. My coverage at this point is PE-hiseq (11X), PE-miseq (23X), ccs (0.86X). Do you suggest using all the data for proovread to get good enough coverage? Or would it best to use only MiSeq. Thank you for your time, Beatriz |
Hi Beatriz, I would go for overlapping reads directly, overlapping already decreases error rates in tails, very poor ends won't produce merged reads anyway. You can do trimming/correction afterwards, but I don't think it is necessary. Since proovread creates consensus from multiple Illumina reads, random errors in single reads don't affect correction accuracy. I guess that your coverage will decrease during overlapping, so you should use both, HiSeq+MiSeq reads. Make sure to set Cheers |
Hi there: I wonder how to tell the program to deal with many .fq files. In the proovread.cfg, it says: LIST of Pacbio read files to correct. FASTA or FASTQ format.'long-reads' => [], it seems this command is not clear. #74 Thanks. |
Hi,
I am interested in using proovread to correct PacBio long reads. My question is regarding your usage, in the manual it says try with a subset of the data first (which I did). After seeing that it works should I run proovread on my whole dataset (19GB file), or should I always run it on a small subset of PacBio reads (20M in the manual).
I currently have about 18X illumina and 37X coverage with PacBio long reads. The genome that I am sequencing is 260Mbp. I am planning on getting more illumina paired end to at least get 50X coverage with paired ends. But I am doing some tests while I get my new data.
Thank you for your time.
Beatriz
The text was updated successfully, but these errors were encountered: