-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running with tutorial dataset giving all empty file #119
Comments
You skipped this part of the the tutorial:
For more information, in the latest master branch of V-pipe, you can look in file V-pipe/config/config.html, see section "input", parameter "read_length".
Yes. The length cutoff is: 200
…
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Input mean length (file 1): 150.55
…
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Input mean length (file 2): 150.66 Because there's no length specified (e.g.: in the third column as suggested in the tutorial), the cut-off value is computed from the default length (based on 250bp) and set to ≥200bp. The read's length (150bp) is shorter than this cut-off, every single read is considered "too short" and filtered out. There are no remaining reads after quality filtering. Future version of V-pipe will be able to illustrate the stage at which reads were rejected (we currently have custom scripts that haven't been incorporated into V-pipe's master yet). |
Hi, Thanks a lot for pointing it out! I somehow missed the information in the tutorial. Now I have run the tutorial dataset with success. However, when I tried to run my sewage samples dataset (with just two samples), it keep running without an error message and I had to stop it after a few hours. It seems like there is a problem during alignment. Do you think that the pipeline will work for read length of 101 bp ? and where could be the problem according to you? Best regards, |
regarding alignmentsCheck if the previous step (running prinseq in rule preprocessing) has successfully completed and you successfully got the Which aligner are you using?
Also if you have extremely deep coverage as happens e.g. with Illumina NovaSeq, due to the sheer amount of reads the alignment can take some time, and the SNV calling with ShoRAH can be extremely slow (we are aware of these shortcomings, but it will take quite some time to fix. In the meantime you can fall back to LoFreq as an SNV caller). regarding the read-lengthAlso, although we have run V-pipe successfully even on 50bp or 75bp read-length (some of the labs that sequence clinical swab samples do shotgun sequencing), you do need to keep in mind with sewage sequencing is that read-length matters: source: doi:10.1101/2021.05.26.21257861 Other groups have indeed shown that the optimal amplicon length for wastewater seem to be around ~400bp as done by, e.g., the Artic protocols (our monitoring used V3 in the past and has now switched to V4 which has better performance on Omicron and plans to switch to V4.1 as soon as it is available in commercial ready-made kits). With the much shorter amplicons (e.g. like Nimagen), you might have fewer opportunities for cooccurrences of mutations. (and with much longer amplicons than Artic (e.g. long reads) there might be fewer fragments as large that manage to survive through the wastewater all the way to the sequencing). This is important to keep in mind if you would want to later run our tool cojac on the output to look for mutation cooccurrences. (note: as of the time being, check its dev branch as we haven't released version 0.2 yet). Looking for individual mutations:
For all our wastewater sequencing, we have used Artic protocols that produce ~400bp amplicon and sequenced those with paired-end 250bp sequencing in order to cover well each whole amplicon. |
Hi, we have updated the config's readme section about the samples TSV to make this more clear. |
Hi, Many thanks for your answers and suggestions! Best regards, |
First, making sure that the alignment step ( In the output
If you only have one region that is highly covered (i.e.: only a few fragment of the virus reached all the way to sequencing) there might by some wet-lab problems (in our experience, the RNA extraction and concentration is a very difficult task. Your lab might want to discuss with our coauthors from Eawag (Swiss Federal Institute of Aquatic Science and Technology) on our paper. Some labs we work with have reported success using the RNA extraction kits produced by Promega). As soon as you have some good aligner reads, you might want to give a try to our tool cojac. Use the dev branch as we haven't released a package 0.2 on bioconda yet. Now regarding SNV calling:
If it's the later case: there is a corner case in ShoRAH that handles poorly when the local-windows for the local-haplotypes aren't aligned to the boundaries of amplicons, leading to a lot of the reads being lost. |
Hi,
I have tried to run the tutorial dataset (SARS-CoV-2) with the command mentioned on the webpage,
https://cbg-ethz.github.io/V-pipe/tutorial/sars-cov2/
However, after running successfully the whole script, the generated output files are all empty. Could anyone help me find out where is the problem?
Is it happeining in prinseq? Below is the output of prinseq;
The length cutoff is: 200
[prinseq-lite-0.20.4] [01/21/2022 23:24:18] Executing PRINSEQ with command: "perl prinseq-lite.pl -fastq samples/SRR10903401/20200102/extracted_data/R1.fastq -fastq2 samples/SRR10903401/20200102/extracted_data/R2.fastq -out_format 3 -trim_qual_right 30 -trim_qual_left 30 -trim_qual_type min -trim_qual_rule lt -trim_qual_window 10 -trim_qual_step 1 -log samples/SRR10903401/20200102/preprocessed_data/prinseq.out.log -min_len 200 -min_qual_mean 30 -ns_max_n 4 -out_good samples/SRR10903401/20200102/preprocessed_data/R -out_bad null"
[prinseq-lite-0.20.4] [01/21/2022 23:24:18] Parsing and processing input data: "samples/SRR10903401/20200102/extracted_data/R1.fastq" and "samples/SRR10903401/20200102/extracted_data/R2.fastq"
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Done parsing and processing input data
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Input sequences (file 1): 476,632
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Input bases (file 1): 71,758,102
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Input mean length (file 1): 150.55
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Input sequences (file 2): 476,632
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Input bases (file 2): 71,807,572
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Input mean length (file 2): 150.66
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Good sequences (pairs): 0
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Good sequences (singletons file 1): 0 (0.00%)
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Good sequences (singletons file 2): 0 (0.00%)
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Bad sequences (file 1): 476,632 (100.00%)
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Bad bases (file 1): 71,758,102
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Bad mean length (file 1): 150.55
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Bad sequences (file 2): 0 (0.00%)
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Sequences filtered by specified parameters:
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] trim_qual_left: 180134
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] min_len: 773130
Please let me know,
Best regards,
Wasim
The text was updated successfully, but these errors were encountered: