Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test - error report - 0 corrected reads? #34

Open
desmodus1984 opened this issue Feb 20, 2021 · 7 comments
Open

Test - error report - 0 corrected reads? #34

desmodus1984 opened this issue Feb 20, 2021 · 7 comments

Comments

@desmodus1984
Copy link

Hi,

I sequenced a genome and I got 16 paired-end libraries. To test the speed of the program, I picked on library ~11 Gb each .gz file, and used the following code:
./lighter -r A-407_1.fq.gz -r A-407_2.fq.gz -K 15 2300000000 -t 48 -od test

I checked the report, and it is:
[2021-02-19 23:57:45] =============Start====================
[2021-02-19 23:57:45] Scanning the input files to infer alpha(sampling rate)
[2021-02-20 00:01:37] Average coverage is 7.923 and alpha is 0.883
[2021-02-20 00:01:41] Bad quality threshold is "5"
[2021-02-20 00:07:35] Finish sampling kmers
[2021-02-20 00:07:35] Bloom filter A's false positive rate: 0.000000
[2021-02-20 00:18:33] Finish storing trusted kmers
[2021-02-20 00:37:52] Finish error correction
Processed 182234180 reads:
0 are error-free
Corrected 0 bases(0.000000 corrections for reads with errors)
Trimmed 0 reads with average trimmed bases 0.000000
Discard 0 reads

It looks very weird because 0 reads are error-free, and also looks like none were corrected?

Any suggestions of this result?

Thanks;

@mourisl
Copy link
Owner

mourisl commented Feb 20, 2021

The false positive rate of bloom filter A is 0 may suggest that Lighter did not sample any kmers. This maybe due to the computed bad quality score threshold "5" is too high. Can you try run Lighter with option --noQual?

There could be another issue. It seems the read coverage is relatively low, and Lighter is not able to distinguish trust and untrusted kmers As a result, the error correction results might not be optimal.

@desmodus1984
Copy link
Author

desmodus1984 commented Feb 20, 2021 via email

@mourisl
Copy link
Owner

mourisl commented Feb 20, 2021

From my own experience, Lighter works well with coverage as low as 15X coverage. When <10, Lighter may fail to correct any errors. For example, in this study, https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-01988-3/figures/2 , Lighter did not correct any errors with 8X coverage. I think this is also the combined effects of read coverage and kmer size. If you set a large kmer size, there might still be statistical powers to distinguish trusted and untrusted kmers in Lighter.

On your second run when combining all samples, it seems the correction sensitivity is low. I think kmer size 15 might be too small given the genome size is 2.3G. You can either try -K 31, or change the MAX_KMER_LENGTH in "utils.h" to 128, recompile Lighter and run Lighter with -K 65. When setting MAX_KMER_LENGTH > 32, Lighter will be slower.

Hope this helps. Thank you!

@desmodus1984
Copy link
Author

desmodus1984 commented Feb 20, 2021 via email

@mourisl
Copy link
Owner

mourisl commented Feb 21, 2021

For the human genome, the kmer size should be at least 19. Since 3G/4^19 is about 1%, this suggests that by random chance, two kmers from two different locations have the sequence is about 1%. In practice, due to repeats, we may want to have a larger kmer size. So I think in many studies, kmer size are picking as 31 to be the balance of computation efficiency (can be represented by a 64 bit variable) and kmer's uniqueness.

@desmodus1984
Copy link
Author

desmodus1984 commented Feb 21, 2021 via email

@mourisl
Copy link
Owner

mourisl commented Feb 21, 2021

It's unlikely to have no error reads. Since your data already has some filtrations with quality score, I would suggest to use "--noQual" option in Lighter. I feel Lighter would still fail to correct given 8X coverage and this high value of alpha (0.960). I would suggest using BFC, Musket or BLESS. If you plan to run all the samples together (~60X coverage), Lighter should work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants