-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_load_graph_multithread() in test_scripts.py finds different unique k-mers #1248
Comments
Right, this sort of thing has been brewing for a while. It could be due to the probabilistic nature of the Count-Min Sketch or it could be a threading bug. @luizirber @camillescott may have insights to share |
On Thu, Aug 13, 2015 at 02:08:58PM -0700, Michael R. Crusoe wrote:
Having dug around in that code recently (and now simplified it), it shouldn't Let's take a look through the issue tracker to find you a new bug |
@ctb Is this "known-issues" worthy? |
On Fri, Sep 04, 2015 at 09:49:16AM -0700, Michael R. Crusoe wrote:
yep. |
This should be resolved by #876 (and serve as extra incentive for me to get that merged) |
Another occurrence of a Heisenbug in |
I'm looking into both of these. For reference:
both produce varying numbers of unique kmers when executed with more than one thread. |
The easy fix is to add a lock at a fairly highlevel which wipes out all the benefits of having more threads. The long term solution is to have a queue between the reader and the hashtable. Then you can have multiple threads for reading (in a chunked fashion) the input file and separately control the number of threads that consume the reads from the queue. At the very least we need a Parser instance per thread that does not share anything with others. Right now the filling of a read is protected by a home made lock, but there doesn't seem to be anything to stop other threads from calling (When reading a gzip'ed file you probably want several producers as that takes quite a bit of CPU. For a plain text file a single thread probably is enough and you'd want more threads consuming the reads.) |
The gzip format makes it ~impossible to do stuff in parallel. Tools like pigz speed up compression but can't do much for decompression (~6s for gunzip or pigz on ecoli). So start with a single threaded reader, into a Q, and consume reads in parallel. |
After some digging into this and failing to reproduce it I noticed that the bug appears if you reduce the size of the BF. From that I gathered that there isn't a bug anywhere but that whether or not a kmer is "new" depends on which kmers were entered into the BF before it. With only one thread the order is always the same, but with multiple threads the order changes between re-runs. Example:
So even with no bug you can get a different answer on how many unique kmers there are. The reason you don't see it for (very) large BFs is probably because the BF is super sparse. |
Short gist with some code to reproduce this https://gist.github.com/betatim/35112d0d6fd0ad543748ae3355cd0cbc |
This makes sense to me. Not sure what to do about it. but it makes sense to me. |
I don't think there is anything that can be done. |
@betatim @luizirber take a look at https://en.wikipedia.org/wiki/Bloom_filter#Approximating_the_number_of_items_in_a_Bloom_filter - this might be a better (approximate) way of computing n_unique_kmers. |
Working on #998, using this command:
default args:
(ran ~ 20 x) displayed 4 different unique k-mers in stdout:
When change args to -T 1,only one unique k-mer (ran ~20x):
OS info:
The text was updated successfully, but these errors were encountered: