Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_load_graph_multithread() in test_scripts.py finds different unique k-mers #1248

Open
johnsolk opened this issue Aug 13, 2015 · 14 comments
Milestone

Comments

@johnsolk
Copy link
Member

Working on #998, using this command:

./setup.py nosetests --tests tests/test_scripts.py:test_load_graph_multithread

default args:

args = ['-N', '4', '-x', '1e7', '-T', '8', outfile, infile]

(ran ~ 20 x) displayed 4 different unique k-mers in stdout:

Total number of unique k-mers: 1124051
Total number of unique k-mers: 1124050
Total number of unique k-mers: 1124049
Total number of unique k-mers: 1124052

When change args to -T 1,only one unique k-mer (ran ~20x):

Total number of unique k-mers: 1124049

OS info:

ProductName: Mac OS X
ProductVersion: 10.9.5
BuildVersion: 13F34

@mr-c
Copy link
Contributor

mr-c commented Aug 13, 2015

Right, this sort of thing has been brewing for a while. It could be due to the probabilistic nature of the Count-Min Sketch or it could be a threading bug. @luizirber @camillescott may have insights to share

@mr-c mr-c added this to the 2.0+ milestone Aug 14, 2015
@ctb
Copy link
Member

ctb commented Aug 14, 2015

On Thu, Aug 13, 2015 at 02:08:58PM -0700, Michael R. Crusoe wrote:

Right, this sort of thing has been brewing for a while. It could be due to the probabilistic nature of the Count-Min Sketch or it could be a threading bug. @luizirber @camillescott may have insights to share

Having dug around in that code recently (and now simplified it), it shouldn't
be happening if the code was working the way I understood it to work. So
there's almost certainly a bug... Thanks, @ljcohen!

Let's take a look through the issue tracker to find you a new bug
to chase while we figure this out :)

@mr-c
Copy link
Contributor

mr-c commented Sep 4, 2015

@ctb Is this "known-issues" worthy?

@ctb
Copy link
Member

ctb commented Sep 4, 2015

On Fri, Sep 04, 2015 at 09:49:16AM -0700, Michael R. Crusoe wrote:

@ctb Is this "known-issues" worthy?

yep.

@camillescott
Copy link
Member

This should be resolved by #876 (and serve as extra incentive for me to get that merged)

@betatim
Copy link
Member

betatim commented Nov 2, 2016

Another occurrence of a Heisenbug in test_abundance_dist_threaded this time.

@betatim betatim mentioned this issue Nov 18, 2016
9 tasks
@betatim
Copy link
Member

betatim commented Dec 9, 2016

I'm looking into both of these.

For reference:

$ for n in {1..20}; do scripts/abundance-dist-single.py -s -x 1e7 -N 2 -k 17 -z --threads 2 tests/test-data/test-reads.fa /tmp/test.dist 2>&1 | grep Total; done
$ for n in {1..20}; do scripts/load-graph.py -N 4 -x 1e7 -T 8 /tmp/test tests/test-data/test-reads.fa 2>&1 | grep "Total"; done;

both produce varying numbers of unique kmers when executed with more than one thread.

@betatim
Copy link
Member

betatim commented Dec 12, 2016

The easy fix is to add a lock at a fairly highlevel which wipes out all the benefits of having more threads.

The long term solution is to have a queue between the reader and the hashtable. Then you can have multiple threads for reading (in a chunked fashion) the input file and separately control the number of threads that consume the reads from the queue. At the very least we need a Parser instance per thread that does not share anything with others. Right now the filling of a read is protected by a home made lock, but there doesn't seem to be anything to stop other threads from calling read->reset() at any moment in time, or replacing the contents of the read while one thread is still using it to fill a hashtable.

(When reading a gzip'ed file you probably want several producers as that takes quite a bit of CPU. For a plain text file a single thread probably is enough and you'd want more threads consuming the reads.)

@betatim
Copy link
Member

betatim commented Dec 12, 2016

The gzip format makes it ~impossible to do stuff in parallel. Tools like pigz speed up compression but can't do much for decompression (~6s for gunzip or pigz on ecoli).

So start with a single threaded reader, into a Q, and consume reads in parallel.

@betatim
Copy link
Member

betatim commented Jan 6, 2017

After some digging into this and failing to reproduce it I noticed that the bug appears if you reduce the size of the BF. From that I gathered that there isn't a bug anywhere but that whether or not a kmer is "new" depends on which kmers were entered into the BF before it. With only one thread the order is always the same, but with multiple threads the order changes between re-runs. Example:

simple BF that has 9 'slots' and 3 hash functions

three kmers with bit patterns:
A: 100010100
B: 000010101
C: 111000000

enter as:
ABC-> 3 unique kmers
CAB-> 3 unique kmers
BCA-> 2 unique kmers

So even with no bug you can get a different answer on how many unique kmers there are. The reason you don't see it for (very) large BFs is probably because the BF is super sparse.

@betatim
Copy link
Member

betatim commented Jan 6, 2017

Short gist with some code to reproduce this https://gist.github.com/betatim/35112d0d6fd0ad543748ae3355cd0cbc

@ctb
Copy link
Member

ctb commented Jan 19, 2017

This makes sense to me. Not sure what to do about it. but it makes sense to me.

@betatim
Copy link
Member

betatim commented Jan 19, 2017

I don't think there is anything that can be done.

@ctb
Copy link
Member

ctb commented Feb 14, 2017

@betatim @luizirber take a look at https://en.wikipedia.org/wiki/Bloom_filter#Approximating_the_number_of_items_in_a_Bloom_filter - this might be a better (approximate) way of computing n_unique_kmers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants