test_load_graph_multithread() in test_scripts.py finds different unique k-mers #1248

johnsolk · 2015-08-13T02:27:09Z

Working on #998, using this command:

./setup.py nosetests --tests tests/test_scripts.py:test_load_graph_multithread

default args:

args = ['-N', '4', '-x', '1e7', '-T', '8', outfile, infile]

(ran ~ 20 x) displayed 4 different unique k-mers in stdout:

Total number of unique k-mers: 1124051
Total number of unique k-mers: 1124050
Total number of unique k-mers: 1124049
Total number of unique k-mers: 1124052

When change args to -T 1,only one unique k-mer (ran ~20x):

Total number of unique k-mers: 1124049

OS info:

ProductName: Mac OS X
ProductVersion: 10.9.5
BuildVersion: 13F34

mr-c · 2015-08-13T21:08:58Z

Right, this sort of thing has been brewing for a while. It could be due to the probabilistic nature of the Count-Min Sketch or it could be a threading bug. @luizirber @camillescott may have insights to share

ctb · 2015-08-14T15:21:41Z

On Thu, Aug 13, 2015 at 02:08:58PM -0700, Michael R. Crusoe wrote:

Right, this sort of thing has been brewing for a while. It could be due to the probabilistic nature of the Count-Min Sketch or it could be a threading bug. @luizirber @camillescott may have insights to share

Having dug around in that code recently (and now simplified it), it shouldn't
be happening if the code was working the way I understood it to work. So
there's almost certainly a bug... Thanks, @ljcohen!

Let's take a look through the issue tracker to find you a new bug
to chase while we figure this out :)

mr-c · 2015-09-04T16:49:16Z

@ctb Is this "known-issues" worthy?

ctb · 2015-09-04T18:00:29Z

On Fri, Sep 04, 2015 at 09:49:16AM -0700, Michael R. Crusoe wrote:

@ctb Is this "known-issues" worthy?

yep.

camillescott · 2015-09-04T18:27:07Z

This should be resolved by #876 (and serve as extra incentive for me to get that merged)

betatim · 2016-11-02T14:27:59Z

Another occurrence of a Heisenbug in test_abundance_dist_threaded this time.

betatim · 2016-12-09T13:10:08Z

I'm looking into both of these.

For reference:

$ for n in {1..20}; do scripts/abundance-dist-single.py -s -x 1e7 -N 2 -k 17 -z --threads 2 tests/test-data/test-reads.fa /tmp/test.dist 2>&1 | grep Total; done
$ for n in {1..20}; do scripts/load-graph.py -N 4 -x 1e7 -T 8 /tmp/test tests/test-data/test-reads.fa 2>&1 | grep "Total"; done;

both produce varying numbers of unique kmers when executed with more than one thread.

betatim · 2016-12-12T11:07:11Z

The easy fix is to add a lock at a fairly highlevel which wipes out all the benefits of having more threads.

The long term solution is to have a queue between the reader and the hashtable. Then you can have multiple threads for reading (in a chunked fashion) the input file and separately control the number of threads that consume the reads from the queue. At the very least we need a Parser instance per thread that does not share anything with others. Right now the filling of a read is protected by a home made lock, but there doesn't seem to be anything to stop other threads from calling read->reset() at any moment in time, or replacing the contents of the read while one thread is still using it to fill a hashtable.

(When reading a gzip'ed file you probably want several producers as that takes quite a bit of CPU. For a plain text file a single thread probably is enough and you'd want more threads consuming the reads.)

betatim · 2016-12-12T12:36:56Z

The gzip format makes it ~impossible to do stuff in parallel. Tools like pigz speed up compression but can't do much for decompression (~6s for gunzip or pigz on ecoli).

So start with a single threaded reader, into a Q, and consume reads in parallel.

betatim · 2017-01-06T14:38:40Z

After some digging into this and failing to reproduce it I noticed that the bug appears if you reduce the size of the BF. From that I gathered that there isn't a bug anywhere but that whether or not a kmer is "new" depends on which kmers were entered into the BF before it. With only one thread the order is always the same, but with multiple threads the order changes between re-runs. Example:

simple BF that has 9 'slots' and 3 hash functions

three kmers with bit patterns:
A: 100010100
B: 000010101
C: 111000000

enter as:
ABC-> 3 unique kmers
CAB-> 3 unique kmers
BCA-> 2 unique kmers

So even with no bug you can get a different answer on how many unique kmers there are. The reason you don't see it for (very) large BFs is probably because the BF is super sparse.

betatim · 2017-01-06T15:01:30Z

Short gist with some code to reproduce this https://gist.github.com/betatim/35112d0d6fd0ad543748ae3355cd0cbc

ctb · 2017-01-19T05:51:28Z

This makes sense to me. Not sure what to do about it. but it makes sense to me.

betatim · 2017-01-19T15:26:22Z

I don't think there is anything that can be done.

ctb · 2017-02-14T20:39:59Z

@betatim @luizirber take a look at https://en.wikipedia.org/wiki/Bloom_filter#Approximating_the_number_of_items_in_a_Bloom_filter - this might be a better (approximate) way of computing n_unique_kmers.

mr-c added bug C++ labels Aug 13, 2015

mr-c added this to the 2.0+ milestone Aug 14, 2015

ctb mentioned this issue Nov 2, 2016

Make Read accessible from python #1492

Merged

7 tasks

betatim mentioned this issue Nov 18, 2016

Fix streaming gzip output #1530

Merged

9 tasks

This was referenced Dec 13, 2016

Experimenting with queues and threads #1553

Open

Nibble storage class #1551

Merged

betatim mentioned this issue Jan 19, 2017

How to observe effect of race conditions when filling a Bloom filter #1575

Open

standage modified the milestones: 2.1, 2.0+ Feb 22, 2017

ctb mentioned this issue Oct 11, 2017

n_unique_kmers() returns 0 upon save/reload #1781

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_load_graph_multithread() in test_scripts.py finds different unique k-mers #1248

test_load_graph_multithread() in test_scripts.py finds different unique k-mers #1248

johnsolk commented Aug 13, 2015

mr-c commented Aug 13, 2015

ctb commented Aug 14, 2015

mr-c commented Sep 4, 2015

ctb commented Sep 4, 2015

camillescott commented Sep 4, 2015

betatim commented Nov 2, 2016

betatim commented Dec 9, 2016

betatim commented Dec 12, 2016

betatim commented Dec 12, 2016

betatim commented Jan 6, 2017 •

edited

Loading

betatim commented Jan 6, 2017

ctb commented Jan 19, 2017

betatim commented Jan 19, 2017

ctb commented Feb 14, 2017 •

edited

Loading

test_load_graph_multithread() in test_scripts.py finds different unique k-mers #1248

test_load_graph_multithread() in test_scripts.py finds different unique k-mers #1248

Comments

johnsolk commented Aug 13, 2015

mr-c commented Aug 13, 2015

ctb commented Aug 14, 2015

mr-c commented Sep 4, 2015

ctb commented Sep 4, 2015

camillescott commented Sep 4, 2015

betatim commented Nov 2, 2016

betatim commented Dec 9, 2016

betatim commented Dec 12, 2016

betatim commented Dec 12, 2016

betatim commented Jan 6, 2017 • edited Loading

betatim commented Jan 6, 2017

ctb commented Jan 19, 2017

betatim commented Jan 19, 2017

ctb commented Feb 14, 2017 • edited Loading

betatim commented Jan 6, 2017 •

edited

Loading

ctb commented Feb 14, 2017 •

edited

Loading