Experimenting with queues and threads #1553

betatim · 2016-12-13T14:13:07Z

This is an experimental branch to address both the bugs mentioned in #1248 and see
if we can improve the speed of filter-abund and friends.

Current ideas are based on:

reading from disk isn't slow or CPU intensive (except when the input is gzipped but you can't multithread that)
making a call from python -> C is "expensive"
want to release the GIL for as long as possible
release/acquire the GIL as few times as possible

Use one thread to read the input, dump batches of reads into a queue. Several consumer threads get a batch from the Q, call to C land, convert the sequence in the batch to char*s, release the GIL, enter them into the hashtable, acquire the GIL.

Focussing on comparing scripts/abundance-dist-single.py with one thread to scripts/abundance-dist-single-threaded.py which is the experimental version.

Is it mergeable?
make test Did it pass the tests?
make clean diff-cover If it introduces new functionality in
scripts/ is it tested?
make format diff_pylint_report cppcheck doc pydocstyle Is it well
formatted?
Did it change the command-line interface? Only backwards-compatible
additions are allowed without a major version increment. Changing file
formats also requires a major version number increment.
For substantial changes or changes to the command-line interface, is it
documented in CHANGELOG.md? See keepachangelog
for more details.
Was a spellchecker run on the source code and documentation after
changes were made?
Do the changes respect streaming IO? (Are they
tested for streaming IO?)

betatim · 2016-12-13T15:18:41Z

Why do we get a different number of unique kmers with the two versions of the script?

betatim · 2016-12-13T15:42:10Z

With only one thread modifying the countgraph the discrepancy goes away.

betatim · 2016-12-13T16:10:51Z

Idea: one bloom filter per thread, then merge them at the end.

ctb · 2016-12-13T16:29:45Z

Query: our current data structures do not support cache locality and cross-CPU NUMA memory access. How big a difference would that make? Could be tested by changing graph sizes. -- Titus Brown, ctbrown@ucdavis.edu

…

-- Titus Brown, ctbrown@ucdavis.edu

On Dec 13, 2016, at 8:10 AM, Tim Head ***@***.***> wrote: Idea: one bloom filter per thread, then merge them at the end. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

ctb · 2016-12-13T16:43:19Z

To be a little clearer: in the search for speed, would a better target be the data structures & support for cache locality?

A concern with the bloom filters approach is that in memory limited situations (or with data sets that require big mem) it won't actually help to do that :). A potentially productive alternate approach would be to support easy chunking of files on different machines -- that is, if you have a 50 GB file and 5 machines, tell machine 1 to handle GB 0-10, machine 2 to handle GB 10-20, etc. Then you could do various kinds of processing in parallel and merge/whatever afterwards. This strategy could work well for load-into-counting, load-graph, normalize-by-median, and trim-low-abund - all of the major use cases, basically.

betatim · 2016-12-13T17:51:27Z

Query: our current data structures do not support cache locality and cross-CPU NUMA memory access. How big a difference would that make? Could be tested by changing graph sizes.

Probably a lot. You'd go and measure how many instructions/cycle are executed, or directly how many cache misses you have. For a (large) bloom filter I am not sure how much you can do as you want the index into table N+1 to be essentially random wrt the index into table N. As well as the index of the next k-mer being random wrt to the index of the previous one (even though they are only different by one base (if not it might make sense to handle insertion into table N for several kmers before moving on to table N+1).

I think cuckoo filters promise to be better for that in this regard.

ctb · 2016-12-13T17:56:26Z

On Tue, Dec 13, 2016 at 09:51:27AM -0800, Tim Head wrote: > Query: our current data structures do not support cache locality and cross-CPU NUMA memory access. How big a difference would that make? Could be tested by changing graph sizes. Probably a lot. You'd go and measure how many instructions/cycle are executed, or directly how many cache misses you have. For a (large) bloom filter I am not sure how much you can do as you want the index into table N+1 to be essentially random wrt the index into table N. As well as the index of the next k-mer being random wrt to the index of the previous one (even though they are only different by one base (if not it might make sense to handle insertion into table N for several kmers before moving on to table N+1). I think cuckoo filters promise to be better for that in this regard.

Yes, I agree with all of that. The speedup can be tested by varying the size of the tables for straight insertions and queries (without doing anything with the results). And yes, no way to achieve locality with bloom filters. You could keep a cache of things to insert into the bloom filters, tho, organized by location in filter, and then flush from there periodically.

betatim · 2016-12-13T18:09:57Z

I think it is Ok to practice how to split things into components connected by a queue using threads. We will need that if you want to go to multiple machines (maybe via multiple processes first?).

Data structures that are inherently parallel would be super useful.

If we make one BF per thread to handle 1/Nth of the reads, would each BF have to be as large as a BF that can handle all reads? How does this work if you want to merge them later on?

Do you know the typical ratio of allocated vs used memory for these very large BFs? Wondering what trickery the kernel has available for allocating more memory than is available if you never use it. Not enough of a memory expert to know off the top of my head.

This branch as it is achieves 166% cpu and elapsed 55.235s (81.93s user) to run through ecoli with one thread for reading and one for filling the countgraph. Compared to 99% cpu and elapsed 1:18.11 (75.15s user). So you have some overhead but if you have two cores less human time passes to do the same job. (🚧 Doing the same with two consumer threads gives you ~270% cpu and elapsed 31s (80.96s user) but the answer is "wrong" because the two threads somehow step on each others toes 🚧)

standage · 2016-12-13T18:15:45Z

The binary bloom filter (node table/graph) can be built in pieces and merged: see the update_from method. However, this requires each of the pieces to be the same size as the final product, so memory consumption would scale linearly with threads. :(

This approach will not work for the counting bloom filter (count table/graph).

ctb · 2016-12-13T18:19:28Z

On Tue, Dec 13, 2016 at 10:09:57AM -0800, Tim Head wrote: I think it is Ok to practice how to split things into components connected by a queue using threads. We will need that if you want to go to multiple machines (maybe via multiple processes first?). Data structures that are inherently parallel would be super useful.

Maybe?

If we make one BF per thread to handle 1/Nth of the reads, would each BF have to be as large as a BF that can handle all reads? How does this work if you want to merge them later on?

Yes, because the data comes in random order sampled from across all true k-mers, to a first approximation split-out BFs have to be able to handle all the data. This is also the same condition required for merging them so I guess it all works out.

Do you know the typical ratio of allocated vs used memory for these very large BFs? Wondering what trickery the kernel has available for allocating more memory than is available if you never use it. Not enough of a memory expert to know off the top of my head.

100% - whatever memory is allocated is used, b/c of the way Bloom filters and hashing work. In these situations Bloom filters have close to the best (smallest) possible storage footprint. But they are also quite slow and may be suboptimal for many other tasks that aren't memory challenged. (One of the goals of #1215 (and your work on #1551) is to give us a variety of data structures with different guarantees so that we can pick and choose as needed.)

ctb · 2016-12-13T18:19:46Z

This approach will not work for the counting bloom filter (count table/graph).

I think it will, no?

standage · 2016-12-13T18:25:18Z

This approach will not work for the counting bloom filter (count table/graph).

I think it will, no?

I guess it should work for this use case: you should be able to simply add the counts. The update_from had me thinking of the more general case where it doesn't really make sense: combining multiple samples into a single countgraph.

ctb · 2016-12-13T18:27:26Z

On Tue, Dec 13, 2016 at 10:25:18AM -0800, Daniel Standage wrote: >> This approach will not work for the counting bloom filter (count table/graph). > I think it will, no? I guess it should work for this use case: you should be able to simply add the counts. The `update_from` had me thinking of the more general case where it doesn't really make sense: combining multiple samples into a single countgraph.

agree.

betatim · 2016-12-16T13:35:00Z

scripts/abundance-dist-single-threaded.py

+    loading_thread.join()
+
+    log_info('Total number of unique k-mers: {nk}',
+             nk=countgraph.n_unique_kmers())


Any general thoughts on the yay or nay-ness of reworking things into this kind of pattern (read from L142 to here)?

Looks familiar to me :) -- see https://github.com/dib-lab/khmer/blob/master/khmer/thread_utils.py#L73, which @camillescott once convinced me didn't add anything to the speed of things.

So I like the pattern, if it can be made fast!

This fixes the discrepancy with the single threaded abundance-dist-single.py script.

betatim · 2016-12-16T14:25:53Z

Producing values is faster than consuming them. One thread reading from a .gz can keep a queue full (10 values) for 3 consuming threads. With 4 the consumers sometimes stall. As measured by printing out the size of the Q every once in a while.

betatim · 2016-12-20T12:58:01Z

Related to #1551 (comment): does someone understand why this is not threadsafe? Running python scripts/abundance-dist-single-threaded.py -s -x 1e8 -N 2 -k 17 -z ecoli_ref-5m.fastq /tmp/test.dist with multiple consumers (even with the mutex not commented out) leads to different total kmer counts each time you run it.

Does someone have time to take a look or listen to me explain it to them?

betatim · 2016-12-22T13:23:19Z

And yes, no way to achieve locality with bloom filters. You could keep a cache of things to insert into the bloom filters, tho, organized by location in filter, and then flush from there periodically.

To insert something into one of the tables we need to compute the index (hash the kmer), fetch that chunk of memory, twiddle some bits. Two things we could do:

maintain a cache of kmer -> hashvalue to avoid recomputing it. Most likely implemented with a map, which requires computing a hash to lookup entries. No wins possible??
maintain a buffer that groups updates to different chunks of memory, apply them as one when you have "enough" for a certain region. How to keep things in sync across threads? Are we smarter than a compiler/prefetcher?

(mainly to help me think about this.)

betatim commented Dec 16, 2016

View reviewed changes

betatim added 2 commits December 16, 2016 14:42

Start experimenting with queues and threads

23e8e06

Limited to one thread filling the countgraph

ba7e3a8

This fixes the discrepancy with the single threaded abundance-dist-single.py script.

betatim force-pushed the fix/multi_threading_bugs branch from 0dfcedf to ba7e3a8 Compare December 16, 2016 13:42

betatim mentioned this pull request Dec 16, 2016

IO experiments #1554

Open

betatim added 3 commits December 20, 2016 08:55

Tweak queue size and fix memory leak

fc3f51a

Exploring options and race conditions

b18774d

Experimenting with locks

0861be3

ctb mentioned this pull request Dec 21, 2016

Implement k-mer Hash Caching #15

Closed

ctb mentioned this pull request Apr 5, 2017

Memory allocation vs file load time: the cost of NUMA? #1665

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experimenting with queues and threads #1553

Experimenting with queues and threads #1553

betatim commented Dec 13, 2016

betatim commented Dec 13, 2016

betatim commented Dec 13, 2016

betatim commented Dec 13, 2016

ctb commented Dec 13, 2016 via email

ctb commented Dec 13, 2016

betatim commented Dec 13, 2016

ctb commented Dec 13, 2016 via email

betatim commented Dec 13, 2016

standage commented Dec 13, 2016

ctb commented Dec 13, 2016 via email

ctb commented Dec 13, 2016 via email

standage commented Dec 13, 2016

ctb commented Dec 13, 2016 via email

betatim Dec 16, 2016 •

edited

Loading

ctb Dec 16, 2016

betatim commented Dec 16, 2016

betatim commented Dec 20, 2016

betatim commented Dec 22, 2016

Experimenting with queues and threads #1553

Are you sure you want to change the base?

Experimenting with queues and threads #1553

Conversation

betatim commented Dec 13, 2016

betatim commented Dec 13, 2016

betatim commented Dec 13, 2016

betatim commented Dec 13, 2016

ctb commented Dec 13, 2016 via email

ctb commented Dec 13, 2016

betatim commented Dec 13, 2016

ctb commented Dec 13, 2016 via email

betatim commented Dec 13, 2016

standage commented Dec 13, 2016

ctb commented Dec 13, 2016 via email

ctb commented Dec 13, 2016 via email

standage commented Dec 13, 2016

ctb commented Dec 13, 2016 via email

betatim Dec 16, 2016 • edited Loading

Choose a reason for hiding this comment

ctb Dec 16, 2016

Choose a reason for hiding this comment

betatim commented Dec 16, 2016

betatim commented Dec 20, 2016

betatim commented Dec 22, 2016

betatim Dec 16, 2016 •

edited

Loading