-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experimenting with queues and threads #1553
base: master
Are you sure you want to change the base?
Conversation
Why do we get a different number of unique kmers with the two versions of the script? |
With only one thread modifying the countgraph the discrepancy goes away. |
Idea: one bloom filter per thread, then merge them at the end. |
Query: our current data structures do not support cache locality and cross-CPU NUMA memory access. How big a difference would that make? Could be tested by changing graph sizes.
--
Titus Brown, ctbrown@ucdavis.edu
…--
Titus Brown, ctbrown@ucdavis.edu
On Dec 13, 2016, at 8:10 AM, Tim Head ***@***.***> wrote:
Idea: one bloom filter per thread, then merge them at the end.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
To be a little clearer: in the search for speed, would a better target be the data structures & support for cache locality? A concern with the bloom filters approach is that in memory limited situations (or with data sets that require big mem) it won't actually help to do that :). A potentially productive alternate approach would be to support easy chunking of files on different machines -- that is, if you have a 50 GB file and 5 machines, tell machine 1 to handle GB 0-10, machine 2 to handle GB 10-20, etc. Then you could do various kinds of processing in parallel and merge/whatever afterwards. This strategy could work well for load-into-counting, load-graph, normalize-by-median, and trim-low-abund - all of the major use cases, basically. |
Probably a lot. You'd go and measure how many instructions/cycle are executed, or directly how many cache misses you have. For a (large) bloom filter I am not sure how much you can do as you want the index into table N+1 to be essentially random wrt the index into table N. As well as the index of the next k-mer being random wrt to the index of the previous one (even though they are only different by one base (if not it might make sense to handle insertion into table N for several kmers before moving on to table N+1). I think cuckoo filters promise to be better for that in this regard. |
On Tue, Dec 13, 2016 at 09:51:27AM -0800, Tim Head wrote:
> Query: our current data structures do not support cache locality and cross-CPU NUMA memory access. How big a difference would that make? Could be tested by changing graph sizes.
Probably a lot. You'd go and measure how many instructions/cycle are executed, or directly how many cache misses you have. For a (large) bloom filter I am not sure how much you can do as you want the index into table N+1 to be essentially random wrt the index into table N. As well as the index of the next k-mer being random wrt to the index of the previous one (even though they are only different by one base (if not it might make sense to handle insertion into table N for several kmers before moving on to table N+1).
I think cuckoo filters promise to be better for that in this regard.
Yes, I agree with all of that. The speedup can be tested by varying the size
of the tables for straight insertions and queries (without doing anything with
the results).
And yes, no way to achieve locality with bloom filters. You could keep a cache
of things to insert into the bloom filters, tho, organized by location in
filter, and then flush from there periodically.
|
I think it is Ok to practice how to split things into components connected by a queue using threads. We will need that if you want to go to multiple machines (maybe via multiple processes first?). Data structures that are inherently parallel would be super useful. If we make one BF per thread to handle 1/Nth of the reads, would each BF have to be as large as a BF that can handle all reads? How does this work if you want to merge them later on? Do you know the typical ratio of allocated vs used memory for these very large BFs? Wondering what trickery the kernel has available for allocating more memory than is available if you never use it. Not enough of a memory expert to know off the top of my head. This branch as it is achieves 166% cpu and elapsed 55.235s (81.93s user) to run through ecoli with one thread for reading and one for filling the countgraph. Compared to 99% cpu and elapsed 1:18.11 (75.15s user). So you have some overhead but if you have two cores less human time passes to do the same job. (🚧 Doing the same with two consumer threads gives you ~270% cpu and elapsed 31s (80.96s user) but the answer is "wrong" because the two threads somehow step on each others toes 🚧) |
The binary bloom filter (node table/graph) can be built in pieces and merged: see the This approach will not work for the counting bloom filter (count table/graph). |
On Tue, Dec 13, 2016 at 10:09:57AM -0800, Tim Head wrote:
I think it is Ok to practice how to split things into components connected by a queue using threads. We will need that if you want to go to multiple machines (maybe via multiple processes first?).
Data structures that are inherently parallel would be super useful.
Maybe?
If we make one BF per thread to handle 1/Nth of the reads, would each BF have to be as large as a BF that can handle all reads? How does this work if you want to merge them later on?
Yes, because the data comes in random order sampled from across all true
k-mers, to a first approximation split-out BFs have to be able to handle all
the data. This is also the same condition required for merging them so I
guess it all works out.
Do you know the typical ratio of allocated vs used memory for these very large BFs? Wondering what trickery the kernel has available for allocating more memory than is available if you never use it. Not enough of a memory expert to know off the top of my head.
100% - whatever memory is allocated is used, b/c of the way Bloom filters and
hashing work. In these situations Bloom filters have close to the best
(smallest) possible storage footprint. But they are also quite slow and may be
suboptimal for many other tasks that aren't memory challenged.
(One of the goals of #1215 (and your work on #1551) is to give us a variety
of data structures with different guarantees so that we can pick and choose
as needed.)
|
This approach will not work for the counting bloom filter (count table/graph).
I think it will, no?
|
I guess it should work for this use case: you should be able to simply add the counts. The |
On Tue, Dec 13, 2016 at 10:25:18AM -0800, Daniel Standage wrote:
>> This approach will not work for the counting bloom filter (count table/graph).
> I think it will, no?
I guess it should work for this use case: you should be able to simply add the counts. The `update_from` had me thinking of the more general case where it doesn't really make sense: combining multiple samples into a single countgraph.
agree.
|
loading_thread.join() | ||
|
||
log_info('Total number of unique k-mers: {nk}', | ||
nk=countgraph.n_unique_kmers()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any general thoughts on the yay or nay-ness of reworking things into this kind of pattern (read from L142 to here)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks familiar to me :) -- see https://github.com/dib-lab/khmer/blob/master/khmer/thread_utils.py#L73, which @camillescott once convinced me didn't add anything to the speed of things.
So I like the pattern, if it can be made fast!
This fixes the discrepancy with the single threaded abundance-dist-single.py script.
0dfcedf
to
ba7e3a8
Compare
Producing values is faster than consuming them. One thread reading from a .gz can keep a queue full (10 values) for 3 consuming threads. With 4 the consumers sometimes stall. As measured by printing out the size of the Q every once in a while. |
Related to #1551 (comment): does someone understand why this is not threadsafe? Running Does someone have time to take a look or listen to me explain it to them? |
To insert something into one of the tables we need to compute the index (hash the kmer), fetch that chunk of memory, twiddle some bits. Two things we could do:
(mainly to help me think about this.) |
This is an experimental branch to address both the bugs mentioned in #1248 and see
if we can improve the speed of filter-abund and friends.
Current ideas are based on:
Use one thread to read the input, dump batches of reads into a queue. Several consumer threads get a batch from the Q, call to C land, convert the sequence in the batch to
char*
s, release the GIL, enter them into the hashtable, acquire the GIL.Focussing on comparing
scripts/abundance-dist-single.py
with one thread toscripts/abundance-dist-single-threaded.py
which is the experimental version.make test
Did it pass the tests?make clean diff-cover
If it introduces new functionality inscripts/
is it tested?make format diff_pylint_report cppcheck doc pydocstyle
Is it wellformatted?
additions are allowed without a major version increment. Changing file
formats also requires a major version number increment.
documented in
CHANGELOG.md
? See keepachangelogfor more details.
changes were made?
tested for streaming IO?)