Support for k>32 #1426

standage · 2016-08-26T18:43:49Z

Supporting k values > 32 has been on the table for a while (since #27 and #60 in 2013). Some progress was made on this with #624, but this PR has stalled.

I'd like to suggest that we prioritize this feature.

From my reading of the relevant threads, there seem to be a couple of possible approaches and concerns.

If we quadruple the size of the hash function output (256 bits instead of the current 64), we can maintain the hashing properties we currently enjoy with k <= 32 (a perfect reversible hash). However, 256 bit unsigned integers are not a native data type, and so we'd have have to find a portable drop-in library with a compatible license. We'd also be increasing memory usage, although for some (all?) applications this won't be a big deal since the hashes simply index into a table.
If we compile khmer with support for multiple hash functions, we can choose the hash function dynamically. For applications that require reversibility (sparse hash), we will continue to use the 2bit representation and require k <= 32.

Am I reading things correctly? Anything I've missed?

ctb · 2016-08-30T14:17:26Z

Yep, seems about right. The reversibility is the biggest problem; for probably 95% of the code that we actually currently use, we could go straight for a cyclic hash of some sort and it wouldn't matter, because all we're doing is indexing into hash tables. But there is a lot of code that relies on reversibility and we have to figure out what to do with it, even if it's not necessarily highly used code at the moment. Unfortunately I am also probably the primary expert on this code (the partitioning code)...

My primary concern with support for multiple hash functions is around performance, note; virtual functions and the like can give significant performance hits.

I suspect there are some relatively small refactorings that could be done to make this all much easier, but I don't know enough C++ to be confident in the right approach here.

camillescott · 2016-08-30T22:00:31Z

A simple struct holding 2-4 long longs would be easy to implement. We
rarely do any math on the actual hash values, so there's no need to use an
external library with integer types.
Virtual functions do introduce a small performance hit which usually
washes out; I think we call our hash functions enough to justify avoiding
them, However, templating a function parameter into the hash function would
shift the performance hit to compile-time and not be overly complicated.

On Aug 30, 2016 7:17 AM, "C. Titus Brown" notifications@github.com wrote:

Yep, seems about right. The reversibility is the biggest problem; for
probably 95% of the code that we actually currently use, we could go
straight for a cyclic hash of some sort and it wouldn't matter, because all
we're doing is indexing into hash tables. But there is a lot of code that
relies on reversibility and we have to figure out what to do with it, even
if it's not necessarily highly used code at the moment. Unfortunately I am
also probably the primary expert on this code (the partitioning code)...

My primary concern with support for multiple hash functions is around
performance, note; virtual functions and the like can give significant
performance hits.

I suspect there are some relatively small refactorings that could be done
to make this all much easier, but I don't know enough C++ to be confident
in the right approach here.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1426 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ACwxrYRn4iuLeF2n20WA-ck1O7K815k7ks5qlDt3gaJpZM4JuWt6
.

standage · 2016-08-30T23:26:37Z

Regarding the first point, I think I grok the idea of hashing in pieces (similar to the ARCH_32 approach taken in Mash's hash function?). What I'm not so clear on is how to then cast these multiple values as a single value to use as a table index. Probably very simple, but not clear to me at the moment.

Regarding the second point, I did a bit of "research" earlier today and also found templates (and "functors") to be the most performant alternative to the function pointer vs virtual function question. I have zero experience with C++ templating though, so I hope @camillescott and @luizirber don't mind it I bug them with questions on the topic!

standage · 2016-08-30T23:30:38Z

The templating will also come in handy with respect to providing BAM support. SeqAn has a BAM parser in addition to the Fast[a|q] parser already in use, and I'm sure IParser::imprint_next_read is called enough that we want to avoid any potential virtual function overhead there.

betatim · 2016-08-31T16:19:02Z

Cyclic hash == https://en.wikipedia.org/wiki/Rolling_hash ?

Is there a setup to benchmark things (in terms of cache misses, etc) as the hash functions are swapped around? Because: measuring gives you a leg up on experts who are too good to measure

@ctb where would you start looking for the "multiple small refactorings"?

ctb · 2016-08-31T16:26:12Z

On Wed, Aug 31, 2016 at 09:19:03AM -0700, Tim Head wrote:

Cyclic hash == https://en.wikipedia.org/wiki/Rolling_hash ?

#624

Is there a setup to benchmark things (in terms of cache misses, etc) as the hash functions are swapped around? Because: measuring gives you a leg up on experts who are too good to measure

No benchmarking setup, and yes, I think adding some benchmarking is the place
to start.

This paper of ours (https://github.com/dib-lab/2013-khmer-counting) might be
a place to look for benchmarking. And/or @luizirber has some benchmark data
sets for his HLL paper, http://biorxiv.org/content/early/2016/06/07/056846.

@ctb where would you start looking for the "multiple small refactorings"?

Sadly it probably involves some diagramming and class structure
investigation... The core idea is that the Hashtable class contains
support for both the core hashing and also the partitioning code
(SubsetPartition etc.) - this latter code is complex, somewhat ugly,
and not that widely used (although it is rather well tested IMO). We
can leave that last code at k <= 32 without any problem, if we can
figure out how to separate the hash function requirements out...

It's a bit of a tangled mess ATM, is what I'm saying.

ctb · 2016-09-03T15:58:40Z

As a trial, I removed most of the code that depends on reversibility in #1431, and then eliminated the sliding window optimization and _revhash code in #1432. Almost all tests pass at this point! Then #1433 swapped in murmurhash as the default hash function, which of course breaks a bunch of stuff. Most of it looks like stuff that should be broken (and hence should be removed or refactored when using different hash functions), but there are a few things that bear closer examination.

At this point, we should be able to re-enable the sliding window optimizations in KmerIterator with any rolling hash function. For example, this one https://github.com/lemire/rollinghashcpp/blob/master/example.cpp should work.

ctb · 2016-09-03T16:44:00Z

Note, one of the problems was that our DNA sanitizing code was broken in places. Sigh. See #1434.

ctb · 2016-09-04T04:32:08Z

Anyway, a question for the C++-ofiles amongst us is: how hard would it be to refactor the C++ code base so that the code removed in #1431 was isolated in a ghetto with its own hash function?
The critical changes are the four member structures removed from Hashtable (in hashtable.hh), which could be placed in a subclass or a friend class with all of the dependent functions quite easily I suspect.

ctb · 2016-09-04T14:38:18Z

OK, pretty much done playing around with hash code refactoring for now :). Here's my summary.

Ignoring the partitioning code (test-removed in #1431), there are three kinds of things we need to do:

lossy hashing of individual k-mers for use in Bloom filters & CountMin sketches. Essentially any hash function will do here.
lossy hashing of all k-mers in a string, quickly, to load many k-mers into Bloom filters & CountMin sketches. Here, any rolling hash will do. The class StringToHashIterator in Add sliding window optimization back in w/irreversible hashing #1437 shows how to do this generically.
index into & traverse the graph from k-mers. This logic is all encoded in the class Traverser, in lib/traversal.*. This is the tricky one, because we need to be able to track the exact k-mer in order to prepend or extend. The class KmerFactory shows how to build these, and the class KmerIterator does so using a rolling hash function. The main cost here appears to be in tracking the exact k-mer efficiently - doing so with a string is (of course) super slow.

The first and second points can be dealt with easily, and are more or less done in #1437.

For the third point, it seems to me like @camillescott's comment is a good way to go: implement a fast new data type that lets you store large k-mers exactly in 2-bit form, and go back and forth to strings when needed. The important operations here would be prepending and appending individual bases (to support get_right and get_left in Traverser) along with the machinery to track and update a rolling hash value (essentially as Traverser::get_left and ::get_right does in master).

With only a little extra effort, this new data type might be able to resolve the issues in the partitioning code as well, although I am OK with ignoring that for the moment.

If @camillescott can take a look at all of this in the next few days and check me, maybe we can schedule a chat about it online with @betatim and @standage so that we're all on the same page and then figure out where to go from there?

ctb · 2016-09-04T22:34:38Z

For the binary encoding for DNA, we might use 3 bits per DNA base instead of just two, or use a more complicated encoding overall; it would be nice to potentially allow for Ns in the sequence.

ctb · 2016-09-07T14:07:55Z

Talking with @camillescott, one question is whether or not we should support lossy hashing at all, or just go with the bitstring encoding and reversible hashing needed for Traverser. I'm not sure - my suspicion is that the lossy hashing code could be much faster and consume less memory. In particular, the HashSet code is being used to store many bags of 100k+ k-mer hashes for compact De Bruijn graph work, and we can probably use 32-bit integers to store those, which could end up saving a lot of space.

I'm not clear on how to do this, but if we can build a simple functor interface such that we can specify when we need reversibility, we can probably fit it all in without too much complexity.

ctb · 2016-09-07T14:09:34Z

Also: for the partitioning code, we need to support STL set membership, so I think all we need to implement is a perfect multi-long-long bitstring encoding that supports == and <.

ctb · 2016-09-07T14:21:09Z

Hmm, actually, we need to support the following uses of HashIntoType:

typedef std::set<HashIntoType> SeenSet;
typedef std::map<HashIntoType, PartitionID*> PartitionMap;
typedef std::queue<HashIntoType> NodeQueue;
typedef std::multimap<HashIntoType, Label> TagLabelMap;
typedef std::pair<HashIntoType, Label> TagLabelPair;
typedef std::pair<Label, HashIntoType> LabelTagPair;

so we need < (weak ordering), copying, and assignment, I believe.

ctb · 2016-09-07T14:24:42Z

I am also unclear on what, exactly, needs to be done on the Python side to support something like the reversible Kmer class. Right now we take full advantage of the mapping between C++ HashIntoType and Python native UnsignedLongLong (ParseTuple type "K"). Perhaps another advantage of the lossy hashing approach is that for we won't need to make the Python API more complicated for it.

ctb · 2017-02-16T23:16:39Z

Irreversible hashing for k > 32 added internally in #1511, along with not-incredibly-painful support for different hash functions. Now, on to the script support! (#1623 and friends)

standage mentioned this issue Aug 30, 2016

Implement support for loading read sequences from BAM files #1428

Closed

9 tasks

ctb mentioned this issue Sep 3, 2016

Swap murmurhash in as primary hash function. #1433

Closed

ctb mentioned this issue Sep 4, 2016

Add sliding window optimization back in w/irreversible hashing #1437

Closed

ctb mentioned this issue Sep 9, 2016

Plan for hash function refactoring #1441

Closed

ctb mentioned this issue Oct 30, 2016

Make hash function a Hashtable method rather than a global function. #1494

Merged

9 tasks

ctb mentioned this issue Nov 15, 2016

Exploring reversible hashing for k>32 #1519

Closed

ctb closed this as completed Feb 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for k>32 #1426

Support for k>32 #1426

standage commented Aug 26, 2016

ctb commented Aug 30, 2016

camillescott commented Aug 30, 2016

standage commented Aug 30, 2016

standage commented Aug 30, 2016 •

edited

Loading

betatim commented Aug 31, 2016

ctb commented Aug 31, 2016

ctb commented Sep 3, 2016

ctb commented Sep 3, 2016

ctb commented Sep 4, 2016

ctb commented Sep 4, 2016 •

edited

Loading

ctb commented Sep 4, 2016

ctb commented Sep 7, 2016

ctb commented Sep 7, 2016

ctb commented Sep 7, 2016

ctb commented Sep 7, 2016 •

edited

Loading

ctb commented Feb 16, 2017

Support for k>32 #1426

Support for k>32 #1426

Comments

standage commented Aug 26, 2016

ctb commented Aug 30, 2016

camillescott commented Aug 30, 2016

standage commented Aug 30, 2016

standage commented Aug 30, 2016 • edited Loading

betatim commented Aug 31, 2016

ctb commented Aug 31, 2016

ctb commented Sep 3, 2016

ctb commented Sep 3, 2016

ctb commented Sep 4, 2016

ctb commented Sep 4, 2016 • edited Loading

ctb commented Sep 4, 2016

ctb commented Sep 7, 2016

ctb commented Sep 7, 2016

ctb commented Sep 7, 2016

ctb commented Sep 7, 2016 • edited Loading

ctb commented Feb 16, 2017

standage commented Aug 30, 2016 •

edited

Loading

ctb commented Sep 4, 2016 •

edited

Loading

ctb commented Sep 7, 2016 •

edited

Loading