-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for k>32 #1426
Comments
Yep, seems about right. The reversibility is the biggest problem; for probably 95% of the code that we actually currently use, we could go straight for a cyclic hash of some sort and it wouldn't matter, because all we're doing is indexing into hash tables. But there is a lot of code that relies on reversibility and we have to figure out what to do with it, even if it's not necessarily highly used code at the moment. Unfortunately I am also probably the primary expert on this code (the partitioning code)... My primary concern with support for multiple hash functions is around performance, note; virtual functions and the like can give significant performance hits. I suspect there are some relatively small refactorings that could be done to make this all much easier, but I don't know enough C++ to be confident in the right approach here. |
On Aug 30, 2016 7:17 AM, "C. Titus Brown" notifications@github.com wrote:
|
Regarding the first point, I think I grok the idea of hashing in pieces (similar to the Regarding the second point, I did a bit of "research" earlier today and also found templates (and "functors") to be the most performant alternative to the function pointer vs virtual function question. I have zero experience with C++ templating though, so I hope @camillescott and @luizirber don't mind it I bug them with questions on the topic! |
The templating will also come in handy with respect to providing BAM support. SeqAn has a BAM parser in addition to the Fast[a|q] parser already in use, and I'm sure |
Cyclic hash == https://en.wikipedia.org/wiki/Rolling_hash ? Is there a setup to benchmark things (in terms of cache misses, etc) as the hash functions are swapped around? Because: measuring gives you a leg up on experts who are too good to measure @ctb where would you start looking for the "multiple small refactorings"? |
On Wed, Aug 31, 2016 at 09:19:03AM -0700, Tim Head wrote:
No benchmarking setup, and yes, I think adding some benchmarking is the place This paper of ours (https://github.com/dib-lab/2013-khmer-counting) might be
Sadly it probably involves some diagramming and class structure It's a bit of a tangled mess ATM, is what I'm saying. |
As a trial, I removed most of the code that depends on reversibility in #1431, and then eliminated the sliding window optimization and _revhash code in #1432. Almost all tests pass at this point! Then #1433 swapped in murmurhash as the default hash function, which of course breaks a bunch of stuff. Most of it looks like stuff that should be broken (and hence should be removed or refactored when using different hash functions), but there are a few things that bear closer examination. At this point, we should be able to re-enable the sliding window optimizations in KmerIterator with any rolling hash function. For example, this one https://github.com/lemire/rollinghashcpp/blob/master/example.cpp should work. |
Note, one of the problems was that our DNA sanitizing code was broken in places. Sigh. See #1434. |
Anyway, a question for the C++-ofiles amongst us is: how hard would it be to refactor the C++ code base so that the code removed in #1431 was isolated in a ghetto with its own hash function? |
OK, pretty much done playing around with hash code refactoring for now :). Here's my summary. Ignoring the partitioning code (test-removed in #1431), there are three kinds of things we need to do:
The first and second points can be dealt with easily, and are more or less done in #1437. For the third point, it seems to me like @camillescott's comment is a good way to go: implement a fast new data type that lets you store large k-mers exactly in 2-bit form, and go back and forth to strings when needed. The important operations here would be prepending and appending individual bases (to support With only a little extra effort, this new data type might be able to resolve the issues in the partitioning code as well, although I am OK with ignoring that for the moment. If @camillescott can take a look at all of this in the next few days and check me, maybe we can schedule a chat about it online with @betatim and @standage so that we're all on the same page and then figure out where to go from there? |
For the binary encoding for DNA, we might use 3 bits per DNA base instead of just two, or use a more complicated encoding overall; it would be nice to potentially allow for Ns in the sequence. |
Talking with @camillescott, one question is whether or not we should support lossy hashing at all, or just go with the bitstring encoding and reversible hashing needed for I'm not clear on how to do this, but if we can build a simple functor interface such that we can specify when we need reversibility, we can probably fit it all in without too much complexity. |
Also: for the partitioning code, we need to support STL set membership, so I think all we need to implement is a perfect multi-long-long bitstring encoding that supports == and <. |
Hmm, actually, we need to support the following uses of HashIntoType:
so we need < (weak ordering), copying, and assignment, I believe. |
I am also unclear on what, exactly, needs to be done on the Python side to support something like the reversible Kmer class. Right now we take full advantage of the mapping between C++ HashIntoType and Python native UnsignedLongLong (ParseTuple type "K"). Perhaps another advantage of the lossy hashing approach is that for we won't need to make the Python API more complicated for it. |
Supporting k values > 32 has been on the table for a while (since #27 and #60 in 2013). Some progress was made on this with #624, but this PR has stalled.
I'd like to suggest that we prioritize this feature.
From my reading of the relevant threads, there seem to be a couple of possible approaches and concerns.
Am I reading things correctly? Anything I've missed?
The text was updated successfully, but these errors were encountered: