-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A type to store large hash values (>64bit) #1444
Conversation
Also unsure where the class should live. |
I know @ctb suggests ignoring the partitioning stuff, but it would still be interesting to try and comprehend why some of those tests fail. |
Is now a good time to look? Titus Brown, ctbrown@ucdavis.edu Titus Brown, ctbrown@ucdavis.edu
|
Wow, that is an esoteric place for things to fail! But it's something minor and detailed, rather than something big. I can't dig into it right now, but I suggest you go about your business... |
I think the failure is because there is a bug in the new class I added. Happens for long~ish k-mer sizes that are less than 32. For some reason it diverges from the value you get with a uint64. #test #test #test |
Nice work! The tests pass for me on Mac OS X. What's the plan moving forward? Benchmarking, for sure... if there's no major slowdown, it seems to me that the code is clean enough to merge, and then we can start working on supporting k > 32, yah? |
Yes, I'd like to tidy it up a bit (make sure all the methods I defined make sense, and put in an assert or two for when you pass in arguments that are too big) and work out what we would need to add to make
work. I think that will be a whole new PR but doing a quick survey of what is missing might point to fundamental flaws in this idea (or not). |
The only thing I worry about (other than performance) is suitably generalizing the save-to-disk code. But we have lots of tests for that so we just need to duplicate them for k > 32. |
You may find that Note that |
Current coverage is 77.19% (diff: 100%)
|
I've just noticed that you're using If you do stick with |
The overall idea of using an array of bytes is based around In the long term you want to make the size of the array configurable (what you really care about is some efficient way to store large integers). Not sure yet if you want to provide just one (fixed at compile time) or several or even one that chooses the size at run time (unlikely to perform well?). First though, stick with 8 bytes and make sure we can reproduce the |
Fair enough. It's unfortunate there's no |
If you have some experience with
|
I don't, sorry.
Add the |
Some kind of progress. Most of the last commit is disentangling the use of Tempted to make this part a separate PR as it touches a lot of different parts of the code. The hope is to make it easier to review. Right now this is still a bit of a construction site. Especially the c <-> python interface is a hack. |
I agree that using |
Shall I look at this commit or wait for you to cherry pick it out to a new
PR? :)
|
Wait please. |
@@ -123,11 +123,13 @@ protected: | |||
|
|||
void _init_bitstuff() | |||
{ | |||
bitmask = 0; | |||
#if 1 | |||
//bitmask = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commenting this out is a bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was more a note-to-self than anything else.
@@ -40,6 +40,7 @@ Contact: khmer-project@idyll.org | |||
#include <string.h> | |||
#include <algorithm> | |||
#include <string> | |||
#include <iostream> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove me
@@ -203,16 +204,13 @@ KmerIterator::KmerIterator(const char * seq, | |||
unsigned char k) : | |||
KmerFactory(k), _seq(seq) | |||
{ | |||
bitmask = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bug?
@@ -38,11 +38,14 @@ Contact: khmer-project@idyll.org | |||
#ifndef KMER_HASH_HH | |||
#define KMER_HASH_HH | |||
|
|||
#include <array> | |||
#include <iostream> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove me
@@ -113,6 +116,117 @@ HashIntoType _hash_murmur(const std::string& kmer, | |||
HashIntoType& h, HashIntoType& r); | |||
HashIntoType _hash_murmur_forward(const std::string& kmer); | |||
|
|||
|
|||
template <typename T, std::size_t N> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given that this can't work with anything but unsigned char
at the moment, should we remove the type template argument?
@@ -163,7 +277,7 @@ public: | |||
/// @warning The default constructor builds an invalid k-mer. | |||
Kmer() | |||
{ | |||
kmer_f = kmer_r = kmer_u = 0; | |||
//kmer_f = kmer_r = kmer_u = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initialiser list instead? Needs some kind of init.
@@ -515,17 +515,18 @@ void SubsetPartition::do_partition( | |||
CallbackFn callback, | |||
void * callback_data) | |||
{ | |||
HashIntoType empty; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add operator bool()
instead of this?
@@ -546,7 +547,7 @@ void SubsetPartition::do_partition( | |||
|
|||
// run callback, if specified | |||
if (total_reads % CALLBACK_PERIOD == 0 && callback) { | |||
cout << "...subset-part " << first_kmer << "-" << last_kmer << ": " | |||
cout << "...subset-part " << first_kmer.as_ull() << "-" << last_kmer.as_ull() << ": " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ctb what is the intention of this code? What if the hash does not fit in 64bit, what should this print?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't really matter. It's just diagnostic output that can be removed w/o a problem.
@@ -43,7 +43,7 @@ using namespace std; | |||
Traverser::Traverser(const Hashtable * ht) : | |||
KmerFactory(ht->ksize()), graph(ht) | |||
{ | |||
bitmask = 0; | |||
bitmask; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bug?
The two tests that are failing are related to storing tags as binary format on disk. The test checks that the size of what was written to disk is "as expected", which now fails. That kinda makes sense as the Otherwise things compile and the tests pass! Wohoo! There are a few bits that are quite a mess, so I will tidy those up and address the comments I made above. If you want to start reviewing and help point out the messy parts, please do so. The benchmark runs in ~0.58, so several times slower than master. Nothing that can't be fixed, hopefully. Probably worth investigating the switch to |
return NULL; | ||
} | ||
if (PyLong_Check(val)) { | ||
_PyLong_AsByteArray((PyLongObject *)val, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a nice place to write a macro - replace _PyLong_AsByteArray(...) with a standard function/macro whatever, that could work for whatever we end up doing.
@@ -163,7 +163,6 @@ unsigned int Hashtable::consume_string(const std::string &s) | |||
|
|||
while(!kmers.done()) { | |||
HashIntoType kmer = kmers.next(); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
@@ -1080,4 +1079,3 @@ const | |||
} | |||
|
|||
// vim: set sts=2 sw=2: | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
Overall, this is really nice - thanks! I think with virtually no effort you could backport most of these patches into the codebase immediately, and eliminate all but the hash-function specific stuff from consideration. What do you think? |
Not sure I get what you mean 😕 what is the hash function specific stuff? Contrary to what I wrote before I am now not so sure anymore about lifting bits out of this PR into smaller ones. could you take a look at the failing tests? Can I safely increase the size it compares to or ...? |
On Thu, Sep 22, 2016 at 06:13:40AM -0700, Tim Head wrote:
The HashIntoType => uint64 stuff is easy to extract, no?
yes. |
This is an experimental branch to trial different options for dealing with hash values that do not fit into 64bit.
The first test passes!
pow() returns a floating point number not an integer.
Main work was changing the uses of HashIntoType that wanted a large integer, not a hash. khmer compiles ...
Python2 makes a difference between PyInt and PyLong, deal with both when converting a hash back to a kmer
In python2 there is a difference between int and long, make sure reverse_hash works with both
Added an asignment operator to BigHashType to convert from uint64
More and more code was making assumptions about how many bytes are used, so stopped templating as it was not going to work with anythign but 8bytes for the moment.
a26acda
to
d902125
Compare
This branch is becoming a place to try things. All the useful things are being extracted into smaller PRs: |
@betatim could you write up a brief summary of the > 64 bit hash value stuff on
a new issue, and link in the various trial efforts you made (and just closed
:)?
documentation FTW :)
|
Exploratory surgery for #1442
This uses a byte array to store the hash value. We can template or similar to support "arbitrarily" large hash values.
At the moment trying to make it work and get some tests to pass. After that speeding up the code. it currently also contains various stuff that is useful for development but you might want to remove later.
bits
vsbytes
vsuint64
as type of the array#include
s|
or&
values larger than one bytemake test
Did it pass the tests?make clean diff-cover
If it introduces new functionality inscripts/
is it tested?make format diff_pylint_report cppcheck doc pydocstyle
Is it wellformatted?
without a major version increment. Changing file formats also requires a
major version number increment.
ChangeLog
?http://en.wikipedia.org/wiki/Changelog#Format
changes were made?
tested for streaming IO?)