A type to store large hash values (>64bit) #1444

betatim · 2016-09-12T15:01:07Z

Exploratory surgery for #1442

This uses a byte array to store the hash value. We can template or similar to support "arbitrarily" large hash values.

At the moment trying to make it work and get some tests to pass. After that speeding up the code. it currently also contains various stuff that is useful for development but you might want to remove later.

betatim · 2016-09-12T15:07:16Z

Also unsure where the class should live. kmer_hash.hh?

betatim · 2016-09-12T15:14:41Z

I know @ctb suggests ignoring the partitioning stuff, but it would still be interesting to try and comprehend why some of those tests fail.

ctb · 2016-09-12T15:37:54Z

Is now a good time to look?

Titus Brown, ctbrown@ucdavis.edu

On Sep 12, 2016, at 8:14 AM, Tim Head notifications@github.com wrote:

I know @ctb suggests ignoring the partitioning stuff, but it would still be interesting to try and comprehend why some of those tests fail.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

ctb · 2016-09-12T15:49:36Z

Wow, that is an esoteric place for things to fail! But it's something minor and detailed, rather than something big. I can't dig into it right now, but I suggest you go about your business...

betatim · 2016-09-13T13:12:52Z

I think the failure is because there is a bug in the new class I added. Happens for long~ish k-mer sizes that are less than 32. For some reason it diverges from the value you get with a uint64. #test #test #test

ctb · 2016-09-13T15:54:42Z

Nice work! The tests pass for me on Mac OS X.

What's the plan moving forward? Benchmarking, for sure... if there's no major slowdown, it seems to me that the code is clean enough to merge, and then we can start working on supporting k > 32, yah?

betatim · 2016-09-13T15:57:19Z

Yes, I'd like to tidy it up a bit (make sure all the methods I defined make sense, and put in an assert or two for when you pass in arguments that are too big) and work out what we would need to add to make

typedef BigHashType HashIntoType;

work. I think that will be a whole new PR but doing a quick survey of what is missing might point to fundamental flaws in this idea (or not).

ctb · 2016-09-13T15:59:53Z

The only thing I worry about (other than performance) is suitably generalizing the save-to-disk code. But we have lots of tests for that so we just need to duplicate them for k > 32.

sjackman · 2016-09-13T16:11:42Z

You may find that std:bitset is more convenient that std::array when you need bitwise operators, particularly the bit shift operators (<< and >>). Most of the operators in class BigHashType is implemented in std::bitset.

Note that std::bitset is a misnomer. It's an array of bits, not a set of bits. You can think of it as a set of integers implemented using an array of bits. That probably explains its name.

codecov-io · 2016-09-13T16:16:28Z

Current coverage is 77.19% (diff: 100%)

No coverage report found for master at 4637e63.

Powered by Codecov. Last update 4637e63...409644d

sjackman · 2016-09-13T17:43:35Z

I've just noticed that you're using std::array<uint8_t, 8>, which is 64 bits. Just to check, the plan is to to eventually increase 8 to a larger value, in a second PR?

If you do stick with std::array over std::bitset, you may find that you get better performance using std::array<uint64_t, N> rather than uint8_t.

betatim · 2016-09-13T18:46:18Z

The overall idea of using an array of bytes is based around _PyLong_FromByteArray. This would make it easy to transport large integers from c++ to python land (and back again). Hence bytes and not uint64 (or single bits) in the array. Not benchmarked yet what the performance difference/trade off is as a function of the size of the type in the array. This is on the to do list.

In the long term you want to make the size of the array configurable (what you really care about is some efficient way to store large integers). Not sure yet if you want to provide just one (fixed at compile time) or several or even one that chooses the size at run time (unlikely to perform well?). First though, stick with 8 bytes and make sure we can reproduce the uint64 behaviour 🚧 👷 🚧.

sjackman · 2016-09-13T19:48:52Z

Fair enough. It's unfortunate there's no PyLong_FromUnsignedLongLongArray

betatim · 2016-09-14T14:48:48Z

Some benchmarks:

hashing_basic is the current implementation with a uint64. bighash uses an array of N integers of type T. The last two numbers indicate the length of the kmer (8, 16, 24, 32) and which one of two strings was used (both random sequences). My conclusions from this wall of numbers:

single uint64 wrapped in a BigHashType is nearly as fast as just a uint64
as you increase the size of the basic type (and reduce N) performance improves
hashing_bighash<uint8_t, 8> is a bad idea?!
(not shown here but in the benchmark bitset is slower than hashing_bighash<uint8_t, 8>)
converting from a BigHashType to a uint64 is pretty slow, we should avoid doing that when ever possible (-> compute the modulus directly)

If you have some experience with google-benchmark I'd like to chat with you. With compilers being super sneaky these days I worry that these benchmarks don't actually benchmark what I want to test.

Benchmark                                  Time           CPU Iterations
------------------------------------------------------------------------
hashing_basic/8/1                         20 ns         40 ns   17541837
hashing_basic/8/2                         20 ns         23 ns  178492200
hashing_basic/16/1                        44 ns         52 ns   86685160
hashing_basic/16/2                        43 ns         49 ns   84504320
hashing_basic/24/1                        63 ns         72 ns   54933840
hashing_basic/24/2                        64 ns         71 ns   52610220
hashing_basic/32/1                        87 ns         98 ns   41023000
hashing_basic/32/2                        90 ns        177 ns    3775009
hashing_bighash<uint8_t, 8>/8/1          104 ns        107 ns   10000000
hashing_bighash<uint8_t, 8>/8/2          102 ns        203 ns    3449941
hashing_bighash<uint8_t, 8>/16/1         213 ns        228 ns   15286310
hashing_bighash<uint8_t, 8>/16/2         215 ns        430 ns    1642105
hashing_bighash<uint8_t, 8>/24/1         321 ns        641 ns    1000000
hashing_bighash<uint8_t, 8>/24/2         319 ns        638 ns    1113550
hashing_bighash<uint8_t, 8>/32/1         450 ns        523 ns    8079410
hashing_bighash<uint8_t, 8>/32/2         453 ns        499 ns    7537900
hashing_bighash<uint16_t, 4>/8/1          66 ns        131 ns    5386107
hashing_bighash<uint16_t, 4>/8/2          71 ns        139 ns   10000000
hashing_bighash<uint16_t, 4>/16/1        141 ns        155 ns   24197510
hashing_bighash<uint16_t, 4>/16/2        135 ns        150 ns   25038090
hashing_bighash<uint16_t, 4>/24/1        202 ns        227 ns   16956210
hashing_bighash<uint16_t, 4>/24/2        203 ns        227 ns   16856020
hashing_bighash<uint16_t, 4>/32/1        272 ns        543 ns    1314850
hashing_bighash<uint16_t, 4>/32/2        271 ns        313 ns   13142580
hashing_bighash<uint32_t, 2>/8/1          30 ns         34 ns  116678330
hashing_bighash<uint32_t, 2>/8/2          29 ns         31 ns  108268630
hashing_bighash<uint32_t, 2>/16/1         61 ns        121 ns    5961404
hashing_bighash<uint32_t, 2>/16/2         62 ns         71 ns   58215930
hashing_bighash<uint32_t, 2>/24/1         93 ns        102 ns   36724590
hashing_bighash<uint32_t, 2>/24/2         90 ns        180 ns    3883926
hashing_bighash<uint32_t, 2>/32/1        122 ns        139 ns   29083780
hashing_bighash<uint32_t, 2>/32/2        123 ns        140 ns   28921320
hashing_bighash<uint64_t, 1>/8/1          22 ns         23 ns  100000000
hashing_bighash<uint64_t, 1>/8/2          20 ns         40 ns   17014326
hashing_bighash<uint64_t, 1>/16/1         42 ns         48 ns   82092180
hashing_bighash<uint64_t, 1>/16/2         42 ns         49 ns   84606460
hashing_bighash<uint64_t, 1>/24/1         64 ns         74 ns   55392890
hashing_bighash<uint64_t, 1>/24/2         70 ns         81 ns   52367770
hashing_bighash<uint64_t, 1>/32/1         88 ns         98 ns   39777700
hashing_bighash<uint64_t, 1>/32/2         86 ns        170 ns    4043905
hashing_bighash<uint64_t, 2>/8/1          37 ns         72 ns   10000000
hashing_bighash<uint64_t, 2>/8/2          34 ns         68 ns   10315659
hashing_bighash<uint64_t, 2>/16/1         69 ns         80 ns   52080230
hashing_bighash<uint64_t, 2>/16/2         70 ns         80 ns   50454090
hashing_bighash<uint64_t, 2>/24/1        109 ns        129 ns   34405140
hashing_bighash<uint64_t, 2>/24/2        109 ns        116 ns   10000000
hashing_bighash<uint64_t, 2>/32/1        148 ns        296 ns    2319986
hashing_bighash<uint64_t, 2>/32/2        146 ns        163 ns   23393220
hashing_bighash<uint64_t, 2>/40/1        182 ns        209 ns   19664030
hashing_bighash<uint64_t, 2>/40/2        181 ns        201 ns   18983150

sjackman · 2016-09-14T15:32:03Z

If you have some experience with google-benchmark I'd like to chat with you.

I don't, sorry.

With compilers being super sneaky these days I worry that these benchmarks don't actually benchmark what I want to test.

Add the -save-temps option to GCC, which will save the assembler code in a .s file. You can then check the function you're testing to see which instructions it uses. If for example you're testing the >> operator, there should be a shift-right shr instruction in the function that you're testing. It can also give insight into why one implementation is faster than another. For example, hashing_bighash<uint64_t, 2> would almost certainly be loop unrolled, whereas hashing_bighash<uint8_t, 16> might be implemented with a loop.

betatim · 2016-09-19T16:20:29Z

Some kind of progress. Most of the last commit is disentangling the use of HashIntoType to mean "I want a hash" and "I want a big integer to use as index/for counting stuff". (I think)

Tempted to make this part a separate PR as it touches a lot of different parts of the code. The hope is to make it easier to review.

Right now this is still a bit of a construction site. Especially the c <-> python interface is a hack.

standage · 2016-09-19T16:24:29Z

I agree that using HashIntoType as a large index is unnecessarily confusing. I also agree this might be better submitted as a separate PR, to make it easier to review the changes.

ctb · 2016-09-19T18:13:50Z

Shall I look at this commit or wait for you to cherry pick it out to a new PR? :)

betatim · 2016-09-19T20:47:39Z

Wait please.

betatim · 2016-09-19T21:30:46Z

lib/hashtable.hh

@@ -123,11 +123,13 @@ protected:

    void _init_bitstuff()
    {
-        bitmask = 0;
+      #if 1
+        //bitmask = 0;


commenting this out is a bug?

I don't understand.

this was more a note-to-self than anything else.

betatim · 2016-09-19T21:31:17Z

lib/kmer_hash.cc

@@ -40,6 +40,7 @@ Contact: khmer-project@idyll.org
 #include <string.h>
 #include <algorithm>
 #include <string>
+#include <iostream>


betatim · 2016-09-19T21:31:30Z

lib/kmer_hash.cc

@@ -203,16 +204,13 @@ KmerIterator::KmerIterator(const char * seq,
                           unsigned char k) :
    KmerFactory(k), _seq(seq)
 {
-    bitmask = 0;


betatim · 2016-09-19T21:31:39Z

lib/kmer_hash.hh

@@ -38,11 +38,14 @@ Contact: khmer-project@idyll.org
 #ifndef KMER_HASH_HH
 #define KMER_HASH_HH

+#include <array>
+#include <iostream>


betatim · 2016-09-19T21:32:36Z

lib/kmer_hash.hh

@@ -113,6 +116,117 @@ HashIntoType _hash_murmur(const std::string& kmer,
                          HashIntoType& h, HashIntoType& r);
 HashIntoType _hash_murmur_forward(const std::string& kmer);

+
+template <typename T, std::size_t N>


given that this can't work with anything but unsigned char at the moment, should we remove the type template argument?

betatim · 2016-09-19T21:33:43Z

lib/kmer_hash.hh

@@ -163,7 +277,7 @@ public:
    /// @warning The default constructor builds an invalid k-mer.
    Kmer()
    {
-        kmer_f = kmer_r = kmer_u = 0;
+        //kmer_f = kmer_r = kmer_u = 0;


initialiser list instead? Needs some kind of init.

betatim · 2016-09-19T21:35:02Z

lib/subset.cc

@@ -515,17 +515,18 @@ void SubsetPartition::do_partition(
    CallbackFn		callback,
    void *		callback_data)
 {
+    HashIntoType empty;


add operator bool() instead of this?

betatim · 2016-09-19T21:35:49Z

lib/subset.cc

@@ -546,7 +547,7 @@ void SubsetPartition::do_partition(

        // run callback, if specified
        if (total_reads % CALLBACK_PERIOD == 0 && callback) {
-            cout << "...subset-part " << first_kmer << "-" << last_kmer << ": "
+            cout << "...subset-part " << first_kmer.as_ull() << "-" << last_kmer.as_ull() << ": "


@ctb what is the intention of this code? What if the hash does not fit in 64bit, what should this print?

doesn't really matter. It's just diagnostic output that can be removed w/o a problem.

betatim · 2016-09-19T21:36:02Z

lib/traversal.cc

@@ -43,7 +43,7 @@ using namespace std;
 Traverser::Traverser(const Hashtable * ht) :
    KmerFactory(ht->ksize()), graph(ht)
 {
-    bitmask = 0;
+    bitmask;


betatim · 2016-09-21T20:16:19Z

The two tests that are failing are related to storing tags as binary format on disk. The test checks that the size of what was written to disk is "as expected", which now fails. That kinda makes sense as the BigHashType is bigger?? I think it'll be faster to get some advice from someone who knows the code than for me to figure this out.

Otherwise things compile and the tests pass! Wohoo! There are a few bits that are quite a mess, so I will tidy those up and address the comments I made above. If you want to start reviewing and help point out the messy parts, please do so.

The benchmark runs in ~0.58, so several times slower than master. Nothing that can't be fixed, hopefully. Probably worth investigating the switch to uint64s: #1444 (comment)

ctb · 2016-09-21T23:53:27Z

khmer/_khmer.cc

+        return NULL;
+    }
+    if (PyLong_Check(val)) {
+        _PyLong_AsByteArray((PyLongObject *)val,


This seems like a nice place to write a macro - replace _PyLong_AsByteArray(...) with a standard function/macro whatever, that could work for whatever we end up doing.

ctb · 2016-09-21T23:54:22Z

lib/hashtable.cc

@@ -163,7 +163,6 @@ unsigned int Hashtable::consume_string(const std::string &s)

    while(!kmers.done()) {
        HashIntoType kmer = kmers.next();
-


ctb · 2016-09-21T23:54:31Z

lib/hashtable.cc

@@ -1080,4 +1079,3 @@ const
 }

 // vim: set sts=2 sw=2:
-


ctb · 2016-09-21T23:58:32Z

Overall, this is really nice - thanks!

I think with virtually no effort you could backport most of these patches into the codebase immediately, and eliminate all but the hash-function specific stuff from consideration. What do you think?

betatim · 2016-09-22T13:13:40Z

Not sure I get what you mean 😕 what is the hash function specific stuff?

Contrary to what I wrote before I am now not so sure anymore about lifting bits out of this PR into smaller ones.

could you take a look at the failing tests? Can I safely increase the size it compares to or ...?

ctb · 2016-09-22T13:15:34Z

On Thu, Sep 22, 2016 at 06:13:40AM -0700, Tim Head wrote:

Not sure I get what you mean ???? what is the hash function specific stuff?

Contrary to what I wrote before I am now not so sure anymore about lifting bits out of this PR into smaller ones.

The HashIntoType => uint64 stuff is easy to extract, no?

could you take a look at the failing tests? Can I safely increase the size it compares to or ...?

yes.

This is an experimental branch to trial different options for dealing with hash values that do not fit into 64bit.

The first test passes!

pow() returns a floating point number not an integer.

Main work was changing the uses of HashIntoType that wanted a large integer, not a hash. khmer compiles ...

Python2 makes a difference between PyInt and PyLong, deal with both when converting a hash back to a kmer

In python2 there is a difference between int and long, make sure reverse_hash works with both

Added an asignment operator to BigHashType to convert from uint64

More and more code was making assumptions about how many bytes are used, so stopped templating as it was not going to work with anythign but 8bytes for the moment.

betatim · 2016-09-26T13:52:59Z

This branch is becoming a place to try things. All the useful things are being extracted into smaller PRs:

ctb · 2016-11-15T15:48:44Z

@betatim could you write up a brief summary of the > 64 bit hash value stuff on a new issue, and link in the various trial efforts you made (and just closed :)? documentation FTW :)

betatim commented Sep 19, 2016

View reviewed changes

ctb reviewed Sep 21, 2016

View reviewed changes

lib/hashtable.cc

@@ -1080,4 +1079,3 @@ const

}

// vim: set sts=2 sw=2:

Copy link

Member

ctb Sep 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

betatim added 3 commits September 23, 2016 16:28

Creating a new type to store hash values with >64bit

f9b41d1

This is an experimental branch to trial different options for dealing with hash values that do not fit into 64bit.

Now using HashType in _hash instead of HashIntoType

8e4dcad

The first test passes!

Fixed bug in as_ull()

64dc9bb

pow() returns a floating point number not an integer.

betatim added 11 commits September 23, 2016 16:28

Use uint64 as index/size/count type instead of HashIntoType

baed432

Main work was changing the uses of HashIntoType that wanted a large integer, not a hash. khmer compiles ...

Tidying up after the big changes

c43518e

Allow hash reversal from python

aa64f80

Reverse hashes that are int and long

58654b1

Python2 makes a difference between PyInt and PyLong, deal with both when converting a hash back to a kmer

Explicitly test reverse hash with long

4fbe688

In python2 there is a difference between int and long, make sure reverse_hash works with both

Fix murmurhash

40706a6

Added an asignment operator to BigHashType to convert from uint64

Switch HashSet to BigHashType

9c43e1a

Stop templating BigHashType

1a2df3a

More and more code was making assumptions about how many bytes are used, so stopped templating as it was not going to work with anythign but 8bytes for the moment.

Finding neighbours works

593ac04

More tagging

ffcb2dc

Towards supporting >64bit values in BigHashType

d902125

betatim force-pushed the large-hash-values branch from a26acda to d902125 Compare September 23, 2016 14:28

This was referenced Sep 23, 2016

Refactor uint64_t vs HashIntoType #1454

Merged

Beyond 64bit hash values #1455

Closed

betatim mentioned this pull request Oct 10, 2016

Plan for hash function refactoring #1441

Closed

betatim closed this Nov 15, 2016

betatim mentioned this pull request Nov 15, 2016

Exploring reversible hashing for k>32 #1519

Closed

		@@ -163,7 +163,6 @@ unsigned int Hashtable::consume_string(const std::string &s)

		while(!kmers.done()) {
		HashIntoType kmer = kmers.next();

A type to store large hash values (>64bit) #1444

A type to store large hash values (>64bit) #1444

Conversation

betatim commented Sep 12, 2016 • edited Loading

betatim commented Sep 12, 2016

betatim commented Sep 12, 2016

ctb commented Sep 12, 2016

ctb commented Sep 12, 2016

betatim commented Sep 13, 2016

ctb commented Sep 13, 2016

betatim commented Sep 13, 2016

ctb commented Sep 13, 2016

sjackman commented Sep 13, 2016 • edited Loading

codecov-io commented Sep 13, 2016 • edited Loading

Current coverage is 77.19% (diff: 100%)

sjackman commented Sep 13, 2016 • edited Loading

betatim commented Sep 13, 2016

sjackman commented Sep 13, 2016

betatim commented Sep 14, 2016

sjackman commented Sep 14, 2016 • edited Loading

betatim commented Sep 19, 2016

standage commented Sep 19, 2016

ctb commented Sep 19, 2016 via email

betatim commented Sep 19, 2016

betatim Sep 19, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

betatim commented Sep 21, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ctb commented Sep 21, 2016

betatim commented Sep 22, 2016

ctb commented Sep 22, 2016

betatim commented Sep 26, 2016

ctb commented Nov 15, 2016 via email

betatim commented Sep 12, 2016 •

edited

Loading

sjackman commented Sep 13, 2016 •

edited

Loading

codecov-io commented Sep 13, 2016 •

edited

Loading

sjackman commented Sep 13, 2016 •

edited

Loading

sjackman commented Sep 14, 2016 •

edited

Loading

betatim Sep 19, 2016 •

edited

Loading