Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A type to store large hash values (>64bit) #1444

Closed
wants to merge 14 commits into from

Conversation

betatim
Copy link
Member

@betatim betatim commented Sep 12, 2016

Exploratory surgery for #1442

This uses a byte array to store the hash value. We can template or similar to support "arbitrarily" large hash values.

At the moment trying to make it work and get some tests to pass. After that speeding up the code. it currently also contains various stuff that is useful for development but you might want to remove later.

  • benchmark bits vs bytes vs uint64 as type of the array
  • clean up spurious #includes
  • asserts or other protection when users try and | or & values larger than one byte
  • Is it mergeable?
  • make test Did it pass the tests?
  • make clean diff-cover If it introduces new functionality in
    scripts/ is it tested?
  • make format diff_pylint_report cppcheck doc pydocstyle Is it well
    formatted?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Is it documented in the ChangeLog?
    http://en.wikipedia.org/wiki/Changelog#Format
  • Was a spellchecker run on the source code and documentation after
    changes were made?
  • Do the changes respect streaming IO? (Are they
    tested for streaming IO?)
  • Is the Copyright year up to date?

@betatim
Copy link
Member Author

betatim commented Sep 12, 2016

Also unsure where the class should live. kmer_hash.hh?

@betatim
Copy link
Member Author

betatim commented Sep 12, 2016

I know @ctb suggests ignoring the partitioning stuff, but it would still be interesting to try and comprehend why some of those tests fail.

@ctb
Copy link
Member

ctb commented Sep 12, 2016

Is now a good time to look?

Titus Brown, ctbrown@ucdavis.edu

Titus Brown, ctbrown@ucdavis.edu

On Sep 12, 2016, at 8:14 AM, Tim Head notifications@github.com wrote:

I know @ctb suggests ignoring the partitioning stuff, but it would still be interesting to try and comprehend why some of those tests fail.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@ctb
Copy link
Member

ctb commented Sep 12, 2016

Wow, that is an esoteric place for things to fail! But it's something minor and detailed, rather than something big. I can't dig into it right now, but I suggest you go about your business...

@betatim
Copy link
Member Author

betatim commented Sep 13, 2016

I think the failure is because there is a bug in the new class I added. Happens for long~ish k-mer sizes that are less than 32. For some reason it diverges from the value you get with a uint64. #test #test #test

@ctb
Copy link
Member

ctb commented Sep 13, 2016

Nice work! The tests pass for me on Mac OS X.

What's the plan moving forward? Benchmarking, for sure... if there's no major slowdown, it seems to me that the code is clean enough to merge, and then we can start working on supporting k > 32, yah?

@betatim
Copy link
Member Author

betatim commented Sep 13, 2016

Yes, I'd like to tidy it up a bit (make sure all the methods I defined make sense, and put in an assert or two for when you pass in arguments that are too big) and work out what we would need to add to make

typedef BigHashType HashIntoType;

work. I think that will be a whole new PR but doing a quick survey of what is missing might point to fundamental flaws in this idea (or not).

@ctb
Copy link
Member

ctb commented Sep 13, 2016

The only thing I worry about (other than performance) is suitably generalizing the save-to-disk code. But we have lots of tests for that so we just need to duplicate them for k > 32.

@sjackman
Copy link

sjackman commented Sep 13, 2016

You may find that std:bitset is more convenient that std::array when you need bitwise operators, particularly the bit shift operators (<< and >>). Most of the operators in class BigHashType is implemented in std::bitset.

Note that std::bitset is a misnomer. It's an array of bits, not a set of bits. You can think of it as a set of integers implemented using an array of bits. That probably explains its name.

@codecov-io
Copy link

codecov-io commented Sep 13, 2016

Current coverage is 77.19% (diff: 100%)

No coverage report found for master at 4637e63.

Powered by Codecov. Last update 4637e63...409644d

@sjackman
Copy link

sjackman commented Sep 13, 2016

I've just noticed that you're using std::array<uint8_t, 8>, which is 64 bits. Just to check, the plan is to to eventually increase 8 to a larger value, in a second PR?

If you do stick with std::array over std::bitset, you may find that you get better performance using std::array<uint64_t, N> rather than uint8_t.

@betatim
Copy link
Member Author

betatim commented Sep 13, 2016

The overall idea of using an array of bytes is based around _PyLong_FromByteArray. This would make it easy to transport large integers from c++ to python land (and back again). Hence bytes and not uint64 (or single bits) in the array. Not benchmarked yet what the performance difference/trade off is as a function of the size of the type in the array. This is on the to do list.

In the long term you want to make the size of the array configurable (what you really care about is some efficient way to store large integers). Not sure yet if you want to provide just one (fixed at compile time) or several or even one that chooses the size at run time (unlikely to perform well?). First though, stick with 8 bytes and make sure we can reproduce the uint64 behaviour 🚧 👷 🚧.

@sjackman
Copy link

Fair enough. It's unfortunate there's no PyLong_FromUnsignedLongLongArray

@betatim
Copy link
Member Author

betatim commented Sep 14, 2016

Some benchmarks:

hashing_basic is the current implementation with a uint64. bighash uses an array of N integers of type T. The last two numbers indicate the length of the kmer (8, 16, 24, 32) and which one of two strings was used (both random sequences). My conclusions from this wall of numbers:

  • single uint64 wrapped in a BigHashType is nearly as fast as just a uint64
  • as you increase the size of the basic type (and reduce N) performance improves
  • hashing_bighash<uint8_t, 8> is a bad idea?!
  • (not shown here but in the benchmark bitset is slower than hashing_bighash<uint8_t, 8>)
  • converting from a BigHashType to a uint64 is pretty slow, we should avoid doing that when ever possible (-> compute the modulus directly)

If you have some experience with google-benchmark I'd like to chat with you. With compilers being super sneaky these days I worry that these benchmarks don't actually benchmark what I want to test.

Benchmark                                  Time           CPU Iterations
------------------------------------------------------------------------
hashing_basic/8/1                         20 ns         40 ns   17541837
hashing_basic/8/2                         20 ns         23 ns  178492200
hashing_basic/16/1                        44 ns         52 ns   86685160
hashing_basic/16/2                        43 ns         49 ns   84504320
hashing_basic/24/1                        63 ns         72 ns   54933840
hashing_basic/24/2                        64 ns         71 ns   52610220
hashing_basic/32/1                        87 ns         98 ns   41023000
hashing_basic/32/2                        90 ns        177 ns    3775009
hashing_bighash<uint8_t, 8>/8/1          104 ns        107 ns   10000000
hashing_bighash<uint8_t, 8>/8/2          102 ns        203 ns    3449941
hashing_bighash<uint8_t, 8>/16/1         213 ns        228 ns   15286310
hashing_bighash<uint8_t, 8>/16/2         215 ns        430 ns    1642105
hashing_bighash<uint8_t, 8>/24/1         321 ns        641 ns    1000000
hashing_bighash<uint8_t, 8>/24/2         319 ns        638 ns    1113550
hashing_bighash<uint8_t, 8>/32/1         450 ns        523 ns    8079410
hashing_bighash<uint8_t, 8>/32/2         453 ns        499 ns    7537900
hashing_bighash<uint16_t, 4>/8/1          66 ns        131 ns    5386107
hashing_bighash<uint16_t, 4>/8/2          71 ns        139 ns   10000000
hashing_bighash<uint16_t, 4>/16/1        141 ns        155 ns   24197510
hashing_bighash<uint16_t, 4>/16/2        135 ns        150 ns   25038090
hashing_bighash<uint16_t, 4>/24/1        202 ns        227 ns   16956210
hashing_bighash<uint16_t, 4>/24/2        203 ns        227 ns   16856020
hashing_bighash<uint16_t, 4>/32/1        272 ns        543 ns    1314850
hashing_bighash<uint16_t, 4>/32/2        271 ns        313 ns   13142580
hashing_bighash<uint32_t, 2>/8/1          30 ns         34 ns  116678330
hashing_bighash<uint32_t, 2>/8/2          29 ns         31 ns  108268630
hashing_bighash<uint32_t, 2>/16/1         61 ns        121 ns    5961404
hashing_bighash<uint32_t, 2>/16/2         62 ns         71 ns   58215930
hashing_bighash<uint32_t, 2>/24/1         93 ns        102 ns   36724590
hashing_bighash<uint32_t, 2>/24/2         90 ns        180 ns    3883926
hashing_bighash<uint32_t, 2>/32/1        122 ns        139 ns   29083780
hashing_bighash<uint32_t, 2>/32/2        123 ns        140 ns   28921320
hashing_bighash<uint64_t, 1>/8/1          22 ns         23 ns  100000000
hashing_bighash<uint64_t, 1>/8/2          20 ns         40 ns   17014326
hashing_bighash<uint64_t, 1>/16/1         42 ns         48 ns   82092180
hashing_bighash<uint64_t, 1>/16/2         42 ns         49 ns   84606460
hashing_bighash<uint64_t, 1>/24/1         64 ns         74 ns   55392890
hashing_bighash<uint64_t, 1>/24/2         70 ns         81 ns   52367770
hashing_bighash<uint64_t, 1>/32/1         88 ns         98 ns   39777700
hashing_bighash<uint64_t, 1>/32/2         86 ns        170 ns    4043905
hashing_bighash<uint64_t, 2>/8/1          37 ns         72 ns   10000000
hashing_bighash<uint64_t, 2>/8/2          34 ns         68 ns   10315659
hashing_bighash<uint64_t, 2>/16/1         69 ns         80 ns   52080230
hashing_bighash<uint64_t, 2>/16/2         70 ns         80 ns   50454090
hashing_bighash<uint64_t, 2>/24/1        109 ns        129 ns   34405140
hashing_bighash<uint64_t, 2>/24/2        109 ns        116 ns   10000000
hashing_bighash<uint64_t, 2>/32/1        148 ns        296 ns    2319986
hashing_bighash<uint64_t, 2>/32/2        146 ns        163 ns   23393220
hashing_bighash<uint64_t, 2>/40/1        182 ns        209 ns   19664030
hashing_bighash<uint64_t, 2>/40/2        181 ns        201 ns   18983150

@sjackman
Copy link

sjackman commented Sep 14, 2016

If you have some experience with google-benchmark I'd like to chat with you.

I don't, sorry.

With compilers being super sneaky these days I worry that these benchmarks don't actually benchmark what I want to test.

Add the -save-temps option to GCC, which will save the assembler code in a .s file. You can then check the function you're testing to see which instructions it uses. If for example you're testing the >> operator, there should be a shift-right shr instruction in the function that you're testing. It can also give insight into why one implementation is faster than another. For example, hashing_bighash<uint64_t, 2> would almost certainly be loop unrolled, whereas hashing_bighash<uint8_t, 16> might be implemented with a loop.

@betatim
Copy link
Member Author

betatim commented Sep 19, 2016

Some kind of progress. Most of the last commit is disentangling the use of HashIntoType to mean "I want a hash" and "I want a big integer to use as index/for counting stuff". (I think)

Tempted to make this part a separate PR as it touches a lot of different parts of the code. The hope is to make it easier to review.

Right now this is still a bit of a construction site. Especially the c <-> python interface is a hack.

@standage
Copy link
Member

I agree that using HashIntoType as a large index is unnecessarily confusing. I also agree this might be better submitted as a separate PR, to make it easier to review the changes.

@ctb
Copy link
Member

ctb commented Sep 19, 2016 via email

@betatim
Copy link
Member Author

betatim commented Sep 19, 2016

Wait please.

@@ -123,11 +123,13 @@ protected:

void _init_bitstuff()
{
bitmask = 0;
#if 1
//bitmask = 0;
Copy link
Member Author

@betatim betatim Sep 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commenting this out is a bug?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was more a note-to-self than anything else.

@@ -40,6 +40,7 @@ Contact: khmer-project@idyll.org
#include <string.h>
#include <algorithm>
#include <string>
#include <iostream>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove me

@@ -203,16 +204,13 @@ KmerIterator::KmerIterator(const char * seq,
unsigned char k) :
KmerFactory(k), _seq(seq)
{
bitmask = 0;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bug?

@@ -38,11 +38,14 @@ Contact: khmer-project@idyll.org
#ifndef KMER_HASH_HH
#define KMER_HASH_HH

#include <array>
#include <iostream>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove me

@@ -113,6 +116,117 @@ HashIntoType _hash_murmur(const std::string& kmer,
HashIntoType& h, HashIntoType& r);
HashIntoType _hash_murmur_forward(const std::string& kmer);


template <typename T, std::size_t N>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that this can't work with anything but unsigned char at the moment, should we remove the type template argument?

@@ -163,7 +277,7 @@ public:
/// @warning The default constructor builds an invalid k-mer.
Kmer()
{
kmer_f = kmer_r = kmer_u = 0;
//kmer_f = kmer_r = kmer_u = 0;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initialiser list instead? Needs some kind of init.

@@ -515,17 +515,18 @@ void SubsetPartition::do_partition(
CallbackFn callback,
void * callback_data)
{
HashIntoType empty;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add operator bool() instead of this?

@@ -546,7 +547,7 @@ void SubsetPartition::do_partition(

// run callback, if specified
if (total_reads % CALLBACK_PERIOD == 0 && callback) {
cout << "...subset-part " << first_kmer << "-" << last_kmer << ": "
cout << "...subset-part " << first_kmer.as_ull() << "-" << last_kmer.as_ull() << ": "
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ctb what is the intention of this code? What if the hash does not fit in 64bit, what should this print?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't really matter. It's just diagnostic output that can be removed w/o a problem.

@@ -43,7 +43,7 @@ using namespace std;
Traverser::Traverser(const Hashtable * ht) :
KmerFactory(ht->ksize()), graph(ht)
{
bitmask = 0;
bitmask;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bug?

@betatim
Copy link
Member Author

betatim commented Sep 21, 2016

The two tests that are failing are related to storing tags as binary format on disk. The test checks that the size of what was written to disk is "as expected", which now fails. That kinda makes sense as the BigHashType is bigger?? I think it'll be faster to get some advice from someone who knows the code than for me to figure this out.

Otherwise things compile and the tests pass! Wohoo! There are a few bits that are quite a mess, so I will tidy those up and address the comments I made above. If you want to start reviewing and help point out the messy parts, please do so.


The benchmark runs in ~0.58, so several times slower than master. Nothing that can't be fixed, hopefully. Probably worth investigating the switch to uint64s: #1444 (comment)

return NULL;
}
if (PyLong_Check(val)) {
_PyLong_AsByteArray((PyLongObject *)val,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a nice place to write a macro - replace _PyLong_AsByteArray(...) with a standard function/macro whatever, that could work for whatever we end up doing.

@@ -163,7 +163,6 @@ unsigned int Hashtable::consume_string(const std::string &s)

while(!kmers.done()) {
HashIntoType kmer = kmers.next();

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

@@ -1080,4 +1079,3 @@ const
}

// vim: set sts=2 sw=2:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

@ctb
Copy link
Member

ctb commented Sep 21, 2016

Overall, this is really nice - thanks!

I think with virtually no effort you could backport most of these patches into the codebase immediately, and eliminate all but the hash-function specific stuff from consideration. What do you think?

@betatim
Copy link
Member Author

betatim commented Sep 22, 2016

Not sure I get what you mean 😕 what is the hash function specific stuff?

Contrary to what I wrote before I am now not so sure anymore about lifting bits out of this PR into smaller ones.


could you take a look at the failing tests? Can I safely increase the size it compares to or ...?

@ctb
Copy link
Member

ctb commented Sep 22, 2016

On Thu, Sep 22, 2016 at 06:13:40AM -0700, Tim Head wrote:

Not sure I get what you mean ???? what is the hash function specific stuff?

Contrary to what I wrote before I am now not so sure anymore about lifting bits out of this PR into smaller ones.

The HashIntoType => uint64 stuff is easy to extract, no?

could you take a look at the failing tests? Can I safely increase the size it compares to or ...?

yes.

This is an experimental branch to trial different options
for dealing with hash values that do not fit into 64bit.
pow() returns a floating point number not an integer.
Main work was changing the uses of HashIntoType that wanted
a large integer, not a hash. khmer compiles ...
Python2 makes a difference between PyInt and PyLong,
deal with both when converting a hash back to a kmer
In python2 there is a difference between int and long, make
sure reverse_hash works with both
Added an asignment operator to BigHashType to convert from uint64
More and more code was making assumptions about how many
bytes are used, so stopped templating as it was not going
to work with anythign but 8bytes for the moment.
@betatim
Copy link
Member Author

betatim commented Sep 26, 2016

This branch is becoming a place to try things. All the useful things are being extracted into smaller PRs:

@ctb
Copy link
Member

ctb commented Nov 15, 2016 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants