Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swap murmurhash in as primary hash function. #1433

Closed
wants to merge 5 commits into from

Conversation

ctb
Copy link
Member

@ctb ctb commented Sep 3, 2016

Starting from #1432 (simplification of hash function code), swap in murmurhash as the default hash function for khmer.

This will support k > 32. cc #1426

The tests that are broken are either checks on the exact hash values, OR are sensitive to collisions (a lot of the occupancy numbers are like this), OR are broken for unknown reasons - I'm not sure why the graph alignment code is broken, in particular. Still, we're down to 15 broken tests which is pretty good...

@ctb ctb mentioned this pull request Sep 3, 2016
@betatim
Copy link
Member

betatim commented Sep 5, 2016

Does it make sense to investigate a general purpose string/byte hashing algorithm? MH is pretty speedy as far as hash functions go but it is still ~100x slower than master. Considering that khmer doesn't need a hash function that can handle all 256 values of a byte but only maybe 10, does it make sense to brew a custom hash function? It seems like you should be able to capitalise on this limitation. thinkingoutloud

@ctb
Copy link
Member Author

ctb commented Sep 5, 2016

Good question ;). Poor man's solution could be this:

  • hash DNA into two bit form;
  • use the bytes emerging from that two bit form as the input to a
    normal hash function.

Should be much faster, too!

I do think we should avoid developing our own hash function because it's
an area with a lot of already developed expertise.

@betatim
Copy link
Member

betatim commented Sep 5, 2016

I do think we should avoid developing our own hash function because it's
an area with a lot of already developed expertise.

... and it probably is a bit like crypto, you think it will be a one afternoon adventure and two years later you are still developing and discovering what others learnt many years ago.

@betatim
Copy link
Member

betatim commented Sep 5, 2016

I was thinking that the way you'd design your hash function would be different if you only have 10 values or if you have 256. I'd be surprised if MH runs faster on byte arrays that only contain 10 different values than on one that contains 256 different values.

@ctb
Copy link
Member Author

ctb commented Sep 5, 2016

On Mon, Sep 05, 2016 at 08:09:51AM -0700, Tim Head wrote:

I was thinking that the way you'd design your hash function would be different if you only have 10 values or if you have 256. I'd be surprised if MH runs faster on byte arrays that only contain 10 different values than on one that contains 256 different values.

Right - but if you hash four DNA characters at a time in an 8-bit string
you're effectively hashing the entire range of 256 characters.

@ctb
Copy link
Member Author

ctb commented Oct 1, 2016

This was a demonstration PR - closing now that #1450 and #1444 are under way.

@ctb ctb closed this Oct 1, 2016
@ctb ctb deleted the hashrefactor_try_murmur branch January 21, 2017 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants