Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storing reversible hashes for k>32 #1442

Closed
betatim opened this issue Sep 9, 2016 · 7 comments
Closed

Storing reversible hashes for k>32 #1442

betatim opened this issue Sep 9, 2016 · 7 comments
Labels

Comments

@betatim
Copy link
Member

betatim commented Sep 9, 2016

Place for thoughts and progress on representing reversible hash values of (arbitrarily) large k-mers.

This is one of the things needed for #1441.

First idea from @camillescott is to switch from using a single int64 to a struct of integers.

@standage
Copy link
Member

standage commented Sep 9, 2016

Another idea discussed was using the 128 bit integer implementation (__int128) discussed here.

Pros:

  • sticking with a "native" data type (on GCC and Clang at least)
  • could support quick, reversible hashing up to k=64 without perturbing the codebase too much

Cons:

  • cannot support arbitrary k
  • may be issues with non-standard compilers, OSes
  • there may be good reasons to support different hash functions anyway, so this would just be kicking the can down the road

Probably not the best idea in the long term, but worth bringing up.

@betatim
Copy link
Member Author

betatim commented Sep 12, 2016

Some experiments in this gist. This uses a int128 and int256 implementation I found by googling. They were useful to quickly get started. Having 33-mers seems to "just work" if you have a bigger int type available.

The second thing I investigated is how to transport these big integers into python land. Python's integer type has no limit and you can create them from an array of bytes _PyLong_FromByteArray (which despite underscore seems to be fairly stable). The reverse _PyLong_AsByteArray also exists. On the pure python side there is int.from_bytes and int.to_bytes which use these.

After this digging around I think implementing the big-ints with a byte array instead of a pair of smaller integers is the way to go given we need to make those arrays for going to/from python.

@betatim
Copy link
Member Author

betatim commented Sep 12, 2016

My only worry about using a byte array is that it might be slower?? I will make a small benchmark pitting the uint128_t from the gist against one based on an array to answer that.

Other remark: int.{to,from}_bytes is not available in legacy python. The C API however has been there since way back when the tim bot was still active.

@ctb
Copy link
Member

ctb commented Sep 12, 2016

blinks Well, this looks pretty awesome... What's the performance hit?

@betatim
Copy link
Member Author

betatim commented Sep 12, 2016

I don't know yet ;) but I might have just achieved one of the tests passing after switching _hash to use a class that uses a byte array to store stuff.

@ctb
Copy link
Member

ctb commented Feb 16, 2017

in #1519 and friends => too slow. #wontfix

@ctb ctb closed this as completed Feb 16, 2017
@ctb
Copy link
Member

ctb commented Feb 16, 2017

(#1511 added irreversible hashing for k > 32)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants