Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison between multiple 128-bit and 256-bit non-cryptographic hashes #257

Closed
8 tasks
DonaldTsang opened this issue Sep 21, 2019 · 17 comments
Closed
8 tasks

Comments

@DonaldTsang
Copy link
Author

DonaldTsang commented Sep 21, 2019

@Cyan4973
Copy link
Owner

While xxh3 has enough margin to produce a 256-bit variant if need be,
it still is a lot of work and lot of code, with associated creation and maintenance costs.
And a problem is, I haven't found a use case where this capability would make sense.

There are multiple use cases which benefit from having more bits.

One of them is to use the produced hash as a bitmap, from which bit fields are extracted for various usages. Bloom filters are a good example, because they require many bitfields, and are able to require more than 64-bit in certain scenarios.
But more than 128-bit ? I've never seen that.

Another case is checksumming : assuming perfect collision properties (which is not always the case, see this recent study), the probability of 2 objects generating the same hash is bound by the birthday paradox.
With a 64-bit hash, the probability of 2 files generating the same hash starts to become non-negligible after several millions of files. Some large scenarios are able to reach that amount.
For a 128-bit hash, getting in the same uncomfortable territory would require many thousands of billions of files ! This is generally way above any plausible scenario, and therefore, I have yet to find a single one where it could matter.

The only scenarios where 256-bit seems useful is for cryptographic applications : here, the larger bit space makes it impossible to "break" a hash (hence to intentionally generate a collision) with a brute-force attack. But for that to be true, one needs a cryptographic quality hash, that is, one where no known formula is able to "improve" probability of collision compared to brute-force. This is paid with a corresponding speed cost.
xxHash is a non-cryptographic hash, so it doesn't claim to offer this property.
Blake2 is an excellent crypto hash. And while it's branded "fast" (and yes, it is fast, for a cryptographic hash), it's still many times slower than xxHash.
Different scenarios, different needs.

And that's it. I don't see any scenario that would actually require a non-cryptographic 256-bit hash.

I've read the thread where you mention that your needs are :

if the server has 2^23 files (a small site)? what about 2^28 files (a large media company)?`

These quantities are suitable for a 128-bit hash. The chances of collisions are way too low to matter in practice (smaller than 1 / 2^64).

@Cyan4973 Cyan4973 self-assigned this Sep 24, 2019
@DonaldTsang
Copy link
Author

DonaldTsang commented Sep 25, 2019

@Cyan4973 bloom filters (for text) and ultra-fast non-secure checksumming are, indeed some of my intended goal.
But the problem is, that nobody has ever made a benchmark for 128-bit and 256-bit hashes as a way of comparison, so there is a need for that.

@DonaldTsang DonaldTsang changed the title Comparison between multiple 256-bit non-cryptographic hashes Comparison between multiple 128-bit and 256-bit non-cryptographic hashes Sep 25, 2019
@Cyan4973
Copy link
Owner

A 256-bit variant is necessarily going to be slower than a 128-bit one, due to the need to mix bits over a wider set of accumulators.

So this speed cost can only be justified if it brings some advantage.

For checksumming, I don't think there is any justifiable scenario where 256-bit offers any uniqueness advantage over 128-bit. Even if one scenario must deal with a thousand of billions of unique elements, 128-bit is still good enough.

For generation of "any number of bitfields", one must find a case where more than 128-bit are needed. Even in this case, a quick workaround is to rehash the initial 128-bit to produce another 128-bit to pick from.
It's obviously not the same as a 256 bit hash, and this method would be bad for checksumming, but for the purpose of generating multiple smaller bitfields, it's good enough: any difference will be hard to measure (one could use sha256 or blake2b for a comparison on collision rates).

@scottcarey
Copy link

256 bit or no, having benchmarks for the 128 bit variant along side the 64 bit one would be nice.

FWIW, using the processor level AES 128 might be competitive -- two rounds is sufficient to produce a good hash that is not cryptographic. (I got the idea from here: https://openjdk.java.net/jeps/8201462)

For anyone wanting a 256 bit hash, using processor intrinsic AES instructions might be faster than the large number of multiplies, shifts, rotates that would be needed for an accumulator 256 bits wide -- not necessarily a full official AES hash, but a truncated process with fewer rounds that satisfies the mixing and randomness properties but is not cryptographic.

@Cyan4973
Copy link
Owner

Cyan4973 commented Oct 7, 2019

There will be an update of benchmark, featuring 128-bit variants.

@Cyan4973
Copy link
Owner

Cyan4973 commented Oct 8, 2019

There are many 64-bit variants in the list.

Note that, if you are interested in a quick speed comparison between any of these variants, you can also do it directly by pluging them into the benchmark program at https://github.com/Cyan4973/xxHash/tree/dev/tests/bench .

@DonaldTsang
Copy link
Author

DonaldTsang commented Oct 9, 2019

@Cyan4973 I am referring to the list of 128-bit variants on the list, the MRX list is broken as there are no documents saying which 6 out of 10 has 128-bit variants.

@Alain2019
Copy link

Didi you ran any tests for Spookyhash (V2)?

@ranbenbasat
Copy link

@Cyan4973 - There are definitely more applications for hashes with a large number of bits for streaming algorithms. I am using multiple 128b hashes for several things :).

@Cyan4973
Copy link
Owner

@Alain2019
Copy link

@Alain2019 SpookyHash is listed in https://github.com/Cyan4973/xxHash/wiki/Performance-comparison

Thanks a lot, a small correction : SpookyHash is a 128bit hash, which can be used as 64bit or even 32bit.

@Cyan4973
Copy link
Owner

Cyan4973 commented Jun 23, 2020

Good point.
I've been testing SpookyHash 's Hash64() entry point,
but indeed, I see now that there is also a Hash128() symbol available.
It seems they should have similar performance, but that would nonetheless deserve a check, just to be sure. Sometimes, compiler can pull off pretty clever optimizations on realizing that only one part of the outcome is effectively being used.

@Cyan4973
Copy link
Owner

Cyan4973 commented Aug 2, 2020

This issue has morphed into a request to benchmark a lot of (sometimes experimental) hash algorithms for comparison purposes.

I believe there's a misunderstanding in the purpose of this repository.
Its only goal is to provide the xxhash library, and present it in context, hence compared to a few relatively well known algorithms and reference implementations.

It is not to provide a full comparison exercise of every hash algorithm available.
For such an objective, I would rather recommend using some dedicated website, which focuses mostly or solely on this objective, and which preferably has a main author that doesn't have its own algorithm in the race, for (hopefully) improved neutrality.

@DonaldTsang
Copy link
Author

@Cyan4973, in that case, would it be possible to start a Telegram channel, a Discord or IRC group, or even a mailing list for such purposes? If so, who should we invite into the community? All ideas are welcomed.

I know that there is an organization already doing it for cryptographic hashes, but we are targeting "fast" hashes.
for reference: https://bench.cr.yp.to/results-hash.html

@Cyan4973
Copy link
Owner

Cyan4973 commented Aug 3, 2020

Well, that's a difficult question.

The reference you provide on speed evaluation of cryptographic hashes is excellent.
Finding an equivalent source for non-cryptographic hashes would be great, but I don't know of anything close.

@rurban maintains an interesting evaluation of hashes. It's more concentrated on quality, which is fair, as otherwise it would be too easy to evaluate some fast "hash" with terrible quality. But it also collects a few elements about speed. So it could be a nice starting point.

Let's be clear, it's a lot of work. Github may simplify a few important issues, such as creating and maintaining a public web page, but it still requires a lot of efforts to create something, even more so to keep it up to date.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants