-
Notifications
You must be signed in to change notification settings - Fork 473
SMHasher
SMHasher's test functions can be broken down into 4 classes -
-
Sanity. Does the hash function work? Does it do bad things with its input?
-
Performance. How fast can the hash function consume input, and how many cycles does it take to complete a hash?
-
Differentials. How often do almost-but-not-quite identical keys produce identical hash values? What is the smallest difference between two keys that can possibly cause the hash function to produce identical values?
-
Keysets. Take a bunch of keys, hash them, and analyze the resulting hash values. How evenly are the hash values spread over the 2^N-bit space (I've come up with an interesting metric for this which you can read about [FillFactor here]). How often do unrelated keys produce identical hash values, and does this occur more likely than would be expected due to statistics?
To address #4, we generate a variety of different hard-to-hash keysets and run our collision & [Distribution distribution] tests against all of them.
Hash functions must not do bad things -
- They must not read bytes before the beginning or past the end of the key they're hashing.
- They should process all bits of the key.
- They should always produce the same hash value given the same key.
Applications that use hash functions to generate semi-unique identifiers for large blocks of data care most about the bulk speed of a hash function, while applications that use hash functions to implement efficient hash tables for short keys care most about the total time needed to hash a key. SMHasher tests both -
To determine how fast a hash function can consume data, create a large block of random data and measure how long it takes to hash it. Due to caching and interrupts and other sources of noise in an average PC, repeat the test a large number of times to ensure we get a clean run.
To determine how fast a hash function can completely hash a key (including function call overhead, code setup overhead, the hashing itself, finalization, and any other source of latency), generate a large number of tiny keys and measure the number of cycles needed to hash them some large number of times and track the fastest time out of all the runs.
For all possible small N-bit subsets of a M-bit key, generate 1000 random key pairs that differ in only those N bits and see if any of them hash to the same value. If so, those N bits tend to cancel out inside the hash function and generate more collisions than expected.
SMHasher tests all 5-bit differentials for 64-bit keys, all 4-bit differentials for 128-bit keys, and all 3-bit differentials for 256-bit keys. The number of differentials is very large (millions) and we do 1000 tests per differential, so in the case of a 32-bit hash value we're quite likely to see random collisions.
Because of this, SMHasher reports only collisions that occur more than once - the likelihood of seeing more than one collision is so astronomically unlikely that these multi-collisions are almost definitely the sign of a bad differential that the hash function handles poorly.
SuperFastHash fails this test badly - a large number of N-bit patterns can cause collisions with high (over 50% in some cases) probability.
Do all key bits affect all hash bits equally? Does flipping a key bit cause all hash bits to flip with a 50/50 probability?
For an N-bit hash function, SMHasher generates 2 million N*2-bit keys and flips each bit of each key, comparing the original hash of the key and the hash of the flipped-bit key. It computes how far the flipping of each output bit is from an ideal 50/50 probability, and reports the worst deviation or 'bias' from ideal it found - 0% indicates perfect 50/50 probability, 100% indicates that the output bit either never or always flipped.
Due to random fluctuations we generally won't see exactly 50/50 even with true random inputs, but 2 million tests is enough to measure bias down to about 0.25%
Generate 10 million keys consisting of N repetitions of M bytes, hash them, and test collision & distribution properties. If identical blocks of the key tend to cancel out inside the hash function, this should produce a lot more collisions than expected.
These keysets break MurmurHash2, and were the motivation for creating MurmurHash3.
These keysets are similar to the ones generated by Bob Jenkins' 'frog.c' test code.
Generate all N
-bit keys with only M
bits set where N >>> M
. Very, very hard to hash both quickly and well.
Breaking a key into blocks and hashing the blocks in parallel is a good way to increase performance, but a hash function must ensure that changing the order of the blocks in the key also changes the resulting hash value.
The permutation keyset tests this by creating ten small, sparse blocks of key data and generating all ~3.6 million possible keys that are permutations of those blocks.
This is more of a sanity check than a difficult keyset to hash. Given an N-bit hash, generate all possible N*2-bit keys that have a contiguous 20-bit-wide "window" of bits set to all possible 220 values.
If nearby bits in the key tend to cancel each other out in the hash function, this should produce a lot more random collisions than expected.
Because this test generates an enormous amount of data (N*2-20 keysets, each containing 1 million keys) and each keyset is relatively easy to hash (it's just a set of contiguous integers left-shifted some amount), we skip the distribution tests and just report collision results.
This is a convenient place for users to create keyset patterns that match those found in their own applications.
Given a prefix string P, a suffix string S, a set of characters C, and a length value L, generate all possible keys of the form P+(L characters from C)+S and test their collision + distribution properties.
Hash functions should produce different outputs even if the keys they're hashing consist of nothing but zeroes - as long as those keys are all of different lengths.
A hash function that accepts a 'seed' or 'nonce' value is quite useful when we intentionally want a single key to generate multiple hash values in a repeatable way. Ideally, every bit of the seed should affect every bit of the hash function's output just as if it were a part of the key we were hashing.
We treat this in the same way as we would a keyset test, except we use a constant key ("The quick brown fox...") and use the integers from 0 to 1 million as our seed values.