Hash function can't handle leading zeroes #6136

mjoud · 2017-07-23T16:08:09Z

import hashes

var
  a = @[42]
  b = @[0, 42]
  c = @[0, 0, 0, 0, 0, 0, 42]

echo a.hash == b.hash  # true
echo a.hash == c.hash  # true

I don't know the significance of this, the sets and tables modules also takes length into consideration, but still.

The text was updated successfully, but these errors were encountered:

Araq · 2017-07-24T17:56:36Z

definitely weird but collisions can always happen.

krux02 · 2017-07-25T00:36:05Z

when you know the hash function you can always construct collisions. Collisions can't be prevented. I would close this issue until shown that it is relevant.

andreaferretti · 2017-07-25T07:51:08Z

I don't agree. While the default hash functions are not cryptographic hashes, it is bad to be able to construct collisions trivially. This can lead to impredictable slowdonws when using tables, or even denial of service. Moreover the solution is easy (just take length into account)

bluenote10 · 2017-07-25T08:00:41Z

Wouldn't it be as simple as adding result = result !& x.len here (and maybe something similar for the two functions below).

mjoud · 2017-07-27T23:49:14Z

So I did some reading and it appears that this was a very hot topic a couple a years back, after this presentation showing vulnerabilities in hashing algorithms used in multiple languages and libraries.

Nim is currently using Jenkins's One-at-a-time-hash function which, AFAICT, does not have known vulnerabilities. There is a problem, however, that the implementation is not seeded and it is relatively easy to generate malicious json files, for example, to mount a hash DoS-attack. Moreover, the algorithm operates on one byte at a time which makes unnecessarily slow compared to other algorithms.

The Python, Rust, Ruby and Perl 5 languages all use the SipHash algorithm. g++ uses the vulnerable MurmurHash2 in std::hash (used in std::unordered_map), for some reason. Ruby and Rust have transitioned from SipHash-2-4 to the faster SipHash-1-3, and Python is in progress. Perl 5 is transitioning to other hash functions. Interestingly, Perl until now used a hardened variant of the Jenkins's hash function, created after this 2003 bug report, when Perl was in a similar state as Nim is in now: using Jenkins's without seeding.

If I understand correctly, all the mentioned languages (except g++) use a random seed (read from /dev/urandom or similar) for hashes, generated at process start. Rust additionally uses a second seed (or "salt"/"pepper") for every unique hash table.

SipHash operate on 64 bits at a time and the hash value is uint64. Jenkins's is 32 bit, so currently the upper 32 bits are unused on 64-bit platforms. There are other very fast hashes out there, such as xxHash which has 32 and 64-bit variants.

Maybe it's time to step up and be proactive in this issue. A reasonable strategy would be to

use a faster and more recent hash function (Siphash/xxHash? See links below)
create a seed at process start (module-level)
perhaps make it possible for the user to seed every new hash table (bool useSeed, or even user-provided value defaulting to 0, no seed?)

Just my 2 cents.

Some random links:

Araq · 2017-07-28T07:53:00Z

SipHash sounds fine but an initial seed means these cannot be .noSideEffect (well we can trick the type checker of course) which requires a careful tradeoff. Also the compiler uses the same algorithm for string case statement hashings. In order to generate deterministic C code we cannot use a seed there at all.

demerphq · 2017-09-22T12:43:31Z

One-At-A-Time-Hashing is broken from a security perspective. Since it is 32 bit it is trivial to generate arbitrary sets of colliding keys by pure brute force. Even worse, the way you guys use it, with a 0 seed, is completely insecure, as constructing arbitrary sets of collisions is as trivial as prefixing any string with an arbitrary set of null bytes. Adding a seed does not save you, as the last byte of the string is not properly mixed by Jenkins OAAT, which allows a seed discovery attack in less 2^32 operations. (The Perl core development team has code that discovers the seed for a seed Jenkins OAAT in an eyeblink.)

Even the "hardened" version of Jenkins OAAT that Perl has used for some years now is obsolete. (Using a 32 bit seed mixed with the length, combined with an additional random suffix added to each key.) It is well within the capabilities of a logic solver to find collisions.

I don't fully understand your use-case, but there are definitely better options available depending on what you want to do. But I am quite confident that what you are doing now is about the worst thing you could do.

FWIW, Perl has switched to using either Stadtx Hash, or Zaphod32 Hash, optionally combined with tabular hashing for short strings. Either way we seed the hash function randomly at process start. This has provided both a securty AND performance boost. One-At-A-Time hashes are not very efficient.

Yves (aka demerphq).

demerphq · 2017-09-22T12:49:07Z

@bluenote10: mixing the length into the hash state before hashing will help a bit, as it will ensure that the state is non-zero when the mixing function starts. However it is basically polishing a turd, any unseeded hash with a 32 bit state can be brute forced for collisions on a modern CPU in seconds.

data-man · 2017-10-29T18:45:35Z

@mjoud

var
  a = @[42]
  b = @[0, 42]
  c = @[0, 0, 0, 0, 0, 0, 42]

echo a.hash
echo b.hash
echo c.hash

echo a.metroHash64
echo b.metroHash64
echo c.metroHash64

echo a.metroHash128
echo b.metroHash128
echo c.metroHash128

Ouput:

12871827045
12871827045
12871827045
2C17D80E6D0C2C41
6C685027F3404212
2C74D09BA49AC6F5
9AED6F1544CAD0691A66717F491E0798
BBB8668BA727F00CCE8EF3F67E6AD372
3D4A94FE5FC44338D2C761B40CBCF7F4

Benchmarks for @[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]:

Hash function	relative	time/iter	iters/s
GlobalBenchmark		5.25ns	190.32M
defaultHash		521.14ns	1.92M
metrohash64Context	186.95%	278.76ns	3.59M
metrohash64	131.88%	395.16ns	2.53M
metrohash128Context	181.16%	287.67ns	3.48M
metrohash128	102.46%	508.62ns	1.97M

ghost · 2017-10-29T20:01:20Z

@data-man yeah, I'll close this after your PR will be merged :)

data-man · 2017-10-29T20:16:36Z

@Yardanico
Then metrohash.nim should be in lib/pure.

data-man · 2017-10-29T20:19:36Z

And I did not try it for the JS backend. Maybe it will not work there.

ghost · 2017-10-29T20:20:11Z

@data-man ah, yes, it would probably not work since you're using unsafe casts and pointers

data-man · 2017-10-29T20:46:10Z

Hmm, Hash (== int) size on x32 systems also 32-bit.

demerphq · 2017-10-30T14:24:02Z

Sorry for top posting. First the names you are using below are a bit ambiguous as to which variants of metrohash you are actually using here. For what it is worth all metrohash functions fail the SAC under some conditions and the CRC variants have known multicollision attacks. What criteria did you use for your selection and which metrohash functions do these names map to? Yves

…

On 29 Oct 2017 19:45, "Dmitry Atamanov" ***@***.***> wrote: @mjoud <https://github.com/mjoud> var a = @[42] b = @[0, 42] c = @[0, 0, 0, 0, 0, 0, 42] echo a.hashecho b.hashecho c.hash echo a.metroHash64echo b.metroHash64echo c.metroHash64 echo a.metroHash128echo b.metroHash128echo c.metroHash128 Ouput: 12871827045 12871827045 12871827045 2C17D80E6D0C2C41 6C685027F3404212 2C74D09BA49AC6F5 9AED6F1544CAD0691A66717F491E0798 BBB8668BA727F00CCE8EF3F67E6AD372 3D4A94FE5FC44338D2C761B40CBCF7F4 Benchmarks for @[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]: Hash function relative time/iter iters/s GlobalBenchmark 5.25ns 190.32M defaultHash 521.14ns 1.92M metrohash64Context 186.95% 278.76ns 3.59M metrohash64 131.88% 395.16ns 2.53M metrohash128Context 181.16% 287.67ns 3.48M metrohash128 102.46% 508.62ns 1.97M — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#6136 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAGexy2bvoJ2AiDjNhsGrbxkMaUnCfUeks5sxMfXgaJpZM4OgfsK> .

data-man · 2017-10-30T14:59:58Z

@demerphq

What criteria did you use for your selection

SMhasher results, speed, simplicity, MIT License.

which metrohash functions do these names map to?

See original C++ sources: MetroHash

Benchmarking C version xxHash and my metroHash on pure Nim : benchHashes.nim

demerphq · 2017-10-31T14:43:34Z

On 30 October 2017 at 16:00, Dmitry Atamanov ***@***.***> wrote: @demerphq <https://github.com/demerphq> What criteria did you use for your selection SMhasher <https://github.com/rurban/smhasher> results, speed, simplicity, MIT License.

Reini's SMHasher fork is a direct descendant of the original Google Murmur-Hash test suite. IIRC, It only tests 32bit hashes, and it does not test seeding, and it uses some pretty dodgy math for some of its tests. You may want to look at my fork https://github.com/demerphq/smhasher which is coded to a) use proper math for distribution tests b) test seeding properly c) test all output bits of the hash d) time seeding the hash function independently from hashing (many applications can seed once at process start and do not need to pay the seeding time twice). It also includes more hash functions.

which metrohash functions do these names map to? See original C++ source: MetroHash <https://github.com/jandrewrogers/MetroHash>

The original source includes 6 hash functions. It was not clear from what you posted which of the six your names correspond to: - 64-bit hash functions, "metrohash64_1" and "metrohash64_2" - 128-bit hash functions, "metrohash128_1" and "metrohash128_2" - 128-bit hash functions using CRC instructions, "metrohash128crc_1" and "metrohash128crc_2" There are also CRC variants of the 64 bit versions and there are pure C versions too. See the reports: https://github.com/demerphq/smhasher/tree/master/doc Specifically but not limited to: https://github.com/demerphq/smhasher/blob/master/doc/metrohash64_1.64.64.64.txt https://github.com/demerphq/smhasher/blob/master/doc/metrohash64_2.64.64.64.txt https://github.com/demerphq/smhasher/blob/master/doc/metrohash64crc_1.64.64.64.txt https://github.com/demerphq/smhasher/blob/master/doc/metrohash64crc_2.64.64.64.txt https://github.com/demerphq/smhasher/blob/master/doc/metrohash128_1.64.64.128.txt https://github.com/demerphq/smhasher/blob/master/doc/metrohash128_2.64.64.128.txt https://github.com/demerphq/smhasher/blob/master/doc/metrohash128crc_1.64.64.128.txt https://github.com/demerphq/smhasher/blob/master/doc/metrohash128crc_2.64.64.128.txt You can see a summary here: https://github.com/demerphq/smhasher/blob/master/doc/summary.txt The pertinent part of the summary: cmetrohash64_1 64 64 64 FAILED 31 of 183 tests. Avalanche: 17-41, OverAll: 183, Seed: 169, 171, 173, 175, 177 cmetrohash64_1o 64 64 64 FAILED 31 of 183 tests. Avalanche: 17-41, OverAll: 183, Seed: 169, 171, 173, 175, 177 cmetrohash64_2 64 64 64 FAILED 31 of 183 tests. Avalanche: 17-41, OverAll: 183, Seed: 169, 171, 173, 175, 177 metrohash128_1 64 64 128 FAILED 2 of 183 tests. Avalanche: 17, 40 metrohash128_2 64 64 128 FAILED 1 of 183 tests. Avalanche: 17 metrohash128crc_1 64 64 128 FAILED 19 of 183 tests. Avalanche: 17-24, Crc-MultiCollision: 81-90, OverAll: 183 metrohash128crc_2 64 64 128 FAILED 21 of 184 tests. Avalanche: 17-24, 38, 42, Crc-MultiCollision: 82-91, OverAll: 184 metrohash64_1 64 64 64 FAILED 5 of 183 tests. Avalanche: 17, 19, 21, 41, OverAll: 183 metrohash64_2 64 64 64 FAILED 6 of 183 tests. Avalanche: 17-19, 21, 41, OverAll: 183 metrohash64crc_1 64 64 64 FAILED 17 of 183 tests. Avalanche: 17, 19, 21, 41, Crc-MultiCollision: 81-90, Cyclic: 42, 52, OverAll: 183 metrohash64crc_2 64 64 64 FAILED 17 of 183 tests. Avalanche: 17, 19, 21, 41, Crc-MultiCollision: 81-90, Cyclic: 42, 52, OverAll: 183 A graph of the "Metrohash Family" performance at various key lengths is available in the doc/graph directory. https://github.com/demerphq/smhasher/blob/master/doc/graph/MetroHash_Family.medium.svg Avalanche 17 is testing how well mixed a seed is on the 0 length string. Every variant of metrohash fails this test, and thus leaks data about the seed when hashing an empty string. Failing more than Avalanche-17 is very bad. Failing Avalanche-17 alone is bad, but not fatal, depending on other factors, but failing any of the other avalanche tests is a serious matter. The CRC variants are of course faster, and completely insecure, as they allow multi-collision attacks on the hash function. (Meaning it is trivial to precompute, as I have done, an infinite set of colliding strings which will break any CRC metrohash variant independent of how it was seeded.) I haven't properly licensed my hash functions, but you can assume I am fine with using any of them under the MIT license. I will update my repository when I get a chance. Good luck folks! This is a complex subject with lots of trade offs, and making a decision is not so easy. If you were able to provide a good description of your use cases and requirements I would be happy to make a recommendation! cheers, Yves

…

-- perl -Mre=debug -e "/just|another|perl|hacker/"

Araq · 2020-07-27T18:36:06Z

The hash function was changed.

andreaferretti added the Standard Library label Jul 25, 2017

mjoud mentioned this issue Oct 29, 2017

WIP: Experimental using metrohash in the compiler #6600

Closed

mjoud mentioned this issue Dec 26, 2018

Predictable hash collisions for int64/uint64 values #10097

Closed

Araq closed this as completed Jul 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hash function can't handle leading zeroes #6136

Hash function can't handle leading zeroes #6136

mjoud commented Jul 23, 2017 •

edited

Loading

Araq commented Jul 24, 2017

krux02 commented Jul 25, 2017

andreaferretti commented Jul 25, 2017

bluenote10 commented Jul 25, 2017

mjoud commented Jul 27, 2017

Araq commented Jul 28, 2017

demerphq commented Sep 22, 2017 •

edited

Loading

demerphq commented Sep 22, 2017

data-man commented Oct 29, 2017

ghost commented Oct 29, 2017

data-man commented Oct 29, 2017

data-man commented Oct 29, 2017

ghost commented Oct 29, 2017

data-man commented Oct 29, 2017

demerphq commented Oct 30, 2017 via email

data-man commented Oct 30, 2017 •

edited

Loading

demerphq commented Oct 31, 2017 via email

Araq commented Jul 27, 2020

Hash function can't handle leading zeroes #6136

Hash function can't handle leading zeroes #6136

Comments

mjoud commented Jul 23, 2017 • edited Loading

Araq commented Jul 24, 2017

krux02 commented Jul 25, 2017

andreaferretti commented Jul 25, 2017

bluenote10 commented Jul 25, 2017

mjoud commented Jul 27, 2017

Araq commented Jul 28, 2017

demerphq commented Sep 22, 2017 • edited Loading

demerphq commented Sep 22, 2017

data-man commented Oct 29, 2017

ghost commented Oct 29, 2017

data-man commented Oct 29, 2017

data-man commented Oct 29, 2017

ghost commented Oct 29, 2017

data-man commented Oct 29, 2017

demerphq commented Oct 30, 2017 via email

data-man commented Oct 30, 2017 • edited Loading

demerphq commented Oct 31, 2017 via email

Araq commented Jul 27, 2020

mjoud commented Jul 23, 2017 •

edited

Loading

demerphq commented Sep 22, 2017 •

edited

Loading

data-man commented Oct 30, 2017 •

edited

Loading