regression: `hashes` makes `tables` 100x slower on some inputs, eg oids #11581

timotheecour · 2019-06-25T00:25:55Z

summary

#11203 (/cc @narimiran) introduced a performance regression where tables (and anything that uses hashes, eg StringTable etc) can become 100X slower (see tests below). This is the case for example when using oids's as keys.

details

The original implementation in hashes.nim (although not mentioned in the code) was using the jenkins one at a time hash function (https://en.wikipedia.org/wiki/Jenkins_hash_function), which (as its name suggests) takes 1 byte at a time and updates the hash.

This hash algorithm doesn't behave well at all when using n bytes at a time as done in #11203, for example:

echo hash("5d0fe4:a12345678") and 16383 # 5165
echo hash("5d0fe4:c1234xxxx") and 16383 # 5165
# ditto when changing `xxxx` to anything

The reason is, depending on the length of the input strings, the resulting avalanche effect is gone, and some input bits have no effect on some output bits of the hash when we process n bytes at a time, this can be shown by looking at the hashImpl code closely; eg if x is of length 16, we'd have just 2 calls to !& and the low bits of the hash are independent of a majority of the input bits.

proposal: use murmur3 instead of current hash (jenkins)

I implemented murmur3 in https://github.com/timotheecour/vitanim/blob/master/murmur/murmur.nim and IMO it is preferable to current hash, see rationale below.
(note: there are already packages for murmur3 but they're old and they don't implement the efficient multibyte MurmurHash3_x64_128)

analysis: entropy and key collision

A reasonable measure of how good a hash is for a class of input keys is entropy; as defined by sum_i -p(i) log(p(i)) where p(i) is the normalized collision count. I'm measuring this entropy here https://github.com/timotheecour/vitanim/blob/master/testcases/tests/t0128.nim
along with max number of collisions. It shows that for eg, input keys of the form align($number, padding) behave badly (low entropy, high max collision count) when padding is 8 thru 11 or 16 thru 19, etc.
Ditto when using oids, which also behave badly with current hash

Murmur3, in contrast, consistently produces entropy values very close to the ideal entropy of a uniform distribution, as shown in that test code, for all cases.

analysis: performance of hashing, and of key insertion and retrieval

This test file https://github.com/timotheecour/vitanim/blob/master/testcases/tests/t0129.nim compares performance of 3 hash functions:

old hash function (jenkins one at a time), as existed before that PR faster hashing #11203
current hash function (jenkins per with several bytes at a time)
proposed hash (murmur3)

It computes performance in 2 settings: -d:danger vs regular builds

It also computes performance in 2 key settings:

long random keys -d:case_long_random_key as produced by getRandomString (pkg/sysrandom)
oids, which are problematic for current multibyte jenkins because of their length and common prefix

the numbers (see that test file) show that murmur3 is always either the fastest or on par with the fastest in all cases, whereas:

multibyte jenkins is 100X slower for oids inputs
classical jenkins (old nim hash) is ~2X slower than both murmur3 and current hash for long random keys, because it operates per byte instead of multibyte, which affects hashing performance

conclusion

murmur3 produces good hash characteristics (randomness) in all cases, yet is as efficient as current hash since it can be computed n bytes at a time
One caveat is making it work at compile time, because of current VM limitations; this could be made a magic as a workaround so it'd also work in VM
alternatives I considered: modifying the current multibyte jenkins to avoid these problematic cases; this might work for current problematic cases but might have other collision issues for other classes of input keys, due to lack of good mixing; also it'd be non-standard as opposed to murmur3 which is widely used and standard and known to be performant.

side note: unaligned pointer

hashImpl uses n = cast[ptr Hash](unsafeAddr x[i])[] ; isn't that undefined behavior? see https://www.viva64.com/en/w/V1032/ which mentions either undefined behavior or potential performance penalty;
I didn't actually run into an issue related to that alignment aspect (and actually varying the offset sPos in hash*(sBuf: string, sPos, ePos: int) gave 0 performance difference in my tests) but that could be machine specific; and indeed, when compiling manually the C file generated by this:

let s = "abcdefghijabcdefgh"
hash(s, 1, high(s))

passing in clang -c -Wcast-align (and removing -w) we get a warning:

warning: cast from 'NIM_CHAR *' (aka 'char *') to 'NI *' (aka 'long long *') increases required alignment from 1 to 8 [-Wcast-align]

note that D's hash.d takes into account alignment before cast for multibyte hash, see https://github.com/dlang/druntime/blob/000b0e8862a21d2754ae1389da4134dc6fea12c2/src/core/internal/hash.d#L691

The text was updated successfully, but these errors were encountered:

timotheecour · 2019-07-17T21:10:58Z

see also #11767 (comment) in which I show the multibyte jenkins can give a 5X slowdown for inputs as small as uint32, compared to both murmur3 as well as the prior bytewiseHashing

narimiran · 2019-07-18T05:55:16Z

I'm working on this, there are some things that need to be fixed first.

timotheecour · 2019-07-18T09:38:42Z

hopefully won't conflict too much with #11767 PR that also touches hashes.nim
feel free to reuse/adapt/benchmark against my nim implementation in https://github.com/timotheecour/vitanim/blob/master/murmur/murmur.nim (which is faster than other nim implementations i found as it's based on MurmurHash3_x64_128)
for the dealing with CT code, I ended up using VM callbacks which IMO is the simplest:

no need for writing extra code like

Nim/lib/pure/hashes.nim

Line 167 in 7deb49e

# we cannot cast in VM, so we do it manually

to work around VM limitations
probably faster since execution at compile time will use native code instead of vm interpreter

narimiran · 2019-07-18T10:15:30Z

My implementation uses 32-bit version of Murmur3, and it already works in VM. The only remaining problem is JS backend.

narimiran · 2019-08-26T09:54:42Z

This should be fixed when #12022 is merged.

dumblob · 2019-08-27T19:23:55Z

Just a note - currently overall "best" known tests are to be found in https://github.com/rurban/smhasher (and even better with rurban/smhasher#62 ). As can be seen, murmur3 is moderately slow though very portable. Feel free to reconsider the choice.

timotheecour · 2019-08-27T20:19:25Z

@dumblob no need to cross-post the same comment, if you do at least mention that you've posted the same in #12022 (comment)

yes there are better hashes available, depending on what dimensions you care about. However: murmu3 is a good tradeoff between:

simplicity, no external dependency
efficiency
entropy / likelihood of collisions in common cases

Most importantly, it's already implemented. I don't think we should reconsider, we should merge @narimiran 's PR (once ready) because it solves an urgent serious performance regression. This doesn't mean the hash function can't be changed down the line though (we shouldn't promise hash compatibility across nim releases, that would be anti-performance), it just needs someone to write a PR that shows a better tradeoff. Ideally, it'd be pluggable so users can experiment easily with alternative hashes.
As mentioned here #12022 (comment), one tricky aspect when exploring alternative hashes is js backend, as for the vm backend can just reuse c backend via VM callback.

c-blake · 2019-08-27T21:50:33Z

Personally, I like wy hash which is also simple and about T_hash =~ 0.2 cycle/byte + 7 cycles (i.e. good small key performance). I have a 30 line impl in cligen/strUt.nim. There are issues/complexity to make it portable, though (getting the 128-bit product of 2 64-bit numbers), and I don't do the small key optimization in that 30 line impl. I'm not against Timothee's murmur. Sounds like anything is better than current situation, even the old bytewise version.

timotheecour mentioned this issue Jun 25, 2019

faster hashing #11203

Merged

mratsim assigned narimiran Jun 26, 2019

mratsim added Performance Regression labels Jun 26, 2019

timotheecour mentioned this issue Jul 17, 2019

[superseded] Different Hashing to avoid Collisions. #11767

Closed

timotheecour changed the title ~~regression: hashes makes tables 100x slower on some inputs~~ regression: hashes makes tables 100x slower on some inputs, eg oids Jul 17, 2019

mratsim mentioned this issue Aug 24, 2019

hashes: implement murmur3 #12022

Merged

Araq closed this as completed in #12022 Aug 31, 2019

mratsim mentioned this issue Dec 18, 2019

Define a hash function in order to support Stint keys in hash tables status-im/nim-stint#102

Merged

ringabout mentioned this issue Jun 5, 2024

Extremely slow and inefficient hash function for openArray[byte] for 64-bit systems #23678

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

regression: `hashes` makes `tables` 100x slower on some inputs, eg oids #11581

regression: `hashes` makes `tables` 100x slower on some inputs, eg oids #11581

timotheecour commented Jun 25, 2019 •

edited

Loading

timotheecour commented Jul 17, 2019 •

edited

Loading

narimiran commented Jul 18, 2019

timotheecour commented Jul 18, 2019 •

edited

Loading

narimiran commented Jul 18, 2019

narimiran commented Aug 26, 2019

dumblob commented Aug 27, 2019

timotheecour commented Aug 27, 2019 •

edited

Loading

c-blake commented Aug 27, 2019

regression: hashes makes tables 100x slower on some inputs, eg oids #11581

regression: hashes makes tables 100x slower on some inputs, eg oids #11581

Comments

timotheecour commented Jun 25, 2019 • edited Loading

summary

details

proposal: use murmur3 instead of current hash (jenkins)

analysis: entropy and key collision

analysis: performance of hashing, and of key insertion and retrieval

conclusion

side note: unaligned pointer

timotheecour commented Jul 17, 2019 • edited Loading

narimiran commented Jul 18, 2019

timotheecour commented Jul 18, 2019 • edited Loading

narimiran commented Jul 18, 2019

narimiran commented Aug 26, 2019

dumblob commented Aug 27, 2019

timotheecour commented Aug 27, 2019 • edited Loading

c-blake commented Aug 27, 2019

regression: `hashes` makes `tables` 100x slower on some inputs, eg oids #11581

regression: `hashes` makes `tables` 100x slower on some inputs, eg oids #11581

timotheecour commented Jun 25, 2019 •

edited

Loading

timotheecour commented Jul 17, 2019 •

edited

Loading

timotheecour commented Jul 18, 2019 •

edited

Loading

timotheecour commented Aug 27, 2019 •

edited

Loading