Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regression: hashes makes tables 100x slower on some inputs, eg oids #11581

Closed
timotheecour opened this issue Jun 25, 2019 · 8 comments · Fixed by #12022
Closed

regression: hashes makes tables 100x slower on some inputs, eg oids #11581

timotheecour opened this issue Jun 25, 2019 · 8 comments · Fixed by #12022

Comments

@timotheecour
Copy link
Member

timotheecour commented Jun 25, 2019

summary

#11203 (/cc @narimiran) introduced a performance regression where tables (and anything that uses hashes, eg StringTable etc) can become 100X slower (see tests below). This is the case for example when using oids's as keys.

details

The original implementation in hashes.nim (although not mentioned in the code) was using the jenkins one at a time hash function (https://en.wikipedia.org/wiki/Jenkins_hash_function), which (as its name suggests) takes 1 byte at a time and updates the hash.

This hash algorithm doesn't behave well at all when using n bytes at a time as done in #11203, for example:

echo hash("5d0fe4:a12345678") and 16383 # 5165
echo hash("5d0fe4:c1234xxxx") and 16383 # 5165
# ditto when changing `xxxx` to anything

The reason is, depending on the length of the input strings, the resulting avalanche effect is gone, and some input bits have no effect on some output bits of the hash when we process n bytes at a time, this can be shown by looking at the hashImpl code closely; eg if x is of length 16, we'd have just 2 calls to !& and the low bits of the hash are independent of a majority of the input bits.

proposal: use murmur3 instead of current hash (jenkins)

I implemented murmur3 in https://github.com/timotheecour/vitanim/blob/master/murmur/murmur.nim and IMO it is preferable to current hash, see rationale below.
(note: there are already packages for murmur3 but they're old and they don't implement the efficient multibyte MurmurHash3_x64_128)

analysis: entropy and key collision

A reasonable measure of how good a hash is for a class of input keys is entropy; as defined by sum_i -p(i) log(p(i)) where p(i) is the normalized collision count. I'm measuring this entropy here https://github.com/timotheecour/vitanim/blob/master/testcases/tests/t0128.nim
along with max number of collisions. It shows that for eg, input keys of the form align($number, padding) behave badly (low entropy, high max collision count) when padding is 8 thru 11 or 16 thru 19, etc.
Ditto when using oids, which also behave badly with current hash

Murmur3, in contrast, consistently produces entropy values very close to the ideal entropy of a uniform distribution, as shown in that test code, for all cases.

analysis: performance of hashing, and of key insertion and retrieval

This test file https://github.com/timotheecour/vitanim/blob/master/testcases/tests/t0129.nim compares performance of 3 hash functions:

  • old hash function (jenkins one at a time), as existed before that PR faster hashing #11203
  • current hash function (jenkins per with several bytes at a time)
  • proposed hash (murmur3)

It computes performance in 2 settings: -d:danger vs regular builds

It also computes performance in 2 key settings:

  • long random keys -d:case_long_random_key as produced by getRandomString (pkg/sysrandom)
  • oids, which are problematic for current multibyte jenkins because of their length and common prefix

the numbers (see that test file) show that murmur3 is always either the fastest or on par with the fastest in all cases, whereas:

  • multibyte jenkins is 100X slower for oids inputs
  • classical jenkins (old nim hash) is ~2X slower than both murmur3 and current hash for long random keys, because it operates per byte instead of multibyte, which affects hashing performance

conclusion

  • murmur3 produces good hash characteristics (randomness) in all cases, yet is as efficient as current hash since it can be computed n bytes at a time
  • One caveat is making it work at compile time, because of current VM limitations; this could be made a magic as a workaround so it'd also work in VM
  • alternatives I considered: modifying the current multibyte jenkins to avoid these problematic cases; this might work for current problematic cases but might have other collision issues for other classes of input keys, due to lack of good mixing; also it'd be non-standard as opposed to murmur3 which is widely used and standard and known to be performant.

side note: unaligned pointer

hashImpl uses n = cast[ptr Hash](unsafeAddr x[i])[] ; isn't that undefined behavior? see https://www.viva64.com/en/w/V1032/ which mentions either undefined behavior or potential performance penalty;
I didn't actually run into an issue related to that alignment aspect (and actually varying the offset sPos in hash*(sBuf: string, sPos, ePos: int) gave 0 performance difference in my tests) but that could be machine specific; and indeed, when compiling manually the C file generated by this:

let s = "abcdefghijabcdefgh"
hash(s, 1, high(s))

passing in clang -c -Wcast-align (and removing -w) we get a warning:

warning: cast from 'NIM_CHAR *' (aka 'char *') to 'NI *' (aka 'long long *') increases required alignment from 1 to 8 [-Wcast-align]

note that D's hash.d takes into account alignment before cast for multibyte hash, see https://github.com/dlang/druntime/blob/000b0e8862a21d2754ae1389da4134dc6fea12c2/src/core/internal/hash.d#L691

@timotheecour timotheecour changed the title regression: hashes makes tables 100x slower on some inputs regression: hashes makes tables 100x slower on some inputs, eg oids Jul 17, 2019
@timotheecour
Copy link
Member Author

timotheecour commented Jul 17, 2019

see also #11767 (comment) in which I show the multibyte jenkins can give a 5X slowdown for inputs as small as uint32, compared to both murmur3 as well as the prior bytewiseHashing

@narimiran
Copy link
Member

I'm working on this, there are some things that need to be fixed first.

@timotheecour
Copy link
Member Author

timotheecour commented Jul 18, 2019

hopefully won't conflict too much with #11767 PR that also touches hashes.nim
feel free to reuse/adapt/benchmark against my nim implementation in https://github.com/timotheecour/vitanim/blob/master/murmur/murmur.nim (which is faster than other nim implementations i found as it's based on MurmurHash3_x64_128)
for the dealing with CT code, I ended up using VM callbacks which IMO is the simplest:

  • no need for writing extra code like
    # we cannot cast in VM, so we do it manually
    to work around VM limitations
  • probably faster since execution at compile time will use native code instead of vm interpreter

@narimiran
Copy link
Member

My implementation uses 32-bit version of Murmur3, and it already works in VM. The only remaining problem is JS backend.

@narimiran
Copy link
Member

This should be fixed when #12022 is merged.

@dumblob
Copy link

dumblob commented Aug 27, 2019

Just a note - currently overall "best" known tests are to be found in https://github.com/rurban/smhasher (and even better with rurban/smhasher#62 ). As can be seen, murmur3 is moderately slow though very portable. Feel free to reconsider the choice.

@timotheecour
Copy link
Member Author

timotheecour commented Aug 27, 2019

@dumblob no need to cross-post the same comment, if you do at least mention that you've posted the same in #12022 (comment)

yes there are better hashes available, depending on what dimensions you care about. However: murmu3 is a good tradeoff between:

  • simplicity, no external dependency
  • efficiency
  • entropy / likelihood of collisions in common cases

Most importantly, it's already implemented. I don't think we should reconsider, we should merge @narimiran 's PR (once ready) because it solves an urgent serious performance regression. This doesn't mean the hash function can't be changed down the line though (we shouldn't promise hash compatibility across nim releases, that would be anti-performance), it just needs someone to write a PR that shows a better tradeoff. Ideally, it'd be pluggable so users can experiment easily with alternative hashes.
As mentioned here #12022 (comment), one tricky aspect when exploring alternative hashes is js backend, as for the vm backend can just reuse c backend via VM callback.

@c-blake
Copy link
Contributor

c-blake commented Aug 27, 2019

Personally, I like wy hash which is also simple and about T_hash =~ 0.2 cycle/byte + 7 cycles (i.e. good small key performance). I have a 30 line impl in cligen/strUt.nim. There are issues/complexity to make it portable, though (getting the 128-bit product of 2 64-bit numbers), and I don't do the small key optimization in that 30 line impl. I'm not against Timothee's murmur. Sounds like anything is better than current situation, even the old bytewise version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants