Skip to content

Commit

Permalink
Add hashWangYi1 (#13823)
Browse files Browse the repository at this point in the history
* Unwind just the "pseudorandom probing" (whole hash-code-keyed variable
stride double hashing) part of recent sets & tables changes (which has
still been causing bugs over a month later (e.g., two days ago
#13794) as well as still having
several "figure this out" implementation question comments in them (see
just diffs of this PR).

This topic has been discussed in many places:
  #13393
  #13418
  #13440
  #13794

Alternative/non-mandatory stronger integer hashes (or vice-versa opt-in
identity hashes) are a better solution that is more general (no illusion
of one hard-coded sequence solving all problems) while retaining the
virtues of linear probing such as cache obliviousness and age-less tables
under delete-heavy workloads (still untested after a month of this change).

The only real solution for truly adversarial keys is a hash keyed off of
data unobservable to attackers.  That all fits better with a few families
of user-pluggable/define-switchable hashes which can be provided in a
separate PR more about `hashes.nim`.

This PR carefully preserves the better (but still hard coded!) probing
of the  `intsets` and other recent fixes like `move` annotations, hash
order invariant tests, `intsets.missingOrExcl` fixing, and the move of
`rightSize` into `hashcommon.nim`.

* Fix `data.len` -> `dataLen` problem.

* This is an alternate resolution to #13393
(which arguably could be resolved outside the stdlib).

Add version1 of Wang Yi's hash specialized to 8 byte integers.  This gives
simple help to users having trouble with overly colliding hash(key)s.  I.e.,
  A) `import hashes; proc hash(x: myInt): Hash = hashWangYi1(int(x))`
      in the instantiation context of a `HashSet` or `Table`
or
  B) more globally, compile with `nim c -d:hashWangYi1`.

No hash can be all things to all use cases, but this one is A) vetted to
scramble well by the SMHasher test suite (a necessarily limited but far
more thorough test than prior proposals here), B) only a few ALU ops on
many common CPUs, and C) possesses an easy via "grade school multi-digit
multiplication" fall back for weaker deployment contexts.

Some people might want to stampede ahead unbridled, but my view is that a
good plan is to
  A) include this in the stdlib for a release or three to let people try it
     on various key sets nim-core could realistically never access/test
     (maybe mentioning it in the changelog so people actually try it out),
  B) have them report problems (if any),
  C) if all seems good, make the stdlib more novice friendly by adding
     `hashIdentity(x)=x` and changing the default `hash() = hashWangYi1`
     with some `when defined` rearranging so users can `-d:hashIdentity`
     if they want the old behavior back.
This plan is compatible with any number of competing integer hashes if
people want to add them.  I would strongly recommend they all *at least*
pass the SMHasher suite since the idea here is to become more friendly to
novices who do not generally understand hashing failure modes.

* Re-organize to work around `when nimvm` limitations; Add some tests; Add
a changelog.md entry.

* Add less than 64-bit CPU when fork.

* Fix decl instead of call typo.

* First attempt at fixing range error on 32-bit platforms; Still do the
arithmetic in doubled up 64-bit, but truncate the hash to the lower
32-bits, but then still return `uint64` to be the same.  So, type
correct but truncated hash value.  Update `thashes.nim` as well.

* A second try at making 32-bit mode CI work.

* Use a more systematic identifier convention than Wang Yi's code.

* Fix test that was wrong for as long as `toHashSet` used `rightSize` (a
very long time, I think).  `$a`/`$b` depend on iteration order which
varies with table range reduced hash order which varies with range for
some `hash()`.  With 3 elements, 3!=6 is small and we've just gotten
lucky with past experimental `hash()` changes.  An alternate fix here
would be to not stringify but use the HashSet operators, but it is not
clear that doesn't alter the "spirit" of the test.

* Fix another stringified test depending upon hash order.

* Oops - revert the string-keyed test.

* Fix another stringify test depending on hash order.

* Add a better than always zero `defined(js)` branch.

* It turns out to be easy to just work all in `BigInt` inside JS and thus
guarantee the same low order bits of output hashes (for `isSafeInteger`
input numbers).  Since `hashWangYi1` output bits are equally random in
all their bits, this means that tables will be safely scrambled for table
sizes up to 2**32 or 4 gigaentries which is probably fine, as long as the
integer keys are all < 2**53 (also likely fine).  (I'm unsure why the
infidelity with C/C++ back ends cut off is 32, not 53 bits.)

Since HashSet & Table only use the low order bits, a quick corollary of
this is that `$` on most int-keyed sets/tables will be the same in all
the various back ends which seems a nice-to-have trait.

* These string hash tests fail for me locally.  Maybe this is what causes
the CI hang for testament pcat collections?

* Oops. That failure was from me manually patching string hash in hashes.  Revert.

* Import more test improvements from #13410

* Fix bug where I swapped order when reverting the test.  Ack.

* Oh, just accept either order like more and more hash tests.

* Iterate in the same order.

* `return` inside `emit` made us skip `popFrame` causing weird troubles.

* Oops - do Windows branch also.

* `nimV1hash` -> multiply-mnemonic, type-scoped `nimIntHash1` (mnemonic
resolutions are "1 == identity", 1 for Nim Version 1, 1 for
first/simplest/fastest in a series of possibilities.  Should be very
easy to remember.)

* Re-organize `when nimvm` logic to be a strict `when`-`else`.

* Merge other changes.

* Lift constants to a common area.

* Fall back to identity hash when `BigInt` is unavailable.

* Increase timeout slightly (probably just real-time perturbation of CI
system performance).
  • Loading branch information
c-blake authored Apr 15, 2020
1 parent 3a2697d commit a0b33f9
Show file tree
Hide file tree
Showing 8 changed files with 107 additions and 12 deletions.
3 changes: 3 additions & 0 deletions changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,9 @@
- `uri` adds Data URI Base64, implements RFC-2397.
- Add [DOM Parser](https://developer.mozilla.org/en-US/docs/Web/API/DOMParser)
to the `dom` module for the JavaScript target.
- The default hash for `Ordinal` has changed to something more bit-scrambling.
`import hashes; proc hash(x: myInt): Hash = hashIdentity(x)` recovers the old
one in an instantiation context while `-d:nimIntHash1` recovers it globally.

## Language changes

Expand Down
5 changes: 2 additions & 3 deletions lib/pure/collections/sets.nim
Original file line number Diff line number Diff line change
Expand Up @@ -334,8 +334,7 @@ proc pop*[A](s: var HashSet[A]): A =
## * `clear proc <#clear,HashSet[A]>`_
runnableExamples:
var s = toHashSet([2, 1])
assert s.pop == 1
assert s.pop == 2
assert [s.pop, s.pop] in [[1, 2], [2,1]] # order unspecified
doAssertRaises(KeyError, echo s.pop)

for h in 0 .. high(s.data):
Expand Down Expand Up @@ -1008,7 +1007,7 @@ when isMainModule and not defined(release):

block toSeqAndString:
var a = toHashSet([2, 7, 5])
var b = initHashSet[int]()
var b = initHashSet[int](rightSize(a.len))
for x in [2, 7, 5]: b.incl(x)
assert($a == $b)
#echo a
Expand Down
73 changes: 71 additions & 2 deletions lib/pure/hashes.nim
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,66 @@ proc `!$`*(h: Hash): Hash {.inline.} =
res = res + res shl 15
result = cast[Hash](res)

proc hiXorLoFallback64(a, b: uint64): uint64 {.inline.} =
let # Fall back in 64-bit arithmetic
aH = a shr 32
aL = a and 0xFFFFFFFF'u64
bH = b shr 32
bL = b and 0xFFFFFFFF'u64
rHH = aH * bH
rHL = aH * bL
rLH = aL * bH
rLL = aL * bL
t = rLL + (rHL shl 32)
var c = if t < rLL: 1'u64 else: 0'u64
let lo = t + (rLH shl 32)
c += (if lo < t: 1'u64 else: 0'u64)
let hi = rHH + (rHL shr 32) + (rLH shr 32) + c
return hi xor lo

proc hiXorLo(a, b: uint64): uint64 {.inline.} =
# Xor of high & low 8B of full 16B product
when nimvm:
result = hiXorLoFallback64(a, b) # `result =` is necessary here.
else:
when Hash.sizeof < 8:
result = hiXorLoFallback64(a, b)
elif defined(gcc) or defined(llvm_gcc) or defined(clang):
{.emit: """__uint128_t r = a; r *= b; `result` = (r >> 64) ^ r;""".}
elif defined(windows) and not defined(tcc):
{.emit: """a = _umul128(a, b, &b); `result` = a ^ b;""".}
else:
result = hiXorLoFallback64(a, b)

proc hashWangYi1*(x: int64|uint64|Hash): Hash {.inline.} =
## Wang Yi's hash_v1 for 8B int. https://github.com/rurban/smhasher has more
## details. This passed all scrambling tests in Spring 2019 and is simple.
## NOTE: It's ok to define ``proc(x: int16): Hash = hashWangYi1(Hash(x))``.
const P0 = 0xa0761d6478bd642f'u64
const P1 = 0xe7037ed1a0b428db'u64
const P58 = 0xeb44accab455d165'u64 xor 8'u64
when nimvm:
cast[Hash](hiXorLo(hiXorLo(P0, uint64(x) xor P1), P58))
else:
when defined(js):
asm """
if (typeof BigInt == 'undefined') {
`result` = `x`; // For Node < 10.4, etc. we do the old identity hash
} else { // Otherwise we match the low 32-bits of C/C++ hash
function hi_xor_lo_js(a, b) {
const prod = BigInt(a) * BigInt(b);
const mask = (BigInt(1) << BigInt(64)) - BigInt(1);
return (prod >> BigInt(64)) ^ (prod & mask);
}
const P0 = BigInt(0xa0761d64)<<BigInt(32)|BigInt(0x78bd642f);
const P1 = BigInt(0xe7037ed1)<<BigInt(32)|BigInt(0xa0b428db);
const P58 = BigInt(0xeb44acca)<<BigInt(32)|BigInt(0xb455d165)^BigInt(8);
var res = hi_xor_lo_js(hi_xor_lo_js(P0, BigInt(`x`) ^ P1), P58);
`result` = Number(res & ((BigInt(1) << BigInt(53)) - BigInt(1)));
}"""
else:
cast[Hash](hiXorLo(hiXorLo(P0, uint64(x) xor P1), P58))

proc hashData*(data: pointer, size: int): Hash =
## Hashes an array of bytes of size `size`.
var h: Hash = 0
Expand Down Expand Up @@ -112,10 +172,19 @@ proc hash*[T: proc](x: T): Hash {.inline.} =
else:
result = hash(pointer(x))

proc hash*[T: Ordinal](x: T): Hash {.inline.} =
## Efficient hashing of integers.
proc hashIdentity*[T: Ordinal](x: T): Hash {.inline.} =
## The identity hash. I.e. ``hashIdentity(x) = x``.
cast[Hash](ord(x))

when defined(nimIntHash1):
proc hash*[T: Ordinal](x: T): Hash {.inline.} =
## Efficient hashing of integers.
hashIdentity(x)
else:
proc hash*[T: Ordinal](x: T): Hash {.inline.} =
## Efficient hashing of integers.
hashWangYi1(uint64(ord(x)))

proc hash*(x: float): Hash {.inline.} =
## Efficient hashing of floats.
var y = x + 0.0 # for denormalization
Expand Down
6 changes: 3 additions & 3 deletions lib/pure/sugar.nim
Original file line number Diff line number Diff line change
Expand Up @@ -291,7 +291,7 @@ macro collect*(init, body: untyped): untyped {.since: (1, 1).} =
let z = collect(initTable(2)):
for i, d in data.pairs: {i: d}

assert z == {1: "word", 0: "bird"}.toTable
assert z == {0: "bird", 1: "word"}.toTable
# analyse the body, find the deepest expression 'it' and replace it via
# 'result.add it'
let res = genSym(nskVar, "collectResult")
Expand Down Expand Up @@ -334,8 +334,8 @@ when isMainModule:

let data = @["bird", "word"] # if this gets stuck in your head, its not my fault
assert collect(newSeq, for (i, d) in data.pairs: (if i mod 2 == 0: d)) == @["bird"]
assert collect(initTable(2), for (i, d) in data.pairs: {i: d}) == {1: "word",
0: "bird"}.toTable
assert collect(initTable(2), for (i, d) in data.pairs: {i: d}) == {0: "bird",
1: "word"}.toTable
assert initHashSet.collect(for d in data.items: {d}) == data.toHashSet

let x = collect(newSeqOfCap(4)):
Expand Down
8 changes: 6 additions & 2 deletions tests/collections/tcollections.nim
Original file line number Diff line number Diff line change
Expand Up @@ -76,8 +76,12 @@ doAssert $(toOrderedSet(["1", "2", "3"])) == """{"1", "2", "3"}"""
doAssert $(toOrderedSet(['1', '2', '3'])) == """{'1', '2', '3'}"""

# Tests for tables
doAssert $({1: "1", 2: "2"}.toTable) == """{1: "1", 2: "2"}"""
doAssert $({"1": 1, "2": 2}.toTable) == """{"1": 1, "2": 2}"""
when defined(nimIntHash1):
doAssert $({1: "1", 2: "2"}.toTable) == """{1: "1", 2: "2"}"""
else:
doAssert $({1: "1", 2: "2"}.toTable) == """{2: "2", 1: "1"}"""
let tabStr = $({"1": 1, "2": 2}.toTable)
doAssert (tabStr == """{"2": 2, "1": 1}""" or tabStr == """{"1": 1, "2": 2}""")

# Tests for deques
block:
Expand Down
5 changes: 4 additions & 1 deletion tests/collections/tcollections_to_string.nim
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,10 @@ doAssert $(toOrderedSet(["1", "2", "3"])) == """{"1", "2", "3"}"""
doAssert $(toOrderedSet(['1', '2', '3'])) == """{'1', '2', '3'}"""

# Tests for tables
doAssert $({1: "1", 2: "2"}.toTable) == """{1: "1", 2: "2"}"""
when defined(nimIntHash1):
doAssert $({1: "1", 2: "2"}.toTable) == """{1: "1", 2: "2"}"""
else:
doAssert $({1: "1", 2: "2"}.toTable) == """{2: "2", 1: "1"}"""
doAssert $({"1": 1, "2": 2}.toTable) == """{"1": 1, "2": 2}"""

# Tests for deques
Expand Down
17 changes: 17 additions & 0 deletions tests/stdlib/thashes.nim
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,20 @@ suite "hashes":
test "0.0 and -0.0 should have the same hash value":
var dummy = 0.0
check hash(dummy) == hash(-dummy)

test "VM and runtime should make the same hash value (hashIdentity)":
const hi123 = hashIdentity(123)
check hashIdentity(123) == hi123

test "VM and runtime should make the same hash value (hashWangYi1)":
const wy123 = hashWangYi1(123)
check hashWangYi1(123) == wy123

test "hashIdentity value incorrect at 456":
check hashIdentity(456) == 456

test "hashWangYi1 value incorrect at 456":
when Hash.sizeof < 8:
check hashWangYi1(456) == 1293320666
else:
check hashWangYi1(456) == -6421749900419628582
2 changes: 1 addition & 1 deletion tests/vm/tslow_tables.nim
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
discard """
timeout: "4"
timeout: "5"
action: "compile"
nimout: '''create
search
Expand Down

0 comments on commit a0b33f9

Please sign in to comment.