Add `hashWangYi1` #13823

c-blake · 2020-03-31T19:12:23Z

This is an alternate resolution to #13393 (which arguably could be resolved outside the stdlib).

Add version1 of Wang Yi's hash specialized to 8 byte integers. This gives simple help to users having trouble with overly colliding hash(key)s. I.e.,
A) import hashes; proc hash(x: myInt): Hash = hashWangYi1(int(x)) in the instantiation context of a HashSet/Table
or
B) more globally, compile with nim c -d:hashWangYi1.

No hash can be all things to all use cases, but this one is A) vetted to scramble well by the SMHasher test suite (a necessarily limited but far more thorough test than prior proposals here), B) only a few ALU ops on many common CPUs, and C) possesses an easy via "grade school multi-digit multiplication" fall back for weaker deployment contexts.

Some people might want to stampede ahead unbridled, but my view is that a good plan is to:
A) include this in the stdlib for a release or three to let people try it on various key sets nim-core could realistically never access/test (maybe mentioning it in the changelog so people actually try it out),
B) have them report problems (if any),
C) if all seems good, make the stdlib more novice friendly by adding hashIdentity(x)=x and changing the default hash() = hashWangYi1 with some when defined rearranging so users can -d:hashIdentity if they want the old behavior back.

This plan is compatible with any number of competing integer hashes if people want to add them. I would strongly recommend they all at least pass the SMHasher suite since the idea here is to become more friendly to novices who do not generally understand hashing failure modes.

stride double hashing) part of recent sets & tables changes (which has still been causing bugs over a month later (e.g., two days ago #13794) as well as still having several "figure this out" implementation question comments in them (see just diffs of this PR). This topic has been discussed in many places: #13393 #13418 #13440 #13794 Alternative/non-mandatory stronger integer hashes (or vice-versa opt-in identity hashes) are a better solution that is more general (no illusion of one hard-coded sequence solving all problems) while retaining the virtues of linear probing such as cache obliviousness and age-less tables under delete-heavy workloads (still untested after a month of this change). The only real solution for truly adversarial keys is a hash keyed off of data unobservable to attackers. That all fits better with a few families of user-pluggable/define-switchable hashes which can be provided in a separate PR more about `hashes.nim`. This PR carefully preserves the better (but still hard coded!) probing of the `intsets` and other recent fixes like `move` annotations, hash order invariant tests, `intsets.missingOrExcl` fixing, and the move of `rightSize` into `hashcommon.nim`.

(which arguably could be resolved outside the stdlib). Add version1 of Wang Yi's hash specialized to 8 byte integers. This gives simple help to users having trouble with overly colliding hash(key)s. I.e., A) `import hashes; proc hash(x: myInt): Hash = hashWangYi1(int(x))` in the instantiation context of a `HashSet` or `Table` or B) more globally, compile with `nim c -d:hashWangYi1`. No hash can be all things to all use cases, but this one is A) vetted to scramble well by the SMHasher test suite (a necessarily limited but far more thorough test than prior proposals here), B) only a few ALU ops on many common CPUs, and C) possesses an easy via "grade school multi-digit multiplication" fall back for weaker deployment contexts. Some people might want to stampede ahead unbridled, but my view is that a good plan is to A) include this in the stdlib for a release or three to let people try it on various key sets nim-core could realistically never access/test (maybe mentioning it in the changelog so people actually try it out), B) have them report problems (if any), C) if all seems good, make the stdlib more novice friendly by adding `hashIdentity(x)=x` and changing the default `hash() = hashWangYi1` with some `when defined` rearranging so users can `-d:hashIdentity` if they want the old behavior back. This plan is compatible with any number of competing integer hashes if people want to add them. I would strongly recommend they all *at least* pass the SMHasher suite since the idea here is to become more friendly to novices who do not generally understand hashing failure modes.

c-blake · 2020-03-31T19:43:58Z

I should perhaps have said, if it isn't obvious, that if users do have a secret key, they can merge that into the hash in some instantiation context like:

import hashes, sets, mysecret
proc hash(x: int): Hash =
  hashWangYi1(x + mysecret.secret)

Things can be made stronger than that using new secrets/salt on some per-resize/per-data instance basis as in https://github.com/c-blake/adix (which can even use the Linux getrandom, and Windows may well have something similar).

The best attack resilience requires more invasive changes, though..actually detecting and then mitigating the attack somehow. It's probably wisest to get experience with that in adix for a while before trying to harden the stdlib hash tables.

juancarlospaco · 2020-03-31T20:59:21Z

Needs Tests, and static: Tests because you say it works on NimVM.

c-blake · 2020-03-31T21:02:56Z

static: where, exactly? I will work on tests tomorrow. Any comment about the general ideas/approach?

Araq · 2020-03-31T21:06:49Z

I like it but IMO we can already make it the new default and emulate the old hashing via -d:nimV1hash or something.

a changelog.md entry.

arithmetic in doubled up 64-bit, but truncate the hash to the lower 32-bits, but then still return `uint64` to be the same. So, type correct but truncated hash value. Update `thashes.nim` as well.

Araq · 2020-04-02T08:45:21Z

lib/pure/hashes.nim

+  const P0 = 0xa0761d6478bd642f'u64
+  const P1 = 0xe7037ed1a0b428db'u64
+  const P5x8 = 0xeb44accab455d165'u64 xor 8'u64
+  Hash(hiXorLo(hiXorLo(P0, uint64(x) xor P1), P5x8))


This should be a cast[Hash], I think.

Ok. I'll try that. Yeah I was getting i386/32-bit errors in the CI but there was no line number in the range exception. So, I was kind of guessing my way around. Thanks for the pointer! Pushed an attempt to fix. We'll see how it goes.

Well, that cast instead of convert seems to have fixed the Linux i386.

Any ideas why tests/closure/tclosure.nim would be failing on the CI? It works for me locally on an Linux_amd64 system, but seems to stop between the end of block doNotation and block tforum. My guess would be block fib50 since that engages the new hash(), but I don't get an assertion error from the CI.

Well, though that tclosure.nim failure had repeated quite a few times it now appears to have worked and Linux i386 passes all tests.

The testament nim pcat collections seems to be infinite looping holding up other people's CI's, and I am unsure why, and I do not see how to kill my CI jobs in the UI. So you should kill them which I think you (or someone) also did yesterday.

Also, before merging this we still need to decide on a defined(js) branch. It seems the arithmetic JS does creates zero output hashes with hiXorLoFallback64. Over at https://github.com/c-blake/adix in althash.nim, I have a hiXorLoFallback32 in 32-bit arithmetic that gives bit identical hashes with the C backend, but different hashes with the JS backend (not sure why, but may be fixable) and also uses 16 multiplications and so is probably kinda slow.

Never mind Re killing jobs. I guess they timeout at 90 minutes. We do need to figure out why they infinite loop, though, and also decide about defined(js).

I'm investigating.

Ok. Thanks!

I found a decent answer for the JS part that gives the same function (in low 32-bits anyway, which is what sets & tables care about). Might not be hard to extend that to 53 bits if people complain. Getting all the way to 64 bits will require changing the backend to use BigInt rather than Number for int64/uint64, but I doubt anyone is using even 4 gigaEntry int-keyed tables on the JS backend. That'd be at least 64 GiB.

very long time, I think). `$a`/`$b` depend on iteration order which varies with table range reduced hash order which varies with range for some `hash()`. With 3 elements, 3!=6 is small and we've just gotten lucky with past experimental `hash()` changes. An alternate fix here would be to not stringify but use the HashSet operators, but it is not clear that doesn't alter the "spirit" of the test.

guarantee the same low order bits of output hashes (for `isSafeInteger` input numbers). Since `hashWangYi1` output bits are equally random in all their bits, this means that tables will be safely scrambled for table sizes up to 2**32 or 4 gigaentries which is probably fine, as long as the integer keys are all < 2**53 (also likely fine). (I'm unsure why the infidelity with C/C++ back ends cut off is 32, not 53 bits.) Since HashSet & Table only use the low order bits, a quick corollary of this is that `$` on most int-keyed sets/tables will be the same in all the various back ends which seems a nice-to-have trait.

the CI hang for testament pcat collections?

…s. Revert.

c-blake · 2020-04-04T13:47:08Z

I doubt this has to do with the tests/closure/tclosure.nim test weirdness but #13410 also added some registerCallback calls to compiler/vmops.nim. Do we want one of those there for this PR?

Also, looking at that vmops.nim code, the res = cast[int32](res) in hashVmImpl may well be the cause of keeping only the low 32 bits instead of low 53 bits. I doubt it matters any year soon, though. By the time it does Nim will hopefully be using BigInt in the JS backend more thoroughly.

c-blake · 2020-04-09T12:23:45Z

@narimiran - it looks like your itertools runnableExamples tests for groupBy & unique depend upon hash order. So, you probably want to adapt them with sorted or sortedPairs along the lines of how #13418 changed tests/collections/ttablesthreads.nim or else if you would like to stay depending on order conditionalize them on the new defined(nimIntHash1). Both make the example code more ornate, as do the asserts. I guess you could also lift that stuff into some external tests. I think the sorting way is probably best, but it's your judgement call.

c-blake · 2020-04-09T13:03:11Z

As for the nimsl package failure..That seems to be specific to the Javascript test where nim js -r --path:$HOME/pkg/nim/variant -d:nodejs nimsl/private/var_decls.nim results in:

$HOME/pkg/nim/nimsl/nimsl/private/var_decls.nim(71, 11) template/generic instantiation of `setVar` from here
$HOME/pkg/nim/nimsl/nimsl/private/var_decls.nim(31, 26) template/generic instantiation of `getTypeId` from here
/usr/lib64/nim/lib/pure/hashes.nim(109, 5) Error: cannot generate VM code for asm "      function hi_xor_lo_js(a, b) {\n        var prod = BigInt(a) * BigInt(b);\n        var mask = (BigInt(1) << BigInt(64)) - BigInt(1);\n        return (prod >> BigInt(64)) ^ (prod & mask);\n      }\n      const P0  = BigInt(0xa0761d64)<<BigInt(32)|BigInt(0x78bd642f);\n      const P1  = BigInt(0xe7037ed1)<<BigInt(32)|BigInt(0xa0b428db);\n      const P58 = BigInt(0xeb44acca)<<BigInt(32)|BigInt(0xb455d165) ^ BigInt(8);\n      var res   = hi_xor_lo_js(hi_xor_lo_js(P0, BigInt("x

The error message is cut off (interestingly in a source-whitespace neutral way), but in compiler/vmgen.nim it seems nothing would come after it anyway. When I just nim js -r -d:nodejs tests/collections/tcollections.nim it works fine. Do you or maybe @Araq have any ideas why this compile failure might happen? Thanks!

c-blake · 2020-04-09T13:19:18Z

As for the third and final nimble packages failure, zero-functional -- I am not sure what's wrong at all. When I nim c -r test.nim in a git checkout of that package I get "OK" for all the tests. The error message almost looks like the problem is related to how testament is running the tests for it, not the hash change.

narimiran · 2020-04-10T05:40:06Z

it looks like your itertools runnableExamples tests for groupBy & unique depend upon hash order

Indeed they were. I just pushed a fix for that.

c-blake · 2020-04-10T13:23:10Z

Cool. Thanks, @narimiran.

Any idea about the compiler crash on VM code generation or what about the nimsl compile environment might tickle that vs. not being tickled in other situations?

(And if you are following the other thread on the original bug report, my conclusion is this proposed hash function has higher quality output at lower CPU time cost and lower assembly size than all others mentioned, where the output is being measured with something like 500+ statistical tests..So, virtually indistinguishable from random. I doubt it'll ever fail us except in targeted, direct attacks where anything without secrets can fail.)

c-blake · 2020-04-14T13:58:59Z

BigInt is on by default is Node >=10.4 (marked "stable" on April 11, 2013) while the CI uses 8.17.0. So, I just fall back to the identity when BigInt is undefined. If you want to keep the same low-order bit hash values on ancient JS platforms, it is possible but we would need a large, ugly quad precision 128-bit product in 32-bit arithmetic impl in the JS branch. I think that's probably overzealous. This is likely a very rare deployment in practice these days and hashIdentity served Nim as an ok default for 12 years. Asking someone with "slowness problems" of any kind to upgrade their JS runtime would be prudent regardless of what we do here, and once they did upgrade it should have BigInt and use the new hash.

Also, I misinterpreted the CI. zero-functional was not failing and passes fine.

It seems that the only failing test is now the INim failure is a failure to import noise (another nimble package in its requirements list). If I locally nimble install noise and run his tests it works fine, though. Also, INim does not actually use sets, tables, or hashes. So, I am pretty sure this CI failure is some version skew/environmental thing, and I'd suspect all current PR CIs are failing with it. So, if you are ok with ancient JS fall back to identity then I think this is ready to merge.

narimiran · 2020-04-14T16:01:07Z

BigInt is on by default is Node >=10.4 ~~(marked "stable" on April 11, 2013)~~ while the CI uses 8.17.0.

@Araq @alaviss, is there a reason why we use Node 8.x in our Cis? (Btw, according to this table, it has reached the end of life by the end of 2019.)
Can we update to a newer release, e.g. 12.x?

And just a minor correction: @c-blake, the crossed-out part above is about 0.10.x, not 10.x, if I'm reading the table linked above correctly.

c-blake · 2020-04-14T16:19:24Z

Oops. You are right, @narimiran. BigInt is more like a 2018 innovation. From your table, though it seems like 10.x is the oldest long term support version. So, the question of how old support we need/want remains. Sorry/thanks.

alaviss · 2020-04-14T17:25:35Z

why we use Node 8.x in our Cis?

Because that's how it was setup when I ported everything to Azure Pipelines. I guess this depends on what version of JS do we want to support.

c-blake · 2020-04-14T19:28:02Z

Note that, presently, this PR will work either with the old version or with a newer one, just giving a different hash order across them and just testing for the existence of BigInt.

What NodeJS version in general on the CI sounds like a broader question. Just my two cents, but given that < 10.x is no longer 'well supported' "at home"/"off Azure", it's likely easier on everyone trying to get CIs to work to bump that version.

system performance).

c-blake · 2020-04-15T13:05:16Z

All checks have passed now. (I made tests/vm/tslow_tables.nim more robust with a 25% increase in the timeout).

narimiran · 2020-04-15T13:43:32Z

lib/pure/collections/sets.nim

@@ -1008,7 +1007,7 @@ when isMainModule and not defined(release):

    block toSeqAndString:
      var a = toHashSet([2, 7, 5])
-      var b = initHashSet[int]()
+      var b = initHashSet[int](rightSize(a.len))


Is this change still needed? Does the example work without it?

That change is necessary because toHashSet also uses rightSize for efficiency (to prevent doubling-up however many times). If you drop the assert you could drop the rightSize. If you compare the output sets by membership, not by string/$ which depends upon hash order then you could also drop the rightSize. OTOH, if you are considering it a "good example" then rightSize is good style since you start with the right size of an output table which is always good style if you know it (and sadly not often provided by hash table libraries).

Oh, also it's part of a block toSeqAndString which, at least to me, made it seem more correct to keep the stringification but make the HashSet table size the same, not whatever the default initial size is. "Hash order" is really a concept inherently relative to a given hash table size since it's hash() and mask. So, if you did want to drop the $s we would also probably want to change the block title. But I think it's fine as-is.

The change to sugar.nim is, however, not strictly necessary but the sugar-internal consistency might be considered "better style".

c-blake · 2020-04-15T18:18:43Z

You can also re-close #11764 if you want.

…g#13823

* test for issue #15624 and PR #15915 for patch #13823 * Update thashes.nim no need mention PR #15915, fixed in #15937 * rebase to devel(issue maybe fixed), ignore ouputs * Apply suggestions from code review Co-authored-by: flywind <43030857+xflywind@users.noreply.github.com>

* test for issue nim-lang#15624 and PR nim-lang#15915 for patch nim-lang#13823 * Update thashes.nim no need mention PR nim-lang#15915, fixed in nim-lang#15937 * rebase to devel(issue maybe fixed), ignore ouputs * Apply suggestions from code review Co-authored-by: flywind <43030857+xflywind@users.noreply.github.com>

c-blake added 5 commits March 31, 2020 09:25

Fix data.len -> dataLen problem.

0433f4b

Merge /u/cb/pkg/nim/Nim-devel into devel

5dc4b0d

Merge branch 'devel' of https://github.com/Araq/Nim into add_hashWangYi1

1b69485

c-blake added 7 commits April 1, 2020 06:07

Merge branch 'devel' of https://github.com/Araq/Nim into add_hashWangYi1

e89fd81

Re-organize to work around when nimvm limitations; Add some tests; Add

4b0c41a

a changelog.md entry.

Add less than 64-bit CPU when fork.

a7cada9

Fix decl instead of call typo.

4ed8e1a

First attempt at fixing range error on 32-bit platforms; Still do the

875e7fb

arithmetic in doubled up 64-bit, but truncate the hash to the lower 32-bits, but then still return `uint64` to be the same. So, type correct but truncated hash value. Update `thashes.nim` as well.

Merge branch 'devel' of https://github.com/Araq/Nim into add_hashWangYi1

dc4dba2

A second try at making 32-bit mode CI work.

b3510c3

Araq reviewed Apr 2, 2020

View reviewed changes

c-blake added 10 commits April 2, 2020 04:57

Use a more systematic identifier convention than Wang Yi's code.

38fe03c

Fix another stringified test depending upon hash order.

b78e18d

Oops - revert the string-keyed test.

29424cc

Fix another stringify test depending on hash order.

add55fe

Add a better than always zero defined(js) branch.

00a708e

These string hash tests fail for me locally. Maybe this is what causes

33240f9

the CI hang for testament pcat collections?

Oops. That failure was from me manually patching string hash in hashe…

f5aae61

…s. Revert.

Merge branch 'devel' of https://github.com/Araq/Nim into add_hashWangYi1

56e1df8

c-blake mentioned this pull request Apr 4, 2020

HashSet[uint64] slow insertion depending on values #11764

Closed

Import more test improvements from #13410

131667a

This was referenced Apr 13, 2020

cannot generate VM code for asm yglukhov/nimsl#3

Closed

testament failure in Nim CI zero-functional/zero-functional#60

Closed

c-blake added 5 commits April 14, 2020 06:55

Re-organize when nimvm logic to be a strict when-else.

aa30554

Merge other changes.

fa879a5

Merge branch 'devel' of https://github.com/Araq/Nim into add_hashWangYi1

97e685b

Lift constants to a common area.

67eab12

Fall back to identity hash when BigInt is unavailable.

f3c236e

c-blake added 2 commits April 15, 2020 07:05

Merge branch 'devel' of https://github.com/Araq/Nim into add_hashWangYi1

dfe9ee8

Increase timeout slightly (probably just real-time perturbation of CI

e97c81a

system performance).

narimiran reviewed Apr 15, 2020

View reviewed changes

Araq merged commit a0b33f9 into nim-lang:devel Apr 15, 2020

c-blake deleted the add_hashWangYi1 branch April 15, 2020 18:15

ringabout mentioned this pull request Oct 30, 2020

[Regression] JS: HashSet hash at compile time and run time not equal #15624

Closed

bung87 added a commit to bung87/Nim that referenced this pull request Nov 12, 2020

test for issue nim-lang#15624 and PR nim-lang#15915 for patch nim-lan…

9bb6d88

…g#13823

bung87 added a commit to bung87/Nim that referenced this pull request Nov 13, 2020

test for issue nim-lang#15624 and PR nim-lang#15915 for patch nim-lan…

fa14573

…g#13823

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `hashWangYi1` #13823

Add `hashWangYi1` #13823

c-blake commented Mar 31, 2020

c-blake commented Mar 31, 2020

juancarlospaco commented Mar 31, 2020

c-blake commented Mar 31, 2020

Araq commented Mar 31, 2020

Araq Apr 2, 2020

c-blake Apr 2, 2020

c-blake Apr 2, 2020

c-blake Apr 2, 2020

c-blake Apr 2, 2020

Araq Apr 2, 2020

c-blake Apr 2, 2020

c-blake Apr 2, 2020

c-blake commented Apr 4, 2020

c-blake commented Apr 9, 2020 •

edited

Loading

c-blake commented Apr 9, 2020

c-blake commented Apr 9, 2020

narimiran commented Apr 10, 2020

c-blake commented Apr 10, 2020

c-blake commented Apr 14, 2020

narimiran commented Apr 14, 2020 •

edited

Loading

c-blake commented Apr 14, 2020

alaviss commented Apr 14, 2020

c-blake commented Apr 14, 2020

c-blake commented Apr 15, 2020

narimiran Apr 15, 2020

c-blake Apr 15, 2020

c-blake Apr 15, 2020 •

edited

Loading

c-blake Apr 15, 2020 •

edited

Loading

c-blake commented Apr 15, 2020

Add hashWangYi1 #13823

Add hashWangYi1 #13823

Conversation

c-blake commented Mar 31, 2020

c-blake commented Mar 31, 2020

juancarlospaco commented Mar 31, 2020

c-blake commented Mar 31, 2020

Araq commented Mar 31, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

c-blake commented Apr 4, 2020

c-blake commented Apr 9, 2020 • edited Loading

c-blake commented Apr 9, 2020

c-blake commented Apr 9, 2020

narimiran commented Apr 10, 2020

c-blake commented Apr 10, 2020

c-blake commented Apr 14, 2020

narimiran commented Apr 14, 2020 • edited Loading

c-blake commented Apr 14, 2020

alaviss commented Apr 14, 2020

c-blake commented Apr 14, 2020

c-blake commented Apr 15, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

c-blake Apr 15, 2020 • edited Loading

Choose a reason for hiding this comment

c-blake Apr 15, 2020 • edited Loading

Choose a reason for hiding this comment

c-blake commented Apr 15, 2020

Add `hashWangYi1` #13823

Add `hashWangYi1` #13823

c-blake commented Apr 9, 2020 •

edited

Loading

narimiran commented Apr 14, 2020 •

edited

Loading

c-blake Apr 15, 2020 •

edited

Loading

c-blake Apr 15, 2020 •

edited

Loading