Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow insertion of uint in Tables and Sets Collections depending on the value #13393

Closed
rockcavera opened this issue Feb 11, 2020 · 12 comments · Fixed by #13418
Closed

Slow insertion of uint in Tables and Sets Collections depending on the value #13393

rockcavera opened this issue Feb 11, 2020 · 12 comments · Fixed by #13418

Comments

@rockcavera
Copy link
Contributor

rockcavera commented Feb 11, 2020

A long time ago I noticed that the Tables and Sets collections took a long time to add values ​​and I decided to search further.

So, I noticed that the problem was: the higher a significant bit in an integer, the slower the insertion of a value will be.

Using uint32, entering values ​​with the lowest 16 bits is 14 times faster than entering values ​​with the highest 16 bits.

Using uint64 the situation is more serious. Entering values ​​with the low 32 bits is about 256 times faster than entering values ​​with the highest 32 bits, this for the Sets collection. For Tables collection, the difference increases to 348 times.

I will not put sample code, because I created a github repository with the problem, where it is possible to see the codes used and even a quick solution using seqs and sort, as well as code using similar collections in other languages ​​to do the same work that show a greater consistency in execution time.

It is a problem related or equal to the one treated here: 11764

Example and Current Output

Visit this link https://github.com/rockcavera/nim-problem

Expected Output

A less significant difference is expected between the addition of values, that is, closer execution times, the same occurs in other languages.

Possible Solution

  • I can't say if the problem is with the designer adopted for the collections or a wrong choice of hash... I didn't delve into the current code of the collections due to lack of time.

Additional Information

  • The test was carried out on several Nim versions since before the release of 1.0 and they all have the same problem
@Araq
Copy link
Member

Araq commented Feb 12, 2020

I personally think that hashing in a stdlib is an endless whack-a-mole game so @narimiran is working on giving us BTrees instead everywhere.

@jyapayne
Copy link
Contributor

jyapayne commented Feb 12, 2020

I did some investigation on this. The hash function is not the problem here and is super fast, and also you probably won't encounter this on a 32 bit machine. If you replace your code with:

# newcodesetsu64.nim
import strutils
import streams
import times
import hashes

let
  fileName1 = "lowu64.txt"
  fileName2 = "highu64.txt"

proc processFile(fileName: string) =
  let iTime = epochTime()

  var
    fh = newFileStream(fileName, fmRead, 1000000)

  if isNil(fh):
    raise newException(IOError, "Problem to open file")

  var
    read: string

  while readLine(fh, read):
    # Only compute hash here
    discard hash(uint64(parseUInt(read)))

  close(fh)

  let eTime = epochTime()

  echo fileName, "\nduration: ", eTime - iTime

processFile(fileName1)
echo "\n"
processFile(fileName2)

Results in (nim c -d:danger -d:release -r newcodesetsu64.nim):

lowu64.txt
duration: 0.04606485366821289

highu64.txt
duration: 0.05289411544799805

I ran some profiling and it looks like the main offender is in the lib/pure/collections/hashcommon.nim module, specifically in the rawGet proc.

Further investigation revealed the offender. In template rawGetKnownHCImpl(), there is this line:

var h: Hash = hc and maxHash(t)

maxHash is a template that expands to t.data.high, so the real line is:

var h: Hash = hc and t.data.high

t.data is a seq, so this returns the max capacity of the seq, which in your case is 524287.

So then what's the problem, you ask?

Well, it seems like h in the above code is either 0 or 11 with uin64 numbers, which causes many many collisions.

You can see this behaviour with the below code:

import strutils
import streams
import times
import math
import hashes

let
  fileName1 = "lowu64.txt"
  fileName2 = "highu64.txt"

proc processFile(fileName: string) =
  let iTime = epochTime()

  var
    fh = newFileStream(fileName, fmRead, 1000000)

  if isNil(fh):
    raise newException(IOError, "Problem to open file")

  var
    read: string

  let capacity = nextPowerOfTwo(400000) - 1

  while readLine(fh, read):
    let parsed = parseUint(read).uint64
    echo hash(parsed).uint64 and capacity.uint64

  close(fh)

  let eTime = epochTime()

  echo fileName, "\nduration: ", eTime - iTime

processFile(fileName1)
echo "\n"
processFile(fileName2)

You'll notice that if you run it, the first file will print valid indices between 0 and 524287, while the second file only produces the number 0 or 11.

I'm not sure what cause is yet or what the fix could be. Maybe @Araq can shine some light?

@jyapayne
Copy link
Contributor

Oh wait, you said the highu64.txt only has high bits set, so that means that when you and those with 524287, it will result in 0. Looks like 11 might come from the hash function in lib/pure/hashes.nim

proc hash*(x: uint64): Hash {.inline.} =
  ## Efficient hashing of `uint64` integers.
  result = cast[Hash](cast[uint](x * prime))

Where the prime variable is always 11.

@demotomohiro
Copy link
Contributor

I think what hash function that provide fastest sets or tables depends on property of input keys.
If lower bits of keys were uniformly distributed (consecutive numbers or uniformly distributed random numbers), using identity function(the function that return input as is) as the hash function would be fastest. In this case, there are few collisions even if identity function is used as the hash function and using any complicated hash function will not reduce number of collisions.
(Also, when keys are consecutive numbers, using non-identity function might cause random access to the internal storage of the set and many cache misses)
If higher half bits were uniformly distributed but lower half bits were not, hash functions that swap higher half bits and lower half bits might be fastest. But it would cause many collisions when keys are consecutive numbers.
If you didn't know what kind of keys were given to sets and want to avoid slowest case as much as possible, hash functions that scramble the bits of the key and return uniformly distributed value like murmurhash might be best.

Whatever the hash function is used for Sets or Tables, there are cases that cause many collisions and slow down.
For example, in following code, a value after applying hash function to myfunc(i) is same to reverseBits(x) and applying reverseBits to consecutive number makes many keys having same lower bits and cause many collisions.

import sets

func myfunc(x: int): int =
  result = inverseFunctionOfHash(reverseBits(x))

var values = initHashSet[int]()
for i in 0..100000:
  values.incl(myfunc(i))

In case hash function is not inversible, that means it is not injection and there are at least 2 different input values that mapped to a same value.
If you give different keys that mapped to a same value or have same low bit value to sets, that can cause many collisions.

We might need to know what kind of int type keys are used in real world to choose the default hash function for int type keys.
If there were some people who need to use int type keys that have same lower bits or not uniformly distributed values, hash function like a murmurhash should be used. (And when user use consecutive numbers as a key, can change the hash function to identity function).
But if that cases were very rare and most of people use consecutive number or uniform random numbers as int type keys, identity function would be fine.

When string type (name of person or something) is used as a key, there are many keys that have same charactors and only few bits are different.
But are there similar cases when int type is used as a key?
When you use a pointers as a key, they can be muliple of 4, 8 or 16. But in this case, just removing lowest 3 or 4 bits would make a uniformly distributed numbers.

@c-blake
Copy link
Contributor

c-blake commented Feb 27, 2020

I hope everyone realizes that all this HashSet/Table implementation churn (that still is far from over with comments today about proc enlarge[A](s: var OrderedSet[A]) being broken with deletions in workloads) could probably have been avoided by just telling the person filing this issue A) hash functions that the Nim default is not considered perfectly general purpose B) there is no perfect general purpose one, and C) they could have just added

import hashes
proc hash(x: uint64): Hash = Hash(x * 0x9549249549249549'u64)

or some similar very large multiplier to their code to fix this. (If for some reason that does not override the hashes module hash, that is a real bug.)

Or if we wanted a 1-line patch in the stdlib to harden integer hashes "probably enough" we could even have tossed a multiplier in the code at the call to hash(x). That extra multiply could even be optional, activated on the next resize after observing a long probe sequence based on a new table-wide flag, thusly never changing anything for users defining their own decent hash. Cost of an inc, cmp, perfectly predicted branch somewhere probably would be hard to measure. Tables would still be "ageless", have the same memory utilization and speed with delete workloads, etc.

I did do a Robin Hood impl, and it is a little faster in some cases, but it cannot be a drop-in replacement. It breaks the current OrderedSet/HashSet layering (as any reorganizing-on-insert method must) and similarly for Table. So, if we go that way we would probably want two styles of set representation for the insertion-ordered and unordered cases. Then if we did that then we would probably want to refactor the whole 4-way set (5 if you include CountTable) so that a Table is just a HashSet[Pair[K,V]].

Personally, I think the old impl with a zero, 1, or few line patch to harden the hash in bad cases is simpler with better general properties. As Araq always points out, hash methods have fundamentally tricky failure modes and so will never be novice friendly. Non novices should know how to define their own hash(x).

@demotomohiro
Copy link
Contributor

@c-blake
Multiplying whatever number in a hash function alone doesn't solve this issue as keys @rockcavera says cause slowdown have same bit patterns in lower bits.
Suppose there are uint value a and b and lowest n bits are same but other bits are different.
Whatever number is multipied to them, their lowest n bits are same.
More operations that mix higher bits to lower bits are required in hash function to solve this issue without pseudorandom probing.
At first, @timotheecour tried to solve this issue with better hash functions but it did not get merged.
#13410

@timotheecour
Copy link
Member

timotheecour commented Feb 28, 2020

a Table is just a HashSet[Pair[K,V]]

that doesn't make sense.
It should be the other way around:
a HashSet[K] could be a thin wrapper around Table[K, Void] with type Void = object (it works); or Table[K, bool] (it also works), with different properties in particular for deletions.
This should be benchmarked to see whether it gives same/similar/better performance compared to current HashSet or not.

I agree though that we should factor the implementations we have into a smaller number, and benchmark to make sure we don't loose any performance in doing so;
I highly doubt we need that many different implementations for:
table, orderedtable, counttable, sharedtable, hashset, set, intset, cellset, strtabs ( + others )

early optimization is the root of all evil applies here.

did do a Robin Hood impl

please share code so I/we can benchmark against #13440 and have an apples to apples comparison; the benchmarking code I wrote cover a wide range of use cases, improvements to it are welcome (notably, deletion workloads should be tested more)

comments today about proc enlarge[A](s: var OrderedSet[A]) being broken with deletions in workloads

please file an issue, I'm not seeing this here https://github.com/nim-lang/Nim/issues

by just telling the person filing this issue A) hash functions that the Nim default is not considered perfectly general purpose B) there is no perfect general purpose one, and C) they could have just added
proc hash(x: uint64): Hash = Hash(x * 0x9549249549249549'u64)
or some similar very large multiplier to their code to fix this.
we could even have tossed a multiplier in the code at the call to hash(x)

2X slower for many simple distributions

@c-blake
Copy link
Contributor

c-blake commented Feb 28, 2020

There is always an attack pattern, even for PR probing, which is why the right answer is notifying the user to some configurable File that they have a problem and how to solve it. Maybe they should use their own hash resilient to whatever expected key distribution hostility. For PR probing someone can still engineer all the same hash codes. You need secret seeded hashing with an unrecoverable secret to really be safe and that's very expensive. I agree the run-time should warn a few times when hash distribution assumptions go far out of bounds, probably inactive in -d:danger mode, but active in -d:release mode and with its own independent switch. It's a common enough DoS attack vector.

More to your specific point, I think you may be wrong. A big multiplier like 0b101010101... the same size of the key that toggles every other or every 3rd bit will spread bits from all over the key into all over the double-key sized product (the most in the middle). A subsequent half-key sized right rotation should yield a good diversity of hash code lower bits. There used to be a PRNG method called "middle square". It's sort of like that. For the biggest ints where you only get the modulus of the product you might want to rotate a little more. I don't see that this is inherently less extracting of key entropy into lower order bits than the xor-shifty one, but any hash will have failure modes. The exact size of tables and variations of bits all matter, etc.

In any event, if you don't like my particular multiply and rotate suggestion, you could have that other one (or both, and maybe a 3rd and 4th, maybe numbered by expected run-time). proc hash(x: myInt) = whicheverhash(x). While Araq might complain about the default changing, it seems pretty reasonable to have a few choices on hand for users. I think the mistake of #13410 is trying to have one hash fits all which is honestly a false sense of security. Give people A) notice of the issue which is easy to detect with a loop counter, and B) a few choices that are easy to switch to. That's already a pretty good situation. Anyway, I realize it was Araq not you that rejected the new hash.

You could make it better by dynamically switching off the implementation at the next rehash event and even trigger rehashes from overlong collision chains instead of table load (while also over 25% load or something to prevent runaway memory). That is also a more incremental/localized change with no impact on delete heavy workloads/table aging and all these follow on bugs. { There's still (at least) a bug in proc hash*[A](s: HashSet[A]): Hash where deleted marker codes get hashed in there..maybe fixed somewhere already? } As it is, we're doing a bunch of work to make something good for less general workloads but more novice-user bulletproof. That sounds like Python priorities to me.

@Araq
Copy link
Member

Araq commented Feb 28, 2020

That sounds like Python priorities to me.

Yes. If we follow Python's dict implementation as close as reasonable, people coming from Python can't complain. That's the goal here really, IMHO. We are also getting BTrees into the stdlib which is a general solution without these flaws, they are slower though and so naive benchmarkers are better off with the hash table implementations.

@timotheecour
Copy link
Member

with comments today about proc enlarge[A](s: var OrderedSet[A]) being broken with deletions in workloads
...
There's still (at least) a bug in proc hash*[A](s: HashSet[A]): Hash where deleted marker codes get hashed in there

if you find a bug, please file a github issue; otherwise it's very likely to be forgotten and not acted on

@c-blake
Copy link
Contributor

c-blake commented Feb 28, 2020

I don't know what kind of huge set of branches you have with what bugs are already fixed. I won't forget. Here's another one, tables.Table.== will fail if you have two tables with different numbers of identical (key, val) pairs. Since tables can add however many they like this is a rare to be triggered bug. It's maybe more arguable about what you want to support.

Also - I wrote my last long post just before @timotheecour posted, but he beat me. Really it's all just "one tuple" part of which happens to be the key that you project down with hash and ==. I know about the Table[K,void] alternative. I'm not sure we can factor things to get it to work that way..maybe. To me it seems less clean. STL uses set(pairs) and I bet compilers optimize out a couple unused struct inits it to the right assembly. Even without STL that has to be one of the top 10 optimizations in any compiler. People who actually wanted Set[complexThingWithAKeySomewhere] would also miss out and at least be "led" to duplicate the key by the API instead of just providing key and == extractors. I think it's better to "lead people" to more efficient solutions. It's also just what a "table" is - a set of rows and maybe those rows are two things or maybe they're many. So, from a natural language, conceptualization, and algorithmic point of view, sets as a basis seem more right to me. Key,Value separation is just more "popular", but lots of popular things are not quite the right mental model.

Anyway, they are clearly mostly "duals" of each other, but the separate impls has allowed a lot of wandering. Another weird divergence is that HashSet and Table have gradually grown divergent capabilities. withValue, Ref variants, and probably other examples. I think we agree either one defined in terms of the other would be better. They both target the general mostly unknown workload idea.

To quickly address some other points, that issue I mentioned came up on IRC, IIRC, and was already fixed. I've already critiqued hot loop benchmarks with more words than I should have and that's all you have in your vitanim, though as I've agreed the LP threshold eases my basic concern for insert/get-only workloads. Also, all I see are different distribution tests. There's not a single steady-state insert-delete aging benchmark which is the workload where this could move things. It's not hard - just randomly insert stuff then randomly insert and delete things with a sample space small enough to have high hit rates. Then run it for a long time. Even that is vulnerable to a "perfectly hot loop" complaint and not really testing "actually random" access as I may or may not have successfully explained.

In filesystems research not aging things will get your papers rejected or at last mocked, and I'm not just saying that to be snarky. As I've mentioned before both in instruction count for typical latency and bandwidth to latency ratios, modern memory is ever more like disks were 50 years ago, but with a much smarter prefetching CPU confounding benchmarking. Tombstones both increase probe length (on purpose - it's what they do) and decrease effective memory utilization. So, it becomes at least two dimensional and cache-cliff problem for which there are no easy general answers - except maybe to avoid the conundrum in the first place as a default which is what I've advocated out of simplicity. The one thing this change actually makes a big difference to just on its face seems still not benchmarked and "worst hit is 1.06x" claims are still being bandied about as if they mean something. That's a bit worrisome, but then again who knows...? Maybe no one ever deletes! :-) I often don't. Good benchmarks are actual applications which perturb perfectly oiled loops with IO, system calls and other things, but which do "a lot" of something of interest. I get that "a lot but not all" is vague. Sorry. I think great answers here are elusive, but your benchmarks only about distribution not "real" performance, related but distinct. I doubt more details about why I think what I think would help.

It seems Araq just wants to have an easy answer for Python newbies, though the current trajectory is actually better than that. Python uses a double indirection all the time with a separate array as the table part and just a dense seq for ordering. And I misremembered, they *do *still use the pseudorandom probe idea on the first level of indirection - but then because the ordered "main seq" part is in "insert order" not "hash order" (rarely correlated) the probes all get wildly indirect. I'd think they would do better to spend the space to put the hash code in the random ordered hash table part and only take one of those hits and use @timotheecour's initial linear probe trick. That idea is so obviously better that I'd say "watch for it in Python someday". Indeed, it's pretty clear that move was because they cared about cache-resident ordered dicts for the language, not very large ordered dicts. Structurally, just looking at the data layout, it just has to be so much slower on those. At least 2x, but maybe 3x or more, but you might never notice it if you test only with a loop the CPU can predict perfectly. I think they also claimed "faster" or 1.06x worse or something else I think quite wrong.

Also on the more-like-Python note and since you're overhauling the impls anyway, you should probably at least put in "length change" asserts in all the iterators (bolded to not be missed in the long message). Python will raise exceptions on mutation during iteration. We might want exceptions long term, but we can at least do cheap/free asserts now. Long-term exceptions are probably a good plan if the idea is the default tables are for people who may have weak backgrounds and we have fast goto-based exceptions.

As Araq observes/knows, using a hash table is usually much faster than a B-tree but it comes with a fundamental, ineliminable burden/gotcha of at least potentially providing proper hash code generator. The perf difference will never go away. So, neither will this gotcha. So, a variety of choices for hash() redefinition in the stdlib would make sense just to let people rapidly try something with likely distinct failure modes without having to hunt for dependencies (and notification as I've mentioned..so they know quickly when to try). I like wyhash lately. It's fast and very mixing passing all those SMHasher tests (and mostly multiplication based, btw, though decidedly more ornate than my above suggestion which probably would work in 99% of all non-constructed/attack scenarios). I think if those measures were in place all this collision resolution re-org would have been avoidable/might even now be better.

So, look -- the (obvious) ultimate answer to this whole area of questions is a diversity of set representations, with and without satellite data capability, all with variously different properties. There are so many we could rattle off...aggregate bit vectors, space optimized direct-indexing, robin hood, compactified ala Py36 for the ordered case, what we had before with a decent hash, the path @timotheecour's on now, and probably Cuckoo to boot if people have switchable families of strong hash functions, and external chaining just for "insert/delete while you" cases. Some "intrusive" options would also be nice. With a good template/macro framework each one of these things is only a couple 100 lines of distinct code.

Good defaults are nice and all, but arguing about defaults is not productive if the real answer is choice hopefully within a consistent interface framework and the arguing in manifesto length screeds takes time away from me spending it on increasing such choice which is where I kind of feel I've moved to in my work on this, personally. I bet Araq feels like this every damn day, honestly. ;-) Boost made C++ drastically more appealing, just as one example. In light of the obvious ultimate answer, I will probably just do a separate library with a compatible API. Maybe someone will care to change an import/tweak a declare sometime. If Nim core wants to copy it that's fine. If they don't I'll just make noise about it from time to time or someone can do an RFC. I think the right refactoring is probably a totally different impl for OrderedSet and HashSet (that's what Python did, but they ditched "choice" in the stdlib because..audience) and Table implemented in terms of Set.

Where we want to get to ecosystem-wise is the sequence "User X - This is slow!. Advisor Y - Try this instead!" Really one would hope (but be disappointed) that users would profile before asking for advice. Maybe sometimes things are simple enough that the RTL is the advisor, not a person. Heck, maybe the RTL can even eventually be super smart and dynamically switch or at least have a mode to. A well written B-tree can give you efficient insert in the middle of dynamic "seq"s as well as sorted keys. It helps to separate "seek" and "mutation" in your implementation to support either. Then you can seek by rank or seek by key and you can even do away with the key and have rank be just like a seq index. (I tried to steer @narimiran this way narimiran/sorta#1 but he didnt' respond. So, I don't know if he even understood my point.) Anyway, it might be more advanced than our target RTL, and but it wouldn't even be totally crazy to have seq maintain statistics about locations of inserts/deletes and (in some mode) beyond some threshold convert to a B-tree -- much as we are discussing bad-case dynamic switchover from hash tables to B-trees instead of rehashing which could also be automagic upon giant probe sequences. That's one vector towards a novice/attack/crazy circumstance proof RTL without sacrificing common-case perf, anyway.

@c-blake
Copy link
Contributor

c-blake commented Feb 28, 2020

Incidentally, saying that a "better" hash "didn't help" or "wouldn't help" when it seems you only tried exactly one hash or maybe that and some super slow string hash all sounds like arguments from ignorance/not trying much. Did you even instrument your probing loop to see if the "supposedly" better hash was even really getting short clusters against those hostile keys? All I see are naive wall clock timings. When I check distributions I check probe counts. Then you know quality. You talk as if you did do that because the better hash is slower, but that xor shr thing doesn't look that much slower and also not that mix-y. Maybe you did some other kind of profiling to disambiguate why the slower hash was "slow" in context? (obviously everything is slower than nothing, but the xor-shr thing looked pretty fast to me compared to even a few loops & probes). Or maybe you just lept to a conclusion?

I'd bet there are plenty of choices that could have been in a little library of 3..5 well vetted fast integer hashes, one of which would have solved this guy's problem. (Or at the very least made him never notice his original small bias problem which he then focused into basically a harder to resist "attack".) Such a little library may seem besides the point with the PRNG probes, but you still would prefer to not collide if there's any LP segment keeping any locality (which you would prefer). And you still want a warning system for attackers who can engineer the same whole hash code..maybe not for integer keys with the identity hash, but for all other client code that shares that impl, especially strings.

So, in short, the resolutions that could have addressed this besides PRNG probing should all still happen, just "less soon", with those things the restructuring would not have been necessary, there would be no questions about aging or benchmarking because you can't do better than "1" for DRAM hits (we do get that one with your limit, though) and 0 for extra space pressure, and no (new) whack-a-moles. It's not quite a mathematical theorem because of diverse use cases & workloads, some in cache, but it's not actually so far from one.

Totally off to the side, WTF with renaming bitops.rotateRightBits locally to hashcommon.translateBits? Intentionally different spatial metaphors to make some kind of point? The assembly instructions are usually called "rotate", too, for like 50-60 years, but that aside at least the stdlib should be internally consistent, even for non-exported procs, I'd say.

Anyway, I'm busily coding up my own library now. So, I'm going to be less noisy about all this and that's something positive to come out of it for me, at least. Apologies for the verbosity, but enough things were being said that made no sense that I didn't feel y'all were tracking important things { except Araq who clearly knows his stuff but mostly seems to wish he wrote a B-Tree 15 years ago and never put hash tables in the stdlib :-D }. How I respond to such feeling - "just try saying it a little differently", but then people tune out the walls of text. Oh well. Good luck and best wishes to all!

narimiran pushed a commit that referenced this issue Mar 31, 2020
…nges (#13816)

* Unwind just the "pseudorandom probing" (whole hash-code-keyed variable
stride double hashing) part of recent sets & tables changes (which has
still been causing bugs over a month later (e.g., two days ago
#13794) as well as still having
several "figure this out" implementation question comments in them (see
just diffs of this PR).

This topic has been discussed in many places:
  #13393
  #13418
  #13440
  #13794

Alternative/non-mandatory stronger integer hashes (or vice-versa opt-in
identity hashes) are a better solution that is more general (no illusion
of one hard-coded sequence solving all problems) while retaining the
virtues of linear probing such as cache obliviousness and age-less tables
under delete-heavy workloads (still untested after a month of this change).

The only real solution for truly adversarial keys is a hash keyed off of
data unobservable to attackers.  That all fits better with a few families
of user-pluggable/define-switchable hashes which can be provided in a
separate PR more about `hashes.nim`.

This PR carefully preserves the better (but still hard coded!) probing
of the  `intsets` and other recent fixes like `move` annotations, hash
order invariant tests, `intsets.missingOrExcl` fixing, and the move of
`rightSize` into `hashcommon.nim`.

* Fix `data.len` -> `dataLen` problem.
Araq pushed a commit that referenced this issue Apr 15, 2020
* Unwind just the "pseudorandom probing" (whole hash-code-keyed variable
stride double hashing) part of recent sets & tables changes (which has
still been causing bugs over a month later (e.g., two days ago
#13794) as well as still having
several "figure this out" implementation question comments in them (see
just diffs of this PR).

This topic has been discussed in many places:
  #13393
  #13418
  #13440
  #13794

Alternative/non-mandatory stronger integer hashes (or vice-versa opt-in
identity hashes) are a better solution that is more general (no illusion
of one hard-coded sequence solving all problems) while retaining the
virtues of linear probing such as cache obliviousness and age-less tables
under delete-heavy workloads (still untested after a month of this change).

The only real solution for truly adversarial keys is a hash keyed off of
data unobservable to attackers.  That all fits better with a few families
of user-pluggable/define-switchable hashes which can be provided in a
separate PR more about `hashes.nim`.

This PR carefully preserves the better (but still hard coded!) probing
of the  `intsets` and other recent fixes like `move` annotations, hash
order invariant tests, `intsets.missingOrExcl` fixing, and the move of
`rightSize` into `hashcommon.nim`.

* Fix `data.len` -> `dataLen` problem.

* This is an alternate resolution to #13393
(which arguably could be resolved outside the stdlib).

Add version1 of Wang Yi's hash specialized to 8 byte integers.  This gives
simple help to users having trouble with overly colliding hash(key)s.  I.e.,
  A) `import hashes; proc hash(x: myInt): Hash = hashWangYi1(int(x))`
      in the instantiation context of a `HashSet` or `Table`
or
  B) more globally, compile with `nim c -d:hashWangYi1`.

No hash can be all things to all use cases, but this one is A) vetted to
scramble well by the SMHasher test suite (a necessarily limited but far
more thorough test than prior proposals here), B) only a few ALU ops on
many common CPUs, and C) possesses an easy via "grade school multi-digit
multiplication" fall back for weaker deployment contexts.

Some people might want to stampede ahead unbridled, but my view is that a
good plan is to
  A) include this in the stdlib for a release or three to let people try it
     on various key sets nim-core could realistically never access/test
     (maybe mentioning it in the changelog so people actually try it out),
  B) have them report problems (if any),
  C) if all seems good, make the stdlib more novice friendly by adding
     `hashIdentity(x)=x` and changing the default `hash() = hashWangYi1`
     with some `when defined` rearranging so users can `-d:hashIdentity`
     if they want the old behavior back.
This plan is compatible with any number of competing integer hashes if
people want to add them.  I would strongly recommend they all *at least*
pass the SMHasher suite since the idea here is to become more friendly to
novices who do not generally understand hashing failure modes.

* Re-organize to work around `when nimvm` limitations; Add some tests; Add
a changelog.md entry.

* Add less than 64-bit CPU when fork.

* Fix decl instead of call typo.

* First attempt at fixing range error on 32-bit platforms; Still do the
arithmetic in doubled up 64-bit, but truncate the hash to the lower
32-bits, but then still return `uint64` to be the same.  So, type
correct but truncated hash value.  Update `thashes.nim` as well.

* A second try at making 32-bit mode CI work.

* Use a more systematic identifier convention than Wang Yi's code.

* Fix test that was wrong for as long as `toHashSet` used `rightSize` (a
very long time, I think).  `$a`/`$b` depend on iteration order which
varies with table range reduced hash order which varies with range for
some `hash()`.  With 3 elements, 3!=6 is small and we've just gotten
lucky with past experimental `hash()` changes.  An alternate fix here
would be to not stringify but use the HashSet operators, but it is not
clear that doesn't alter the "spirit" of the test.

* Fix another stringified test depending upon hash order.

* Oops - revert the string-keyed test.

* Fix another stringify test depending on hash order.

* Add a better than always zero `defined(js)` branch.

* It turns out to be easy to just work all in `BigInt` inside JS and thus
guarantee the same low order bits of output hashes (for `isSafeInteger`
input numbers).  Since `hashWangYi1` output bits are equally random in
all their bits, this means that tables will be safely scrambled for table
sizes up to 2**32 or 4 gigaentries which is probably fine, as long as the
integer keys are all < 2**53 (also likely fine).  (I'm unsure why the
infidelity with C/C++ back ends cut off is 32, not 53 bits.)

Since HashSet & Table only use the low order bits, a quick corollary of
this is that `$` on most int-keyed sets/tables will be the same in all
the various back ends which seems a nice-to-have trait.

* These string hash tests fail for me locally.  Maybe this is what causes
the CI hang for testament pcat collections?

* Oops. That failure was from me manually patching string hash in hashes.  Revert.

* Import more test improvements from #13410

* Fix bug where I swapped order when reverting the test.  Ack.

* Oh, just accept either order like more and more hash tests.

* Iterate in the same order.

* `return` inside `emit` made us skip `popFrame` causing weird troubles.

* Oops - do Windows branch also.

* `nimV1hash` -> multiply-mnemonic, type-scoped `nimIntHash1` (mnemonic
resolutions are "1 == identity", 1 for Nim Version 1, 1 for
first/simplest/fastest in a series of possibilities.  Should be very
easy to remember.)

* Re-organize `when nimvm` logic to be a strict `when`-`else`.

* Merge other changes.

* Lift constants to a common area.

* Fall back to identity hash when `BigInt` is unavailable.

* Increase timeout slightly (probably just real-time perturbation of CI
system performance).
dom96 pushed a commit that referenced this issue Jun 10, 2020
* Unwind just the "pseudorandom probing" (whole hash-code-keyed variable
stride double hashing) part of recent sets & tables changes (which has
still been causing bugs over a month later (e.g., two days ago
#13794) as well as still having
several "figure this out" implementation question comments in them (see
just diffs of this PR).

This topic has been discussed in many places:
  #13393
  #13418
  #13440
  #13794

Alternative/non-mandatory stronger integer hashes (or vice-versa opt-in
identity hashes) are a better solution that is more general (no illusion
of one hard-coded sequence solving all problems) while retaining the
virtues of linear probing such as cache obliviousness and age-less tables
under delete-heavy workloads (still untested after a month of this change).

The only real solution for truly adversarial keys is a hash keyed off of
data unobservable to attackers.  That all fits better with a few families
of user-pluggable/define-switchable hashes which can be provided in a
separate PR more about `hashes.nim`.

This PR carefully preserves the better (but still hard coded!) probing
of the  `intsets` and other recent fixes like `move` annotations, hash
order invariant tests, `intsets.missingOrExcl` fixing, and the move of
`rightSize` into `hashcommon.nim`.

* Fix `data.len` -> `dataLen` problem.

* Add neglected API call `find` to heapqueue.

* Add a changelog.md entry, `since` annotation and rename parameter to be
`heap` like all the other procs for consistency.

* Add missing import.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment