-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow insertion of uint in Tables and Sets Collections depending on the value #13393
Comments
I personally think that hashing in a stdlib is an endless whack-a-mole game so @narimiran is working on giving us BTrees instead everywhere. |
I did some investigation on this. The hash function is not the problem here and is super fast, and also you probably won't encounter this on a 32 bit machine. If you replace your code with: # newcodesetsu64.nim
import strutils
import streams
import times
import hashes
let
fileName1 = "lowu64.txt"
fileName2 = "highu64.txt"
proc processFile(fileName: string) =
let iTime = epochTime()
var
fh = newFileStream(fileName, fmRead, 1000000)
if isNil(fh):
raise newException(IOError, "Problem to open file")
var
read: string
while readLine(fh, read):
# Only compute hash here
discard hash(uint64(parseUInt(read)))
close(fh)
let eTime = epochTime()
echo fileName, "\nduration: ", eTime - iTime
processFile(fileName1)
echo "\n"
processFile(fileName2) Results in (
I ran some profiling and it looks like the main offender is in the Further investigation revealed the offender. In var h: Hash = hc and maxHash(t)
var h: Hash = hc and t.data.high
So then what's the problem, you ask? Well, it seems like You can see this behaviour with the below code: import strutils
import streams
import times
import math
import hashes
let
fileName1 = "lowu64.txt"
fileName2 = "highu64.txt"
proc processFile(fileName: string) =
let iTime = epochTime()
var
fh = newFileStream(fileName, fmRead, 1000000)
if isNil(fh):
raise newException(IOError, "Problem to open file")
var
read: string
let capacity = nextPowerOfTwo(400000) - 1
while readLine(fh, read):
let parsed = parseUint(read).uint64
echo hash(parsed).uint64 and capacity.uint64
close(fh)
let eTime = epochTime()
echo fileName, "\nduration: ", eTime - iTime
processFile(fileName1)
echo "\n"
processFile(fileName2) You'll notice that if you run it, the first file will print valid indices between I'm not sure what cause is yet or what the fix could be. Maybe @Araq can shine some light? |
Oh wait, you said the proc hash*(x: uint64): Hash {.inline.} =
## Efficient hashing of `uint64` integers.
result = cast[Hash](cast[uint](x * prime)) Where the |
I think what hash function that provide fastest sets or tables depends on property of input keys. Whatever the hash function is used for Sets or Tables, there are cases that cause many collisions and slow down. import sets
func myfunc(x: int): int =
result = inverseFunctionOfHash(reverseBits(x))
var values = initHashSet[int]()
for i in 0..100000:
values.incl(myfunc(i)) In case hash function is not inversible, that means it is not injection and there are at least 2 different input values that mapped to a same value. We might need to know what kind of int type keys are used in real world to choose the default hash function for int type keys. When string type (name of person or something) is used as a key, there are many keys that have same charactors and only few bits are different. |
I hope everyone realizes that all this import hashes
proc hash(x: uint64): Hash = Hash(x * 0x9549249549249549'u64) or some similar very large multiplier to their code to fix this. (If for some reason that does not override the hashes module hash, that is a real bug.) Or if we wanted a 1-line patch in the stdlib to harden integer hashes "probably enough" we could even have tossed a multiplier in the code at the call to I did do a Robin Hood impl, and it is a little faster in some cases, but it cannot be a drop-in replacement. It breaks the current Personally, I think the old impl with a zero, 1, or few line patch to harden the hash in bad cases is simpler with better general properties. As Araq always points out, hash methods have fundamentally tricky failure modes and so will never be novice friendly. Non novices should know how to define their own |
@c-blake |
that doesn't make sense. I agree though that we should factor the implementations we have into a smaller number, and benchmark to make sure we don't loose any performance in doing so;
please share code so I/we can benchmark against #13440 and have an apples to apples comparison; the benchmarking code I wrote cover a wide range of use cases, improvements to it are welcome (notably, deletion workloads should be tested more)
please file an issue, I'm not seeing this here https://github.com/nim-lang/Nim/issues
|
There is always an attack pattern, even for PR probing, which is why the right answer is notifying the user to some configurable More to your specific point, I think you may be wrong. A big multiplier like 0b101010101... the same size of the key that toggles every other or every 3rd bit will spread bits from all over the key into all over the double-key sized product (the most in the middle). A subsequent half-key sized right rotation should yield a good diversity of hash code lower bits. There used to be a PRNG method called "middle square". It's sort of like that. For the biggest ints where you only get the modulus of the product you might want to rotate a little more. I don't see that this is inherently less extracting of key entropy into lower order bits than the xor-shifty one, but any hash will have failure modes. The exact size of tables and variations of bits all matter, etc. In any event, if you don't like my particular multiply and rotate suggestion, you could have that other one (or both, and maybe a 3rd and 4th, maybe numbered by expected run-time). You could make it better by dynamically switching off the implementation at the next rehash event and even trigger rehashes from overlong collision chains instead of table load (while also over 25% load or something to prevent runaway memory). That is also a more incremental/localized change with no impact on delete heavy workloads/table aging and all these follow on bugs. { There's still (at least) a bug in |
Yes. If we follow Python's dict implementation as close as reasonable, people coming from Python can't complain. That's the goal here really, IMHO. We are also getting BTrees into the stdlib which is a general solution without these flaws, they are slower though and so naive benchmarkers are better off with the hash table implementations. |
if you find a bug, please file a github issue; otherwise it's very likely to be forgotten and not acted on |
I don't know what kind of huge set of branches you have with what bugs are already fixed. I won't forget. Here's another one, tables.Table. Also - I wrote my last long post just before @timotheecour posted, but he beat me. Really it's all just "one tuple" part of which happens to be the key that you project down with hash and ==. I know about the Anyway, they are clearly mostly "duals" of each other, but the separate impls has allowed a lot of wandering. Another weird divergence is that To quickly address some other points, that issue I mentioned came up on IRC, IIRC, and was already fixed. I've already critiqued hot loop benchmarks with more words than I should have and that's all you have in your In filesystems research not aging things will get your papers rejected or at last mocked, and I'm not just saying that to be snarky. As I've mentioned before both in instruction count for typical latency and bandwidth to latency ratios, modern memory is ever more like disks were 50 years ago, but with a much smarter prefetching CPU confounding benchmarking. Tombstones both increase probe length (on purpose - it's what they do) and decrease effective memory utilization. So, it becomes at least two dimensional and cache-cliff problem for which there are no easy general answers - except maybe to avoid the conundrum in the first place as a default which is what I've advocated out of simplicity. The one thing this change actually makes a big difference to just on its face seems still not benchmarked and "worst hit is 1.06x" claims are still being bandied about as if they mean something. That's a bit worrisome, but then again who knows...? Maybe no one ever deletes! :-) I often don't. Good benchmarks are actual applications which perturb perfectly oiled loops with IO, system calls and other things, but which do "a lot" of something of interest. I get that "a lot but not all" is vague. Sorry. I think great answers here are elusive, but your benchmarks only about distribution not "real" performance, related but distinct. I doubt more details about why I think what I think would help. It seems Araq just wants to have an easy answer for Python newbies, though the current trajectory is actually better than that. Python uses a double indirection all the time with a separate array as the table part and just a dense seq for ordering. And I misremembered, they *do *still use the pseudorandom probe idea on the first level of indirection - but then because the ordered "main seq" part is in "insert order" not "hash order" (rarely correlated) the probes all get wildly indirect. I'd think they would do better to spend the space to put the hash code in the random ordered hash table part and only take one of those hits and use @timotheecour's initial linear probe trick. That idea is so obviously better that I'd say "watch for it in Python someday". Indeed, it's pretty clear that move was because they cared about cache-resident ordered dicts for the language, not very large ordered dicts. Structurally, just looking at the data layout, it just has to be so much slower on those. At least 2x, but maybe 3x or more, but you might never notice it if you test only with a loop the CPU can predict perfectly. I think they also claimed "faster" or 1.06x worse or something else I think quite wrong. Also on the more-like-Python note and since you're overhauling the impls anyway, you should probably at least put in "length change" asserts in all the iterators (bolded to not be missed in the long message). Python will raise exceptions on mutation during iteration. We might want exceptions long term, but we can at least do cheap/free asserts now. Long-term exceptions are probably a good plan if the idea is the default tables are for people who may have weak backgrounds and we have fast goto-based exceptions. As Araq observes/knows, using a hash table is usually much faster than a B-tree but it comes with a fundamental, ineliminable burden/gotcha of at least potentially providing proper hash code generator. The perf difference will never go away. So, neither will this gotcha. So, a variety of choices for So, look -- the (obvious) ultimate answer to this whole area of questions is a diversity of set representations, with and without satellite data capability, all with variously different properties. There are so many we could rattle off...aggregate bit vectors, space optimized direct-indexing, robin hood, compactified ala Py36 for the ordered case, what we had before with a decent hash, the path @timotheecour's on now, and probably Cuckoo to boot if people have switchable families of strong hash functions, and external chaining just for "insert/delete while you" cases. Some "intrusive" options would also be nice. With a good template/macro framework each one of these things is only a couple 100 lines of distinct code. Good defaults are nice and all, but arguing about defaults is not productive if the real answer is choice hopefully within a consistent interface framework and the arguing in manifesto length screeds takes time away from me spending it on increasing such choice which is where I kind of feel I've moved to in my work on this, personally. I bet Araq feels like this every damn day, honestly. ;-) Boost made C++ drastically more appealing, just as one example. In light of the obvious ultimate answer, I will probably just do a separate library with a compatible API. Maybe someone will care to change an import/tweak a declare sometime. If Nim core wants to copy it that's fine. If they don't I'll just make noise about it from time to time or someone can do an RFC. I think the right refactoring is probably a totally different impl for Where we want to get to ecosystem-wise is the sequence "User X - This is slow!. Advisor Y - Try this instead!" Really one would hope (but be disappointed) that users would profile before asking for advice. Maybe sometimes things are simple enough that the RTL is the advisor, not a person. Heck, maybe the RTL can even eventually be super smart and dynamically switch or at least have a mode to. A well written B-tree can give you efficient insert in the middle of dynamic "seq"s as well as sorted keys. It helps to separate "seek" and "mutation" in your implementation to support either. Then you can seek by rank or seek by key and you can even do away with the key and have rank be just like a seq index. (I tried to steer @narimiran this way narimiran/sorta#1 but he didnt' respond. So, I don't know if he even understood my point.) Anyway, it might be more advanced than our target RTL, and but it wouldn't even be totally crazy to have |
Incidentally, saying that a "better" hash "didn't help" or "wouldn't help" when it seems you only tried exactly one hash or maybe that and some super slow string hash all sounds like arguments from ignorance/not trying much. Did you even instrument your probing loop to see if the "supposedly" better hash was even really getting short clusters against those hostile keys? All I see are naive wall clock timings. When I check distributions I check probe counts. Then you know quality. You talk as if you did do that because the better hash is slower, but that xor shr thing doesn't look that much slower and also not that mix-y. Maybe you did some other kind of profiling to disambiguate why the slower hash was "slow" in context? (obviously everything is slower than nothing, but the xor-shr thing looked pretty fast to me compared to even a few loops & probes). Or maybe you just lept to a conclusion? I'd bet there are plenty of choices that could have been in a little library of 3..5 well vetted fast integer hashes, one of which would have solved this guy's problem. (Or at the very least made him never notice his original small bias problem which he then focused into basically a harder to resist "attack".) Such a little library may seem besides the point with the PRNG probes, but you still would prefer to not collide if there's any LP segment keeping any locality (which you would prefer). And you still want a warning system for attackers who can engineer the same whole hash code..maybe not for integer keys with the identity hash, but for all other client code that shares that impl, especially So, in short, the resolutions that could have addressed this besides PRNG probing should all still happen, just "less soon", with those things the restructuring would not have been necessary, there would be no questions about aging or benchmarking because you can't do better than "1" for DRAM hits (we do get that one with your limit, though) and 0 for extra space pressure, and no (new) whack-a-moles. It's not quite a mathematical theorem because of diverse use cases & workloads, some in cache, but it's not actually so far from one. Totally off to the side, WTF with renaming Anyway, I'm busily coding up my own library now. So, I'm going to be less noisy about all this and that's something positive to come out of it for me, at least. Apologies for the verbosity, but enough things were being said that made no sense that I didn't feel y'all were tracking important things { except Araq who clearly knows his stuff but mostly seems to wish he wrote a B-Tree 15 years ago and never put hash tables in the stdlib :-D }. How I respond to such feeling - "just try saying it a little differently", but then people tune out the walls of text. Oh well. Good luck and best wishes to all! |
…nges (#13816) * Unwind just the "pseudorandom probing" (whole hash-code-keyed variable stride double hashing) part of recent sets & tables changes (which has still been causing bugs over a month later (e.g., two days ago #13794) as well as still having several "figure this out" implementation question comments in them (see just diffs of this PR). This topic has been discussed in many places: #13393 #13418 #13440 #13794 Alternative/non-mandatory stronger integer hashes (or vice-versa opt-in identity hashes) are a better solution that is more general (no illusion of one hard-coded sequence solving all problems) while retaining the virtues of linear probing such as cache obliviousness and age-less tables under delete-heavy workloads (still untested after a month of this change). The only real solution for truly adversarial keys is a hash keyed off of data unobservable to attackers. That all fits better with a few families of user-pluggable/define-switchable hashes which can be provided in a separate PR more about `hashes.nim`. This PR carefully preserves the better (but still hard coded!) probing of the `intsets` and other recent fixes like `move` annotations, hash order invariant tests, `intsets.missingOrExcl` fixing, and the move of `rightSize` into `hashcommon.nim`. * Fix `data.len` -> `dataLen` problem.
* Unwind just the "pseudorandom probing" (whole hash-code-keyed variable stride double hashing) part of recent sets & tables changes (which has still been causing bugs over a month later (e.g., two days ago #13794) as well as still having several "figure this out" implementation question comments in them (see just diffs of this PR). This topic has been discussed in many places: #13393 #13418 #13440 #13794 Alternative/non-mandatory stronger integer hashes (or vice-versa opt-in identity hashes) are a better solution that is more general (no illusion of one hard-coded sequence solving all problems) while retaining the virtues of linear probing such as cache obliviousness and age-less tables under delete-heavy workloads (still untested after a month of this change). The only real solution for truly adversarial keys is a hash keyed off of data unobservable to attackers. That all fits better with a few families of user-pluggable/define-switchable hashes which can be provided in a separate PR more about `hashes.nim`. This PR carefully preserves the better (but still hard coded!) probing of the `intsets` and other recent fixes like `move` annotations, hash order invariant tests, `intsets.missingOrExcl` fixing, and the move of `rightSize` into `hashcommon.nim`. * Fix `data.len` -> `dataLen` problem. * This is an alternate resolution to #13393 (which arguably could be resolved outside the stdlib). Add version1 of Wang Yi's hash specialized to 8 byte integers. This gives simple help to users having trouble with overly colliding hash(key)s. I.e., A) `import hashes; proc hash(x: myInt): Hash = hashWangYi1(int(x))` in the instantiation context of a `HashSet` or `Table` or B) more globally, compile with `nim c -d:hashWangYi1`. No hash can be all things to all use cases, but this one is A) vetted to scramble well by the SMHasher test suite (a necessarily limited but far more thorough test than prior proposals here), B) only a few ALU ops on many common CPUs, and C) possesses an easy via "grade school multi-digit multiplication" fall back for weaker deployment contexts. Some people might want to stampede ahead unbridled, but my view is that a good plan is to A) include this in the stdlib for a release or three to let people try it on various key sets nim-core could realistically never access/test (maybe mentioning it in the changelog so people actually try it out), B) have them report problems (if any), C) if all seems good, make the stdlib more novice friendly by adding `hashIdentity(x)=x` and changing the default `hash() = hashWangYi1` with some `when defined` rearranging so users can `-d:hashIdentity` if they want the old behavior back. This plan is compatible with any number of competing integer hashes if people want to add them. I would strongly recommend they all *at least* pass the SMHasher suite since the idea here is to become more friendly to novices who do not generally understand hashing failure modes. * Re-organize to work around `when nimvm` limitations; Add some tests; Add a changelog.md entry. * Add less than 64-bit CPU when fork. * Fix decl instead of call typo. * First attempt at fixing range error on 32-bit platforms; Still do the arithmetic in doubled up 64-bit, but truncate the hash to the lower 32-bits, but then still return `uint64` to be the same. So, type correct but truncated hash value. Update `thashes.nim` as well. * A second try at making 32-bit mode CI work. * Use a more systematic identifier convention than Wang Yi's code. * Fix test that was wrong for as long as `toHashSet` used `rightSize` (a very long time, I think). `$a`/`$b` depend on iteration order which varies with table range reduced hash order which varies with range for some `hash()`. With 3 elements, 3!=6 is small and we've just gotten lucky with past experimental `hash()` changes. An alternate fix here would be to not stringify but use the HashSet operators, but it is not clear that doesn't alter the "spirit" of the test. * Fix another stringified test depending upon hash order. * Oops - revert the string-keyed test. * Fix another stringify test depending on hash order. * Add a better than always zero `defined(js)` branch. * It turns out to be easy to just work all in `BigInt` inside JS and thus guarantee the same low order bits of output hashes (for `isSafeInteger` input numbers). Since `hashWangYi1` output bits are equally random in all their bits, this means that tables will be safely scrambled for table sizes up to 2**32 or 4 gigaentries which is probably fine, as long as the integer keys are all < 2**53 (also likely fine). (I'm unsure why the infidelity with C/C++ back ends cut off is 32, not 53 bits.) Since HashSet & Table only use the low order bits, a quick corollary of this is that `$` on most int-keyed sets/tables will be the same in all the various back ends which seems a nice-to-have trait. * These string hash tests fail for me locally. Maybe this is what causes the CI hang for testament pcat collections? * Oops. That failure was from me manually patching string hash in hashes. Revert. * Import more test improvements from #13410 * Fix bug where I swapped order when reverting the test. Ack. * Oh, just accept either order like more and more hash tests. * Iterate in the same order. * `return` inside `emit` made us skip `popFrame` causing weird troubles. * Oops - do Windows branch also. * `nimV1hash` -> multiply-mnemonic, type-scoped `nimIntHash1` (mnemonic resolutions are "1 == identity", 1 for Nim Version 1, 1 for first/simplest/fastest in a series of possibilities. Should be very easy to remember.) * Re-organize `when nimvm` logic to be a strict `when`-`else`. * Merge other changes. * Lift constants to a common area. * Fall back to identity hash when `BigInt` is unavailable. * Increase timeout slightly (probably just real-time perturbation of CI system performance).
* Unwind just the "pseudorandom probing" (whole hash-code-keyed variable stride double hashing) part of recent sets & tables changes (which has still been causing bugs over a month later (e.g., two days ago #13794) as well as still having several "figure this out" implementation question comments in them (see just diffs of this PR). This topic has been discussed in many places: #13393 #13418 #13440 #13794 Alternative/non-mandatory stronger integer hashes (or vice-versa opt-in identity hashes) are a better solution that is more general (no illusion of one hard-coded sequence solving all problems) while retaining the virtues of linear probing such as cache obliviousness and age-less tables under delete-heavy workloads (still untested after a month of this change). The only real solution for truly adversarial keys is a hash keyed off of data unobservable to attackers. That all fits better with a few families of user-pluggable/define-switchable hashes which can be provided in a separate PR more about `hashes.nim`. This PR carefully preserves the better (but still hard coded!) probing of the `intsets` and other recent fixes like `move` annotations, hash order invariant tests, `intsets.missingOrExcl` fixing, and the move of `rightSize` into `hashcommon.nim`. * Fix `data.len` -> `dataLen` problem. * Add neglected API call `find` to heapqueue. * Add a changelog.md entry, `since` annotation and rename parameter to be `heap` like all the other procs for consistency. * Add missing import.
A long time ago I noticed that the Tables and Sets collections took a long time to add values and I decided to search further.
So, I noticed that the problem was: the higher a significant bit in an integer, the slower the insertion of a value will be.
Using uint32, entering values with the lowest 16 bits is 14 times faster than entering values with the highest 16 bits.
Using uint64 the situation is more serious. Entering values with the low 32 bits is about 256 times faster than entering values with the highest 32 bits, this for the Sets collection. For Tables collection, the difference increases to 348 times.
I will not put sample code, because I created a github repository with the problem, where it is possible to see the codes used and even a quick solution using seqs and sort, as well as code using similar collections in other languages to do the same work that show a greater consistency in execution time.
It is a problem related or equal to the one treated here: 11764
Example and Current Output
Visit this link https://github.com/rockcavera/nim-problem
Expected Output
A less significant difference is expected between the addition of values, that is, closer execution times, the same occurs in other languages.
Possible Solution
Additional Information
The text was updated successfully, but these errors were encountered: