-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How smart should Nim sets & tables be? #201
Comments
Just as one concrete example, this program: import metab
var one = initSet[int]()
for i in 0 ..< (1 shl 23):
one.incl (i shl 25)
var ds = one.depths
echo ds
echo one.len, "/", one.getCap
echo "Stats: ", one.depthStats produces this output (RobinHood mode on by default but with no hash() output rehashing on by default as per current vsn of adix):
So, you can see that early on (when depth is 7) the bad hashing is detected, then hash code rehashing is activated and the end result is a near perfect hash table (only one item has a probe depth greater than the minimum). The rehasher is currently a pretty simple Fibonacci-esque rotate & multiply. Your runtime results may vary a little from the above because right now the rehasher That last is actually a safety feature (see, e.g., https://accidentallyquadratic.tumblr.com/post/153545455987/rust-hash-iteration-reinsertion, though depth triggered resize also fixes that particular problem..So, this would be a "belt & suspenders" solution to that blog post). It might be better to actually "stage activation" of the per-VM-addr ordering, too, maybe just for reproducibility with a slightly more slow but more scrambling-than-default identity hash. It might also make sense to have another/other fallback hashes. I'm open to suggestions. Mostly I think we can do a lot better than Python's hard-coded fancier-hash-built-into-the-probe sequence (and therefore fixed!) and its associated need of tombstones. |
I don't like anything based on "random" memory addresses as it makes bugs much harder to reproduce. This should only be done via |
Yeah. I was leaning toward a user-redefinable |
(In fact I comitted a comment similar to that idea 10 minutes before your comment: c-blake/adix@508e3b9 though I'm sure you hadn't seen that. I do appreciate the feedback, though. Thanks!) |
Can now either `import althash; proc hash(h,salt: Hash): Hash = ...` to change how hashes are rehashed (when that is even necessary) or compile with `-d:unstableHashHash` to do so with the built-in hash rehasher.
Since we now have a good hash function for integers, I consider this RFC settled. The current implementation is good enough, Robin Hood hashing can still be introduced, for example, but I think this RFC can now be closed. |
Ok. Seems fair. |
So, I just put up a sort of a sketch of what I've been trying to say abstractly at https://github.com/c-blake/adix in response to a variety of changes @timotheecour has been making to the stdlib tables, e.g. nim-lang/Nim#13440 and nim-lang/Nim#13393 and nim-lang/Nim#13418
It is pretty drop-in compatible with the stdlib sets & tables and I tried to keep the internal names similar as well. I just added a few generic procs for the
CountTable
functionality and sorting does not yet reconstitute the index.I didn't commit all my test programs, but I have been using the primary linear probing with Robin Hood re-org for the past week or two as a replacement for the stdlib tables (via that
metab.nim
module, short for metaTab, that makes compile-time switching easy). I'm not sure I'm happy with a lot of things about this code -- thegetKey
/setKey
, tables written as wrappers around sets, various things in triplicate instead of template bodies, etc.Nevertheless, it does show concretely how we could do both unordered and ordered tables with just linear probing with run-time per-table optional robin hood hashing and also run-time per-table hash()-rehashing with the latter two modes "automatically activated" on overdeep probes on sparse tables. It also shows another instance of the current algorithms (I believe cleaner via the
probeSeq
iterator). The weak hash mitigation is mostly checked for in aproc tooFull
in the various??set.nim
implementations.From a mathematical property point of view, I believe the non-tombstone variants let users who know what they are doing suffer no performance penalty in time or space on delete heavy workloads while at the same time automatically adapting for users that are blissfully unaware of hash table performance pitfalls - up to a a point. { Beyond the point we can adapt to, the user really needs to provide inequality element comparison since there just isn't enough diversity in the output of the
hash()
they're using (by definition of the "up to a point". As far as I can tell that inequality requirement is ineliminable. } I sort of feel like it's almost a theorem that this is a better approach for tables of unknown workloads (unless you want to force performance sensitive people to do their own table impls or use this library). Even so, a broader conversation than just comments on pull requests seemed warranted. Hence this RFC issue.Beyond the above virtues, probe depth triggered resize (in any of the variants) has a bonus that if the user does have a great hash function then they get to enjoy near 100% space utilization. For example, with my current hash-the-output of hash via this multiply-rotate gizmo, the running example of only high bits varying (like what started this whole arc of development) becomes almost a perfect hash table with Robin Hood and rehash both activated.
In terms of performance, usually Robin Hood wins big when you have a "miss heavy" workload - e.g., building a pair of sets and then doing an intersection of them where the result is mostly empty. For a more "hit heavy" workload like a histogram where the counts average much greater than 1, vanilla linear probing is usually best. I don't believe there is a single algorithmic solution that is fastest for all workloads, though. Timing things simply in hot loops can mislead, especially leading one astray generalizing to not perfectly oiled cache prefetching situations.
Personally, I think all the recent churn in table stuff could be profitably just unwound and replaced with advice to use such a better integer hash function, possibly providing a warning from the run-time to guide them and then possibly a few alternative hashes in the standard lib under different names from just
hash
so users can goproc hash(x: int)=hash2(x)
to activate them. Then we could add hash3, hash4 (or give them better names). Thealthash
module inadix
is sort of like that.Anyway, there are also some NOTES.md and TODO.md and doc comments at the tops of modules.
The text was updated successfully, but these errors were encountered: