-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adaptive hashing #1796
Adaptive hashing #1796
Conversation
I just wanted to say that this RFC was a delight to read. I'm not familiar with robin hood hashing, and so the part on choosing constants was foreign to me, but regardless, I felt the RFC made it real easy to see the motivation, benefits, and implementation. Thanks! |
About this complexity : One could iamgine a two tier hash map infrastructure, the bottom tier implements the hash map but just "throws complaints" if it "gets upset" about probing length or whatever, while the the upper tier acts as "combinators" that can decide to rebuild with a new hash function and/or hash table. That's not simpler, but it's allow more tweaks, including switching to another hash table even. |
## Choosing hash functions | ||
|
||
For hashing integers, the best choice is a mixer similar to the one used in SipHash’s finalizer. For | ||
strings and slices of integers, we will use FarmHash. (The Hasher trait must allow one-shot hashing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No design that I know of has resolved the issue that we run into with one shot hashing, even for a few select types: custom types that define Hash exactly like strings/slices do.
text/0000-adaptive-hashing.md
Outdated
creating a HashMap from a given list of keys: | ||
|
||
```rust | ||
fn make_map(keys: Vec<usize>) => HashMap<usize> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo on =>
text/0000-adaptive-hashing.md
Outdated
the probability of having a run length of more than 640 buckets may be higher than the probability | ||
we want, it should be low enough. | ||
|
||
<img width="600" src="https://cdn.rawgit.com/pczarn/code/d62cd067ca84ff049ef196aa1b7773d67b4189d4/rust/robinhood/lookup_cost.png"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These might want to get encoded as base64-encoded data URIs, so the RFC is self-contained.
text/0000-adaptive-hashing.md
Outdated
## For the HashMap API | ||
|
||
One day, we may want to hash HashMaps. The hashing infrastructure can be changed to allow it. The | ||
implementation of Hash for HashMap can hash the hashes stored in every map, rather than However, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than what?
text/0000-adaptive-hashing.md
Outdated
## For the order of iteration | ||
|
||
Currently, HashMaps have nondeterministic order of iteration by default. This is seen as a good | ||
thing, because programmers that test programs won’t learn to rely on a fixed iteration order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: “programmers that test programs” makes little grammatical sense. Maybe:
because programmers then ensure programs do not rely on a fixed iteration order
text/0000-adaptive-hashing.md
Outdated
- We can restrict adaptive hashing to integer keys. With this limitation, we don't need Farmhash in | ||
the standard library. | ||
- We can use some other fast one-shot hasher instead of Farmhash. | ||
- We can add use an additional fast hash function for fast streaming hashing. The improvement would |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: “add use” maybe should be either “add” or “use”.
Could you expand on what the APIs will look like here? I'm guessing we'd add second hasher type parameter like I also agree with @bluss that supporting one off hashing is very much not a thing that should be mentioned off hand as a thing that is required before this RFC can be implemented. |
There are a bunch of subtle concerns here like : Is farmhash problematic for certain embedded target due to it's large code size? It's easy to switch hash functions, but you might not want this two hash function version in that setting either. Is this extra table table rebuild problematic for some scenarios? At present, you avoid any rebuilds by specifying the capacity correctly, but now you must disable this switching behavior too. If you're switching hash function, then maybe you should switch from robin hood hashing to cuckoo hashing too? If |
@sfackler The signature of HashMap will stay the same ( @burdges I don't think having two tiers make sense. There is no other, better implementation I can imagine.
As far as I know, people writing for embedded targets already remember to optimize for code size, and changing the default is easy. There should be a guide for embedded programming in Rust, where optimization is mentioned. SipHash alone adds some code size, because its 1+3 rounds are inlined for speed, and there are few 64-bit embedded systems, so it's not good for them. We need measurements if you think code size is a priority.
Excellent question. Adaptive hashing may be a problem for realtime systems. I suggest changing HashMap's load factor to 85%, and increasing the threshold for run length by a small number, so that having any reallocations is less likely. I hope changing the hasher or getting HashMap from an external crate is easy enough in case you need a 100% guarantee of cheap insertion. Changing the load factor is proposed in rust-lang/rust#38003 One more thing: in the RFC, the probability distribution of run lengths is wrong. I have an idea how to fix the math. Soon, I'll make other corrections to the RFC too. |
So is |
@sfackler No, but |
Ok, so could you write down what is actually changing in what way, since it is apparently not clear from the RFC as written? |
Abstract the I'd say the draw back is it's kinda boost-ish to do this and involves lots more discussion over what should be in this new trait vs what should be specific to implementations. |
The kinds of traits necessary to cover set and map functionality aren't definable in a reasonable way without HKT. |
With the early resizing this indirectly helps with #36481, right? |
There is no need for Adaptive hashing for "simple keys", cause mix function from this rfc could easily be extended to strong enough seeded variant. If we have 128bit seed, then we can simply apply "Even-Mansour scheme": self.hash = mix(msg_data ^ seed.k0) ^ seed.k1; But Adaptive hashing still could be useful for complex keys, so this rfc needs fast hash function for complex keys. contain-rs/hashmap2#5 (comment) |
Why fo you think FarmHash doesn't exploit ILP ? Your SeaHash doesn't differs much from xxHash or MetroHash. I'm not saying SeaHash is bad. It is just not unique. What if I say, almost same performance could be achieved without multiplication? What if I say, first And if you doesn't bother for hash flood attack, then there is no need in second per-block multiplication ( and shift), so your function could be twice faster. |
It does, but only partially. It is not as optimized as it could be.
Right, the differences are minor, but it is still able to beat both in performance.
I never claimed it was ;). It is a fairly standard highly-optimized DM construction.
You can't. You need to look at the avalanche diagram.
It doesn't. There is a multiplication between them.
The diffusion function could, yes, but you have destroyed all of its statistical properties. The comments in the code contains some fairly detailed information about why I chose to do it this way. Essentially I need to move entropy up, then down, then up again. I wrote about the design here: http://ticki.github.io/blog/designing-a-good-non-cryptographic-hash-function/ |
Here's the construction you're describing: That's bad. Flipping high bits won't flip lower ones. So, you could ask, why do quality matter for adaptive hashing. It does because you don't want to accidentally switch to a secure hash function, because there is some unexpected bias in the output, and when you lose entropy every round, it's pretty bad. |
Yes, I can. I've already did it. I achieve 3.3 byte per clock without multiplication, while best multiplicative achieves 4.4 byte/clock. Even non-multiplucative SpookyHash achieves 4 byte per cycle, but it uses 12 variables for state, and I use 4 variables (and no hidden temporary variables needed for xor-shift). I didn't publish it yet, but if you want, I will.
There is no multiplication between blocks. And first xor-shift from block N+1 cancels result of third xor-shift from block N.
There is no need in great statistical properties in interblock diffusion unless you concerned about security. But if you do, then you should add at least proper seeding. But, yes, single multiplication is no ok. I take this suggestion back. Too easy to construct interblock collision. |
Those numbers are wrong. SpookyHash doesn't outperform SeaHash nor xxHash (I've benchmarked it), and it doesn't matter if you can remove a cycle or two. If you want you could replace the whole function by
and you would get excellent benchmarks™, but the matter of the fact is that quality is worse. When you remove the multiplication as you do, you also remove the BIC invariant as well as the avalanche criterion.
Now, that's a true observation. It's an implementation mistake, because |
Btw look at non-assembler version of Golang hash function : It also uses two multiplication per-block, but it uses single rotation instead of three xor-shift. And it is seeded. In fact, I really believe that properly seeded Golang hash function ( and yes, SeaHash also) will be as safe as SipHash for usage in hash-table. |
You're wrong here. It is trivial to break. I can generate collisions easily for both golang, xxHash and SeaHash. |
@funny-falcon I read about the Even-Mansour scheme. It won't be secure with incremented keys. Currently, the seeds in two maps may differ by the least significant bit. With known iteration order for one map, an attacker could choose first N entries from that map and increment their keys. Then, constructing a map with these keys takes O(n^2) time. This O(n^2) blowup is similar to the one in rust-lang/rust#36481. So there's a tradeoff: the scheme may work, but we must get the seeds from an RNG. @arthurprs The early resize is fine by itself. I would rather keep the early resize restricted to maps with the default hasher. This way, the flag can be moved into RandomState. It just needs 2 special values of SipHash state. @seanmonstar I'm glad you liked it. |
@pczarn instead of incrementing prng could be used. xorshift generators could be taken to efficiently mix bits of state of any size (1 * 32bit, 2 * 32bit, 1 * 64bit or 2 * 64bit or even greater). http://xoroshiro.di.unimi.it/ Or use same |
I am still waiting on an update which concretely outlines how the public facing API is going to change:
|
I think the proposal work under the current public api. It's only implemented for the default hasher. IMO, exposing it as a public api would be way too cumbersome. |
Another approach is simply providing an API to make the hashing more efficient by reducing the number of times it needs to be done. Idea 1. I think the Idea 2. Use a fast hasher in the table itself, but use a wrapper type for keys that adds a more secure hash value.
If you store these prehashed wrappers, then your slow hashing operations correspond 1-1 with user input, while you can speed up all the repeated hashing operations required by your algorithm. In this vein, has anyone written this adaptive hashing scheme as a crate that provides a wrapper on |
The public facing API is not going to change. The RFC does not propose any such API. The user can't configure anything. For example, think of specializing |
@pczarn Ping on updating the RFC? |
AFAIKT the WIP PR doesn't include the hashing changes, just an adjustment to the load factor if a long chain is encountered. |
If I understand, anyone wanting to avoid this complexity, just specifies their own Would this be implemented by making In that case, an attacker can reliably produce like 8 cache misses per query without setting I think you should consider using something like I suppose |
Adaptive hashmap implementation All credits to @pczarn who wrote rust-lang/rfcs#1796 and contain-rs/hashmap2#5 **Background** Rust std lib hashmap puts a strong emphasis on security, we did some improvements in #37470 but in some very specific cases and for non-default hashers it's still vulnerable (see #36481). This is a simplified version of rust-lang/rfcs#1796 proposal sans switching hashers on the fly and other things that require an RFC process and further decisions. I think this part has great potential by itself. **Proposal** This PR adds code checking for extra long probe and shifts lengths (see code comments and rust-lang/rfcs#1796 for details), when those are encountered the hashmap will grow (even if the capacity limit is not reached yet) _greatly_ attenuating the degenerate performance case. We need a lower bound on the minimum occupancy that may trigger the early resize, otherwise in extreme cases it's possible to turn the CPU attack into a memory attack. The PR code puts that lower bound at half of the max occupancy (defined by ResizePolicy). This reduces the protection (it could potentially be exploited between 0-50% occupancy) but makes it completely safe. **Drawbacks** * May interact badly with poor hashers. Maps using those may not use the desired capacity. * It adds 2-3 branches to the common insert path, luckily those are highly predictable and there's room to shave some in future patches. * May complicate exposure of ResizePolicy in the future as the constants are a function of the fill factor. **Example** Example code that exploit the exposure of iteration order and weak hasher. ``` const MERGE: usize = 10_000usize; #[bench] fn merge_dos(b: &mut Bencher) { let first_map: $hashmap<usize, usize, FnvBuilder> = (0..MERGE).map(|i| (i, i)).collect(); let second_map: $hashmap<usize, usize, FnvBuilder> = (MERGE..MERGE * 2).map(|i| (i, i)).collect(); b.iter(|| { let mut merged = first_map.clone(); for (&k, &v) in &second_map { merged.insert(k, v); } ::test::black_box(merged); }); } ``` _91 is stdlib and _ad is patched (the end capacity in both cases is the same) ``` running 2 tests test _91::merge_dos ... bench: 47,311,843 ns/iter (+/- 2,040,302) test _ad::merge_dos ... bench: 599,099 ns/iter (+/- 83,270) ```
I couldn't find the right math to describe insertion cost, so I made measurements on a real implementation. The results are a bit disappointing. The smallest reasonable forward shift threshold is 1500. The chance of reaching it at random is almost 10^-13. Having such threshold means we allow hashmap construction to take I suppose a threshold of 1500 at a load factor equal 0.833 is a decent choice. I'm going to change the RFC to propose it. Also, we can have an additional check for early detection of O(n^2) blowup to improve performance of map merges. Whenever we detect an abnormally long chunk, we need to calculate the proportion of its length to the number of all entries in the map. If we see that the chunk is a major part of the map, we can switch to safer hashing. The code for drawing the chart is in the IJulia notebook. Measurements are implemented here: pczarn/hashmap2@0127ceb |
I spend the last hour or so running simulations (thanks for the gists) and I agree 1500 will work fairly well in practice (I got 8e-5 probability at 0.9 load factor). I missed your comment about the forward shifts math being wrong... so now I have to fix rust-lang/rust#38368 Edit: My results for 0.9 load factor can be seen here https://gist.github.com/arthurprs/09d3f39d4c8cbf211919dd40ad317d21 |
I updated the RFC. Sorry for the delay. I think only one small change is yet to be done: I need to convert the state graph to plain ASCII. |
I have a couple of problems with the approach here:
|
This RFC has been stalled for quite a while now. While we'd really like to see improvements along these lines, I'm going to propose to close for now, until the RFC is revised to take into account the various feedback given. Please feel free to reopen with a revision! @rfcbot fcp close |
Team member @aturon has proposed to close this. The next step is review by the rest of the tagged teams: No concerns currently listed. Once these reviewers reach consensus, this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up! See this document for info about what commands tagged team members can give me. |
@kennytm what in particular is confusing about closing this? |
@sfackler Why close, not postpone? |
Postpone means "we want to do the thing proposed here, but not right now". We're happy to do something like what's proposed here whenever, but the specifics need to change. |
🔔 This is now entering its final comment period, as per the review above. 🔔 |
The final comment period is now complete. |
Ok looks like no new conversation happened during FCP, so I'm going to close and tag as postponed. Thanks again for the RFC @pczarn! |
rendered
Implementation
Visualization of Robin Hood hashing
IJulia notebook
Related RFC Extend the
Hasher
trait withfn delimit
to support one-shot hashing #1666 "Extend theHasher
trait withfn delimit
to support one-shot hashing"