-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adaptive hashing #1796
Adaptive hashing #1796
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,193 @@ | ||
- Feature Name: adaptive_hashing | ||
- Start Date: 2016-11-20 | ||
- RFC PR: (leave this empty) | ||
- Rust Issue: (leave this empty) | ||
|
||
# Summary | ||
|
||
Implement adaptive hashing for HashMap. Initialize hash maps using the fastest practical hash | ||
function, and fall back to SipHash in case of a potential DoS attack. | ||
|
||
# Motivation | ||
|
||
Hash DoS is an example of a DoS attack. The goal of DoS attacks is a denial of service. Consider | ||
creating a HashMap from a given list of keys: | ||
|
||
```rust | ||
fn make_map(keys: Vec<usize>) => HashMap<usize> { | ||
let mut map = HashMap::new(); | ||
for key in keys { | ||
map.insert(key, 0); | ||
} | ||
map | ||
} | ||
``` | ||
|
||
Let's suppose that the `keys` array is an input that comes from the outside world. A simple case | ||
of DoS happens when a server receives a HTTP request with thousands of deliberately chosen | ||
parameters. Processing just one such request can take minutes. | ||
|
||
The `keys` array is manipulated to get the slowest possible run time. In the worst case, all keys | ||
hash to the same bucket, so we no longer benefit from hashing. Each iteration of the loop in the | ||
example code takes O(n) time. The entire function executes in O(n**2) time. The hash map behaves | ||
like a typical dynamic array. We might as well write: | ||
|
||
```rust | ||
fn make_map(keys: Vec<usize>) => HashMap<usize> { | ||
let mut map = vec![]; | ||
for key in keys { | ||
if let Some(index) = map.position(|(k, _)| k == key) { | ||
map[index] = 0; | ||
} else { | ||
map.push(0); | ||
} | ||
} | ||
} | ||
``` | ||
|
||
We are only considering slow insertions, because we don’t need to worry about lookup. The cost of | ||
inserting an element includes the cost of searching for that element. Immediately after inserting an | ||
element, the cost of looking it up will be equal or smaller. Later, after some number of unrelated | ||
insertions, the cost of looking up that element will still be limited by some threshold. | ||
|
||
To prevent all Hash DoS attacks, we need to make sure that HashMap is protected. The standard | ||
library's HashMap currently uses SipHash-1-3 for all its lookups to protect from Hash DoS. | ||
Unfortunately, this comes with a tradeoff. Some people believe SipHash is too slow. They consider | ||
non-ideal performance of HashMap for small keys as its main drawback. Others see the use of SipHash | ||
as a good solution to the tradeoff between security and speed. | ||
|
||
Is SipHash really slow, and why? We can simply count the number of instructions it performs. | ||
SipHash’s round involves 14 64-bit operations. SipHash-1-3 runs one round for each 8 bytes of input, | ||
and three rounds for finalization, so it involves 16 operations for each 8 bytes of input, and 42 | ||
operations for finalization. Hashing an input of 8 bytes needs 58 operations. However, out-of-order | ||
execution allows more than one operation per cycle on modern CPUs. Also, SipHash uses simple | ||
operations, i.e. addition, bitwise rotation and XOR. Still, we can see that SipHash is relatively | ||
slow for small values. Ideally, hashing an integer should take only 7 instructions. | ||
|
||
Several dynamic programming languages use SipHash for their hash tables. However, Rust is a systems | ||
programming language. The slowdown from hashing is more noticeable than in other languages. | ||
|
||
Perl uses a mechanism similar to adaptive hashing for its dictionaries implemented with chaining. | ||
Java uses chaining and changes a linked list to a binary tree when its length exceeds some | ||
threshold. | ||
|
||
Fortunately, Robin Hood hashing can be easily extended with adaptive hashing. | ||
|
||
# Detailed design | ||
## The algorithm for adaptive hashing | ||
|
||
A HashMap with adaptive hashing has two states. One state is called “fast mode” and the other is | ||
“safe mode”. The fast mode is the inital state for HashMaps with keys of a type that can be hashed | ||
in one shot. Otherwise, a HashMap with complex keys is always in safe mode. We switch to the safe | ||
mode when the following conditions are met: | ||
|
||
- an inserted entry's displacement >= 128, or the number of entries displaced by an inserted | ||
entry >= 512 | ||
- the load of the map is smaller than 20% | ||
- the map is in the fast mode | ||
|
||
The second condition reduces the odds of switching to safe hashing. The chance that the first | ||
condition is satisfied is tiny, and the chance that both are satisfied at the same time is | ||
negligible. Moreover, we add a flag to the map. The flag delays displacement reduction until the | ||
next insertion to make code simpler. Otherwise, rebuilding the map would invalidate our entry. | ||
The pseudocode for a function that replaces `insert` is: | ||
|
||
``` | ||
fn safeguarded_insert(map, key): | ||
entry = insert(map, key) | ||
if the entry's displacement >= 128 or the number of entries displaced by entry >= 512: | ||
set the flag for reducing displacement | ||
return entry | ||
``` | ||
|
||
Before the next insertion operation, the state must be checked. Conveniently, the `reserve` method | ||
is always called before insertion and entry search, so we add the following code to `reserve`: | ||
|
||
``` | ||
fn reserve(map, ...): | ||
if the flag for reducing displacement is set and the map uses fast hashing: | ||
if the load of the map is higher than 20%: | ||
grow the map | ||
else: | ||
switch the map's hash state to safe hashing | ||
rebuild the map | ||
clear the flag for reducing displacement | ||
// ... | ||
``` | ||
|
||
Here’s a state diagram for HashMap with adaptive hashing. The dashed edge means the state change is | ||
very unlikely, and the dotted edge means the state change is enormously unlikely. | ||
|
||
<img width="800" src="https://cdn.rawgit.com/pczarn/code/d62cd067ca84ff049ef196aa1b7773d67b4189d4/rust/robinhood/adaptive.svg"> | ||
|
||
## Choosing constants | ||
|
||
The thresholds of 128 and 512 are chosen to minimize the chance of exceeding them. In particular, we | ||
want that chance to be less than 10^-8 with a load of 90% and less than 10^-30 with a load of 20%. | ||
For displacement, the smallest k that fits our needs is 90, so we round that up to 128. For the | ||
number of forward-shifted buckets, we choose k=512. Keep in mind that the run length is a sum of the | ||
displacement and the number of forward-shifted buckets, so its threshold is 128+512=640. Even though | ||
the probability of having a run length of more than 640 buckets may be higher than the probability | ||
we want, it should be low enough. | ||
|
||
<img width="600" src="https://cdn.rawgit.com/pczarn/code/d62cd067ca84ff049ef196aa1b7773d67b4189d4/rust/robinhood/lookup_cost.png"> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These might want to get encoded as base64-encoded data URIs, so the RFC is self-contained. |
||
|
||
<img width="600" src="https://cdn.rawgit.com/pczarn/code/d62cd067ca84ff049ef196aa1b7773d67b4189d4/rust/robinhood/run_length.png"> | ||
|
||
## Choosing hash functions | ||
|
||
For hashing integers, the best choice is a mixer similar to the one used in SipHash’s finalizer. For | ||
strings and slices of integers, we will use FarmHash. (The Hasher trait must allow one-shot hashing | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No design that I know of has resolved the issue that we run into with one shot hashing, even for a few select types: custom types that define Hash exactly like strings/slices do. |
||
for FarmHash.) Using any other key type means your HashMap will do safe hashing. | ||
|
||
# Consequences | ||
## For the performance of Rust programs | ||
|
||
The impact is minimal on programs that rarely use HashMaps. The increase in binary size should be | ||
small. For programs that spend a large portion of their run time using HashMap with primitive keys, | ||
the speedup should be noticeable. | ||
|
||
On 32-bit platforms, the benefit of using a 32-bit hash function instead of SipHash is higher, | ||
because SipHash’s round involves 30 32-bit operations. | ||
|
||
## For the HashMap API | ||
|
||
One day, we may want to hash HashMaps. The hashing infrastructure can be changed to allow it. The | ||
implementation of Hash for HashMap can hash the hashes stored in every map, rather than However, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. rather than what? |
||
adaptive hashing makes it harder to write a correct and performant implementation of Hash for | ||
HashMap. If two HashMaps (that can be compared) have equal values, they must hash to the same | ||
integer. However, with adaptive hashing, HashMap can switch to the safe mode, which means it no | ||
longer stores the same hashes as other HashMaps that remain in ‘fast’ mode. The only way to handle | ||
the situation for the safe mode is to rehash all keys as if the HashMap were in ‘fast’ mode, which | ||
may take a significant time. | ||
|
||
## For the order of iteration | ||
|
||
Currently, HashMaps have nondeterministic order of iteration by default. This is seen as a good | ||
thing, because programmers that test programs won’t learn to rely on a fixed iteration order. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: “programmers that test programs” makes little grammatical sense. Maybe:
|
||
Otherwise, programmers might not know that their programs only work with a specific iteration order. | ||
To keep nondeterministic order, SipHash’s thread-local seed may be used for all hashers. | ||
|
||
# Drawbacks | ||
|
||
More complex code needs to be maintained. There’s a risk of having a bug in the algorithm or in the | ||
code. | ||
|
||
# Alternatives | ||
|
||
- We can reject adaptive hashing. SipHash-1-3 may be fast enough. | ||
- We can restrict adaptive hashing to integer keys. With this limitation, we don't need Farmhash in | ||
the standard library. | ||
- We can use some other fast one-shot hasher instead of Farmhash. | ||
- We can add use an additional fast hash function for fast streaming hashing. The improvement would | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: “add use” maybe should be either “add” or “use”. |
||
be small. | ||
- We can set FarmHash's seed to a random value for nondeterminism. | ||
- When a map is emptied, its hash function does not matter anymore. As a special case, we can detect | ||
operations that clear maps in safe mode, and reset them back to fast mode. | ||
- We can let user declare their types as one-shot hashable. | ||
|
||
# Unresolved questions | ||
|
||
Is there any hasher that is faster than Farmhash? | ||
|
||
Are the chosen thresholds reasonably low? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo on
=>