-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adaptive hashing #1796
Closed
Closed
Adaptive hashing #1796
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,250 @@ | ||
- Feature Name: adaptive_hashing | ||
- Start Date: 2016-11-20 | ||
- RFC PR: (leave this empty) | ||
- Rust Issue: (leave this empty) | ||
|
||
# Summary | ||
|
||
Implement adaptive hashing for HashMap. Initialize hash maps using the fastest practical hash | ||
function, and fall back to SipHash in case of a potential DoS attack. | ||
|
||
# Motivation | ||
|
||
Hash DoS is an example of a DoS attack. The goal of DoS attacks is a denial of service. Consider | ||
creating a HashMap from a given list of keys: | ||
|
||
```rust | ||
fn make_map(keys: Vec<usize>) -> HashMap<usize> { | ||
let mut map = HashMap::new(); | ||
for key in keys { | ||
map.insert(key, 0); | ||
} | ||
map | ||
} | ||
``` | ||
|
||
Let's suppose that the `keys` array is an input that comes from the outside world. A simple case | ||
of DoS happens when a server receives a HTTP request with thousands of deliberately chosen | ||
parameters. Processing just one such request can take minutes. | ||
|
||
The `keys` array is manipulated to get the slowest possible run time. In the worst case, all keys | ||
hash to the same bucket, so we no longer benefit from hashing. Each iteration of the loop in the | ||
example code takes O(n) time. The entire function executes in O(n**2) time. The hash map behaves | ||
like a typical dynamic array. We might as well write: | ||
|
||
```rust | ||
fn make_map(keys: Vec<usize>) -> HashMap<usize> { | ||
let mut map = vec![]; | ||
for key in keys { | ||
if let Some(index) = map.position(|(k, _)| k == key) { | ||
map[index] = 0; | ||
} else { | ||
map.push(0); | ||
} | ||
} | ||
} | ||
``` | ||
|
||
We are only considering slow insertions, because we don’t need to worry about lookup. The cost of | ||
inserting an element includes the cost of searching for that element. Immediately after inserting an | ||
element, the cost of looking it up will be equal or smaller. Later, after some number of unrelated | ||
insertions, the cost of looking up that element will still be limited by some threshold. | ||
|
||
To prevent all Hash DoS attacks, we need to make sure that HashMap is protected. The standard | ||
library's HashMap currently uses SipHash-1-3 for all its lookups to protect from Hash DoS. | ||
Unfortunately, this comes with a tradeoff. Some people believe SipHash is too slow. They consider | ||
non-ideal performance of HashMap for small keys as its main drawback. Others see the use of SipHash | ||
as a good solution to the tradeoff between security and speed. | ||
|
||
Is SipHash really slow, and why? We can simply count the number of instructions it performs. | ||
SipHash’s round involves 14 64-bit operations. SipHash-1-3 runs one round for each 8 bytes of input, | ||
and three rounds for finalization, so it involves 16 operations for each 8 bytes of input, and 42 | ||
operations for finalization. Hashing an input of 8 bytes needs 58 operations. However, out-of-order | ||
execution allows more than one operation per cycle on modern CPUs. Also, SipHash uses simple | ||
operations, i.e. addition, bitwise rotation and XOR. Still, we can see that SipHash is relatively | ||
slow for small values. Ideally, hashing an integer should take only 7 instructions. | ||
|
||
Several dynamic programming languages use SipHash for their hash tables. However, Rust is a systems | ||
programming language. The slowdown from hashing is more noticeable than in other languages. | ||
|
||
Perl uses a mechanism similar to adaptive hashing for its dictionaries implemented with chaining. | ||
Java uses chaining and changes a linked list to a binary tree when its length exceeds some | ||
threshold. | ||
|
||
Fortunately, Robin Hood hashing can be easily extended with adaptive hashing. | ||
|
||
# Detailed design | ||
## The algorithm for adaptive hashing | ||
|
||
A HashMap with adaptive hashing has two states. One state is called “fast mode” and the other is | ||
“safe mode”. The fast mode is the inital state for HashMaps with keys of a type that can be hashed | ||
in one shot. Otherwise, a HashMap with complex keys is always in safe mode. We switch to the safe | ||
mode when the following conditions are met: | ||
|
||
- an inserted entry's displacement >= 128, or the number of entries displaced by an inserted | ||
entry >= 1500 | ||
- the load of the map is smaller than 20% | ||
- the map is in the fast mode | ||
|
||
The second condition reduces the odds of switching to safe hashing. The chance that the first | ||
condition is satisfied is tiny, and the chance that both are satisfied at the same time is | ||
negligible. Moreover, we add a flag to the map. The flag delays displacement reduction until the | ||
next insertion to make code simpler. Otherwise, rebuilding the map would invalidate our entry. | ||
The pseudocode for a function that replaces `insert` is: | ||
|
||
``` | ||
fn safeguarded_insert(map, key): | ||
entry = insert(map, key) | ||
if the entry's displacement >= 128 or the number of entries displaced by entry >= 1500: | ||
set the flag for reducing displacement | ||
return entry | ||
``` | ||
|
||
Before the next insertion operation, the state must be checked. Conveniently, the `reserve` method | ||
is always called before insertion and entry search, so we add the following code to `reserve`: | ||
|
||
``` | ||
fn reserve(map, ...): | ||
if the flag for reducing displacement is set and the map uses fast hashing: | ||
if the load of the map is higher than 20%: | ||
grow the map | ||
else: | ||
switch the map's hash state to safe hashing | ||
rebuild the map | ||
clear the flag for reducing displacement | ||
// ... | ||
``` | ||
|
||
Here’s a state diagram for HashMap with adaptive hashing. The dashed edge means the state change is | ||
very unlikely, and the dotted edge means the state change is enormously unlikely. | ||
|
||
<img width="800" src="https://cdn.rawgit.com/pczarn/code/d62cd067ca84ff049ef196aa1b7773d67b4189d4/rust/robinhood/adaptive.svg"> | ||
|
||
## Load factor | ||
|
||
We decrease the load factor of `HashMap` from 0.909 to 0.833. | ||
|
||
## Choosing constants | ||
|
||
The thresholds of 128 and 1500 are chosen to minimize the chance of exceeding them. In particular, | ||
we want that chance to be less than 10^-8 with a load of 90% and less than 10^-30 with a load of | ||
20%. For displacement, the smallest k that fits our needs is 90, so we round that up to 128. For the | ||
number of forward-shifted buckets, we choose k=1500. Keep in mind that the run length is a sum of | ||
the displacement and the number of forward-shifted buckets, so its threshold is 128+1500=1628. We | ||
can allow probability of exceeding our thresholds that is a bit worse than desirable. | ||
|
||
### Lookup cost | ||
|
||
``` | ||
At load factor 0.909 | ||
Pr{lookup cost >= 100} = 1.0e-9 | ||
Pr{lookup cost >= 128} = 3.1e-12 | ||
Pr{lookup cost >= 150} = 3.3e-14 | ||
``` | ||
|
||
``` | ||
At load factor 0.833 | ||
Pr{lookup cost >= 100} = 4.1e-16 | ||
Pr{lookup cost >= 128} = 2.0e-20 | ||
Pr{lookup cost >= 150} = 8.0e-24 | ||
``` | ||
|
||
``` | ||
At load factor 0.2 | ||
Pr{lookup cost >= 100} = 6.2e-116 | ||
``` | ||
|
||
### Forward shift cost | ||
|
||
At load factor near the current limit of 0.909, the cost of forward shift is too high to allow it. | ||
|
||
``` | ||
At load factor 0.833 | ||
Pr{forward shift cost >= 1200} = 2.6e-10 | ||
Pr{forward shift cost >= 1500} = 4.1e-12 | ||
Pr{forward shift cost >= 1800} = 6.4e-14 | ||
``` | ||
|
||
## Choosing hash functions | ||
|
||
For hashing integers, the best choice is a mixer similar to the one used in SipHash’s finalizer. For | ||
strings and slices of integers, we will use FarmHash. (The Hasher trait must allow one-shot hashing | ||
for FarmHash.) Using any other key type means your HashMap will do safe hashing. | ||
|
||
# Consequences | ||
## For the hashing API | ||
|
||
This RFC does not propose any public-facing changes to the hashing infrastructure. | ||
|
||
## For the performance of Rust programs | ||
|
||
The impact is minimal on programs that rarely use HashMaps. The load factor’s new value is well | ||
within the reasonable range. The increase in binary size should be small. For programs that spend a | ||
large portion of their run time using HashMap with primitive keys, the speedup should be noticeable. | ||
|
||
On 32-bit platforms, the benefit of using a 32-bit hash function instead of SipHash is higher, | ||
because each SipHash’s round involves 30 32-bit operations. | ||
|
||
## For the HashMap API | ||
|
||
One day, we may want to hash HashMaps. The hashing infrastructure can be changed to allow it. The | ||
implementation of Hash for HashMap can hash the hashes stored in the map, rather than hash the | ||
contents of each key in the map. However, adaptive hashing makes it harder to write a correct and | ||
performant implementation of Hash for HashMap. If two HashMaps (that can be compared) have equal | ||
values, they must hash to the same integer. However, with adaptive hashing, HashMap can switch to | ||
the safe mode, which means it no longer stores the same hashes as other HashMaps that remain in the | ||
fast mode. The only way to handle the situation for the safe mode is to rehash all keys as if the | ||
HashMap were in the fast mode, which may take a significant time. | ||
|
||
## For the order of iteration | ||
|
||
Currently, HashMaps have nondeterministic order of iteration by default. This is seen as a good | ||
thing, because testing will catch code that relies on a specific iteration order. Otherwise, | ||
programmers might not know that their programs only work with a fixed iteration order. To keep | ||
nondeterministic order, SipHash’s thread-local seed may be used for all hashers. | ||
|
||
# Drawbacks | ||
|
||
More complex code needs to be maintained. There’s a risk of having a bug in the algorithm or in the | ||
code. | ||
|
||
# Alternatives | ||
|
||
- We can reject adaptive hashing. SipHash-1-3 may be fast enough. | ||
- We can restrict adaptive hashing to integer keys. With this limitation, we don't need Farmhash in | ||
the standard library. | ||
- We can use some other fast one-shot hasher instead of Farmhash. | ||
- We can use an additional fast hash function for fast streaming hashing. The improvement would | ||
be small. | ||
- We can set FarmHash's seed to a random value for nondeterminism. | ||
- When a map is emptied, its hash function does not matter anymore. As a special case, we can detect | ||
operations that clear maps in safe mode, and reset them back to fast mode. | ||
- We can let users declare their types as one-shot hashable. The following public trait may allow | ||
such one-shot hashing. | ||
|
||
```rust | ||
#[cfg(not(target_pointer_width = "64"))] | ||
type ShortHash = u32; | ||
#[cfg(target_pointer_width = "64")] | ||
type ShortHash = u64; | ||
|
||
trait OneshotHashable { | ||
fn hash(&self) -> ShortHash; | ||
} | ||
``` | ||
|
||
# Unresolved questions | ||
|
||
Is there any hasher that is faster than Farmhash? | ||
|
||
Are the chosen thresholds reasonably low? | ||
|
||
# Appendices | ||
|
||
## Image for the lookup cost chart | ||
|
||
<img width="600" src="https://cdn.rawgit.com/pczarn/code/d62cd067ca84ff049ef196aa1b7773d67b4189d4/rust/robinhood/lookup_cost.png"> | ||
|
||
## Image for the forward shift cost chart | ||
|
||
<img width="600" src="https://cdn.rawgit.com/pczarn/code/def92e19ae60b599e9620afa1bdcad1c36e6e982/rust/robinhood/extrapolated_insertion_cost_4.png"> |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No design that I know of has resolved the issue that we run into with one shot hashing, even for a few select types: custom types that define Hash exactly like strings/slices do.