From 4280916d6dc7d7c1d802519d51928d97ca2cadb1 Mon Sep 17 00:00:00 2001
From: Piotr Czarnecki <pioczarn@gmail.com>
Date: Thu, 7 Jul 2016 12:40:20 +0200
Subject: [PATCH 1/4] RFC: Adaptive hashing

---
 text/0000-adaptive-hashing.md | 193 ++++++++++++++++++++++++++++++++++
 1 file changed, 193 insertions(+)
 create mode 100644 text/0000-adaptive-hashing.md
diff --git a/text/0000-adaptive-hashing.md b/text/0000-adaptive-hashing.md
new file mode 100644
index 00000000000..a65bb282c06
--- /dev/null
+++ b/text/0000-adaptive-hashing.md
@@ -0,0 +1,193 @@
+- Feature Name: adaptive_hashing
+- Start Date: 2016-11-20
+- RFC PR: (leave this empty)
+- Rust Issue: (leave this empty)
+
+# Summary
+
+Implement adaptive hashing for HashMap. Initialize hash maps using the fastest practical hash
+function, and fall back to SipHash in case of a potential DoS attack.
+
+# Motivation
+
+Hash DoS is an example of a DoS attack. The goal of DoS attacks is a denial of service. Consider
+creating a HashMap from a given list of keys:
+
+```rust
+    fn make_map(keys: Vec<usize>) => HashMap<usize> {
+        let mut map = HashMap::new();
+        for key in keys {
+            map.insert(key, 0);
+        }
+        map
+    }
+```
+
+Let's suppose that the `keys` array is an input that comes from the outside world. A simple case
+of DoS happens when a server receives a HTTP request with thousands of deliberately chosen
+parameters. Processing just one such request can take minutes.
+
+The `keys`  array is manipulated to get the slowest possible run time. In the worst case, all keys
+hash to the same bucket, so we no longer benefit from hashing. Each iteration of the loop in the
+example code takes O(n) time. The entire function executes in O(n**2) time. The hash map behaves
+like a typical dynamic array. We might as well write:
+
+```rust
+    fn make_map(keys: Vec<usize>) => HashMap<usize> {
+        let mut map = vec![];
+        for key in keys {
+            if let Some(index) = map.position(|(k, _)| k == key) {
+                map[index] = 0;
+            } else {
+                map.push(0);
+            }
+        }
+    }
+```
+
+We are only considering slow insertions, because we don’t need to worry about lookup. The cost of
+inserting an element includes the cost of searching for that element. Immediately after inserting an
+element, the cost of looking it up will be equal or smaller. Later, after some number of unrelated
+insertions, the cost of looking up that element will still be limited by some threshold.
+
+To prevent all Hash DoS attacks, we need to make sure that HashMap is protected.  The standard
+library's HashMap currently uses SipHash-1-3 for all its lookups to protect from Hash DoS.
+Unfortunately, this comes with a tradeoff. Some people believe SipHash is too slow. They consider
+non-ideal performance of HashMap for small keys as its main drawback. Others see the use of SipHash
+as a good solution to the tradeoff between security and speed.
+
+Is SipHash really slow, and why? We can simply count the number of instructions it performs.
+SipHash’s round involves 14 64-bit operations. SipHash-1-3 runs one round for each 8 bytes of input,
+and three rounds for finalization, so it involves 16 operations for each 8 bytes of input, and 42
+operations for finalization. Hashing an input of 8 bytes needs 58 operations. However, out-of-order
+execution allows more than one operation per cycle on modern CPUs. Also, SipHash uses simple
+operations, i.e. addition, bitwise rotation and XOR. Still, we can see that SipHash is relatively
+slow for small values. Ideally, hashing an integer should take only 7 instructions.
+
+Several dynamic programming languages use SipHash for their hash tables. However, Rust is a systems
+programming language. The slowdown from hashing is more noticeable than in other languages.
+
+Perl uses a mechanism similar to adaptive hashing for its dictionaries implemented with chaining.
+Java uses chaining and changes a linked list to a binary tree when its length exceeds some
+threshold.
+
+Fortunately, Robin Hood hashing can be easily extended with adaptive hashing.
+
+# Detailed design
+## The algorithm for adaptive hashing
+
+A HashMap with adaptive hashing has two states. One state is called “fast mode” and the other is
+“safe mode”. The fast mode is the inital state for HashMaps with keys of a type that can be hashed
+in one shot. Otherwise, a HashMap with complex keys is always in safe mode.  We switch to the safe
+mode when the following conditions are met:
+
+- an inserted entry's displacement >= 128, or the number of entries displaced by an inserted
+  entry >= 512
+- the load of the map is smaller than 20%
+- the map is in the fast mode
+
+The second condition reduces the odds of switching to safe hashing. The chance that the first
+condition is satisfied is tiny, and the chance that both are satisfied at the same time is
+negligible. Moreover, we add a flag to the map. The flag delays displacement reduction until the
+next insertion to make code simpler. Otherwise, rebuilding the map would invalidate our entry.
+The pseudocode for a function that replaces `insert` is:
+
+```
+fn safeguarded_insert(map, key):
+  entry = insert(map, key)
+  if the entry's displacement >= 128 or the number of entries displaced by entry >= 512:
+    set the flag for reducing displacement
+  return entry
+```
+
+Before the next insertion operation, the state must be checked. Conveniently, the `reserve` method
+is always called before insertion and entry search, so we add the following code to `reserve`:
+
+```
+fn reserve(map, ...):
+  if the flag for reducing displacement is set and the map uses fast hashing:
+    if the load of the map is higher than 20%:
+      grow the map
+    else:
+      switch the map's hash state to safe hashing
+      rebuild the map
+    clear the flag for reducing displacement
+  // ...
+```
+
+Here’s a state diagram for HashMap with adaptive hashing. The dashed edge means the state change is
+very unlikely, and the dotted edge means the state change is enormously unlikely.
+
+<img width="800" src="https://cdn.rawgit.com/pczarn/code/d62cd067ca84ff049ef196aa1b7773d67b4189d4/rust/robinhood/adaptive.svg">
+
+## Choosing constants
+
+The thresholds of 128 and 512 are chosen to minimize the chance of exceeding them. In particular, we
+want that chance to be less than 10^-8 with a load of 90% and less than 10^-30 with a load of 20%.
+For displacement, the smallest k that fits our needs is 90, so we round that up to 128. For the
+number of forward-shifted buckets, we choose k=512. Keep in mind that the run length is a sum of the
+displacement and the number of forward-shifted buckets, so its threshold is 128+512=640. Even though
+the probability of having a run length of more than 640 buckets may be higher than the probability
+we want, it should be low enough.
+
+<img width="600" src="https://cdn.rawgit.com/pczarn/code/d62cd067ca84ff049ef196aa1b7773d67b4189d4/rust/robinhood/lookup_cost.png">
+
+<img width="600" src="https://cdn.rawgit.com/pczarn/code/d62cd067ca84ff049ef196aa1b7773d67b4189d4/rust/robinhood/run_length.png">
+
+## Choosing hash functions
+
+For hashing integers, the best choice is a mixer similar to the one used in SipHash’s finalizer. For
+strings and slices of integers, we will use FarmHash. (The Hasher trait must allow one-shot hashing
+for FarmHash.) Using any other key type means your HashMap will do safe hashing.
+
+# Consequences
+## For the performance of Rust programs
+
+The impact is minimal on programs that rarely use HashMaps. The increase in binary size should be
+small. For programs that spend a large portion of their run time using HashMap with primitive keys,
+the speedup should be noticeable.
+
+On 32-bit platforms, the benefit of using a 32-bit hash function instead of SipHash is higher,
+because SipHash’s round involves 30 32-bit operations.
+
+## For the HashMap API
+
+One day, we may want to hash HashMaps. The hashing infrastructure can be changed to allow it. The
+implementation of Hash for HashMap can hash the hashes stored in every map, rather than  However,
+adaptive hashing makes it harder to write a correct and performant implementation of Hash for
+HashMap. If two HashMaps (that can be compared) have equal values, they must hash to the same
+integer. However, with adaptive hashing, HashMap can switch to the safe mode, which means it no
+longer stores the same hashes as other HashMaps that remain in ‘fast’ mode. The only way to handle
+the situation for the safe mode is to rehash all keys as if the HashMap were in ‘fast’ mode, which
+may take a significant time.
+
+## For the order of iteration
+
+Currently, HashMaps have nondeterministic order of iteration by default. This is seen as a good
+thing, because programmers that test programs won’t learn to rely on a fixed iteration order.
+Otherwise, programmers might not know that their programs only work with a specific iteration order.
+To keep nondeterministic order, SipHash’s thread-local seed may be used for all hashers.
+
+# Drawbacks
+
+More complex code needs to be maintained. There’s a risk of having a bug in the algorithm or in the
+code.
+
+# Alternatives
+
+- We can reject adaptive hashing. SipHash-1-3 may be fast enough.
+- We can restrict adaptive hashing to integer keys. With this limitation, we don't need Farmhash in
+  the standard library.
+- We can use some other fast one-shot hasher instead of Farmhash.
+- We can add use an additional fast hash function for fast streaming hashing. The improvement would
+  be small.
+- We can set FarmHash's seed to a random value for nondeterminism.
+- When a map is emptied, its hash function does not matter anymore. As a special case, we can detect
+  operations that clear maps in safe mode, and reset them back to fast mode.
+- We can let user declare their types as one-shot hashable.
+
+# Unresolved questions
+
+Is there any hasher that is faster than Farmhash?
+
+Are the chosen thresholds reasonably low?

From 8cce6212005b5b9f343e25bea0be8ab60dbbebf3 Mon Sep 17 00:00:00 2001
From: Piotr Czarnecki <pioczarn@gmail.com>
Date: Tue, 22 Nov 2016 19:40:04 +0100
Subject: [PATCH 2/4] Fix phrasing and typos

---
 text/0000-adaptive-hashing.md | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/text/0000-adaptive-hashing.md b/text/0000-adaptive-hashing.md
index a65bb282c06..f17441bd0b1 100644
--- a/text/0000-adaptive-hashing.md
+++ b/text/0000-adaptive-hashing.md
@@ -14,7 +14,7 @@ Hash DoS is an example of a DoS attack. The goal of DoS attacks is a denial of s
 creating a HashMap from a given list of keys:
 
 ```rust
-    fn make_map(keys: Vec<usize>) => HashMap<usize> {
+    fn make_map(keys: Vec<usize>) -> HashMap<usize> {
         let mut map = HashMap::new();
         for key in keys {
             map.insert(key, 0);
@@ -33,7 +33,7 @@ example code takes O(n) time. The entire function executes in O(n**2) time. The
 like a typical dynamic array. We might as well write:
 
 ```rust
-    fn make_map(keys: Vec<usize>) => HashMap<usize> {
+    fn make_map(keys: Vec<usize>) -> HashMap<usize> {
         let mut map = vec![];
         for key in keys {
             if let Some(index) = map.position(|(k, _)| k == key) {
@@ -153,20 +153,20 @@ because SipHash’s round involves 30 32-bit operations.
 ## For the HashMap API
 
 One day, we may want to hash HashMaps. The hashing infrastructure can be changed to allow it. The
-implementation of Hash for HashMap can hash the hashes stored in every map, rather than  However,
-adaptive hashing makes it harder to write a correct and performant implementation of Hash for
-HashMap. If two HashMaps (that can be compared) have equal values, they must hash to the same
-integer. However, with adaptive hashing, HashMap can switch to the safe mode, which means it no
-longer stores the same hashes as other HashMaps that remain in ‘fast’ mode. The only way to handle
-the situation for the safe mode is to rehash all keys as if the HashMap were in ‘fast’ mode, which
-may take a significant time.
+implementation of Hash for HashMap can hash the hashes stored in the map, rather than hash the
+contents of each key in the map. However, adaptive hashing makes it harder to write a correct and
+performant implementation of Hash for HashMap. If two HashMaps (that can be compared) have equal
+values, they must hash to the same integer. However, with adaptive hashing, HashMap can switch to
+the safe mode, which means it no longer stores the same hashes as other HashMaps that remain in the
+fast mode. The only way to handle the situation for the safe mode is to rehash all keys as if the
+HashMap were in the fast mode, which may take a significant time.
 
 ## For the order of iteration
 
 Currently, HashMaps have nondeterministic order of iteration by default. This is seen as a good
-thing, because programmers that test programs won’t learn to rely on a fixed iteration order.
-Otherwise, programmers might not know that their programs only work with a specific iteration order.
-To keep nondeterministic order, SipHash’s thread-local seed may be used for all hashers.
+thing, because testing will catch code that relies on a specific iteration order. Otherwise,
+programmers might not know that their programs only work with a fixed iteration order. To keep
+nondeterministic order, SipHash’s thread-local seed may be used for all hashers.
 
 # Drawbacks
 
@@ -179,7 +179,7 @@ code.
 - We can restrict adaptive hashing to integer keys. With this limitation, we don't need Farmhash in
   the standard library.
 - We can use some other fast one-shot hasher instead of Farmhash.
-- We can add use an additional fast hash function for fast streaming hashing. The improvement would
+- We can use an additional fast hash function for fast streaming hashing. The improvement would
   be small.
 - We can set FarmHash's seed to a random value for nondeterminism.
 - When a map is emptied, its hash function does not matter anymore. As a special case, we can detect

From 93b76034c12e4e0f86bfc32e7dafcbc0182520f1 Mon Sep 17 00:00:00 2001
From: Piotr Czarnecki <pioczarn@gmail.com>
Date: Wed, 1 Mar 2017 17:57:58 +0100
Subject: [PATCH 3/4] Clarify RFC text. Add some numbers in addition to charts

---
 text/0000-adaptive-hashing.md | 85 +++++++++++++++++++++++++++++------
 1 file changed, 71 insertions(+), 14 deletions(-)

diff --git a/text/0000-adaptive-hashing.md b/text/0000-adaptive-hashing.md
index f17441bd0b1..ef4b9412698 100644
--- a/text/0000-adaptive-hashing.md
+++ b/text/0000-adaptive-hashing.md
@@ -120,19 +120,50 @@ very unlikely, and the dotted edge means the state change is enormously unlikely
 
 <img width="800" src="https://cdn.rawgit.com/pczarn/code/d62cd067ca84ff049ef196aa1b7773d67b4189d4/rust/robinhood/adaptive.svg">
 
+## Load factor
+
+We decrease the load factor of `HashMap` from 0.909 to 0.833.
+
 ## Choosing constants
 
-The thresholds of 128 and 512 are chosen to minimize the chance of exceeding them. In particular, we
-want that chance to be less than 10^-8 with a load of 90% and less than 10^-30 with a load of 20%.
-For displacement, the smallest k that fits our needs is 90, so we round that up to 128. For the
-number of forward-shifted buckets, we choose k=512. Keep in mind that the run length is a sum of the
-displacement and the number of forward-shifted buckets, so its threshold is 128+512=640. Even though
-the probability of having a run length of more than 640 buckets may be higher than the probability
-we want, it should be low enough.
+The thresholds of 128 and 1500 are chosen to minimize the chance of exceeding them. In particular,
+we want that chance to be less than 10^-8 with a load of 90% and less than 10^-30 with a load of
+20%. For displacement, the smallest k that fits our needs is 90, so we round that up to 128. For the
+number of forward-shifted buckets, we choose k=1500. Keep in mind that the run length is a sum of
+the displacement and the number of forward-shifted buckets, so its threshold is 128+1500=1628. We
+can allow probability of exceeding our thresholds that is a bit worse than desirable.
 
-<img width="600" src="https://cdn.rawgit.com/pczarn/code/d62cd067ca84ff049ef196aa1b7773d67b4189d4/rust/robinhood/lookup_cost.png">
+### Lookup cost
+
+```
+At load factor 0.909
+Pr{lookup cost >= 100} = 1.0e-9
+Pr{lookup cost >= 128} = 3.1e-12
+Pr{lookup cost >= 150} = 3.3e-14
+```
+
+```
+At load factor 0.833
+Pr{lookup cost >= 100} = 4.1e-16
+Pr{lookup cost >= 128} = 2.0e-20
+Pr{lookup cost >= 150} = 8.0e-24
+```
 
-<img width="600" src="https://cdn.rawgit.com/pczarn/code/d62cd067ca84ff049ef196aa1b7773d67b4189d4/rust/robinhood/run_length.png">
+```
+At load factor 0.2
+Pr{lookup cost >= 100} = 6.2e-116
+```
+
+### Forward shift cost
+
+At load factor near the current limit of 0.909, the cost of forward shift is too high to allow it.
+
+```
+At load factor 0.833
+Pr{forward shift cost >= 1200} = 2.6e-10
+Pr{forward shift cost >= 1500} = 4.1e-12
+Pr{forward shift cost >= 1800} = 6.4e-14
+```
 
 ## Choosing hash functions
 
@@ -141,14 +172,18 @@ strings and slices of integers, we will use FarmHash. (The Hasher trait must all
 for FarmHash.) Using any other key type means your HashMap will do safe hashing.
 
 # Consequences
+## For the hashing API
+
+This RFC does not propose any public-facing changes to the hashing infrastructure.
+
 ## For the performance of Rust programs
 
-The impact is minimal on programs that rarely use HashMaps. The increase in binary size should be
-small. For programs that spend a large portion of their run time using HashMap with primitive keys,
-the speedup should be noticeable.
+The impact is minimal on programs that rarely use HashMaps. The load factor’s new value is well
+within the reasonable range. The increase in binary size should be small. For programs that spend a
+large portion of their run time using HashMap with primitive keys, the speedup should be noticeable.
 
 On 32-bit platforms, the benefit of using a 32-bit hash function instead of SipHash is higher,
-because SipHash’s round involves 30 32-bit operations.
+because each SipHash’s round involves 30 32-bit operations.
 
 ## For the HashMap API
 
@@ -184,10 +219,32 @@ code.
 - We can set FarmHash's seed to a random value for nondeterminism.
 - When a map is emptied, its hash function does not matter anymore. As a special case, we can detect
   operations that clear maps in safe mode, and reset them back to fast mode.
-- We can let user declare their types as one-shot hashable.
+- We can let users declare their types as one-shot hashable. The following public trait may allow
+  such one-shot hashing.
+
+```rust
+#[cfg(not(target_pointer_width = "64"))]
+type ShortHash = u32;
+#[cfg(target_pointer_width = "64")]
+type ShortHash = u64;
+
+trait OneshotHashable {
+  fn hash(&self) -> ShortHash;
+}
+```
 
 # Unresolved questions
 
 Is there any hasher that is faster than Farmhash?
 
 Are the chosen thresholds reasonably low?
+
+# Appendices
+
+## Image for the lookup cost chart
+
+<img width="600" src="https://cdn.rawgit.com/pczarn/code/d62cd067ca84ff049ef196aa1b7773d67b4189d4/rust/robinhood/lookup_cost.png">
+
+## Image for the forward shift cost chart
+
+<img width="600" src="https://cdn.rawgit.com/pczarn/code/def92e19ae60b599e9620afa1bdcad1c36e6e982/rust/robinhood/extrapolated_insertion_cost_4.png">

From d84f34ceb5d706fa26321bce8dc62e395f8929ee Mon Sep 17 00:00:00 2001
From: Piotr Czarnecki <pioczarn@gmail.com>
Date: Wed, 1 Mar 2017 18:03:39 +0100
Subject: [PATCH 4/4] Fix numbers that were overlooked

---
 text/0000-adaptive-hashing.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/text/0000-adaptive-hashing.md b/text/0000-adaptive-hashing.md
index ef4b9412698..7d93f324300 100644
--- a/text/0000-adaptive-hashing.md
+++ b/text/0000-adaptive-hashing.md
@@ -82,7 +82,7 @@ in one shot. Otherwise, a HashMap with complex keys is always in safe mode.  We
 mode when the following conditions are met:
 
 - an inserted entry's displacement >= 128, or the number of entries displaced by an inserted
-  entry >= 512
+  entry >= 1500
 - the load of the map is smaller than 20%
 - the map is in the fast mode
 
@@ -95,7 +95,7 @@ The pseudocode for a function that replaces `insert` is:
 ```
 fn safeguarded_insert(map, key):
   entry = insert(map, key)
-  if the entry's displacement >= 128 or the number of entries displaced by entry >= 512:
+  if the entry's displacement >= 128 or the number of entries displaced by entry >= 1500:
     set the flag for reducing displacement
   return entry
 ```