HashMap, HashSet: impl Hash #48366

Centril · 2018-02-20T05:34:07Z

This PR makes HashMaps and HashSets implement Hash themselves.
Since neither of those have any defined order among elements, a commutative operation is used (as well as an independent Hasher for each one) to ensure that the order in which they are hashed does not matter.

cc rust-lang/rfcs#2190 for improving this in the future.

r? @alexcrichton

oli-obk · 2018-02-20T07:58:50Z

src/libstd/collections/hash/set.rs

 {
 }

+#[unstable(feature = "hashmap_hash", issue = "0")]


impls are insta-stable

hmm yes, true... maybe it should be #[stable(feature = "hashmap_hash", since = "1.26.0")]?

src/libstd/collections/hash/set.rs

oli-obk · 2018-02-20T07:59:02Z

src/libstd/collections/hash/map.rs

 {
 }

+#[unstable(feature = "hashmap_hash", issue = "0")]


impls are insta-stable

oli-obk · 2018-02-20T07:59:57Z

src/libstd/collections/hash/map.rs


+#[unstable(feature = "hashmap_hash", issue = "0")]
+impl<K, V, S> Hash for HashMap<K, V, S>
+    where K: Eq + Hash,


Is the Eq bound necessary here?

Yes, it is necessary for self.iter() - you can't call that otherwise. You could use self.table.iter() which would not require Eq. So self.iter() could be changed to not require Eq. But I think it is right to require K: Eq since you shouldn't be able to do anything useful with a HashMap<K, V, S> unless K: Eq.

oli-obk · 2018-02-20T08:00:39Z

src/libstd/collections/hash/map.rs

+        // we might be able do so in the future.
+        hasher.write_u64(
+            self.iter()
+                .map(|(k, v)| {


no need to destructure and re-tuple below.

Good point, fixing that.

varkor · 2018-02-20T09:39:02Z

src/libstd/collections/hash/map.rs

+{
+    fn hash<H: Hasher>(&self, hasher: &mut H) {
+        // We must preserve:  x == y -> hash(x) == hash(y).
+        // HashMaps have no order, so we must use a commutative operation


To be precise, the operation must be associative as well, because in theory you're iterating over arbitrary permutations of the elements. Fortunately, wrapping_add is both.

I'll note that. =)

hanna-kruppe · 2018-02-20T12:02:55Z

This hash function seems very weak. This probably can't be avoided, but are we sure we want to add a Hash impl to a standard library type that does not have any of the collision resistance and DoS protection properties Rust has generally tried (best-effort, of course) to provide?

varkor · 2018-02-20T12:15:04Z

src/libstd/collections/hash/map.rs

-        // (.wrapping_add) so that the order does not matter.
+        // HashMaps have no order, so we must combine the elements with an
+        // associative and commutative operation • so that the order does not
+        // matter. In other words, (u64, •, 0) must form a commutative monoid.


In theory, it would be sufficient to use a pointed commutative semigroup, as you're not making use of the identity at all here, but I think there's a limit to the usefulness of mentioning such technicalities 😄This looks fine to me!

Haha =P The fold uses the identity element 0 though.

The fold just needs an initial (pointed) element (which doesn't need to be special in any way); you've picked the identity, but in theory you could have chosen any other u64.

Oh yes, that is true =P /nerdsniped

Centril · 2018-02-20T12:23:59Z

@rkruppe Hmm.. I agree it is weak.. But I can't think of a better way to do it.. It is always possible to add this impl on a newtype.. But I think it is nice to have the impl for ergonomics.. So for best effort I think we should provide the best we can.

udoprog · 2018-02-20T12:24:38Z

@rkruppe I believe that can be overcome by implementing supplemental hashing for entries in the collections themselves?

We should be doing this regardless, because there's always the risk a user might store something that implements a correct but vulnerable hash.

hanna-kruppe · 2018-02-20T12:37:43Z

@udoprog By supplemental hashing you mean something like what Java's HashMap does? Mangling the key's reported hash further to mix it up a bit? That may improve the distribution but does not help with DoS protection, since each key still has a predictable hash.

varkor · 2018-02-20T12:42:27Z

@rkruppe: would simply hashing the result of summing the individual hashed key-value pairs provide an improvement at all? You then shouldn't be able to so easily extract the hash of an individual item.

Edit: Ah, ignore me; I misread the method.

udoprog · 2018-02-20T12:45:04Z

That may improve the distribution but does not help with DoS protection, since each key still has a predictable hash

I think DoS protection can be implemented in the collection by incorporating a random salt into the supplemental hash?

My wider point is that I don't believe it's tenable to rely even on best-effort hash::Hash implementations for collision resistance and DoS protection. It's something better left to the collection.

udoprog · 2018-02-20T12:46:33Z

I also want to emphasize that I'm wildly in favor of this change. This particular impl is something I'm tired of sprinkling over every project.

hanna-kruppe · 2018-02-20T12:55:10Z

@udoprog

I think DoS protection can be implemented in the collection by incorporating a random salt into the supplemental hash?

The supplemental hash can't undo collisions that happened during the actual hash computation.

@varkor

@rkruppe: would simply hashing the result of summing the individual hashed key-value pairs provide an improvement at all? You then shouldn't be able to so easily extract the hash of an individual item.

I don't understand how this is different from what's implemented right now?

But regardless, re: "extracting the hash of an individual item" -- using DefaultHasher::new makes the hash of each key trivial to figure out since that's a siphasher with a hardcoded key (0). I don't really see a good way to avoid that (a process-global salt adds synchronization costs and is weaker than the per-hash-map key we have right now).

Furthermore, even if that could be avoided, hand wavy arguments based on "it should not be possible to [do specific thing an attacker might do]" are very unconvincing.

udoprog · 2018-02-20T13:21:19Z

The supplemental hash can't undo collisions that happened during the actual hash computation.

Ah, yeah. Now I get it.
It would be useful if Hasher` could spawn sub-hashers inheriting some of its properties.

Centril · 2018-02-20T14:14:41Z

@udoprog

It would be useful if Hasher could spawn sub-hashers inheriting some of its properties.

That was the point of mentioning a future ConstraintKinds lang feature such that we may be able to do:

// = () denotes the "I am adding no additional constraints" bound
pub trait Hash<trait Extra = ()> {
    fn hash<H>(&self, state: &mut H) where H: Hasher + Bounds;
}

impl<K: Eq + Hash, V: Hash, S: BuildHasher> Hash<Clone> for HashMap<K, V, S> {
    fn hash<H: Hasher + Clone>(&self, hasher: &mut H) {
        let r = 
            self.iter()
                .map(|kv| { let mut h = hasher.clone(); kv.hash(&mut h); h.finish() })
                .fold(0, u64::wrapping_add);
        hasher.write_u64(r);
    }
}

EDIT: This was a bad idea.

Now that I've written this... It came to mind that we have access to S: BuildHasher and that we can change the impl to:

impl<K, V, S> Hash for HashMap<K, V, S>
    where K: Eq + Hash,
          V: Hash,
          S: BuildHasher
{
    fn hash<H: Hasher + Clone>(&self, hasher: &mut H) {
        // ...
        hasher.write_u64(
            self.iter()
                .map(|kv| {
                    let mut h = self.hash_builder.build_hash(); // <--- look ma!
                    kv.hash(&mut h);
                    h.finish()
                })
                .fold(0, u64::wrapping_add)
        );
    }
}

at least then you won't always use DefaultHasher but can use something else for the inner hash-map (if you have nested hash maps) that is less predictive.

@rkruppe, @varkor: what do y'all think? ~~Better / good enough?~~

hanna-kruppe · 2018-02-20T15:22:21Z

@Centril I believe that doesn't work because different HashMaps will generally have different RandomState, and thus compute different hashes for the same key-value-pairs.

Centril · 2018-02-20T15:25:33Z

@rkruppe Oh yes, that would break x == y -> hash(x) == hash(y) :(

alexcrichton · 2018-02-26T15:24:09Z

Thanks for the PR @Centril!

I think I personally share similar worries as those here in that this is a relatively weak hash function. You mention though that this can be improved in the future, and I was wondering if we could dig into that a bit? What are our future options for improving this hash function? Are there more complicated methods we know of that can work but don't want to implement just yet?

Centril · 2018-02-26T17:24:14Z

@alexcrichton

So I think two principal improvement methods come to my mind - and they both involve the ability to Clone Hashers as seen above.

We could do this in two ways, the first being:

pub trait Hasher: Clone { ... }

This is backwards incompatible, but potentially doable since I think it would be unusual for a Hasher to not be Clone already. We can crater-run this perhaps and see if we have many or any breakages...

The second, backwards-compatible way is to provide parametric polymorphism of traits/types on bounds, as in the code example above. This requires a very large language change (which I'd argue for on other merits), so it is unlikely to happen in 2018. I haven't written an RFC for that yet, but I probably will.

This is all I can think of off the top of my head, but perhaps there are possible improvements to the algorithm itself?

alexcrichton · 2018-02-26T23:10:55Z

Ok thanks for the info! It seems though that such features would be a very long ways off which may mean that to decide on this I think we need to acknowledge that it's a weak hash algorithm. If we decide, however, that a weak hash function shouldn't be implemented here then I think we may wish to hold off on this for now.

cc @rust-lang/libs, do y'all have thoughts on this? The main caveat here is that the hash algorithm for a hashmap/hashset is relatively weak (just add up all the hashes of all the elements). It seems that improving that may take some time.

@Centril out of curiosity, do you know how this is implemented in other languages?

hanna-kruppe · 2018-02-26T23:21:33Z

The main caveat here is that the hash algorithm for a hashmap/hashset is relatively weak (just add up all the hashes of all the elements)

Note: Not only is the method of combining the entry hashes of dubious strength, the entry hashes themselves are ~~weaker~~ less DoS-resistant than usual as well -- they are always computed with DefaultHasher::new which is not randomized at all (i.e., known to an attacker).

Centril · 2018-02-26T23:36:24Z

@alexcrichton After some quick searching..

This is how Java does it (by summing with wrapping_add with initial 0 and xoring the key and the value):

Java seems to not provide any DoS protection. The only difference between Java's implementation and this PR is that Java's xors the key and the value.

alexcrichton · 2018-02-27T04:28:48Z

Ok interesting! Do languages like Ruby/JS/Python implement hashing for maps? Or maybe C++?

Centril · 2018-02-27T14:38:29Z

@alexcrichton No idea about dynamic languages. With respect to C++, I don't think std::unordered_map<std::unordered_map<int, int>, int>, is permitted by default unless a hasher is provided, so I don't think so.

src/libstd/collections/hash/map.rs

kennytm · 2018-02-28T11:01:29Z

Python doesn't implement Hash for their HashMap (dict).

>>> hash({1:2})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'dict'

JavaScript only supports string keys for their Object type, and their Map uses shallow comparison (===) so "how to hash a HashMap" does not enter the question.

Ruby supports hashing a HashMap:

irb(main):002:0> {1=>2}.hash
=> -2037178694041332840
irb(main):003:0> {1=>2,3=>4}.hash
=> -958810549599379095
irb(main):004:0> {3=>4,1=>2}.hash
=> -958810549599379095

Its implementation is mainly XOR-ing the hash of all entries together.

hanna-kruppe · 2018-02-28T11:13:57Z

Wrt Python, note that it does not implement hashing for any of its mutable collections (e.g., set, the moral equivalent of HashSet, and list, the moral equivalent of Vec), presumably because it's a logic error to change an object's hash while it's used as a key in a hash map. When there is an immutable equivalent, that one does implement hash -- there is no immutable dict, but there is frozenset. Its hash impl is quite a bit more complex but mostly boils down to XOR-ing the hashes with some bit shuffling per-element-hash and at the end. It also factors the number of elements into the hash.

Centril · 2018-02-28T16:02:41Z

It also factors the number of elements into the hash.

We could do that =) It doesn't do anything against DDoS, but still nice?
Should the key and the value be XOR-ed as in Java's implementation?

kennytm · 2018-02-28T16:41:41Z

Hashing key and value together as a 2-tuple (i.e. the current implementation kv.hash(&mut h)) seems better.

hanna-kruppe · 2018-02-28T16:48:49Z

We could do that =) It doesn't do anything against DDoS, but still nice?

Yeah it's not a bad idea, and matches what other collection types do. Doesn't do anything against DoS attacks though, since collection length is really easy to predict or extract.

Should the key and the value be XOR-ed as in Java's implementation?

I think Java's choice here is at least in part due to not having a canonical tuple type or other canonical way to combine two hashes. Rust's hashing strategy leaves us many more options.

But hash function design is an art and/or science in itself, so I'm wary of making any quick judgements.

varkor · 2018-02-28T17:50:13Z

Though perhaps it would be better to use XOR instead of wrapping_add?

Centril · 2018-02-28T20:59:20Z

But hash function design is an art and/or science in itself, so I'm wary of making any quick judgements.

Where is @ticki when you need them =P

alexcrichton · 2018-03-02T21:34:01Z

Ok so it sort of sounds like our options here are few and far between other than what's currently implemented. @Centril could you remind me of the motivation of this PR? Is it strong enough to push on landing a weak hash function? Or should we perhaps postpone to a later date?

Centril · 2018-03-03T22:32:40Z

@alexcrichton I didn't have a particular use case in mind / don't have a personal need for this; iirc this was sparked by an IRC conversation on #rust. But @udoprog said in a comment that:

I also want to emphasize that I'm wildly in favor of this change. This particular impl is something I'm tired of sprinkling over every project.

So maybe they have something specific in mind?

alexcrichton · 2018-03-05T10:28:38Z

Ah ok, I'm personally somewhat tempted to close this until a future date where we may either deem the weaker hash ok or have a better solution for the weak hash here. But @udoprog maybe you can expand a bit what you use this for?

Centril · 2018-03-08T23:34:16Z

Closing this for now and let's revisit if we have more motivations to do this / or a better hash function.

udoprog · 2018-03-10T15:51:23Z

@alexcrichton Sorry for the late reply. I've been out travelling for the last week.

Nothing apart from the obvious (storing maps in hashtables). I've never used it in a sensitive context, so I generally don't care about DoS protection.

One public instance where I actively worked around it can be found here:
https://github.com/udoprog/sysmon-rs/blob/master/src/metric/metric_id.rs#L38

Centril · 2018-06-10T10:25:01Z

@alexcrichton don't know if I discussed this with you; but did you have a chance to look at @udoprog's use case?

ldicarlo · 2022-03-10T20:53:55Z

Hey,
I just ran into this PR (coming from here).

I just want to say that that BTreeSet and BTreeMap do implement Hash.

So in my case, I wanted to put some data, which contained HashMap as a field, in HashSet. (It cannot work since HashMap does not implement Hash)

Replacing HashMap by BTreeMap solves the problem.

Just saying, for future beginners like me, coming here.

TheLostLambda · 2024-07-03T17:57:15Z

Heya! Figured I'd chime in with a use-case (and a reason why it would be quite nice to have this functionality) — with that being said, I'm not sure if this should be an issue I create on the hashbrown repo or here?

Either way, the use-case is pretty simple! I've got some structs that use a HashMap internally for one or more of their fields, and I have a super expensive operation to perform on them, but luckily it's one that can be cached. Unfortunately, since all of the performant caching crates use some sort of HashMap as their backing data structure, they all require that cache-keys are Hash, so I can't use my struct as a key with any of them.

Would love to know if people are still open to implementing something like this — I think it would be nice to have that parity with other languages that have hashable HashMaps. It would save people a lot of surprise (and Googling) down the line, even if it's not something that comes up all of the time.

Let me know if there is anything I can do to help out or revive this PR!

HashMap, HashSet: impl Hash

e6350e0

Centril added C-enhancement Category: An issue proposing an enhancement or a PR with one. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 20, 2018

rust-highfive assigned alexcrichton Feb 20, 2018

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Feb 20, 2018

HashMap, HashSet: impl Hash - fix import

8c200ad

oli-obk reviewed Feb 20, 2018

View reviewed changes

HashMap, HashSet: impl Hash - don't destructure/restructure kv-pair

6f92790

varkor reviewed Feb 20, 2018

View reviewed changes

HashMap, HashSet: impl Hash - note on commutative monoid

e4dcd82

varkor reviewed Feb 20, 2018

View reviewed changes

cuviper mentioned this pull request Feb 26, 2018

impl Hash for IndexSet, IndexMap indexmap-rs/indexmap#67

Closed

bluss reviewed Feb 27, 2018

View reviewed changes

src/libstd/collections/hash/map.rs Show resolved Hide resolved

Centril closed this Mar 8, 2018

A248 mentioned this pull request Jan 19, 2022

Hash implementation maciejhirsz/json-rust#177

Open

osukhoroslov mentioned this pull request Feb 18, 2023

[mp] use hash from state in dfs osukhoroslov/dslab#206

Closed

popematt mentioned this pull request Feb 28, 2024

Modifies PartialEq for Violation amazon-ion/ion-schema-rust#208

Merged

cuviper mentioned this pull request Jun 11, 2024

Add hash implementation indexmap-rs/indexmap#329

Closed

HashMap, HashSet: impl Hash #48366

HashMap, HashSet: impl Hash #48366

Uh oh!

Conversation

Centril commented Feb 20, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Centril Feb 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hanna-kruppe commented Feb 20, 2018

Uh oh!

varkor Feb 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

varkor Feb 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Centril commented Feb 20, 2018

Uh oh!

udoprog commented Feb 20, 2018

Uh oh!

hanna-kruppe commented Feb 20, 2018

Uh oh!

varkor commented Feb 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

udoprog commented Feb 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

udoprog commented Feb 20, 2018

Uh oh!

hanna-kruppe commented Feb 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

udoprog commented Feb 20, 2018

Uh oh!

Centril commented Feb 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hanna-kruppe commented Feb 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Centril commented Feb 20, 2018

Uh oh!

alexcrichton commented Feb 26, 2018

Uh oh!

Centril commented Feb 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexcrichton commented Feb 26, 2018

Uh oh!

hanna-kruppe commented Feb 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Centril commented Feb 26, 2018

Uh oh!

Centril Feb 20, 2018 •

edited

Loading

varkor Feb 20, 2018 •

edited

Loading

varkor Feb 20, 2018 •

edited

Loading

varkor commented Feb 20, 2018 •

edited

Loading

udoprog commented Feb 20, 2018 •

edited

Loading

hanna-kruppe commented Feb 20, 2018 •

edited

Loading

Centril commented Feb 20, 2018 •

edited

Loading

hanna-kruppe commented Feb 20, 2018 •

edited

Loading

Centril commented Feb 26, 2018 •

edited

Loading

hanna-kruppe commented Feb 26, 2018 •

edited

Loading

kennytm commented Feb 28, 2018 •

edited

Loading

varkor commented Feb 28, 2018 •

edited

Loading

ldicarlo commented Mar 10, 2022 •

edited

Loading