Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HashMap, HashSet: impl Hash #48366

Closed
wants to merge 4 commits into from

Conversation

Centril
Copy link
Contributor

@Centril Centril commented Feb 20, 2018

This PR makes HashMaps and HashSets implement Hash themselves.
Since neither of those have any defined order among elements, a commutative operation is used (as well as an independent Hasher for each one) to ensure that the order in which they are hashed does not matter.

cc rust-lang/rfcs#2190 for improving this in the future.

r? @alexcrichton

@Centril Centril added C-enhancement Category: An issue proposing an enhancement or a PR with one. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 20, 2018
@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Feb 20, 2018
@@ -757,6 +757,16 @@ impl<T, S> Eq for HashSet<T, S>
{
}

#[unstable(feature = "hashmap_hash", issue = "0")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

impls are insta-stable

Copy link
Contributor Author

@Centril Centril Feb 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm yes, true... maybe it should be #[stable(feature = "hashmap_hash", since = "1.26.0")]?

@@ -1370,6 +1370,37 @@ impl<K, V, S> Eq for HashMap<K, V, S>
{
}

#[unstable(feature = "hashmap_hash", issue = "0")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

impls are insta-stable

@@ -1370,6 +1370,37 @@ impl<K, V, S> Eq for HashMap<K, V, S>
{
}

#[unstable(feature = "hashmap_hash", issue = "0")]
impl<K, V, S> Hash for HashMap<K, V, S>
where K: Eq + Hash,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the Eq bound necessary here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is necessary for self.iter() - you can't call that otherwise. You could use self.table.iter() which would not require Eq. So self.iter() could be changed to not require Eq. But I think it is right to require K: Eq since you shouldn't be able to do anything useful with a HashMap<K, V, S> unless K: Eq.

// we might be able do so in the future.
hasher.write_u64(
self.iter()
.map(|(k, v)| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to destructure and re-tuple below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, fixing that.

{
fn hash<H: Hasher>(&self, hasher: &mut H) {
// We must preserve: x == y -> hash(x) == hash(y).
// HashMaps have no order, so we must use a commutative operation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be precise, the operation must be associative as well, because in theory you're iterating over arbitrary permutations of the elements. Fortunately, wrapping_add is both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll note that. =)

@hanna-kruppe
Copy link
Contributor

This hash function seems very weak. This probably can't be avoided, but are we sure we want to add a Hash impl to a standard library type that does not have any of the collision resistance and DoS protection properties Rust has generally tried (best-effort, of course) to provide?

// (.wrapping_add) so that the order does not matter.
// HashMaps have no order, so we must combine the elements with an
// associative and commutative operation • so that the order does not
// matter. In other words, (u64, •, 0) must form a commutative monoid.
Copy link
Member

@varkor varkor Feb 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, it would be sufficient to use a pointed commutative semigroup, as you're not making use of the identity at all here, but I think there's a limit to the usefulness of mentioning such technicalities 😄This looks fine to me!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha =P The fold uses the identity element 0 though.

Copy link
Member

@varkor varkor Feb 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fold just needs an initial (pointed) element (which doesn't need to be special in any way); you've picked the identity, but in theory you could have chosen any other u64.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, that is true =P /nerdsniped

@Centril
Copy link
Contributor Author

Centril commented Feb 20, 2018

@rkruppe Hmm.. I agree it is weak.. But I can't think of a better way to do it.. It is always possible to add this impl on a newtype.. But I think it is nice to have the impl for ergonomics.. So for best effort I think we should provide the best we can.

@udoprog
Copy link
Contributor

udoprog commented Feb 20, 2018

@rkruppe I believe that can be overcome by implementing supplemental hashing for entries in the collections themselves?

We should be doing this regardless, because there's always the risk a user might store something that implements a correct but vulnerable hash.

@hanna-kruppe
Copy link
Contributor

@udoprog By supplemental hashing you mean something like what Java's HashMap does? Mangling the key's reported hash further to mix it up a bit? That may improve the distribution but does not help with DoS protection, since each key still has a predictable hash.

@varkor
Copy link
Member

varkor commented Feb 20, 2018

@rkruppe: would simply hashing the result of summing the individual hashed key-value pairs provide an improvement at all? You then shouldn't be able to so easily extract the hash of an individual item.

Edit: Ah, ignore me; I misread the method.

@udoprog
Copy link
Contributor

udoprog commented Feb 20, 2018

That may improve the distribution but does not help with DoS protection, since each key still has a predictable hash

I think DoS protection can be implemented in the collection by incorporating a random salt into the supplemental hash?

My wider point is that I don't believe it's tenable to rely even on best-effort hash::Hash implementations for collision resistance and DoS protection. It's something better left to the collection.

@udoprog
Copy link
Contributor

udoprog commented Feb 20, 2018

I also want to emphasize that I'm wildly in favor of this change. This particular impl is something I'm tired of sprinkling over every project.

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Feb 20, 2018

@udoprog

I think DoS protection can be implemented in the collection by incorporating a random salt into the supplemental hash?

The supplemental hash can't undo collisions that happened during the actual hash computation.

@varkor

@rkruppe: would simply hashing the result of summing the individual hashed key-value pairs provide an improvement at all? You then shouldn't be able to so easily extract the hash of an individual item.

I don't understand how this is different from what's implemented right now?

But regardless, re: "extracting the hash of an individual item" -- using DefaultHasher::new makes the hash of each key trivial to figure out since that's a siphasher with a hardcoded key (0). I don't really see a good way to avoid that (a process-global salt adds synchronization costs and is weaker than the per-hash-map key we have right now).

Furthermore, even if that could be avoided, hand wavy arguments based on "it should not be possible to [do specific thing an attacker might do]" are very unconvincing.

@udoprog
Copy link
Contributor

udoprog commented Feb 20, 2018

The supplemental hash can't undo collisions that happened during the actual hash computation.

Ah, yeah. Now I get it.
It would be useful if Hasher` could spawn sub-hashers inheriting some of its properties.

@Centril
Copy link
Contributor Author

Centril commented Feb 20, 2018

@udoprog

It would be useful if Hasher could spawn sub-hashers inheriting some of its properties.

That was the point of mentioning a future ConstraintKinds lang feature such that we may be able to do:

// = () denotes the "I am adding no additional constraints" bound
pub trait Hash<trait Extra = ()> {
    fn hash<H>(&self, state: &mut H) where H: Hasher + Bounds;
}

impl<K: Eq + Hash, V: Hash, S: BuildHasher> Hash<Clone> for HashMap<K, V, S> {
    fn hash<H: Hasher + Clone>(&self, hasher: &mut H) {
        let r = 
            self.iter()
                .map(|kv| { let mut h = hasher.clone(); kv.hash(&mut h); h.finish() })
                .fold(0, u64::wrapping_add);
        hasher.write_u64(r);
    }
}

EDIT: This was a bad idea.

Now that I've written this... It came to mind that we have access to S: BuildHasher and that we can change the impl to:

impl<K, V, S> Hash for HashMap<K, V, S>
    where K: Eq + Hash,
          V: Hash,
          S: BuildHasher
{
    fn hash<H: Hasher + Clone>(&self, hasher: &mut H) {
        // ...
        hasher.write_u64(
            self.iter()
                .map(|kv| {
                    let mut h = self.hash_builder.build_hash(); // <--- look ma!
                    kv.hash(&mut h);
                    h.finish()
                })
                .fold(0, u64::wrapping_add)
        );
    }
}

at least then you won't always use DefaultHasher but can use something else for the inner hash-map (if you have nested hash maps) that is less predictive.

@rkruppe, @varkor: what do y'all think? Better / good enough?

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Feb 20, 2018

@Centril I believe that doesn't work because different HashMaps will generally have different RandomState, and thus compute different hashes for the same key-value-pairs.

@Centril
Copy link
Contributor Author

Centril commented Feb 20, 2018

@rkruppe Oh yes, that would break x == y -> hash(x) == hash(y) :(

@alexcrichton
Copy link
Member

Thanks for the PR @Centril!

I think I personally share similar worries as those here in that this is a relatively weak hash function. You mention though that this can be improved in the future, and I was wondering if we could dig into that a bit? What are our future options for improving this hash function? Are there more complicated methods we know of that can work but don't want to implement just yet?

@Centril
Copy link
Contributor Author

Centril commented Feb 26, 2018

@alexcrichton

So I think two principal improvement methods come to my mind - and they both involve the ability to Clone Hashers as seen above.

We could do this in two ways, the first being:

pub trait Hasher: Clone { ... }

This is backwards incompatible, but potentially doable since I think it would be unusual for a Hasher to not be Clone already. We can crater-run this perhaps and see if we have many or any breakages...

The second, backwards-compatible way is to provide parametric polymorphism of traits/types on bounds, as in the code example above. This requires a very large language change (which I'd argue for on other merits), so it is unlikely to happen in 2018. I haven't written an RFC for that yet, but I probably will.

This is all I can think of off the top of my head, but perhaps there are possible improvements to the algorithm itself?

@alexcrichton
Copy link
Member

Ok thanks for the info! It seems though that such features would be a very long ways off which may mean that to decide on this I think we need to acknowledge that it's a weak hash algorithm. If we decide, however, that a weak hash function shouldn't be implemented here then I think we may wish to hold off on this for now.

cc @rust-lang/libs, do y'all have thoughts on this? The main caveat here is that the hash algorithm for a hashmap/hashset is relatively weak (just add up all the hashes of all the elements). It seems that improving that may take some time.

@Centril out of curiosity, do you know how this is implemented in other languages?

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Feb 26, 2018

The main caveat here is that the hash algorithm for a hashmap/hashset is relatively weak (just add up all the hashes of all the elements)

Note: Not only is the method of combining the entry hashes of dubious strength, the entry hashes themselves are weaker less DoS-resistant than usual as well -- they are always computed with DefaultHasher::new which is not randomized at all (i.e., known to an attacker).

@Centril
Copy link
Contributor Author

Centril commented Feb 26, 2018

@alexcrichton After some quick searching..

This is how Java does it (by summing with wrapping_add with initial 0 and xoring the key and the value):

Java seems to not provide any DoS protection. The only difference between Java's implementation and this PR is that Java's xors the key and the value.

@alexcrichton
Copy link
Member

Ok interesting! Do languages like Ruby/JS/Python implement hashing for maps? Or maybe C++?

@Centril
Copy link
Contributor Author

Centril commented Feb 27, 2018

@alexcrichton No idea about dynamic languages. With respect to C++, I don't think std::unordered_map<std::unordered_map<int, int>, int>, is permitted by default unless a hasher is provided, so I don't think so.

@kennytm
Copy link
Member

kennytm commented Feb 28, 2018

Python doesn't implement Hash for their HashMap (dict).

>>> hash({1:2})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'dict'

JavaScript only supports string keys for their Object type, and their Map uses shallow comparison (===) so "how to hash a HashMap" does not enter the question.

Ruby supports hashing a HashMap:

irb(main):002:0> {1=>2}.hash
=> -2037178694041332840
irb(main):003:0> {1=>2,3=>4}.hash
=> -958810549599379095
irb(main):004:0> {3=>4,1=>2}.hash
=> -958810549599379095

Its implementation is mainly XOR-ing the hash of all entries together.

@hanna-kruppe
Copy link
Contributor

Wrt Python, note that it does not implement hashing for any of its mutable collections (e.g., set, the moral equivalent of HashSet, and list, the moral equivalent of Vec), presumably because it's a logic error to change an object's hash while it's used as a key in a hash map. When there is an immutable equivalent, that one does implement hash -- there is no immutable dict, but there is frozenset. Its hash impl is quite a bit more complex but mostly boils down to XOR-ing the hashes with some bit shuffling per-element-hash and at the end. It also factors the number of elements into the hash.

@Centril
Copy link
Contributor Author

Centril commented Feb 28, 2018

It also factors the number of elements into the hash.

We could do that =) It doesn't do anything against DDoS, but still nice?
Should the key and the value be XOR-ed as in Java's implementation?

@kennytm
Copy link
Member

kennytm commented Feb 28, 2018

Hashing key and value together as a 2-tuple (i.e. the current implementation kv.hash(&mut h)) seems better.

@hanna-kruppe
Copy link
Contributor

We could do that =) It doesn't do anything against DDoS, but still nice?

Yeah it's not a bad idea, and matches what other collection types do. Doesn't do anything against DoS attacks though, since collection length is really easy to predict or extract.

Should the key and the value be XOR-ed as in Java's implementation?

I think Java's choice here is at least in part due to not having a canonical tuple type or other canonical way to combine two hashes. Rust's hashing strategy leaves us many more options.

But hash function design is an art and/or science in itself, so I'm wary of making any quick judgements.

@varkor
Copy link
Member

varkor commented Feb 28, 2018

Though perhaps it would be better to use XOR instead of wrapping_add?

@Centril
Copy link
Contributor Author

Centril commented Feb 28, 2018

But hash function design is an art and/or science in itself, so I'm wary of making any quick judgements.

Where is @ticki when you need them =P

@alexcrichton
Copy link
Member

Ok so it sort of sounds like our options here are few and far between other than what's currently implemented. @Centril could you remind me of the motivation of this PR? Is it strong enough to push on landing a weak hash function? Or should we perhaps postpone to a later date?

@Centril
Copy link
Contributor Author

Centril commented Mar 3, 2018

@alexcrichton I didn't have a particular use case in mind / don't have a personal need for this; iirc this was sparked by an IRC conversation on #rust. But @udoprog said in a comment that:

I also want to emphasize that I'm wildly in favor of this change. This particular impl is something I'm tired of sprinkling over every project.

So maybe they have something specific in mind?

@alexcrichton
Copy link
Member

Ah ok, I'm personally somewhat tempted to close this until a future date where we may either deem the weaker hash ok or have a better solution for the weak hash here. But @udoprog maybe you can expand a bit what you use this for?

@Centril
Copy link
Contributor Author

Centril commented Mar 8, 2018

Closing this for now and let's revisit if we have more motivations to do this / or a better hash function.

@Centril Centril closed this Mar 8, 2018
@udoprog
Copy link
Contributor

udoprog commented Mar 10, 2018

@alexcrichton Sorry for the late reply. I've been out travelling for the last week.

Nothing apart from the obvious (storing maps in hashtables). I've never used it in a sensitive context, so I generally don't care about DoS protection.

One public instance where I actively worked around it can be found here:
https://github.com/udoprog/sysmon-rs/blob/master/src/metric/metric_id.rs#L38

@Centril
Copy link
Contributor Author

Centril commented Jun 10, 2018

@alexcrichton don't know if I discussed this with you; but did you have a chance to look at @udoprog's use case?

@ldicarlo
Copy link

ldicarlo commented Mar 10, 2022

Hey,
I just ran into this PR (coming from here).

I just want to say that that BTreeSet and BTreeMap do implement Hash.

So in my case, I wanted to put some data, which contained HashMap as a field, in HashSet. (It cannot work since HashMap does not implement Hash)

Replacing HashMap by BTreeMap solves the problem.

Just saying, for future beginners like me, coming here.

@TheLostLambda
Copy link

Heya! Figured I'd chime in with a use-case (and a reason why it would be quite nice to have this functionality) — with that being said, I'm not sure if this should be an issue I create on the hashbrown repo or here?

Either way, the use-case is pretty simple! I've got some structs that use a HashMap internally for one or more of their fields, and I have a super expensive operation to perform on them, but luckily it's one that can be cached. Unfortunately, since all of the performant caching crates use some sort of HashMap as their backing data structure, they all require that cache-keys are Hash, so I can't use my struct as a key with any of them.

Would love to know if people are still open to implementing something like this — I think it would be nice to have that parity with other languages that have hashable HashMaps. It would save people a lot of surprise (and Googling) down the line, even if it's not something that comes up all of the time.

Let me know if there is anything I can do to help out or revive this PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.