Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

custom hashing is too easy to accidentally break #12198

Open
StefanKarpinski opened this issue Jul 17, 2015 · 49 comments
Open

custom hashing is too easy to accidentally break #12198

StefanKarpinski opened this issue Jul 17, 2015 · 49 comments
Assignees
Labels
collections Data structures holding multiple items, e.g. sets equality Issues relating to equality relations: ==, ===, isequal hashing

Comments

@StefanKarpinski
Copy link
Member

If you define a custom == for a type and don't define a matching hash method for it, it silently breaks hashing of that type:

julia> type Foo
           x::Int
       end

julia> ==(a::Foo, b::Foo) = a.x == b.x
== (generic function with 109 methods)

julia> [unique([Foo(4), Foo(4), Foo(4)]) for _ = 1:25]
25-element Array{Array{Foo,1},1}:
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4)]
 [Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4)]
 [Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]

Whether you get one, two or three values depends on whether the last digits of their address in memory happen to be the same or not. This is because hashing is based on object identity, which is based on address. See discussion here:

Having it this easy to silently break the hashing of a type seems like a bad design. Although the problem is worse for mutable types since people inevitably want to overload == for them, it's not just a problem for mutable objects:

julia> immutable Bar
           x::Int
           y::Int
       end

julia> ==(a::Bar, b::Bar) = a.x == b.x
== (generic function with 110 methods)

julia> [unique([Bar(i,i+j), Bar(i,2i+j), Bar(i,i+2j)]) for i=1:25, j=1:3]
25x3 Array{Array{Bar,1},2}:
 [Bar(1,2),Bar(1,3)]                 [Bar(1,3),Bar(1,4)]                 [Bar(1,4),Bar(1,5),Bar(1,7)]
 [Bar(2,3),Bar(2,5),Bar(2,4)]        [Bar(2,4),Bar(2,6)]                 [Bar(2,5),Bar(2,7),Bar(2,8)]
 [Bar(3,4),Bar(3,7),Bar(3,5)]        [Bar(3,5),Bar(3,8),Bar(3,7)]        [Bar(3,6),Bar(3,9)]
 [Bar(4,5)]                          [Bar(4,6),Bar(4,10),Bar(4,8)]       [Bar(4,7),Bar(4,11),Bar(4,10)]
 [Bar(5,6),Bar(5,11),Bar(5,7)]       [Bar(5,7),Bar(5,12),Bar(5,9)]       [Bar(5,8),Bar(5,13),Bar(5,11)]
 [Bar(6,7),Bar(6,13)]                [Bar(6,8),Bar(6,14),Bar(6,10)]      [Bar(6,9),Bar(6,15)]
 [Bar(7,8),Bar(7,15),Bar(7,9)]       [Bar(7,9),Bar(7,16),Bar(7,11)]      [Bar(7,10),Bar(7,17),Bar(7,13)]
 [Bar(8,9),Bar(8,17),Bar(8,10)]      [Bar(8,10),Bar(8,18),Bar(8,12)]     [Bar(8,11),Bar(8,19),Bar(8,14)]
 [Bar(9,10),Bar(9,19),Bar(9,11)]     [Bar(9,11),Bar(9,20),Bar(9,13)]     [Bar(9,12),Bar(9,21),Bar(9,15)]
 [Bar(10,11),Bar(10,21),Bar(10,12)]  [Bar(10,12),Bar(10,22),Bar(10,14)]  [Bar(10,13),Bar(10,23),Bar(10,16)]
 [Bar(11,12),Bar(11,23),Bar(11,13)]  [Bar(11,13),Bar(11,24),Bar(11,15)]  [Bar(11,14),Bar(11,25),Bar(11,17)]
 [Bar(12,13),Bar(12,25),Bar(12,14)]  [Bar(12,14),Bar(12,16)]             [Bar(12,15),Bar(12,18)]
 [Bar(13,14),Bar(13,27),Bar(13,15)]  [Bar(13,15),Bar(13,28),Bar(13,17)]  [Bar(13,16),Bar(13,29),Bar(13,19)]
 [Bar(14,15),Bar(14,29),Bar(14,16)]  [Bar(14,16),Bar(14,30)]             [Bar(14,17)]
 [Bar(15,16),Bar(15,31),Bar(15,17)]  [Bar(15,17),Bar(15,32),Bar(15,19)]  [Bar(15,18),Bar(15,33),Bar(15,21)]
 [Bar(16,17),Bar(16,33),Bar(16,18)]  [Bar(16,18),Bar(16,34)]             [Bar(16,19),Bar(16,35)]
 [Bar(17,18),Bar(17,35),Bar(17,19)]  [Bar(17,19),Bar(17,36),Bar(17,21)]  [Bar(17,20),Bar(17,37),Bar(17,23)]
 [Bar(18,19),Bar(18,37),Bar(18,20)]  [Bar(18,20),Bar(18,38),Bar(18,22)]  [Bar(18,21),Bar(18,39),Bar(18,24)]
 [Bar(19,20),Bar(19,39),Bar(19,21)]  [Bar(19,21),Bar(19,23)]             [Bar(19,22),Bar(19,41),Bar(19,25)]
 [Bar(20,21),Bar(20,22)]             [Bar(20,22),Bar(20,42),Bar(20,24)]  [Bar(20,23),Bar(20,43),Bar(20,26)]
 [Bar(21,22),Bar(21,43)]             [Bar(21,23),Bar(21,44),Bar(21,25)]  [Bar(21,24),Bar(21,45),Bar(21,27)]
 [Bar(22,23),Bar(22,45),Bar(22,24)]  [Bar(22,24),Bar(22,46),Bar(22,26)]  [Bar(22,25),Bar(22,47),Bar(22,28)]
 [Bar(23,24),Bar(23,47),Bar(23,25)]  [Bar(23,25),Bar(23,48),Bar(23,27)]  [Bar(23,26),Bar(23,49),Bar(23,29)]
 [Bar(24,25),Bar(24,49),Bar(24,26)]  [Bar(24,26),Bar(24,50),Bar(24,28)]  [Bar(24,27),Bar(24,51),Bar(24,30)]
 [Bar(25,26),Bar(25,51)]             [Bar(25,27),Bar(25,52),Bar(25,29)]  [Bar(25,28),Bar(25,53),Bar(25,31)]
@StefanKarpinski StefanKarpinski changed the title too easy to accidentally break hashing custom hashing is too easy to accidentally break Jul 17, 2015
@mauro3
Copy link
Contributor

mauro3 commented Jul 17, 2015

@StefanKarpinski
Copy link
Member Author

Yup, linked to that in the middle of that. Probably should have put it at the end.

@mauro3
Copy link
Contributor

mauro3 commented Jul 17, 2015

Oops, sorry, I should have read more carefully.

@StefanKarpinski
Copy link
Member Author

No worries – I should have posted more clearly :-)

@yuyichao
Copy link
Contributor

@StefanKarpinski Is there something more we need to worry about if we just cache the hash or otherwise add additional comparason of the hash as you proposed somewhere in that thread? We could just throw an error in such case.

@yuyichao
Copy link
Contributor

And by such case I mean when they are == but the hash is not.

@StefanKarpinski
Copy link
Member Author

That would fix it, but you would only get the error in the cases where two hashes accidentally "collide". I'm kind of into the idea of defining == and hash generically in terms of some identity_parts function or something like that.

@yuyichao
Copy link
Contributor

That would fix it, but you would only get the error in the cases where two hashes accidentally "collide".

Getting a predictable result (unique retuning 3 items in the case you have above) or a clear error is argueably the best response I can imagine for an undefined behavior.

in terms of some identity_parts function

I just think it might be kind of a headache for type stability for types that allows "missing values".

@yuyichao
Copy link
Contributor

Getting a predictable result (unique retuning 3 items in the case you have above) or a clear error is argueably the best response I can imagine for an undefined behavior.

I agree that it might be hard to hit the error and notice the undefined behavior in certain scenario so I'm not saying this is the best we can do either (e.g. if identity_parts works).

@yuyichao
Copy link
Contributor

I just think it might be kind of a headache for type stability for types that allows "missing values".

Actually since we have such a type. The question is basically, how do you define identity_parts for Nullable.

Feels like it should return a Tuple{Bool, Union{}} for type stability if it's a missing value. Would be funny if it's not type stable for Nullable given it's the whole point of the type......

@JeffBezanson
Copy link
Member

I guess we should just remove the fallback for hash that uses object_id.

@carnaval
Copy link
Contributor

I guess we should just remove the fallback for hash that uses object_id.

Can I frame this ? Many hours wasted in debugging (lack of recompilation + wrong bootstrap order = hashing by object id unexpectedly)

@jakebolewski
Copy link
Member

Horray!

@JeffBezanson
Copy link
Member

Huh? What have you guys been smoking hashing?

@carnaval
Copy link
Contributor

hashing my life away

@jakebolewski
Copy link
Member

Coke

@davidagold
Copy link
Contributor

Question: What is an "identity tuple"? Is the idea that if identity_parts(a) and identity_parts(b) return the same identity tuple then they get the same hash and == returns true?

For Nullables, == currently throws NullException regardless of whether or not the compared values are missing or present. How does that fit into the proposed framework?

If y'all decide to try the identity_parts route, I've got a potential name for the method: haecceity (explanation here). Though I wouldn't want to offend any nominalists/radical empiricists out there. =p

@JeffBezanson
Copy link
Member

That is an awesome vocab word!

Personally I would be against this sort of interface obscurity: "oh, if you want to define ==, you should actually define this other non-obvious function instead."

@tomasaschan
Copy link
Member

One wouldn't be required to define haecceity though - overloading == and hash would still work just fine. The new function would just make it easier to create the overloads for most straightforward cases.

I think one of the reasons many users don't create a hash method for their types is that they don't really know how to. (Should I just hash the components I want to matter? How should I combine them? What if my hashing function is incorrect?) The fear of breaking something actually makes people break that exact thing. Being able to say to those users "hey, just use this function to tell us what parts of your object are important, and we'll figure out the rest for you" is awesome.

However, if == and hash are broken if they don't correspond, would ==(x, y) = ==(hash(x), hash(y)), with some specialized method for ints, be a terrible idea?

@StefanKarpinski
Copy link
Member Author

@JeffBezanson, the weird thing is that both equality and hashing are clearly based on extrapolations of the same question – "which pieces of this thing are essential?" Equality takes two objects and does a pairwise recursive equality check on their essential parts; hashing recursively hashes the essential parts of an object together with some futzing around with types to make sure that hashes don't collide. All of that stuff could be completely automated – telling the language what parts of a thing are important and getting == and isequal and hash for free seems like a good deal to me. As @tlycken points out, you don't need to do it this way, it would just be available to you and quite convenient.

@JeffBezanson
Copy link
Member

I think that's pretty easy to do using

==(a::MyType, b::MyType) = (a.x, a.z) == (b.x, b.z)
hash(a::MyType, h) = hash((a.x, a.z), h)

@StefanKarpinski
Copy link
Member Author

So you're type has hash collisions with tuples of its field values?

@tomasaschan
Copy link
Member

I can't find anything in the manual either about exactly how a best-practice implementation of == and hash should look like for a custom composite type. The closest I get is the paragraph above this headline (sorry for the weird linking, but it was very far to scroll from the next headline above... click the link and scroll up a little) but the only thing mentioned there is that

For other [in context: non-numerical types] types, isequal() defaults to calling ==(), so if you want to define equality for your own types then you only need to add a ==() method. If you define your own equality function, you should probably define a corresponding hash() method to ensure that isequal(x,y) implies hash(x) == hash(y).

Regardless of whether or not it's "really simple" to implement both ==() and hash(), at the very least the correct way to do it needs to be better documented.

And I still feel that providing a shortcut for users that want one is a really nice feature (one that I've missed in many other settings, including Java and C#).

@JeffBezanson
Copy link
Member

My point is that the proposed mechanism is not significantly simpler, and just adds more stuff you have to know about. We could recommend using hash((a.x, a.z), hash(typeof(a), h)).

The get_parts interface has a performance pitfall, in that it's too easy to end up copying a whole collection: parts(x::MyType) = (x...,).

@mbauman
Copy link
Member

mbauman commented Jul 20, 2015

I can't find anything in the manual either about exactly how a best-practice implementation of == and hash should look like for a custom composite type.

I agree: #11794 (comment)

@tomasaschan
Copy link
Member

@mbauman Very nice indeed! I had not seen that PR before, but it makes me very happy :)

@StefanKarpinski
Copy link
Member Author

Ok, so back to the issue at hand. It seems like we should at least remove the default object-identity hashing for mutables. Should we just leave immutables alone?

@JeffBezanson
Copy link
Member

I started experimenting with this, and surprisingly found quite a few places in Base that depend (correctly) on object_id hashing for mutables: Type, Method, LambdaStaticData, Function, Module
So I'm afraid this is a very breaking change. The current approach "just works" for the fairly common case where you're associating data with a particular object. This trades off against the problem in this issue. Our policy has often been that if you're defining a new type, making it work is on you, and all we can really offer here is a no-method error, which might not be worth the breakage this causes.

@yuyichao
Copy link
Contributor

If we still want to keep the default object id hash method, raising error when there's a custom defined == should be fine then?

@StefanKarpinski
Copy link
Member Author

More confusiong due to this: https://discourse.julialang.org/t/how-to-use-unique-function/5266.

@JeffBezanson JeffBezanson added the triage This should be discussed on a triage call label Jan 8, 2018
@JeffBezanson
Copy link
Member

Update: started working on this in #24354. It revealed that there are many uses of Dict that should really be ObjectIdDict, but dropping the type parameters loses a lot of type information (often of value as documentation). Hence #25210 adds IdDict{K,V}, and is near ready to merge.

The primary issue is that you can no longer naively ask whether a random object is in a set or dictionary, which can be pretty annoying. E.g. rewriting x in [1, 2] to x in Set([1, 2]) is no longer safe and might cause an error. We have some functions like unique that internally rely on this.

So I have another (slightly crazy) idea: instead of removing the hash fallback, make it return 0 instead. That way it's correct by default, just inefficient. Then defining hash methods becomes an optimization you do as needed.

@StefanKarpinski
Copy link
Member Author

I think that crazy idea may actually be quite brilliant.

@mauro3
Copy link
Contributor

mauro3 commented Jan 17, 2018

This has the potential to turn the "unique [etc] does not work"-questions into "unique [etc] is super slow"-questions? However, then there is an obvious place for the afflicted to check: "Performance tips".

All in all, this seems a very Julian solution, +1.

@StefanKarpinski
Copy link
Member Author

Here's a different perspective on this (based on discussion with Jeff)... Using a hash table is only one way to implement dictionary and set data structures. Another perfectly valid and often just fine implementation is a linear array of values. We can consider that to be the default implementation, and providing a hash function for a type is a way to tell the system, "hey, this type is hashable, so use a fancy hash table implementation of dict and set instead of the standard linear array implementation" – and the system will magically do that for you. This works because in the case where hash is the constant zero function, a hash table degenerates to a linear array implementation.

@JeffBezanson
Copy link
Member

Best-of-both-worlds idea from Stefan: check whether == for an object calls the default definition, and if so use object_id. Otherwise return a constant. That way we get the benefits of this idea without hurting performance of objects that work now. That depends on our ability to optimize such a check, but that can be added later and will only break code that somehow depends on the current hash function being wrong for their type (which is already broken).

@JeffBezanson JeffBezanson modified the milestones: 1.0, 1.x Jan 18, 2018
@JeffBezanson JeffBezanson added collections Data structures holding multiple items, e.g. sets and removed breaking This change will break code triage This should be discussed on a triage call labels Jan 18, 2018
@StefanKarpinski
Copy link
Member Author

Nice thing about the best-of-both-worlds idea solution is that it's strictly non-breaking: code that was calling hash for whatever reason continues to work – it gets a zero where it would have gotten an objectid value before, but that's fine; code that was actually putting objects into dicts or sets without defining a custom hash function will magically become unbroken, albeit potentially slower.

@mauro3
Copy link
Contributor

mauro3 commented Jan 19, 2018

As a added bonus getting this feature:

That depends on our ability to optimize such a [method-exists] check

sounds super useful to define other fast-dispatching traits in terms of defined methods, yay!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
collections Data structures holding multiple items, e.g. sets equality Issues relating to equality relations: ==, ===, isequal hashing
Projects
None yet
Development

No branches or pull requests