custom hashing is too easy to accidentally break #12198

StefanKarpinski · 2015-07-17T20:30:56Z

If you define a custom == for a type and don't define a matching hash method for it, it silently breaks hashing of that type:

julia> type Foo
           x::Int
       end

julia> ==(a::Foo, b::Foo) = a.x == b.x
== (generic function with 109 methods)

julia> [unique([Foo(4), Foo(4), Foo(4)]) for _ = 1:25]
25-element Array{Array{Foo,1},1}:
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4)]
 [Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4)]
 [Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]

Whether you get one, two or three values depends on whether the last digits of their address in memory happen to be the same or not. This is because hashing is based on object identity, which is based on address. See discussion here:

Having it this easy to silently break the hashing of a type seems like a bad design. Although the problem is worse for mutable types since people inevitably want to overload == for them, it's not just a problem for mutable objects:

julia> immutable Bar
           x::Int
           y::Int
       end

julia> ==(a::Bar, b::Bar) = a.x == b.x
== (generic function with 110 methods)

julia> [unique([Bar(i,i+j), Bar(i,2i+j), Bar(i,i+2j)]) for i=1:25, j=1:3]
25x3 Array{Array{Bar,1},2}:
 [Bar(1,2),Bar(1,3)]                 [Bar(1,3),Bar(1,4)]                 [Bar(1,4),Bar(1,5),Bar(1,7)]
 [Bar(2,3),Bar(2,5),Bar(2,4)]        [Bar(2,4),Bar(2,6)]                 [Bar(2,5),Bar(2,7),Bar(2,8)]
 [Bar(3,4),Bar(3,7),Bar(3,5)]        [Bar(3,5),Bar(3,8),Bar(3,7)]        [Bar(3,6),Bar(3,9)]
 [Bar(4,5)]                          [Bar(4,6),Bar(4,10),Bar(4,8)]       [Bar(4,7),Bar(4,11),Bar(4,10)]
 [Bar(5,6),Bar(5,11),Bar(5,7)]       [Bar(5,7),Bar(5,12),Bar(5,9)]       [Bar(5,8),Bar(5,13),Bar(5,11)]
 [Bar(6,7),Bar(6,13)]                [Bar(6,8),Bar(6,14),Bar(6,10)]      [Bar(6,9),Bar(6,15)]
 [Bar(7,8),Bar(7,15),Bar(7,9)]       [Bar(7,9),Bar(7,16),Bar(7,11)]      [Bar(7,10),Bar(7,17),Bar(7,13)]
 [Bar(8,9),Bar(8,17),Bar(8,10)]      [Bar(8,10),Bar(8,18),Bar(8,12)]     [Bar(8,11),Bar(8,19),Bar(8,14)]
 [Bar(9,10),Bar(9,19),Bar(9,11)]     [Bar(9,11),Bar(9,20),Bar(9,13)]     [Bar(9,12),Bar(9,21),Bar(9,15)]
 [Bar(10,11),Bar(10,21),Bar(10,12)]  [Bar(10,12),Bar(10,22),Bar(10,14)]  [Bar(10,13),Bar(10,23),Bar(10,16)]
 [Bar(11,12),Bar(11,23),Bar(11,13)]  [Bar(11,13),Bar(11,24),Bar(11,15)]  [Bar(11,14),Bar(11,25),Bar(11,17)]
 [Bar(12,13),Bar(12,25),Bar(12,14)]  [Bar(12,14),Bar(12,16)]             [Bar(12,15),Bar(12,18)]
 [Bar(13,14),Bar(13,27),Bar(13,15)]  [Bar(13,15),Bar(13,28),Bar(13,17)]  [Bar(13,16),Bar(13,29),Bar(13,19)]
 [Bar(14,15),Bar(14,29),Bar(14,16)]  [Bar(14,16),Bar(14,30)]             [Bar(14,17)]
 [Bar(15,16),Bar(15,31),Bar(15,17)]  [Bar(15,17),Bar(15,32),Bar(15,19)]  [Bar(15,18),Bar(15,33),Bar(15,21)]
 [Bar(16,17),Bar(16,33),Bar(16,18)]  [Bar(16,18),Bar(16,34)]             [Bar(16,19),Bar(16,35)]
 [Bar(17,18),Bar(17,35),Bar(17,19)]  [Bar(17,19),Bar(17,36),Bar(17,21)]  [Bar(17,20),Bar(17,37),Bar(17,23)]
 [Bar(18,19),Bar(18,37),Bar(18,20)]  [Bar(18,20),Bar(18,38),Bar(18,22)]  [Bar(18,21),Bar(18,39),Bar(18,24)]
 [Bar(19,20),Bar(19,39),Bar(19,21)]  [Bar(19,21),Bar(19,23)]             [Bar(19,22),Bar(19,41),Bar(19,25)]
 [Bar(20,21),Bar(20,22)]             [Bar(20,22),Bar(20,42),Bar(20,24)]  [Bar(20,23),Bar(20,43),Bar(20,26)]
 [Bar(21,22),Bar(21,43)]             [Bar(21,23),Bar(21,44),Bar(21,25)]  [Bar(21,24),Bar(21,45),Bar(21,27)]
 [Bar(22,23),Bar(22,45),Bar(22,24)]  [Bar(22,24),Bar(22,46),Bar(22,26)]  [Bar(22,25),Bar(22,47),Bar(22,28)]
 [Bar(23,24),Bar(23,47),Bar(23,25)]  [Bar(23,25),Bar(23,48),Bar(23,27)]  [Bar(23,26),Bar(23,49),Bar(23,29)]
 [Bar(24,25),Bar(24,49),Bar(24,26)]  [Bar(24,26),Bar(24,50),Bar(24,28)]  [Bar(24,27),Bar(24,51),Bar(24,30)]
 [Bar(25,26),Bar(25,51)]             [Bar(25,27),Bar(25,52),Bar(25,29)]  [Bar(25,28),Bar(25,53),Bar(25,31)]

The text was updated successfully, but these errors were encountered:

mauro3 · 2015-07-17T20:46:59Z

julia user discussion:
https://groups.google.com/d/msg/julia-users/bi9sBnlTswQ/njsRvLzgd5gJ
https://groups.google.com/d/msg/julia-users/q6F7-eVcAkQ/Csm4zHFH2JQJ

StefanKarpinski · 2015-07-17T20:48:47Z

Yup, linked to that in the middle of that. Probably should have put it at the end.

mauro3 · 2015-07-17T20:53:32Z

Oops, sorry, I should have read more carefully.

StefanKarpinski · 2015-07-17T21:00:56Z

No worries – I should have posted more clearly :-)

yuyichao · 2015-07-17T21:34:06Z

@StefanKarpinski Is there something more we need to worry about if we just cache the hash or otherwise add additional comparason of the hash as you proposed somewhere in that thread? We could just throw an error in such case.

yuyichao · 2015-07-17T21:34:36Z

And by such case I mean when they are == but the hash is not.

StefanKarpinski · 2015-07-17T21:41:48Z

That would fix it, but you would only get the error in the cases where two hashes accidentally "collide". I'm kind of into the idea of defining == and hash generically in terms of some identity_parts function or something like that.

yuyichao · 2015-07-17T21:48:17Z

That would fix it, but you would only get the error in the cases where two hashes accidentally "collide".

Getting a predictable result (unique retuning 3 items in the case you have above) or a clear error is argueably the best response I can imagine for an undefined behavior.

in terms of some identity_parts function

I just think it might be kind of a headache for type stability for types that allows "missing values".

yuyichao · 2015-07-17T21:52:44Z

Getting a predictable result (unique retuning 3 items in the case you have above) or a clear error is argueably the best response I can imagine for an undefined behavior.

I agree that it might be hard to hit the error and notice the undefined behavior in certain scenario so I'm not saying this is the best we can do either (e.g. if identity_parts works).

yuyichao · 2015-07-17T22:13:43Z

I just think it might be kind of a headache for type stability for types that allows "missing values".

Actually since we have such a type. The question is basically, how do you define identity_parts for Nullable.

Feels like it should return a Tuple{Bool, Union{}} for type stability if it's a missing value. Would be funny if it's not type stable for Nullable given it's the whole point of the type......

JeffBezanson · 2015-07-17T23:01:53Z

I guess we should just remove the fallback for hash that uses object_id.

carnaval · 2015-07-17T23:04:40Z

I guess we should just remove the fallback for hash that uses object_id.

Can I frame this ? Many hours wasted in debugging (lack of recompilation + wrong bootstrap order = hashing by object id unexpectedly)

jakebolewski · 2015-07-17T23:06:59Z

Horray!

JeffBezanson · 2015-07-17T23:09:29Z

Huh? What have you guys been ~~smoking~~ hashing?

carnaval · 2015-07-17T23:11:17Z

hashing my life away

jakebolewski · 2015-07-17T23:12:41Z

Coke

davidagold · 2015-07-18T02:31:44Z

Question: What is an "identity tuple"? Is the idea that if identity_parts(a) and identity_parts(b) return the same identity tuple then they get the same hash and == returns true?

For Nullables, == currently throws NullException regardless of whether or not the compared values are missing or present. How does that fit into the proposed framework?

If y'all decide to try the identity_parts route, I've got a potential name for the method: haecceity (explanation here). Though I wouldn't want to offend any nominalists/radical empiricists out there. =p

JeffBezanson · 2015-07-18T03:10:53Z

That is an awesome vocab word!

Personally I would be against this sort of interface obscurity: "oh, if you want to define ==, you should actually define this other non-obvious function instead."

tomasaschan · 2015-07-18T11:41:47Z

One wouldn't be required to define haecceity though - overloading == and hash would still work just fine. The new function would just make it easier to create the overloads for most straightforward cases.

I think one of the reasons many users don't create a hash method for their types is that they don't really know how to. (Should I just hash the components I want to matter? How should I combine them? What if my hashing function is incorrect?) The fear of breaking something actually makes people break that exact thing. Being able to say to those users "hey, just use this function to tell us what parts of your object are important, and we'll figure out the rest for you" is awesome.

However, if == and hash are broken if they don't correspond, would ==(x, y) = ==(hash(x), hash(y)), with some specialized method for ints, be a terrible idea?

StefanKarpinski · 2015-07-20T14:30:57Z

@JeffBezanson, the weird thing is that both equality and hashing are clearly based on extrapolations of the same question – "which pieces of this thing are essential?" Equality takes two objects and does a pairwise recursive equality check on their essential parts; hashing recursively hashes the essential parts of an object together with some futzing around with types to make sure that hashes don't collide. All of that stuff could be completely automated – telling the language what parts of a thing are important and getting == and isequal and hash for free seems like a good deal to me. As @tlycken points out, you don't need to do it this way, it would just be available to you and quite convenient.

JeffBezanson · 2015-07-20T16:11:54Z

I think that's pretty easy to do using

==(a::MyType, b::MyType) = (a.x, a.z) == (b.x, b.z)
hash(a::MyType, h) = hash((a.x, a.z), h)

StefanKarpinski · 2015-07-20T16:14:15Z

So you're type has hash collisions with tuples of its field values?

tomasaschan · 2015-07-20T16:23:29Z

I can't find anything in the manual either about exactly how a best-practice implementation of == and hash should look like for a custom composite type. The closest I get is the paragraph above this headline (sorry for the weird linking, but it was very far to scroll from the next headline above... click the link and scroll up a little) but the only thing mentioned there is that

For other [in context: non-numerical types] types, isequal() defaults to calling ==(), so if you want to define equality for your own types then you only need to add a ==() method. If you define your own equality function, you should probably define a corresponding hash() method to ensure that isequal(x,y) implies hash(x) == hash(y).

Regardless of whether or not it's "really simple" to implement both ==() and hash(), at the very least the correct way to do it needs to be better documented.

And I still feel that providing a shortcut for users that want one is a really nice feature (one that I've missed in many other settings, including Java and C#).

JeffBezanson · 2015-07-20T16:26:48Z

My point is that the proposed mechanism is not significantly simpler, and just adds more stuff you have to know about. We could recommend using hash((a.x, a.z), hash(typeof(a), h)).

The get_parts interface has a performance pitfall, in that it's too easy to end up copying a whole collection: parts(x::MyType) = (x...,).

mbauman · 2015-07-20T16:35:26Z

I can't find anything in the manual either about exactly how a best-practice implementation of == and hash should look like for a custom composite type.

I agree: #11794 (comment)

tomasaschan · 2015-07-20T16:47:29Z

@mbauman Very nice indeed! I had not seen that PR before, but it makes me very happy :)

StefanKarpinski · 2015-07-20T19:00:19Z

Ok, so back to the issue at hand. It seems like we should at least remove the default object-identity hashing for mutables. Should we just leave immutables alone?

JeffBezanson · 2015-07-22T17:49:21Z

I started experimenting with this, and surprisingly found quite a few places in Base that depend (correctly) on object_id hashing for mutables: Type, Method, LambdaStaticData, Function, Module
So I'm afraid this is a very breaking change. The current approach "just works" for the fairly common case where you're associating data with a particular object. This trades off against the problem in this issue. Our policy has often been that if you're defining a new type, making it work is on you, and all we can really offer here is a no-method error, which might not be worth the breakage this causes.

yuyichao · 2015-07-22T17:58:42Z

If we still want to keep the default object id hash method, raising error when there's a custom defined == should be fine then?

StefanKarpinski · 2017-08-16T14:46:51Z

More confusiong due to this: https://discourse.julialang.org/t/how-to-use-unique-function/5266.

JeffBezanson · 2018-01-17T00:52:20Z

Update: started working on this in #24354. It revealed that there are many uses of Dict that should really be ObjectIdDict, but dropping the type parameters loses a lot of type information (often of value as documentation). Hence #25210 adds IdDict{K,V}, and is near ready to merge.

The primary issue is that you can no longer naively ask whether a random object is in a set or dictionary, which can be pretty annoying. E.g. rewriting x in [1, 2] to x in Set([1, 2]) is no longer safe and might cause an error. We have some functions like unique that internally rely on this.

So I have another (slightly crazy) idea: instead of removing the hash fallback, make it return 0 instead. That way it's correct by default, just inefficient. Then defining hash methods becomes an optimization you do as needed.

StefanKarpinski · 2018-01-17T02:50:32Z

I think that crazy idea may actually be quite brilliant.

mauro3 · 2018-01-17T09:32:37Z

This has the potential to turn the "unique [etc] does not work"-questions into "unique [etc] is super slow"-questions? However, then there is an obvious place for the afflicted to check: "Performance tips".

All in all, this seems a very Julian solution, +1.

StefanKarpinski · 2018-01-17T16:10:31Z

Here's a different perspective on this (based on discussion with Jeff)... Using a hash table is only one way to implement dictionary and set data structures. Another perfectly valid and often just fine implementation is a linear array of values. We can consider that to be the default implementation, and providing a hash function for a type is a way to tell the system, "hey, this type is hashable, so use a fancy hash table implementation of dict and set instead of the standard linear array implementation" – and the system will magically do that for you. This works because in the case where hash is the constant zero function, a hash table degenerates to a linear array implementation.

fixes #12198

JeffBezanson · 2018-01-18T20:22:29Z

Best-of-both-worlds idea from Stefan: check whether == for an object calls the default definition, and if so use object_id. Otherwise return a constant. That way we get the benefits of this idea without hurting performance of objects that work now. That depends on our ability to optimize such a check, but that can be added later and will only break code that somehow depends on the current hash function being wrong for their type (which is already broken).

StefanKarpinski · 2018-01-18T20:46:34Z

Nice thing about the best-of-both-worlds idea solution is that it's strictly non-breaking: code that was calling hash for whatever reason continues to work – it gets a zero where it would have gotten an objectid value before, but that's fine; code that was actually putting objects into dicts or sets without defining a custom hash function will magically become unbroken, albeit potentially slower.

mauro3 · 2018-01-19T13:31:24Z

As a added bonus getting this feature:

That depends on our ability to optimize such a [method-exists] check

sounds super useful to define other fast-dispatching traits in terms of defined methods, yay!

StefanKarpinski changed the title ~~too easy to accidentally break hashing~~ custom hashing is too easy to accidentally break Jul 17, 2015

vtjnash mentioned this issue Jul 27, 2017

== for immutables should recursively call == on its fields #4648

Open

JeffBezanson mentioned this issue Aug 15, 2017

State clearly in the documentation that the iteration of a Set occurs in undefined order #23269

Closed

JeffBezanson added a commit that referenced this issue Oct 26, 2017

remove fallback hash method. fixes #12198

38a1db2

JeffBezanson added a commit that referenced this issue Oct 27, 2017

remove fallback hash method. fixes #12198

4aa0205

mauro3 mentioned this issue Dec 9, 2017

WeakKeyDict based on object ids #3002

Open

JeffBezanson added the triage This should be discussed on a triage call label Jan 8, 2018

JeffBezanson added a commit that referenced this issue Jan 18, 2018

make fallback hash a constant instead of based on object id

8792046

fixes #12198

JeffBezanson added a commit that referenced this issue Jan 18, 2018

make fallback hash a constant instead of based on object id

d4b8f16

fixes #12198

JeffBezanson mentioned this issue Jan 18, 2018

make fallback hash a constant instead of based on object id #25621

Closed

JeffBezanson modified the milestones: 1.0, 1.x Jan 18, 2018

JeffBezanson added collections Data structures holding multiple items, e.g. sets and removed breaking This change will break code triage This should be discussed on a triage call labels Jan 18, 2018

mbauman mentioned this issue Feb 13, 2018

Pointer hashing breaks equality #26038

Closed

mbauman mentioned this issue May 6, 2019

unique() behaves strangely for mutables with custom == function #31940

Closed

kalmarek mentioned this issue Jun 27, 2019

unique polynomials oscar-system/Singular.jl#120

Closed

felixcremer mentioned this issue May 7, 2021

Multiple possible axis matches found - mapCube "time" axis error with >1 InDims("time") and OutDims("time") JuliaDataCubes/YAXArrays.jl#60

Closed

DilumAluthge removed this from the 1.x milestone Mar 13, 2022

mbauman mentioned this issue Jul 15, 2024

Impossible values of allunique() and allequal() #55134

Closed

nsajko added hashing equality Issues relating to equality relations: ==, ===, isequal labels Jul 16, 2024

custom hashing is too easy to accidentally break #12198

custom hashing is too easy to accidentally break #12198

Comments

StefanKarpinski commented Jul 17, 2015

mauro3 commented Jul 17, 2015

StefanKarpinski commented Jul 17, 2015

mauro3 commented Jul 17, 2015

StefanKarpinski commented Jul 17, 2015

yuyichao commented Jul 17, 2015

yuyichao commented Jul 17, 2015

StefanKarpinski commented Jul 17, 2015

yuyichao commented Jul 17, 2015

yuyichao commented Jul 17, 2015

yuyichao commented Jul 17, 2015

JeffBezanson commented Jul 17, 2015

carnaval commented Jul 17, 2015

jakebolewski commented Jul 17, 2015

JeffBezanson commented Jul 17, 2015

carnaval commented Jul 17, 2015

jakebolewski commented Jul 17, 2015

davidagold commented Jul 18, 2015

JeffBezanson commented Jul 18, 2015

tomasaschan commented Jul 18, 2015

StefanKarpinski commented Jul 20, 2015

JeffBezanson commented Jul 20, 2015

StefanKarpinski commented Jul 20, 2015

tomasaschan commented Jul 20, 2015

JeffBezanson commented Jul 20, 2015

mbauman commented Jul 20, 2015

tomasaschan commented Jul 20, 2015

StefanKarpinski commented Jul 20, 2015

JeffBezanson commented Jul 22, 2015

yuyichao commented Jul 22, 2015

StefanKarpinski commented Aug 16, 2017

JeffBezanson commented Jan 17, 2018

StefanKarpinski commented Jan 17, 2018

mauro3 commented Jan 17, 2018

StefanKarpinski commented Jan 17, 2018

JeffBezanson commented Jan 18, 2018

StefanKarpinski commented Jan 18, 2018

mauro3 commented Jan 19, 2018