-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
custom hashing is too easy to accidentally break #12198
Comments
Yup, linked to that in the middle of that. Probably should have put it at the end. |
Oops, sorry, I should have read more carefully. |
No worries – I should have posted more clearly :-) |
@StefanKarpinski Is there something more we need to worry about if we just cache the hash or otherwise add additional comparason of the hash as you proposed somewhere in that thread? We could just throw an error in such case. |
And by such case I mean when they are |
That would fix it, but you would only get the error in the cases where two hashes accidentally "collide". I'm kind of into the idea of defining |
Getting a predictable result (
I just think it might be kind of a headache for type stability for types that allows "missing values". |
I agree that it might be hard to hit the error and notice the undefined behavior in certain scenario so I'm not saying this is the best we can do either (e.g. if |
Actually since we have such a type. The question is basically, how do you define Feels like it should return a |
I guess we should just remove the fallback for |
Can I frame this ? Many hours wasted in debugging (lack of recompilation + wrong bootstrap order = hashing by object id unexpectedly) |
Horray! |
Huh? What have you guys been |
hashing my life away |
Question: What is an "identity tuple"? Is the idea that if For If y'all decide to try the |
That is an awesome vocab word! Personally I would be against this sort of interface obscurity: "oh, if you want to define |
One wouldn't be required to define I think one of the reasons many users don't create a However, if |
@JeffBezanson, the weird thing is that both equality and hashing are clearly based on extrapolations of the same question – "which pieces of this thing are essential?" Equality takes two objects and does a pairwise recursive equality check on their essential parts; hashing recursively hashes the essential parts of an object together with some futzing around with types to make sure that hashes don't collide. All of that stuff could be completely automated – telling the language what parts of a thing are important and getting |
I think that's pretty easy to do using
|
So you're type has hash collisions with tuples of its field values? |
I can't find anything in the manual either about exactly how a best-practice implementation of
Regardless of whether or not it's "really simple" to implement both And I still feel that providing a shortcut for users that want one is a really nice feature (one that I've missed in many other settings, including Java and C#). |
My point is that the proposed mechanism is not significantly simpler, and just adds more stuff you have to know about. We could recommend using The get_parts interface has a performance pitfall, in that it's too easy to end up copying a whole collection: |
I agree: #11794 (comment) |
@mbauman Very nice indeed! I had not seen that PR before, but it makes me very happy :) |
Ok, so back to the issue at hand. It seems like we should at least remove the default object-identity hashing for mutables. Should we just leave immutables alone? |
I started experimenting with this, and surprisingly found quite a few places in Base that depend (correctly) on object_id hashing for mutables: Type, Method, LambdaStaticData, Function, Module |
If we still want to keep the default object id hash method, raising error when there's a custom defined |
More confusiong due to this: https://discourse.julialang.org/t/how-to-use-unique-function/5266. |
Update: started working on this in #24354. It revealed that there are many uses of The primary issue is that you can no longer naively ask whether a random object is in a set or dictionary, which can be pretty annoying. E.g. rewriting So I have another (slightly crazy) idea: instead of removing the |
I think that crazy idea may actually be quite brilliant. |
This has the potential to turn the "unique [etc] does not work"-questions into "unique [etc] is super slow"-questions? However, then there is an obvious place for the afflicted to check: "Performance tips". All in all, this seems a very Julian solution, +1. |
Here's a different perspective on this (based on discussion with Jeff)... Using a hash table is only one way to implement dictionary and set data structures. Another perfectly valid and often just fine implementation is a linear array of values. We can consider that to be the default implementation, and providing a |
Best-of-both-worlds idea from Stefan: check whether |
Nice thing about the best-of-both-worlds idea solution is that it's strictly non-breaking: code that was calling hash for whatever reason continues to work – it gets a zero where it would have gotten an |
As a added bonus getting this feature:
sounds super useful to define other fast-dispatching traits in terms of defined methods, yay! |
If you define a custom
==
for a type and don't define a matchinghash
method for it, it silently breaks hashing of that type:Whether you get one, two or three values depends on whether the last digits of their address in memory happen to be the same or not. This is because hashing is based on object identity, which is based on address. See discussion here:
Having it this easy to silently break the hashing of a type seems like a bad design. Although the problem is worse for mutable types since people inevitably want to overload == for them, it's not just a problem for mutable objects:
The text was updated successfully, but these errors were encountered: