Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

py2/py3: mutable objects and hashes #268

Closed
MartinBasti opened this issue Jul 17, 2017 · 6 comments
Closed

py2/py3: mutable objects and hashes #268

MartinBasti opened this issue Jul 17, 2017 · 6 comments

Comments

@MartinBasti
Copy link
Contributor

Almost all objects in python DNS that have implemented __eq__ have no __hash__ method implemented and even majority of them are mutable. This has several consequences.

Let's assume following class

class HashTest(object):
    def __init__(self, num):
        self.num = num
    def __eq__(self, other):
        return self.num == other.num 

Example 1. __hash__ not implemented

t = HashTest(1)
testset = set((t,))
  • PY2: uses id() as hash (hash may be randomized in newer py versions)
  • PY3: raises error: unhashable type if __eq__ is implemented without __hash__

However due usage of id() as hash Py2 works unexpectedly:

>>> [HashTest(1) in testset for i in range(10)]
[False, False, False, False, False, False, False, False, False, False]
>>> [HashTest(1) == t for i in range(10)]
[True, True, True, True, True, True, True, True, True, True]

Example 2. __hash__ implemented

class HashTest(object):
    def __init__(self, num):
        self.num = num
    def __eq__(self, other):
        return self.num == other.num 
    def __hash__(self):
        return hash(self.num)

t2 = HashTest(1)
testset2 = set((t2,))

Python2/3:

>>> [HashTest(1) in testset2 for i in range(10)]
[True, True, True, True, True, True, True, True, True, True]

However as was said before, majority of objects in dnspython are mutable and change of attribute may cause undesired side effects in cases where hash() has been used.
Python2/3:

>>> t2.num = 5
>>> [HashTest(5) in testset2 for i in range(10)]
[False, False, False, False, False, False, False, False, False, False]
>>> [HashTest(5) == t2 for i in range(10)]
[True, True, True, True, True, True, True, True, True, True]

Summary

  • py2/3 differs in default implementation of hashing Py3 disables default hash() based on id() when __eq__ is implemented ==> safe)
  • py2 allows hashing of objects that have implemented __eq__, but because by default it uses hash() it violates basic assumption a == b --> hash(a) == hash(b) and may result in incorrect results when sets are used (or any functions which rely on hashing) [Example 1]
  • implementing own __hash__ method that respect a == b --> hash(a) == hash(b) solves [Example 1] for immutable objects, but because dnspython uses mainly mutable objects we are hitting [Example 2]. If object used in a set (for example) is changed, then the set is incorrect and must be recreated to work again.

Possible solutions

  • Keep compatibility with python2 by implementing __hash__ to be id() => may result into unexpected behavior
  • Make object unhashable (py3 way) what may result into destroying backward compatibility. (What actually may be good thing because people may use sets with DNS records from dnspython and this may help prevent users to have "hard to debug" issues) [this can be done by deprecation warning in next release and then completely removed in future releases]
  • Combine the best of both worlds by implementing __hash__ for objects that have some immutable attributes (and use only these attributes for counting hash) and disabling __hash__ for objects that have no immutable attributes. [this can be done by deprecation warnings as mentioned before]
  • Use constants as hash for objects and make it work even for mutable objects, but basically this violates the basic principle of hash tables so it will reduce searching speed to zero (maybe even lower than searching in lists).
  • Do nothing and keep py2 compatibility as is, and let py3 to break any usage of sets and force users during migration to py3 to fix their implementations which incorrectly rely on py2 hashes from id().

Is good practice to have implemented custom __hash__ together with custom __eq__ that provides results which fit into a == b --> hash(a) == hash(b) or disable hashing of that objects at all. I think that both py2/py3 should produce the same result and IMO is safer to disable __hash__ for objects that are mutable.

@MartinBasti
Copy link
Contributor Author

Also is possible to make objects immutable, but it will ruin backward compatibility and I'm not sure if result will be usable.

@pspacek
Copy link
Collaborator

pspacek commented Aug 22, 2017

I just got bitten by this... having said that, I vote for Combine the best of both worlds by implementing hash for objects that have some immutable attributes approach.

@rthalley rthalley added the 2.0 label Dec 3, 2018
@rthalley
Copy link
Owner

rthalley commented Jan 7, 2019

There's another problem too, which is that we've got things like rdata which are currently mutable but which have non-trivial hashes. For these, it might be best to just make them immutable. I don't think that's much of an inconvenience in most cases. It might be annoying for people processing zones though. Imagine something supplying a "change every MX target name equal to X to Y". You could write this by direct mutation today, but that code would need to change to do a replace instead of a mutation. I think this is probably OK, but we should definitely agree this is the way to go before anyone writes anything :)

@pspacek
Copy link
Collaborator

pspacek commented Jan 10, 2019

I have no idea how big trouble it would be to make rdata & rrsets immutable... In general I'm in favor of this change but really do not have enough data to judge impact on users.

@rthalley
Copy link
Owner

I'll ask a question about it on dnspython-users. Another alternative is to leave things as-is, define hashes for other structures (like rdatasets and rrsets), and simply say "if you're using an object in a dictionary, you'd better not be changing it". This is (possibly!) pragmatic, in that it allows both the "treat as immutable" and "manipulate rdata and rdatasets by modifying them" usage, but is also dangerous in that inevitably someone will mutate an object that is in a dictionary, changing its hash.

@rthalley
Copy link
Owner

rthalley commented Apr 3, 2020

rdata is now immutable

@rthalley rthalley closed this as completed Apr 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants