-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix hash of Parameter
and ParameterExpression
#10875
Conversation
One or more of the the following people are requested to review this:
|
This fixes the construction paths for `Parameter` and its hash such that it will now correctly hash equal to any `ParameterExpression`s that it compares equal to. This is a requirement of the Python data model for hashmaps and hashsets, which previously we were breaking. In order to achieve this, we slightly modify the hash key such that the `Parameter` instances are no longer a part of the hash of `ParameterExpression`, which means we can use the same hashing strategy for both. This rearrangement has the benefit of removing the requirement for the `__new__` overrides on `Parameter` and `ParameterVectorElement`.
844c1eb
to
4913f11
Compare
Co-authored-by: Matthew Treinish <mtreinish@kortar.org>
same name do not compare equal to help catch shadowing bugs when two circuits | ||
containing the same named parameters are spurious combined. Setting the ``uuid`` | ||
field when creating two parameters to the same thing (along with the same name) | ||
allows them to be equal. This is useful during serialization and deserialization. | ||
""" | ||
self._name = name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be unrelated to this PR, but why is the _name
attribute needed? Isn't that information already included in _symbol_expr
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Off the top of my head, I think you're completely right and there's no real problems with dropping this field to reduce a pointer's worth of memory usage per parameter, and slightly reduce the hashing work we need to do.
Because there's a (minor) risk that we break something by inferring name
from the sympy/symegine symbol rather than storing it ourselves, perhaps let's try that in a follow-up PR so we can isolate it if it does happen to cause bugs without losing this bugfix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I created #10880
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice. It makes Parameter
feel a little more normal to me, though I still find the way it subclasses ParameterExpression
but is needed to create a ParameterExpression
weird.
return self._hash | ||
|
||
# We have to manually control the pickling so that the hash is computable before the unpickling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a while to understand why the unpickling was putting this parameter into a hashmap. The issue is the self._parameter_symbols = {self: symbol}
line in __init__
. I am not sure how pickle avoids an infinite loop there when it pickles the attributes and slots of an object, but it must do something clever. When unpickling, it wants to recreate this dictionary but self
is not fully initialized with a _hash
when it does.
One alternative approach could be to override _parameter_symbols
with a property that returns {self: self._symbol_expr}
so that self
is not in the data of the class instance. I think that would almost be enough to avoid __getstate__
and __setstate__
. However, I think we might still want to override __getstate__
to null out _hash
and have __hash__
recompute it if it is None
. The reason is to avoid pickling in one Python process and unpickling in another with a different hash seed and not having the restored _hash
match what would be calculated for a fresh hash()
call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self-referential objects or just reference cycles in general aren't all that uncommon in Python - they're handled by all the standard library serialisers etc. Even printing works on standard objects:
>>> a = a[0] = [[]]
>>> a
[[...]]
(where that makes a
a list that contains itself). I think a pretty large number of reference cycles can be handled in pickle
serialisation just by it inserting references to the various objects before it initialises each of them, and keeps a memo list of things it's still got to restore state to to avoid looping. In the cases where there's a particular partial order between an object's attributes, that's when you need __setstate__
.
We could potentially make _parameter_symbols
a property, but as you say, we'd still need to recompute the cached _hash
on pickle restoration, so I'm not certain there's a huge reason to add the property - if nothing else, adding it as a descriptor means it needs to be recalculated each time it's accessed, and there's an overhead just from having the indirection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps self-reference is not uncommon, but it has the potential to result in code that is hard to understand, and I think that has become the case here. Can we break the cycle by making Parameter no longer inherit from ParameterExpression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's obviously subjective, but personally I thought that this all interacted without much trouble with Python; it was just broken before because I think people weren't aware that there were data-model invariants they were required to uphold. I think it maybe feels more complex because the previous implementation was broken; this PR includes a few additional comments to highlight the invariants.
It's a little unusual to have a concrete class depend on a concrete subclass of itself (but common to have an abstract class depend on a concrete subclass), but it wouldn't be super easy to walk that back now; there's a lot of extant code that depends on being able to retrieve the .parameters
as a set from an expression.
(edit: also, I was saying that "self-referential objects aren't all that uncommon" just in response to Will's surprise that pickle handles them (generally) well, more than saying it in support of the architecture.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for adding all the comments in the code 😉
Summary
This fixes the construction paths for
Parameter
and its hash such that it will now correctly hash equal to anyParameterExpression
s that it compares equal to. This is a requirement of the Python data model for hashmaps and hashsets, which previously we were breaking. In order to achieve this, we slightly modify the hash key such that theParameter
instances are no longer a part of the hash ofParameterExpression
, which means we can use the same hashing strategy for both.This rearrangement has the benefit of removing the requirement for the
__new__
overrides onParameter
andParameterVectorElement
,and the pickle overrides for. edit: Neko caught a case where the vector and index of aParameterVectorElement
ParameterVectorElement
weren't restored after unpickling in the first iteration of this PR, which our internal test suite didn't catch. I've restored the special handling to the pickle there, and added a regression test.Details and comments
With
_hash
still cached inParameter
, I didn't see any meaningful changes to performance in the benchmarks I ran locally (removing the cache of it does cause a noticeable slowdown inQuantumCircuit.assign_parameters
for circuits with many parameters, though). I saw a 4% improvement in the (mostly meaningless) microbenchmarkmost likely due to no longer having the
__new__
call or thesuper().__init__
resolution in the construction path, but that doesn't really mean anything.ParameterVector
construction saw approximately the same improvement (maybe slightly smaller).Fix #9299