Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy circuits and parameters when adding them to the EquivalenceLibrary in Rust. #13532

Closed
wants to merge 5 commits into from

Conversation

raynelfss
Copy link
Contributor

@raynelfss raynelfss commented Dec 5, 2024

Fixes #13504

Summary

Adds a missing call to copy when adding equivalencies in the EquivalenceLibrary, which solves the parallel failures happening with #13504.

Details and comments

When EquivalenceLibrary was ported to rust, we missed the specific calls to copy() when adding an instance of Equivalence into the graph. This although harmless at first could break operability in parallel because, during serialization and de-serialization, the memory slot occupied by said parameter would not be available anymore and, therefore, when the BasisTranslator tries to re-assign this parameter via lookup, the search would fail due to said parameter being linked to a memory address that may not exist in that instance.

The following commits add the missing copy() calls for both the parameters and the instance of CircuitData, but also add a test-case for it in the BasisTranslator. Although the failure happens within the EquivalenceLibrary, we should keep the test there because the BasisTranslator is what detects the dead parameter trying to be replaced.

Tasks

  • Add release note

Co-authored-by: Julien Gacon gaconju@gmail.com

@raynelfss raynelfss added priority: high Changelog: Bugfix Include in the "Fixed" section of the changelog Rust This PR or issue is related to Rust code in the repository mod: transpiler Issues and PRs related to Transpiler labels Dec 5, 2024
@raynelfss raynelfss marked this pull request as ready for review December 5, 2024 18:17
@raynelfss raynelfss requested a review from a team as a code owner December 5, 2024 18:17
@qiskit-bot
Copy link
Collaborator

One or more of the following people are relevant to this code:

  • @Qiskit/terra-core

@jakelishman
Copy link
Member

jakelishman commented Dec 5, 2024

Something seems strange about this. Why does the equivalence lookup mutate any part of the circuit at all, and why does serialisation/deserialisation via pickle not already produce separated instances? Is there any chance that the bug fix of #13482 helped here, before this PR?

Maybe I just don't understand the mechanism of the fix here, but it doesn't seem right to copy a circuit when adding it as an equivalence - adding the equivalence should be transferring ownership, and once it's a part of the rule graph, it absolutely should not be being mutated by anything (edit: so my point is that it shouldn't matter even if the EquivalenceLibrary was in totally shared memory - it still should be thread safe to retrieve rules from it, and if it isn't, that feels like an underlying problem.)

@coveralls
Copy link

coveralls commented Dec 5, 2024

Pull Request Test Coverage Report for Build 12187010225

Details

  • 1 of 1 (100.0%) changed or added relevant line in 1 file are covered.
  • 14 unchanged lines in 2 files lost coverage.
  • Overall coverage increased (+0.02%) to 88.966%

Files with Coverage Reduction New Missed Lines %
crates/qasm2/src/parse.rs 6 97.62%
crates/qasm2/src/lex.rs 8 91.98%
Totals Coverage Status
Change from base Build 12182055039: 0.02%
Covered Lines: 79384
Relevant Lines: 89230

💛 - Coveralls

@raynelfss
Copy link
Contributor Author

raynelfss commented Dec 5, 2024

I don't think anything is being mutated within the circuit here, but we do perform that extra copy when storing things within the EquivalenceLibrary. I'm not quite sure why not copying the parameters breaks serialization (not copying the circuit seems to be fine there). Seems like copying the circuit preserved it, not copying the Params.

As mentioned in the PR details, I suspected it had to do with some stale address being re-used during deserialization, but this shouldn't necessarily break the previous calls to __getstate__ for the Parameter, since you should be able to rebuild it from the state obtained during the calls to dump.

Also, #13482 was what exposed this bug (described in #13504) as some parameters during compose_transforms in BasisTranslator were not being exposed correctly to Python, and we could not see what was happening.

The only change that seemed to have broken things in between was the copying over Parameter instances which, as I said, I had removed before thinking it would increase performance. But it seems to have broken some things, so I re-added it. I'm open to continue this discussion @jakelishman.

Update: Seems like copying the circuit is what makes it work here, not copying the Params.

@raynelfss raynelfss force-pushed the basis-translator-error branch from 4fd00ba to fa1b056 Compare December 5, 2024 19:50
@jakelishman
Copy link
Member

I'm trying to have a little look, but copying the circuit is the more worrying part to me - the only live reference to the circuit after a call to add_equivalence should be the one held by the library itself (assuming the caller just drops the reference, which we do in its construction), so why is copying actually working here? Seems like there's two problems:

  • we must be keeping too many references to some inner objects somehow
  • no circuits referenced by the EquivalenceLibrary should actually ever get mutated.

I think the copy isn't isolating the circuit from the outer environment (which is why we usually copy, but here's there's no extant references in the environment), I think it's somehow just isolating one part of the EquivalenceLibrary from another part of the same equivalence, so now there's two copies of the same circuit in there, and one of them is still getting erroneously mutated in compose_transforms, but that's not the one that's used to rebuild things later.

This might work, but it probably isn't actually the root problem. I think there probably is some mistake lurking in compose_transforms and EquivalenceLibrary that is allowing an owned CircuitData object to have its cache written to, then following that, I suspect that the real root cause here is that CircuitData::clone is cloning references to cached Python objects. We treat the PackedInstruction.py_op_cache field as owning that Python object, but the derived Clone will only clone the reference, so now there'll be two CircuitData objects that both think they "own" the contained Python object - mutating the cache in one of them might invalidate the other.

I don't know 100% precisely, but I have strong suspicions that we have problems around the ownership model in the Python-object cache surrounding Rust-space Clone. I think there's two ways of thinking about it:

  • Clone should produce a fully owned value, akin to Python's deepcopy.
  • PackedInstruction::clone should deep-copy all the Rust-space objects, but treat Python objects like references (there's plenty of precedent in Rust for this: consider Rc, and, of course, Py).

I imagine we quite possibly use both in Qiskit. I think, though, that given that PackedInstruction considers itself to own the Python object it might hold (especially in the case of a StandardGate), then the right solution is to manually impl Clone for PackedInstruction to not propagate the cache for gates that aren't Python-space singletons.

@mtreinish
Copy link
Member

I don't know 100% precisely, but I have strong suspicions that we have problems around the ownership model in the Python-object cache surrounding Rust-space Clone. I think there's two ways of thinking about it:

  • Clone should produce a fully owned value, akin to Python's deepcopy.
  • PackedInstruction::clone should deep-copy all the Rust-space objects, but treat Python objects like references (there's plenty of precedent in Rust for this: consider Rc, and, of course, Py).

I imagine we quite possibly use both in Qiskit. I think, though, that given that PackedInstruction considers itself to own the Python object it might hold (especially in the case of a StandardGate), then the right solution is to manually impl Clone for PackedInstruction to not propagate the cache for gates that aren't Python-space singletons.

I strongly suspect this is the root cause of the issue too or something similar. My assumption around Clone for PackedInstruction it was the second bullet, but I could easily see there being places treating it like the former. My gut feeling on #13482 was we should invalidate the cache instead of update the cached instance because my instinct was that the cache is only safe if the objects are treated immutably from rust (it was also my proposed fix for #13504) since we never knew how many references are out there too (my larger concern was coming from python space mutation of a shared object that's cached in a PackedInstruction) I guess this issue is just another reason for us to work on disabling the py-clone feature longer term because we'd stop having this ambiguity around the use of clone for things containing Py<T>.

@jakelishman
Copy link
Member

Ok, here's a much more minimal reproducer of the problem without this PR:

from qiskit.circuit import QuantumCircuit, StandardEquivalenceLibrary
from qiskit.transpiler.passes import BasisTranslator

rx_key = next(key for key in StandardEquivalenceLibrary.keys() if key.name == "rx")

# The circuit doesn't need to be parametric.
qc = QuantumCircuit(1)
qc.rx(0.5, 0)

BasisTranslator(
    equivalence_library=StandardEquivalenceLibrary,
    target_basis=["cx", "id", "rz", "sx", "x"],
)(qc)

inst = StandardEquivalenceLibrary._get_equivalences(rx_key)[0].circuit.data[0]
print(inst.params, inst.operation.params)

A cached gate's parameters are getting mutated in the equivalence library by the basis translator, which is staying visible in the library.

The CircuitFromPython that's entering add_equivalence isn't kept around by the caller, so Python-space copying it shouldn't matter (in fact, I'm not even 100% sure why we do equivalent_circuit.clone()) - the only thing I can think of is that the way we currently construct the StandardEquivalenceLibrary is old-style Qiskit, where we do stuff like for rule in [(RXGate(alpha), [qr[0]], []), ...]: qc.append(*rule) instead of the far more natural qc.rx(alpha, 0). That means that the StandardEquivalenceLibrary will be full of operators that do have a cached Python op, but the copy added by this PR effective clears the cache (the copy method does correctly invalidate the cache of standard gates if copy_instructions=True).

So my updated suspicions: compose_transforms is taking a Rust-space clone of CircuitData objects owned by the EquivalenceLibrary. It then calls assign_parameters on that, which mutates the Python-op cache (if it exists), which propagates back to the EquivalenceLibrary. This PR happens to work only because it clears out the Python-op caches, not because it copies the incoming circuit. The problem of "mutating py-op caches in a Rust-space clone of CircuitData can propagate backwards" remains, and that's the PackedInstruction::clone problem.

@jakelishman
Copy link
Member

jakelishman commented Dec 5, 2024

Matt: just saw your message now.

I think assign_parameters post #13482 is correct. I think the ownership model should be that the PackedInstruction cache is owned data, not a reference, since it's meant to be entirely in sync with the rest of the PackedInstruction. If you PackedInstruction::clone and then mutate the Rust-space params, that's obviously completely sound within Rust and should work, but if we're treating it as a shallow referent-based copy for Python space, then we're not allowed to update the cache, which makes it data incoherent.

I think we need to impl Clone for PackedInstruction to invalidate the cache if we know it to be unsafe, and to Python-space deep copy if it's a custom Instruction/Gate/etc; that Clone is supposed to replace Python's old Instruction.copy (deep copy).

@jakelishman
Copy link
Member

Hmm, no, actually I can be convinced by your way, Matt - it shakes out cheaper in memory usage, and it's cleaner to never attempt to mutate Python objects from Rust other than clean rebuilds.

I was worried about custom gates needing to be updated, but if we update the Python object via the Gate pointer, and clear the PackedInstruction cache, it localises the Python mutation only to the point it's absolutely necessary.

@jakelishman
Copy link
Member

jakelishman commented Dec 5, 2024

But that also means that PackedInstruction needs to forbid all mutable access to itself other than by controlled setters, or eagerly invalidate the cache on production of a mutable reference to inner data, because all mutations need to invalidate the cache.

@raynelfss raynelfss added the on hold Can not fix yet label Dec 6, 2024
@raynelfss
Copy link
Contributor Author

Right now the closest thing we could do without introducing big changes is to re-implement the Clone trait for the PackedInstruction struct, in which we would not copy over the cached Python gate. This seems to also solve the issue without having to drastically decrease performance by copying over the entire circuit. We can then do a follow up re-balancing PackedInstruction to have setters that also clean up the cache.

Is there anywhere else where we perform similar caching and might want to look at?

@raynelfss
Copy link
Contributor Author

This will be superceded by #13543

@raynelfss raynelfss closed this Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Changelog: Bugfix Include in the "Fixed" section of the changelog mod: transpiler Issues and PRs related to Transpiler on hold Can not fix yet priority: high Rust This PR or issue is related to Rust code in the repository
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Basis translation can fail if run in parallel
5 participants