-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use rust gates for Optimize1QGatesDecomposition #12650
Conversation
This commit moves to using rust gates for the Optimize1QGatesDecomposition transpiler pass. It takes in a sequence of runs (which are a list of DAGOpNodes) from the python side of the transpiler pass which are generated from DAGCircuit.collect_1q_runs() (which in the future should be moved to rust after Qiskit#12550 merges). The rust portion of the pass now iterates over each run, performs the matrix multiplication to compute the unitary of the run, then synthesizes that unitary, computes the estimated error of the circuit synthesis and returns a tuple of the circuit sequence in terms of rust StandardGate enums. The python portion of the code then takes those sequences and does inplace substitution of each run with the sequence returned from rust. Once Qiskit#12550 merges we should be able to move the input collect_1q_runs() call and perform the output node substitions in rust making the full pass execute in the rust domain without any python interaction. Additionally, the OneQubitEulerDecomposer class is updated to use rust for circuit generation instead of doing this python side. The internal changes done to use rust gates in the transpiler pass meant we were half way to this already by emitting rust StandardGates instead of python gate objects. The dag handling is still done in Python however until Qiskit#12550 merges. This also includes an implementation of the r gate, I temporarily added this to unblock this effort as it was the only gate missing needed to complete this. We can rebase this if a standalone implementation of the gate merges before this.
One or more of the following people are relevant to this code:
|
Pull Request Test Coverage Report for Build 9652482316Details
💛 - Coveralls |
qiskit/transpiler/passes/optimization/optimize_1q_decomposition.py
Outdated
Show resolved
Hide resolved
Previously this PR was re-computing the target bases to synthesize with for each run found in the circuit. But in cases where there were multiple runs repeated on a qubit this was unecessary work. Prior to moving this code to rust there was already caching code to make this optimization, but the rust path short circuited around this. This commit fixes this so we're caching the target bases for each qubit and only computing it once.
Pull Request Test Coverage Report for Build 9662366170Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
I took a quick attempt at implementing this with multithreading locally and I could get it to work but it required doing extra collection of all the data from the collect runs into a second vec to strip the |
I threw together a quick benchmark to test this: import statistics
import time
from qiskit.transpiler.passes import Optimize1qGatesDecomposition
from qiskit.circuit.random import random_circuit
from qiskit.converters import circuit_to_dag
seed = 42
n_qubits = 20
depth = 102400
qc = random_circuit(n_qubits, depth, measure=True, seed=seed)
dag = circuit_to_dag(qc)
basis_gates = ['sx', 'x', 'rz', 'cx', 'id']
opt_pass = Optimize1qGatesDecomposition(basis=basis_gates)
times = []
for i in range(10):
start = time.perf_counter()
opt_pass.run(dag)
stop = time.perf_counter()
times.append(stop - start)
print(f"Mean runtime: {statistics.geometric_mean(times)}") and it's showing a pretty noticeable speedup over Qiskit 1.1. With this PR I got:
and with Qiskit 1.1 I got:
Looking at a profile with this script the majority of the time with this PR is spent in collect 1q runs and dag interactions in python. |
Pull Request Test Coverage Report for Build 9666227832Details
💛 - Coveralls |
Pull Request Test Coverage Report for Build 9667036450Details
💛 - Coveralls |
To compare this against the current state of main I ran asv for the pass specific benchmarks:
|
Pull Request Test Coverage Report for Build 9678061360Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
This commit moves to use rust gates for the ConsolidateBlocks transpiler pass. Instead of generating the unitary matrices for the gates in a 2q block Python side and passing that list to a rust function this commit switches to passing a list of DAGOpNodes to the rust and then generating the matrices inside the rust function directly. This is similar to what was done in Qiskit#12650 for Optimize1qGatesDecomposition. Besides being faster to get the matrix for standard gates, it also reduces the eager construction of Python gate objects which was a significant source of overhead after Qiskit#12459. To that end this builds on the thread of work in the two PRs Qiskit#12692 and Qiskit#12701 which changed the access patterns for other passes to minimize eager gate object construction.
Another note for later. This function
has return type PyResult<OneQubitGateSequence> . But as far as I can tell it is only called in one place, and that is in the same file. The returned value is unwrap ped with no check. In fact the return value is always ok . This is so the function can be called from Python. (maybe in the future, since I don't see this called anywhere) Reading and understanding the unwrap takes a bit of time.
I'd prefer that the Python-side function (if needed) wraps a function that returns This is a quick fix. But probably outside the scope of this PR. |
I left one more suggested simplification edit. Other than that, this looks good to go. |
Pull Request Test Coverage Report for Build 9769535246Details
💛 - Coveralls |
This commit moves to use rust gates for the ConsolidateBlocks transpiler pass. Instead of generating the unitary matrices for the gates in a 2q block Python side and passing that list to a rust function this commit switches to passing a list of DAGOpNodes to the rust and then generating the matrices inside the rust function directly. This is similar to what was done in Qiskit#12650 for Optimize1qGatesDecomposition. Besides being faster to get the matrix for standard gates, it also reduces the eager construction of Python gate objects which was a significant source of overhead after Qiskit#12459. To that end this builds on the thread of work in the two PRs Qiskit#12692 and Qiskit#12701 which changed the access patterns for other passes to minimize eager gate object construction.
In the recently merged Qiskit#12650 a new rust function was added for the filter function of the collect_1q_runs() method's internal filter function. As part of this we need to do exclude any non-DAGOpNode dag nodes passed to the filter function. This was previously implemented using the pyo3 extract() method so that if the pyobject can be coerced into a DAGOpNode for later use we continue but if it errors we know the input object is not a DAGOpNode and return false. However, as we don't need to return the error to python we should be using downcast instead of extract (as per the pyo3 performance guide [1]) to avoid the error conversion cost. This was an oversight in Qiskit#12650 and I only used extract() out of habit. [1] https://pyo3.rs/v0.22.0/performance#extract-versus-downcast
In the recently merged Qiskit#12650 a new rust function was added for the filter function of the collect_1q_runs() method's internal filter function. As part of this we need to do exclude any non-DAGOpNode dag nodes passed to the filter function. This was previously implemented using the pyo3 extract() method so that if the pyobject can be coerced into a DAGOpNode for later use we continue but if it errors we know the input object is not a DAGOpNode and return false. However, as we don't need to return the error to python we should be using downcast instead of extract (as per the pyo3 performance guide [1]) to avoid the error conversion cost. This was an oversight in Qiskit#12650 and I only used extract() out of habit. [1] https://pyo3.rs/v0.22.0/performance#extract-versus-downcast
In the recently merged Qiskit#12650 a new rust function was added for the filter function of the collect_1q_runs() method's internal filter function. As part of this we need to do exclude any non-DAGOpNode dag nodes passed to the filter function. This was previously implemented using the pyo3 extract() method so that if the pyobject can be coerced into a DAGOpNode for later use we continue but if it errors we know the input object is not a DAGOpNode and return false. However, as we don't need to return the error to python we should be using downcast instead of extract (as per the pyo3 performance guide [1]) to avoid the error conversion cost. This was an oversight in Qiskit#12650 and I only used extract() out of habit. [1] https://pyo3.rs/v0.22.0/performance#extract-versus-downcast
In the recently merged #12650 a new rust function was added for the filter function of the collect_1q_runs() method's internal filter function. As part of this we need to do exclude any non-DAGOpNode dag nodes passed to the filter function. This was previously implemented using the pyo3 extract() method so that if the pyobject can be coerced into a DAGOpNode for later use we continue but if it errors we know the input object is not a DAGOpNode and return false. However, as we don't need to return the error to python we should be using downcast instead of extract (as per the pyo3 performance guide [1]) to avoid the error conversion cost. This was an oversight in #12650 and I only used extract() out of habit. [1] https://pyo3.rs/v0.22.0/performance#extract-versus-downcast
This commit moves to use rust gates for the ConsolidateBlocks transpiler pass. Instead of generating the unitary matrices for the gates in a 2q block Python side and passing that list to a rust function this commit switches to passing a list of DAGOpNodes to the rust and then generating the matrices inside the rust function directly. This is similar to what was done in Qiskit#12650 for Optimize1qGatesDecomposition. Besides being faster to get the matrix for standard gates, it also reduces the eager construction of Python gate objects which was a significant source of overhead after Qiskit#12459. To that end this builds on the thread of work in the two PRs Qiskit#12692 and Qiskit#12701 which changed the access patterns for other passes to minimize eager gate object construction.
This commit moves to use rust gates for the ConsolidateBlocks transpiler pass. Instead of generating the unitary matrices for the gates in a 2q block Python side and passing that list to a rust function this commit switches to passing a list of DAGOpNodes to the rust and then generating the matrices inside the rust function directly. This is similar to what was done in Qiskit#12650 for Optimize1qGatesDecomposition. Besides being faster to get the matrix for standard gates, it also reduces the eager construction of Python gate objects which was a significant source of overhead after Qiskit#12459. To that end this builds on the thread of work in the two PRs Qiskit#12692 and Qiskit#12701 which changed the access patterns for other passes to minimize eager gate object construction.
* Use rust gates for ConsolidateBlocks This commit moves to use rust gates for the ConsolidateBlocks transpiler pass. Instead of generating the unitary matrices for the gates in a 2q block Python side and passing that list to a rust function this commit switches to passing a list of DAGOpNodes to the rust and then generating the matrices inside the rust function directly. This is similar to what was done in #12650 for Optimize1qGatesDecomposition. Besides being faster to get the matrix for standard gates, it also reduces the eager construction of Python gate objects which was a significant source of overhead after #12459. To that end this builds on the thread of work in the two PRs #12692 and #12701 which changed the access patterns for other passes to minimize eager gate object construction. * Add rust filter function for DAGCircuit.collect_2q_runs() * Update crates/accelerate/src/convert_2q_block_matrix.rs --------- Co-authored-by: John Lapeyre <jlapeyre@users.noreply.github.com>
This commit builds off of what Qiskit#12650 did for the 1q decomposer and moves to using rust gates for the 2q decomposer too. This means that the circuit sequence generation is using rust's StandardGate representation directly instead of relying on mapping strings. For places where circuits are generated (calling `TwoQubitWeylDecomposition.circuit()` or or `TwoQubitBasisDecomposer.__call__` without the `use_dag()` flag) the entire circuit is generated in Rust and returned to Python.
This commit builds off of what Qiskit#12650 did for the 1q decomposer and moves to using rust gates for the 2q decomposer too. This means that the circuit sequence generation is using rust's StandardGate representation directly instead of relying on mapping strings. For places where circuits are generated (calling `TwoQubitWeylDecomposition.circuit()` or or `TwoQubitBasisDecomposer.__call__` without the `use_dag` flag) the entire circuit is generated in Rust and returned to Python.
* Use Rust gates for 2q unitary synthesis This commit builds off of what #12650 did for the 1q decomposer and moves to using rust gates for the 2q decomposer too. This means that the circuit sequence generation is using rust's StandardGate representation directly instead of relying on mapping strings. For places where circuits are generated (calling `TwoQubitWeylDecomposition.circuit()` or or `TwoQubitBasisDecomposer.__call__` without the `use_dag` flag) the entire circuit is generated in Rust and returned to Python. * Run cargo fmt and black post rebase
* Use Rust gates for 2q unitary synthesis This commit builds off of what Qiskit#12650 did for the 1q decomposer and moves to using rust gates for the 2q decomposer too. This means that the circuit sequence generation is using rust's StandardGate representation directly instead of relying on mapping strings. For places where circuits are generated (calling `TwoQubitWeylDecomposition.circuit()` or or `TwoQubitBasisDecomposer.__call__` without the `use_dag` flag) the entire circuit is generated in Rust and returned to Python. * Run cargo fmt and black post rebase
* Use rust gates for Optimize1QGatesDecomposition This commit moves to using rust gates for the Optimize1QGatesDecomposition transpiler pass. It takes in a sequence of runs (which are a list of DAGOpNodes) from the python side of the transpiler pass which are generated from DAGCircuit.collect_1q_runs() (which in the future should be moved to rust after Qiskit#12550 merges). The rust portion of the pass now iterates over each run, performs the matrix multiplication to compute the unitary of the run, then synthesizes that unitary, computes the estimated error of the circuit synthesis and returns a tuple of the circuit sequence in terms of rust StandardGate enums. The python portion of the code then takes those sequences and does inplace substitution of each run with the sequence returned from rust. Once Qiskit#12550 merges we should be able to move the input collect_1q_runs() call and perform the output node substitions in rust making the full pass execute in the rust domain without any python interaction. Additionally, the OneQubitEulerDecomposer class is updated to use rust for circuit generation instead of doing this python side. The internal changes done to use rust gates in the transpiler pass meant we were half way to this already by emitting rust StandardGates instead of python gate objects. The dag handling is still done in Python however until Qiskit#12550 merges. This also includes an implementation of the r gate, I temporarily added this to unblock this effort as it was the only gate missing needed to complete this. We can rebase this if a standalone implementation of the gate merges before this. * Cache target decompositions for each qubit Previously this PR was re-computing the target bases to synthesize with for each run found in the circuit. But in cases where there were multiple runs repeated on a qubit this was unecessary work. Prior to moving this code to rust there was already caching code to make this optimization, but the rust path short circuited around this. This commit fixes this so we're caching the target bases for each qubit and only computing it once. * Optimize rust implementation slightly * Avoid extra allocations by inlining matrix multiplication * Remove unnecessary comment * Remove stray code block * Add import path for rust gate * Use rust gate in circuit constructor * Remove duplicated op_name getter and just use existing name getter * Apply suggestions from code review Co-authored-by: John Lapeyre <jlapeyre@users.noreply.github.com> * Simplify construction of target_basis_vec * Fix rebase issue * Update crates/accelerate/src/euler_one_qubit_decomposer.rs Co-authored-by: John Lapeyre <jlapeyre@users.noreply.github.com> * Update crates/accelerate/src/euler_one_qubit_decomposer.rs --------- Co-authored-by: John Lapeyre <jlapeyre@users.noreply.github.com>
In the recently merged Qiskit#12650 a new rust function was added for the filter function of the collect_1q_runs() method's internal filter function. As part of this we need to do exclude any non-DAGOpNode dag nodes passed to the filter function. This was previously implemented using the pyo3 extract() method so that if the pyobject can be coerced into a DAGOpNode for later use we continue but if it errors we know the input object is not a DAGOpNode and return false. However, as we don't need to return the error to python we should be using downcast instead of extract (as per the pyo3 performance guide [1]) to avoid the error conversion cost. This was an oversight in Qiskit#12650 and I only used extract() out of habit. [1] https://pyo3.rs/v0.22.0/performance#extract-versus-downcast
* Use rust gates for ConsolidateBlocks This commit moves to use rust gates for the ConsolidateBlocks transpiler pass. Instead of generating the unitary matrices for the gates in a 2q block Python side and passing that list to a rust function this commit switches to passing a list of DAGOpNodes to the rust and then generating the matrices inside the rust function directly. This is similar to what was done in Qiskit#12650 for Optimize1qGatesDecomposition. Besides being faster to get the matrix for standard gates, it also reduces the eager construction of Python gate objects which was a significant source of overhead after Qiskit#12459. To that end this builds on the thread of work in the two PRs Qiskit#12692 and Qiskit#12701 which changed the access patterns for other passes to minimize eager gate object construction. * Add rust filter function for DAGCircuit.collect_2q_runs() * Update crates/accelerate/src/convert_2q_block_matrix.rs --------- Co-authored-by: John Lapeyre <jlapeyre@users.noreply.github.com>
* Use Rust gates for 2q unitary synthesis This commit builds off of what Qiskit#12650 did for the 1q decomposer and moves to using rust gates for the 2q decomposer too. This means that the circuit sequence generation is using rust's StandardGate representation directly instead of relying on mapping strings. For places where circuits are generated (calling `TwoQubitWeylDecomposition.circuit()` or or `TwoQubitBasisDecomposer.__call__` without the `use_dag` flag) the entire circuit is generated in Rust and returned to Python. * Run cargo fmt and black post rebase
Summary
This commit moves to using rust gates for the Optimize1QGatesDecomposition transpiler pass. It takes in a sequence of runs (which are a list of DAGOpNodes) from the python side of the transpiler pass which are generated from DAGCircuit.collect_1q_runs() (which in the future should be moved to rust after #12550 merges). The rust portion of the pass now iterates over each run, performs the matrix multiplication to compute the unitary of the run, then synthesizes that unitary, computes the estimated error of the circuit synthesis and returns a tuple of the circuit sequence in terms of rust StandardGate enums. The python portion of the code then takes those sequences and does inplace substitution of each run with the sequence returned from rust.
Once #12550 merges we should be able to move the input collect_1q_runs() call and perform the output node substitions in rust making the full pass execute in the rust domain without any python interaction.
Additionally, the OneQubitEulerDecomposer class is updated to use rust for circuit generation instead of doing this python side. The internal changes done to use rust gates in the transpiler pass meant we were half way to this already by emitting rust StandardGates instead of python gate objects. The dag handling is still done in Python however until #12550 merges.
This also includes an implementation of the r gate, I temporarily added this to unblock this effort as it was the only gate missing needed to complete this. We can rebase this if a standalone implementation of the gate merges before this.Rebased on top of: #12662Details and comments
TODO:
DAGCircuit
to Rust #12550 merges so we can do a parallel bridge iterator over runs and avoid having to collect runs generated in python