Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parallel synthesis interface to default unitary synthesis plugin #12007

Open
Tracked by #8774
mtreinish opened this issue Mar 13, 2024 · 1 comment · May be fixed by #13419
Open
Tracked by #8774

Add parallel synthesis interface to default unitary synthesis plugin #12007

mtreinish opened this issue Mar 13, 2024 · 1 comment · May be fixed by #13419
Labels
performance Rust This PR or issue is related to Rust code in the repository synthesis

Comments

@mtreinish
Copy link
Member

mtreinish commented Mar 13, 2024

Once we have #12004 (and also for #12005 too) implemented we can better leverage available CPU resources better by running synthesis in parallel using a rayon iterator. There is no data dependency between doing unitary synthesis for different unitary matrices, it should be fairly straightforward to have a function that takes in a list of unitary matrices with their target qubits and also dict/hashmap of target qubits to the appropriate decomposer class and then have the function return a list of circuit sequences which is computed using a multithreaded rayon iterator. This can be connected #12006 to have the default synthesis plugin prepare a batch of all the 2 qubit unitary matrices to exploit this function.

@mtreinish mtreinish added performance synthesis Rust This PR or issue is related to Rust code in the repository labels Mar 13, 2024
mtreinish added a commit to mtreinish/qiskit-core that referenced this issue Mar 14, 2024
This commit adds a new inner rust function that performs 1q euler
decomposition in multiple threads to attempt to improve the performance
of the synthesis. The numerical component of the 1q synthesis has no
data dependency when run with multiple unitary matrices and we can
potentially better leverage multiple CPUs by performing the calculations
in parallel. We are still required to serially manipulate the DAGCircuit
though which will potentially limit the efficacy of this change. Longer
term if we move the core structural representation of DAGCircuit into
qiskit._accelerate we can potentially also parallelize the dag
manipulation (albeit with some locking).

This is the 1q component of Qiskit#12007, we'll need to add similar interfaces
for TwoQubitBasisDecomposer, after Qiskit#12004 is implemented, and
XXDecomposer, after Qiskit#12005 is complete. The benefits of this are
expected to be much smaller compared to the 2q decomposers because the
computational effort in the numerical component of 1q euler
decomposition is very small.
mtreinish added a commit to mtreinish/qiskit-core that referenced this issue Mar 21, 2024
mtreinish added a commit to mtreinish/qiskit-core that referenced this issue Mar 26, 2024
This commit adds a new inner rust function that performs 1q euler
decomposition in multiple threads to attempt to improve the performance
of the synthesis. The numerical component of the 1q synthesis has no
data dependency when run with multiple unitary matrices and we can
potentially better leverage multiple CPUs by performing the calculations
in parallel. We are still required to serially manipulate the DAGCircuit
though which will potentially limit the efficacy of this change. Longer
term if we move the core structural representation of DAGCircuit into
qiskit._accelerate we can potentially also parallelize the dag
manipulation (albeit with some locking).

This is the 1q component of Qiskit#12007, we'll need to add similar interfaces
for TwoQubitBasisDecomposer, after Qiskit#12004 is implemented, and
XXDecomposer, after Qiskit#12005 is complete. The benefits of this are
expected to be much smaller compared to the 2q decomposers because the
computational effort in the numerical component of 1q euler
decomposition is very small.
mtreinish added a commit to mtreinish/qiskit-core that referenced this issue Mar 28, 2024
This commit adds a new transpiler pass for 2q peephole optimization that is
designed to replace the use of `Collect2qBlocks`, `ConsolidateBlocks`,
and `UnitarySynthesis` in the optimization loop of the transpiler with a
new optimized pass Optimize2qBlocks that performs the same basic
functionality. The goal of this new pass is to be more efficient in
runtime and also enable better quality output. The runtime improvements
are achieved by only crossing the python<->rust boundary once and doing
all the heavy lifting in rust and then just returning a list of circuit
sequences for all 2q blocks and then performing inline substitution for
all of those circuits. The actual computation is then potentially
executed in parallel using rust multithreading. The potential quality
improvement is caused by changing the decomposition selection to be
based on projected error rates instead of an estimated number of 2q
basis gates from the decomposition. In the previous triplet we skipped
synthesis if the estimated number of 2q gates from the default
decomposer was greater than or equal to the 2q gates in the block which
was an attempt to estimate the error rate. In this new pass we compare
the estimated fidelity of all the provided synthesis methods and select
the lowest noise decomposition.

Fixes: Qiskit#11659
Fixes: Qiskit#12007
mtreinish added a commit to mtreinish/qiskit-core that referenced this issue Mar 28, 2024
This commit adds a new transpiler pass for 2q peephole optimization that is
designed to replace the use of `Collect2qBlocks`, `ConsolidateBlocks`,
and `UnitarySynthesis` in the optimization loop of the transpiler with a
new optimized pass Optimize2qBlocks that performs the same basic
functionality. The goal of this new pass is to be more efficient in
runtime and also enable better quality output. The runtime improvements
are achieved by only crossing the python<->rust boundary once and doing
all the heavy lifting in rust and then just returning a list of circuit
sequences for all 2q blocks and then performing inline substitution for
all of those circuits. The actual computation is then potentially
executed in parallel using rust multithreading. The potential quality
improvement is caused by changing the decomposition selection to be
based on projected error rates instead of an estimated number of 2q
basis gates from the decomposition. In the previous triplet we skipped
synthesis if the estimated number of 2q gates from the default
decomposer was greater than or equal to the 2q gates in the block which
was an attempt to estimate the error rate. In this new pass we compare
the estimated fidelity of all the provided synthesis methods and select
the lowest noise decomposition.

Fixes: Qiskit#11659
Fixes: Qiskit#12007
@mtreinish
Copy link
Member Author

I think this might be superseded by the suggestions in #11659. If we build out a new transpiler pass to do efficient 2q peephole unitary resynthesis we should do what this is proposing. The only case where I think this wouldn't be the case is if we ran UnitarySynthesis for basis translation or unrolling >=3q unitaries. So I'll keep this open just in case we want the parallel interface even though the primary utility I envisioned was for peephole optimization.

mtreinish added a commit to mtreinish/qiskit-core that referenced this issue Sep 18, 2024
This commit adds a new transpiler pass for physical optimization,
TwoQubitPeepholeOptimization. This replaces the use of Collect2qBlocks,
ConsolidateBlocks, and UnitarySynthesis in the optimization stage for
a default pass manager setup. The pass logically works the same way
where it analyzes the dag to get a list of 2q runs, calculates the matrix
of each run, and then synthesizes the matrix and substitutes it inplace.
The distinction this pass makes though is it does this all in a single
pass and also parallelizes the matrix calculation and synthesis steps
because there is no data dependency there.

This new pass is not meant to fully replace the Collect2qBlocks,
ConsolidateBlocks, or UnitarySynthesis passes as those also run in
contexts where we don't have a physical circuit. This is meant instead
to replace their usage in the optimization stage only. Accordingly this
new pass also changes the logic on how we select the synthesis to use
and when to make a substituion. Previously this logic was primarily done
via the ConsolidateBlocks pass by only consolidating to a UnitaryGate if
the number of basis gates needed based on the weyl chamber coordinates
was less than the number of 2q gates in the block (see Qiskit#11659 for
discussion on this). Since this new pass skips the explicit
consolidation stage we go ahead and try all the available synthesizers

Right now this commit has a number of limitations, the largest are:

- Doesn't support builds with the py-cache feature (`OnceCell` for the
  cache can't be used across threads)
- Only supports the target
- It doesn't support any synthesizers besides the TwoQubitBasisDecomposer,
  because it's the only one in rust currently.

For plugin handling I left the logic as running the three pass series,
but I'm not sure this is the behavior we want. We could say keep the
synthesis plugins for `UnitarySynthesis` only and then rely on our
built-in methods for physical optimiztion only. But this also seems less
than ideal because the plugin mechanism is how we support synthesizing
to custom basis gates, and also more advanced approximate synthesis
methods. Both of those are things we need to do as part of the synthesis
here.

Additionally, this is currently missing tests and documentation and while
running it manually "works" as in it returns a circuit that looks valid,
I've not done any validation yet. This also likely will need several
rounds of performance optimization and tuning. t this point this is
just a rough proof of concept and will need a lof refinement along with
larger changes to Qiskit's rust code before this is ready to merge.

Fixes Qiskit#12007
Fixes Qiskit#11659
mtreinish added a commit to mtreinish/qiskit-core that referenced this issue Oct 24, 2024
This commit ports the consolidate blocks pass to rust. The logic remains
the same and this is just a straight porting. One optimization is that
to remove the amount of python processing the Collect2qBlocks pass is no
longer run as part of the preset pass managers and this is called
directly in rust. This speeds up the pass because it avoids 3 crossing
of the language boundary and also the intermediate creation of DAGNode
python objects. The pass still supports running with Collect2qBlocks for
backwards compatibility and will skip running the pass equivalent
internally the field is present in the property set.

There are potential improvements that can be investigated here such as
avoiding in place dag contraction and moving to rebuilding the dag
iteratively. Also changing the logic around estimated error (see Qiskit#11659)
to be more robust. But these can be left for follow up PRs as they
change the logic.

Realistically we should look at combining ConsolidateBlocks for it's
current two usages with Split2qUnitaries and UnitarySynthesis into
those passes for more efficiency. We can improve the performance and
logic as part of that refactor. See Qiskit#12007 for more details on this
for UnitarySynthesis.

Closes Qiskit#12250
github-merge-queue bot pushed a commit that referenced this issue Nov 6, 2024
* Oxidize the ConsolidateBlocks pass

This commit ports the consolidate blocks pass to rust. The logic remains
the same and this is just a straight porting. One optimization is that
to remove the amount of python processing the Collect2qBlocks pass is no
longer run as part of the preset pass managers and this is called
directly in rust. This speeds up the pass because it avoids 3 crossing
of the language boundary and also the intermediate creation of DAGNode
python objects. The pass still supports running with Collect2qBlocks for
backwards compatibility and will skip running the pass equivalent
internally the field is present in the property set.

There are potential improvements that can be investigated here such as
avoiding in place dag contraction and moving to rebuilding the dag
iteratively. Also changing the logic around estimated error (see #11659)
to be more robust. But these can be left for follow up PRs as they
change the logic.

Realistically we should look at combining ConsolidateBlocks for it's
current two usages with Split2qUnitaries and UnitarySynthesis into
those passes for more efficiency. We can improve the performance and
logic as part of that refactor. See #12007 for more details on this
for UnitarySynthesis.

Closes #12250

* Update test to count consolidate_blocks instead of collect_2q_blocks

* Fix lint

* Fix solovay kitaev test

* Add release note

* Restore 2q block collection for synthesis translation plugin

* Add rust native substitute method

* Fix final test failures

* Remove release note and test change

The test failure fixed by a test change was incorrect and masked a logic
bug that was fixed in a subsequent commit. This commit reverts that
change to the test and removes the release note attempting to document a
fix for a bug that only existed during development of this PR.

* Fix comment leftover from rust-analyzer

* Remove unused code

* Simplify control flow handling

* Remove unnecessary clone from substitute_node

* Preallocate block op names in replace_block_with_py_op

* Remove more unused imports

* Optimize linalg in block collection

This commit reworks the logic to reduce the number of Kronecker products
and 2q matrix multiplications we do as part of computing the unitary of
the block. It now computes the 1q components individually with 1q matrix
multiplications and only calls kron() and a 2q matmul when a 2q gate is
encountered. This reduces the number of more expensive operations we
need to perform and replaces them with a much faster 1q matmul.

* Use static one qubit identity matrix

* Remove unnecessary lifetime annotations

* Add missing docstring to new rust method

* Apply suggestions from code review

Co-authored-by: Kevin Hartman <kevin@hart.mn>

* Fix lint

* Add comment for MAX_2Q_DEPTH constant

* Reuse block_qargs for each block

Co-authored-by: Henry Zou <87874865+henryzou50@users.noreply.github.com>

---------

Co-authored-by: Kevin Hartman <kevin@hart.mn>
Co-authored-by: Henry Zou <87874865+henryzou50@users.noreply.github.com>
mtreinish added a commit to mtreinish/qiskit-core that referenced this issue Nov 10, 2024
This commit adds a new transpiler pass for physical optimization,
TwoQubitPeepholeOptimization. This replaces the use of Collect2qBlocks,
ConsolidateBlocks, and UnitarySynthesis in the optimization stage for
a default pass manager setup. The pass logically works the same way
where it analyzes the dag to get a list of 2q runs, calculates the matrix
of each run, and then synthesizes the matrix and substitutes it inplace.
The distinction this pass makes though is it does this all in a single
pass and also parallelizes the matrix calculation and synthesis steps
because there is no data dependency there.

This new pass is not meant to fully replace the Collect2qBlocks,
ConsolidateBlocks, or UnitarySynthesis passes as those also run in
contexts where we don't have a physical circuit. This is meant instead
to replace their usage in the optimization stage only. Accordingly this
new pass also changes the logic on how we select the synthesis to use
and when to make a substituion. Previously this logic was primarily done
via the ConsolidateBlocks pass by only consolidating to a UnitaryGate if
the number of basis gates needed based on the weyl chamber coordinates
was less than the number of 2q gates in the block (see Qiskit#11659 for
discussion on this). Since this new pass skips the explicit
consolidation stage we go ahead and try all the available synthesizers

Right now this commit has a number of limitations, the largest are:

- Only supports the target
- It doesn't support any synthesizers besides the TwoQubitBasisDecomposer,
  because it's the only one in rust currently.

For plugin handling I left the logic as running the three pass series,
but I'm not sure this is the behavior we want. We could say keep the
synthesis plugins for `UnitarySynthesis` only and then rely on our
built-in methods for physical optimiztion only. But this also seems less
than ideal because the plugin mechanism is how we support synthesizing
to custom basis gates, and also more advanced approximate synthesis
methods. Both of those are things we need to do as part of the synthesis
here.

Additionally, this is currently missing tests and documentation and while
running it manually "works" as in it returns a circuit that looks valid,
I've not done any validation yet. This also likely will need several
rounds of performance optimization and tuning. t this point this is
just a rough proof of concept and will need a lof refinement along with
larger changes to Qiskit's rust code before this is ready to merge.

Fixes Qiskit#12007
Fixes Qiskit#11659
mtreinish added a commit to mtreinish/qiskit-core that referenced this issue Nov 10, 2024
This commit adds a new transpiler pass for physical optimization,
TwoQubitPeepholeOptimization. This replaces the use of Collect2qBlocks,
ConsolidateBlocks, and UnitarySynthesis in the optimization stage for
a default pass manager setup. The pass logically works the same way
where it analyzes the dag to get a list of 2q runs, calculates the matrix
of each run, and then synthesizes the matrix and substitutes it inplace.
The distinction this pass makes though is it does this all in a single
pass and also parallelizes the matrix calculation and synthesis steps
because there is no data dependency there.

This new pass is not meant to fully replace the Collect2qBlocks,
ConsolidateBlocks, or UnitarySynthesis passes as those also run in
contexts where we don't have a physical circuit. This is meant instead
to replace their usage in the optimization stage only. Accordingly this
new pass also changes the logic on how we select the synthesis to use
and when to make a substituion. Previously this logic was primarily done
via the ConsolidateBlocks pass by only consolidating to a UnitaryGate if
the number of basis gates needed based on the weyl chamber coordinates
was less than the number of 2q gates in the block (see Qiskit#11659 for
discussion on this). Since this new pass skips the explicit
consolidation stage we go ahead and try all the available synthesizers

Right now this commit has a number of limitations, the largest are:

- Only supports the target
- It doesn't support any synthesizers besides the TwoQubitBasisDecomposer,
  because it's the only one in rust currently.

For plugin handling I left the logic as running the three pass series,
but I'm not sure this is the behavior we want. We could say keep the
synthesis plugins for `UnitarySynthesis` only and then rely on our
built-in methods for physical optimiztion only. But this also seems less
than ideal because the plugin mechanism is how we support synthesizing
to custom basis gates, and also more advanced approximate synthesis
methods. Both of those are things we need to do as part of the synthesis
here.

Additionally, this is currently missing tests and documentation and while
running it manually "works" as in it returns a circuit that looks valid,
I've not done any validation yet. This also likely will need several
rounds of performance optimization and tuning. t this point this is
just a rough proof of concept and will need a lof refinement along with
larger changes to Qiskit's rust code before this is ready to merge.

Fixes Qiskit#12007
Fixes Qiskit#11659
@mtreinish mtreinish linked a pull request Nov 10, 2024 that will close this issue
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Rust This PR or issue is related to Rust code in the repository synthesis
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant