Oxidize the ConsolidateBlocks pass #13368

mtreinish · 2024-10-24T16:30:54Z

Summary

This commit ports the consolidate blocks pass to rust. The logic remains the same and this is just a straight porting. One optimization is that to remove the amount of python processing the Collect2qBlocks pass is no longer run as part of the preset pass managers and this is called directly in rust. This speeds up the pass because it avoids 3 crossing of the language boundary and also the intermediate creation of DAGNode python objects. The pass still supports running with Collect2qBlocks for backwards compatibility and will skip running the pass equivalent internally the field is present in the property set.

There are potential improvements that can be investigated here such as avoiding in place dag contraction and moving to rebuilding the dag iteratively. Also changing the logic around estimated error (see #11659) to be more robust. But these can be left for follow up PRs as they change the logic.

Realistically we should look at combining ConsolidateBlocks for it's current two usages with Split2qUnitaries and UnitarySynthesis into those passes for more efficiency. We can improve the performance and logic as part of that refactor. See #12007 for more details on this for UnitarySynthesis.

Details and comments

Closes #12250

TODO:

Fix test failures
Add release note for support for running without Collect2qBlocks
Benchmark and tune

This commit ports the consolidate blocks pass to rust. The logic remains the same and this is just a straight porting. One optimization is that to remove the amount of python processing the Collect2qBlocks pass is no longer run as part of the preset pass managers and this is called directly in rust. This speeds up the pass because it avoids 3 crossing of the language boundary and also the intermediate creation of DAGNode python objects. The pass still supports running with Collect2qBlocks for backwards compatibility and will skip running the pass equivalent internally the field is present in the property set. There are potential improvements that can be investigated here such as avoiding in place dag contraction and moving to rebuilding the dag iteratively. Also changing the logic around estimated error (see Qiskit#11659) to be more robust. But these can be left for follow up PRs as they change the logic. Realistically we should look at combining ConsolidateBlocks for it's current two usages with Split2qUnitaries and UnitarySynthesis into those passes for more efficiency. We can improve the performance and logic as part of that refactor. See Qiskit#12007 for more details on this for UnitarySynthesis. Closes Qiskit#12250

qiskit-bot · 2024-10-24T16:30:59Z

One or more of the following people are relevant to this code:

@Qiskit/terra-core
@levbishop

mtreinish · 2024-10-24T16:57:38Z

While there are still 4 tests to fix here, I did a quick asv run to get a feel for the speedup so far and it yielded:

Benchmarks that have improved:

| Change   | Before [2284f192] <consolidate-blocks~1>   | After [ed2b41b5] <consolidate-blocks>   |   Ratio | Benchmark (Parameter)                                                                           |
|----------|--------------------------------------------|-----------------------------------------|---------|-------------------------------------------------------------------------------------------------|
| -        | 1.77±0s                                    | 1.60±0.01s                              |    0.91 | utility_scale.UtilityScaleBenchmarks.time_qft('cz')                                             |
| -        | 530±4ms                                    | 474±5ms                                 |    0.89 | transpiler_levels.TranspilerLevelBenchmarks.time_quantum_volume_transpile_50_x_20(3)            |
| -        | 1.77±0.01s                                 | 1.58±0.01s                              |    0.89 | utility_scale.UtilityScaleBenchmarks.time_qft('ecr')                                            |
| -        | 50.1±0.5ms                                 | 44.0±0.3ms                              |    0.88 | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_qv_14_x_14(2)                        |
| -        | 75.5±0.6ms                                 | 66.5±0.4ms                              |    0.88 | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_qv_14_x_14(3)                        |
| -        | 312±2ms                                    | 271±2ms                                 |    0.87 | transpiler_levels.TranspilerLevelBenchmarks.time_quantum_volume_transpile_50_x_20(2)            |
| -        | 416±2ms                                    | 359±4ms                                 |    0.86 | utility_scale.UtilityScaleBenchmarks.time_qaoa('cx')                                            |
| -        | 27.0±0.3ms                                 | 22.6±0.3ms                              |    0.84 | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm_backend_with_prop(2) |
| -        | 28.2±0.5ms                                 | 23.5±0.4ms                              |    0.83 | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm_backend_with_prop(3) |
| -        | 1.28±0.01s                                 | 1.06±0.01s                              |    0.83 | utility_scale.UtilityScaleBenchmarks.time_qv('cx')                                              |
| -        | 223±4ms                                    | 185±2ms                                 |    0.83 | utility_scale.UtilityScaleBenchmarks.time_square_heisenberg('cx')                               |
| -        | 649±3ms                                    | 531±8ms                                 |    0.82 | utility_scale.UtilityScaleBenchmarks.time_qaoa('cz')                                            |
| -        | 276±0.5ms                                  | 227±3ms                                 |    0.82 | utility_scale.UtilityScaleBenchmarks.time_square_heisenberg('cz')                               |
| -        | 279±0.9ms                                  | 228±3ms                                 |    0.82 | utility_scale.UtilityScaleBenchmarks.time_square_heisenberg('ecr')                              |
| -        | 22.3±0.3ms                                 | 18.0±0.2ms                              |    0.81 | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm(3)                   |
| -        | 637±4ms                                    | 512±3ms                                 |    0.8  | utility_scale.UtilityScaleBenchmarks.time_qaoa('ecr')                                           |
| -        | 1.74±0.01s                                 | 1.39±0.01s                              |    0.8  | utility_scale.UtilityScaleBenchmarks.time_qv('cz')                                              |
| -        | 1.67±0.01s                                 | 1.33±0.01s                              |    0.8  | utility_scale.UtilityScaleBenchmarks.time_qv('ecr')                                             |
| -        | 21.5±0.5ms                                 | 17.1±0.3ms                              |    0.79 | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm(2)                   |
| -        | 11.0±0.2ms                                 | 4.24±0.04ms                             |    0.38 | passes.Collect2QPassBenchmarks.time_consolidate_blocks(5, 1024)                                 |
| -        | 29.8±0.09ms                                | 10.2±0.04ms                             |    0.34 | passes.Collect2QPassBenchmarks.time_consolidate_blocks(14, 1024)                                |

Benchmarks that have stayed the same:

| Change   | Before [2284f192] <consolidate-blocks~1>   | After [ed2b41b5] <consolidate-blocks>   | Ratio   | Benchmark (Parameter)                                                                                  |
|----------|--------------------------------------------|-----------------------------------------|---------|--------------------------------------------------------------------------------------------------------|
|          | failed                                     | failed                                  | n/a     | passes.Collect2QPassBenchmarks.time_consolidate_blocks(20, 1024)                                       |
|          | 0                                          | 0                                       | n/a     | utility_scale.UtilityScaleBenchmarks.track_bvlike_depth('cx')                                          |
|          | 0                                          | 0                                       | n/a     | utility_scale.UtilityScaleBenchmarks.track_bvlike_depth('cz')                                          |
|          | 0                                          | 0                                       | n/a     | utility_scale.UtilityScaleBenchmarks.track_bvlike_depth('ecr')                                         |
|          | 3.72±0.02s                                 | 3.90±0.2s                               | 1.05    | utility_scale.UtilityScaleBenchmarks.time_circSU2('cx')                                                |
|          | 9.34±0.06ms                                | 9.74±0.05ms                             | 1.04    | utility_scale.UtilityScaleBenchmarks.time_bvlike('cx')                                                 |
|          | 9.41±0.08ms                                | 9.83±0.2ms                              | 1.04    | utility_scale.UtilityScaleBenchmarks.time_bvlike('cz')                                                 |
|          | 70.9±0.7ms                                 | 72.8±0.3ms                              | 1.03    | transpiler_levels.TranspilerLevelBenchmarks.time_schedule_qv_14_x_14(1)                                |
|          | 9.62±0.06ms                                | 9.92±0.1ms                              | 1.03    | utility_scale.UtilityScaleBenchmarks.time_parse_qaoa_n100('cz')                                        |
|          | 102±1ms                                    | 105±0.4ms                               | 1.03    | utility_scale.UtilityScaleBenchmarks.time_parse_qft_n100('ecr')                                        |
|          | 397                                        | 407                                     | 1.03    | utility_scale.UtilityScaleBenchmarks.track_bv_100_depth('cz')                                          |
|          | 397                                        | 407                                     | 1.03    | utility_scale.UtilityScaleBenchmarks.track_bv_100_depth('ecr')                                         |
|          | 73.4±0.5ms                                 | 75.0±0.6ms                              | 1.02    | transpiler_levels.TranspilerLevelBenchmarks.time_schedule_qv_14_x_14(0)                                |
|          | 35.7±0.3ms                                 | 36.5±0.6ms                              | 1.02    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_qv_14_x_14(1)                               |
|          | 9.45±0.06ms                                | 9.66±0.02ms                             | 1.02    | utility_scale.UtilityScaleBenchmarks.time_bvlike('ecr')                                                |
|          | 9.70±0.02ms                                | 9.88±0.08ms                             | 1.02    | utility_scale.UtilityScaleBenchmarks.time_parse_qaoa_n100('cx')                                        |
|          | 103±0.5ms                                  | 105±0.3ms                               | 1.02    | utility_scale.UtilityScaleBenchmarks.time_parse_qft_n100('cz')                                         |
|          | 33.5±0.2ms                                 | 34.2±0.2ms                              | 1.02    | utility_scale.UtilityScaleBenchmarks.time_parse_square_heisenberg_n100('cx')                           |
|          | 33.7±0.3ms                                 | 34.2±0.3ms                              | 1.02    | utility_scale.UtilityScaleBenchmarks.time_parse_square_heisenberg_n100('cz')                           |
|          | 33.5±0.2ms                                 | 34.2±0.1ms                              | 1.02    | utility_scale.UtilityScaleBenchmarks.time_parse_square_heisenberg_n100('ecr')                          |
|          | 35.5±0.1ms                                 | 36.0±0.9ms                              | 1.01    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm(1)                          |
|          | 29.5±0.1ms                                 | 29.8±0.2ms                              | 1.01    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_qv_14_x_14(0)                               |
|          | 9.73±0.07ms                                | 9.82±0.08ms                             | 1.01    | utility_scale.UtilityScaleBenchmarks.time_parse_qaoa_n100('ecr')                                       |
|          | 103±0.7ms                                  | 104±0.2ms                               | 1.01    | utility_scale.UtilityScaleBenchmarks.time_parse_qft_n100('cx')                                         |
|          | 185±2ms                                    | 185±0.4ms                               | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.time_quantum_volume_transpile_50_x_20(0)                   |
|          | 43.0±0.4ms                                 | 43.0±0.3ms                              | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm_backend_with_prop(0)        |
|          | 1404                                       | 1404                                    | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_quantum_volume_transpile_50_x_20(0)            |
|          | 1403                                       | 1403                                    | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_quantum_volume_transpile_50_x_20(1)            |
|          | 1323                                       | 1323                                    | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_quantum_volume_transpile_50_x_20(2)            |
|          | 1296                                       | 1296                                    | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_quantum_volume_transpile_50_x_20(3)            |
|          | 2705                                       | 2705                                    | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm(0)                   |
|          | 2005                                       | 2005                                    | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm(1)                   |
|          | 7                                          | 7                                       | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm(2)                   |
|          | 7                                          | 7                                       | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm(3)                   |
|          | 2705                                       | 2705                                    | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm_backend_with_prop(0) |
|          | 2005                                       | 2005                                    | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm_backend_with_prop(1) |
|          | 7                                          | 7                                       | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm_backend_with_prop(2) |
|          | 7                                          | 7                                       | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm_backend_with_prop(3) |
|          | 465                                        | 465                                     | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_qv_14_x_14(0)                        |
|          | 336                                        | 336                                     | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_qv_14_x_14(1)                        |
|          | 327                                        | 327                                     | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_qv_14_x_14(2)                        |
|          | 272                                        | 272                                     | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_qv_14_x_14(3)                        |
|          | 395                                        | 395                                     | 1.00    | utility_scale.UtilityScaleBenchmarks.track_bv_100_depth('cx')                                          |
|          | 300                                        | 300                                     | 1.00    | utility_scale.UtilityScaleBenchmarks.track_circSU2_depth('cx')                                         |
|          | 300                                        | 300                                     | 1.00    | utility_scale.UtilityScaleBenchmarks.track_circSU2_depth('cz')                                         |
|          | 300                                        | 300                                     | 1.00    | utility_scale.UtilityScaleBenchmarks.track_circSU2_depth('ecr')                                        |
|          | 1607                                       | 1607                                    | 1.00    | utility_scale.UtilityScaleBenchmarks.track_qaoa_depth('cx')                                            |
|          | 1622                                       | 1622                                    | 1.00    | utility_scale.UtilityScaleBenchmarks.track_qaoa_depth('cz')                                            |
|          | 1622                                       | 1622                                    | 1.00    | utility_scale.UtilityScaleBenchmarks.track_qaoa_depth('ecr')                                           |
|          | 1954                                       | 1954                                    | 1.00    | utility_scale.UtilityScaleBenchmarks.track_qft_depth('cx')                                             |
|          | 1954                                       | 1954                                    | 1.00    | utility_scale.UtilityScaleBenchmarks.track_qft_depth('cz')                                             |
|          | 1954                                       | 1954                                    | 1.00    | utility_scale.UtilityScaleBenchmarks.track_qft_depth('ecr')                                            |
|          | 2709                                       | 2709                                    | 1.00    | utility_scale.UtilityScaleBenchmarks.track_qv_depth('cx')                                              |
|          | 2709                                       | 2709                                    | 1.00    | utility_scale.UtilityScaleBenchmarks.track_qv_depth('cz')                                              |
|          | 2709                                       | 2709                                    | 1.00    | utility_scale.UtilityScaleBenchmarks.track_qv_depth('ecr')                                             |
|          | 462                                        | 462                                     | 1.00    | utility_scale.UtilityScaleBenchmarks.track_square_heisenberg_depth('cx')                               |
|          | 462                                        | 462                                     | 1.00    | utility_scale.UtilityScaleBenchmarks.track_square_heisenberg_depth('cz')                               |
|          | 462                                        | 462                                     | 1.00    | utility_scale.UtilityScaleBenchmarks.track_square_heisenberg_depth('ecr')                              |
|          | 192±0.7ms                                  | 192±2ms                                 | 0.99    | transpiler_levels.TranspilerLevelBenchmarks.time_quantum_volume_transpile_50_x_20(1)                   |
|          | 32.2±0.3ms                                 | 31.8±0.4ms                              | 0.99    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm(0)                          |
|          | 47.6±0.09ms                                | 47.3±0.5ms                              | 0.99    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm_backend_with_prop(1)        |
|          | 3.88±0.06s                                 | 3.75±0.01s                              | 0.97    | utility_scale.UtilityScaleBenchmarks.time_circSU2('ecr')                                               |
|          | 168±0.9ms                                  | 158±0.9ms                               | 0.94    | utility_scale.UtilityScaleBenchmarks.time_bv_100('cx')                                                 |
|          | 179±2ms                                    | 168±0.9ms                               | 0.94    | utility_scale.UtilityScaleBenchmarks.time_bv_100('cz')                                                 |
|          | 179±1ms                                    | 168±0.7ms                               | 0.94    | utility_scale.UtilityScaleBenchmarks.time_bv_100('ecr')                                                |
|          | 3.93±0.06s                                 | 3.71±0.1s                               | 0.94    | utility_scale.UtilityScaleBenchmarks.time_circSU2('cz')                                                |
|          | 1.41±0.01s                                 | 1.32±0s                                 | 0.94    | utility_scale.UtilityScaleBenchmarks.time_qft('cx')                                                    |

In general there is only so much we'll be able to do on the performance here because we'll be bottlenecked on the dag manipulation and UnitaryGate object creation. I think we can work on fixing those in follow ups separately. The dag manipulation can be removed as part of something like: bd43c51 (and an equivalent for Split2qUnitaries) where we rebuild the dag instead of doing in place substitution.

mtreinish · 2024-10-24T21:39:05Z

This should be good to review now. There might be some benchmarking/profiling and tuning we want to do, but it's not a blocker.

The test failure fixed by a test change was incorrect and masked a logic bug that was fixed in a subsequent commit. This commit reverts that change to the test and removes the release note attempting to document a fix for a bug that only existed during development of this PR.

coveralls · 2024-10-24T22:09:20Z

Pull Request Test Coverage Report for Build 11693897124

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

535 of 571 (93.7%) changed or added relevant lines in 9 files are covered.
134 unchanged lines in 13 files lost coverage.
Overall coverage increased (+0.05%) to 88.816%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
crates/accelerate/src/convert_2q_block_matrix.rs	67	72	93.06%
crates/accelerate/src/consolidate_blocks.rs	239	248	96.37%
crates/circuit/src/dag_circuit.rs	199	221	90.05%

Files with Coverage Reduction	New Missed Lines	%
qiskit/transpiler/passes/optimization/consolidate_blocks.py	1	96.15%
crates/accelerate/src/target_transpiler/mod.rs	1	82.64%
qiskit/circuit/library/generalized_gates/gms.py	2	94.44%
qiskit/circuit/library/generalized_gates/rv.py	2	84.62%
crates/qasm2/src/lex.rs	2	92.48%
qiskit/circuit/library/generalized_gates/permutation.py	3	92.73%
qiskit/circuit/library/generalized_gates/diagonal.py	3	95.16%
crates/qasm2/src/parse.rs	6	97.62%
qiskit/circuit/library/grover_operator.py	9	92.86%
qiskit/circuit/library/generalized_gates/linear_function.py	9	84.87%

Totals
Change from base Build 11683622570:	0.05%
Covered Lines:	77266
Relevant Lines:	86996

💛 - Coveralls

mtreinish · 2024-10-25T10:46:14Z

After the most recent round of changes the overall benchmarking results look like:

Benchmarks that have improved:                      

| Change   | Before [f2e07bc5] <main>               | After [a4229901] <consolidate-blocks>   |   Ratio | Benchmark (Parameter)                                                                           |
|----------|----------------------------------------|-----------------------------------------|---------|-------------------------------------------------------------------------------------------------|      
| -        | 1.77±0.01s                             | 1.57±0s                                 |    0.89 | utility_scale.UtilityScaleBenchmarks.time_qft('cz')                                             |      
| -        | 310±2ms                                | 272±1ms                                 |    0.88 | transpiler_levels.TranspilerLevelBenchmarks.time_quantum_volume_transpile_50_x_20(2)            |      
| -        | 49.8±0.7ms                             | 43.8±0.6ms                              |    0.88 | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_qv_14_x_14(2)                        |      
| -        | 1.77±0.01s                             | 1.56±0.01s                              |    0.88 | utility_scale.UtilityScaleBenchmarks.time_qft('ecr')                                            |      
| -        | 535±1ms                                | 463±3ms                                 |    0.87 | transpiler_levels.TranspilerLevelBenchmarks.time_quantum_volume_transpile_50_x_20(3)            |      
| -        | 27.8±1ms                               | 23.4±0.4ms                              |    0.84 | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm_backend_with_prop(2) |      
| -        | 76.9±2ms                               | 64.6±2ms                                |    0.84 | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_qv_14_x_14(3)                        |      
| -        | 1.26±0.01s                             | 1.07±0.01s                              |    0.84 | utility_scale.UtilityScaleBenchmarks.time_qv('cx')                                              |      
| -        | 28.2±2ms                               | 23.4±0.3ms                              |    0.83 | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm_backend_with_prop(3) |      
| -        | 417±3ms                                | 347±0.9ms                               |    0.83 | utility_scale.UtilityScaleBenchmarks.time_qaoa('cx')                                            |      
| -        | 222±2ms                                | 183±3ms                                 |    0.83 | utility_scale.UtilityScaleBenchmarks.time_square_heisenberg('cx')                               |      
| -        | 279±4ms                                | 229±3ms                                 |    0.82 | utility_scale.UtilityScaleBenchmarks.time_square_heisenberg('ecr')                              |      
| -        | 1.72±0.01s                             | 1.38±0.02s                              |    0.81 | utility_scale.UtilityScaleBenchmarks.time_qv('cz')                                              |      
| -        | 276±2ms                                | 224±3ms                                 |    0.81 | utility_scale.UtilityScaleBenchmarks.time_square_heisenberg('cz')                               |      
| -        | 1.66±0.02s                             | 1.34±0.01s                              |    0.8  | utility_scale.UtilityScaleBenchmarks.time_qv('ecr')                                             |      
| -        | 21.1±0.2ms                             | 16.8±0.1ms                              |    0.79 | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm(2)                   |      
| -        | 643±4ms                                | 508±2ms                                 |    0.79 | utility_scale.UtilityScaleBenchmarks.time_qaoa('cz')                                            |      
| -        | 635±5ms                                | 500±9ms                                 |    0.79 | utility_scale.UtilityScaleBenchmarks.time_qaoa('ecr')                                           |
| -        | 24.0±0.7ms                             | 18.2±0.2ms                              |    0.76 | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm(3)                   |
| -        | 11.3±0.2ms                             | 3.68±0.02ms                             |    0.33 | passes.Collect2QPassBenchmarks.time_consolidate_blocks(5, 1024)                                 |
| -        | 30.5±0.2ms                             | 9.90±0.04ms                             |    0.32 | passes.Collect2QPassBenchmarks.time_consolidate_blocks(14, 1024)                                |

Benchmarks that have stayed the same:

| Change   | Before [f2e07bc5] <main>               | After [a4229901] <consolidate-blocks>   | Ratio   | Benchmark (Parameter)                                                                                  |
|----------|----------------------------------------|-----------------------------------------|---------|--------------------------------------------------------------------------------------------------------|
|          | failed                                 | failed                                  | n/a     | passes.Collect2QPassBenchmarks.time_consolidate_blocks(20, 1024)                                       |
|          | 3.71±0.02s                             | 3.79±0.1s                               | 1.02    | utility_scale.UtilityScaleBenchmarks.time_circSU2('cx')                                                |
|          | 9.18±0.1ms                             | 9.27±0.1ms                              | 1.01    | utility_scale.UtilityScaleBenchmarks.time_bvlike('cx')                                                 |
|          | 9.21±0.05ms                            | 9.24±0.05ms                             | 1.00    | utility_scale.UtilityScaleBenchmarks.time_bvlike('cz')                                                 |
|          | 9.23±0.06ms                            | 9.25±0.07ms                             | 1.00    | utility_scale.UtilityScaleBenchmarks.time_bvlike('ecr')                                                |
|          | 4.00±0.05s                             | 3.92±0.07s                              | 0.98    | utility_scale.UtilityScaleBenchmarks.time_circSU2('cz')                                                |
|          | 35.0±1ms                               | 34.2±0.2ms                              | 0.98    | utility_scale.UtilityScaleBenchmarks.time_parse_square_heisenberg_n100('cz')                           |
|          | 35.0±1ms                               | 34.2±0.3ms                              | 0.98    | utility_scale.UtilityScaleBenchmarks.time_parse_square_heisenberg_n100('ecr')                          |
|          | 3.71±0.1s                              | 3.61±0.2s                               | 0.97    | utility_scale.UtilityScaleBenchmarks.time_circSU2('ecr')                                               |
|          | 166±2ms                                | 155±2ms                                 | 0.93    | utility_scale.UtilityScaleBenchmarks.time_bv_100('cx')                                                 |
|          | 177±1ms                                | 165±0.8ms                               | 0.93    | utility_scale.UtilityScaleBenchmarks.time_bv_100('cz')                                                 |
|          | 1.40±0.01s                             | 1.31±0s                                 | 0.93    | utility_scale.UtilityScaleBenchmarks.time_qft('cx')                                                    |
|          | 179±0.9ms                              | 165±1ms                                 | 0.92    | utility_scale.UtilityScaleBenchmarks.time_bv_100('ecr')                                                |

It's marginally faster than the previous run now.

This commit reworks the logic to reduce the number of Kronecker products and 2q matrix multiplications we do as part of computing the unitary of the block. It now computes the 1q components individually with 1q matrix multiplications and only calls kron() and a 2q matmul when a 2q gate is encountered. This reduces the number of more expensive operations we need to perform and replaces them with a much faster 1q matmul.

mtreinish · 2024-10-25T19:12:09Z

I ran the pgo scripts under a profiler to see where the pass is spending most of it's time after 62df015 and the top 4 components taking runtime are: ~42% of the time is in TwoQubitBasisDecomposer.num_basis_gates() ~10% is in DAGCircuit.replace_block_with_op(), and ~8.3% each for DAGCircuit.collect_2q_runs() and is_supported(). So I'm not sure there is a ton of extra tuning we can do without changing the behavior of the pass. I think refactoring this into something like: bd43c51 is going to be better path to improve the performance moving forward. The other thing I think we will want to look at is using nalgebra for it's fixed size Array2 and Array4 types which are stack allocated and should be faster for all of our use cases than ndarray and faer in this code path.

kevinhartman

This looks good to me. Just a few comments / questions.

I'll avoid signing off since I believe @henryzou50 is also planning to look.

releasenotes/notes/rust-consolidation-a791a00380fc78b8.yaml

qiskit/transpiler/passes/optimization/consolidate_blocks.py

kevinhartman · 2024-11-04T16:20:56Z

qiskit/transpiler/passes/optimization/consolidate_blocks.py

@@ -195,38 +130,15 @@ def _handle_control_flow_ops(self, dag):
        pass_manager = PassManager()
        if "run_list" in self.property_set:
            pass_manager.append(Collect1qRuns())
-        if "block_list" in self.property_set:


Can block_list be specified when run_list is not?

It can be, that's arguably the normal invocation of the pass. My thinking here was that we'll implicitly run the equivalent of Collect2qBlocks on the control flow block regardless of the property set if we don't specify it. That's the new feature this PR adds to the pass. So we don't need to manually populate the property set with Collect2qBlocks unless the run_list is set because populating that will preclude the 2q blocks.

kevinhartman · 2024-11-04T16:24:09Z

qiskit/transpiler/passes/optimization/consolidate_blocks.py

-                node.op.replace_blocks(pass_manager.run(block) for block in node.op.blocks),
-                propagate_condition=False,
-            )
+        control_flow_nodes = dag.control_flow_op_nodes()


I must say, I wish that control_flow_op_nodes just returned an empty list rather than None, but that is beyond the scope of this PR.

crates/accelerate/src/convert_2q_block_matrix.rs