Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oxidize the ConsolidateBlocks pass #13368

Merged
merged 26 commits into from
Nov 6, 2024
Merged

Conversation

mtreinish
Copy link
Member

@mtreinish mtreinish commented Oct 24, 2024

Summary

This commit ports the consolidate blocks pass to rust. The logic remains the same and this is just a straight porting. One optimization is that to remove the amount of python processing the Collect2qBlocks pass is no longer run as part of the preset pass managers and this is called directly in rust. This speeds up the pass because it avoids 3 crossing of the language boundary and also the intermediate creation of DAGNode python objects. The pass still supports running with Collect2qBlocks for backwards compatibility and will skip running the pass equivalent internally the field is present in the property set.

There are potential improvements that can be investigated here such as avoiding in place dag contraction and moving to rebuilding the dag iteratively. Also changing the logic around estimated error (see #11659) to be more robust. But these can be left for follow up PRs as they change the logic.

Realistically we should look at combining ConsolidateBlocks for it's current two usages with Split2qUnitaries and UnitarySynthesis into those passes for more efficiency. We can improve the performance and logic as part of that refactor. See #12007 for more details on this for UnitarySynthesis.

Details and comments

Closes #12250

TODO:

  • Fix test failures
  • Add release note for support for running without Collect2qBlocks
  • Benchmark and tune

This commit ports the consolidate blocks pass to rust. The logic remains
the same and this is just a straight porting. One optimization is that
to remove the amount of python processing the Collect2qBlocks pass is no
longer run as part of the preset pass managers and this is called
directly in rust. This speeds up the pass because it avoids 3 crossing
of the language boundary and also the intermediate creation of DAGNode
python objects. The pass still supports running with Collect2qBlocks for
backwards compatibility and will skip running the pass equivalent
internally the field is present in the property set.

There are potential improvements that can be investigated here such as
avoiding in place dag contraction and moving to rebuilding the dag
iteratively. Also changing the logic around estimated error (see Qiskit#11659)
to be more robust. But these can be left for follow up PRs as they
change the logic.

Realistically we should look at combining ConsolidateBlocks for it's
current two usages with Split2qUnitaries and UnitarySynthesis into
those passes for more efficiency. We can improve the performance and
logic as part of that refactor. See Qiskit#12007 for more details on this
for UnitarySynthesis.

Closes Qiskit#12250
@mtreinish mtreinish added on hold Can not fix yet performance Changelog: New Feature Include in the "Added" section of the changelog Rust This PR or issue is related to Rust code in the repository mod: transpiler Issues and PRs related to Transpiler labels Oct 24, 2024
@mtreinish mtreinish added this to the 1.3.0 milestone Oct 24, 2024
@mtreinish mtreinish requested a review from a team as a code owner October 24, 2024 16:30
@qiskit-bot
Copy link
Collaborator

One or more of the following people are relevant to this code:

  • @Qiskit/terra-core
  • @levbishop

@mtreinish
Copy link
Member Author

mtreinish commented Oct 24, 2024

While there are still 4 tests to fix here, I did a quick asv run to get a feel for the speedup so far and it yielded:

Benchmarks that have improved:

| Change   | Before [2284f192] <consolidate-blocks~1>   | After [ed2b41b5] <consolidate-blocks>   |   Ratio | Benchmark (Parameter)                                                                           |
|----------|--------------------------------------------|-----------------------------------------|---------|-------------------------------------------------------------------------------------------------|
| -        | 1.77±0s                                    | 1.60±0.01s                              |    0.91 | utility_scale.UtilityScaleBenchmarks.time_qft('cz')                                             |
| -        | 530±4ms                                    | 474±5ms                                 |    0.89 | transpiler_levels.TranspilerLevelBenchmarks.time_quantum_volume_transpile_50_x_20(3)            |
| -        | 1.77±0.01s                                 | 1.58±0.01s                              |    0.89 | utility_scale.UtilityScaleBenchmarks.time_qft('ecr')                                            |
| -        | 50.1±0.5ms                                 | 44.0±0.3ms                              |    0.88 | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_qv_14_x_14(2)                        |
| -        | 75.5±0.6ms                                 | 66.5±0.4ms                              |    0.88 | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_qv_14_x_14(3)                        |
| -        | 312±2ms                                    | 271±2ms                                 |    0.87 | transpiler_levels.TranspilerLevelBenchmarks.time_quantum_volume_transpile_50_x_20(2)            |
| -        | 416±2ms                                    | 359±4ms                                 |    0.86 | utility_scale.UtilityScaleBenchmarks.time_qaoa('cx')                                            |
| -        | 27.0±0.3ms                                 | 22.6±0.3ms                              |    0.84 | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm_backend_with_prop(2) |
| -        | 28.2±0.5ms                                 | 23.5±0.4ms                              |    0.83 | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm_backend_with_prop(3) |
| -        | 1.28±0.01s                                 | 1.06±0.01s                              |    0.83 | utility_scale.UtilityScaleBenchmarks.time_qv('cx')                                              |
| -        | 223±4ms                                    | 185±2ms                                 |    0.83 | utility_scale.UtilityScaleBenchmarks.time_square_heisenberg('cx')                               |
| -        | 649±3ms                                    | 531±8ms                                 |    0.82 | utility_scale.UtilityScaleBenchmarks.time_qaoa('cz')                                            |
| -        | 276±0.5ms                                  | 227±3ms                                 |    0.82 | utility_scale.UtilityScaleBenchmarks.time_square_heisenberg('cz')                               |
| -        | 279±0.9ms                                  | 228±3ms                                 |    0.82 | utility_scale.UtilityScaleBenchmarks.time_square_heisenberg('ecr')                              |
| -        | 22.3±0.3ms                                 | 18.0±0.2ms                              |    0.81 | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm(3)                   |
| -        | 637±4ms                                    | 512±3ms                                 |    0.8  | utility_scale.UtilityScaleBenchmarks.time_qaoa('ecr')                                           |
| -        | 1.74±0.01s                                 | 1.39±0.01s                              |    0.8  | utility_scale.UtilityScaleBenchmarks.time_qv('cz')                                              |
| -        | 1.67±0.01s                                 | 1.33±0.01s                              |    0.8  | utility_scale.UtilityScaleBenchmarks.time_qv('ecr')                                             |
| -        | 21.5±0.5ms                                 | 17.1±0.3ms                              |    0.79 | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm(2)                   |
| -        | 11.0±0.2ms                                 | 4.24±0.04ms                             |    0.38 | passes.Collect2QPassBenchmarks.time_consolidate_blocks(5, 1024)                                 |
| -        | 29.8±0.09ms                                | 10.2±0.04ms                             |    0.34 | passes.Collect2QPassBenchmarks.time_consolidate_blocks(14, 1024)                                |

Benchmarks that have stayed the same:

| Change   | Before [2284f192] <consolidate-blocks~1>   | After [ed2b41b5] <consolidate-blocks>   | Ratio   | Benchmark (Parameter)                                                                                  |
|----------|--------------------------------------------|-----------------------------------------|---------|--------------------------------------------------------------------------------------------------------|
|          | failed                                     | failed                                  | n/a     | passes.Collect2QPassBenchmarks.time_consolidate_blocks(20, 1024)                                       |
|          | 0                                          | 0                                       | n/a     | utility_scale.UtilityScaleBenchmarks.track_bvlike_depth('cx')                                          |
|          | 0                                          | 0                                       | n/a     | utility_scale.UtilityScaleBenchmarks.track_bvlike_depth('cz')                                          |
|          | 0                                          | 0                                       | n/a     | utility_scale.UtilityScaleBenchmarks.track_bvlike_depth('ecr')                                         |
|          | 3.72±0.02s                                 | 3.90±0.2s                               | 1.05    | utility_scale.UtilityScaleBenchmarks.time_circSU2('cx')                                                |
|          | 9.34±0.06ms                                | 9.74±0.05ms                             | 1.04    | utility_scale.UtilityScaleBenchmarks.time_bvlike('cx')                                                 |
|          | 9.41±0.08ms                                | 9.83±0.2ms                              | 1.04    | utility_scale.UtilityScaleBenchmarks.time_bvlike('cz')                                                 |
|          | 70.9±0.7ms                                 | 72.8±0.3ms                              | 1.03    | transpiler_levels.TranspilerLevelBenchmarks.time_schedule_qv_14_x_14(1)                                |
|          | 9.62±0.06ms                                | 9.92±0.1ms                              | 1.03    | utility_scale.UtilityScaleBenchmarks.time_parse_qaoa_n100('cz')                                        |
|          | 102±1ms                                    | 105±0.4ms                               | 1.03    | utility_scale.UtilityScaleBenchmarks.time_parse_qft_n100('ecr')                                        |
|          | 397                                        | 407                                     | 1.03    | utility_scale.UtilityScaleBenchmarks.track_bv_100_depth('cz')                                          |
|          | 397                                        | 407                                     | 1.03    | utility_scale.UtilityScaleBenchmarks.track_bv_100_depth('ecr')                                         |
|          | 73.4±0.5ms                                 | 75.0±0.6ms                              | 1.02    | transpiler_levels.TranspilerLevelBenchmarks.time_schedule_qv_14_x_14(0)                                |
|          | 35.7±0.3ms                                 | 36.5±0.6ms                              | 1.02    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_qv_14_x_14(1)                               |
|          | 9.45±0.06ms                                | 9.66±0.02ms                             | 1.02    | utility_scale.UtilityScaleBenchmarks.time_bvlike('ecr')                                                |
|          | 9.70±0.02ms                                | 9.88±0.08ms                             | 1.02    | utility_scale.UtilityScaleBenchmarks.time_parse_qaoa_n100('cx')                                        |
|          | 103±0.5ms                                  | 105±0.3ms                               | 1.02    | utility_scale.UtilityScaleBenchmarks.time_parse_qft_n100('cz')                                         |
|          | 33.5±0.2ms                                 | 34.2±0.2ms                              | 1.02    | utility_scale.UtilityScaleBenchmarks.time_parse_square_heisenberg_n100('cx')                           |
|          | 33.7±0.3ms                                 | 34.2±0.3ms                              | 1.02    | utility_scale.UtilityScaleBenchmarks.time_parse_square_heisenberg_n100('cz')                           |
|          | 33.5±0.2ms                                 | 34.2±0.1ms                              | 1.02    | utility_scale.UtilityScaleBenchmarks.time_parse_square_heisenberg_n100('ecr')                          |
|          | 35.5±0.1ms                                 | 36.0±0.9ms                              | 1.01    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm(1)                          |
|          | 29.5±0.1ms                                 | 29.8±0.2ms                              | 1.01    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_qv_14_x_14(0)                               |
|          | 9.73±0.07ms                                | 9.82±0.08ms                             | 1.01    | utility_scale.UtilityScaleBenchmarks.time_parse_qaoa_n100('ecr')                                       |
|          | 103±0.7ms                                  | 104±0.2ms                               | 1.01    | utility_scale.UtilityScaleBenchmarks.time_parse_qft_n100('cx')                                         |
|          | 185±2ms                                    | 185±0.4ms                               | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.time_quantum_volume_transpile_50_x_20(0)                   |
|          | 43.0±0.4ms                                 | 43.0±0.3ms                              | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm_backend_with_prop(0)        |
|          | 1404                                       | 1404                                    | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_quantum_volume_transpile_50_x_20(0)            |
|          | 1403                                       | 1403                                    | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_quantum_volume_transpile_50_x_20(1)            |
|          | 1323                                       | 1323                                    | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_quantum_volume_transpile_50_x_20(2)            |
|          | 1296                                       | 1296                                    | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_quantum_volume_transpile_50_x_20(3)            |
|          | 2705                                       | 2705                                    | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm(0)                   |
|          | 2005                                       | 2005                                    | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm(1)                   |
|          | 7                                          | 7                                       | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm(2)                   |
|          | 7                                          | 7                                       | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm(3)                   |
|          | 2705                                       | 2705                                    | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm_backend_with_prop(0) |
|          | 2005                                       | 2005                                    | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm_backend_with_prop(1) |
|          | 7                                          | 7                                       | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm_backend_with_prop(2) |
|          | 7                                          | 7                                       | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm_backend_with_prop(3) |
|          | 465                                        | 465                                     | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_qv_14_x_14(0)                        |
|          | 336                                        | 336                                     | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_qv_14_x_14(1)                        |
|          | 327                                        | 327                                     | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_qv_14_x_14(2)                        |
|          | 272                                        | 272                                     | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_qv_14_x_14(3)                        |
|          | 395                                        | 395                                     | 1.00    | utility_scale.UtilityScaleBenchmarks.track_bv_100_depth('cx')                                          |
|          | 300                                        | 300                                     | 1.00    | utility_scale.UtilityScaleBenchmarks.track_circSU2_depth('cx')                                         |
|          | 300                                        | 300                                     | 1.00    | utility_scale.UtilityScaleBenchmarks.track_circSU2_depth('cz')                                         |
|          | 300                                        | 300                                     | 1.00    | utility_scale.UtilityScaleBenchmarks.track_circSU2_depth('ecr')                                        |
|          | 1607                                       | 1607                                    | 1.00    | utility_scale.UtilityScaleBenchmarks.track_qaoa_depth('cx')                                            |
|          | 1622                                       | 1622                                    | 1.00    | utility_scale.UtilityScaleBenchmarks.track_qaoa_depth('cz')                                            |
|          | 1622                                       | 1622                                    | 1.00    | utility_scale.UtilityScaleBenchmarks.track_qaoa_depth('ecr')                                           |
|          | 1954                                       | 1954                                    | 1.00    | utility_scale.UtilityScaleBenchmarks.track_qft_depth('cx')                                             |
|          | 1954                                       | 1954                                    | 1.00    | utility_scale.UtilityScaleBenchmarks.track_qft_depth('cz')                                             |
|          | 1954                                       | 1954                                    | 1.00    | utility_scale.UtilityScaleBenchmarks.track_qft_depth('ecr')                                            |
|          | 2709                                       | 2709                                    | 1.00    | utility_scale.UtilityScaleBenchmarks.track_qv_depth('cx')                                              |
|          | 2709                                       | 2709                                    | 1.00    | utility_scale.UtilityScaleBenchmarks.track_qv_depth('cz')                                              |
|          | 2709                                       | 2709                                    | 1.00    | utility_scale.UtilityScaleBenchmarks.track_qv_depth('ecr')                                             |
|          | 462                                        | 462                                     | 1.00    | utility_scale.UtilityScaleBenchmarks.track_square_heisenberg_depth('cx')                               |
|          | 462                                        | 462                                     | 1.00    | utility_scale.UtilityScaleBenchmarks.track_square_heisenberg_depth('cz')                               |
|          | 462                                        | 462                                     | 1.00    | utility_scale.UtilityScaleBenchmarks.track_square_heisenberg_depth('ecr')                              |
|          | 192±0.7ms                                  | 192±2ms                                 | 0.99    | transpiler_levels.TranspilerLevelBenchmarks.time_quantum_volume_transpile_50_x_20(1)                   |
|          | 32.2±0.3ms                                 | 31.8±0.4ms                              | 0.99    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm(0)                          |
|          | 47.6±0.09ms                                | 47.3±0.5ms                              | 0.99    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm_backend_with_prop(1)        |
|          | 3.88±0.06s                                 | 3.75±0.01s                              | 0.97    | utility_scale.UtilityScaleBenchmarks.time_circSU2('ecr')                                               |
|          | 168±0.9ms                                  | 158±0.9ms                               | 0.94    | utility_scale.UtilityScaleBenchmarks.time_bv_100('cx')                                                 |
|          | 179±2ms                                    | 168±0.9ms                               | 0.94    | utility_scale.UtilityScaleBenchmarks.time_bv_100('cz')                                                 |
|          | 179±1ms                                    | 168±0.7ms                               | 0.94    | utility_scale.UtilityScaleBenchmarks.time_bv_100('ecr')                                                |
|          | 3.93±0.06s                                 | 3.71±0.1s                               | 0.94    | utility_scale.UtilityScaleBenchmarks.time_circSU2('cz')                                                |
|          | 1.41±0.01s                                 | 1.32±0s                                 | 0.94    | utility_scale.UtilityScaleBenchmarks.time_qft('cx')                                                    |

In general there is only so much we'll be able to do on the performance here because we'll be bottlenecked on the dag manipulation and UnitaryGate object creation. I think we can work on fixing those in follow ups separately. The dag manipulation can be removed as part of something like: bd43c51 (and an equivalent for Split2qUnitaries) where we rebuild the dag instead of doing in place substitution.

@mtreinish mtreinish changed the title [WIP] Oxidize the ConsolidateBlocks pass Oxidize the ConsolidateBlocks pass Oct 24, 2024
@mtreinish mtreinish removed the on hold Can not fix yet label Oct 24, 2024
@mtreinish
Copy link
Member Author

mtreinish commented Oct 24, 2024

This should be good to review now. There might be some benchmarking/profiling and tuning we want to do, but it's not a blocker.

The test failure fixed by a test change was incorrect and masked a logic
bug that was fixed in a subsequent commit. This commit reverts that
change to the test and removes the release note attempting to document a
fix for a bug that only existed during development of this PR.
@coveralls
Copy link

coveralls commented Oct 24, 2024

Pull Request Test Coverage Report for Build 11693897124

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 535 of 571 (93.7%) changed or added relevant lines in 9 files are covered.
  • 134 unchanged lines in 13 files lost coverage.
  • Overall coverage increased (+0.05%) to 88.816%

Changes Missing Coverage Covered Lines Changed/Added Lines %
crates/accelerate/src/convert_2q_block_matrix.rs 67 72 93.06%
crates/accelerate/src/consolidate_blocks.rs 239 248 96.37%
crates/circuit/src/dag_circuit.rs 199 221 90.05%
Files with Coverage Reduction New Missed Lines %
qiskit/transpiler/passes/optimization/consolidate_blocks.py 1 96.15%
crates/accelerate/src/target_transpiler/mod.rs 1 82.64%
qiskit/circuit/library/generalized_gates/gms.py 2 94.44%
qiskit/circuit/library/generalized_gates/rv.py 2 84.62%
crates/qasm2/src/lex.rs 2 92.48%
qiskit/circuit/library/generalized_gates/permutation.py 3 92.73%
qiskit/circuit/library/generalized_gates/diagonal.py 3 95.16%
crates/qasm2/src/parse.rs 6 97.62%
qiskit/circuit/library/grover_operator.py 9 92.86%
qiskit/circuit/library/generalized_gates/linear_function.py 9 84.87%
Totals Coverage Status
Change from base Build 11683622570: 0.05%
Covered Lines: 77266
Relevant Lines: 86996

💛 - Coveralls

@mtreinish
Copy link
Member Author

After the most recent round of changes the overall benchmarking results look like:

Benchmarks that have improved:                      

| Change   | Before [f2e07bc5] <main>               | After [a4229901] <consolidate-blocks>   |   Ratio | Benchmark (Parameter)                                                                           |
|----------|----------------------------------------|-----------------------------------------|---------|-------------------------------------------------------------------------------------------------|      
| -        | 1.77±0.01s                             | 1.57±0s                                 |    0.89 | utility_scale.UtilityScaleBenchmarks.time_qft('cz')                                             |      
| -        | 310±2ms                                | 272±1ms                                 |    0.88 | transpiler_levels.TranspilerLevelBenchmarks.time_quantum_volume_transpile_50_x_20(2)            |      
| -        | 49.8±0.7ms                             | 43.8±0.6ms                              |    0.88 | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_qv_14_x_14(2)                        |      
| -        | 1.77±0.01s                             | 1.56±0.01s                              |    0.88 | utility_scale.UtilityScaleBenchmarks.time_qft('ecr')                                            |      
| -        | 535±1ms                                | 463±3ms                                 |    0.87 | transpiler_levels.TranspilerLevelBenchmarks.time_quantum_volume_transpile_50_x_20(3)            |      
| -        | 27.8±1ms                               | 23.4±0.4ms                              |    0.84 | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm_backend_with_prop(2) |      
| -        | 76.9±2ms                               | 64.6±2ms                                |    0.84 | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_qv_14_x_14(3)                        |      
| -        | 1.26±0.01s                             | 1.07±0.01s                              |    0.84 | utility_scale.UtilityScaleBenchmarks.time_qv('cx')                                              |      
| -        | 28.2±2ms                               | 23.4±0.3ms                              |    0.83 | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm_backend_with_prop(3) |      
| -        | 417±3ms                                | 347±0.9ms                               |    0.83 | utility_scale.UtilityScaleBenchmarks.time_qaoa('cx')                                            |      
| -        | 222±2ms                                | 183±3ms                                 |    0.83 | utility_scale.UtilityScaleBenchmarks.time_square_heisenberg('cx')                               |      
| -        | 279±4ms                                | 229±3ms                                 |    0.82 | utility_scale.UtilityScaleBenchmarks.time_square_heisenberg('ecr')                              |      
| -        | 1.72±0.01s                             | 1.38±0.02s                              |    0.81 | utility_scale.UtilityScaleBenchmarks.time_qv('cz')                                              |      
| -        | 276±2ms                                | 224±3ms                                 |    0.81 | utility_scale.UtilityScaleBenchmarks.time_square_heisenberg('cz')                               |      
| -        | 1.66±0.02s                             | 1.34±0.01s                              |    0.8  | utility_scale.UtilityScaleBenchmarks.time_qv('ecr')                                             |      
| -        | 21.1±0.2ms                             | 16.8±0.1ms                              |    0.79 | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm(2)                   |      
| -        | 643±4ms                                | 508±2ms                                 |    0.79 | utility_scale.UtilityScaleBenchmarks.time_qaoa('cz')                                            |      
| -        | 635±5ms                                | 500±9ms                                 |    0.79 | utility_scale.UtilityScaleBenchmarks.time_qaoa('ecr')                                           |
| -        | 24.0±0.7ms                             | 18.2±0.2ms                              |    0.76 | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm(3)                   |
| -        | 11.3±0.2ms                             | 3.68±0.02ms                             |    0.33 | passes.Collect2QPassBenchmarks.time_consolidate_blocks(5, 1024)                                 |
| -        | 30.5±0.2ms                             | 9.90±0.04ms                             |    0.32 | passes.Collect2QPassBenchmarks.time_consolidate_blocks(14, 1024)                                |

Benchmarks that have stayed the same:

| Change   | Before [f2e07bc5] <main>               | After [a4229901] <consolidate-blocks>   | Ratio   | Benchmark (Parameter)                                                                                  |
|----------|----------------------------------------|-----------------------------------------|---------|--------------------------------------------------------------------------------------------------------|
|          | failed                                 | failed                                  | n/a     | passes.Collect2QPassBenchmarks.time_consolidate_blocks(20, 1024)                                       |
|          | 3.71±0.02s                             | 3.79±0.1s                               | 1.02    | utility_scale.UtilityScaleBenchmarks.time_circSU2('cx')                                                |
|          | 9.18±0.1ms                             | 9.27±0.1ms                              | 1.01    | utility_scale.UtilityScaleBenchmarks.time_bvlike('cx')                                                 |
|          | 9.21±0.05ms                            | 9.24±0.05ms                             | 1.00    | utility_scale.UtilityScaleBenchmarks.time_bvlike('cz')                                                 |
|          | 9.23±0.06ms                            | 9.25±0.07ms                             | 1.00    | utility_scale.UtilityScaleBenchmarks.time_bvlike('ecr')                                                |
|          | 4.00±0.05s                             | 3.92±0.07s                              | 0.98    | utility_scale.UtilityScaleBenchmarks.time_circSU2('cz')                                                |
|          | 35.0±1ms                               | 34.2±0.2ms                              | 0.98    | utility_scale.UtilityScaleBenchmarks.time_parse_square_heisenberg_n100('cz')                           |
|          | 35.0±1ms                               | 34.2±0.3ms                              | 0.98    | utility_scale.UtilityScaleBenchmarks.time_parse_square_heisenberg_n100('ecr')                          |
|          | 3.71±0.1s                              | 3.61±0.2s                               | 0.97    | utility_scale.UtilityScaleBenchmarks.time_circSU2('ecr')                                               |
|          | 166±2ms                                | 155±2ms                                 | 0.93    | utility_scale.UtilityScaleBenchmarks.time_bv_100('cx')                                                 |
|          | 177±1ms                                | 165±0.8ms                               | 0.93    | utility_scale.UtilityScaleBenchmarks.time_bv_100('cz')                                                 |
|          | 1.40±0.01s                             | 1.31±0s                                 | 0.93    | utility_scale.UtilityScaleBenchmarks.time_qft('cx')                                                    |
|          | 179±0.9ms                              | 165±1ms                                 | 0.92    | utility_scale.UtilityScaleBenchmarks.time_bv_100('ecr')                                                |

It's marginally faster than the previous run now.

This commit reworks the logic to reduce the number of Kronecker products
and 2q matrix multiplications we do as part of computing the unitary of
the block. It now computes the 1q components individually with 1q matrix
multiplications and only calls kron() and a 2q matmul when a 2q gate is
encountered. This reduces the number of more expensive operations we
need to perform and replaces them with a much faster 1q matmul.
@mtreinish
Copy link
Member Author

I ran the pgo scripts under a profiler to see where the pass is spending most of it's time after 62df015 and the top 4 components taking runtime are: ~42% of the time is in TwoQubitBasisDecomposer.num_basis_gates() ~10% is in DAGCircuit.replace_block_with_op(), and ~8.3% each for DAGCircuit.collect_2q_runs() and is_supported(). So I'm not sure there is a ton of extra tuning we can do without changing the behavior of the pass. I think refactoring this into something like: bd43c51 is going to be better path to improve the performance moving forward. The other thing I think we will want to look at is using nalgebra for it's fixed size Array2 and Array4 types which are stack allocated and should be faster for all of our use cases than ndarray and faer in this code path.

Copy link
Contributor

@kevinhartman kevinhartman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Just a few comments / questions.

I'll avoid signing off since I believe @henryzou50 is also planning to look.

@@ -195,38 +130,15 @@ def _handle_control_flow_ops(self, dag):
pass_manager = PassManager()
if "run_list" in self.property_set:
pass_manager.append(Collect1qRuns())
if "block_list" in self.property_set:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can block_list be specified when run_list is not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be, that's arguably the normal invocation of the pass. My thinking here was that we'll implicitly run the equivalent of Collect2qBlocks on the control flow block regardless of the property set if we don't specify it. That's the new feature this PR adds to the pass. So we don't need to manually populate the property set with Collect2qBlocks unless the run_list is set because populating that will preclude the 2q blocks.

node.op.replace_blocks(pass_manager.run(block) for block in node.op.blocks),
propagate_condition=False,
)
control_flow_nodes = dag.control_flow_op_nodes()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must say, I wish that control_flow_op_nodes just returned an empty list rather than None, but that is beyond the scope of this PR.

crates/accelerate/src/convert_2q_block_matrix.rs Outdated Show resolved Hide resolved
crates/accelerate/src/consolidate_blocks.rs Show resolved Hide resolved
.collect();
let circuit_data = CircuitData::from_packed_operations(
py,
block_qargs.len() as u32,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is perhaps more evidence that we ought to just use usize in any public interface of the circuit crate rather than the internal representation.

(no action requested)

crates/accelerate/src/consolidate_blocks.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@henryzou50 henryzou50 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks great to me! Thanks to Kevin for also reviewing. I’ve left a few suggestions for improving efficiency and memory usage, along with some additional comments and questions. Excellent work, and thanks for all the work put into this!

crates/accelerate/src/consolidate_blocks.rs Show resolved Hide resolved
crates/accelerate/src/consolidate_blocks.rs Outdated Show resolved Hide resolved
Comment on lines +67 to +93
let blocks = match blocks {
Some(runs) => runs
.into_iter()
.map(|run| {
run.into_iter()
.map(NodeIndex::new)
.collect::<Vec<NodeIndex>>()
})
.collect(),
// If runs are specified but blocks are none we're in a legacy configuration where external
// collection passes are being used. In this case don't collect blocks because it's
// unexpected.
None => match runs {
Some(_) => vec![],
None => dag.collect_2q_runs().unwrap(),
},
};

let runs: Option<Vec<Vec<NodeIndex>>> = runs.map(|runs| {
runs.into_iter()
.map(|run| {
run.into_iter()
.map(NodeIndex::new)
.collect::<Vec<NodeIndex>>()
})
.collect()
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To improve the memory efficiency here, we could simplify the handling of the blocks and runs by unifying them with or_else and unwarp_or_else. This approach would reduces the unnecessary temporary Vec allocations and leveragers iterators for more efficient memory usage.

For instance, something like:

Suggested change
let blocks = match blocks {
Some(runs) => runs
.into_iter()
.map(|run| {
run.into_iter()
.map(NodeIndex::new)
.collect::<Vec<NodeIndex>>()
})
.collect(),
// If runs are specified but blocks are none we're in a legacy configuration where external
// collection passes are being used. In this case don't collect blocks because it's
// unexpected.
None => match runs {
Some(_) => vec![],
None => dag.collect_2q_runs().unwrap(),
},
};
let runs: Option<Vec<Vec<NodeIndex>>> = runs.map(|runs| {
runs.into_iter()
.map(|run| {
run.into_iter()
.map(NodeIndex::new)
.collect::<Vec<NodeIndex>>()
})
.collect()
});
// If `blocks` is `None` but `runs` is specified, `blocks` is set to `runs.clone()`.
// If both are `None`, it defaults to collecting 2q runs from the DAG circuit.
let blocks: Vec<Vec<NodeIndex>> = blocks
.or_else(|| runs.clone())
.unwrap_or_else(|| {
dag.collect_2q_runs()
.unwrap()
.into_iter()
.map(|run| {
run.into_iter()
.map(|node_index| node_index.index())
.collect()
})
.collect()
})
.into_iter()
.map(|run| run.into_iter().map(NodeIndex::new).collect())
.collect();
let runs: Option<Vec<Vec<NodeIndex>>> = runs.map(|runs| {
runs.into_iter()
.map(|run| run.into_iter().map(NodeIndex::new).collect())
.collect()
});

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see if I can come up with something for this. The behavior of cloning runs like this doesn't feel correct and I think probably wouldn't work correctly because it'll try to call blocks_to_matrix with a single qubit matrix. But I'll see if I can come up with something to avoid temporary vecs. I think I did this to make the borrow checker happy, but it was long enough ago I'm not 100% sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I see a path here, the conflict is on the type mismatch between NodeIndex and the usize input. From a typing perspective rust won't be happy because collect_2q_runs() returns a Vec of NodeIndex and we can't take that as an input. The best idea I had to avoid an extra allocation was to take NodeIndex on the input Vec but we can't define the FromPyObject trait for a type not defined in qiskit.

Your suggestion here doesn't actually avoid any allocations except if blocks is Some(_), but at the cost of a second allocation for blocks is None path. The blocks is None path is the more common path though because that's what we run in the preset pass managers, so I'd rather stick with the bare collect_2q_runs call and have the conversion cost from the blocks list if it's specified.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, I understand now, and that makes sense. You're correct about the typing conflict and the allocation cost trade-offs, especially since the blocks is None path is more common in the preset pass managers. Let’s stick with what you initially suggested with the bare collect_2q_runs call, and handle the conversion cost from the blocks list if it's specified. Thanks for clarifying!

crates/accelerate/src/consolidate_blocks.rs Show resolved Hide resolved
mtreinish and others added 2 commits November 5, 2024 17:29
Co-authored-by: Henry Zou <87874865+henryzou50@users.noreply.github.com>
@mtreinish mtreinish requested a review from henryzou50 November 5, 2024 23:23
Copy link
Contributor

@henryzou50 henryzou50 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, Matt! Thanks for the latest changes!

Comment on lines +67 to +93
let blocks = match blocks {
Some(runs) => runs
.into_iter()
.map(|run| {
run.into_iter()
.map(NodeIndex::new)
.collect::<Vec<NodeIndex>>()
})
.collect(),
// If runs are specified but blocks are none we're in a legacy configuration where external
// collection passes are being used. In this case don't collect blocks because it's
// unexpected.
None => match runs {
Some(_) => vec![],
None => dag.collect_2q_runs().unwrap(),
},
};

let runs: Option<Vec<Vec<NodeIndex>>> = runs.map(|runs| {
runs.into_iter()
.map(|run| {
run.into_iter()
.map(NodeIndex::new)
.collect::<Vec<NodeIndex>>()
})
.collect()
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, I understand now, and that makes sense. You're correct about the typing conflict and the allocation cost trade-offs, especially since the blocks is None path is more common in the preset pass managers. Let’s stick with what you initially suggested with the bare collect_2q_runs call, and handle the conversion cost from the blocks list if it's specified. Thanks for clarifying!

@henryzou50 henryzou50 added this pull request to the merge queue Nov 6, 2024
Merged via the queue into Qiskit:main with commit 1b35e8b Nov 6, 2024
17 checks passed
@mtreinish mtreinish deleted the consolidate-blocks branch November 10, 2024 15:13
mtreinish added a commit to mtreinish/qiskit-core that referenced this pull request Nov 20, 2024
The ConsolidateBlocks pass was ported to rust in Qiskit#13368 and as part of
that implementation a small behavior difference between the rust and
python interfaces was causing the pass to not work correctly with non-CX
gates. The internal 2q decomposer interface stores a sentinel string for
the kak gate which is used to tell the python space constructor use the
python defined gate object. However in the pass code we weren't
factoring this difference in, and for non-CX gates we were evaluating
the basis count as the number of gates with that sentinel value name
(which is almost always zero) and this was preventing the pass from
consolidating many blocks that should have been. This commit fixes this
issue by taking the name from python space and passing it through to the
rust portion of the code and using that for the comparison.

Fixes Qiskit#13459
github-merge-queue bot pushed a commit that referenced this pull request Nov 20, 2024
The ConsolidateBlocks pass was ported to rust in #13368 and as part of
that implementation a small behavior difference between the rust and
python interfaces was causing the pass to not work correctly with non-CX
gates. The internal 2q decomposer interface stores a sentinel string for
the kak gate which is used to tell the python space constructor use the
python defined gate object. However in the pass code we weren't
factoring this difference in, and for non-CX gates we were evaluating
the basis count as the number of gates with that sentinel value name
(which is almost always zero) and this was preventing the pass from
consolidating many blocks that should have been. This commit fixes this
issue by taking the name from python space and passing it through to the
rust portion of the code and using that for the comparison.

Fixes #13459
mergify bot pushed a commit that referenced this pull request Nov 20, 2024
The ConsolidateBlocks pass was ported to rust in #13368 and as part of
that implementation a small behavior difference between the rust and
python interfaces was causing the pass to not work correctly with non-CX
gates. The internal 2q decomposer interface stores a sentinel string for
the kak gate which is used to tell the python space constructor use the
python defined gate object. However in the pass code we weren't
factoring this difference in, and for non-CX gates we were evaluating
the basis count as the number of gates with that sentinel value name
(which is almost always zero) and this was preventing the pass from
consolidating many blocks that should have been. This commit fixes this
issue by taking the name from python space and passing it through to the
rust portion of the code and using that for the comparison.

Fixes #13459

(cherry picked from commit c792426)
github-merge-queue bot pushed a commit that referenced this pull request Nov 20, 2024
The ConsolidateBlocks pass was ported to rust in #13368 and as part of
that implementation a small behavior difference between the rust and
python interfaces was causing the pass to not work correctly with non-CX
gates. The internal 2q decomposer interface stores a sentinel string for
the kak gate which is used to tell the python space constructor use the
python defined gate object. However in the pass code we weren't
factoring this difference in, and for non-CX gates we were evaluating
the basis count as the number of gates with that sentinel value name
(which is almost always zero) and this was preventing the pass from
consolidating many blocks that should have been. This commit fixes this
issue by taking the name from python space and passing it through to the
rust portion of the code and using that for the comparison.

Fixes #13459

(cherry picked from commit c792426)

Co-authored-by: Matthew Treinish <mtreinish@kortar.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Changelog: New Feature Include in the "Added" section of the changelog mod: transpiler Issues and PRs related to Transpiler performance Rust This PR or issue is related to Rust code in the repository
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Port ConsolidateBlocks to Rust
5 participants