Reimplement DenseLayout as a Parallel Algorithm in Rust #7740

mtreinish · 2022-03-05T19:55:41Z

Summary

This commit re-implements the core of the DenseLayout transpiler pass
in Rust and to run in multiple threads. Previously this algorithm used
scipy sparse matrices to create a CSR sparse matrix representation of
the coupling graph and iterate over that to find a densely connected
subgraph in the coupling graph. This performs and scales well for modest
sized circuits and coupling graphs. But as the size of the coupling
graphs and circuits grows running the algorithm by iterating over the
sparse matrix in Python starts to hit a limit. The underlying traversal
can be efficiently executed in parallel using rust as the algorithm
iterates over the coupling graph in BFS order for each node to try and
find the best subgraph. We can do the BFS traversals in parallel and
then iteratively compare the results in parallel until the best is
found. This greatly speeds up the execution of the pass, for example
running on a 1081 qubit quantum volume circuit on 1081 qubit
heavy hexagon coupling graph takes ~135 seconds with the previous
version and ~0.14 seconds after this commit (on my local workstation
with 32 physical cores and 64 logical cores, scaling likely won't be as
good on smaller systems).

The tradeoff here comes in slightly increased memory consumption as
to have a shared representation of the adjacency matrix (and the error matrix)
between Python and Rust we use numpy arrays as they can be passed by
reference between the languages. In practice this will not matter much
until the graphs get truly large (e.g. to represent a 10000 qubit adjacency
matrix and error matrix would require ~1.6 GB of memory) and if it does
become an issue (either for memory or runtime performance) we can add a
shared compressed sparse matrix representation to Qiskit for use in both
Python in Rust.

Details and comments

Update 4 dense layout tests for slightly different output
Add release notes
Add rust docstrings
Additional benchmarking and tuning

This commit reimplements the core of the DenseLayout transpiler pass in Rust and to run in multiple threads. Previously this algorithm used scipy sparse matrices to create a CSR sparse matrix representation of the coupling graph and iterate over that to find a densely connected subgraph in the coupling graph. This performs and scales well for modest sized circuits and coupling graphs. But as the size of the coupling graphs and circuits grows running the algorithm by iterating over the sparse matrix in Python starts to hit a limit. The underlying traversal can be efficiently executed in parallel using rust as the algorithm iterates over the coupling graph in BFS order for each node to try and find the best subgraph. We can do the BFS traversals in parallel and then iteratively compare the results in parallel until the best is found. This greatly speeds up the execution of the pass, for example running on a 1081 qubit quantum volume circuit on 1081 qubit heavy hexagon coupling graph takes ~134 seconds with the previous iteration and ~0.144 seconds after this commit (on my local workstation with 32 physical cores and 64 logical cores, scaling likely won't be as good on smaller systems). The tradeoff here comes in slightly increased memory consumption as to have a shared representation of the adjacency matrix (and the error) between Python and Rust we use numpy arrays as they can be passed by reference between the languages. In practice this will not matter much until the graphs get truly large (e.g. to represent a 10000 qubit adjacency matrix and error matrix would require ~1.6 GB of memory) and if it does become an issue (either for memory or runtime performance) we can add a shared compressed sparse matrix representation to Qiskit for use in both Python in Rust.

This commit fixes the 4 remaining test failures. The results from the rust version of the pass were correct but different than the results from the Python version. This is because the parallel reduce() was comparing in a different order that was returning a different subgraph. This commit reverses the arg order to correct this so the behavior should be identical to the previous implementation

coveralls · 2022-03-06T12:57:43Z

Pull Request Test Coverage Report for Build 1965542553

29 of 29 (100.0%) changed or added relevant lines in 2 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.005%) to 83.461%

Totals
Change from base Build 1965461459:	-0.005%
Covered Lines:	52394
Relevant Lines:	62777

💛 - Coveralls

The error matrix building was not working because it was comparing a list of qubits to a tuple. This was used prior to the rust rewrite so we probably were not actually checking noise properties prior to this commit.

In an earlier commit we fixed the noise awareness of the laoyout pass, doing this had a side effect of changing a test that was looking explicitly for the layout found by the pass. Since the pass is now correctly using error rates the best layout found is now different. This commit updates the tests to account for this.

mtreinish · 2022-03-07T14:42:49Z

I ran a sweep with QV circuits matching the width of a heavy hex coupling map from 19 qubits to 1081 qubits on two systems:

(the 6 physical cores is my laptop so there might have been some thermal throttling going on too)

So I'm not sure how much extra value there is in further tuning here with numbers like this.

ewinston · 2022-03-08T06:02:01Z

Here is the scaling performance I saw for square random circuits run on 16 cores;

For the non-parallel version;

The parallel version is a couple of orders of magnitude faster, however the scaling is a bit worse which appears to agrees with with the down turn in the ratio of performance you saw.

src/dense_layout.rs

mtreinish · 2022-03-08T15:29:20Z

The parallel version is a couple of orders of magnitude faster, however the scaling is a bit worse which appears to agrees with with the down turn in the ratio of performance you saw.

We should do some profiling to see if we can figure out where the scaling limits in the rust code are coming from and if there is a way to address it. I don't think it'll be an issue in the near term because getting a 2-3 order of magnitude improvement into thousands of qubits will get us pretty far, but if it's something that's easy to address it's probably better to do it now than trying to debug a slowdown when we are working with 100k qubits.

src/dense_layout.rs

releasenotes/notes/rust-denselayout-bc0f08874ad778d6.yaml

qiskit/transpiler/passes/layout/dense_layout.py

Co-authored-by: Kevin Hartman <kevin@hart.mn>

This commit reduces the overhead of the internal loop over the bfs sorted nodes to create the subgraph by making 2 changes. First, instead of passing around the full bfs sorted list everywhere this commit truncates it to just the nodes we're going to use. We never use any qubits in the bfs sort > num_qubits so we can just truncate the Vec and not have to worry about limiting to the first num_qubits elements. The second change made is that instead of traversing the bfs nodes to check if a node is in the subgraph we create an intermediate set to do a constant time lookup for the membership check.

This commit adjusts the truncation logic around the bfs sort. In the previous commit we truncated the bfs result to just the first n qubits (where n is the number of qubits in the circuit) after performing a full traversal and generating the full list. Instead of doing that we can truncate the search when we've found n qubits already and save ourselves the extra work of continuing to traverse the adjacency matrix.

mtreinish · 2022-03-08T23:37:35Z

After the most recent changes based on the suggestions by @kevinhartman the overall performance improved quite substantially, although it looks like we still hit a scaling wall at roughly the same point:

mtreinish · 2022-03-09T13:21:26Z

I took a quick look at the profile of a 3000 qubit example and am not seeing an obvious hot spot. In parallel most of the work is in the parallel closure context and when I modified the iterator to be serially executed (changing into_par_iter() into into_iter() and adjusting things to compile) the perf profile just shows that ~95% of the time is in qiskit_accelerate::dense_layout::best_subset{{closure}} and everything below that is opaque. Which isn't super helpful.

Edit: ignore this I forgot to build with debug symbols in the binary, that's what I get for doing this before my morning coffee. Looking at it again it's spending ~87% of the time in the bfs_sort function (which gets inlined so there isn't much more useful detail than that in the profile besides ~1% of the time is on IndexSet insert and ~2% on enumerate iteration over the adjacency matrix)

mtreinish · 2022-03-09T15:59:21Z

Thinking about this a bit more and talking to @georgios-ts I think the next step for this to address the scaling problems is to potentially add an implementation of this function to retworkx and leverage the better and more efficient graph representation retworkx uses (the scaling limits seem mostly around doing the bfs on the adjacency matrix). The coupling graph is already a retworkx graph internally so we could write this same basic algorithm directly on the graph data structure. Ideally we'd be able to do this in qiskit._accelerate directly and call retworkx's graph directly because both are rust. But unfortunately we're not able to share PyO3/Rust python objects between rust libraries (see: PyO3/pyo3#1444 for more details) so working directly off of the retworkx graph isn't an option yet.

As an intermediate step we could rewrite this algorithm using petgraph (which is the upstream library that retworkx uses for the graph representation) in qiskit._accelerate. But for this PR I'm quite happy with this implementation as a first step, it provides a huge speed up with minimal changes to the code (at the cost of increased memory consumption) and a petgrah implementation would be a larger change. My thinking is we can leave either rewriting this in petgraph or adding a function to retworkx as a follow on step after this initial implementation merges.

This commit adds the env variable based switching to prevent us from running dense layout in parallel when we're already running under parallel_map to prevent overloading the system.

kevinhartman

LGTM.

jakelishman · 2022-03-10T22:00:17Z

I feel like we should at least acknowledge that Matthew called this branch cthulhu_mckee before it merges 🦑🐉 (that's the closest Cthulhu approximation I could make...)

mtreinish · 2022-03-11T12:30:55Z

Hah, I'm glad someone noticed. I just kept misreading reverse_cuthill_mckee() as reverse_cthulhu_mckee() when I was developing this and named the branch as an inside joke to myself :)

mtreinish added performance Changelog: New Feature Include in the "Added" section of the changelog Rust This PR or issue is related to Rust code in the repository labels Mar 5, 2022

mtreinish added this to the 0.20 milestone Mar 5, 2022

mtreinish requested a review from a team as a code owner March 5, 2022 19:55

mtreinish added 5 commits March 5, 2022 16:44

Fix measurement error calculation

a30ec04

Fix lint

2dbdfc9

Compute adjacency matrix at init instead of run()

abe4a1f

Merge branch 'main' into cthulhu_mckee

40a6ebd

mtreinish added 5 commits March 7, 2022 07:07

Fix error and matrix building

7537709

The error matrix building was not working because it was comparing a list of qubits to a tuple. This was used prior to the rust rewrite so we probably were not actually checking noise properties prior to this commit.

Merge remote-tracking branch 'origin/main' into cthulhu_mckee

4de5026

Add release notes

20f2cef

Add rust docstring

ec4a6c3

mtreinish changed the title ~~[WIP] Reimplement DenseLayout as a Parallel Algorithm in Rust~~ Reimplement DenseLayout as a Parallel Algorithm in Rust Mar 7, 2022

kevinhartman self-assigned this Mar 7, 2022

mtreinish added 2 commits March 8, 2022 06:57

Fix dag qubit count to include registerless bits

17143a0

Merge remote-tracking branch 'origin/main' into cthulhu_mckee

ed942d2

mtreinish commented Mar 8, 2022

View reviewed changes

src/dense_layout.rs Outdated Show resolved Hide resolved

mtreinish mentioned this pull request Mar 8, 2022

Add support for BackendV2 and Target to transpiler #7113

Closed

kevinhartman reviewed Mar 8, 2022

View reviewed changes

Update docs and fix release note typos

eae3b40

Co-authored-by: Kevin Hartman <kevin@hart.mn>

mtreinish added 4 commits March 8, 2022 17:19

Merge branch 'main' into cthulhu_mckee

f2680ad

Simplify subgraph creation iterator using filter_map()

0bc6bb6

mtreinish mentioned this pull request Mar 10, 2022

Add a function to find densest subgraph Qiskit/rustworkx#570

Open

mtreinish added 3 commits March 10, 2022 09:41

Merge branch 'main' into cthulhu_mckee

48b5952

Merge remote-tracking branch 'origin/main' into cthulhu_mckee

6746a92

Only run multithreaded outside of parallel context

90860dc

This commit adds the env variable based switching to prevent us from running dense layout in parallel when we're already running under parallel_map to prevent overloading the system.

mtreinish requested a review from kevinhartman March 10, 2022 21:27

kevinhartman approved these changes Mar 10, 2022

View reviewed changes

mtreinish added the automerge label Mar 10, 2022

mergify bot merged commit ec2a2a6 into Qiskit:main Mar 10, 2022

mtreinish deleted the cthulhu_mckee branch March 11, 2022 12:28

mtreinish mentioned this pull request Mar 16, 2022

Make DenseLayout transpiler pass target aware #7776

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reimplement DenseLayout as a Parallel Algorithm in Rust #7740

Reimplement DenseLayout as a Parallel Algorithm in Rust #7740

mtreinish commented Mar 5, 2022 •

edited

Loading

coveralls commented Mar 6, 2022 •

edited

Loading

mtreinish commented Mar 7, 2022

ewinston commented Mar 8, 2022

mtreinish commented Mar 8, 2022 •

edited

Loading

mtreinish commented Mar 8, 2022

mtreinish commented Mar 9, 2022 •

edited

Loading

mtreinish commented Mar 9, 2022

kevinhartman left a comment

jakelishman commented Mar 10, 2022 •

edited

Loading

mtreinish commented Mar 11, 2022

Reimplement DenseLayout as a Parallel Algorithm in Rust #7740

Reimplement DenseLayout as a Parallel Algorithm in Rust #7740

Conversation

mtreinish commented Mar 5, 2022 • edited Loading

Summary

Details and comments

coveralls commented Mar 6, 2022 • edited Loading

Pull Request Test Coverage Report for Build 1965542553

💛 - Coveralls

mtreinish commented Mar 7, 2022

ewinston commented Mar 8, 2022

mtreinish commented Mar 8, 2022 • edited Loading

mtreinish commented Mar 8, 2022

mtreinish commented Mar 9, 2022 • edited Loading

mtreinish commented Mar 9, 2022

kevinhartman left a comment

Choose a reason for hiding this comment

jakelishman commented Mar 10, 2022 • edited Loading

mtreinish commented Mar 11, 2022

mtreinish commented Mar 5, 2022 •

edited

Loading

coveralls commented Mar 6, 2022 •

edited

Loading

mtreinish commented Mar 8, 2022 •

edited

Loading

mtreinish commented Mar 9, 2022 •

edited

Loading

jakelishman commented Mar 10, 2022 •

edited

Loading