Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplement DenseLayout as a Parallel Algorithm in Rust #7740

Merged
merged 21 commits into from
Mar 10, 2022

Conversation

mtreinish
Copy link
Member

@mtreinish mtreinish commented Mar 5, 2022

Summary

This commit re-implements the core of the DenseLayout transpiler pass
in Rust and to run in multiple threads. Previously this algorithm used
scipy sparse matrices to create a CSR sparse matrix representation of
the coupling graph and iterate over that to find a densely connected
subgraph in the coupling graph. This performs and scales well for modest
sized circuits and coupling graphs. But as the size of the coupling
graphs and circuits grows running the algorithm by iterating over the
sparse matrix in Python starts to hit a limit. The underlying traversal
can be efficiently executed in parallel using rust as the algorithm
iterates over the coupling graph in BFS order for each node to try and
find the best subgraph. We can do the BFS traversals in parallel and
then iteratively compare the results in parallel until the best is
found. This greatly speeds up the execution of the pass, for example
running on a 1081 qubit quantum volume circuit on 1081 qubit
heavy hexagon coupling graph takes ~135 seconds with the previous
version and ~0.14 seconds after this commit (on my local workstation
with 32 physical cores and 64 logical cores, scaling likely won't be as
good on smaller systems).

The tradeoff here comes in slightly increased memory consumption as
to have a shared representation of the adjacency matrix (and the error matrix)
between Python and Rust we use numpy arrays as they can be passed by
reference between the languages. In practice this will not matter much
until the graphs get truly large (e.g. to represent a 10000 qubit adjacency
matrix and error matrix would require ~1.6 GB of memory) and if it does
become an issue (either for memory or runtime performance) we can add a
shared compressed sparse matrix representation to Qiskit for use in both
Python in Rust.

Details and comments

  • Update 4 dense layout tests for slightly different output
  • Add release notes
  • Add rust docstrings
  • Additional benchmarking and tuning

This commit reimplements the core of the DenseLayout transpiler pass
in Rust and to run in multiple threads. Previously this algorithm used
scipy sparse matrices to create a CSR sparse matrix representation of
the coupling graph and iterate over that to find a densely connected
subgraph in the coupling graph. This performs and scales well for modest
sized circuits and coupling graphs. But as the size of the coupling
graphs and circuits grows running the algorithm by iterating over the
sparse matrix in Python starts to hit a limit. The underlying traversal
can be efficiently executed in parallel using rust as the algorithm
iterates over the coupling graph in BFS order for each node to try and
find the best subgraph. We can do the BFS traversals in parallel and
then iteratively compare the results in parallel until the best is
found. This greatly speeds up the execution of the pass, for example
running on a 1081 qubit quantum volume circuit on 1081 qubit
heavy hexagon coupling graph takes ~134 seconds with the previous
iteration and ~0.144 seconds after this commit (on my local workstation
with 32 physical cores and 64 logical cores, scaling likely won't be as
good on smaller systems).

The tradeoff here comes in slightly increased memory consumption as
to have a shared representation of the adjacency matrix (and the error)
between Python and Rust we use numpy arrays as they can be passed by
reference between the languages. In practice this will not matter much
until the graphs get truly large (e.g. to represent a 10000 qubit adjacency
matrix and error matrix would require ~1.6 GB of memory) and if it does
become an issue (either for memory or runtime performance) we can add a
shared compressed sparse matrix representation to Qiskit for use in both
Python in Rust.
@mtreinish mtreinish added performance Changelog: New Feature Include in the "Added" section of the changelog Rust This PR or issue is related to Rust code in the repository labels Mar 5, 2022
@mtreinish mtreinish added this to the 0.20 milestone Mar 5, 2022
@mtreinish mtreinish requested a review from a team as a code owner March 5, 2022 19:55
This commit fixes the 4 remaining test failures. The results from the
rust version of the pass were correct but different than the results
from the Python version. This is because the parallel reduce() was
comparing in a different order that was returning a different subgraph.
This commit reverses the arg order to correct this so the behavior
should be identical to the previous implementation
@coveralls
Copy link

coveralls commented Mar 6, 2022

Pull Request Test Coverage Report for Build 1965542553

  • 29 of 29 (100.0%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.005%) to 83.461%

Totals Coverage Status
Change from base Build 1965461459: -0.005%
Covered Lines: 52394
Relevant Lines: 62777

💛 - Coveralls

The error matrix building was not working because it was comparing a
list of qubits to a tuple. This was used prior to the rust rewrite so we
probably were not actually checking noise properties prior to this
commit.
In an earlier commit we fixed the noise awareness of the laoyout pass,
doing this had a side effect of changing a test that was looking
explicitly for the layout found by the pass. Since the pass is now
correctly using error rates the best layout found is now different. This
commit updates the tests to account for this.
@mtreinish mtreinish changed the title [WIP] Reimplement DenseLayout as a Parallel Algorithm in Rust Reimplement DenseLayout as a Parallel Algorithm in Rust Mar 7, 2022
@mtreinish
Copy link
Member Author

I ran a sweep with QV circuits matching the width of a heavy hex coupling map from 19 qubits to 1081 qubits on two systems:

32core_linear
6core_linear

(the 6 physical cores is my laptop so there might have been some thermal throttling going on too)

So I'm not sure how much extra value there is in further tuning here with numbers like this.

@kevinhartman kevinhartman self-assigned this Mar 7, 2022
@ewinston
Copy link
Contributor

ewinston commented Mar 8, 2022

Here is the scaling performance I saw for square random circuits run on 16 cores;

image

For the non-parallel version;

image

The parallel version is a couple of orders of magnitude faster, however the scaling is a bit worse which appears to agrees with with the down turn in the ratio of performance you saw.

src/dense_layout.rs Outdated Show resolved Hide resolved
@mtreinish
Copy link
Member Author

mtreinish commented Mar 8, 2022

The parallel version is a couple of orders of magnitude faster, however the scaling is a bit worse which appears to agrees with with the down turn in the ratio of performance you saw.

We should do some profiling to see if we can figure out where the scaling limits in the rust code are coming from and if there is a way to address it. I don't think it'll be an issue in the near term because getting a 2-3 order of magnitude improvement into thousands of qubits will get us pretty far, but if it's something that's easy to address it's probably better to do it now than trying to debug a slowdown when we are working with 100k qubits.

src/dense_layout.rs Show resolved Hide resolved
src/dense_layout.rs Outdated Show resolved Hide resolved
src/dense_layout.rs Outdated Show resolved Hide resolved
src/dense_layout.rs Outdated Show resolved Hide resolved
src/dense_layout.rs Outdated Show resolved Hide resolved
releasenotes/notes/rust-denselayout-bc0f08874ad778d6.yaml Outdated Show resolved Hide resolved
releasenotes/notes/rust-denselayout-bc0f08874ad778d6.yaml Outdated Show resolved Hide resolved
releasenotes/notes/rust-denselayout-bc0f08874ad778d6.yaml Outdated Show resolved Hide resolved
releasenotes/notes/rust-denselayout-bc0f08874ad778d6.yaml Outdated Show resolved Hide resolved
Co-authored-by: Kevin Hartman <kevin@hart.mn>
This commit reduces the overhead of the internal loop over the bfs
sorted nodes to create the subgraph by making 2 changes. First, instead
of passing around the full bfs sorted list everywhere this commit
truncates it to just the nodes we're going to use. We never use any
qubits in the bfs sort > num_qubits so we can just truncate the Vec and
not have to worry about limiting to the first num_qubits elements. The
second change made is that instead of traversing the bfs nodes to check
if a node is in the subgraph we create an intermediate set to do a
constant time lookup for the membership check.
This commit adjusts the truncation logic around the bfs sort. In the
previous commit we truncated the bfs result to just the first n qubits
(where n is the number of qubits in the circuit) after performing a full
traversal and generating the full list. Instead of doing that we can
truncate the search when we've found n qubits already and save ourselves
the extra work of continuing to traverse the adjacency matrix.
@mtreinish
Copy link
Member Author

After the most recent changes based on the suggestions by @kevinhartman the overall performance improved quite substantially, although it looks like we still hit a scaling wall at roughly the same point:

32core_linear

@mtreinish
Copy link
Member Author

mtreinish commented Mar 9, 2022

I took a quick look at the profile of a 3000 qubit example and am not seeing an obvious hot spot. In parallel most of the work is in the parallel closure context and when I modified the iterator to be serially executed (changing into_par_iter() into into_iter() and adjusting things to compile) the perf profile just shows that ~95% of the time is in qiskit_accelerate::dense_layout::best_subset{{closure}} and everything below that is opaque. Which isn't super helpful.

Edit: ignore this I forgot to build with debug symbols in the binary, that's what I get for doing this before my morning coffee. Looking at it again it's spending ~87% of the time in the bfs_sort function (which gets inlined so there isn't much more useful detail than that in the profile besides ~1% of the time is on IndexSet insert and ~2% on enumerate iteration over the adjacency matrix)

@mtreinish
Copy link
Member Author

Thinking about this a bit more and talking to @georgios-ts I think the next step for this to address the scaling problems is to potentially add an implementation of this function to retworkx and leverage the better and more efficient graph representation retworkx uses (the scaling limits seem mostly around doing the bfs on the adjacency matrix). The coupling graph is already a retworkx graph internally so we could write this same basic algorithm directly on the graph data structure. Ideally we'd be able to do this in qiskit._accelerate directly and call retworkx's graph directly because both are rust. But unfortunately we're not able to share PyO3/Rust python objects between rust libraries (see: PyO3/pyo3#1444 for more details) so working directly off of the retworkx graph isn't an option yet.

As an intermediate step we could rewrite this algorithm using petgraph (which is the upstream library that retworkx uses for the graph representation) in qiskit._accelerate. But for this PR I'm quite happy with this implementation as a first step, it provides a huge speed up with minimal changes to the code (at the cost of increased memory consumption) and a petgrah implementation would be a larger change. My thinking is we can leave either rewriting this in petgraph or adding a function to retworkx as a follow on step after this initial implementation merges.

This commit adds the env variable based switching to prevent us from
running dense layout in parallel when we're already running under
parallel_map to prevent overloading the system.
@mtreinish mtreinish requested a review from kevinhartman March 10, 2022 21:27
Copy link
Contributor

@kevinhartman kevinhartman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@jakelishman
Copy link
Member

jakelishman commented Mar 10, 2022

I feel like we should at least acknowledge that Matthew called this branch cthulhu_mckee before it merges 🦑🐉 (that's the closest Cthulhu approximation I could make...)

@mergify mergify bot merged commit ec2a2a6 into Qiskit:main Mar 10, 2022
@mtreinish mtreinish deleted the cthulhu_mckee branch March 11, 2022 12:28
@mtreinish
Copy link
Member Author

Hah, I'm glad someone noticed. I just kept misreading reverse_cuthill_mckee() as reverse_cthulhu_mckee() when I was developing this and named the branch as an inside joke to myself :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Changelog: New Feature Include in the "Added" section of the changelog performance Rust This PR or issue is related to Rust code in the repository
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants