-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reimplement DenseLayout as a Parallel Algorithm in Rust #7740
Conversation
This commit reimplements the core of the DenseLayout transpiler pass in Rust and to run in multiple threads. Previously this algorithm used scipy sparse matrices to create a CSR sparse matrix representation of the coupling graph and iterate over that to find a densely connected subgraph in the coupling graph. This performs and scales well for modest sized circuits and coupling graphs. But as the size of the coupling graphs and circuits grows running the algorithm by iterating over the sparse matrix in Python starts to hit a limit. The underlying traversal can be efficiently executed in parallel using rust as the algorithm iterates over the coupling graph in BFS order for each node to try and find the best subgraph. We can do the BFS traversals in parallel and then iteratively compare the results in parallel until the best is found. This greatly speeds up the execution of the pass, for example running on a 1081 qubit quantum volume circuit on 1081 qubit heavy hexagon coupling graph takes ~134 seconds with the previous iteration and ~0.144 seconds after this commit (on my local workstation with 32 physical cores and 64 logical cores, scaling likely won't be as good on smaller systems). The tradeoff here comes in slightly increased memory consumption as to have a shared representation of the adjacency matrix (and the error) between Python and Rust we use numpy arrays as they can be passed by reference between the languages. In practice this will not matter much until the graphs get truly large (e.g. to represent a 10000 qubit adjacency matrix and error matrix would require ~1.6 GB of memory) and if it does become an issue (either for memory or runtime performance) we can add a shared compressed sparse matrix representation to Qiskit for use in both Python in Rust.
This commit fixes the 4 remaining test failures. The results from the rust version of the pass were correct but different than the results from the Python version. This is because the parallel reduce() was comparing in a different order that was returning a different subgraph. This commit reverses the arg order to correct this so the behavior should be identical to the previous implementation
Pull Request Test Coverage Report for Build 1965542553
💛 - Coveralls |
The error matrix building was not working because it was comparing a list of qubits to a tuple. This was used prior to the rust rewrite so we probably were not actually checking noise properties prior to this commit.
In an earlier commit we fixed the noise awareness of the laoyout pass, doing this had a side effect of changing a test that was looking explicitly for the layout found by the pass. Since the pass is now correctly using error rates the best layout found is now different. This commit updates the tests to account for this.
I ran a sweep with QV circuits matching the width of a heavy hex coupling map from 19 qubits to 1081 qubits on two systems: (the 6 physical cores is my laptop so there might have been some thermal throttling going on too) So I'm not sure how much extra value there is in further tuning here with numbers like this. |
We should do some profiling to see if we can figure out where the scaling limits in the rust code are coming from and if there is a way to address it. I don't think it'll be an issue in the near term because getting a 2-3 order of magnitude improvement into thousands of qubits will get us pretty far, but if it's something that's easy to address it's probably better to do it now than trying to debug a slowdown when we are working with 100k qubits. |
Co-authored-by: Kevin Hartman <kevin@hart.mn>
This commit reduces the overhead of the internal loop over the bfs sorted nodes to create the subgraph by making 2 changes. First, instead of passing around the full bfs sorted list everywhere this commit truncates it to just the nodes we're going to use. We never use any qubits in the bfs sort > num_qubits so we can just truncate the Vec and not have to worry about limiting to the first num_qubits elements. The second change made is that instead of traversing the bfs nodes to check if a node is in the subgraph we create an intermediate set to do a constant time lookup for the membership check.
This commit adjusts the truncation logic around the bfs sort. In the previous commit we truncated the bfs result to just the first n qubits (where n is the number of qubits in the circuit) after performing a full traversal and generating the full list. Instead of doing that we can truncate the search when we've found n qubits already and save ourselves the extra work of continuing to traverse the adjacency matrix.
After the most recent changes based on the suggestions by @kevinhartman the overall performance improved quite substantially, although it looks like we still hit a scaling wall at roughly the same point: |
I took a quick look at the profile of a 3000 qubit example and am not seeing an obvious hot spot. In parallel most of the work is in the parallel closure context and when I modified the iterator to be serially executed (changing Edit: ignore this I forgot to build with debug symbols in the binary, that's what I get for doing this before my morning coffee. Looking at it again it's spending ~87% of the time in the |
Thinking about this a bit more and talking to @georgios-ts I think the next step for this to address the scaling problems is to potentially add an implementation of this function to retworkx and leverage the better and more efficient graph representation retworkx uses (the scaling limits seem mostly around doing the bfs on the adjacency matrix). The coupling graph is already a retworkx graph internally so we could write this same basic algorithm directly on the graph data structure. Ideally we'd be able to do this in As an intermediate step we could rewrite this algorithm using petgraph (which is the upstream library that retworkx uses for the graph representation) in |
This commit adds the env variable based switching to prevent us from running dense layout in parallel when we're already running under parallel_map to prevent overloading the system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I feel like we should at least acknowledge that Matthew called this branch |
Hah, I'm glad someone noticed. I just kept misreading |
Summary
This commit re-implements the core of the DenseLayout transpiler pass
in Rust and to run in multiple threads. Previously this algorithm used
scipy sparse matrices to create a CSR sparse matrix representation of
the coupling graph and iterate over that to find a densely connected
subgraph in the coupling graph. This performs and scales well for modest
sized circuits and coupling graphs. But as the size of the coupling
graphs and circuits grows running the algorithm by iterating over the
sparse matrix in Python starts to hit a limit. The underlying traversal
can be efficiently executed in parallel using rust as the algorithm
iterates over the coupling graph in BFS order for each node to try and
find the best subgraph. We can do the BFS traversals in parallel and
then iteratively compare the results in parallel until the best is
found. This greatly speeds up the execution of the pass, for example
running on a 1081 qubit quantum volume circuit on 1081 qubit
heavy hexagon coupling graph takes ~135 seconds with the previous
version and ~0.14 seconds after this commit (on my local workstation
with 32 physical cores and 64 logical cores, scaling likely won't be as
good on smaller systems).
The tradeoff here comes in slightly increased memory consumption as
to have a shared representation of the adjacency matrix (and the error matrix)
between Python and Rust we use numpy arrays as they can be passed by
reference between the languages. In practice this will not matter much
until the graphs get truly large (e.g. to represent a 10000 qubit adjacency
matrix and error matrix would require ~1.6 GB of memory) and if it does
become an issue (either for memory or runtime performance) we can add a
shared compressed sparse matrix representation to Qiskit for use in both
Python in Rust.
Details and comments