Rebalance CircuitInstruction and PackedInstruction (Qiskit#12730)

* Rebalance `CircuitInstruction` and `PackedInstruction` This is a large overhaul of how circuit instructions are both stored in Rust (`PackedInstruction`) and how they are presented to Python (`CircuitInstruction`). In summary: * The old `OperationType` enum is now collapsed into a manually managed `PackedOperation`. This is logically equivalent, but stores a `PyGate`/`PyInstruction`/`PyOperation` indirectly through a boxed pointer, and stores a `StandardGate` inline. As we expect the vast majority of gates to be standard, this hugely reduces the memory usage. The enumeration is manually compressed to a single pointer, hiding the discriminant in the low, alignment-required bytes of the pointer. * `PackedOperation::view()` unpacks the operation into a proper reference-like enumeration `OperationRef<'a>`, which implements `Operation` (though there is also a `try_standard_gate` method to get the gate without unpacking the whole enumeration). * Both `PackedInstruction` and `CircuitInstruction` use this `PackedOperation` as the operation storage. * `PackedInstruction` is now completely the Rust-space format for data, and `CircuitInstruction` is purely for communication with Python. On my machine, this commit brings the utility-scale benchmarks to within 10% of the runtime of 1.1.0 (and some to parity), despite all the additional overhead. Changes to accepting and building Python objects ------------------------------------------------ * A `PackedInstruction` is created by copy constructor from a `CircuitInstruction` by `CircuitData::pack`. There is no `pack_owned` (really, there never was - the previous method didn't take ownership) because there's never owned `CircuitInstruction`s coming in; they're Python-space interop, so we never own them (unless we clone them) other than when we're unpacking them. * `PackedInstruction` is currently just created manually when not coming from a `CircuitInstruction`. It's not hard, and makes it easier to re-use known intern indices than to waste time re-interning them. There is no need to go via `CircuitInstruction`. * `CircuitInstruction` now has two separated Python-space constructors: the old one, which is the default and takes `(operation, qubits, clbits)` (and extracts the information), and a new fast-path `from_standard` which asks only for the standard gate, qubits and params, avoiding operator construction. * To accept a Python-space operation, extract a Python object to `OperationFromPython`. This extracts the components that are separate in Rust space, but joined in Python space (the operation, params and extra attributes). This replaces `OperationInput` and `OperationTypeConstruct`, being more efficient at the extraction, including providing the data in the formats needed for `PackedInstruction` or `CircuitInstruction`. * To retrieve the Python-space operation, use `CircuitInstruction::get_operation` or `PackedInstruction::unpack_py_op` as appropriate. Both will cache and reuse the op, if `cache_pygates` is active. (Though note that if the op is created by `CircuitInstruction`, it will not propagate back to a `PackedInstruction`.) Avoiding operation creation --------------------------- The `_raw_op` field of `CircuitInstruction` is gone, because `PyGate`, `PyInstruction` and `PyOperation` are no longer pyclasses and no longer exposed to Python. Instead, we avoid operation creation by: * having an internal `DAGNode::_to_circuit_instruction`, which returns a copy of the internal `CircuitInstruction`, which can then be used with `CircuitInstruction.replace`, etc. * having `CircuitInstruction::is_standard_gate` to query from Python space if we should bother to create the operator. * changing `CircuitData::map_ops` to `map_nonstandard_ops`, and having it only call the Python callback function if the operation is not an unconditional standard gate. Memory usage ------------ Given the very simple example construction script: ```python from qiskit.circuit import QuantumCircuit qc = QuantumCircuit(1_000) for _ in range(3_000): for q in qc.qubits: qc.rz(0.0, q) for q in qc.qubits: qc.rx(0.0, q) for q in qc.qubits: qc.rz(0.0, q) for a, b in zip(qc.qubits[:-1], qc.qubits[1:]): qc.cx(a, b) ``` This uses 1.5GB in max resident set size on my Macbook (note that it's about 12 million gates) on both 1.1.0 and with this commit, so we've undone our memory losses. The parent of this commit uses 2GB. However, we're in a strong position to beat 1.1.0 in the future now; there are two obvious large remaining costs: * There are 16 bytes per `PackedInstruction` for the Python-operation caching (worth about 180MB in this benchmark, since no Python operations are actually created). * There is also significant memory wastage in the current `SmallVec<[Param; 3]>` storage of the parameters; for all standard gates, we know statically how many parameters are / should be stored, and we never need to increase the capacity. Further, the `Param` enum is 16 bytes wide per parameter, of which nearly 8 bytes is padding, but for all our current use cases, we only care if _all_ the parameters or floats (for everything else, we're going to have to defer to Python). We could move the discriminant out to the level of the parameters structure, and save a large amount of padding. Further work ------------ There's still performance left on the table here: * We still copy-in and copy-out of `CircuitInstruction` too much right now; we might want to make all the `CircuitInstruction` fields nullable and have `CircuitData::append` take them by _move_ rather than by copy. * The qubits/clbits interner requires owned arrays going in, but most interning should return an existing entry. We probably want to switch to have the interner take references/iterators by default, and clone when necessary. We could have a small circuit optimisation where the intern contexts reserve the first n entries to use for an all-to-all connectivity interning for up to (say) 8 qubits, since the transpiler will want to create a lot of ephemeral small circuits. * The `Param` vectors are too heavy at the moment; `SmallVec<[Param; 3]>` is 56 bytes wide, despite the vast majority of gates we care about having at most one single float (8 bytes). Dead padding is a large chunk of the memory use currently. * Fix clippy in no-gate-cache mode * Fix pylint unused-import complaints * Fix broken assumptions around the gate model The `compose` test had a now-broken assumption, because the Python-space `is` check is no longer expected to return an identical object when a standard gate is moved from one circuit to another and has its components remapped as part of the `compose` operation. This doesn't constitute the unpleasant deep-copy that that test is preventing. A custom gate still satisfies that, however, so we can just change the test. `DAGNode::set_name` could cause problems if it was called for the first time on a `CircuitInstruction` that was for a standard gate; these would be created as immutable instances. Given the changes in operator extraction to Rust space, it can now be the case that a standard gate that comes in as mutable is unpacked into Rust space, the cache is some time later invalidated, and then the operation is recreated immutably. * Fix lint * Fix minor documentation
Procatv · Aug 1, 2024 · 7adc18c · 7adc18c
1 parent 60f05b4
commit 7adc18c
Show file tree

Hide file tree

Showing 25 changed files with 1,371 additions and 1,473 deletions.
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/Cargo.toml b/Cargo.toml
@@ -14,6 +14,7 @@ license = "Apache-2.0"
 #
 # Each crate can add on specific features freely as it inherits.
 [workspace.dependencies]
+bytemuck = "1.16"
 indexmap.version = "2.2.6"
 hashbrown.version = "0.14.0"
 num-complex = "0.4"

diff --git a/crates/accelerate/src/convert_2q_block_matrix.rs b/crates/accelerate/src/convert_2q_block_matrix.rs
@@ -23,33 +23,32 @@ use numpy::{IntoPyArray, PyArray2, PyReadonlyArray2};
 use smallvec::SmallVec;
 
 use qiskit_circuit::bit_data::BitData;
-use qiskit_circuit::circuit_instruction::{operation_type_to_py, CircuitInstruction};
+use qiskit_circuit::circuit_instruction::CircuitInstruction;
 use qiskit_circuit::dag_node::DAGOpNode;
 use qiskit_circuit::gate_matrix::ONE_QUBIT_IDENTITY;
 use qiskit_circuit::imports::QI_OPERATOR;
-use qiskit_circuit::operations::{Operation, OperationType};
+use qiskit_circuit::operations::{Operation, OperationRef};
 
 use crate::QiskitError;
 
 fn get_matrix_from_inst<'py>(
     py: Python<'py>,
     inst: &'py CircuitInstruction,
 ) -> PyResult<Array2<Complex64>> {
-    match inst.operation.matrix(&inst.params) {
-        Some(mat) => Ok(mat),
-        None => match inst.operation {
-            OperationType::Standard(_) => Err(QiskitError::new_err(
-                "Parameterized gates can't be consolidated",
-            )),
-            OperationType::Gate(_) => Ok(QI_OPERATOR
-                .get_bound(py)
-                .call1((operation_type_to_py(py, inst)?,))?
-                .getattr(intern!(py, "data"))?
-                .extract::<PyReadonlyArray2<Complex64>>()?
-                .as_array()
-                .to_owned()),
-            _ => unreachable!("Only called for unitary ops"),
-        },
+    if let Some(mat) = inst.op().matrix(&inst.params) {
+        Ok(mat)
+    } else if inst.operation.try_standard_gate().is_some() {
+        Err(QiskitError::new_err(
+            "Parameterized gates can't be consolidated",
+        ))
+    } else {
+        Ok(QI_OPERATOR
+            .get_bound(py)
+            .call1((inst.get_operation(py)?,))?
+            .getattr(intern!(py, "data"))?
+            .extract::<PyReadonlyArray2<Complex64>>()?
+            .as_array()
+            .to_owned())
     }
 }
 
@@ -127,34 +126,20 @@ pub fn change_basis(matrix: ArrayView2<Complex64>) -> Array2<Complex64> {
 
 #[pyfunction]
 pub fn collect_2q_blocks_filter(node: &Bound<PyAny>) -> Option<bool> {
-    match node.downcast::<DAGOpNode>() {
-        Ok(bound_node) => {
-            let node = bound_node.borrow();
-            match &node.instruction.operation {
-                OperationType::Standard(gate) => Some(
-                    gate.num_qubits() <= 2
-                        && node
-                            .instruction
-                            .extra_attrs
-                            .as_ref()
-                            .and_then(|attrs| attrs.condition.as_ref())
-                            .is_none()
-                        && !node.is_parameterized(),
-                ),
-                OperationType::Gate(gate) => Some(
-                    gate.num_qubits() <= 2
-                        && node
-                            .instruction
-                            .extra_attrs
-                            .as_ref()
-                            .and_then(|attrs| attrs.condition.as_ref())
-                            .is_none()
-                        && !node.is_parameterized(),
-                ),
-                _ => Some(false),
-            }
-        }
-        Err(_) => None,
+    let Ok(node) = node.downcast::<DAGOpNode>() else { return None };
+    let node = node.borrow();
+    match node.instruction.op() {
+        gate @ (OperationRef::Standard(_) | OperationRef::Gate(_)) => Some(
+            gate.num_qubits() <= 2
+                && node
+                    .instruction
+                    .extra_attrs
+                    .as_ref()
+                    .and_then(|attrs| attrs.condition.as_ref())
+                    .is_none()
+                && !node.is_parameterized(),
+        ),
+        _ => Some(false),
     }
 }
 

diff --git a/crates/accelerate/src/euler_one_qubit_decomposer.rs b/crates/accelerate/src/euler_one_qubit_decomposer.rs
@@ -743,7 +743,7 @@ pub fn compute_error_list(
         .iter()
         .map(|node| {
             (
-                node.instruction.operation.name().to_string(),
+                node.instruction.op().name().to_string(),
                 smallvec![], // Params not needed in this path
             )
         })
@@ -988,11 +988,10 @@ pub fn optimize_1q_gates_decomposition(
                 .iter()
                 .map(|node| {
                     if let Some(err_map) = error_map {
-                        error *=
-                            compute_error_term(node.instruction.operation.name(), err_map, qubit)
+                        error *= compute_error_term(node.instruction.op().name(), err_map, qubit)
                     }
                     node.instruction
-                        .operation
+                        .op()
                         .matrix(&node.instruction.params)
                         .expect("No matrix defined for operation")
                 })
@@ -1046,24 +1045,16 @@ fn matmul_1q(operator: &mut [[Complex64; 2]; 2], other: Array2<Complex64>) {
 
 #[pyfunction]
 pub fn collect_1q_runs_filter(node: &Bound<PyAny>) -> bool {
-    let op_node = node.downcast::<DAGOpNode>();
-    match op_node {
-        Ok(bound_node) => {
-            let node = bound_node.borrow();
-            node.instruction.operation.num_qubits() == 1
-                && node.instruction.operation.num_clbits() == 0
-                && node
-                    .instruction
-                    .operation
-                    .matrix(&node.instruction.params)
-                    .is_some()
-                && match &node.instruction.extra_attrs {
-                    None => true,
-                    Some(attrs) => attrs.condition.is_none(),
-                }
+    let Ok(node) = node.downcast::<DAGOpNode>() else { return false };
+    let node = node.borrow();
+    let op = node.instruction.op();
+    op.num_qubits() == 1
+        && op.num_clbits() == 0
+        && op.matrix(&node.instruction.params).is_some()
+        && match &node.instruction.extra_attrs {
+            None => true,
+            Some(attrs) => attrs.condition.is_none(),
         }
-        Err(_) => false,
-    }
 }
 
 #[pymodule]

diff --git a/crates/circuit/Cargo.toml b/crates/circuit/Cargo.toml
@@ -10,6 +10,7 @@ name = "qiskit_circuit"
 doctest = false
 
 [dependencies]
+bytemuck.workspace = true
 hashbrown.workspace = true
 num-complex.workspace = true
 ndarray.workspace = true