Parallel execution of gates #175

Shayan-P · 2023-08-27T00:21:59Z

idea

Assume that our GPU is capable of executing kernels of gates in parallel (running each on a different stream), or we have access to multiple GPUs and we want to utilize their computation power.

There is a limiting factor that we should run the gates of our circuit one by one. This does not allow us to run gates in parallel. However, if gate1 and gate2 do not affect any common memory, we can run them in parallel.

Here, I suggest an algorithm to partition our circuit (consisting of $U_1, U_2, ..., U_n$) into batches $k$ batches in such a way that the gates in each batch do not affect any shared memory, thus we can run all the gates in each batch in parallel. So effectively, applying $U_1, ..., U_n$ is equivalent to applying $Batch_1, Batch_2, ..., Batch_k$ one after another (but each batch in parallel).

algorithm

Assume there is a directed graph in which we have a direct edge from $i$ to $j$ ($i < j$) where $U_i$ and $U_j$ affect some common memory (in our case, each gate acts on some columns of the Tabluea, so we have to check if there is any common column).

This directed graph is topologically sorted. We just need to partition it into independent sets which can trivially be done in $O(n)$.

While there is any node remaining
Put the nodes that don't have any incoming edge in one batch and then delete them.

The code below is the implementation of the idea above:

function partition_circuit(::Type{T}, circ) where {T <: Unsigned}
    # time complexity: O(gates * qubits/64) for UInt64

    circ = if eltype(circ) <: QuantumClifford.CompactifiedGate
        circ
    else
        compactify_circuit(circ)
    end

    qmax = maximum((maximum(affectedqubits(g)) for g in circ))
    columns_cnt = QuantumClifford._div(T, qmax) + 1

    partitions = []
    partition_idx = []
    last_column_occupier = [0 for _ in 1:columns_cnt] # last_column_occupier[i] = group id of the last gate that occupies indice i

    for g in circ
        column_indices = [QuantumClifford.getbigindex(T, qbit) for qbit in affectedqubits(g)]
        my_partition_idx = maximum(last_column_occupier[column_indices]) + 1
        push!(partition_idx, my_partition_idx)
        if my_partition_idx > length(partitions)
            push!(partitions, [g])
        else
            push!(partitions[my_partition_idx], g)
        end
        last_column_occupier[column_indices] .= my_partition_idx
    end
    partitions
end

And then, we would this partitioning like this:

function pftrajectories(state::PauliFrameGPU{T}, circuit) where {T <: Unsigned}
    for gates in partition_circuit(T, circuit)
        @sync begin
            for gate in gates
                @CUDA.async apply!(state, gate)
            end
        end
    end
    return state
end

issue

My GPU was not able to run two tasks in parallel (it would initiate different streams for each async call but eventually they would get run sequentially) thus it showed no improvement.

The text was updated successfully, but these errors were encountered:

Shayan-P added GPU related to new gpu support addition experiment experimental work labels Aug 27, 2023

Shayan-P changed the title ~~Parallel execration of gates~~ Parallel execution of gates Aug 27, 2023

Fe-r-oz mentioned this issue May 6, 2024

Enhancing Issue Organization and Management with Label-Based Sorting #273

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel execution of gates #175

Parallel execution of gates #175

Shayan-P commented Aug 27, 2023

Parallel execution of gates #175

Parallel execution of gates #175

Comments

Shayan-P commented Aug 27, 2023

idea

algorithm

issue