Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel execution of gates #175

Open
Shayan-P opened this issue Aug 27, 2023 · 0 comments
Open

Parallel execution of gates #175

Shayan-P opened this issue Aug 27, 2023 · 0 comments
Labels
experiment experimental work GPU related to new gpu support addition

Comments

@Shayan-P
Copy link
Collaborator

idea

Assume that our GPU is capable of executing kernels of gates in parallel (running each on a different stream), or we have access to multiple GPUs and we want to utilize their computation power.

There is a limiting factor that we should run the gates of our circuit one by one. This does not allow us to run gates in parallel. However, if gate1 and gate2 do not affect any common memory, we can run them in parallel.

Here, I suggest an algorithm to partition our circuit (consisting of $U_1, U_2, ..., U_n$) into batches $k$ batches in such a way that the gates in each batch do not affect any shared memory, thus we can run all the gates in each batch in parallel. So effectively, applying $U_1, ..., U_n$ is equivalent to applying $Batch_1, Batch_2, ..., Batch_k$ one after another (but each batch in parallel).

algorithm

Assume there is a directed graph in which we have a direct edge from $i$ to $j$ ($i < j$) where $U_i$ and $U_j$ affect some common memory (in our case, each gate acts on some columns of the Tabluea, so we have to check if there is any common column).

This directed graph is topologically sorted. We just need to partition it into independent sets which can trivially be done in $O(n)$.

  • While there is any node remaining
  • Put the nodes that don't have any incoming edge in one batch and then delete them.

The code below is the implementation of the idea above:

function partition_circuit(::Type{T}, circ) where {T <: Unsigned}
    # time complexity: O(gates * qubits/64) for UInt64

    circ = if eltype(circ) <: QuantumClifford.CompactifiedGate
        circ
    else
        compactify_circuit(circ)
    end

    qmax = maximum((maximum(affectedqubits(g)) for g in circ))
    columns_cnt = QuantumClifford._div(T, qmax) + 1

    partitions = []
    partition_idx = []
    last_column_occupier = [0 for _ in 1:columns_cnt] # last_column_occupier[i] = group id of the last gate that occupies indice i

    for g in circ
        column_indices = [QuantumClifford.getbigindex(T, qbit) for qbit in affectedqubits(g)]
        my_partition_idx = maximum(last_column_occupier[column_indices]) + 1
        push!(partition_idx, my_partition_idx)
        if my_partition_idx > length(partitions)
            push!(partitions, [g])
        else
            push!(partitions[my_partition_idx], g)
        end
        last_column_occupier[column_indices] .= my_partition_idx
    end
    partitions
end

And then, we would this partitioning like this:

function pftrajectories(state::PauliFrameGPU{T}, circuit) where {T <: Unsigned}
    for gates in partition_circuit(T, circuit)
        @sync begin
            for gate in gates
                @CUDA.async apply!(state, gate)
            end
        end
    end
    return state
end

issue

My GPU was not able to run two tasks in parallel (it would initiate different streams for each async call but eventually they would get run sequentially) thus it showed no improvement.

@Shayan-P Shayan-P added GPU related to new gpu support addition experiment experimental work labels Aug 27, 2023
@Shayan-P Shayan-P changed the title Parallel execration of gates Parallel execution of gates Aug 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
experiment experimental work GPU related to new gpu support addition
Projects
None yet
Development

No branches or pull requests

1 participant