You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Assume that our GPU is capable of executing kernels of gates in parallel (running each on a different stream), or we have access to multiple GPUs and we want to utilize their computation power.
There is a limiting factor that we should run the gates of our circuit one by one. This does not allow us to run gates in parallel. However, if gate1 and gate2 do not affect any common memory, we can run them in parallel.
Here, I suggest an algorithm to partition our circuit (consisting of $U_1, U_2, ..., U_n$) into batches $k$ batches in such a way that the gates in each batch do not affect any shared memory, thus we can run all the gates in each batch in parallel. So effectively, applying $U_1, ..., U_n$ is equivalent to applying $Batch_1, Batch_2, ..., Batch_k$ one after another (but each batch in parallel).
algorithm
Assume there is a directed graph in which we have a direct edge from $i$ to $j$ ($i < j$) where $U_i$ and $U_j$ affect some common memory (in our case, each gate acts on some columns of the Tabluea, so we have to check if there is any common column).
This directed graph is topologically sorted. We just need to partition it into independent sets which can trivially be done in $O(n)$.
While there is any node remaining
Put the nodes that don't have any incoming edge in one batch and then delete them.
The code below is the implementation of the idea above:
function partition_circuit(::Type{T}, circ) where {T <: Unsigned}
# time complexity: O(gates * qubits/64) for UInt64
circ = if eltype(circ) <: QuantumClifford.CompactifiedGate
circ
else
compactify_circuit(circ)
end
qmax = maximum((maximum(affectedqubits(g)) for g in circ))
columns_cnt = QuantumClifford._div(T, qmax) + 1
partitions = []
partition_idx = []
last_column_occupier = [0 for _ in 1:columns_cnt] # last_column_occupier[i] = group id of the last gate that occupies indice i
for g in circ
column_indices = [QuantumClifford.getbigindex(T, qbit) for qbit in affectedqubits(g)]
my_partition_idx = maximum(last_column_occupier[column_indices]) + 1
push!(partition_idx, my_partition_idx)
if my_partition_idx > length(partitions)
push!(partitions, [g])
else
push!(partitions[my_partition_idx], g)
end
last_column_occupier[column_indices] .= my_partition_idx
end
partitions
end
And then, we would this partitioning like this:
function pftrajectories(state::PauliFrameGPU{T}, circuit) where {T <: Unsigned}
for gates in partition_circuit(T, circuit)
@sync begin
for gate in gates
@CUDA.async apply!(state, gate)
end
end
end
return state
end
issue
My GPU was not able to run two tasks in parallel (it would initiate different streams for each async call but eventually they would get run sequentially) thus it showed no improvement.
The text was updated successfully, but these errors were encountered:
idea
Assume that our GPU is capable of executing kernels of gates in parallel (running each on a different stream), or we have access to multiple GPUs and we want to utilize their computation power.
There is a limiting factor that we should run the gates of our circuit one by one. This does not allow us to run gates in parallel. However, if gate1 and gate2 do not affect any common memory, we can run them in parallel.
Here, I suggest an algorithm to partition our circuit (consisting of$U_1, U_2, ..., U_n$ ) into batches $k$ batches in such a way that the gates in each batch do not affect any shared memory, thus we can run all the gates in each batch in parallel. So effectively, applying $U_1, ..., U_n$ is equivalent to applying $Batch_1, Batch_2, ..., Batch_k$ one after another (but each batch in parallel).
algorithm
Assume there is a directed graph in which we have a direct edge from$i$ to $j$ ($i < j$ ) where $U_i$ and $U_j$ affect some common memory (in our case, each gate acts on some columns of the Tabluea, so we have to check if there is any common column).
This directed graph is topologically sorted. We just need to partition it into independent sets which can trivially be done in$O(n)$ .
The code below is the implementation of the idea above:
And then, we would this partitioning like this:
issue
My GPU was not able to run two tasks in parallel (it would initiate different streams for each async call but eventually they would get run sequentially) thus it showed no improvement.
The text was updated successfully, but these errors were encountered: