-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Significant overhead when using GPU and ITensors #101
Comments
Hi! It would be really interesting if you could maybe share a profiler run of what you are trying to achieve. In principle, KrylovKit should be compatible with GPU-based backends, with the limitation that they still perform CPU linear algebra on the smaller matrices that characterize the Krylov subspaces. Depending on the interplay of these packages, it might be that there are accidentally a large number of transfers between CPU and GPU memory that are hindering the performance, or that it is some other thing that might be going wrong, or that it really is just that there is quite a bit of overhead that cannot be avoided. |
Thanks for the answer; it made me check my statement more closely. At the end of this post, you can find a code built on example no.1 from ITensorMPS.jl. I have not tried to run a profile yet, but I will get to that soon. It can be interesting to hear an opinion from someone more experienced with these considerations: for a given problem size, say n=20 and maxbondim=100, does the GPU backend have the expected speedup over the CPU (in my laptop, it was faster by around 0.75 per time step- 3sec compared with 4sec). Here's the code:
|
Unrelated to this issue, @XingyuZhang2018 has also been looking at performance of KrylovKit with CUDA / GPU data and found it to be not meeting expectations. It is thus probably nothing todo with ITensors, and I would like to reorient this issue towards discussing GPU performance. |
A first attempted explanation for possible suboptimal performance, was the observation that julia> @testset "dot $atype" for atype in [Array, CuArray]
Random.seed!(100)
N = 10^2
A = atype(rand(ComplexF64, N,N))
B = [A]
C = [B]
D = [C]
@btime CUDA.@sync dot($A, $A)
@btime CUDA.@sync dot($B, $B)
@btime CUDA.@sync dot($C, $C)
@btime CUDA.@sync dot($D, $D)
end
2.876 μs (0 allocations: 0 bytes)
5.111 μs (0 allocations: 0 bytes)
9.600 μs (0 allocations: 0 bytes)
18.579 μs (0 allocations: 0 bytes)
Test Summary: | Total Time
dot Array | 0 21.6s
16.030 μs (18 allocations: 304 bytes)
36.399 μs (36 allocations: 608 bytes)
75.128 μs (72 allocations: 1.19 KiB)
161.297 μs (144 allocations: 2.38 KiB)
Test Summary: | Total Time
dot CuArray | 0 29.9s However, that is the same for julia> using VectorInterface
julia> @testset "dot $atype" for atype in [Array, CuArray]
Random.seed!(100)
N = 10^2
A = atype(rand(ComplexF64, N,N))
B = [A]
C = [B]
D = [C]
@btime CUDA.@sync inner($A, $A)
@btime CUDA.@sync inner($B, $B)
@btime CUDA.@sync inner($C, $C)
@btime CUDA.@sync inner($D, $D)
end
3.021 μs (0 allocations: 0 bytes)
3.073 μs (2 allocations: 80 bytes)
3.106 μs (4 allocations: 160 bytes)
3.134 μs (6 allocations: 240 bytes)
Test Summary: | Total Time
dot Array | 0 15.0s
16.381 μs (18 allocations: 304 bytes)
15.587 μs (20 allocations: 384 bytes)
15.949 μs (22 allocations: 464 bytes)
15.470 μs (24 allocations: 544 bytes)
Test Summary: | Total Time
dot CuArray | 0 27.2s
2-element Vector{Any}:
Test.DefaultTestSet("dot Array", Any[], 0, false, false, true, 1.732188137110498e9, 1.732188152078698e9, false, "REPL[11]")
Test.DefaultTestSet("dot CuArray", Any[], 0, false, false, true, 1.732188152078942e9, 1.732188179328664e9, false, "REPL[11]") |
Actually, it would still be interesting to see where the allocations are coming from in those cases, but they do not seem to affect timings significantly. I found this presentation which is potentially interesting: This discusses the case where BLAS level 2 operations are used to do the orthogonalisation of the Krylov vectors. KrylovKit.jl doesn't even use BLAS level 2 but only level 1, because of how it stores vectors in general, but it might be useful to specialise to different behaviour for the case where the vectors are of type However, as indicated in these slides, even the BLAS level 2 operations are suboptimal. It might be that we want to write some custom kernels for CuArray vectors, but that is not something I have experience with. |
The problem can be seen directly in this easy example: @testset "eigsolve $atype" for atype in [Array, CuArray]
Random.seed!(100)
N = 10^3
A = [atype(rand(ComplexF64, N, N)) for i in 1:4]
v0 = [atype(rand(ComplexF64, N)) for i in 1:4]
linearmap(v) = A .* v
@btime CUDA.@sync inner($v0, $v0)
@btime CUDA.@sync $linearmap($v0)
@btime CUDA.@sync λs, vs = eigsolve(v -> $linearmap(v), $v0, 1, :LM)
end
855.385 ns (2 allocations: 128 bytes)
613.000 μs (14 allocations: 62.84 KiB)
23.762 ms (3015 allocations: 4.40 MiB)
Test Summary: | Total Time
eigsolve Array | 0 25.3s
142.500 μs (58 allocations: 1.06 KiB)
72.000 μs (94 allocations: 2.03 KiB)
4.707 s (124214 allocations: 2.22 MiB)
Test Summary: | Total Time
eigsolve CuArray | 0 47.3s The |
I guess that |
That is a great test case to focus our attention on! Is the fact that you have several vectors wrapped in a list essential to this, or is it already manifest with just a single |
Yes, single |
I think the reason is that: vector-vector inner product is too simple to be calculated when N~10^3, thus, the CPU could be faster than GPU in these cases, instead of "Array of Array is slow". In my benchmark, the A*v take nearly the same time comparing to [A].*[v].
My suggestion is to consider initializing U, the unitary matrix which spans the Krylov space, filling it with zeros. And update it in-place, which will eliminate all allocation. When perform Gram-Schmit, just call v = v-U^T*(U*v), thus will use BLAS-level2 instead of BLAS-level1, which will be much more friendly for GPU. |
Hi,
I am simulating a quantum dynamical system using the great ITensorMPS.jl package.
(https://github.com/ITensor/ITensorMPS.jl)
Without getting into details about this package and the specific computation, I want to point out a seemingly significant overhead in computation when running a subroutine of the 1TDVP algorithm using "exponentiate" with a GPU backend.
(it is used exactly here- https://github.com/ITensor/ITensorMPS.jl/blob/main/src/solvers/tdvp.jl)
This algorithm solves the (time-dependent-)Schrodinger equation by projecting the Hamiltonian onto a single MPS site. Then, the "exponentiate" function is used to solve the single-site effective equation. In principle, any ODE solver can solve the effective equation, but they usually work with exponentiate.
Is the significant overhead with the GPU backend compared with the CPU backend expected or known?
The text was updated successfully, but these errors were encountered: