Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too much specialization with permutedims: Use dynamic shared memory #375

Closed
GiggleLiu opened this issue Sep 16, 2021 · 3 comments · Fixed by #383
Closed

Too much specialization with permutedims: Use dynamic shared memory #375

GiggleLiu opened this issue Sep 16, 2021 · 3 comments · Fixed by #383

Comments

@GiggleLiu
Copy link
Contributor

GiggleLiu commented Sep 16, 2021

Each different permutation costs ~0.6s on my device, too bad for contracting a tensor network. This is because the host function unrolls the permutation order:

gpu_call(permutedims_kernel, dest, src, Val(perm))

The following version is compiler friendly, but not runtime efficient.

using CUDA, Random
using CUDA: @cartesianidx, AbstractGPUArray, gpu_call
using LinearAlgebra: permutedims!

@inline @generated function map_index(dest, src, I, perm::NTuple{N,T}) where {N,T}
    Expr(:(=), Expr(:ref, :dest, [:(@inbounds I[perm[$i]]) for i in 1:N]...), Expr(:ref, :src, :I))
end
function mypermutedims!(dest::AbstractGPUArray, src::AbstractGPUArray,
                                    perm::NTuple)
    Base.checkdims_perm(dest, src, perm)
    function permutedims_kernel(ctx, dest, src, perm)
        I = @cartesianidx src
        map_index(dest, src, I, perm)
        return
    end
    gpu_call(permutedims_kernel, dest, src, perm)
    return dest
end

using BenchmarkTools
x = CUDA.randn(fill(2, 18)...);
y = zero(x);
p = (randperm(18)...,)
@btime CUDA.@sync permutedims!($y, $x, $p);
  142.905 μs (97 allocations: 3.41 KiB)
@btime CUDA.@sync mypermutedims!($y, $x, $p);
  400.497 μs (413 allocations: 13.14 KiB)  # too bad


x = CUDA.randn(80, 80, 80);
y = zero(x);
p = (2,3,1)
using BenchmarkTools
@btime CUDA.@sync permutedims!($y, $x, $p);
  100.064 μs (53 allocations: 1.91 KiB)
@btime CUDA.@sync mypermutedims!($y, $x, $p);
  130.855 μs (217 allocations: 7.02 KiB)   # this looks not too bad

Wish for some advices to improve the permutedims implementation.

@GiggleLiu
Copy link
Contributor Author

GiggleLiu commented Sep 17, 2021

Find a related PR: #338 , static or dynamic, that is a question. 🤔

Do you think this can be a solution:

If tensor rank >= 5, use the dynamic version, otherwise, use the static version.

@maleadt
Copy link
Member

maleadt commented Oct 4, 2021

The problem is that the dynamic version isn't supported by all GPUArrays back-ends. But the recent KernelState work should make it possible to, at least with OpenCL-style back-ends like oneAPI.jl.

@jpsamaroo does AMDGPU.jl support dynamically-allocated shared memory?

@maleadt maleadt changed the title Slow compiling permutedims in tensor network applications Too much specialization with permutedims: Use dynamic shared memory Oct 4, 2021
@jpsamaroo
Copy link
Member

Nope, last I checked, the LLVM backend doesn't currently support dynamic shared allocations (and I'm not sure if it ever will). Of course, whenever I get around to implementing device-side kernel launch, we could probably use that to work around that limitation, but it's not a high priority for me right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants