-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Too much specialization with permutedims: Use dynamic shared memory #375
Comments
Find a related PR: #338 , static or dynamic, that is a question. 🤔 Do you think this can be a solution: If tensor rank >= 5, use the dynamic version, otherwise, use the static version. |
The problem is that the dynamic version isn't supported by all GPUArrays back-ends. But the recent KernelState work should make it possible to, at least with OpenCL-style back-ends like oneAPI.jl. @jpsamaroo does AMDGPU.jl support dynamically-allocated shared memory? |
Nope, last I checked, the LLVM backend doesn't currently support dynamic shared allocations (and I'm not sure if it ever will). Of course, whenever I get around to implementing device-side kernel launch, we could probably use that to work around that limitation, but it's not a high priority for me right now. |
Each different permutation costs ~0.6s on my device, too bad for contracting a tensor network. This is because the host function unrolls the permutation order:
GPUArrays.jl/src/host/linalg.jl
Line 195 in e1856fe
The following version is compiler friendly, but not runtime efficient.
Wish for some advices to improve the
permutedims
implementation.The text was updated successfully, but these errors were encountered: