You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Mar 6, 2023. It is now read-only.
hi, @merrymercy
I am working on winograd on cuda.
I found that batched MM in your winograd is slow in nvida architecure. I guest this is because when C is large, it could not use parallel power of GPU.
Do you have any idea about this part?
Thanks
The text was updated successfully, but these errors were encountered:
The schedule for mali gpu cannot be used for nvidia gpu. The main difference is the usage of shared memory. You should implement totally different schedule for both transformation and batch MM. For batch gemm, you can see https://github.com/dmlc/tvm/tree/master/topi/recipe/gemm for example.
For nvidia gpu, if we want to get the best performance, we cannot re-layout the data several times like what we do on mali. Because some stages can be memory bounded on nvidia's gpu. (NVIDIA gpu vs mali gpu, peak FLOPS is about 50~200x, but memory bandwith is only 10x). According to the original paper, we should fuse the transform and batch gemm into a block.
Actually I cannot figure out how to fuse them to get the best performance. The open source code from that paper (neon library) is in asm and I cannot read it. Now I only have some preliminary results. For inference, if we do kernel transformation in advance, our kernel can beat cudnn's best winograd when the kernel tensor is large (such as last few layers in resnet)
What's your background of cuda? It helps a lot if your team can contribute a fast (fused) winograd kernel for cuda.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
hi, @merrymercy
I am working on winograd on cuda.
I found that batched MM in your winograd is slow in nvida architecure. I guest this is because when C is large, it could not use parallel power of GPU.
Do you have any idea about this part?
Thanks
The text was updated successfully, but these errors were encountered: