about winograd batched MM performance #7

janboeye · 2018-04-21T03:02:12Z

hi, @merrymercy
I am working on winograd on cuda.
I found that batched MM in your winograd is slow in nvida architecure. I guest this is because when C is large, it could not use parallel power of GPU.

Do you have any idea about this part?

Thanks

merrymercy · 2018-04-21T15:25:03Z

I am also working on cuda winograd.

The schedule for mali gpu cannot be used for nvidia gpu. The main difference is the usage of shared memory. You should implement totally different schedule for both transformation and batch MM. For batch gemm, you can see https://github.com/dmlc/tvm/tree/master/topi/recipe/gemm for example.
For nvidia gpu, if we want to get the best performance, we cannot re-layout the data several times like what we do on mali. Because some stages can be memory bounded on nvidia's gpu. (NVIDIA gpu vs mali gpu, peak FLOPS is about 50~200x, but memory bandwith is only 10x). According to the original paper, we should fuse the transform and batch gemm into a block.
Actually I cannot figure out how to fuse them to get the best performance. The open source code from that paper (neon library) is in asm and I cannot read it. Now I only have some preliminary results. For inference, if we do kernel transformation in advance, our kernel can beat cudnn's best winograd when the kernel tensor is large (such as last few layers in resnet)

What's your background of cuda? It helps a lot if your team can contribute a fast (fused) winograd kernel for cuda.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

about winograd batched MM performance #7

about winograd batched MM performance #7

janboeye commented Apr 21, 2018

merrymercy commented Apr 21, 2018 •

edited

Loading

about winograd batched MM performance #7

about winograd batched MM performance #7

Comments

janboeye commented Apr 21, 2018

merrymercy commented Apr 21, 2018 • edited Loading

merrymercy commented Apr 21, 2018 •

edited

Loading