[CUTLASS] Support batch_matmul #9439

masahi · 2021-11-03T19:55:07Z

Adds support for offloading batch_matmul via GemmBatched cutlass kernel. Also supports dynamic shape.
I didn't add a profiler specifically for GemmBatched kernel - I piggy back on the dense profiler because cutlass GemmBatched uses the same kernel for normal gemm parallelized over the batch dimension (Grid Z dim).

This allows me to test cutlass byoc on Huggingface BERT-large end to end. I also have a number for TensorRT measured using their BERT demo, but TensorRT one is using google's own implementation of BERT.

This is the current result comparing cutlass offload, autotvm native and TensorRT, all using tensor core on rtx3070. Input size is (8, 128) and numbers in milliseconds.

CUTLASS	AutoTVM	TensorRT (google's implementation)
23.0551	23.6729	14.0

Here is the detailed nvprof output from cutlass and autotvm:

As you can see, for now activation fusion is enabled only for GeLU. There are other activations or elemwise ops that can be fused in principle, such as

Together, they account for about 15% of total execution time, which is a shame. To fuse them, we need to overcome all of following blockers:

Support ND input dense to remove reshape between dense and activations ([Topi] Allow relay.nn.dense support arbitrary number dimensions #8412)
Support multiple "source" tensors in cutlass epilogue (Epilogue with mutiple sources NVIDIA/cutlass#347)
Enable fusion of cast op into activation (Allowing "source" (bias tensor) and output tensor to have different data type NVIDIA/cutlass#352)

The nvprof output also shows that softmax is a huge bottleneck, accounting for 24% of e2e time. Apparently TVM's cuda softmax is 2x slower than cudnn, here is the result if we use cudnn's softmax:

CUTLASS	AutoTVM	TensorRT (google's implementation)
20.2517	21.1869	14.0

cc @comaniac @Laurawly @zhiics @hwu36

commit cfacfa2 Author: Masahiro Masuda <masahi129@gmail.com> Date: Mon Nov 1 15:57:49 2021 +0900 change is_constant pattern to wildcard in gelu pattern commit 84da943 Author: Masahiro Masuda <masahi129@gmail.com> Date: Mon Nov 1 05:41:11 2021 +0900 fixed batch stride C commit 66e5779 Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Oct 31 20:47:16 2021 +0900 refactoring codegen commit 561daea Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Oct 31 20:05:20 2021 +0900 generated kernel compiled and result match commit a5740bc Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Oct 31 19:36:53 2021 +0900 partitioning looks good commit 59112fd Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Oct 31 19:01:47 2021 +0900 [WIP] cutlass batch matmul support

comaniac

I like the detail benchmarking. LGTM.

masahi · 2021-11-04T06:59:25Z

Thanks @comaniac

* Import batched gemm change commit cfacfa2 Author: Masahiro Masuda <masahi129@gmail.com> Date: Mon Nov 1 15:57:49 2021 +0900 change is_constant pattern to wildcard in gelu pattern commit 84da943 Author: Masahiro Masuda <masahi129@gmail.com> Date: Mon Nov 1 05:41:11 2021 +0900 fixed batch stride C commit 66e5779 Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Oct 31 20:47:16 2021 +0900 refactoring codegen commit 561daea Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Oct 31 20:05:20 2021 +0900 generated kernel compiled and result match commit a5740bc Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Oct 31 19:36:53 2021 +0900 partitioning looks good commit 59112fd Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Oct 31 19:01:47 2021 +0900 [WIP] cutlass batch matmul support * fixed test * refactoring * gelu test fixed * more refactor * batch_matmul fp32 accum working * dynamic batch matmul working * black * remove doc TODO

hwu36 · 2021-12-20T20:37:23Z

batch gemm is nice to have, but it is a bit yesterday now. :)

cutlass group gemm is released with 2.8 and it is the way to go.

* Import batched gemm change commit cfacfa2 Author: Masahiro Masuda <masahi129@gmail.com> Date: Mon Nov 1 15:57:49 2021 +0900 change is_constant pattern to wildcard in gelu pattern commit 84da943 Author: Masahiro Masuda <masahi129@gmail.com> Date: Mon Nov 1 05:41:11 2021 +0900 fixed batch stride C commit 66e5779 Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Oct 31 20:47:16 2021 +0900 refactoring codegen commit 561daea Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Oct 31 20:05:20 2021 +0900 generated kernel compiled and result match commit a5740bc Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Oct 31 19:36:53 2021 +0900 partitioning looks good commit 59112fd Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Oct 31 19:01:47 2021 +0900 [WIP] cutlass batch matmul support * fixed test * refactoring * gelu test fixed * more refactor * batch_matmul fp32 accum working * dynamic batch matmul working * black * remove doc TODO

masahi added 9 commits November 3, 2021 11:55

fixed test

c50dd47

refactoring

b33eefa

gelu test fixed

aa3422d

more refactor

6ac5e1d

batch_matmul fp32 accum working

e2815b3

dynamic batch matmul working

1baabbb

black

0c6988a

remove doc TODO

300dac8

masahi requested review from anijain2305, areusch, comaniac, icemelon, jroesch, junrushao, jwfromm, manupak, MarisaKirisame, mbaret, mbrookhart, merrymercy, slyubomirsky, tqchen, trevor-m, vinx13, wweic, yzhliu, zhiics and ZihengJiang as code owners November 3, 2021 19:55

comaniac approved these changes Nov 3, 2021

View reviewed changes

masahi merged commit c7a01a4 into apache:main Nov 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUTLASS] Support batch_matmul #9439

[CUTLASS] Support batch_matmul #9439

masahi commented Nov 3, 2021 •

edited

Loading

comaniac left a comment

masahi commented Nov 4, 2021

hwu36 commented Dec 20, 2021

[CUTLASS] Support batch_matmul #9439

[CUTLASS] Support batch_matmul #9439

Conversation

masahi commented Nov 3, 2021 • edited Loading

comaniac left a comment

Choose a reason for hiding this comment

masahi commented Nov 4, 2021

hwu36 commented Dec 20, 2021

masahi commented Nov 3, 2021 •

edited

Loading