Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUTLASS] Support batch_matmul #9439

Merged
merged 9 commits into from
Nov 4, 2021
Merged

[CUTLASS] Support batch_matmul #9439

merged 9 commits into from
Nov 4, 2021

Conversation

masahi
Copy link
Member

@masahi masahi commented Nov 3, 2021

Adds support for offloading batch_matmul via GemmBatched cutlass kernel. Also supports dynamic shape.
I didn't add a profiler specifically for GemmBatched kernel - I piggy back on the dense profiler because cutlass GemmBatched uses the same kernel for normal gemm parallelized over the batch dimension (Grid Z dim).

This allows me to test cutlass byoc on Huggingface BERT-large end to end. I also have a number for TensorRT measured using their BERT demo, but TensorRT one is using google's own implementation of BERT.

This is the current result comparing cutlass offload, autotvm native and TensorRT, all using tensor core on rtx3070. Input size is (8, 128) and numbers in milliseconds.

CUTLASS AutoTVM TensorRT (google's implementation)
23.0551 23.6729 14.0

Here is the detailed nvprof output from cutlass and autotvm:

As you can see, for now activation fusion is enabled only for GeLU. There are other activations or elemwise ops that can be fused in principle, such as

Together, they account for about 15% of total execution time, which is a shame. To fuse them, we need to overcome all of following blockers:

The nvprof output also shows that softmax is a huge bottleneck, accounting for 24% of e2e time. Apparently TVM's cuda softmax is 2x slower than cudnn, here is the result if we use cudnn's softmax:

CUTLASS AutoTVM TensorRT (google's implementation)
20.2517 21.1869 14.0

cc @comaniac @Laurawly @zhiics @hwu36

commit cfacfa2
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Mon Nov 1 15:57:49 2021 +0900

    change is_constant pattern to wildcard in gelu pattern

commit 84da943
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Mon Nov 1 05:41:11 2021 +0900

    fixed batch stride C

commit 66e5779
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Sun Oct 31 20:47:16 2021 +0900

    refactoring codegen

commit 561daea
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Sun Oct 31 20:05:20 2021 +0900

    generated kernel compiled and result match

commit a5740bc
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Sun Oct 31 19:36:53 2021 +0900

    partitioning looks good

commit 59112fd
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Sun Oct 31 19:01:47 2021 +0900

    [WIP] cutlass batch matmul support
Copy link
Contributor

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the detail benchmarking. LGTM.

@masahi masahi merged commit c7a01a4 into apache:main Nov 4, 2021
@masahi
Copy link
Member Author

masahi commented Nov 4, 2021

Thanks @comaniac

mehrdadh pushed a commit to mehrdadh/tvm that referenced this pull request Dec 1, 2021
* Import batched gemm change

commit cfacfa2
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Mon Nov 1 15:57:49 2021 +0900

    change is_constant pattern to wildcard in gelu pattern

commit 84da943
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Mon Nov 1 05:41:11 2021 +0900

    fixed batch stride C

commit 66e5779
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Sun Oct 31 20:47:16 2021 +0900

    refactoring codegen

commit 561daea
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Sun Oct 31 20:05:20 2021 +0900

    generated kernel compiled and result match

commit a5740bc
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Sun Oct 31 19:36:53 2021 +0900

    partitioning looks good

commit 59112fd
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Sun Oct 31 19:01:47 2021 +0900

    [WIP] cutlass batch matmul support

* fixed test

* refactoring

* gelu test fixed

* more refactor

* batch_matmul fp32 accum working

* dynamic batch matmul working

* black

* remove doc TODO
mehrdadh pushed a commit to mehrdadh/tvm that referenced this pull request Dec 1, 2021
* Import batched gemm change

commit cfacfa2
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Mon Nov 1 15:57:49 2021 +0900

    change is_constant pattern to wildcard in gelu pattern

commit 84da943
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Mon Nov 1 05:41:11 2021 +0900

    fixed batch stride C

commit 66e5779
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Sun Oct 31 20:47:16 2021 +0900

    refactoring codegen

commit 561daea
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Sun Oct 31 20:05:20 2021 +0900

    generated kernel compiled and result match

commit a5740bc
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Sun Oct 31 19:36:53 2021 +0900

    partitioning looks good

commit 59112fd
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Sun Oct 31 19:01:47 2021 +0900

    [WIP] cutlass batch matmul support

* fixed test

* refactoring

* gelu test fixed

* more refactor

* batch_matmul fp32 accum working

* dynamic batch matmul working

* black

* remove doc TODO
@hwu36
Copy link

hwu36 commented Dec 20, 2021

batch gemm is nice to have, but it is a bit yesterday now. :)

cutlass group gemm is released with 2.8 and it is the way to go.

ylc pushed a commit to ylc/tvm that referenced this pull request Jan 7, 2022
* Import batched gemm change

commit cfacfa2
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Mon Nov 1 15:57:49 2021 +0900

    change is_constant pattern to wildcard in gelu pattern

commit 84da943
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Mon Nov 1 05:41:11 2021 +0900

    fixed batch stride C

commit 66e5779
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Sun Oct 31 20:47:16 2021 +0900

    refactoring codegen

commit 561daea
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Sun Oct 31 20:05:20 2021 +0900

    generated kernel compiled and result match

commit a5740bc
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Sun Oct 31 19:36:53 2021 +0900

    partitioning looks good

commit 59112fd
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Sun Oct 31 19:01:47 2021 +0900

    [WIP] cutlass batch matmul support

* fixed test

* refactoring

* gelu test fixed

* more refactor

* batch_matmul fp32 accum working

* dynamic batch matmul working

* black

* remove doc TODO
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 13, 2022
* Import batched gemm change

commit cfacfa2
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Mon Nov 1 15:57:49 2021 +0900

    change is_constant pattern to wildcard in gelu pattern

commit 84da943
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Mon Nov 1 05:41:11 2021 +0900

    fixed batch stride C

commit 66e5779
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Sun Oct 31 20:47:16 2021 +0900

    refactoring codegen

commit 561daea
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Sun Oct 31 20:05:20 2021 +0900

    generated kernel compiled and result match

commit a5740bc
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Sun Oct 31 19:36:53 2021 +0900

    partitioning looks good

commit 59112fd
Author: Masahiro Masuda <masahi129@gmail.com>
Date:   Sun Oct 31 19:01:47 2021 +0900

    [WIP] cutlass batch matmul support

* fixed test

* refactoring

* gelu test fixed

* more refactor

* batch_matmul fp32 accum working

* dynamic batch matmul working

* black

* remove doc TODO
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants