-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CUTLASS] Support batch_matmul #9439
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
commit cfacfa2 Author: Masahiro Masuda <masahi129@gmail.com> Date: Mon Nov 1 15:57:49 2021 +0900 change is_constant pattern to wildcard in gelu pattern commit 84da943 Author: Masahiro Masuda <masahi129@gmail.com> Date: Mon Nov 1 05:41:11 2021 +0900 fixed batch stride C commit 66e5779 Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Oct 31 20:47:16 2021 +0900 refactoring codegen commit 561daea Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Oct 31 20:05:20 2021 +0900 generated kernel compiled and result match commit a5740bc Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Oct 31 19:36:53 2021 +0900 partitioning looks good commit 59112fd Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Oct 31 19:01:47 2021 +0900 [WIP] cutlass batch matmul support
masahi
requested review from
anijain2305,
areusch,
comaniac,
icemelon,
jroesch,
junrushao,
jwfromm,
manupak,
MarisaKirisame,
mbaret,
mbrookhart,
merrymercy,
slyubomirsky,
tqchen,
trevor-m,
vinx13,
wweic,
yzhliu,
zhiics and
ZihengJiang
as code owners
November 3, 2021 19:55
comaniac
approved these changes
Nov 3, 2021
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the detail benchmarking. LGTM.
Thanks @comaniac |
mehrdadh
pushed a commit
to mehrdadh/tvm
that referenced
this pull request
Dec 1, 2021
* Import batched gemm change commit cfacfa2 Author: Masahiro Masuda <masahi129@gmail.com> Date: Mon Nov 1 15:57:49 2021 +0900 change is_constant pattern to wildcard in gelu pattern commit 84da943 Author: Masahiro Masuda <masahi129@gmail.com> Date: Mon Nov 1 05:41:11 2021 +0900 fixed batch stride C commit 66e5779 Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Oct 31 20:47:16 2021 +0900 refactoring codegen commit 561daea Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Oct 31 20:05:20 2021 +0900 generated kernel compiled and result match commit a5740bc Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Oct 31 19:36:53 2021 +0900 partitioning looks good commit 59112fd Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Oct 31 19:01:47 2021 +0900 [WIP] cutlass batch matmul support * fixed test * refactoring * gelu test fixed * more refactor * batch_matmul fp32 accum working * dynamic batch matmul working * black * remove doc TODO
mehrdadh
pushed a commit
to mehrdadh/tvm
that referenced
this pull request
Dec 1, 2021
* Import batched gemm change commit cfacfa2 Author: Masahiro Masuda <masahi129@gmail.com> Date: Mon Nov 1 15:57:49 2021 +0900 change is_constant pattern to wildcard in gelu pattern commit 84da943 Author: Masahiro Masuda <masahi129@gmail.com> Date: Mon Nov 1 05:41:11 2021 +0900 fixed batch stride C commit 66e5779 Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Oct 31 20:47:16 2021 +0900 refactoring codegen commit 561daea Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Oct 31 20:05:20 2021 +0900 generated kernel compiled and result match commit a5740bc Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Oct 31 19:36:53 2021 +0900 partitioning looks good commit 59112fd Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Oct 31 19:01:47 2021 +0900 [WIP] cutlass batch matmul support * fixed test * refactoring * gelu test fixed * more refactor * batch_matmul fp32 accum working * dynamic batch matmul working * black * remove doc TODO
batch gemm is nice to have, but it is a bit yesterday now. :) cutlass group gemm is released with 2.8 and it is the way to go. |
ylc
pushed a commit
to ylc/tvm
that referenced
this pull request
Jan 7, 2022
* Import batched gemm change commit cfacfa2 Author: Masahiro Masuda <masahi129@gmail.com> Date: Mon Nov 1 15:57:49 2021 +0900 change is_constant pattern to wildcard in gelu pattern commit 84da943 Author: Masahiro Masuda <masahi129@gmail.com> Date: Mon Nov 1 05:41:11 2021 +0900 fixed batch stride C commit 66e5779 Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Oct 31 20:47:16 2021 +0900 refactoring codegen commit 561daea Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Oct 31 20:05:20 2021 +0900 generated kernel compiled and result match commit a5740bc Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Oct 31 19:36:53 2021 +0900 partitioning looks good commit 59112fd Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Oct 31 19:01:47 2021 +0900 [WIP] cutlass batch matmul support * fixed test * refactoring * gelu test fixed * more refactor * batch_matmul fp32 accum working * dynamic batch matmul working * black * remove doc TODO
ylc
pushed a commit
to ylc/tvm
that referenced
this pull request
Jan 13, 2022
* Import batched gemm change commit cfacfa2 Author: Masahiro Masuda <masahi129@gmail.com> Date: Mon Nov 1 15:57:49 2021 +0900 change is_constant pattern to wildcard in gelu pattern commit 84da943 Author: Masahiro Masuda <masahi129@gmail.com> Date: Mon Nov 1 05:41:11 2021 +0900 fixed batch stride C commit 66e5779 Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Oct 31 20:47:16 2021 +0900 refactoring codegen commit 561daea Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Oct 31 20:05:20 2021 +0900 generated kernel compiled and result match commit a5740bc Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Oct 31 19:36:53 2021 +0900 partitioning looks good commit 59112fd Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Oct 31 19:01:47 2021 +0900 [WIP] cutlass batch matmul support * fixed test * refactoring * gelu test fixed * more refactor * batch_matmul fp32 accum working * dynamic batch matmul working * black * remove doc TODO
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Adds support for offloading
batch_matmul
viaGemmBatched
cutlass kernel. Also supports dynamic shape.I didn't add a profiler specifically for
GemmBatched
kernel - I piggy back on thedense
profiler because cutlassGemmBatched
uses the same kernel for normal gemm parallelized over the batch dimension (Grid Z dim).This allows me to test cutlass byoc on Huggingface
BERT-large
end to end. I also have a number for TensorRT measured using their BERT demo, but TensorRT one is using google's own implementation of BERT.This is the current result comparing cutlass offload, autotvm native and TensorRT, all using tensor core on rtx3070. Input size is
(8, 128)
and numbers in milliseconds.Here is the detailed nvprof output from cutlass and autotvm:
As you can see, for now activation fusion is enabled only for GeLU. There are other activations or elemwise ops that can be fused in principle, such as
Together, they account for about 15% of total execution time, which is a shame. To fuse them, we need to overcome all of following blockers:
reshape
betweendense
and activations ([Topi] Allow relay.nn.dense support arbitrary number dimensions #8412)cast
op into activation (Allowing "source" (bias tensor) and output tensor to have different data type NVIDIA/cutlass#352)The nvprof output also shows that softmax is a huge bottleneck, accounting for 24% of e2e time. Apparently TVM's cuda softmax is 2x slower than cudnn, here is the result if we use cudnn's softmax:
cc @comaniac @Laurawly @zhiics @hwu36