[CUDA]batch_matmul tensorcore schedule #7146

Meteorix · 2020-12-22T05:39:12Z

Add batch_matmul tensorcore schedule for bert inference. It shows better performance than cublas batch_matmul kernel.

@jcf94 @merrymercy could you help review this pr?

python/tvm/topi/cuda/batch_matmul_tensorcore.py

jcf94 · 2020-12-22T06:02:56Z

python/tvm/relay/op/strategy/cuda.py

+        if ((M % 8 == 0 and K % 16 == 0 and N % 32 == 0) or \
+                (M % 16 == 0 and K % 16 == 0 and N % 16 == 0) or \
+                (M % 32 == 0 and K % 16 == 0 and N % 8 == 0)):


Will it be better to also add data type check here or use some other user defined options?
TensorCore needs to be computed in float16, but I'm not sure if this will bring any loss in precision if we just try to transform all float32 batch_matmul ops to compute in lower precision.
Besides, TensorCore can also support datatype like int8 in some higher cuda versions.

I just kept the same with code for dense_tensorcore https://github.com/apache/tvm/blob/main/python/tvm/relay/op/strategy/cuda.py#L679

I think it is a bug in dense_tensorcore. We should not follow that.

jcf94 · 2020-12-22T06:10:03Z

python/tvm/topi/cuda/batch_matmul_tensorcore.py

+        shared_shedule(BS, BS_align)
+
+        shape = (wmma_m, wmma_n, wmma_k)
+        in_dtype = 'float16'


Same concerns about the data type as above.
It's fine for this PR, but will be better to add more check or just put some comments saying that the TensorCore needs to use a special data type, then if some one meets any trouble, they can know how to check.

also kept same with https://github.com/apache/tvm/blob/main/python/tvm/topi/cuda/dense_tensorcore.py#L72

jcf94 · 2020-12-22T06:10:47Z

python/tvm/topi/cuda/batch_matmul_tensorcore.py

+        "The shape of (M, K, N) must be multiple of (16, 16, 16) or (32, 16, 8) or (8, 16, 32) for now"
+
+    x_16 = te.compute((batch, M, K), lambda b, i, k: x[b, i, k].astype('float16'))
+    y_16 = te.compute((batch, N, K), lambda b, j, k: y[b, j, k].astype('float16'))


jcf94 · 2020-12-22T06:16:59Z

@Meteorix, great thanks for your PR! The code looks good to me.

jcf94

The code is fine. Please fix the CI problem.
cc @merrymercy

jwfromm · 2020-12-23T05:25:58Z

@Meteorix out of curiosity can you share some of your benchmarking results? I'd love to know how much faster this performs than cublas.

jwfromm · 2020-12-23T05:30:33Z

tests/python/topi/python/test_topi_batch_matmul_tensorcore.py

+def verify_batch_matmul(x_batch, y_batch, M, N, K):
+    x = te.placeholder((x_batch, M, K), name="x")
+    y = te.placeholder((y_batch, N, K), name="y")
+    dtype = x.dtype


It may be worth testing other datatypes as well, especially float16.

merrymercy

We should fix the type issue mentioned by @jcf94.
The existing dense_tensorcore is buggy in my view. We should fix it instead of following it.
This small bug can lead to potential accuracy loss that is very hard to debug.

merrymercy · 2020-12-23T14:31:25Z

cc tensor core maintainers @vinx13 @Laurawly @Hzfengsy

Will re-check this PR later.

Meteorix · 2020-12-24T07:29:28Z

@Meteorix out of curiosity can you share some of your benchmarking results? I'd love to know how much faster this performs than cublas.

@jwfromm following are some of the benchmark(tuning 1000 times). This schedule beat cublas on some shapes. That is also why I made batch_matmul_cublas autotunable in this pr.

Shape: [1, 64, 1024] [1, 4096, 1024]
batch_matmul_tensorcore.cuda   2.9238894640234948e-05
batch_matmul_cublas.cuda       2.7487557097865394e-05 
batch_matmul.cuda              0.00014189747117647058

Shape: [1, 64, 1024] [1, 1024, 1024]
batch_matmul_tensorcore.cuda   1.5578384301061096e-05 
batch_matmul_cublas.cuda       2.041829239101948e-05
batch_matmul.cuda              6.108717968157696e-05

Shape: [1, 128, 1024] [1, 4096, 1024]
batch_matmul_tensorcore.cuda   0.00011345079327976625
batch_matmul_cublas.cuda       0.00011074180193236715 
batch_matmul.cuda              0.00024510443407707913

Shape: [1, 128, 4096] [1, 1024, 4096]
batch_matmul_tensorcore.cuda   0.00017083510384959715
batch_matmul_cublas.cuda       0.00010608833085714285 
batch_matmul.cuda              0.00035638234315169367

Shape: [16, 128, 64] [16, 128, 64]
batch_matmul_cublas.cuda       6.046038943091678e-06
batch_matmul_tensorcore.cuda   4.134768131265665e-06 
batch_matmul.cuda              1.2430305571941866e-05

Shape: [16, 128, 128] [16, 64, 128]
batch_matmul_tensorcore.cuda   4.74178964860194e-06 
batch_matmul_cublas.cuda       9.463372359711623e-06
batch_matmul.cuda              1.4179731404708587e-05

Shape: [1, 128, 1024] [1, 1024, 1024]
batch_matmul_tensorcore.cuda   3.857668104222821e-05
batch_matmul_cublas.cuda       2.3704257450575394e-05 
batch_matmul.cuda              0.0002515613367983368

Meteorix · 2020-12-24T07:32:28Z

We should fix the type issue mentioned by @jcf94.
The existing dense_tensorcore is buggy in my view. We should fix it instead of following it.
This small bug can lead to potential accuracy loss that is very hard to debug.

@merrymercy I see your point. Maybe we can discuss it with other tensor core maintainers and file another pr to resolve this issue?

Laurawly · 2020-12-24T07:59:27Z

We should fix the type issue mentioned by @jcf94.
The existing dense_tensorcore is buggy in my view. We should fix it instead of following it.
This small bug can lead to potential accuracy loss that is very hard to debug.

@merrymercy I see your point. Maybe we can discuss it with other tensor core maintainers and file another pr to resolve this issue?

I agree with @merrymercy and think we should fix the type issue that we overlooked before. We can either fix it in this PR or in a separate parallel PR. I'd like to help with that.

jcf94 · 2020-12-24T11:57:33Z

Thanks! @Laurawly @merrymercy
I think it's fine to fix them in a new PR.

@Meteorix If we're not going to finish these here, you can add some TODO comments in the code and create a new issue for tracking. Please fix the CI problem and we can merge this.
#7147 is also fine, just need to add some unit tests for these modifications.

Laurawly · 2020-12-31T00:21:53Z

python/tvm/relay/op/strategy/cuda.py

@@ -657,6 +657,23 @@ def batch_matmul_strategy_cuda(attrs, inputs, out_type, target):
            name="batch_matmul_cublas.cuda",
            plevel=15,
        )
+    if target.kind.name == "cuda" and nvcc.have_tensorcore(tvm.gpu(0).compute_version):


It's better to use nvcc.have_tensorcore(target=target) here since tvm.gpu(0) might not exist.

Laurawly

LGTM

Meteorix · 2021-01-08T06:33:40Z

@jcf94 @merrymercy @Laurawly finally the ci passed. Also I have fixed the dtype check for batch_matmul. Please review this mr again.

jcf94

@Meteorix Thanks! LGTM.

Let's dismiss this temporary and we can continue on #7147

* add batch_matmul_tensorcore * add bmm cublas autotune * add bmm tests * out_shape for bmm_tensorcore * fix comments * code format * add todos for tensorcore datatype checking * fix lint * fix have_tensorcore * add dtype check for batch_matmul_tensorcore

tqchen · 2021-01-12T02:09:38Z

@jcf94 @Meteorix @jwfromm because our TOPI test stage does not gaurantee uses the tensorcore GPU(we had two pascal GPUs), it would be useful to optionally skip it, to avoid flaky CI error on the main.

tqchen · 2021-01-13T23:40:02Z

created #7277 to track the issue

* add batch_matmul_tensorcore * add bmm cublas autotune * add bmm tests * out_shape for bmm_tensorcore * fix comments * code format * add todos for tensorcore datatype checking * fix lint * fix have_tensorcore * add dtype check for batch_matmul_tensorcore

Meteorix added 4 commits November 9, 2020 11:48

add batch_matmul_tensorcore

9844939

add bmm cublas autotune

1ea5ea8

add bmm tests

debec15

out_shape for bmm_tensorcore

a1808b4

jcf94 requested changes Dec 22, 2020

View reviewed changes

python/tvm/topi/cuda/batch_matmul_tensorcore.py Outdated Show resolved Hide resolved

Meteorix mentioned this pull request Dec 22, 2020

[CUDA][PASS]Legalize tensorcore #7147

Merged

fix comments

a2eee2a

jcf94 reviewed Dec 22, 2020

View reviewed changes

jcf94 previously approved these changes Dec 23, 2020

View reviewed changes

jwfromm reviewed Dec 23, 2020

View reviewed changes

merrymercy previously requested changes Dec 23, 2020

View reviewed changes

Meteorix added 2 commits December 30, 2020 16:59

code format

bf8936a

add todos for tensorcore datatype checking

8ce932c

Laurawly reviewed Dec 31, 2020

View reviewed changes

Meteorix added 2 commits December 31, 2020 11:19

fix lint

595b9f1

fix lint

3da8299

Laurawly mentioned this pull request Dec 31, 2020

[Fix] Tensor core type issue for dense #7187

Merged

Meteorix added 3 commits December 31, 2020 14:28

fix have_tensorcore

cdca880

add dtype check for batch_matmul_tensorcore

72a5885

fix

618b4bf

Laurawly approved these changes Jan 8, 2021

View reviewed changes

jcf94 approved these changes Jan 11, 2021

View reviewed changes

jcf94 merged commit 89e3688 into apache:main Jan 11, 2021

tqchen mentioned this pull request Jan 13, 2021

[TEST] Skip TensorCore TOPI test when it is not available #7277

Closed

junrushao mentioned this pull request Nov 1, 2021

Apache TVM v0.8 Release Note Candidate #9416

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA]batch_matmul tensorcore schedule #7146

[CUDA]batch_matmul tensorcore schedule #7146

Meteorix commented Dec 22, 2020

jcf94 Dec 22, 2020

Meteorix Dec 22, 2020

merrymercy Dec 23, 2020 •

edited

Loading

jcf94 Dec 22, 2020

Meteorix Dec 22, 2020

jcf94 Dec 22, 2020

jcf94 commented Dec 22, 2020

jcf94 left a comment •

edited

Loading

jwfromm commented Dec 23, 2020

jwfromm Dec 23, 2020

merrymercy left a comment •

edited

Loading

merrymercy commented Dec 23, 2020

Meteorix commented Dec 24, 2020

Meteorix commented Dec 24, 2020

Laurawly commented Dec 24, 2020

jcf94 commented Dec 24, 2020

Laurawly Dec 31, 2020 •

edited

Loading

Laurawly left a comment

Meteorix commented Jan 8, 2021

jcf94 left a comment

tqchen commented Jan 12, 2021

tqchen commented Jan 13, 2021

[CUDA]batch_matmul tensorcore schedule #7146

[CUDA]batch_matmul tensorcore schedule #7146

Conversation

Meteorix commented Dec 22, 2020

jcf94 Dec 22, 2020

Choose a reason for hiding this comment

Meteorix Dec 22, 2020

Choose a reason for hiding this comment

merrymercy Dec 23, 2020 • edited Loading

Choose a reason for hiding this comment

jcf94 Dec 22, 2020

Choose a reason for hiding this comment

Meteorix Dec 22, 2020

Choose a reason for hiding this comment

jcf94 Dec 22, 2020

Choose a reason for hiding this comment

jcf94 commented Dec 22, 2020

jcf94 left a comment • edited Loading

Choose a reason for hiding this comment

jwfromm commented Dec 23, 2020

jwfromm Dec 23, 2020

Choose a reason for hiding this comment

merrymercy left a comment • edited Loading

Choose a reason for hiding this comment

merrymercy commented Dec 23, 2020

Meteorix commented Dec 24, 2020

Meteorix commented Dec 24, 2020

Laurawly commented Dec 24, 2020

jcf94 commented Dec 24, 2020

Laurawly Dec 31, 2020 • edited Loading

Choose a reason for hiding this comment

Laurawly left a comment

Choose a reason for hiding this comment

Meteorix commented Jan 8, 2021

jcf94 left a comment

Choose a reason for hiding this comment

tqchen commented Jan 12, 2021

tqchen commented Jan 13, 2021

merrymercy Dec 23, 2020 •

edited

Loading

jcf94 left a comment •

edited

Loading

merrymercy left a comment •

edited

Loading

Laurawly Dec 31, 2020 •

edited

Loading