-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CUDA]batch_matmul tensorcore schedule #7146
Conversation
python/tvm/relay/op/strategy/cuda.py
Outdated
if ((M % 8 == 0 and K % 16 == 0 and N % 32 == 0) or \ | ||
(M % 16 == 0 and K % 16 == 0 and N % 16 == 0) or \ | ||
(M % 32 == 0 and K % 16 == 0 and N % 8 == 0)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it be better to also add data type check here or use some other user defined options?
TensorCore needs to be computed in float16, but I'm not sure if this will bring any loss in precision if we just try to transform all float32 batch_matmul ops to compute in lower precision.
Besides, TensorCore can also support datatype like int8 in some higher cuda versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just kept the same with code for dense_tensorcore https://github.com/apache/tvm/blob/main/python/tvm/relay/op/strategy/cuda.py#L679
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a bug in dense_tensorcore
. We should not follow that.
shared_shedule(BS, BS_align) | ||
|
||
shape = (wmma_m, wmma_n, wmma_k) | ||
in_dtype = 'float16' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same concerns about the data type as above.
It's fine for this PR, but will be better to add more check or just put some comments saying that the TensorCore needs to use a special data type, then if some one meets any trouble, they can know how to check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The shape of (M, K, N) must be multiple of (16, 16, 16) or (32, 16, 8) or (8, 16, 32) for now" | ||
|
||
x_16 = te.compute((batch, M, K), lambda b, i, k: x[b, i, k].astype('float16')) | ||
y_16 = te.compute((batch, N, K), lambda b, j, k: y[b, j, k].astype('float16')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
@Meteorix, great thanks for your PR! The code looks good to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code is fine. Please fix the CI problem.
cc @merrymercy
@Meteorix out of curiosity can you share some of your benchmarking results? I'd love to know how much faster this performs than cublas. |
def verify_batch_matmul(x_batch, y_batch, M, N, K): | ||
x = te.placeholder((x_batch, M, K), name="x") | ||
y = te.placeholder((y_batch, N, K), name="y") | ||
dtype = x.dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be worth testing other datatypes as well, especially float16
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should fix the type issue mentioned by @jcf94.
The existing dense_tensorcore
is buggy in my view. We should fix it instead of following it.
This small bug can lead to potential accuracy loss that is very hard to debug.
@jwfromm following are some of the benchmark(tuning 1000 times). This schedule beat cublas on some shapes. That is also why I made
|
@merrymercy I see your point. Maybe we can discuss it with other tensor core maintainers and file another pr to resolve this issue? |
I agree with @merrymercy and think we should fix the type issue that we overlooked before. We can either fix it in this PR or in a separate parallel PR. I'd like to help with that. |
Thanks! @Laurawly @merrymercy @Meteorix If we're not going to finish these here, you can add some TODO comments in the code and create a new issue for tracking. Please fix the CI problem and we can merge this. |
python/tvm/relay/op/strategy/cuda.py
Outdated
@@ -657,6 +657,23 @@ def batch_matmul_strategy_cuda(attrs, inputs, out_type, target): | |||
name="batch_matmul_cublas.cuda", | |||
plevel=15, | |||
) | |||
if target.kind.name == "cuda" and nvcc.have_tensorcore(tvm.gpu(0).compute_version): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to use nvcc.have_tensorcore(target=target)
here since tvm.gpu(0)
might not exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@jcf94 @merrymercy @Laurawly finally the ci passed. Also I have fixed the dtype check for batch_matmul. Please review this mr again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Meteorix Thanks! LGTM.
Let's dismiss this temporary and we can continue on #7147
* add batch_matmul_tensorcore * add bmm cublas autotune * add bmm tests * out_shape for bmm_tensorcore * fix comments * code format * add todos for tensorcore datatype checking * fix lint * fix have_tensorcore * add dtype check for batch_matmul_tensorcore
created #7277 to track the issue |
* add batch_matmul_tensorcore * add bmm cublas autotune * add bmm tests * out_shape for bmm_tensorcore * fix comments * code format * add todos for tensorcore datatype checking * fix lint * fix have_tensorcore * add dtype check for batch_matmul_tensorcore
* add batch_matmul_tensorcore * add bmm cublas autotune * add bmm tests * out_shape for bmm_tensorcore * fix comments * code format * add todos for tensorcore datatype checking * fix lint * fix have_tensorcore * add dtype check for batch_matmul_tensorcore
* add batch_matmul_tensorcore * add bmm cublas autotune * add bmm tests * out_shape for bmm_tensorcore * fix comments * code format * add todos for tensorcore datatype checking * fix lint * fix have_tensorcore * add dtype check for batch_matmul_tensorcore
* add batch_matmul_tensorcore * add bmm cublas autotune * add bmm tests * out_shape for bmm_tensorcore * fix comments * code format * add todos for tensorcore datatype checking * fix lint * fix have_tensorcore * add dtype check for batch_matmul_tensorcore
* add batch_matmul_tensorcore * add bmm cublas autotune * add bmm tests * out_shape for bmm_tensorcore * fix comments * code format * add todos for tensorcore datatype checking * fix lint * fix have_tensorcore * add dtype check for batch_matmul_tensorcore
Add batch_matmul tensorcore schedule for bert inference. It shows better performance than cublas batch_matmul kernel.
@jcf94 @merrymercy could you help review this pr?