-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CUDA] Improve injective schedule to enable half2 #8457
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
I'm thinking that if loop partition can solve this problem? @comaniac @junrushao1994
For example, with a dimension of 301, partition it to a 300 loop and a single 1 loop, then process vectorize in the 300 loop.
Looks like this can be more easily implemented in TensorIR's block scope.
Right. Both loop partitioning and input padding can resolve this issue as well. |
* [CUDA] Improve injective schedule to enable half2 * lint * fix * trigger ci
* [CUDA] Improve injective schedule to enable half2 * lint * fix * trigger ci
Per discussion in https://discuss.tvm.apache.org/t/cuda-enable-half2-in-cuda-injective-schedule/10441, this PR improves the CUDA injective schedule to benefit more from
half2
when working onfloat16
.The background is that although the CUDA injective schedule does vectorize the innermost loop when working on float16, the vectorization may fail due to the if-conditions introduced by non-dividable workload and block/thread sizes. Formally, vectorization requires
prod(output_shape) % block % thread % vector_width == 0
. To make sure vectorization is effective, this PR adjusts the block and thread sizes accordingly (see the code change for details).On the other hand, when the output shapes are weird (e.g., prime numbers), the selected block and thread sizes may be too small. For example, if the output shape is
(311, 3814)
, then factors are(1, 2, 311, 1907, 3814)
. As a result, we may select(block, thread) = (2, 311)
with the consideration of the maximum(block, thread) = (256, 1024)
. In this case, we don't utilize the compute resources well evenhalf2
is enabled.Ideally, we should pad the output to let the factors always be power of two, but it is too complicate and may introduce other issues. Accordingly, another heuristic introduced by this PR is that when
(select_block * select_thread) / (max_block * max_thread) < R
, then we don't apply the change and let the vectorization failed.Here is the evaluation results when
R=0.7
.For each platform, I displayed the worst, the best, and the average speedup of all workloads over the current upstream.
cc @vinx13 @wpan11nv @Laurawly @masahi