[CUDA] Improve injective schedule to enable half2 #8457

comaniac · 2021-07-13T00:08:36Z

Per discussion in https://discuss.tvm.apache.org/t/cuda-enable-half2-in-cuda-injective-schedule/10441, this PR improves the CUDA injective schedule to benefit more from half2 when working on float16.

The background is that although the CUDA injective schedule does vectorize the innermost loop when working on float16, the vectorization may fail due to the if-conditions introduced by non-dividable workload and block/thread sizes. Formally, vectorization requires prod(output_shape) % block % thread % vector_width == 0. To make sure vectorization is effective, this PR adjusts the block and thread sizes accordingly (see the code change for details).

On the other hand, when the output shapes are weird (e.g., prime numbers), the selected block and thread sizes may be too small. For example, if the output shape is (311, 3814), then factors are (1, 2, 311, 1907, 3814). As a result, we may select (block, thread) = (2, 311) with the consideration of the maximum (block, thread) = (256, 1024). In this case, we don't utilize the compute resources well even half2 is enabled.

Ideally, we should pad the output to let the factors always be power of two, but it is too complicate and may introduce other issues. Accordingly, another heuristic introduced by this PR is that when (select_block * select_thread) / (max_block * max_thread) < R, then we don't apply the change and let the vectorization failed.

Here is the evaluation results when R=0.7.

Workloads: FP32Mul_FP16Add, FP16Mul_FP16Add, FP16Mul, Cast, FP32Mul_FP32Add, FP32Mul.
Output shapes: I manually assigned two shapes (768, 3072), (1, 1000) and randomly generated additional 100 shapes ranging from 1 to 4096.
Platform: NVIDIA T4 and V100.

For each platform, I displayed the worst, the best, and the average speedup of all workloads over the current upstream.

T4: Worst 0.91x, Best, 1.70x, Average 1.11x.
V100: Worst 0.97x, Best 1.33x, Average 1.15x.

cc @vinx13 @wpan11nv @Laurawly @masahi

python/tvm/topi/cuda/injective.py

jcf94

Thanks!
I'm thinking that if loop partition can solve this problem? @comaniac @junrushao1994
For example, with a dimension of 301, partition it to a 300 loop and a single 1 loop, then process vectorize in the 300 loop.
Looks like this can be more easily implemented in TensorIR's block scope.

comaniac · 2021-07-13T16:36:57Z

Thanks!
I'm thinking that if loop partition can solve this problem? @comaniac @junrushao1994
For example, with a dimension of 301, partition it to a 300 loop and a single 1 loop, then process vectorize in the 300 loop.
Looks like this can be more easily implemented in TensorIR's block scope.

Right. Both loop partitioning and input padding can resolve this issue as well.

comaniac · 2021-07-14T00:57:27Z

Thanks @vinx13 @jcf94

* [CUDA] Improve injective schedule to enable half2 * lint * fix * trigger ci

comaniac added 2 commits July 12, 2021 20:50

[CUDA] Improve injective schedule to enable half2

4987015

lint

69bcfd1

comaniac requested a review from vinx13 July 13, 2021 00:08

vinx13 reviewed Jul 13, 2021

View reviewed changes

python/tvm/topi/cuda/injective.py Outdated Show resolved Hide resolved

vinx13 self-assigned this Jul 13, 2021

fix

f9cbf9f

jcf94 approved these changes Jul 13, 2021

View reviewed changes

trigger ci

d5409c8

vinx13 approved these changes Jul 13, 2021

View reviewed changes

comaniac merged commit 5c1a1cf into apache:main Jul 14, 2021

comaniac deleted the cuda_injective branch July 14, 2021 00:57

ylc pushed a commit to ylc/tvm that referenced this pull request Sep 29, 2021

[CUDA] Improve injective schedule to enable half2 (apache#8457)

880e0d9

* [CUDA] Improve injective schedule to enable half2 * lint * fix * trigger ci

junrushao mentioned this pull request Nov 1, 2021

Apache TVM v0.8 Release Note Candidate #9416

Closed

zxy844288792 pushed a commit to zxy844288792/tvm that referenced this pull request Mar 4, 2022

[CUDA] Improve injective schedule to enable half2 (apache#8457)

280139b

* [CUDA] Improve injective schedule to enable half2 * lint * fix * trigger ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] Improve injective schedule to enable half2 #8457

[CUDA] Improve injective schedule to enable half2 #8457

comaniac commented Jul 13, 2021 •

edited

Loading

jcf94 left a comment

comaniac commented Jul 13, 2021

comaniac commented Jul 14, 2021

[CUDA] Improve injective schedule to enable half2 #8457

[CUDA] Improve injective schedule to enable half2 #8457

Conversation

comaniac commented Jul 13, 2021 • edited Loading

jcf94 left a comment

Choose a reason for hiding this comment

comaniac commented Jul 13, 2021

comaniac commented Jul 14, 2021

comaniac commented Jul 13, 2021 •

edited

Loading