[Target] Add __launch_bounds__ directive as part of the CUDA code generation #8678

ArmageddonKnight · 2021-08-06T19:51:45Z

Short Summary

Sometimes, when executing CUDA kernels, we might encounter the error CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES (e.g., here). This happens because the nvcc compiler allocates too many registers per thread. In the case when we launch the CUDA kernel using too many threads, the GPU will notice that the CUDA kernel requests more registers than what are available on the chip and therefore refuse to launch the kernel.

This hence implies that we need a way of telling nvcc what to expect in terms of the number of threads per block. Luckily, the __launch_bounds__ directive can help us achieve what we want. In this patch, we add __launch_bounds__ as part of the CUDA code generation procedure. __launch_bounds__ will be automatically printed if it is detected that the number of threads per block is a constant integer value. Passing this information to nvcc allows it to spill registers if needed, which might hurt performance, but is still better than having a CUDA kernel that is not functional.

Q & A

Q: Would this affect the AutoTVM and the auto-scheduler submodule?

A: No. Although in those cases the number of threads keeps changing at each trial, the number will be set to a constant when it comes to the code generation phase. Furthermore, in the case when the number of threads per block is not a constant, __launch_bounds__ will simply not be printed.

Any feedback on this patch is appreciated. @comaniac @icemelon @yzhliu @yidawang

src/target/source/codegen_cuda.cc

comaniac

LGTM. @Hzfengsy @Laurawly @masahi @tqchen it would be better if you folks could also take a look, as this will change the generated CUDA codes for all existing workloads.

junrushao · 2021-08-06T20:39:16Z

I am very much in favor of this approach. Thanks Bojian!

src/target/source/codegen_cuda.cc

junrushao · 2021-08-07T05:18:31Z

The syntax should be __launch_bounds__(maxThreadsPerBlock) or __launch_bounds__(maxThreadsPerBlock, minBlocksPerMultiprocessor).

However, in the failing test, I am seeing:

grid=(16,8,1),  block=(32,2,2)
...
__launch_bounds__(1)

Would you like to double check? @ArmageddonKnight

ArmageddonKnight · 2021-08-07T05:59:41Z

@junrushao1994 Thanks for letting me know. I had a look into this issue. The problem is caused by assigning threadIdx.x to iv->thread_tag rather than iv->var->name_hint (and therefore, the extractor is unable to correctly extract the number of threads per block). To address this issue, I extend the extractor to cover thread_tag's as well. At the same time, if the number of threads per block is extracted as 1, then __launch_bounds__ will NOT be printed.

I ran the test case again and it works locally.

…eration

junrushao · 2021-08-07T07:37:04Z

Thanks Bojian! It makes perfect sense to me, which reminds me of a similar bug we encountered in TensorIR lowering :-)

ArmageddonKnight · 2021-08-07T12:42:21Z

@vinx13 @comaniac @junrushao1994 FYI, the patch has passed all the checks. Could you please merge it if possible?

junrushao · 2021-08-07T12:46:00Z

Thanks Bojian! It’s really super useful improvement in cuda codegen

ArmageddonKnight · 2021-08-07T14:52:03Z

Thanks @junrushao1994

…eration (apache#8678)

ArmageddonKnight requested review from junrushao, kparzysz-quic, masahi, tqchen, vinx13 and ZihengJiang as code owners August 6, 2021 19:51

ArmageddonKnight force-pushed the bojian/CUDALaunchBounds branch from 28e4a55 to 3367d1a Compare August 6, 2021 19:55

vinx13 reviewed Aug 6, 2021

View reviewed changes

src/target/source/codegen_cuda.cc Outdated Show resolved Hide resolved

ArmageddonKnight force-pushed the bojian/CUDALaunchBounds branch from 3367d1a to 1837546 Compare August 6, 2021 19:59

comaniac reviewed Aug 6, 2021

View reviewed changes

ArmageddonKnight force-pushed the bojian/CUDALaunchBounds branch from 1837546 to 828116f Compare August 6, 2021 20:27

junrushao approved these changes Aug 6, 2021

View reviewed changes

junrushao reviewed Aug 6, 2021

View reviewed changes

src/target/source/codegen_cuda.cc Outdated Show resolved Hide resolved

ArmageddonKnight force-pushed the bojian/CUDALaunchBounds branch from 828116f to d643bfc Compare August 6, 2021 21:08

comaniac approved these changes Aug 6, 2021

View reviewed changes

src/target/source/codegen_cuda.cc Outdated Show resolved Hide resolved

src/target/source/codegen_cuda.cc Show resolved Hide resolved

ArmageddonKnight force-pushed the bojian/CUDALaunchBounds branch from d643bfc to edf71ef Compare August 6, 2021 21:30

vinx13 approved these changes Aug 7, 2021

View reviewed changes

ArmageddonKnight force-pushed the bojian/CUDALaunchBounds branch from edf71ef to 9e1fe5f Compare August 7, 2021 01:57

Hzfengsy approved these changes Aug 7, 2021

View reviewed changes

ArmageddonKnight force-pushed the bojian/CUDALaunchBounds branch from 9e1fe5f to 9819ed1 Compare August 7, 2021 05:57

[Target] Add __launch_bounds__ directive as part of the CUDA code gen…

edb3719

…eration

ArmageddonKnight force-pushed the bojian/CUDALaunchBounds branch from 9819ed1 to edb3719 Compare August 7, 2021 06:00

junrushao merged commit bca155f into apache:main Aug 7, 2021

ArmageddonKnight deleted the bojian/CUDALaunchBounds branch August 7, 2021 14:52

mehrdadh pushed a commit to mehrdadh/tvm that referenced this pull request Aug 11, 2021

[Target] Add __launch_bounds__ directive as part of the CUDA code gen…

f9ec02b

…eration (apache#8678)

ylc pushed a commit to ylc/tvm that referenced this pull request Sep 29, 2021

[Target] Add __launch_bounds__ directive as part of the CUDA code gen…

2180659

…eration (apache#8678)

junrushao mentioned this pull request Nov 1, 2021

Apache TVM v0.8 Release Note Candidate #9416

Closed

ylc pushed a commit to ylc/tvm that referenced this pull request Jan 13, 2022

[Target] Add __launch_bounds__ directive as part of the CUDA code gen…

0e88a8b

…eration (apache#8678)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Target] Add __launch_bounds__ directive as part of the CUDA code generation #8678

[Target] Add __launch_bounds__ directive as part of the CUDA code generation #8678

ArmageddonKnight commented Aug 6, 2021

comaniac left a comment

junrushao commented Aug 6, 2021

junrushao commented Aug 7, 2021 •

edited

Loading

ArmageddonKnight commented Aug 7, 2021 •

edited

Loading

junrushao commented Aug 7, 2021

ArmageddonKnight commented Aug 7, 2021

junrushao commented Aug 7, 2021

ArmageddonKnight commented Aug 7, 2021

[Target] Add __launch_bounds__ directive as part of the CUDA code generation #8678

[Target] Add __launch_bounds__ directive as part of the CUDA code generation #8678

Conversation

ArmageddonKnight commented Aug 6, 2021

Short Summary

Q & A

comaniac left a comment

Choose a reason for hiding this comment

junrushao commented Aug 6, 2021

junrushao commented Aug 7, 2021 • edited Loading

ArmageddonKnight commented Aug 7, 2021 • edited Loading

junrushao commented Aug 7, 2021

ArmageddonKnight commented Aug 7, 2021

junrushao commented Aug 7, 2021

ArmageddonKnight commented Aug 7, 2021

junrushao commented Aug 7, 2021 •

edited

Loading

ArmageddonKnight commented Aug 7, 2021 •

edited

Loading