-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Target] Add __launch_bounds__ directive as part of the CUDA code generation #8678
Conversation
28e4a55
to
3367d1a
Compare
3367d1a
to
1837546
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1837546
to
828116f
Compare
I am very much in favor of this approach. Thanks Bojian! |
828116f
to
d643bfc
Compare
d643bfc
to
edf71ef
Compare
edf71ef
to
9e1fe5f
Compare
The syntax should be However, in the failing test, I am seeing:
Would you like to double check? @ArmageddonKnight |
9e1fe5f
to
9819ed1
Compare
@junrushao1994 Thanks for letting me know. I had a look into this issue. The problem is caused by assigning I ran the test case again and it works locally. |
9819ed1
to
edb3719
Compare
Thanks Bojian! It makes perfect sense to me, which reminds me of a similar bug we encountered in TensorIR lowering :-) |
@vinx13 @comaniac @junrushao1994 FYI, the patch has passed all the checks. Could you please merge it if possible? |
Thanks Bojian! It’s really super useful improvement in cuda codegen |
Thanks @junrushao1994 |
Short Summary
Sometimes, when executing CUDA kernels, we might encounter the error
CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES
(e.g., here). This happens because the nvcc compiler allocates too many registers per thread. In the case when we launch the CUDA kernel using too many threads, the GPU will notice that the CUDA kernel requests more registers than what are available on the chip and therefore refuse to launch the kernel.This hence implies that we need a way of telling nvcc what to expect in terms of the number of threads per block. Luckily, the
__launch_bounds__
directive can help us achieve what we want. In this patch, we add__launch_bounds__
as part of the CUDA code generation procedure.__launch_bounds__
will be automatically printed if it is detected that the number of threads per block is a constant integer value. Passing this information to nvcc allows it to spill registers if needed, which might hurt performance, but is still better than having a CUDA kernel that is not functional.Q & A
Q: Would this affect the AutoTVM and the auto-scheduler submodule?
A: No. Although in those cases the number of threads keeps changing at each trial, the number will be set to a constant when it comes to the code generation phase. Furthermore, in the case when the number of threads per block is not a constant,
__launch_bounds__
will simply not be printed.Any feedback on this patch is appreciated. @comaniac @icemelon @yzhliu @yidawang