-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OpenCL] Add vectorization to cuda conv2d_nhwc schedule #8636
[OpenCL] Add vectorization to cuda conv2d_nhwc schedule #8636
Conversation
dbbc678
to
5fa169a
Compare
5fa169a
to
1a931a1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@echuraev Thanks!
Another thing is that you're doing test on a OpenCL device? While the schedule is in a topi/cuda
dir. It would be very nice of you if you can help move this schedule to topi/gpu
dir
@jcf94 thank you for your comments. I applied your comments and moved conv2d_nhwc to
Yes, I run topology from #8146 on my Samsung Galaxy A71 and got about 6-7x performance boost. |
a44c81e
to
ae90d88
Compare
@jcf94 also, I suppose that we should update statistic in tophub and rename |
CC-ing @masahi - would this also benefit Vulkan FP16 codegen? And in general do we want to maintain TOPI definitions for both CUDA and GPU, or rename it all to GPU, to show that it encompasses more than just Nvidia backends? That would I believe be the topic of a larger RFC. @echuraev could you confirm that making those changes to the schedule templates doesn't lead to regressions on the CUDA codegen on Nvidia GPUs? Thank you! |
Thanks @echuraev, this is really interesting. With the new commit, I was able to generate a packed float16x2 instruction for AMD Vega or newer GPUs, which is supposed to be 2x faster than normal fp32 or fp16 instructions https://gist.github.com/masahi/2de1a7dc87e2068ffb50ba6135273f95#file-conv2d_nhwc_float16x2-s-L495-L496. I'm going to experiment further. |
CC @mei-ye on the above comment - investigating if using vector types can lead to improved FP16 codegen on AMD GPUs |
e19c67a
to
256f524
Compare
@mbrookhart I wasn't able to reproduce the regression on NVidia 3070. As a possible solution, I can create a separate
|
256f524
to
6d9d210
Compare
@jcf94, @masahi, @tmoreau89 could you please take a look one again on this PR? |
So it appears @mbrookhart has encountered a regression and that regression wasn't reproduced by @echuraev on a slightly different GPU. The possible solution in order to be extra safe would be to introduce a new set of TOPI schedules for vectorized GPU operator implementations. WDYT @jwfromm and @mbrookhart |
Sorry, my mistake. I didn't write an update. @mbrookhart has remeasured autotvm tuning on the latest code from my branch and he got the same numbers as he got on the main branch. Now the table can be updated:
So, we don't have regression on the latest code. |
I think it is good to go as long as there would be no concern for regression. Does the new schedule subsume the old schedule in its search space? |
Yes, it does. When vectorization factor is equal to 1, then generated code should be the same as it was with old schedule. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @jcf94
@jcf94 can you please take another look at this? It's currently being blocked by your request for changes. |
@jcf94 Could you please take a look one again? |
@jcf94 - please update the PR review based on latest comments and updates to the PR. I will dismiss your review in 24hrs so we can unblock this PR given that we haven't heard back from you. |
Adding vectorization significantly improved performance. About 6-7x boost.
6d9d210
to
828cdfa
Compare
Change requested occurred on Aug 5th by jcf96. On 6th Egor addressed the changes. We then pinged 23, 17, 10, 6, and 3 days ago but no response. As such in order to unlock this PR I will dismiss the review.
I've dismissed the review that was blocking progress. In my view we've provided sufficient ping to notify the reviewer who requested changes. @jwfromm I'll leave it to you to merge the PR when convenient. |
Thanks @echuraev, this is now merged. |
* main: (80 commits) Introduce centralised name transformation functions (apache#9088) [OpenCL] Add vectorization to cuda conv2d_nhwc schedule (apache#8636) [6/6] Arm(R) Ethos(TM)-U NPU codegen integration with `tvmc` (apache#8854) [microTVM] Add wrapper for creating project using a MLF (apache#9090) Fix typo (apache#9156) [Hotfix][Testing] Wait for RPCServer to be established (apache#9150) Update find cublas so it search default path if needed. (apache#9149) [TIR][LowerMatchBuffer] Fix lowering strides when source region has higher dimension than the buffer (apache#9145) Fix flaky NMS test by making sure scores are unique (apache#9140) [Relay] Merge analysis/context_analysis.cc and transforms/device_annotation.cc (apache#9038) [LLVM] Make changes needed for opaque pointers (apache#9138) Arm(R) Ethos(TM)-U NPU codegen integration (apache#8849) [CI] Split Integration tests out of first phase of pipeline (apache#9128) [Meta Schedule][M3b] Runner (apache#9111) Fix Google Mock differences between Ubuntu 18.04 and 16.04 (apache#9141) [TIR] add loop partition hint pragma (apache#9121) fix things (apache#9146) [Meta Schedule][M3a] SearchStrategy (apache#9132) [Frontend][PyTorch] support for quantized conv_transpose2d op (apache#9133) [UnitTest] Parametrized test_conv2d_int8_intrinsics (apache#9143) ...
* Add vectorization to cuda conv2d_nhwc schedule Adding vectorization significantly improved performance. About 6-7x boost. * Apply comment * Move schedule to topi/gpu dir * Add vectorization to inner loop * Update values of vectorization factor
* Add vectorization to cuda conv2d_nhwc schedule Adding vectorization significantly improved performance. About 6-7x boost. * Apply comment * Move schedule to topi/gpu dir * Add vectorization to inner loop * Update values of vectorization factor
Adding vectorization significantly improved performance. About 6-7x
boost.
Thanks for contributing to TVM! Please refer to guideline https://tvm.apache.org/docs/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers by @ them in the pull request thread.