-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[PERF] Moving GPU softmax to RTC and optimizations #19905
Conversation
Hey @ptrendx , Thanks for submitting the PR
CI supported jobs: [unix-cpu, centos-cpu, unix-gpu, windows-gpu, clang, website, centos-gpu, edge, miscellaneous, sanity, windows-cpu] Note: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few minor things.
Apart of that, I think vectorization should be independent of RTC.
Even without RTC, it would be nice to have a vectorization mechanism where each input/output could create a vectorization object if required, which its own parameters, aligment, etc..
About the vectorization being independent from RTC - generally I agree with you and the first approach to vectorization was actually before RTC was introduced. There was a problem, however, in that using it produced quite a lot of kernels bloating the library size and increasing the GPU memory usage (see PR #17767 and then issue #18280). That is why I reintroduced it as part of the RTC effort to make sure that only the needed kernels get compiled. |
@mxnet-bot run ci [centos-gpu, unix-cpu] |
Jenkins CI successfully triggered : [centos-gpu, unix-cpu] |
@mxnet-bot run ci [unix-cpu] |
Jenkins CI successfully triggered : [unix-cpu] |
thanks @ptrendx , looks good to me |
Description
This PR moves the GPU softmax implementation (not yet the masked softmax implementation) to use RTC and adds multiple optimizations to it to improve performance.
Checklist
Essentials
Changes
type_util::mixed_type<DType, DType2>
instead of the previous verbosetypename type_util::mixed_type<DType, DType2>::type
, and arbitrary number of types can be passed as template arguments)