Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[PERF] Moving GPU softmax to RTC and optimizations #19905

Merged
merged 6 commits into from
Apr 26, 2021

Conversation

ptrendx
Copy link
Member

@ptrendx ptrendx commented Feb 16, 2021

Description

This PR moves the GPU softmax implementation (not yet the masked softmax implementation) to use RTC and adds multiple optimizations to it to improve performance.

Checklist

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

Changes

  • Moved both stride1 and non-stride1 versions of the softmax kernels to use RTC
  • The performance of the non-stride1 version was improved by running multiple rows per block and coalescing memory accesses. Benchmarks show ~4x improvement in time for the typical case, and much more (up to ~40x) when the size of the row over which the summation happens is very small.
  • The performance of the stride1 kernel was improved by downloading multiple rows to shared memory collectively by the entire block and increasing amount of work per thread (including ability for the entire row to be summed by even a single thread, down from the minimum of 1 full warp per block in the previous version).
  • The vectorization requirements of the previous implementation were eliminated, resulting in especially big speedup for cases where row length is odd.
  • The stride1 kernel can now be used when the type of the output does not match the type of input (e.g. float16 input, float32 output)
  • Overall, the performance of the stride1 kernel got improved ranging form 1.1x for BERT-like shapes (12 * 32, 128, 128), ~2x for the typical sizes with even row length and ~4x for the typical sizes with odd row length, to >20x for sizes with very small row length.
  • Performance improvements quoted in the previous points are for the forward pass, but backward has similar (albeit slightly smaller) performance improvements.
  • Improved the mixed_type utility for RTC kernels (now one can use type_util::mixed_type<DType, DType2> instead of the previous verbose typename type_util::mixed_type<DType, DType2>::type, and arbitrary number of types can be passed as template arguments)

@ptrendx ptrendx requested a review from DickJC123 February 16, 2021 23:57
@mxnet-bot
Copy link

Hey @ptrendx , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [unix-cpu, centos-cpu, unix-gpu, windows-gpu, clang, website, centos-gpu, edge, miscellaneous, sanity, windows-cpu]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@lanking520 lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 16, 2021
@lanking520 lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 18, 2021
@ptrendx ptrendx requested a review from MoisesHer March 1, 2021 20:16
Copy link
Contributor

@MoisesHer MoisesHer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few minor things.
Apart of that, I think vectorization should be independent of RTC.
Even without RTC, it would be nice to have a vectorization mechanism where each input/output could create a vectorization object if required, which its own parameters, aligment, etc..

src/operator/nn/softmax.cu Show resolved Hide resolved
src/operator/nn/softmax.cu Outdated Show resolved Hide resolved
src/operator/nn/softmax.cu Outdated Show resolved Hide resolved
src/operator/nn/softmax.cu Outdated Show resolved Hide resolved
src/operator/nn/softmax.cu Outdated Show resolved Hide resolved
src/operator/nn/softmax.cu Show resolved Hide resolved
@ptrendx
Copy link
Member Author

ptrendx commented Apr 22, 2021

About the vectorization being independent from RTC - generally I agree with you and the first approach to vectorization was actually before RTC was introduced. There was a problem, however, in that using it produced quite a lot of kernels bloating the library size and increasing the GPU memory usage (see PR #17767 and then issue #18280). That is why I reintroduced it as part of the RTC effort to make sure that only the needed kernels get compiled.

@mseth10 mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-awaiting-review PR is waiting for code review pr-awaiting-testing PR is reviewed and waiting CI build and test labels Apr 23, 2021
@mseth10 mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-work-in-progress PR is still work in progress labels Apr 23, 2021
@mseth10 mseth10 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Apr 24, 2021
@ptrendx
Copy link
Member Author

ptrendx commented Apr 24, 2021

@mxnet-bot run ci [centos-gpu, unix-cpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [centos-gpu, unix-cpu]

@mseth10 mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-work-in-progress PR is still work in progress labels Apr 24, 2021
@mseth10 mseth10 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Apr 24, 2021
@ptrendx
Copy link
Member Author

ptrendx commented Apr 26, 2021

@mxnet-bot run ci [unix-cpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-cpu]

@mseth10 mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Apr 26, 2021
@MoisesHer
Copy link
Contributor

thanks @ptrendx , looks good to me

@MoisesHer MoisesHer merged commit c692770 into apache:master Apr 26, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants