[PERF] Moving GPU softmax to RTC and optimizations #19905

ptrendx · 2021-02-16T23:57:24Z

Description

This PR moves the GPU softmax implementation (not yet the masked softmax implementation) to use RTC and adds multiple optimizations to it to improve performance.

Checklist

Essentials

PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented

Changes

Moved both stride1 and non-stride1 versions of the softmax kernels to use RTC
The performance of the non-stride1 version was improved by running multiple rows per block and coalescing memory accesses. Benchmarks show ~4x improvement in time for the typical case, and much more (up to ~40x) when the size of the row over which the summation happens is very small.
The performance of the stride1 kernel was improved by downloading multiple rows to shared memory collectively by the entire block and increasing amount of work per thread (including ability for the entire row to be summed by even a single thread, down from the minimum of 1 full warp per block in the previous version).
The vectorization requirements of the previous implementation were eliminated, resulting in especially big speedup for cases where row length is odd.
The stride1 kernel can now be used when the type of the output does not match the type of input (e.g. float16 input, float32 output)
Overall, the performance of the stride1 kernel got improved ranging form 1.1x for BERT-like shapes (12 * 32, 128, 128), ~2x for the typical sizes with even row length and ~4x for the typical sizes with odd row length, to >20x for sizes with very small row length.
Performance improvements quoted in the previous points are for the forward pass, but backward has similar (albeit slightly smaller) performance improvements.
Improved the mixed_type utility for RTC kernels (now one can use type_util::mixed_type<DType, DType2> instead of the previous verbose typename type_util::mixed_type<DType, DType2>::type, and arbitrary number of types can be passed as template arguments)

mxnet-bot · 2021-02-16T23:57:28Z

Hey @ptrendx , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

To trigger all jobs: @mxnet-bot run ci [all]
To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [unix-cpu, centos-cpu, unix-gpu, windows-gpu, clang, website, centos-gpu, edge, miscellaneous, sanity, windows-cpu]

Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

MoisesHer

Few minor things.
Apart of that, I think vectorization should be independent of RTC.
Even without RTC, it would be nice to have a vectorization mechanism where each input/output could create a vectorization object if required, which its own parameters, aligment, etc..

src/operator/nn/softmax.cu

ptrendx · 2021-04-22T21:43:53Z

About the vectorization being independent from RTC - generally I agree with you and the first approach to vectorization was actually before RTC was introduced. There was a problem, however, in that using it produced quite a lot of kernels bloating the library size and increasing the GPU memory usage (see PR #17767 and then issue #18280). That is why I reintroduced it as part of the RTC effort to make sure that only the needed kernels get compiled.

ptrendx · 2021-04-24T04:10:38Z

@mxnet-bot run ci [centos-gpu, unix-cpu]

mxnet-bot · 2021-04-24T04:10:42Z

Jenkins CI successfully triggered : [centos-gpu, unix-cpu]

ptrendx · 2021-04-26T05:01:48Z

@mxnet-bot run ci [unix-cpu]

mxnet-bot · 2021-04-26T05:01:54Z

Jenkins CI successfully triggered : [unix-cpu]

MoisesHer · 2021-04-26T16:16:00Z

thanks @ptrendx , looks good to me

Moving softmax to RTC

c648e6f

ptrendx requested a review from DickJC123 February 16, 2021 23:57

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 16, 2021

Fix from rebase

ba2c3b4

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 18, 2021

ptrendx requested a review from MoisesHer March 1, 2021 20:16

MoisesHer reviewed Mar 11, 2021

View reviewed changes

Fixes from review

9a81c17

Merge branch 'upstream' into pr_softmax_rtc_opt

20fd2e0

mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-awaiting-review PR is waiting for code review pr-awaiting-testing PR is reviewed and waiting CI build and test labels Apr 23, 2021

Fix

783bb3c

mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-work-in-progress PR is still work in progress labels Apr 23, 2021

Fix FPE in the softmax grad.

0009dec

mseth10 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Apr 24, 2021

mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-work-in-progress PR is still work in progress labels Apr 24, 2021

mseth10 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Apr 24, 2021

mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Apr 26, 2021

MoisesHer merged commit c692770 into apache:master Apr 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PERF] Moving GPU softmax to RTC and optimizations #19905

[PERF] Moving GPU softmax to RTC and optimizations #19905

ptrendx commented Feb 16, 2021

mxnet-bot commented Feb 16, 2021

MoisesHer left a comment

ptrendx commented Apr 22, 2021

ptrendx commented Apr 24, 2021

mxnet-bot commented Apr 24, 2021

ptrendx commented Apr 26, 2021

mxnet-bot commented Apr 26, 2021

MoisesHer commented Apr 26, 2021

[PERF] Moving GPU softmax to RTC and optimizations #19905

[PERF] Moving GPU softmax to RTC and optimizations #19905

Conversation

ptrendx commented Feb 16, 2021

Description

Checklist

Essentials

Changes

mxnet-bot commented Feb 16, 2021

MoisesHer left a comment

Choose a reason for hiding this comment

ptrendx commented Apr 22, 2021

ptrendx commented Apr 24, 2021

mxnet-bot commented Apr 24, 2021

ptrendx commented Apr 26, 2021

mxnet-bot commented Apr 26, 2021

MoisesHer commented Apr 26, 2021