Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Performance improvement in ToTensor GPU Kernel #14099

Merged

Conversation

sandeep-krishnamurthy
Copy link
Contributor

@sandeep-krishnamurthy sandeep-krishnamurthy commented Feb 8, 2019

Description

Earlier, we were using Kernel Launch/Map way of launching kernel to write common CPU and GPU code for ToTensor operator. However, I observed there are too many threads and blocks being launched with kernel causing significant performance implication.

To overcome, I wrote a separate CUDA kernel for GPU and moved out of Kernel launch/map.
Benchmarks below.

Benchmarks

Ran 1000 ToTensor operation on (512, 512, 3)

GPU
Before

('Average time per ToTensor 512,512,3 - ', 39.17948246002197)

After

('Average time per ToTensor 512,512,3 - ', 0.44632863998413086)

CPU

Before
('Average time per ToTensor 512,512,3 - ', 3.7258052825927734)
After
('Averagetime per ToTensor 512,512,3 - ', 1.8473007678985596)

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Code is well-documented:
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Remove Kernel Launch/Map for ToTensor operator
  • Make independent kernel for CPU ToTensor
  • Add 2 separate CUDA kernel for ToTensor operator (3D and 4D) inputs.

@zhreshold @stu1130

@vandanavk
Copy link
Contributor

@mxnet-label-bot add [pr-work-in-progress, Operator, Performance]

@sandeep-krishnamurthy sandeep-krishnamurthy changed the title [WIP] Performance improvement in ToTensor GPU Kernel Performance improvement in ToTensor GPU Kernel Feb 9, 2019
@sandeep-krishnamurthy sandeep-krishnamurthy added pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress labels Feb 9, 2019
@zhreshold
Copy link
Member

LGTM now, thanks for the efforts!

@sandeep-krishnamurthy sandeep-krishnamurthy merged commit ab5a0cf into apache:master Feb 11, 2019
stephenrawls pushed a commit to stephenrawls/incubator-mxnet that referenced this pull request Feb 16, 2019
* CPU implementation without Kernel launch/map

* Optimal CUDA support for 3D ToTensor operator

* Add CUDA kernel for 4D inputs

* Fix failing CPU tests for totensor

* disable warning on windows

* try fix in instance norm windows build failure

* Guard omp parallel collapse for windows

* Remove warning supression to check if it is ok

* fix lint issues

* Address code review comments
jessr92 pushed a commit to jessr92/incubator-mxnet that referenced this pull request Feb 19, 2019
* CPU implementation without Kernel launch/map

* Optimal CUDA support for 3D ToTensor operator

* Add CUDA kernel for 4D inputs

* Fix failing CPU tests for totensor

* disable warning on windows

* try fix in instance norm windows build failure

* Guard omp parallel collapse for windows

* Remove warning supression to check if it is ok

* fix lint issues

* Address code review comments
drivanov pushed a commit to drivanov/incubator-mxnet that referenced this pull request Mar 4, 2019
* CPU implementation without Kernel launch/map

* Optimal CUDA support for 3D ToTensor operator

* Add CUDA kernel for 4D inputs

* Fix failing CPU tests for totensor

* disable warning on windows

* try fix in instance norm windows build failure

* Guard omp parallel collapse for windows

* Remove warning supression to check if it is ok

* fix lint issues

* Address code review comments
vdantu pushed a commit to vdantu/incubator-mxnet that referenced this pull request Mar 31, 2019
* CPU implementation without Kernel launch/map

* Optimal CUDA support for 3D ToTensor operator

* Add CUDA kernel for 4D inputs

* Fix failing CPU tests for totensor

* disable warning on windows

* try fix in instance norm windows build failure

* Guard omp parallel collapse for windows

* Remove warning supression to check if it is ok

* fix lint issues

* Address code review comments
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
* CPU implementation without Kernel launch/map

* Optimal CUDA support for 3D ToTensor operator

* Add CUDA kernel for 4D inputs

* Fix failing CPU tests for totensor

* disable warning on windows

* try fix in instance norm windows build failure

* Guard omp parallel collapse for windows

* Remove warning supression to check if it is ok

* fix lint issues

* Address code review comments
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants