Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Performance improvement in Normalize GPU Kernel #14139

Merged

Conversation

sandeep-krishnamurthy
Copy link
Contributor

Description

Similar to perf improvements of ToTensor GPU kernel in PR - #14099
In this PR, I wrote a separate CUDA kernel for GPU and moved out of Kernel launch/map.

Benchmarks below.

Benchmarks
Ran 500 Normalize operation on (3, 512, 512) sample input.

GPU

Before: ('Average time per Normalize 3,512,512 - ', 38.19581985473633)
After: ('Average time per Normalize 3,512,512 - ', 0.5398507118225098)

CPU

Before: ('Average time per Normalize 3,512,512 - ', 1.8209707736968994)
After: ('Average time per Normalize 3,512,512 - ', 1.2644755840301514)

('Total time for CPU ToTensor - ', 1264.4755840301514)
('Average time per Normalize 3,512,512 - ', 1.2644755840301514)

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Remove Kernel Launch/Map for Normalize operator
  • Make independent kernel for CPU Normalize
  • Add 2 separate CUDA kernel for Normalize Forward operator (3D and 4D) inputs.
  • Add 2 separate CUDA kernel for Normalize Backward operator (3D and 4D) inputs.

@stu1130 @zhreshold

@szha szha requested a review from zhreshold February 12, 2019 23:53
@sandeep-krishnamurthy sandeep-krishnamurthy added Operator pr-awaiting-review PR is waiting for code review labels Feb 12, 2019
src/operator/image/image_random.cu Show resolved Hide resolved
src/operator/image/image_random.cu Outdated Show resolved Hide resolved
ToTensorCudaKernel<gpu, DType>
<<<blocks, dim3(32, 32), 0, stream>>>(input, output,
<<<blocks, dim3(H, cuda::kMaxThreadsPerBlock / H), 0, stream>>>(input, output,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix ToTensor similarly in a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure already work in progress. Should this PR wait till then or undo dim3(H, cuda::kMaxThreadsPerBlock / H) back to dim3(32, 32). Let me know next steps for this PR. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

back to (32, 32), we can address it later.

@zhreshold
Copy link
Member

Basically lgtm, please make minor revision and once CI passes ,we can merge

@sandeep-krishnamurthy
Copy link
Contributor Author

Basically lgtm, please make minor revision and once CI passes ,we can merge

Done. Will create ToTensor refactoring PR in a day or two. Thanks again for your time and fast turn around time for all PR reviews.

@sandeep-krishnamurthy sandeep-krishnamurthy merged commit dad33b5 into apache:master Feb 14, 2019
stephenrawls pushed a commit to stephenrawls/incubator-mxnet that referenced this pull request Feb 16, 2019
* New CPU kernel for normalize

* New GPU kernel for Normalize

* Add launch bounds and increase threads to 32*32

* do not hardcode number of threads

* Try fix windows build failure

* make channels as int to fix windows build issues with omp

* Simplify cuda kernels with 1 D thread block

* Minor refactoring

* Revert thread dim for ToTensor operator
jessr92 pushed a commit to jessr92/incubator-mxnet that referenced this pull request Feb 19, 2019
* New CPU kernel for normalize

* New GPU kernel for Normalize

* Add launch bounds and increase threads to 32*32

* do not hardcode number of threads

* Try fix windows build failure

* make channels as int to fix windows build issues with omp

* Simplify cuda kernels with 1 D thread block

* Minor refactoring

* Revert thread dim for ToTensor operator
drivanov pushed a commit to drivanov/incubator-mxnet that referenced this pull request Mar 4, 2019
* New CPU kernel for normalize

* New GPU kernel for Normalize

* Add launch bounds and increase threads to 32*32

* do not hardcode number of threads

* Try fix windows build failure

* make channels as int to fix windows build issues with omp

* Simplify cuda kernels with 1 D thread block

* Minor refactoring

* Revert thread dim for ToTensor operator
vdantu pushed a commit to vdantu/incubator-mxnet that referenced this pull request Mar 31, 2019
* New CPU kernel for normalize

* New GPU kernel for Normalize

* Add launch bounds and increase threads to 32*32

* do not hardcode number of threads

* Try fix windows build failure

* make channels as int to fix windows build issues with omp

* Simplify cuda kernels with 1 D thread block

* Minor refactoring

* Revert thread dim for ToTensor operator
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
* New CPU kernel for normalize

* New GPU kernel for Normalize

* Add launch bounds and increase threads to 32*32

* do not hardcode number of threads

* Try fix windows build failure

* make channels as int to fix windows build issues with omp

* Simplify cuda kernels with 1 D thread block

* Minor refactoring

* Revert thread dim for ToTensor operator
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Operator pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants