-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Performance improvement in Normalize GPU Kernel #14139
Performance improvement in Normalize GPU Kernel #14139
Conversation
src/operator/image/image_random.cu
Outdated
ToTensorCudaKernel<gpu, DType> | ||
<<<blocks, dim3(32, 32), 0, stream>>>(input, output, | ||
<<<blocks, dim3(H, cuda::kMaxThreadsPerBlock / H), 0, stream>>>(input, output, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please fix ToTensor similarly in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure already work in progress. Should this PR wait till then or undo dim3(H, cuda::kMaxThreadsPerBlock / H) back to dim3(32, 32). Let me know next steps for this PR. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
back to (32, 32), we can address it later.
Basically lgtm, please make minor revision and once CI passes ,we can merge |
9ca1aec
to
26a0532
Compare
Done. Will create ToTensor refactoring PR in a day or two. Thanks again for your time and fast turn around time for all PR reviews. |
cf05dd7
to
f595dff
Compare
* New CPU kernel for normalize * New GPU kernel for Normalize * Add launch bounds and increase threads to 32*32 * do not hardcode number of threads * Try fix windows build failure * make channels as int to fix windows build issues with omp * Simplify cuda kernels with 1 D thread block * Minor refactoring * Revert thread dim for ToTensor operator
* New CPU kernel for normalize * New GPU kernel for Normalize * Add launch bounds and increase threads to 32*32 * do not hardcode number of threads * Try fix windows build failure * make channels as int to fix windows build issues with omp * Simplify cuda kernels with 1 D thread block * Minor refactoring * Revert thread dim for ToTensor operator
* New CPU kernel for normalize * New GPU kernel for Normalize * Add launch bounds and increase threads to 32*32 * do not hardcode number of threads * Try fix windows build failure * make channels as int to fix windows build issues with omp * Simplify cuda kernels with 1 D thread block * Minor refactoring * Revert thread dim for ToTensor operator
* New CPU kernel for normalize * New GPU kernel for Normalize * Add launch bounds and increase threads to 32*32 * do not hardcode number of threads * Try fix windows build failure * make channels as int to fix windows build issues with omp * Simplify cuda kernels with 1 D thread block * Minor refactoring * Revert thread dim for ToTensor operator
* New CPU kernel for normalize * New GPU kernel for Normalize * Add launch bounds and increase threads to 32*32 * do not hardcode number of threads * Try fix windows build failure * make channels as int to fix windows build issues with omp * Simplify cuda kernels with 1 D thread block * Minor refactoring * Revert thread dim for ToTensor operator
Description
Similar to perf improvements of ToTensor GPU kernel in PR - #14099
In this PR, I wrote a separate CUDA kernel for GPU and moved out of Kernel launch/map.
Benchmarks below.
Benchmarks
Ran 500 Normalize operation on (3, 512, 512) sample input.
GPU
Before: ('Average time per Normalize 3,512,512 - ', 38.19581985473633)
After: ('Average time per Normalize 3,512,512 - ', 0.5398507118225098)
CPU
Before: ('Average time per Normalize 3,512,512 - ', 1.8209707736968994)
After: ('Average time per Normalize 3,512,512 - ', 1.2644755840301514)
('Total time for CPU ToTensor - ', 1264.4755840301514)
('Average time per Normalize 3,512,512 - ', 1.2644755840301514)
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
@stu1130 @zhreshold