Rework occupancy, re-enable grid-stride broadcast #367

maleadt · 2021-07-22T12:45:17Z

The occupancy API (launch_heuristic, launch_configuration) needed an update because oneAPI's occupancy API wants to know about the total width of the launch, which wasn't passed to launch_configuraton. I've taken that opportunity to rename total_threads to elements.

I also came across the disabled support for grid-stride broadcast, where each thread performs a couple of iterations in a loop instead of just launching more blocks. I didn't notice performance improvements when initially adding this, but doing some more careful benchmarking it turns out to resolve the performance issue of launching as many threads as the CUDA occupancy API suggests. My hypothesis is that it's expensive to launch many large blocks, so we capped the block size to 256, whereas with grid-stride loops it's instead possible to keep the block large but launch fewer of them.

The above, in benchmark results of broadcasting sin over a 1024x1024 Float32 array:

4096 blocks of 256 threads (the current situation): 25 us
1024 blocks of 1024 threads (what the occupancy API suggests): 30 us
48 blocks (as suggested by the occupancy API) of 256 threads (i.e. on top of the current situation), each doing 86 elements: 25 us, so no performance improvement, and why I didn't enable this functionality
48 blocks of 1024 threads (as suggested by the occupancy API), each doing 22 elements: 25us

Even though the combination of a grid-stride loop with larger blocks doesn't yield a performance improvement, it does stay closer to the NVIDIA-recommended configuration, so I'm inclined to use that configuration instead.

- enable grid-stride feature - always pass number of elements - rename total_threads to elements

…reads.

gpu_call signiture changed by JuliaGPU#367 to rename "total_threads" to "elements"

maleadt added 2 commits July 22, 2021 14:21

Rework launch heuristic API.

f742727

- enable grid-stride feature - always pass number of elements - rename total_threads to elements

Use a flooring division to ensure we launch the required amount of th…

f299182

…reads.

maleadt requested a review from vchuravy July 22, 2021 12:45

vchuravy approved these changes Jul 22, 2021

View reviewed changes

maleadt merged commit bb9ca6d into master Jul 22, 2021

bors bot deleted the tb/occupancy branch July 22, 2021 13:42

maleadt mentioned this pull request Jul 22, 2021

Adapt to GPUArrays changes. JuliaGPU/CUDA.jl#1061

Merged

maleadt mentioned this pull request Aug 12, 2021

Try to use the heuristic's block configuration when using grid-stride kernels. #372

Merged

vchuravy mentioned this pull request Oct 26, 2021

Array + Diagonal failure JuliaGPU/AMDGPU.jl#165

Closed

awadell1 added a commit to awadell1/GPUArrays.jl that referenced this pull request Mar 18, 2022

Swap total_threads with elements

c5a16b2

gpu_call signiture changed by JuliaGPU#367 to rename "total_threads" to "elements"

maleadt pushed a commit to awadell1/GPUArrays.jl that referenced this pull request May 17, 2022

Swap total_threads with elements

92141a7

gpu_call signiture changed by JuliaGPU#367 to rename "total_threads" to "elements"

awadell1 added a commit to awadell1/GPUArrays.jl that referenced this pull request Jul 4, 2022

Swap total_threads with elements

170e91a

gpu_call signiture changed by JuliaGPU#367 to rename "total_threads" to "elements"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rework occupancy, re-enable grid-stride broadcast #367

Rework occupancy, re-enable grid-stride broadcast #367

Uh oh!

maleadt commented Jul 22, 2021

Uh oh!

Uh oh!

Rework occupancy, re-enable grid-stride broadcast #367

Rework occupancy, re-enable grid-stride broadcast #367

Uh oh!

Conversation

maleadt commented Jul 22, 2021

Uh oh!

Uh oh!