Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework occupancy, re-enable grid-stride broadcast #367

Merged
merged 2 commits into from
Jul 22, 2021
Merged

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Jul 22, 2021

The occupancy API (launch_heuristic, launch_configuration) needed an update because oneAPI's occupancy API wants to know about the total width of the launch, which wasn't passed to launch_configuraton. I've taken that opportunity to rename total_threads to elements.

I also came across the disabled support for grid-stride broadcast, where each thread performs a couple of iterations in a loop instead of just launching more blocks. I didn't notice performance improvements when initially adding this, but doing some more careful benchmarking it turns out to resolve the performance issue of launching as many threads as the CUDA occupancy API suggests. My hypothesis is that it's expensive to launch many large blocks, so we capped the block size to 256, whereas with grid-stride loops it's instead possible to keep the block large but launch fewer of them.

The above, in benchmark results of broadcasting sin over a 1024x1024 Float32 array:

  • 4096 blocks of 256 threads (the current situation): 25 us
  • 1024 blocks of 1024 threads (what the occupancy API suggests): 30 us
  • 48 blocks (as suggested by the occupancy API) of 256 threads (i.e. on top of the current situation), each doing 86 elements: 25 us, so no performance improvement, and why I didn't enable this functionality
  • 48 blocks of 1024 threads (as suggested by the occupancy API), each doing 22 elements: 25us

Even though the combination of a grid-stride loop with larger blocks doesn't yield a performance improvement, it does stay closer to the NVIDIA-recommended configuration, so I'm inclined to use that configuration instead.

maleadt added 2 commits July 22, 2021 14:21
- enable grid-stride feature
- always pass number of elements
- rename total_threads to elements
@maleadt maleadt requested a review from vchuravy July 22, 2021 12:45
@maleadt maleadt merged commit bb9ca6d into master Jul 22, 2021
@bors bors bot deleted the tb/occupancy branch July 22, 2021 13:42
awadell1 added a commit to awadell1/GPUArrays.jl that referenced this pull request Mar 18, 2022
gpu_call signiture changed by JuliaGPU#367 to rename "total_threads" to "elements"
maleadt pushed a commit to awadell1/GPUArrays.jl that referenced this pull request May 17, 2022
gpu_call signiture changed by JuliaGPU#367 to rename "total_threads" to "elements"
awadell1 added a commit to awadell1/GPUArrays.jl that referenced this pull request Jul 4, 2022
gpu_call signiture changed by JuliaGPU#367 to rename "total_threads" to "elements"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants