Rework occupancy, re-enable grid-stride broadcast #367
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The occupancy API (
launch_heuristic
,launch_configuration
) needed an update because oneAPI's occupancy API wants to know about the total width of the launch, which wasn't passed tolaunch_configuraton
. I've taken that opportunity to renametotal_threads
toelements
.I also came across the disabled support for grid-stride broadcast, where each thread performs a couple of iterations in a loop instead of just launching more blocks. I didn't notice performance improvements when initially adding this, but doing some more careful benchmarking it turns out to resolve the performance issue of launching as many threads as the CUDA occupancy API suggests. My hypothesis is that it's expensive to launch many large blocks, so we capped the block size to 256, whereas with grid-stride loops it's instead possible to keep the block large but launch fewer of them.
The above, in benchmark results of broadcasting
sin
over a 1024x1024 Float32 array:Even though the combination of a grid-stride loop with larger blocks doesn't yield a performance improvement, it does stay closer to the NVIDIA-recommended configuration, so I'm inclined to use that configuration instead.