Use select for masked single element load #1770
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
As part of my investigation into #1721, I found an opportunity to remove some branches in certain kernels by removing a predicated load in favor of a load+select. The nvidia path does something similar - the difference there is they can emit a masked load instruction. This patch removes quite a bit of branching from the sample kernel from #1721 reducing runtime from 0.090ms to 0.067ms.
I intend to do more testing as well as look for other optimizations in similar kernels, so I am marking this as draft for now but putting it up so it can run through CI (my local env is not setup for testing against IPEX).