@gpu.itersPerThread-cyclic bug fix #26218
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR fixes a bug where the same loop iteration was executed by multiple GPU threads in some cases involving the attribute
@gpu.itersPerThread
withcyclic
argument set totrue
. The test itersPerThread.chpl is now beefed up to detect this buggy behavior, should it occur again.Semantics
This PR upholds the original intention of cyclic
itersPerThread
that maps the loop iterations onto the smallest number of GPU threads such that each thread executes at mostitersPerThread
loop iterations. In a discussion wtihin the group, we leaned against mapping the loop iterations onto ALL the threads that the GPU will fire up for the corresponding kernel, if this is different.For example, consider a loop with 12 iterations and
itersPerThread=4
. They are mapped over 12/4=3 threads as follows:The mapping could be different if the loop is also annotated with
@gpu.blockSize(2)
. In this case, the GPU will execute ceil(3/2)=2 blocks and therefore 2*2=4 threads, so we could map the iterations in a cyclic manner to all 4 threads as follows:We chose against this option because it can change the number of threads that the user expects, which is undesirable in GPU programming. So in this example only 3 threads will execute loop iterations regardless of how many threads the GPU will fire.
Testing: paratest, gpu=amd, nvidia.