Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve block and thread calculations and invoke only if in range #76

Merged
merged 1 commit into from
Apr 18, 2024

Conversation

PhilipFackler
Copy link
Collaborator

This only applies to 1d parallel_for in CUDA. If this is acceptable to you all, a similar thing should be done where else it's applicable.

@PhilipFackler
Copy link
Collaborator Author

PhilipFackler commented Apr 18, 2024

See #57. This is a problem in 1d also. It's actually not "wrong". It's just that if your N is greater than the threads per block and not divisible by the same, you pass indices to the kernel that are out of bounds. Adding the conditional doesn't seem to affect performance, which makes sense, since all threads in a block (except the last one) would follow the true branch.

@williamfgc
Copy link
Collaborator

Thanks @PhilipFackler eventually we will revisit this when doing performance testing, with the apps we are targeting.

@williamfgc
Copy link
Collaborator

Test this please

maxPossibleThreads = CUDA.maxthreads(parallel_kernel)
threads = min(N, maxPossibleThreads)
blocks = ceil(Int, N / threads)
parallel_kernel(parallel_kargs...; threads=threads, blocks=blocks)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this guarantee to be synchronized?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know :)

@williamfgc
Copy link
Collaborator

Also, are we missing the same for AMDGPU.jl?

@williamfgc williamfgc merged commit 6a07365 into JuliaORNL:main Apr 18, 2024
6 checks passed
@PhilipFackler
Copy link
Collaborator Author

Also, are we missing the same for AMDGPU.jl?

Yes, but I don't have a way to check that out locally

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants