-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve block and thread calculations and invoke only if in range #76
Improve block and thread calculations and invoke only if in range #76
Conversation
See #57. This is a problem in 1d also. It's actually not "wrong". It's just that if your |
Thanks @PhilipFackler eventually we will revisit this when doing performance testing, with the apps we are targeting. |
Test this please |
maxPossibleThreads = CUDA.maxthreads(parallel_kernel) | ||
threads = min(N, maxPossibleThreads) | ||
blocks = ceil(Int, N / threads) | ||
parallel_kernel(parallel_kargs...; threads=threads, blocks=blocks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this guarantee to be synchronized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know :)
Also, are we missing the same for AMDGPU.jl? |
Yes, but I don't have a way to check that out locally |
This only applies to 1d parallel_for in CUDA. If this is acceptable to you all, a similar thing should be done where else it's applicable.