-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unify and simplify batch functionality: Multivector #1651
Conversation
If you haven't noted that already, there is a naming inconsistency of the single-batch matrix-apply kernels for the reference and cuda/hip backend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I'm wondering if we can get rid of some of the GKO_DEVICE_NAMESPACE:: annotations though - is there an ambiguity issue when removing them?
@MarcelKoch , yes, I plan to unify those when refactoring the matrix format kernels. |
@@ -20,8 +58,7 @@ __device__ __forceinline__ void scale( | |||
|
|||
|
|||
template <typename ValueType, typename Mapping> | |||
__global__ | |||
__launch_bounds__(default_block_size, sm_oversubscription) void scale_kernel( | |||
__global__ __launch_bounds__(default_block_size) void scale_kernel( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you have used 4 for sm_oversubscription on both cuda/hip.
I assume the cuda is the correct and hip just uses it.
if you want to compute in more accurate mapping, hip should use (min_blocks_multiprocessor (4) * max_threads_per_block (256) )/32 = 32 for hip.
you will need to distinguish it by macro
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I havent fully benchmarked that yet. I agree that this will have to be a macro specialized for CUDA and HIP. But will done in a future PR. It has already been noted in #1376
format! |
This PR is the first part of the batched kernel refactoring. It removes the .inc style files, moves to .hpp, making the functions available in a common namespace. It also uses a single source for both HIP and CUDA.
TODO
__launch_bounds__
between CUDA and HIP needs to be resolved