-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated hip kernels #139
Updated hip kernels #139
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. In general it looks OK, but I would like more details on the benefit for manually strip-mining the loop in the nstream kernel. I don't think we do this for any other model - why do we need to do it for HIP / AMD GPUs?
Thanks for your comment!
It's not strictly necessary, this was mostly a result of experimentation to generated vector load instructions from device memory in an attempt to boost the bandwidth performance of the kernel. This had a consistent, albeit minor, increase in bandwidth performance for n-stream. This was mainly useful during experimentation, but was hesitant to modify it any further as I recall from our last conversation that we want to avoid significantly "ninja'd" code in the main repo. This was more impactful in older versions of ROCm, but since 5.2 we're not seeing a huge benefit from doing this. This is probably for the best, since our compilers are generating better instructions from simpler code. There's still some room for improvement and this kernel in particular is one I'm looking at including in our HBM stressors/benchmarks. I'm happy to leave it mostly the way it is. Will update this PR accordingly. |
Co-authored-by: Nick Curtis <arghdos@users.noreply.github.com>
45d63ea
to
85d8091
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for contributing this, happy to merge this now. Thanks for the effort!
PR Summary:
hipHostMalloc
, which allocates in a device-visible page. Memory transfer occurs asynchronously and, as a result, requires ahipDeviceSynchronize
after the kernel is called.Happy to discuss how much of this you want upstream. For reference (before/after) numbers, here are some quick results in double precision:
Reference (baseline - develop): Ran on an MI-210 with the following arguments with an array size of 2^28 elements:
-s $((2**28))
Updated dot kernel: