Updated hip kernels #139

thomasgibson · 2022-08-15T13:29:22Z

PR Summary:

The number of blocks launched for the dot kernel has been modified.
The partial sums for the dot-kernel is allocated using hipHostMalloc, which allocates in a device-visible page. Memory transfer occurs asynchronously and, as a result, requires a hipDeviceSynchronize after the kernel is called.
A number of other minor kernel updates using recommended syntax.

Happy to discuss how much of this you want upstream. For reference (before/after) numbers, here are some quick results in double precision:

Reference (baseline - develop): Ran on an MI-210 with the following arguments with an array size of 2^28 elements: -s $((2**28))

Function    MBytes/sec  Min (sec)   Max         Average     
Copy        1406620.269 0.00305     0.00322     0.00308     
Mul         1404059.445 0.00306     0.00316     0.00312     
Add         1279936.374 0.00503     0.00522     0.00515     
Triad       1272501.462 0.00506     0.00524     0.00516     
Dot         745985.464  0.00576     0.00687     0.00626

Updated dot kernel:

Function    MBytes/sec  Min (sec)   Max         Average     
Copy        1406789.357 0.00305     0.00310     0.00307     
Mul         1407020.248 0.00305     0.00622     0.00315     
Add         1279350.509 0.00504     0.00522     0.00515     
Triad       1286205.887 0.00501     0.00525     0.00517     
Dot         1297750.809 0.00331     0.00344     0.00338

tomdeakin

Thanks for the PR. In general it looks OK, but I would like more details on the benefit for manually strip-mining the loop in the nstream kernel. I don't think we do this for any other model - why do we need to do it for HIP / AMD GPUs?

thomasgibson · 2022-09-26T14:04:43Z

Thanks for your comment!

Thanks for the PR. In general it looks OK, but I would like more details on the benefit for manually strip-mining the loop in the nstream kernel. I don't think we do this for any other model - why do we need to do it for HIP / AMD GPUs?

It's not strictly necessary, this was mostly a result of experimentation to generated vector load instructions from device memory in an attempt to boost the bandwidth performance of the kernel. This had a consistent, albeit minor, increase in bandwidth performance for n-stream. This was mainly useful during experimentation, but was hesitant to modify it any further as I recall from our last conversation that we want to avoid significantly "ninja'd" code in the main repo.

This was more impactful in older versions of ROCm, but since 5.2 we're not seeing a huge benefit from doing this. This is probably for the best, since our compilers are generating better instructions from simpler code. There's still some room for improvement and this kernel in particular is one I'm looking at including in our HBM stressors/benchmarks. I'm happy to leave it mostly the way it is. Will update this PR accordingly.

Co-authored-by: Nick Curtis <arghdos@users.noreply.github.com>

src/hip/HIPStream.cpp

tomdeakin

Thanks for contributing this, happy to merge this now. Thanks for the effort!

tomdeakin reviewed Sep 20, 2022

View reviewed changes

thomasgibson and others added 6 commits October 10, 2022 21:32

Add tuned benchmark kernels

a075455

Co-authored-by: Nick Curtis <arghdos@users.noreply.github.com>

Clean up kernels and drop unneeded modifications

bcf8708

Use triple-chevron syntax for hip kernel launching

f98aedf

Add clarifying comment and further clean-up

de93c06

Roll back modifications for copy, mul, add, and triad

f44cd6f

Simplify/roll back unneeded modifications

85d8091

thomasgibson force-pushed the updated-hip-kernels branch from 45d63ea to 85d8091 Compare October 11, 2022 02:42

thomasgibson requested a review from tomdeakin October 11, 2022 02:47

tomdeakin requested changes Jan 30, 2023

View reviewed changes

src/hip/HIPStream.cpp Outdated Show resolved Hide resolved

src/hip/HIPStream.cpp Show resolved Hide resolved

Round up dot_num_blocks and remove extra check

696ff6a

tomdeakin approved these changes Mar 23, 2023

View reviewed changes

tomdeakin merged commit ebb1176 into UoB-HPC:develop May 2, 2023

tomdeakin mentioned this pull request Jun 12, 2023

Additional SYCL USM (device pointer explicit copy) and CUDA tuning for DOT #122

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated hip kernels #139

Updated hip kernels #139

thomasgibson commented Aug 15, 2022 •

edited

Loading

tomdeakin left a comment

thomasgibson commented Sep 26, 2022

tomdeakin left a comment

Updated hip kernels #139

Updated hip kernels #139

Conversation

thomasgibson commented Aug 15, 2022 • edited Loading

tomdeakin left a comment

Choose a reason for hiding this comment

thomasgibson commented Sep 26, 2022

tomdeakin left a comment

Choose a reason for hiding this comment

thomasgibson commented Aug 15, 2022 •

edited

Loading