AOMP Release 17.0-1
These are the release notes for AOMP 17.0-1. AOMP uses AMD developer modifications to the upstream LLVM development trunk. These differences are managed in a branch called the "amd-stg-open". This branch is found in a mirror of upstream LLVM found at https://github.com/RadeonOpenCompute/llvm-project. The amd-stg-open branch is constantly changing as AMD merges upstream development trunk with its internal open development efforts. The AMD modifications are experimental and/or contributions under review for the upstream trunk. AOMP uses a snapshot of amd-stg-open at the commit ids and dates listed below. AOMP also includes builds of related ROCm components. We call AOMP a "standalone" build as it does not use or require ROCm with the exception of the kernel module (dkms) and libdrm which are often part of the Linux distribution. AOMP is isolated from any ROCm installations by installing into /usr/lib/aomp and its use of RPATH on runtime libraries.
For AOMP 17.0-1, the last trunk commit is 3712dd73a1d50b76624ee6a520be2b1ca94c02ee on April 11th, 2023. The last amd-only commit is
1d8def5772d16c64652d68daac1b12af99fe3770 on April 12th, 2023 . These commits forms a frozen branch now called "aomp-17.0-1". See https://github.com/RadeonOpenCompute/llvm-project/tree/aomp-17.0-1.
The integrated ROCm components for this AOMP release were built with ROCM 5.4.4 sources.
This is the 2nd AOMP release based on LLVM 17 development.
These are the changes from 17.0-0 to 17.0-1 include:
- Switch to nextgen plugin as default. This has shown significant performance improvements. To revert to the old plugin set LIBOMPTARGET_NEXTGEN_PLUGINS=OFF
- Switch from hostrpc to hostexec. hostexec is a significant rewrite of hostrpc. The device hostexec_invoke is now written in OpenMP for portability to other platforms. The names of the wrapper (stub) to run a host function has changed to hostexec() and hostexec_<ReturnType>() . hostexec also uses a global variable to find the transfer payload buffer instead of AMD implicit kernel args. This will support portability of hostexec, printf, and fprintf to other platforms. The update to this device global is made with global variable services in the nextgen plugin.
- An example on the use of hostexec to run MPI_Send and MPI_Recv in a target region is given. This example demonstrates how library owners can build a supplemental header file to enable transparent host execution of selected library functions within an OpenMP target regions with the same host interface. This eliminates the need for any source changes in the user code when host execution from a target region is desired. Before hostexec, users would typically have to end their target region, execute a host-only function, then start another target region. This feature significantly increases general purpose computing capabilities of OpenMP on GPGPU platforms.
- OMPT target support is incomplete with the nextgen plugin. To use OMPT, set the environment variable LIBOMPTARGET_NEXTGEN_PLUGINS=OFF
- Set GPU_MAX_HW_QUEUES in gpurun to 1 when multiple ranks per GPU. This limits GPU concurrency when the GPU is already getting shared usage. This should only set if caller (of gpurun or mpirun) did not already set it. In other words, this should trust the user if they set a value. This will be fixed in next release. Also, OpenMP nextgen plugin does not use GPU_MAX_HW_QUEUES. It uses env variable LIBOMPTARGET_AMDGPU_NUM_HSA_QUEUES.
- Critical regions created via the critical directive are now more efficient: by relaxing the semantics of locks and combining that with the use of acquire and release fences we can limit the flushing of the GPU caches to every time the lock is acquired instead of at every lock check.
- When inlining functions called from the kernel, move allocas for their arguments in the kernel entry block instead of leaving them at launch point.
- Respect environment variable to force synchronous target region executions. Available via
OMPX_FORCE_SYNC_REGIONS=1
.
Errata:
- smoke test "schedule" occasionally fails with memory fault or wrong ordering
- AMD code object version 5 does not work with nextgen plugin. When testing cov5, use LIBOMPTARGET_NEXTGEN_PLUGINS=OFF