diff --git a/README.md b/README.md index 04189246f2..9deb054361 100644 --- a/README.md +++ b/README.md @@ -7,16 +7,16 @@ Key features include: * HIP is very thin and has little or no performance impact over coding directly in CUDA mode. * HIP allows coding in a single-source C++ programming language including features such as templates, C++11 lambdas, classes, namespaces, and more. * HIP allows developers to use the "best" development environment and tools on each target platform. -* The [HIPIFY](https://github.com/ROCm-Developer-Tools/HIPIFY/blob/master/README.md) tools automatically convert source from CUDA to HIP. +* The [HIPIFY](https://github.com/ROCm-Developer-Tools/HIPIFY/blob/amd-staging/README.md) tools automatically convert source from CUDA to HIP. * Developers can specialize for the platform (CUDA or AMD) to tune for performance or handle tricky cases. New projects can be developed directly in the portable HIP C++ language and can run on either NVIDIA or AMD platforms. Additionally, HIP provides porting tools which make it easy to port existing CUDA codes to the HIP layer, with no loss of performance as compared to the original CUDA application. HIP is not intended to be a drop-in replacement for CUDA, and developers should expect to do some manual coding and performance tuning work to complete the port. ## DISCLAIMER -The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions, and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard versionchanges, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. Any computer system has risks of security vulnerabilities that cannot be completely prevented or mitigated.AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.THIS INFORMATION IS PROVIDED ‘AS IS.” AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD Arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies. +The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions, and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard versionchanges, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. Any computer system has risks of security vulnerabilities that cannot be completely prevented or mitigated.AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.THIS INFORMATION IS PROVIDED 'AS IS." AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD Arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies. -© 2023 Advanced Micro Devices, Inc. All Rights Reserved. +(C) 2023 Advanced Micro Devices, Inc. All Rights Reserved. ## Repository branches: diff --git a/docs/.doxygen/mainpage.md b/docs/.doxygen/mainpage.md index 65f03dc619..e187a69345 100644 --- a/docs/.doxygen/mainpage.md +++ b/docs/.doxygen/mainpage.md @@ -11,6 +11,7 @@ This is the full HIP Runtime API reference. The API is organized into - @ref Error - @ref Stream - @ref StreamM +- @ref Event - @ref Memory - @ref External - @ref MemoryM @@ -29,4 +30,4 @@ This is the full HIP Runtime API reference. The API is organized into - @ref Graph - @ref Virtual - @ref GL -- [Surface Object](#Surface) +- @ref Surface diff --git a/docs/.sphinx/_toc.yml.in b/docs/.sphinx/_toc.yml.in index e0cb4610fc..11fe58caf5 100644 --- a/docs/.sphinx/_toc.yml.in +++ b/docs/.sphinx/_toc.yml.in @@ -5,7 +5,7 @@ subtrees: - file: user_guide/programming_manual - file: user_guide/hip_rtc - file: user_guide/faq - - file: user_guide/hip_porting_guide + - file: user_guide/hip_porting_guide - file: user_guide/hip_porting_driver_api - caption: How to Guides entries: @@ -23,4 +23,4 @@ subtrees: entries: - file: developer_guide/build - file: developer_guide/logging - - file: developer_guide/contributing.md \ No newline at end of file + - file: developer_guide/contributing.md diff --git a/docs/developer_guide/build.md b/docs/developer_guide/build.md index 51117be207..7102a9d896 100755 --- a/docs/developer_guide/build.md +++ b/docs/developer_guide/build.md @@ -13,6 +13,7 @@ sudo apt install mesa-common-dev sudo apt install clang sudo apt install comgr sudo apt-get -y install rocm-dkms +sudo apt-get install -y libelf-dev ``` ### NVIDIA platform @@ -22,14 +23,14 @@ Install Nvidia driver and pre-build packages (see HIP Installation Guide at http ### Branch of repository Before get HIP source code, set the expected branch of repository at the variable `ROCM_BRANCH`. -For example, for ROCm5.0 release branch, set +For example, for ROCm5.7 release branch, set ```shell -export ROCM_BRANCH=rocm-5.0.x +export ROCM_BRANCH=rocm-5.7.x ``` -ROCm5.4 release branch, set +ROCm5.6 release branch, set ```shell -export ROCM_BRANCH=rocm-5.4.x +export ROCM_BRANCH=rocm-5.6.x ``` Similiar format for future branches. @@ -42,38 +43,46 @@ Similiar format for future branches. ### Get HIP source code ```shell -git clone -b "$ROCM_BRANCH" https://github.com/ROCm-Developer-Tools/hipamd.git +git clone -b "$ROCM_BRANCH" https://github.com/ROCm-Developer-Tools/clr.git git clone -b "$ROCM_BRANCH" https://github.com/ROCm-Developer-Tools/hip.git -git clone -b "$ROCM_BRANCH" https://github.com/ROCm-Developer-Tools/ROCclr.git -git clone -b "$ROCM_BRANCH" https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime.git +git clone -b "$ROCM_BRANCH" https://github.com/ROCm-Developer-Tools/HIPCC.git ``` ### Set the environment variables ```shell -export HIPAMD_DIR="$(readlink -f hipamd)" +export CLR_DIR="$(readlink -f clr)" export HIP_DIR="$(readlink -f hip)" +export HIPCC_DIR="$(readlink -f hipcc)" ``` -ROCclr is defined on AMD platform that HIP use Radeon Open Compute Common Language Runtime (ROCclr), which is a virtual device interface that HIP runtimes interact with different backends. -See https://github.com/ROCm-Developer-Tools/ROCclr +Note, starting from ROCM 5.6 release, clr is a new repository including the previous ROCclr, HIPAMD and OpenCl repositories. +ROCclr is defined on AMD platform that HIP uses Radeon Open Compute Common Language Runtime (ROCclr), which is a virtual device interface that HIP runtimes interact with different backends. +HIPAMD provides implementation specifically for AMD platform. +OpenCL provides headers that ROCclr runtime currently depends on. -HIPAMD repository provides implementation specifically for AMD platform. -See https://github.com/ROCm-Developer-Tools/hipamd +### Build the HIPCC runtime + +```shell +cd "$HIPCC_DIR" +mkdir -p build; cd build +cmake .. +make -j4 +``` ### Build HIP ```shell -cd "$HIPAMD_DIR" +cd "$CLR_DIR" mkdir -p build; cd build -cmake -DHIP_COMMON_DIR=$HIP_DIR -DCMAKE_PREFIX_PATH="/" -DCMAKE_INSTALL_PREFIX=$PWD/install .. +cmake -DHIP_COMMON_DIR=$HIP_DIR -DHIP_PLATFORM=amd -DCMAKE_PREFIX_PATH="/opt/rocm/" -DCMAKE_INSTALL_PREFIX=$PWD/install -DHIPCC_BIN_DIR=$HIPCC_DIR/build -DHIP_CATCH_TEST=0 -DCLR_BUILD_HIP=ON -DCLR_BUILD_OCL=OFF .. + make -j$(nproc) sudo make install ``` -::::{note} -If you don't specify `CMAKE_INSTALL_PREFIX`, hip runtime will be installed to `/hip`. -By default, release version of AMDHIP is built. -:::: + +Note, if `CMAKE_INSTALL_PREFIX` is not specified, hip runtime will be installed to `/hip`. +By default, release version of HIP is built. ### Default paths and environment variables @@ -120,9 +129,9 @@ Developers can build HIP directed tests right after build HIP commands, sudo make install make -j$(nproc) build_tests ``` -By default, all HIP directed tests will be built and generated under the folder `$HIPAMD_DIR/build/`directed_tests. +By default, all HIP directed tests will be built and generated under the folder `$CLR_DIR/build/hipamd`directed_tests. Take HIP directed device APIs tests, as an example, all available test applications will have executable files generated under, -`$HIPAMD_DIR/build/directed_tests/runtimeApi/device`. +`$CLR_DIR/build/hipamd/directed_tests/runtimeApi/device`. Run all HIP directed_tests, use the command, @@ -138,7 +147,7 @@ Build and run a single directed test, use the follow command as an example, ```shell make directed_tests.texture.hipTexObjPitch -cd $HIPAMD_DIR/build/directed_tests/texcture +cd $CLR_DIR/build/hipamd/directed_tests/texcture ./hipTexObjPitch ``` Please note, the integrated HIP directed tests, will be deprecated in future release. @@ -156,20 +165,20 @@ git clone -b "$ROCM_BRANCH" https://github.com/ROCm-Developer-Tools/hip-tests.gi ##### Build HIP tests from source ```shell -export HIP_TESTS_DIR="$(readlink -f hip-tests)" -cd "$HIP_TESTS_DIR" +export HIPTESTS_DIR="$(readlink -f hip-tests)" +cd "$HIPTESTS_DIR" mkdir -p build; cd build -export HIP_PATH=$HIPAMD_DIR/build/install (or any path where HIP is installed, for example, /opt/rocm) +export HIP_PATH=$CLR_DIR/build/install (or any path where HIP is installed, for example, /opt/rocm) cmake ../catch/ -DHIP_PLATFORM=amd make -j$(nproc) build_tests ctest # run tests ``` -HIP catch tests are built under the folder $HIP_TESTS_DIR/build. +HIP catch tests are built under the folder $HIPTESTS_DIR/build. To run any single catch test, the following is an example, ```shell -cd $HIP_TESTS_DIR/build/catch_tests/unit/texture +cd $HIPTESTS_DIR/build/catch_tests/unit/texture ./TextureTest ``` @@ -178,8 +187,8 @@ cd $HIP_TESTS_DIR/build/catch_tests/unit/texture HIP Catch2 supports build a standalone test, for example, ```shell -cd "$HIP_TESTS_DIR" -hipcc $HIP_TESTS_DIR/catch/unit/memory/hipPointerGetAttributes.cc -I ./catch/include ./catch/hipTestMain/standalone_main.cc -I ./catch/external/Catch2 -o hipPointerGetAttributes +cd "$HIPTESTS_DIR" +hipcc $HIPTESTS_DIR/catch/unit/memory/hipPointerGetAttributes.cc -I ./catch/include ./catch/hipTestMain/standalone_main.cc -I ./catch/external/Catch2 -o hipPointerGetAttributes ./hipPointerGetAttributes ... @@ -193,22 +202,33 @@ All tests passed ```shell git clone -b "$ROCM_BRANCH" https://github.com/ROCm-Developer-Tools/hip.git -git clone -b "$ROCM_BRANCH" https://github.com/ROCm-Developer-Tools/hipamd.git +git clone -b "$ROCM_BRANCH" https://github.com/ROCm-Developer-Tools/clr.git +git clone -b "$ROCM_BRANCH" https://github.com/ROCm-Developer-Tools/HIPCC.git ``` ### Set the environment variables ```shell export HIP_DIR="$(readlink -f hip)" -export HIPAMD_DIR="$(readlink -f hipamd)" +export CLR_DIR="$(readlink -f hipamd)" +export HIPCC_DIR="$(readlink -f hipcc)" ``` -### Build HIP +### Build the HIPCC runtime + +```shell +cd "$HIPCC_DIR" +mkdir -p build; cd build +cmake .. +make -j4 +``` + +## Build HIP ```shell -cd "$HIPAMD_DIR" +cd "$CLR_DIR" mkdir -p build; cd build -cmake -DHIP_COMMON_DIR=$HIP_DIR -DHIP_PLATFORM=nvidia -DCMAKE_INSTALL_PREFIX=$PWD/install .. +cmake -DHIP_COMMON_DIR=$HIP_DIR -DHIP_PLATFORM=nvidia -DCMAKE_INSTALL_PREFIX=$PWD/install -DHIPCC_BIN_DIR=$HIPCC_DIR/build -DHIP_CATCH_TEST=0 -DCLR_BUILD_HIP=ON -DCLR_BUILD_OCL=OFF .. make -j$(nproc) sudo make install ``` @@ -218,5 +238,5 @@ Build HIP tests commands on NVIDIA platform are basically the same as AMD, excep ## Run HIP -Compile and run the [square sample](https://github.com/ROCm-Developer-Tools/HIP/tree/rocm-5.0.x/samples/0_Intro/square). +Compile and run the [square sample](https://github.com/ROCm-Developer-Tools/hip-tests/tree/rocm-5.5.x/samples/0_Intro/square). diff --git a/docs/developer_guide/logging.md b/docs/developer_guide/logging.md index b712a40d7c..08c2d0f7c8 100644 --- a/docs/developer_guide/logging.md +++ b/docs/developer_guide/logging.md @@ -85,8 +85,7 @@ ClPrint(amd::LOG_INFO, amd::LOG_INIT, "Initializing HSA stack."); ## HIP Logging Example: -Below is an example to enable HIP logging and get logging information during -execution of hipinfo, +Below is an example to enable HIP logging and get logging information during execution of hipinfo on Linux, ```console user@user-test:~/hip/bin$ export AMD_LOG_LEVEL=4 @@ -136,22 +135,7 @@ concurrentKernels: 1 cooperativeLaunch: 0 cooperativeMultiDeviceLaunch: 0 arch.hasGlobalInt32Atomics: 1 -arch.hasGlobalFloatAtomicExch: 1 -arch.hasSharedInt32Atomics: 1 -arch.hasSharedFloatAtomicExch: 1 -arch.hasFloatAtomicAdd: 1 -arch.hasGlobalInt64Atomics: 1 -arch.hasSharedInt64Atomics: 1 -arch.hasDoubles: 1 -arch.hasWarpVote: 1 -arch.hasWarpBallot: 1 -arch.hasWarpShuffle: 1 -arch.hasFunnelShift: 0 -arch.hasThreadFenceSystem: 1 -arch.hasSyncThreadsExt: 0 -arch.hasSurfaceFuncs: 0 -arch.has3dGrid: 1 -arch.hasDynamicParallelism: 0 +... gcnArch: 1012 isIntegrated: 0 maxTexture1D: 65536 @@ -178,6 +162,54 @@ memInfo.total: 7.98 GB memInfo.free: 7.98 GB (100%) ``` +On Windows, AMD_LOG_LEVEL can be set via environment variable from advanced system setting, or from Command prompt run as administrator, as shown below as an example, which shows some debug log information calling backend runtime on Windows. +```console +C:\hip\bin>set AMD_LOG_LEVEL=4 +C:\hip\bin>hipinfo +:3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\vdi\device\comgrctx.cpp:33 : 605413686305 us: 29864: [tid:0x9298] Loading COMGR library. +:4:C:\constructicon\builds\gfx\two\22.40\drivers\compute\vdi\platform\runtime.cpp:83 : 605413869411 us: 29864: [tid:0x9298] init +:3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_context.cpp:47 : 605413869502 us: 29864: [tid:0x9298] Direct Dispatch: 0 +:3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_device_runtime.cpp:543 : 605413870553 us: 29864: [tid:0x9298] hipGetDeviceCount: Returned hipSuccess : +:3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_device_runtime.cpp:556 : 605413870631 us: 29864: [tid:0x9298] <-[32m hipSetDevice ( 0 ) <-[0m +:3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_device_runtime.cpp:561 : 605413870848 us: 29864: [tid:0x9298] hipSetDevice: Returned hipSuccess : +-------------------------------------------------------------------------------- +device# 0 +:3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_device.cpp:346 : 605413871623 us: 29864: [tid:0x9298] <-[32m hipGetDeviceProperties ( 0000008AEBEFF8C8, 0 ) <-[0m +:3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_device.cpp:348 : 605413871695 us: 29864: [tid:0x9298] hipGetDeviceProperties: Returned hipSuccess : +Name: AMD Radeon(TM) Graphics +pciBusID: 3 +pciDeviceID: 0 +pciDomainID: 0 +multiProcessorCount: 7 +maxThreadsPerMultiProcessor: 2560 +isMultiGpuBoard: 0 +clockRate: 1600 Mhz +memoryClockRate: 1333 Mhz +memoryBusWidth: 0 +totalGlobalMem: 12.06 GB +totalConstMem: 2147483647 +sharedMemPerBlock: 64.00 KB +... +gcnArchName: gfx90c:xnack- +:3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_device_runtime.cpp:541 : 605413924779 us: 29864: [tid:0x9298] <-[32m hipGetDeviceCount ( 0000008AEBEFF8A4 ) <-[0m +:3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_device_runtime.cpp:543 : 605413925075 us: 29864: [tid:0x9298] hipGetDeviceCount: Returned hipSuccess : +peers: :3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_peer.cpp:176 : 605413928643 us: 29864: [tid:0x9298] <-[32m hipDeviceCanAccessPeer ( 0000008AEBEFF890, 0, 0 ) <-[0m +:3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_peer.cpp:177 : 605413928743 us: 29864: [tid:0x9298] hipDeviceCanAccessPeer: Returned hipSuccess : +non-peers: :3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_peer.cpp:176 : 605413930830 us: 29864: [tid:0x9298] <-[32m hipDeviceCanAccessPeer ( 0000008AEBEFF890, 0, 0 ) <-[0m +:3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_peer.cpp:177 : 605413930882 us: 29864: [tid:0x9298] hipDeviceCanAccessPeer: Returned hipSuccess : +device#0 +... +:4:C:\constructicon\builds\gfx\two\22.40\drivers\compute\vdi\device\pal\palmemory.cpp:430 : 605414517802 us: 29864: [tid:0x9298] Free-: 8000 bytes, VM[ 3007c8000, 3007d0000] +:3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\vdi\device\devprogram.cpp:2979: 605414517893 us: 29864: [tid:0x9298] For Init/Fini: Kernel Name: __amd_rocclr_copyBufferToImage +:3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\vdi\device\devprogram.cpp:2979: 605414518259 us: 29864: [tid:0x9298] For Init/Fini: Kernel Name: __amd_rocclr_copyBuffer +... +:4:C:\constructicon\builds\gfx\two\22.40\drivers\compute\vdi\device\pal\palmemory.cpp:206 : 605414523422 us: 29864: [tid:0x9298] Alloc: 100000 bytes, ptr[00000003008D0000-00000003009D0000], obj[00000003007D0000-00000003047D0000] +:4:C:\constructicon\builds\gfx\two\22.40\drivers\compute\vdi\device\pal\palmemory.cpp:206 : 605414523767 us: 29864: [tid:0x9298] Alloc: 100000 bytes, ptr[00000003009D0000-0000000300AD0000], obj[00000003007D0000-00000003047D0000] +:3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_memory.cpp:681 : 605414524092 us: 29864: [tid:0x9298] hipMemGetInfo: Returned hipSuccess : +memInfo.total: 12.06 GB +memInfo.free: 11.93 GB (99%) +``` + ## HIP Logging Tips: - HIP logging works for both release and debug version of HIP application. @@ -191,3 +223,4 @@ memInfo.free: 7.98 GB (100%) user@user-test:~/hip/bin$ ./hipinfo > ~/hip_log.txt ``` + diff --git a/docs/how_to_guides/debugging.md b/docs/how_to_guides/debugging.md index 556a51c106..ec9652a7f2 100644 --- a/docs/how_to_guides/debugging.md +++ b/docs/how_to_guides/debugging.md @@ -100,7 +100,7 @@ Reading symbols from ./hipTexObjPitch... (gdb) break main Breakpoint 1 at 0x4013d1: file /home/test/hip/tests/src/texture/hipTexObjPitch.cpp, line 98. (gdb) run -Starting program: /home/test/hip/build/directed_tests/texture/hipTexObjPitch +Starting program: /home/test/hip/build/directed_tests/texture/hipTexObjPitch [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". @@ -112,11 +112,11 @@ Breakpoint 1, main () ``` ### Other Debugging Tools -There are also other debugging tools available online developers can google and choose the one best suits the debugging requirements. +There are also other debugging tools available online developers can google and choose the one best suits the debugging requirements. For example, Microsoft Visual Studio and Windgb tools are options on Windows. ## Debugging HIP Applications -Below is an example to show how to get useful information from the debugger while running a simple memory copy test, which caused an issue of segmentation fault. +Below is an example on Linux to show how to get useful information from the debugger while running a simple memory copy test, which caused an issue of segmentation fault. ```console test: simpleTest2 numElements=4194304 sizeElements=4194304 bytes @@ -176,11 +176,13 @@ Thread 1 "hipMemcpy_simpl" received signal SIGSEGV, Segmentation fault. ... ``` +On Windows, debugging HIP applications on IDE like Microsoft Visual Studio tools, are more informative and visible to debug codes, inspect variables, watch multiple details and examine the call stacks. + ## Useful Environment Variables -HIP provides some environment variables which allow HIP, hip-clang, or HSA driver to disable some feature or optimization. +HIP provides some environment variables which allow HIP, hip-clang, or HSA driver on Linux to disable some feature or optimization. These are not intended for production but can be useful diagnose synchronization problems in the application (or driver). -Some of the most useful environment variables are described here. They are supported on the ROCm path. +Some of the most useful environment variables are described here. They are supported on the ROCm path on Linux and Windows as well. ### Kernel Enqueue Serialization Developers can control kernel command serialization from the host using the environment variable, @@ -201,7 +203,7 @@ So HIP runtime can wait for GPU idle before/after any GPU command depending on t ### Making Device visible For system with multiple devices, it's possible to make only certain device(s) visible to HIP via setting environment variable, -HIP_VISIBLE_DEVICES, only devices whose index is present in the sequence are visible to HIP. +HIP_VISIBLE_DEVICES(or CUDA_VISIBLE_DEVICES on Nvidia platform), only devices whose index is present in the sequence are visible to HIP. For example, ```console @@ -221,8 +223,8 @@ if (totalDeviceNum > 2) { Developers can dump code object to analyze compiler related issues via setting environment variable, GPU_DUMP_CODE_OBJECT -### HSA related environment variables -HSA provides some environment variables help to analyze issues in driver or hardware, for example, +### HSA related environment variables on Linux +On Linux with open source, HSA provides some environment variables help to analyze issues in driver or hardware, for example, HSA_ENABLE_SDMA=0 It causes host-to-device and device-to-host copies to use compute shader blit kernels rather than the dedicated DMA copy engines. @@ -241,19 +243,19 @@ The following is the summary of the most useful environment variables in HIP. | ---------------------------------------------------------------------------------------------------------------| ----------------- | --------- | | AMD_LOG_LEVEL
Enable HIP log on different Level. | 0 | 0: Disable log.
1: Enable log on error level.
2: Enable log on warning and below levels.
0x3: Enable log on information and below levels.
0x4: Decode and display AQL packets. | | AMD_LOG_MASK
Enable HIP log on different Level. | 0x7FFFFFFF | 0x1: Log API calls.
0x02: Kernel and Copy Commands and Barriers.
0x4: Synchronization and waiting for commands to finish.
0x8: Enable log on information and below levels.
0x20: Queue commands and queue contents.
0x40:Signal creation, allocation, pool.
0x80: Locks and thread-safety code.
0x100: Copy debug.
0x200: Detailed copy debug.
0x400: Resource allocation, performance-impacting events.
0x800: Initialization and shutdown.
0x1000: Misc debug, not yet classified.
0x2000: Show raw bytes of AQL packet.
0x4000: Show code creation debug.
0x8000: More detailed command info, including barrier commands.
0x10000: Log message location.
0xFFFFFFFF: Log always even mask flag is zero. | -| HIP_VISIBLE_DEVICES
Only devices whose index is present in the sequence are visible to HIP. | | 0,1,2: Depending on the number of devices on the system. | +| HIP_VISIBLE_DEVICES(or CUDA_VISIBLE_DEVICES)
Only devices whose index is present in the sequence are visible to HIP. | | 0,1,2: Depending on the number of devices on the system. | | GPU_DUMP_CODE_OBJECT
Dump code object. | 0 | 0: Disable.
1: Enable. | | AMD_SERIALIZE_KERNEL
Serialize kernel enqueue. | 0 | 1: Wait for completion before enqueue.
2: Wait for completion after enqueue.
3: Both. | | AMD_SERIALIZE_COPY
Serialize copies. | 0 | 1: Wait for completion before enqueue.
2: Wait for completion after enqueue.
3: Both. | | HIP_HOST_COHERENT
Coherent memory in hipHostMalloc. | 0 | 0: memory is not coherent between host and GPU.
1: memory is coherent with host. | -| AMD_DIRECT_DISPATCH
Enable direct kernel dispatch. | 1 | 0: Disable.
1: Enable. | +| AMD_DIRECT_DISPATCH
Enable direct kernel dispatch (Currently for Linux, under development on Windows). | 1 | 0: Disable.
1: Enable. | | GPU_MAX_HW_QUEUES
The maximum number of hardware queues allocated per device. | 4 | The variable controls how many independent hardware queues HIP runtime can create per process, per device. If application allocates more HIP streams than this number, then HIP runtime will reuse the same hardware queues for the new streams in round robin manner. Please note, this maximum number does not apply to either hardware queues that are created for CU masked HIP streams, or cooperative queue for HIP Cooperative Groups (there is only one single queue per device). | ## General Debugging Tips - 'gdb --args' can be used to conveniently pass the executable and arguments to gdb. -- From inside GDB, you can set environment variables "set env". Note the command does not use an '=' sign: +- From inside GDB on Linux, you can set environment variables "set env". Note the command does not use an '=' sign: -``` +```console (gdb) set env AMD_SERIALIZE_KERNEL 3 ``` - The fault will be caught by the runtime but was actually generated by an asynchronous command running on the GPU. So, the GDB backtrace will show a path in the runtime. diff --git a/docs/reference/deprecated_api_list.md b/docs/reference/deprecated_api_list.md index e9e32ec1ba..1266ff392d 100644 --- a/docs/reference/deprecated_api_list.md +++ b/docs/reference/deprecated_api_list.md @@ -80,3 +80,4 @@ Should use roctracer/rocTX instead ### hipTexRefSetMipmapLevelBias ### hipTexRefSetMipmapLevelClamp ### hipTexRefSetMipmappedArray +### hipBindTextureToMipmappedArray diff --git a/docs/reference/kernel_language.md b/docs/reference/kernel_language.md index 3cd4d55007..4edbb057ae 100644 --- a/docs/reference/kernel_language.md +++ b/docs/reference/kernel_language.md @@ -6,7 +6,7 @@ HIP provides a C++ syntax that is suitable for compiling most code that commonly - Math functions resembling those in the "math.h" header included with standard C++ compilers - Built-in functions for accessing specific GPU hardware capabilities -This section describes the built-in variables and functions accessible from the HIP kernel. It’s intended for readers who are familiar with Cuda kernel syntax and want to understand how HIP is different. +This section describes the built-in variables and functions accessible from the HIP kernel. It's intended for readers who are familiar with Cuda kernel syntax and want to understand how HIP is different. Features are marked with one of the following keywords: - **Supported**---HIP supports the feature with a Cuda-equivalent function @@ -26,11 +26,10 @@ Supported `__global__` functions are - Executed on the device - Called ("launched") from the host -HIP `__global__` functions must have a `void` return type, and the first parameter to a HIP `__global__` function must have the type `hipLaunchParm`. See [Kernel-Launch Example](#kernel-launch-example). +HIP `__global__` functions must have a `void` return type, and the first parameter to a HIP `__global__` function must have the type `hipLaunchParm`. See [Kernel-Launch Example](#kernel-launch-example). HIP lacks dynamic-parallelism support, so `__global__ ` functions cannot be called from the device. -(host_attr)= ### `__host__` Supported `__host__` functions are - Executed on the host @@ -68,7 +67,7 @@ MyKernel<<>> (a,b,c,n); ``` -The hipLaunchKernelGGL macro always starts with the five parameters specified above, followed by the kernel arguments. HIPIFY tools optionally convert Cuda launch syntax to hipLaunchKernelGGL, including conversion of optional arguments in <<< >>> to the five required hipLaunchKernelGGL parameters. The dim3 constructor accepts zero to three arguments and will by default initialize unspecified dimensions to 1. See [dim3](#dim3). The kernel uses the coordinate built-ins (thread*, block*, grid*) to determine coordinate index and coordinate bounds of the work item that’s currently executing. See {ref}`coordinate_builtins`. +The hipLaunchKernelGGL macro always starts with the five parameters specified above, followed by the kernel arguments. HIPIFY tools optionally convert Cuda launch syntax to hipLaunchKernelGGL, including conversion of optional arguments in <<< >>> to the five required hipLaunchKernelGGL parameters. The dim3 constructor accepts zero to three arguments and will by default initialize unspecified dimensions to 1. See [dim3](#dim3). The kernel uses the coordinate built-ins (thread*, block*, grid*) to determine coordinate index and coordinate bounds of the work item that's currently executing. See [Coordinate Built-Ins](#Coordinate-Built-Ins). Please note, HIP does not support kernel launch with total work items defined in dimension with size gridDim x blockDim >= 2^32. @@ -126,22 +125,21 @@ The `__restrict__` keyword tells the compiler that the associated memory pointer ## Built-In Variables -(coordinate_builtins)= ### Coordinate Built-Ins Built-ins determine the coordinate of the active work item in the execution grid. They are defined in amd_hip_runtime.h (rather than being implicitly defined by the compiler). In HIP, built-ins coordinate variable definitions are the same as in Cuda, for instance: threadIdx.x, blockIdx.y, gridDim.y, etc. The products gridDim.x * blockDim.x, gridDim.y * blockDim.y and gridDim.z * blockDim.z are always less than 2^32. +Coordinates builtins are implemented as structures for better performance. When used with printf, they needs to be casted to integer types explicitly. ### warpSize -The warpSize variable is of type int and contains the warp size (in threads) for the target device. Note that all current Nvidia devices return 32 for this variable, and all current AMD devices return 64. Device code should use the warpSize built-in to develop portable wave-aware code. +The warpSize variable is of type int and contains the warp size (in threads) for the target device. Note that all current Nvidia devices return 32 for this variable, and current AMD devices return 64 for gfx9 and 32 for gfx10 and above. The warpSize variable should only be used in device functions. Device code should use the warpSize built-in to develop portable wave-aware code. ## Vector Types Note that these types are defined in hip_runtime.h and are not automatically provided by the compiler. - ### Short Vector Types Short vector types derive from the basic integer and floating-point types. They are structures defined in hip_vector_types.h. The first, second, third and fourth components of the vector are accessible through the ```x```, ```y```, ```z``` and ```w``` fields, respectively. All the short vector types support a constructor function of the form ```make_()```. For example, ```float4 make_float4(float x, float y, float z, float w)``` creates a vector of type ```float4``` and value ```(x,y,z,w)```. diff --git a/docs/user_guide/faq.md b/docs/user_guide/faq.md index ab67cc745f..81ea40ecf2 100644 --- a/docs/user_guide/faq.md +++ b/docs/user_guide/faq.md @@ -148,6 +148,9 @@ ROCclr (Radeon Open Compute Common Language Runtime) is a virtual device interfa ## What is HIPAMD? HIPAMD is a repository branched out from HIP, mainly the implementation for AMD GPU. +## Can I get HIP open source repository for Windows? +No, there is no HIP repository open publicly on Windows. + ## Can a HIP binary run on both AMD and Nvidia platforms? HIP is a source-portable language that can be compiled to run on either AMD or NVIDIA platform. HIP tools don't create a "fat binary" that can run on either platform, however. @@ -223,7 +226,7 @@ If you have compiled the application yourself, make sure you have given the corr If you have a precompiled application/library (like rocblas, tensorflow etc) which gives you such error, there are one of two possibilities. - The application/library does not ship code object bundles for *all* of your device(s): in this case you need to recompile the application/library yourself with correct `--offload-arch`. - - The application/library does not ship code object bundles for *some* of your device(s), for example you have a system with an APU + GPU and the library does not ship code objects for your APU. For this you can set the environment variable `HIP_VISIBLE_DEVICES` to only enable GPUs for which code object is available. This will limit the GPUs visible to your application and allow it to run. + - The application/library does not ship code object bundles for *some* of your device(s), for example you have a system with an APU + GPU and the library does not ship code objects for your APU. For this you can set the environment variable `HIP_VISIBLE_DEVICES` or `CUDA_VISIBLE_DEVICES` on NVdia platform, to only enable GPUs for which code object is available. This will limit the GPUs visible to your application and allow it to run. ## How to use per-thread default stream in HIP? @@ -232,11 +235,16 @@ The per-thread default stream is an implicit stream local to both the thread and The per-thread default stream is a blocking stream and will synchronize with the default null stream if both are used in a program. In ROCm, a compilation option should be added in order to compile the translation unit with per-thread default stream enabled. -“-fgpu-default-stream=per-thread”. +"-fgpu-default-stream=per-thread". Once source is compiled with per-thread default stream enabled, all APIs will be executed on per thread default stream, hence there will not be any implicit synchronization with other streams. Besides, per-thread default stream be enabled per translation unit, users can compile some files with feature enabled and some with feature disabled. Feature enabled translation unit will have default stream as per thread and there will not be any implicit synchronization done but other modules will have legacy default stream which will do implicit synchronization. +## Can I develop applications with HIP APIs on Windows the same on Linux? + +Yes, HIP APIs are available to use on both Linux and Windows. +Due to different working mechanisms on operating systems like Windows vs Linux, HIP APIs call corresponding lower level backend runtime libraries and kernel drivers for the OS, in order to control the executions on GPU hardware accordingly. There might be a few differences on the related backend software and driver support, which might affect usage of HIP APIs. See OS support details in HIP API document. + ## How can I know the version of HIP? HIP version definition has been updated since ROCm 4.2 release as the following: diff --git a/docs/user_guide/hip_porting_guide.md b/docs/user_guide/hip_porting_guide.md index be0f793d35..b49a609129 100644 --- a/docs/user_guide/hip_porting_guide.md +++ b/docs/user_guide/hip_porting_guide.md @@ -89,7 +89,7 @@ directory names. | CUB | rocPRIM | Low Level Optimized Parallel Primitives | cuDNN | MIOpen | Deep learning Solver Library | cuRAND | rocRAND | Random Number Generator Library -| EIGEN | EIGEN – HIP port | C++ template library for linear algebra: matrices, vectors, numerical solvers, +| EIGEN | EIGEN - HIP port | C++ template library for linear algebra: matrices, vectors, numerical solvers, | NCCL | RCCL | Communications Primitives Library based on the MPI equivalents @@ -333,7 +333,7 @@ CPPFLAGS += $(shell $(HIP_PATH)/bin/hipconfig --cpp_config) nvcc includes some headers by default. However, HIP does not include default headers, and instead all required files must be explicitly included. Specifically, files that call HIP run-time APIs or define HIP kernels must explicitly include the appropriate HIP headers. -If the compilation process reports that it cannot find necessary APIs (for example, "error: identifier ‘hipSetDevice’ is undefined"), +If the compilation process reports that it cannot find necessary APIs (for example, "error: identifier hipSetDevice is undefined"), ensure that the file includes hip_runtime.h (or hip_runtime_api.h, if appropriate). The hipify-perl script automatically converts "cuda_runtime.h" to "hip_runtime.h," and it converts "cuda_runtime_api.h" to "hip_runtime_api.h", but it may miss nested headers or macros. @@ -461,11 +461,12 @@ In this case, memory type translation for hipPointerGetAttributes needs to be ha So in any HIP applications which use HIP APIs involving memory types, developers should use #ifdef in order to assign the correct enum values depending on Nvidia or AMD platform. -As an example, please see the code from the link, -github.com/ROCm-Developer-Tools/HIP/blob/develop/tests/catch/unit/memory/hipMemcpyParam2D.cc#L77-L96. +As an example, please see the code from the [link](github.com/ROCm-Developer-Tools/HIP/blob/develop/tests/catch/unit/memory/hipMemcpyParam2D.cc). With the #ifdef condition, HIP APIs work as expected on both AMD and NVIDIA platforms. +Note, cudaMemoryTypeUnregstered is currently not supported in hipMemoryType enum, due to HIP functionality backward compatibility. + ## threadfence_system Threadfence_system makes all device memory writes, all writes to mapped host memory, and all writes to peer memory visible to CPU and other GPU devices. Some implementations can provide this behavior by flushing the GPU L2 cache. diff --git a/docs/user_guide/programming_manual.md b/docs/user_guide/programming_manual.md index 012a6a24f0..65a9ffc892 100644 --- a/docs/user_guide/programming_manual.md +++ b/docs/user_guide/programming_manual.md @@ -14,10 +14,9 @@ GPU can directly access the host memory over the CPU/GPU interconnect, without n There are flags parameter which can specify options how to allocate the memory, for example, hipHostMallocPortable, the memory is considered allocated by all contexts, not just the one on which the allocation is made. hipHostMallocMapped, will map the allocation into the address space for the current device, and the device pointer can be obtained with the API hipHostGetDevicePointer(). -hipHostMallocNumaUser is the flag to allow host memory allocation to follow numa policy by user. -All allocation flags are independent, and can be used in any combination without restriction, for instance, hipHostMalloc can be called with both hipHostMallocPortable and hipHostMallocMapped flags set. Both usage models described above use the same allocation flags, and the difference is in how the surrounding code uses the host memory. +hipHostMallocNumaUser is the flag to allow host memory allocation to follow Numa policy by user. Please note this flag is currently only applicable on Linux, under development on Windows. -See the hipHostMalloc API for more information. +All allocation flags are independent, and can be used in any combination without restriction, for instance, hipHostMalloc can be called with both hipHostMallocPortable and hipHostMallocMapped flags set. Both usage models described above use the same allocation flags, and the difference is in how the surrounding code uses the host memory. ### Numa-aware host memory allocation Numa policy determines how memory is allocated. @@ -25,51 +24,14 @@ Target of Numa policy is to select a CPU that is closest to each GPU. Numa distance is the measurement of how far between GPU and CPU devices. By default, each GPU selects a Numa CPU node that has the least Numa distance between them, that is, host memory will be automatically allocated closest on the memory pool of Numa node of the current GPU device. Using hipSetDevice API to a different GPU will still be able to access the host allocation, but can have longer Numa distance. - -### Managed memory allocation -Managed memory, including the `__managed__` keyword, is supported in HIP combined host/device compilation. - -Managed memory, via unified memory allocation, allows data be shared and accessible to both the CPU and GPU using a single pointer. -The allocation will be managed by AMD GPU driver using the linux HMM (Heterogeneous Memory Management) mechanism, the user can call managed memory API hipMallocManaged to allocate a large chuch of HMM memory, execute kernels on device and fetch data between the host and device as needed. - -In HIP application, It is recommend to do the capability check before calling the managed memory APIs. For example: - -``` -int managed_memory = 0; -HIPCHECK(hipDeviceGetAttribute(&managed_memory, - hipDeviceAttributeManagedMemory,p_gpuDevice)); - -if (!managed_memory ) { - printf ("info: managed memory access not supported on the device %d\n Skipped\n", p_gpuDevice); -} -else { - HIPCHECK(hipSetDevice(p_gpuDevice)); - HIPCHECK(hipMallocManaged(&Hmm, N * sizeof(T))); -. . . -} -``` -Please note, the managed memory capability check may not be necessary, but if HMM is not supported, then managed malloc will fall back to using system memory and other managed memory API calls will have undefined behavior. -For more details on managed memory APIs, please refer to the documentation HIP-API.pdf, and the application at (https://github.com/ROCm-Developer-Tools/HIP/blob/rocm-4.5.x/tests/src/runtimeApi/memory/hipMallocManaged.cpp) is a sample usage. - -### HIP Stream Memory Operations - -HIP supports Stream Memory Operations to enable direct synchronization between Network Nodes and GPU. Following new APIs are added, - hipStreamWaitValue32 - hipStreamWaitValue64 - hipStreamWriteValue32 - hipStreamWriteValue64 - -Note, CPU access to the semaphore's memory requires volatile keyword to disable CPU compiler's optimizations on memory access. -For more details, please check the documentation HIP-API.pdf. - -Please note, HIP stream does not gurantee concurrency on AMD hardware for the case of multiple (at least 6) long running streams executing concurrently, using hipStreamSynchronize(nullptr) for synchronization. +Note, Numa policy is so far implemented on Linux, and under development on Windows. ### Coherency Controls ROCm defines two coherency options for host memory: -- Coherent memory : Supports fine-grain synchronization while the kernel is running.  For example, a kernel can perform atomic operations that are visible to the host CPU or to other (peer) GPUs.  Synchronization instructions include threadfence_system and C++11-style atomic operations. +- Coherent memory : Supports fine-grain synchronization while the kernel is running. For example, a kernel can perform atomic operations that are visible to the host CPU or to other (peer) GPUs. Synchronization instructions include threadfence_system and C++11-style atomic operations. In order to achieve this fine-grained coherence, many AMD GPUs use a limited cache policy, such as leaving these allocations uncached by the GPU, or making them read-only. -- Non-coherent memory : Can be cached by GPU, but cannot support synchronization while the kernel is running.  Non-coherent memory can be optionally synchronized only at command (end-of-kernel or copy command) boundaries.  This memory is appropriate for high-performance access when fine-grain synchronization is not required. +- Non-coherent memory : Can be cached by GPU, but cannot support synchronization while the kernel is running. Non-coherent memory can be optionally synchronized only at command (end-of-kernel or copy command) boundaries. This memory is appropriate for high-performance access when fine-grain synchronization is not required. HIP provides the developer with controls to select which type of memory is used via allocation flags passed to hipHostMalloc and the HIP_HOST_COHERENT environment variable. By default, the environment variable HIP_HOST_COHERENT is set to 0 in HIP. The control logic in the current version of HIP is as follows: @@ -96,10 +58,10 @@ Non-coherent ### hipEventSynchronize Developers can control the release scope for hipEvents: -- By default, the GPU performs a device-scope acquire and release operation with each recorded event.  This will make host and device memory visible to other commands executing on the same device. +- By default, the GPU performs a device-scope acquire and release operation with each recorded event. This will make host and device memory visible to other commands executing on the same device. A stronger system-level fence can be specified when the event is created with hipEventCreateWithFlags: -- hipEventReleaseToSystem : Perform a system-scope release operation when the event is recorded.  This will make both Coherent and Non-Coherent host memory visible to other agents in the system, but may involve heavyweight operations such as cache flushing.  Coherent memory will typically use lighter-weight in-kernel synchronization mechanisms such as an atomic operation and thus does not need to use hipEventReleaseToSystem. +- hipEventReleaseToSystem : Perform a system-scope release operation when the event is recorded. This will make both Coherent and Non-Coherent host memory visible to other agents in the system, but may involve heavyweight operations such as cache flushing. Coherent memory will typically use lighter-weight in-kernel synchronization mechanisms such as an atomic operation and thus does not need to use hipEventReleaseToSystem. - hipEventDisableTiming: Events created with this flag would not record profiling data and provide best performance if used for synchronization. ### Summary and Recommendations: @@ -107,8 +69,49 @@ A stronger system-level fence can be specified when the event is created with hi - Coherent host memory is the default and is the easiest to use since the memory is visible to the CPU at typical synchronization points. This memory allows in-kernel synchronization commands such as threadfence_system to work transparently. - HIP/ROCm also supports the ability to cache host memory in the GPU using the "Non-Coherent" host memory allocations. This can provide performance benefit, but care must be taken to use the correct synchronization. +### Managed memory allocation +Managed memory, including the `__managed__` keyword, is supported in HIP combined host/device compilation, on Linux, not on Windows (under development). + +Managed memory, via unified memory allocation, allows data be shared and accessible to both the CPU and GPU using a single pointer. +The allocation will be managed by AMD GPU driver using the linux HMM (Heterogeneous Memory Management) mechanism, the user can call managed memory API hipMallocManaged to allocate a large chuch of HMM memory, execute kernels on device and fetch data between the host and device as needed. + +In HIP application, It is recommend to do the capability check before calling the managed memory APIs. For example: + +``` +int managed_memory = 0; +HIPCHECK(hipDeviceGetAttribute(&managed_memory, + hipDeviceAttributeManagedMemory,p_gpuDevice)); + +if (!managed_memory ) { + printf ("info: managed memory access not supported on the device %d\n Skipped\n", p_gpuDevice); +} +else { + HIPCHECK(hipSetDevice(p_gpuDevice)); + HIPCHECK(hipMallocManaged(&Hmm, N * sizeof(T))); +. . . +} +``` +Please note, the managed memory capability check may not be necessary, but if HMM is not supported, then managed malloc will fall back to using system memory and other managed memory API calls will have undefined behavior. +For more details on managed memory APIs, please refer to the documentation HIP-API.pdf, and the application at (https://github.com/ROCm-Developer-Tools/HIP/blob/rocm-4.5.x/tests/src/runtimeApi/memory/hipMallocManaged.cpp) is a sample usage. + +Note, managed memory management is implemented on Linux, not supported on Windows yet. + +### HIP Stream Memory Operations + +HIP supports Stream Memory Operations to enable direct synchronization between Network Nodes and GPU. Following new APIs are added, + hipStreamWaitValue32 + hipStreamWaitValue64 + hipStreamWriteValue32 + hipStreamWriteValue64 + +Note, CPU access to the semaphore's memory requires volatile keyword to disable CPU compiler's optimizations on memory access. +For more details, please check the documentation HIP-API.pdf. + +Please note, HIP stream does not gurantee concurrency on AMD hardware for the case of multiple (at least 6) long running streams executing concurrently, using hipStreamSynchronize(nullptr) for synchronization. + ## Direct Dispatch -HIP runtime has Direct Dispatch enabled by default in ROCM 4.4. With this feature we move away from our conventional producer-consumer model where the runtime creates a worker thread(consumer) for each HIP Stream, and the host thread(producer) enqueues commands to a command queue(per stream). +HIP runtime has Direct Dispatch enabled by default in ROCM 4.4 on Linux. +With this feature we move away from our conventional producer-consumer model where the runtime creates a worker thread(consumer) for each HIP Stream, and the host thread(producer) enqueues commands to a command queue(per stream). For Direct Dispatch, HIP runtime would directly enqueue a packet to the AQL queue (user mode queue on GPU) on the Dispatch API call from the application. That has shown to reduce the latency to launch the first wave on the idle GPU and total time of tiny dispatches synchronized with the host. @@ -117,14 +120,15 @@ In addition, eliminating the threads in runtime has reduced the variance in the This feature can be disabled via setting the following environment variable, AMD_DIRECT_DISPATCH=0 +Note, Direct Dispatch is implemented on Linux. It is currently not supported on Windows. + ## HIP Runtime Compilation HIP now supports runtime compilation (hipRTC), the usage of which will provide the possibility of optimizations and performance improvement compared with other APIs via regular offline static compilation. hipRTC APIs accept HIP source files in character string format as input parameters and create handles of programs by compiling the HIP source files without spawning separate processes. -For more details on hipRTC APIs, refer to HIP-API.pdf in GitHub (https://github.com/RadeonOpenCompute/ROCm). - -The link here(https://github.com/ROCm-Developer-Tools/HIP/blob/main/tests/src/hiprtc/saxpy.cpp) shows an example how to program HIP application using runtime compilation mechanism, and detail hipRTC programming guide is also available in Github (https://github.com/ROCm-Developer-Tools/HIP/blob/main/docs/markdown/hip_rtc.md). +For more details on hipRTC APIs, refer to HIP-API.pdf in GitHub (https://docs.amd.com/category/api_documentation). +For Linux developers, the link here(https://github.com/ROCm-Developer-Tools/hip-tests/blob/develop/samples/2_Cookbook/23_cmake_hiprtc/saxpy.cpp) shows an example how to program HIP application using runtime compilation mechanism, and detail hipRTC programming guide is also available in Github (https://github.com/ROCm-Developer-Tools/HIP/blob/develop/docs/user_guide/hip_rtc.md). ## HIP Graph HIP graph is supported. For more details, refer to the HIP API Guide. @@ -141,9 +145,9 @@ The test codes in the link (https://github.com/ROCm-Developer-Tools/HIP/blob/dev The per-thread default stream is supported in HIP. It is an implicit stream local to both the thread and the current device. This means that the command issued to the per-thread default stream by the thread does not implicitly synchronize with other streams (like explicitly created streams), or default per-thread stream on other threads. The per-thread default stream is a blocking stream and will synchronize with the default null stream if both are used in a program. The per-thread default stream can be enabled via adding a compilation option, -“-fgpu-default-stream=per-thread”. +"-fgpu-default-stream=per-thread". -And users can explicitly use "hipStreamPerThread" as per-thread default stream handle as input in API commands. There are test codes as examples in the link (https://github.com/ROCm-Developer-Tools/HIP/tree/develop/tests/catch/unit/streamperthread). +And users can explicitly use "hipStreamPerThread" as per-thread default stream handle as input in API commands. There are test codes as examples in the link (https://github.com/ROCm-Developer-Tools/hip-tests/tree/develop/catch/unit/streamperthread). ## Use of Long Double Type @@ -188,4 +192,4 @@ Here is an example to create and use static libraries: hipcc libHipDevice.a test.cpp -fgpu-rdc -o test.out ``` -For more information, please see samples/2_Cookbook/15_static_library/host_functions and samples/2_Cookbook/15_static_library/device_functions. +For more information, please see [HIP samples](https://github.com/ROCm-Developer-Tools/hip-tests/tree/rocm-5.5.x/samples/2_Cookbook/15_static_library/host_functions) and [samples](https://github.com/ROCm-Developer-Tools/hip-tests/tree/rocm-5.5.x/samples/2_Cookbook/15_static_library/device_functions). diff --git a/include/hip/hip_runtime_api.h b/include/hip/hip_runtime_api.h index dc2edb0273..8bdfe6eb98 100644 --- a/include/hip/hip_runtime_api.h +++ b/include/hip/hip_runtime_api.h @@ -164,6 +164,13 @@ typedef struct hipDeviceProp_t { /** * hipMemoryType (for pointer attributes) * + * @note hipMemoryType enum values are different from cudaMemoryType enum values. + * In this case, memory type translation for hipPointerGetAttributes needs to be handled properly + * on nvidia platform to get the correct memory type in CUDA. Developers should use '#ifdef' in order + * to assign the correct enum values depending on Nvidia or AMD platform. + * + * @note cudaMemoryTypeUnregistered is currently not supported due to HIP functionality backward + * compatibility. */ typedef enum hipMemoryType { hipMemoryTypeHost = 0, ///< Memory is physically located on host @@ -2616,12 +2623,17 @@ hipError_t hipPointerSetAttribute(const void* value, hipPointer_attribute attrib * @param [out] attributes attributes for the specified pointer * @param [in] ptr pointer to get attributes for * - * Note: To get pointer's memory type, the parameter attributes has 'type' as member variable. - * The 'type' indicates input pointer is allocated on device or host. + * @note To get pointer's memory type, the parameter attributes has 'type' as member variable. + * The 'type' indicates input pointer is allocated on device or host. That means the input + * pointer must be returned or passed through an HIP API such as hipHostMalloc, hipMallocManaged, + * hipHostRegister, etc. Otherwise, the pointer couldn't be handled by this API and attributes + * returned hipErrorInvalidValue, due to the hipMemoryType enums values, unrecognized memory type + * is currently not supported due to HIP functionality backward compatibility. * * @return #hipSuccess, #hipErrorInvalidDevice, #hipErrorInvalidValue * * @see hipPointerGetAttribute + * */ hipError_t hipPointerGetAttributes(hipPointerAttribute_t* attributes, const void* ptr); /** @@ -2797,7 +2809,7 @@ hipError_t hipExtMallocWithFlags(void** ptr, size_t sizeBytes, unsigned int flag * * @return #hipSuccess, #hipErrorOutOfMemory * - * @deprecated use hipHostMalloc() instead + * @warning This API is deprecated use hipHostMalloc() instead */ DEPRECATED("use hipHostMalloc instead") hipError_t hipMallocHost(void** ptr, size_t size); @@ -2811,7 +2823,7 @@ hipError_t hipMallocHost(void** ptr, size_t size); * * @return #hipSuccess, #hipErrorOutOfMemory * - * @deprecated use hipHostMalloc() instead + * @warning This API is deprecated, use hipHostMalloc() instead */ DEPRECATED("use hipHostMalloc instead") hipError_t hipMemAllocHost(void** ptr, size_t size); @@ -2863,7 +2875,7 @@ hipError_t hipHostMalloc(void** ptr, size_t size, unsigned int flags); /** * @brief Allocates memory that will be automatically managed by HIP. * - * This API is used for managed memory, allows data be shared and accessible to both the CPU and + * This API is used for managed memory, allows data be shared and accessible to both CPU and * GPU using a single pointer. * * The API returns the allocation pointer, managed by HMM, can be used further to execute kernels @@ -2902,12 +2914,18 @@ hipError_t hipMemPrefetchAsync(const void* dev_ptr, * @brief Advise about the usage of a given memory range to HIP. * * @param [in] dev_ptr pointer to memory to set the advice for - * @param [in] count size in bytes of the memory range, it should be 4KB alligned. + * @param [in] count size in bytes of the memory range, it should be CPU page size alligned. * @param [in] advice advice to be applied for the specified memory range * @param [in] device device to apply the advice for * * @returns #hipSuccess, #hipErrorInvalidValue * + * This HIP API advises about the usage to be applied on unified memory allocation in the + * range starting from the pointer address devPtr, with the size of count bytes. The memory range + * must refer to managed memory allocated via the API hipMallocManaged, and the range will be + * handled with proper round down and round up respectively in the driver to be aligned to + * CPU page size. + * * @note This API is implemented on Linux, under development on Windows. */ hipError_t hipMemAdvise(const void* dev_ptr, @@ -3543,13 +3561,13 @@ hipError_t hipMemAllocPitch(hipDeviceptr_t* dptr, size_t* pitch, size_t widthInB */ hipError_t hipFree(void* ptr); /** - * @brief Free memory allocated by the hcc hip host memory allocation API. [Deprecated] + * @brief Free memory allocated by the hcc hip host memory allocation API [Deprecated] * * @param[in] ptr Pointer to memory to be freed - * @return #hipSuccess, - * #hipErrorInvalidValue (if pointer is invalid, including device pointers allocated with - hipMalloc) - * @deprecated use hipHostFree() instead + * @return #hipSuccess, #hipErrorInvalidValue (if pointer is invalid, including device pointers + * allocated with hipMalloc) + * + * @warning This API is deprecated, use hipHostFree() instead */ DEPRECATED("use hipHostFree instead") hipError_t hipFreeHost(void* ptr); @@ -4282,6 +4300,8 @@ hipError_t hipMemcpy2DToArrayAsync(hipArray* dst, size_t wOffset, size_t hOffset * * @see hipMemcpy, hipMemcpy2DToArray, hipMemcpy2D, hipMemcpyFromArray, hipMemcpyToSymbol, * hipMemcpyAsync + * + * @warning This API is deprecated. */ DEPRECATED(DEPRECATED_MSG) hipError_t hipMemcpyToArray(hipArray* dst, size_t wOffset, size_t hOffset, const void* src, @@ -4291,7 +4311,7 @@ hipError_t hipMemcpyToArray(hipArray* dst, size_t wOffset, size_t hOffset, const * * @param[in] dst Destination memory address * @param[in] srcArray Source memory address - * @param[in] woffset Source starting X offset + * @param[in] wOffset Source starting X offset * @param[in] hOffset Source starting Y offset * @param[in] count Size in bytes to copy * @param[in] kind Type of transfer @@ -4300,6 +4320,8 @@ hipError_t hipMemcpyToArray(hipArray* dst, size_t wOffset, size_t hOffset, const * * @see hipMemcpy, hipMemcpy2DToArray, hipMemcpy2D, hipMemcpyFromArray, hipMemcpyToSymbol, * hipMemcpyAsync + * + * @warning This API is deprecated. */ DEPRECATED(DEPRECATED_MSG) hipError_t hipMemcpyFromArray(void* dst, hipArray_const_t srcArray, size_t wOffset, size_t hOffset, @@ -4537,7 +4559,7 @@ hipError_t hipMemcpyPeerAsync(void* dst, int dstDeviceId, const void* src, int s * This section describes the deprecated context management functions of HIP runtime API. */ /** - * @brief Create a context and set it as current/default context + * @brief Create a context and set it as current/default context. * * @param [out] ctx Context to create * @param [in] flags Context creation flags @@ -4581,7 +4603,7 @@ hipError_t hipCtxDestroy(hipCtx_t ctx); DEPRECATED(DEPRECATED_MSG) hipError_t hipCtxPopCurrent(hipCtx_t* ctx); /** - * @brief Push the context to be set as current/ default context + * @brief Push the context to be set as current/ default context. * * @param [in] ctx * @@ -4595,7 +4617,7 @@ hipError_t hipCtxPopCurrent(hipCtx_t* ctx); DEPRECATED(DEPRECATED_MSG) hipError_t hipCtxPushCurrent(hipCtx_t ctx); /** - * @brief Set the passed context as current/default + * @brief Set the passed context as current/default. * * @param [in] ctx * @@ -4609,7 +4631,7 @@ hipError_t hipCtxPushCurrent(hipCtx_t ctx); DEPRECATED(DEPRECATED_MSG) hipError_t hipCtxSetCurrent(hipCtx_t ctx); /** - * @brief Get the handle of the current/ default context + * @brief Get the handle of the current/ default context. * * @param [out] ctx * @@ -4623,7 +4645,7 @@ hipError_t hipCtxSetCurrent(hipCtx_t ctx); DEPRECATED(DEPRECATED_MSG) hipError_t hipCtxGetCurrent(hipCtx_t* ctx); /** - * @brief Get the handle of the device associated with current/default context + * @brief Get the handle of the device associated with current/default context. * * @param [out] device * @@ -4658,7 +4680,7 @@ hipError_t hipCtxGetDevice(hipDevice_t* device); DEPRECATED(DEPRECATED_MSG) hipError_t hipCtxGetApiVersion(hipCtx_t ctx, int* apiVersion); /** - * @brief Get Cache configuration for a specific function + * @brief Get Cache configuration for a specific function. * * @param [out] cacheConfig Cache configuration * @@ -4763,7 +4785,7 @@ hipError_t hipCtxGetFlags(unsigned int* flags); * accessible from the current device until a call to hipDeviceDisablePeerAccess or hipDeviceReset. * * - * @param [in] peerCtx Peer context + * @param [in] peerCtx Peer context * @param [in] flags flags, need to set as 0 * * @returns #hipSuccess, #hipErrorInvalidDevice, #hipErrorInvalidValue, @@ -5216,16 +5238,16 @@ hipError_t hipOccupancyMaxPotentialBlockSize(int* gridSize, int* blockSize, */ // TODO - expand descriptions: /** - * @brief Start recording of profiling information + * @brief Start recording of profiling information. * When using this API, start the profiler with profiling disabled. (--startdisabled) - * @warning : hipProfilerStart API is under development. + * @warning hipProfilerStart API is deprecated, use roctracer/rocTX instead. */ DEPRECATED("use roctracer/rocTX instead") hipError_t hipProfilerStart(); /** * @brief Stop recording of profiling information. * When using this API, start the profiler with profiling disabled. (--startdisabled) - * @warning : hipProfilerStop API is under development. + * @warning hipProfilerStop API is deprecated, use roctracer/rocTX instead. */ DEPRECATED("use roctracer/rocTX instead") hipError_t hipProfilerStop(); @@ -8182,7 +8204,7 @@ inline hipError_t hipExtLaunchMultiKernelMultiDevice(hipLaunchParams* launchPara } /** * @brief Binds a memory area to a texture. - * + * @ingroup TextureD * @param [in] offset Offset in bytes. * @param [in] tex Texture to bind. * @param [in] devPtr Pointer of memory on the device. @@ -8199,7 +8221,7 @@ static inline hipError_t hipBindTexture(size_t* offset, const struct texture