WIP

ROCm · May 24, 2024 · 91cd84d · 91cd84d
1 parent acdd566
commit 91cd84d
Show file tree

Hide file tree

Showing 4 changed files with 34 additions and 26 deletions.
diff --git a/.wordlist.txt b/.wordlist.txt
@@ -1,4 +1,12 @@
+ALU
+ALUs
+APU
 AQL
 builtins
 Builtins
-NDRange
+clr
+GPGPU
+multicore
+NDRange
+SIMT
+SYCL
diff --git a/docs/how-to/programming_manual.md b/docs/how-to/programming_manual.md
@@ -4,22 +4,22 @@
 
 ### Introduction
 
-hipHostMalloc allocates pinned host memory which is mapped into the address space of all GPUs in the system, the memory can be accessed directly by the GPU device, and can be read or written with much higher bandwidth than pageable memory obtained with functions such as malloc().
+`hipHostMalloc` allocates pinned host memory which is mapped into the address space of all GPUs in the system, the memory can be accessed directly by the GPU device, and can be read or written with much higher bandwidth than pageable memory obtained with functions such as malloc().
 There are two use cases for this host memory:
 
 * Faster `HostToDevice` and `DeviceToHost` Data Transfers:
-The runtime tracks the hipHostMalloc allocations and can avoid some of the setup required for regular unpinned memory.  For exact measurements on a specific system, experiment with `--unpinned` and `--pinned` switches for the `hipBusBandwidth` tool.
+The runtime tracks the `hipHostMalloc` allocations and can avoid some of the setup required for regular unpinned memory.  For exact measurements on a specific system, experiment with `--unpinned` and `--pinned` switches for the `hipBusBandwidth` tool.
 * Zero-Copy GPU Access:
 GPU can directly access the host memory over the CPU/GPU interconnect, without need to copy the data.  This avoids the need for the copy, but during the kernel access each memory access must traverse the interconnect, which can be tens of times slower than accessing the GPU's local device memory.  Zero-copy memory can be a good choice when the memory accesses are infrequent (perhaps only once).  Zero-copy memory is typically "Coherent" and thus not cached by the GPU but this can be overridden if desired.
 
 ### Memory allocation flags
 
 There are flags parameter which can specify options how to allocate the memory, for example,
-hipHostMallocPortable, the memory is considered allocated by all contexts, not just the one on which the allocation is made.
-hipHostMallocMapped, will map the allocation into the address space for the current device, and the device pointer can be obtained with the API hipHostGetDevicePointer().
-hipHostMallocNumaUser is the flag to allow host memory allocation to follow Numa policy by user. Please note this flag is currently only applicable on Linux, under development on Windows.
+`hipHostMallocPortable`, the memory is considered allocated by all contexts, not just the one on which the allocation is made.
+`hipHostMallocMapped`, will map the allocation into the address space for the current device, and the device pointer can be obtained with the API `hipHostGetDevicePointer()`.
+`hipHostMallocNumaUser` is the flag to allow host memory allocation to follow Numa policy by user. Please note this flag is currently only applicable on Linux, under development on Windows.
 
-All allocation flags are independent, and can be used in any combination without restriction, for instance, hipHostMalloc can be called with both hipHostMallocPortable and hipHostMallocMapped flags set. Both usage models described above use the same allocation flags, and the difference is in how the surrounding code uses the host memory.
+All allocation flags are independent, and can be used in any combination without restriction, for instance, `hipHostMalloc` can be called with both `hipHostMallocPortable` and `hipHostMallocMapped` flags set. Both usage models described above use the same allocation flags, and the difference is in how the surrounding code uses the host memory.
 
 ### Numa-aware host memory allocation
 
@@ -58,10 +58,10 @@ Non-coherent
 
 | HIP API              | Synchronization Effect                                                         | Fence                | Coherent Host Memory Visibiity | Non-Coherent Host Memory Visibility|
 | ---                  | ---                                                                            | ---                  | ---                            | --- |
-| hipStreamSynchronize | host waits for all commands in the specified stream to complete                | system-scope release | yes                        | yes   |
-| hipDeviceSynchronize | host waits for all commands in all streams on the specified device to complete | system-scope release | yes                        | yes   |
-| hipEventSynchronize  | host waits for the specified event to complete                                 | device-scope release | yes                        | depends - see below|
-| hipStreamWaitEvent   | stream waits for the specified event to complete                               | none                 | yes                        | no   |
+| `hipStreamSynchronize` | host waits for all commands in the specified stream to complete                | system-scope release | yes                        | yes   |
+| `hipDeviceSynchronize` | host waits for all commands in all streams on the specified device to complete | system-scope release | yes                        | yes   |
+| `hipEventSynchronize`  | host waits for the specified event to complete                                 | device-scope release | yes                        | depends - see below|
+| `hipStreamWaitEvent`   | stream waits for the specified event to complete                               | none                 | yes                        | no   |
 
 ### hipEventSynchronize
 
@@ -110,10 +110,10 @@ Note, managed memory management is implemented on Linux, not supported on Window
 ### HIP Stream Memory Operations
 
 HIP supports Stream Memory Operations to enable direct synchronization between Network Nodes and GPU. Following new APIs are added,
-  hipStreamWaitValue32
-  hipStreamWaitValue64
-  hipStreamWriteValue32
-  hipStreamWriteValue64
+  `hipStreamWaitValue32`
+  `hipStreamWaitValue64`
+  `hipStreamWriteValue32`
+  `hipStreamWriteValue64`
 
 Note, CPU access to the semaphore's memory requires volatile keyword to disable CPU compiler's optimizations on memory access.
 For more details, please check the documentation HIP-API.pdf.
@@ -158,7 +158,7 @@ This implementation does not require the use of `hipDeviceSetLimit(hipLimitMallo
 The per-thread default stream is supported in HIP. It is an implicit stream local to both the thread and the current device. This means that the command issued to the per-thread default stream by the thread does not implicitly synchronize with other streams (like explicitly created streams), or default per-thread stream on other threads.
 The per-thread default stream is a blocking stream and will synchronize with the default null stream if both are used in a program.
 The per-thread default stream can be enabled via adding a compilation option,
-"-fgpu-default-stream=per-thread".
+`-fgpu-default-stream=per-thread`.
 
 And users can explicitly use "hipStreamPerThread" as per-thread default stream handle as input in API commands. There are test codes as examples in the [link](https://github.com/ROCm/hip-tests/tree/develop/catch/unit/streamperthread).
 
@@ -172,10 +172,10 @@ If a host function is to be used between clang (or hipcc) and gcc for x86_64, i.
 
 ## FMA and contractions
 
-By default HIP-Clang assumes -ffp-contract=fast-honor-pragmas.
-Users can use '#pragma clang fp contract(on|off|fast)' to control fp contraction of a block of code.
+By default HIP-Clang assumes `-ffp-contract=fast-honor-pragmas`.
+Users can use `#pragma clang fp contract(on|off|fast)` to control fp contraction of a block of code.
 For x86_64, FMA is off by default since the generic x86_64 target does not
-support FMA by default. To turn on FMA on x86_64, either use -mfma or -march=native
+support FMA by default. To turn on FMA on x86_64, either use `-mfma` or `-march=native`
 on CPU's supporting FMA.
 
 When contractions are enabled and the CPU has not enabled FMA instructions, the
@@ -194,7 +194,7 @@ In addition, the first type of library contains host objects with device code em
 
 Here is an example to create and use static libraries:
 
-* Type 1 using --emit-static-lib:
+* Type 1 using `--emit-static-lib`:
 
     ```cpp
     hipcc hipOptLibrary.cpp --emit-static-lib -fPIC -o libHipOptLibrary.a

diff --git a/docs/understand/glossary.md b/docs/understand/glossary.md
@@ -4,7 +4,7 @@
 * **default device** : Each host thread maintains a default device.
 Most HIP runtime APIs (including memory allocation, copy commands, kernel launches) do not accept an explicit device
 argument but instead implicitly use the default device.
-The default device can be set with ```hipSetDevice```.
+The default device can be set with `hipSetDevice`.
 
 * **active host thread** - the thread which is running the HIP APIs.
 
@@ -13,12 +13,12 @@ The default device can be set with ```hipSetDevice```.
 * **clr** - a repository for AMD Common Language Runtime, contains source codes for AMD's compute languages runtimes: HIP and OpenCL.
 clr (https://github.com/ROCm/clr) contains the following three parts,
 
-  * ```hipamd```: contains implementation of HIP language on AMD platform.
-  * ```rocclr```: contains common runtime used in HIP and OpenCL, which provides virtual device interfaces that compute runtimes interact with different backends such as ROCr on Linux or PAL on Windows.
-  * ```opencl```: contains implementation of OpenCL on AMD platform.
+  * `hipamd`: contains implementation of HIP language on AMD platform.
+  * `rocclr`: contains common runtime used in HIP and OpenCL, which provides virtual device interfaces that compute runtimes interact with different backends such as ROCr on Linux or PAL on Windows.
+  * `opencl`: contains implementation of OpenCL on AMD platform.
 
 * **hipify tools** - tools to convert CUDA code to portable C++ code (https://github.com/ROCm/HIPIFY).
 
 * **hipconfig** - tool to report various configuration properties of the target platform.
 
-* **nvcc** - NVIDIA CUDA ```nvcc``` compiler, do not capitalize.
+* **nvcc** - NVIDIA CUDA `nvcc` compiler, do not capitalize.
diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst
@@ -84,7 +84,7 @@ identical instructions over the available SIMD engines.
 
 Consider the following kernel:
 
-.. code:: cu
+.. code:: cpp
 
   __global__ void k(float4* a, const float4* b)
   {