Skip to content

Commit

Permalink
WIP
Browse files Browse the repository at this point in the history
  • Loading branch information
neon60 committed May 24, 2024
1 parent acdd566 commit 91cd84d
Show file tree
Hide file tree
Showing 4 changed files with 34 additions and 26 deletions.
10 changes: 9 additions & 1 deletion .wordlist.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,12 @@
ALU
ALUs
APU
AQL
builtins
Builtins
NDRange
clr
GPGPU
multicore
NDRange
SIMT
SYCL
38 changes: 19 additions & 19 deletions docs/how-to/programming_manual.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,22 +4,22 @@

### Introduction

hipHostMalloc allocates pinned host memory which is mapped into the address space of all GPUs in the system, the memory can be accessed directly by the GPU device, and can be read or written with much higher bandwidth than pageable memory obtained with functions such as malloc().
`hipHostMalloc` allocates pinned host memory which is mapped into the address space of all GPUs in the system, the memory can be accessed directly by the GPU device, and can be read or written with much higher bandwidth than pageable memory obtained with functions such as malloc().
There are two use cases for this host memory:

* Faster `HostToDevice` and `DeviceToHost` Data Transfers:
The runtime tracks the hipHostMalloc allocations and can avoid some of the setup required for regular unpinned memory. For exact measurements on a specific system, experiment with `--unpinned` and `--pinned` switches for the `hipBusBandwidth` tool.
The runtime tracks the `hipHostMalloc` allocations and can avoid some of the setup required for regular unpinned memory. For exact measurements on a specific system, experiment with `--unpinned` and `--pinned` switches for the `hipBusBandwidth` tool.
* Zero-Copy GPU Access:
GPU can directly access the host memory over the CPU/GPU interconnect, without need to copy the data. This avoids the need for the copy, but during the kernel access each memory access must traverse the interconnect, which can be tens of times slower than accessing the GPU's local device memory. Zero-copy memory can be a good choice when the memory accesses are infrequent (perhaps only once). Zero-copy memory is typically "Coherent" and thus not cached by the GPU but this can be overridden if desired.

### Memory allocation flags

There are flags parameter which can specify options how to allocate the memory, for example,
hipHostMallocPortable, the memory is considered allocated by all contexts, not just the one on which the allocation is made.
hipHostMallocMapped, will map the allocation into the address space for the current device, and the device pointer can be obtained with the API hipHostGetDevicePointer().
hipHostMallocNumaUser is the flag to allow host memory allocation to follow Numa policy by user. Please note this flag is currently only applicable on Linux, under development on Windows.
`hipHostMallocPortable`, the memory is considered allocated by all contexts, not just the one on which the allocation is made.
`hipHostMallocMapped`, will map the allocation into the address space for the current device, and the device pointer can be obtained with the API `hipHostGetDevicePointer()`.
`hipHostMallocNumaUser` is the flag to allow host memory allocation to follow Numa policy by user. Please note this flag is currently only applicable on Linux, under development on Windows.

All allocation flags are independent, and can be used in any combination without restriction, for instance, hipHostMalloc can be called with both hipHostMallocPortable and hipHostMallocMapped flags set. Both usage models described above use the same allocation flags, and the difference is in how the surrounding code uses the host memory.
All allocation flags are independent, and can be used in any combination without restriction, for instance, `hipHostMalloc` can be called with both `hipHostMallocPortable` and `hipHostMallocMapped` flags set. Both usage models described above use the same allocation flags, and the difference is in how the surrounding code uses the host memory.

### Numa-aware host memory allocation

Expand Down Expand Up @@ -58,10 +58,10 @@ Non-coherent

| HIP API | Synchronization Effect | Fence | Coherent Host Memory Visibiity | Non-Coherent Host Memory Visibility|
| --- | --- | --- | --- | --- |
| hipStreamSynchronize | host waits for all commands in the specified stream to complete | system-scope release | yes | yes |
| hipDeviceSynchronize | host waits for all commands in all streams on the specified device to complete | system-scope release | yes | yes |
| hipEventSynchronize | host waits for the specified event to complete | device-scope release | yes | depends - see below|
| hipStreamWaitEvent | stream waits for the specified event to complete | none | yes | no |
| `hipStreamSynchronize` | host waits for all commands in the specified stream to complete | system-scope release | yes | yes |
| `hipDeviceSynchronize` | host waits for all commands in all streams on the specified device to complete | system-scope release | yes | yes |
| `hipEventSynchronize` | host waits for the specified event to complete | device-scope release | yes | depends - see below|
| `hipStreamWaitEvent` | stream waits for the specified event to complete | none | yes | no |

### hipEventSynchronize

Expand Down Expand Up @@ -110,10 +110,10 @@ Note, managed memory management is implemented on Linux, not supported on Window
### HIP Stream Memory Operations
HIP supports Stream Memory Operations to enable direct synchronization between Network Nodes and GPU. Following new APIs are added,
hipStreamWaitValue32
hipStreamWaitValue64
hipStreamWriteValue32
hipStreamWriteValue64
`hipStreamWaitValue32`
`hipStreamWaitValue64`
`hipStreamWriteValue32`
`hipStreamWriteValue64`
Note, CPU access to the semaphore's memory requires volatile keyword to disable CPU compiler's optimizations on memory access.
For more details, please check the documentation HIP-API.pdf.
Expand Down Expand Up @@ -158,7 +158,7 @@ This implementation does not require the use of `hipDeviceSetLimit(hipLimitMallo
The per-thread default stream is supported in HIP. It is an implicit stream local to both the thread and the current device. This means that the command issued to the per-thread default stream by the thread does not implicitly synchronize with other streams (like explicitly created streams), or default per-thread stream on other threads.
The per-thread default stream is a blocking stream and will synchronize with the default null stream if both are used in a program.
The per-thread default stream can be enabled via adding a compilation option,
"-fgpu-default-stream=per-thread".
`-fgpu-default-stream=per-thread`.
And users can explicitly use "hipStreamPerThread" as per-thread default stream handle as input in API commands. There are test codes as examples in the [link](https://github.com/ROCm/hip-tests/tree/develop/catch/unit/streamperthread).
Expand All @@ -172,10 +172,10 @@ If a host function is to be used between clang (or hipcc) and gcc for x86_64, i.
## FMA and contractions
By default HIP-Clang assumes -ffp-contract=fast-honor-pragmas.
Users can use '#pragma clang fp contract(on|off|fast)' to control fp contraction of a block of code.
By default HIP-Clang assumes `-ffp-contract=fast-honor-pragmas`.
Users can use `#pragma clang fp contract(on|off|fast)` to control fp contraction of a block of code.
For x86_64, FMA is off by default since the generic x86_64 target does not
support FMA by default. To turn on FMA on x86_64, either use -mfma or -march=native
support FMA by default. To turn on FMA on x86_64, either use `-mfma` or `-march=native`
on CPU's supporting FMA.
When contractions are enabled and the CPU has not enabled FMA instructions, the
Expand All @@ -194,7 +194,7 @@ In addition, the first type of library contains host objects with device code em
Here is an example to create and use static libraries:
* Type 1 using --emit-static-lib:
* Type 1 using `--emit-static-lib`:
```cpp
hipcc hipOptLibrary.cpp --emit-static-lib -fPIC -o libHipOptLibrary.a
Expand Down
10 changes: 5 additions & 5 deletions docs/understand/glossary.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
* **default device** : Each host thread maintains a default device.
Most HIP runtime APIs (including memory allocation, copy commands, kernel launches) do not accept an explicit device
argument but instead implicitly use the default device.
The default device can be set with ```hipSetDevice```.
The default device can be set with `hipSetDevice`.

* **active host thread** - the thread which is running the HIP APIs.

Expand All @@ -13,12 +13,12 @@ The default device can be set with ```hipSetDevice```.
* **clr** - a repository for AMD Common Language Runtime, contains source codes for AMD's compute languages runtimes: HIP and OpenCL.
clr (https://github.com/ROCm/clr) contains the following three parts,

* ```hipamd```: contains implementation of HIP language on AMD platform.
* ```rocclr```: contains common runtime used in HIP and OpenCL, which provides virtual device interfaces that compute runtimes interact with different backends such as ROCr on Linux or PAL on Windows.
* ```opencl```: contains implementation of OpenCL on AMD platform.
* `hipamd`: contains implementation of HIP language on AMD platform.
* `rocclr`: contains common runtime used in HIP and OpenCL, which provides virtual device interfaces that compute runtimes interact with different backends such as ROCr on Linux or PAL on Windows.
* `opencl`: contains implementation of OpenCL on AMD platform.

* **hipify tools** - tools to convert CUDA code to portable C++ code (https://github.com/ROCm/HIPIFY).

* **hipconfig** - tool to report various configuration properties of the target platform.

* **nvcc** - NVIDIA CUDA ```nvcc``` compiler, do not capitalize.
* **nvcc** - NVIDIA CUDA `nvcc` compiler, do not capitalize.
2 changes: 1 addition & 1 deletion docs/understand/programming_model.rst
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ identical instructions over the available SIMD engines.

Consider the following kernel:

.. code:: cu
.. code:: cpp
__global__ void k(float4* a, const float4* b)
{
Expand Down

0 comments on commit 91cd84d

Please sign in to comment.