Skip to content

Commit

Permalink
Update memory management page
Browse files Browse the repository at this point in the history
  • Loading branch information
neon60 committed Sep 30, 2024
1 parent 0b4cc86 commit 75e4a0b
Show file tree
Hide file tree
Showing 2 changed files with 176 additions and 159 deletions.
329 changes: 173 additions & 156 deletions docs/how-to/hip_runtime_api/memory_management.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,51 @@ its own distinct memory. Kernels execute mainly on device memory, the runtime
offers functions for allocating, deallocating, and copying device memory, along
with transferring data between host and device memory.

Device memory
================================================================================

Device memory exists on the device, e.g. on GPUs in the video random access
memory (VRAM), and is accessible by the kernels operating on the device. Recent
architectures use graphics double data rate (GDDR) synchronous dynamic
random-access memory (SDRAM) such as GDDR6, or high-bandwidth memory (HBM) such
as HBM2e. Device memory can be allocated as global memory, constant, texture or
surface memory.

Global memory
--------------------------------------------------------------------------------

Read-write storage visible to all threads on a given device. There are
specialized versions of global memory with different usage semantics which are
typically backed by the same hardware, but can use different caching paths.

Constant memory
--------------------------------------------------------------------------------

Read-only storage visible to all threads on a given device. It is a limited
segment backed by device memory with queryable size. It needs to be set by the
host before kernel execution. Constant memory provides the best performance
benefit when all threads within a warp access the same address.

Texture memory
--------------------------------------------------------------------------------

Read-only storage visible to all threads on a given device and accessible
through additional APIs. Its origins come from graphics APIs, and provides
performance benefits when accessing memory in a pattern where the
addresses are close to each other in a 2D representation of the memory.

The texture management module of HIP runtime API contains the functions of
texture memory.

Surface memory
--------------------------------------------------------------------------------

A read-write version of texture memory, which can be useful for applications
that require direct manipulation of 1D, 2D, or 3D hipArray_t.

The surface objects module of HIP runtime API contains the functions for surface
memory create, destroy, read and write.

Host Memory
================================================================================

Expand Down Expand Up @@ -168,19 +213,6 @@ The example code how to use pinned memory in HIP showed at the following example
The pinned memory allocation is effected with different flags, which details
described at :ref:`memory_allocation_flags`.

The pinned memory can coherent and non-coherent:

* Coherent host memory supports fine-grain synchronization while the kernel is
running. This is the default and is the easiest to use since the memory
is visible to the CPU at typical synchronization points. This memory allows
in-kernel synchronization commands such as :cpp:func:`threadfence_system` to
work transparently.
* Non-coherent memory can be cached by GPU, but cannot support synchronization
while the kernel is running. This can provide performance benefit,
but care must be taken to use the correct synchronization.

For further details, check :ref:`coherency_controls`.

.. _memory_allocation_flags:

Memory allocation flags of pinned memory
Expand All @@ -193,88 +225,137 @@ host memory:
not just the one on which the allocation is made.
* ``hipHostMallocMapped``: Map the allocation into the address space for
the current device, and the device pointer can be obtained with
``hipHostGetDevicePointer()``.
:cpp:func:`hipHostGetDevicePointer`.
* ``hipHostMallocNumaUser``: The flag to allow host memory allocation to
follow Numa policy by user. Please note this flag is currently only applicable
on Linux, under development on Windows.

All allocation flags are independent, and can be used in any combination without
restriction, for instance, ``hipHostMalloc`` can be called with both
follow Numa policy by user. Target of Numa policy is to select a CPU that is
closest to each GPU. Numa distance is the measurement of how far between GPU
and CPU devices.
* ``hipHostMallocWriteCombined``: Allocates the memory as write-combined. On
some system configurations, write-combined allocation may be transferred
faster across the PCI Express bus, however, could have low read efficiency by
most CPUs. It's a good option for data transfer from host to device via mapped
pinned memory.
* ``hipHostMallocCoherent``: Allocate fine-grained memory. Overrides
``HIP_HOST_COHERENT`` environment variable for specific allocation. For
further details, check :ref:`coherency_controls`.
* ``hipHostMallocNonCoherent``: Allocate coarse-grained memory. Overrides
``HIP_HOST_COHERENT`` environment variable for specific allocation. For
further details, check :ref:`coherency_controls`.

All allocation flags are independent and can be used in most of the combination
without restriction, for instance, :cpp:func:`hipHostMalloc` can be called with both
``hipHostMallocPortable`` and ``hipHostMallocMapped`` flags set. Both usage
models described above use the same allocation flags, and the difference is in
how the surrounding code uses the host memory.

Numa-aware host memory allocation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Numa policy determines how memory is allocated. Target of Numa policy is to
select a CPU that is closest to each GPU. Numa distance is the measurement of
how far between GPU and CPU devices.

By default, each GPU selects a Numa CPU node that has the least Numa distance
between them, that is, host memory will be automatically allocated closest on
the memory pool of Numa node of the current GPU device. Using
:cpp:func:`hipSetDevice` API to a different GPU will still be able to access the
host allocation, but can have longer Numa distance.

.. note::

By default, each GPU selects a Numa CPU node that has the least Numa distance
between them, that is, host memory will be automatically allocated closest on
the memory pool of Numa node of the current GPU device. Using
:cpp:func:`hipSetDevice` API to a different GPU will still be able to access
the host allocation, but can have longer Numa distance.

Numa policy is implemented on Linux and is under development on Microsoft
Windows.

.. _coherency_controls:

Coherency controls
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
================================================================================

ROCm defines two coherency options for host memory:
AMD GPUs can have two different types of memory coherence:

* **Coarse-grained coherence** means that memory is only considered up to date at
kernel boundaries, which can be enforced through hipDeviceSynchronize,
hipStreamSynchronize, or any blocking operation that acts on the null
stream (e.g. hipMemcpy). For example, cacheable memory is a type of
coarse-grained memory where an up-to-date copy of the data can be stored
elsewhere (e.g. in an L2 cache).
* **Fine-grained coherence** means the coherence is supported while a CPU/GPU
kernel is running. This can be useful if both host and device are operating on
the same dataspace using system-scope atomic operations (e.g. updating an
error code or flag to a buffer). Fine-grained memory implies that up-to-date
data may be made visible to others regardless of kernel boundaries as
discussed above.

* Coherent memory : Supports fine-grain synchronization while the kernel is
running. For example, a kernel can perform atomic operations that are
visible to the host CPU or to other (peer) GPUs. Synchronization instructions
include ``threadfence_system`` and C++11-style atomic operations.

In order to achieve this fine-grained coherence, many AMD GPUs use a limited
cache policy, such as leaving these allocations uncached by the GPU, or making
them read-only.

* Non-coherent memory : Can be cached by GPU, but cannot support synchronization
while the kernel is running. Non-coherent memory can be optionally
synchronized only at command (end-of-kernel or copy command) boundaries. This
memory is appropriate for high-performance access when fine-grain
synchronization is not required.

HIP provides the developer with controls to select which type of memory is used
via allocation flags passed to :cpp:func:`hipHostMalloc` and the
``HIP_HOST_COHERENT`` environment variable. By default, the environment variable
``HIP_HOST_COHERENT`` is set to 0 in HIP.

The control logic in the current version of HIP is as follows:

* No flags are passed in: the host memory allocation is coherent, the
``HIP_HOST_COHERENT`` environment variable is ignored.
* ``hipHostMallocCoherent=1``: The host memory allocation will be coherent, the
``HIP_HOST_COHERENT`` environment variable is ignored.
* ``hipHostMallocMapped=1``: The host memory allocation will be coherent, the
``HIP_HOST_COHERENT`` environment variable is ignored.
* ``hipHostMallocNonCoherent=1``, ``hipHostMallocCoherent=0``, and
``hipHostMallocMapped=0``: The host memory will be non-coherent, the
``HIP_HOST_COHERENT`` environment variable is ignored.
* ``hipHostMallocCoherent=0``, ``hipHostMallocNonCoherent=0``,
``hipHostMallocMapped=0``, but one of the other ``HostMalloc`` flags is set:

* If ``HIP_HOST_COHERENT`` is defined as 1, the host memory allocation is
coherent.
* If ``HIP_HOST_COHERENT`` is not defined, or defined as 0, the host memory
allocation is non-coherent.

* ``hipHostMallocCoherent=1``, ``hipHostMallocNonCoherent=1``: Illegal.

Visibility of Zero-Copy Host Memory
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. note::

Coherent host memory is automatically visible at synchronization points.
Non-coherent
In order to achieve this fine-grained coherence, many AMD GPUs use a limited
cache policy, such as leaving these allocations uncached by the GPU, or making
them read-only.

.. TODO: Is this still valid? What about Mi300?
Developers should use coarse-grained coherence where they can to reduce
host-device interconnect communication and also Mi200 accelerators hardware
based floating point instructions are working on coarse grained memory regions.

The availability of fine- and coarse-grained memory pools can be checked with
``rocminfo``.

.. list-table:: Memory coherence control
:widths: 25, 35, 20, 20
:header-rows: 1
:align: center

* - API
- Flag
- :cpp:func:`hipMemAdvise` call with argument
- Coherence
* - ``hipHostMalloc``
- ``hipHostMallocDefault``
-
- Fine-grained
* - ``hipHostMalloc``
- ``hipHostMallocNonCoherent`` :sup:`1`
-
- Coarse-grained
* - ``hipExtMallocWithFlags``
- ``hipDeviceMallocDefault``
-
- Coarse-grained
* - ``hipExtMallocWithFlags``
- ``hipDeviceMallocFinegrained``
-
- Fine-grained
* - ``hipMallocManaged``
-
-
- Fine-grained
* - ``hipMallocManaged``
-
- ``hipMemAdviseSetCoarseGrain``
- Coarse-grained
* - ``malloc``
-
-
- Fine-grained
* - ``malloc``
-
- ``hipMemAdviseSetCoarseGrain``
- Coarse-grained

:sup:`1` The :cpp:func:`hipHostMalloc` memory allocation coherence mode can be
affected by the ``HIP_HOST_COHERENT`` environment variable, if the
``hipHostMallocCoherent=0``, ``hipHostMallocNonCoherent=0``,
``hipHostMallocMapped=0`` and one of the other flag is set to 1. At this case,
if the ``HIP_HOST_COHERENT`` is not defined, or defined as 0, the host memory
allocation is coarse-grained.

.. note::

* At ``hipHostMallocMapped=1`` case the allocated host memory is
fine-grained and the ``hipHostMallocNonCoherent`` flag is ignored.
* The ``hipHostMallocCoherent=1`` and ``hipHostMallocNonCoherent=1`` state is
illegal.

Visibility of synchronization functions
--------------------------------------------------------------------------------

The fine-grained coherence memory is visible at synchronization points, however
at coarse-grained coherence, it depends on the used synchronization function.
The synchronization functions effect and visibility on different coherence
memory types collected in the following table.

.. list-table:: HIP API

Expand All @@ -293,20 +374,17 @@ Non-coherent
- system-scope release
- system-scope release
- none
* - Coherent Host Memory Visibility
* - Fine-grained host memory visibility
- yes
- yes
- yes
- yes
* - Non-Coherent Host Memory Visibility
* - Coarse-grained host memory visibility
- yes
- yes
- depends - see below
- depends on the used event.
- no

``hipEventSynchronize``
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

Developers can control the release scope for :cpp:func:`hipEvents`:

* By default, the GPU performs a device-scope acquire and release operation
Expand All @@ -316,74 +394,13 @@ Developers can control the release scope for :cpp:func:`hipEvents`:
A stronger system-level fence can be specified when the event is created with
:cpp:func:`hipEventCreateWithFlags`:

* :cpp:func:`hipEventReleaseToSystem`: Perform a system-scope release operation when the
event is recorded. This will make both Coherent and Non-Coherent host memory
visible to other agents in the system, but may involve heavyweight operations
such as cache flushing. Coherent memory will typically use lighter-weight
in-kernel synchronization mechanisms such as an atomic operation and thus
does not need to use :cpp:func:`hipEventReleaseToSystem`.
* :cpp:func:`hipEventDisableTiming`: Events created with this flag will not record
profiling data and provide the best performance if used for synchronization.

HIP Stream Memory Operations
--------------------------------------------------------------------------------

HIP supports Stream Memory Operations to enable direct synchronization between
Network Nodes and GPU. Following new APIs are added,

* ``hipStreamWaitValue32``
* ``hipStreamWaitValue64``
* ``hipStreamWriteValue32``
* ``hipStreamWriteValue64``

Note, CPU access to the semaphore's memory requires volatile keyword to disable
CPU compiler's optimizations on memory access.

Please note, HIP stream does not guarantee concurrency on AMD hardware for the
case of multiple (at least 6) long-running streams executing concurrently, using
``hipStreamSynchronize(nullptr)`` for synchronization.

Device memory
================================================================================

Device memory exists on the device, e.g. on GPUs in the video random access
memory (VRAM), and is accessible by the kernels operating on the device. Recent
architectures use graphics double data rate (GDDR) synchronous dynamic
random-access memory (SDRAM) such as GDDR6, or high-bandwidth memory (HBM) such
as HBM2e. Device memory can be allocated as global memory, constant, texture or
surface memory.

Global memory
--------------------------------------------------------------------------------

Read-write storage visible to all threads on a given device. There are
specialized versions of global memory with different usage semantics which are
typically backed by the same hardware, but can use different caching paths.

Constant memory
--------------------------------------------------------------------------------

Read-only storage visible to all threads on a given device. It is a limited
segment backed by device memory with queryable size. It needs to be set by the
host before kernel execution. Constant memory provides the best performance
benefit when all threads within a warp access the same address.

Texture memory
--------------------------------------------------------------------------------

Read-only storage visible to all threads on a given device and accessible
through additional APIs. Its origins come from graphics APIs, and provides
performance benefits when accessing memory in a pattern where the
addresses are close to each other in a 2D representation of the memory.

The texture management module of HIP runtime API contains the functions of
texture memory.

Surface memory
--------------------------------------------------------------------------------

A read-write version of texture memory, which can be useful for applications
that require direct manipulation of 1D, 2D, or 3D hipArray_t.

The surface objects module of HIP runtime API contains the functions for surface
memory create, destroy, read and write.
* :cpp:func:`hipEventReleaseToSystem`: Perform a system-scope release operation
when the event is recorded. This will make **both fine-grained and
coarse-grained host memory visible to other agents in the system**, but may
involve heavyweight operations such as cache flushing. Fine-grained memory
will typically use lighter-weight in-kernel synchronization mechanisms such as
an atomic operation and thus does not need to use.
:cpp:func:`hipEventReleaseToSystem`.
* :cpp:func:`hipEventDisableTiming`: Events created with this flag will not
record profiling data and provide the best performance if used for
synchronization.
Loading

0 comments on commit 75e4a0b

Please sign in to comment.