Memory Usage in Kernel

This page shares findings or problems we've encountered.

NOTE : On windows, out of resources error may happen easily when NV device is used due to TdrLevel enabled, turn it OFF to fix it.

Local memory :

On CPU device, local memory is a regular RAM - same as global memory.
On GPU device, very fast on-chip controllable cache.

To find out device local memory size

import pyopencl as cl
from pyopencl import device_info as di
# dev is the target cl.Device instance.
local_memory_size = dev.get_info(di.LOCAL_MEM_SIZE)

Memory caching implementation on Intel architecture

How to allocate local memory ?

Two ways

a) In-kernel allocation, e.g.

#define LM_SIZE 1024
__kernel void test_1(...)
{
  __local int localArray[LM_SIZE];
}
__kernel void test_2(...)
{
  __local int localArray[1024];
}

b) Host-side allocation, e.g. allocating 32 KBs local memory.
- Python
```
prog.test_input_local(queue, global_work_items, local_work_items, cl.LocalMemory(32*1024)).wait()
```
- Kernel
```
__kernel void test_input_local(local int* localArray)
{}
```
NOTE : The usage of local memory can NOT be calculated during compilation. Out-Of-Resources may happen during runtime.

Optimization for local memory ?

Intel devices

Private memory :

TBD

When is private memory used ?

Used when a). register spilling happens 2). private array is used.

TBD

Example

TBD

Tested Devices Information 1

   Device name                             : Intel(R) Core(TM) i7-4578U CPU @ 3.00GHz
   Device type                             :                                      CPU
   Device version                          :                    OpenCL 2.1 (Build 18)
   Device Profile                          :                             FULL_PROFILE
=====================================================================================
   Global memory cache line size           :                                      64B
   Global memory cache size                :                                 256.00KB
   Global memory cache type                :                         READ_WRITE_CACHE
   Global memory size                      :                                   7.71GB
   Max preferred size of global variables  :                                  64.00KB
   Local memory size                       :                                  32.00KB
   Local memory type                       :                                   GLOBAL
   Max constant arguments count            :                                      480
   Max size of a constant buffer           :                                 128.00KB
   Max global variable size                :                                  64.00KB
   Max size of memory object allocation    :                                   1.93GB
   Max parameter size                      :                                   3.75KB
   Max pipe objects                        :                                       16
   max work group size                     :                                     8192
   max work item dimensions                :                                        3
   max work item size                      :                       [8192, 8192, 8192]
   base address align                      :                                     1024
   Local memory size                       :                                  32.00KB
   The max size of the device queue        :                                   4.00GB
   The size of the device queue            :                                   4.00GB
   Compute Units                           :                                        4

Tested Device Information 2

   Device name                             :                         GeForce GTX 950M
   Device type                             :                                      GPU
   Device version                          :                          OpenCL 1.2 CUDA
   Device Profile                          :                             FULL_PROFILE
=====================================================================================
   Global memory cache line size           :                                     128B
   Global memory cache size                :                                  80.00KB
   Global memory cache type                :                         READ_WRITE_CACHE
   Global memory size                      :                                   2.00GB
   Max preferred size of global variables  :                  Not available (version)
   Local memory size                       :                                  48.00KB
   Local memory type                       :                                    LOCAL
   Max constant arguments count            :                                        9
   Max size of a constant buffer           :                                  64.00KB
   Max global variable size                :                  Not available (version)
   Max size of memory object allocation    :                                 512.00MB
   Max parameter size                      :                                   4.25KB
   Max pipe objects                        :                  Not available (version)
   max work group size                     :                                     1024
   max work item dimensions                :                                        3
   max work item size                      :                         [1024, 1024, 64]
   base address align                      :                                     4096
   Local memory size                       :                                  48.00KB
   The max size of the device queue        :                  Not available (version)
   The size of the device queue            :                  Not available (version)
   Device command-queue properties         :                  Not available (version)
   Host command-queue properties           :            OUT_OF_ORDER_EXEC_MODE_ENABLE
                                           :                         PROFILING_ENABLE

Tested Device Information 3

   Device name                             :                 Intel(R) HD Graphics 530
   Device type                             :                                      GPU
   Device version                          :                              OpenCL 2.0
   Device Profile                          :                             FULL_PROFILE
=====================================================================================
   Global memory cache line size           :                                      64B
   Global memory cache size                :                                 512.00KB
   Global memory cache type                :                         READ_WRITE_CACHE
   Global memory size                      :                                   3.15GB
   Max preferred size of global variables  :                                   2.00GB
   Local memory size                       :                                  64.00KB
   Local memory type                       :                                    LOCAL
   Max constant arguments count            :                                        8
   Max size of a constant buffer           :                                   2.00GB
   Max global variable size                :                                  64.00KB
   Max size of memory object allocation    :                                   2.00GB
   Max parameter size                      :                                   1.00KB
   Max pipe objects                        :                                       1
   max work group size                     :                                      256
   max work item dimensions                :                                        3
   max work item size                      :                          [256, 256, 256]
   base address align                      :                                     1024
   Local memory size                       :                                  64.00KB
   The max size of the device queue        :                                  64.00MB
   The size of the device queue            :                                 128.00KB
   Device command-queue properties         :            OUT_OF_ORDER_EXEC_MODE_ENABLE
                                           :                         PROFILING_ENABLE
   Host command-queue properties           :            OUT_OF_ORDER_EXEC_MODE_ENABLE
                                           :                         PROFILING_ENABLE

Tested Device Information 4

   Device name                             : Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
   Device type                             :                                      CPU
   Device version                          :                   OpenCL 2.0 (Build 359)
   Device Profile                          :                             FULL_PROFILE
=====================================================================================
   Global memory cache line size           :                                      64B
   Global memory cache size                :                                 256.00KB
   Global memory cache type                :                         READ_WRITE_CACHE
   Global memory size                      :                                   7.89GB
   Max preferred size of global variables  :                                  64.00KB
   Local memory size                       :                                  32.00KB
   Local memory type                       :                                   GLOBAL
   Max constant arguments count            :                                      480
   Max size of a constant buffer           :                                 128.00KB
   Max global variable size                :                                  64.00KB
   Max size of memory object allocation    :                                   1.97GB
   Max parameter size                      :                                   3.75KB
   Max pipe objects                        :                                       16
   max work group size                     :                                     8192
   max work item dimensions                :                                        3
   max work item size                      :                       [8192, 8192, 8192]
   base address align                      :                                     1024
   Local memory size                       :                                  32.00KB
   The max size of the device queue        :                                   4.00GB
   The size of the device queue            :                                   4.00GB
   Device command-queue properties         :            OUT_OF_ORDER_EXEC_MODE_ENABLE
                                           :                         PROFILING_ENABLE
   Host command-queue properties           :            OUT_OF_ORDER_EXEC_MODE_ENABLE
                                           :                         PROFILING_ENABLE

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Usage in Kernel

Local memory :

How to allocate local memory ?

Optimization for local memory ?

Private memory :

When is private memory used ?

Example

Tested Devices Information 1

Tested Device Information 2

Tested Device Information 3

Tested Device Information 4

Clone this wiki locally