diff --git a/ext/cl_exp_tensor.asciidoc b/ext/cl_exp_tensor.asciidoc new file mode 100644 index 000000000..5f8ac60b3 --- /dev/null +++ b/ext/cl_exp_tensor.asciidoc @@ -0,0 +1,811 @@ +// Copyright 2023 The Khronos Group. This work is licensed under a +// Creative Commons Attribution 4.0 International License; see +// http://creativecommons.org/licenses/by/4.0/ += cl_exp_tensor + +:source-highlighter: coreray + +[[cl_exp_tensor]] +== Tensor Data Type + +This extension provides a new opaque OpenCL datatype called +`cl_tensor`. It is used for storing N-dimensional tensor data in +implementation-defined memory layout which may be optimized based on +tensor's use cases. The datatype is designed to be efficiently used +within the `cl_khr_command_buffers` extension to capture task graphs +which can utilize tensors as input, output and temporary storage. + +=== General information + +==== Name Strings + +`cl_exp_tensor` + +==== Version history + +[cols="1,1,3",options="header",] +|==== +| *Date* | *Version* | *Description* +| 2023-11-XX | 0.1.0 | First assigned version. +|==== + +==== Dependencies + +This extension is written against the OpenCL Specification version 3.0.14. + +This extension requires OpenCL 1.2 or later. + +==== Contributors + +Henry Linjamäki, Intel. + +Pekka Jääslkeläinen, Intel and Tampere University. + +Ben Ashbaugh, Intel. + + +=== Overview + +The new tensor object enables applications to describe N-dimensional +arrays whose memory layout is opaque to applications. The goals +of this extension are the following: + +* Enable implementations to have freedom of placement data of the tensors for + improving performance of the kernels which use them. This extension + is designed such it allows implementations to determine optimal + memory layouts for the tensors based on their use cases for + increased performance, by means of, for example, analyzing kernels’ access + patterns or, in case of built-in kernels, by inspecting the tensor + arguments they operate on. + +* Reduce details and boilerplate needed for performance portable implementation of + applications by being less dependent on platform or device specifics + on the memory layout / data arrangements which matters for + performance. Such specifics may include: + +** alignment of data (e.g. for avoiding misaligned memory accesses) + +** arrangement of data required by kernels (column-major vs row-major + for matrix multiplication, NHWC vs NCHW for neural network + convolution) + +** arrangement of the data into tiles (or “packing”) for improving + cache and TLB hits + +** arrangement of data into specific tiles in order to exploit complex + HW operations such as matrix multiplications (Intel AMX, AMD matrix + cores). + +** arrangement of data into rows separated by a stride in order to + avoid bank conflicts in GPUs. + +The tensor data type is designed to be efficiently used together with command buffers (cl_khr_command_buffers) +and built-in kernels, including kernels to be provided by the Defined +Built-in Kernels (cl_khr_defined_builtin_kernels) extension that is being prepared together with this extension. + +=== Modifications to OpenCL + +==== New Section: 5.x Tensor Objects + +A tensor object stores an N-dimensional array of elements. The memory +layout of the tensor is opaque to the application. When a tensor +object is created it is initially not associated to any storage for the tensor elements. + A storage is bound to a tensor +by creating a memory buffer with CL_MEM_BIND_TO_BUFFER. Tensor objects +without storage can be set as kernel arguments for kernels which +accepts them. Kernels which have tensor arguments must have storage +assigned to them prior enqueuing the kernels for execution. + +==== New OpenCL Functions added to Tensor Objects section + +To create a tensor use: + +[source,c] +---- +cl_tensor clCreateTensor( + cl_context context, + const cl_tensor_peoperties *properties, + size_t rank, + const size_t* shape, + cl_tensor_datatype dtype, + cl_int *errcode_ret); +---- + +* _context_ is a valid OpenCL context used to create the tensor object. + +* _properties_ is an optional list of properties for the tensor object + and their corresponding values. The list is terminated with the + special property 0. If no properties are required, properties may be + NULL. This extension does not define any optional properties for + tensors. + +* _rank_ is the number of dimensions. Zero value creates a "scalar" + tensor which has no dimensions but has storage for one element. + +* _shape_ is a list of sizes of the dimensions. The length of the list + must be _rank_ elements. _shape_ can be NULL if _rank_ value is + zero. All the first _rank_ values in the list must be non-zero. + +* _dtype_ is the element type of _tensor_. Refer to the + <> table for the types. + +* _errcode_ret_ may return an appropriate error code. If errcode_ret + is NULL, no error code is returned. + +clCreateTensor function creates a `rank`-dimensional tensor with +`shape[0] * shape[1] * ... * shape[rank-1]` elements of _dtype_ +type. At the creation time of the tensor, it does not have +storage. The storage is assigned to the tensor by calling +clCreateBufferWithProperties() with CL_MEM_BIND_TO_TENSOR. + +A command that refers to a tensor must be bound to a valid buffer +object before enqueuing or recording the command. + +*clCreateTensor* returns a valid non-zero tensor object and errcode_ret +is set to CL_SUCCESS if the tensor object is created +successfully. Otherwise, they return a NULL value with one of the +following error values returned in errcode_ret: + +* CL_INVALID_CONTEXT if context is not a valid context. + +* CL_INVALID_PROPERTY if a property name in properties is not a + supported property name, if the value specified for a supported + property name is not valid, or if the same property name is + specified more than once. + +* CL_INVALID_VALUE if a value specified in dtype is invalid. + +* CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources + required by the OpenCL implementation on the host. + +.Tensor element types. The API type indicates the corresponding type for copying elements from an host allocation / buffer object to tensor or vice versa. +[cols="1,1,1",stripes=even] +[#TensorDtypes] +|=== +| *Tensor element data type* | *Description* | *API type* + +| CL_TENSOR_BOOL | 1-bit signedless integer. | +cl_uchar. footnote:[only LSB bit is considered when writing data to +tensor. When reading data from tensor the boolean value will be +written as 0 or 1. The boolean values in the tensor may be packed densenly] +| CL_TENSOR_INT8 | 8-bit signed integer. | cl_char. +| CL_TENSOR_INT16 | 16-bit signed integer. | cl_short. +| CL_TENSOR_INT32 | 32-bit signed integer. | cl_int. +| CL_TENSOR_INT64 | 64-bit signed integer. | cl_long. +| CL_TENSOR_UINT8 | 8-bit unsigned integer. | cl_uchar. +| CL_TENSOR_UINT16 | 16-bit unsigned integer. | cl_ushort. +| CL_TENSOR_UINT32 | 32-bit unsigned integer. | cl_uint. +| CL_TENSOR_UINT64 | 64-bit unsigned integer. | cl_ulong. +| CL_TENSOR_HALF | Half precision floating-point. | cl_half. +| CL_TENSOR_BFLOAT16 | 16-bit brain floating-point. | cl_ushort +| CL_TENSOR_FLOAT | Single precision floating-point. | cl_float. +| CL_TENSOR_DOUBLE | Double precision floating-point. | cl_double. +| CL_TENSOR_COMPLEX64 | 64-bit complex floating-point with + 32-bit real and imaginary part. | cl_float2 +| CL_TENSOR_COMPLEX128 | 128-bit complex floating-point with + 64-bit real and imaginary part. | cl_double2 +|=== + +To retain a tensor object, call the function + +[source,c] +---- +cl_int clRetainTensorObject(cl_tensor tensor); +---- + +* _tensor_ is the tensor object to be retained. + +The _tensor_ reference count is incremented. + +*clRetainTensor* returns CL_SUCCESS if the function is executed +successfully. Otherwise, it returns one of the following errors: + +* CL_INVALID_TENSOR if the tensor is not a valid tensor object. + +To release a tensor object, call the function + +[source,c] +---- +cl_int clReleaseTensorObject(cl_tensor tensor); +---- + +* _tensor_ is the tensor object to be released. + +The _tensor_ reference count is decremented. + +The tensor object is deleted once the number of instances that are +retained to tensor become zero and the tensor object is no longer +needed by any enqueued or recorded commands that use _tensor_. Using +this function to release a reference that was not obtained by creating +the object or by calling *clRetainTensor* causes undefined behavior. + +*clReleaseTensor* returns CL_SUCCESS if the function is executed +successfully. Otherwise, it returns one of the following errors: + +* CL_INVALID_TENSOR if tensor is not a valid tensor object. + +// TODO: add clSetTensorObjectDestructorCallback? + +To return information about a tensor object, call the function + +[source,c] +---- +cl_int clGetTensorInfo( + cl_tensor tensor, + cl_tensor_info param_name, + size_t param_value_size, + void* param_value, + size_t* param_value_size_ret); +---- + +* _tensor_ specifies the tensor object being queried. + +* _param_name_ specifies the information to query. The list of + supported param_name types and the information returned in + _param_value_ by clGetTensorInfo is described in the <> table. + +* _param_value_ is a pointer to memory where the appropriate result + being queried is returned. If _param_value_ is NULL, it is ignored. + +* _param_value_size_ is used to specify the size in bytes of memory + pointed to by _param_value_. This size must be ≥ size of return type + as described in the <> table. + +* _param_value_size_ret_ returns the actual size in bytes of data + being queried by _param_name_. If _param_value_size_ret_ is NULL, it is + ignored. + +*clGetTensorInfo* returns CL_SUCCESS if the function is executed + succesfully. Otherwise, it returns one of the following errors: + +* CL_INVALID_TENSOR if _tensor_ is not a valid tensor object. + +[#Tensor Object Quaries] +.List of supported param_names by clGetTensorInfo +[cols="2,1,2",stripes=odd] +|=== +| CL_TENSOR_RANK | size_t | Return the tensor rank. +| CL_TENSOR_SHAPE | size_t[] | Return the tensor shape. +| CL_TENSOR_DTYPE | cl_tensor_datatype | Return the tensor data type. + +| CL_TENSOR_BOUND_TO_BUFFER | cl_bool | Return true if the tensor is +bound to a buffer. + +| CL_TENSOR_BUFFER | cl_mem a| If CL_TENSOR_BOUND_TO_BUFFER is true, +return the buffer object the tensor is bound to. Otherwise, +clGetTensorInfo call returns: + +* CL_INVALID_MEM_OBJECT if the tensor is not bound to a buffer object. + +* CL_INVALID_PROPERTY otherwise. + +| CL_TENSOR_CONTEXT | cl_context | Return the context specified when + the tensor object is created. + +| CL_TENSOR_REFERENCE_COUNT | cl_uint | Return the tensor reference +count. +|=== + +The following functions are for reading from a tensor to host memory / +buffer object or to write to a tensor object from host memory / buffer +object. + +[source,c] +---- +cl_int clEnqueueImportFromTensor( + cl_command_queue command_queue, + cl_tensor tensor, + cl_bool blocking_command, + const size_t* tensor_origin, + const size_t* mem_origin, + const size_t* region, + const size_t* mem_pitch, + cl_mem buffer, + void* host_ptr, + cl_uint num_events_in_wait_list, + const cl_event* event_wait_list, + cl_event* event); +---- + +[source,c] +---- +cl_int clEnqueueExportToTensor( + cl_command_queue command_queue, + cl_tensor tensor, + cl_bool blocking_command, + const size_t* tensor_origin, + const size_t* mem_origin, + const size_t* region, + const size_t* mem_pitch, + cl_mem buffer, + const void* host_ptr, + cl_uint num_events_in_wait_list, + const cl_event* event_wait_list, + cl_event* event); +---- + +* _command_queue_ is a valid host command-queue in which the read / + write command will be queued. _command_queue_ and _tensor_ must be + created with the same OpenCL context. + +* _tensor_ refers to a valid tensor object which is bound to a buffer. + +* _blocking_command_ indicate if the read and write operations are + blocking or non-blocking (see below). + +* _tensor_origin_ defines the offset coordinates in _tensor_ for start of + the regions to read / write tensor data. The length of the array + must be at least rank the the _tensor_. + +* _mem_origin_ defines the offset coordinates in the memory region + pointed by _buffer_ or _host_ptr_ expressed in elements of _tensor_ + data type. The length of the array must be at least rank the the + _tensor_. + +* _region_ defines the region being read or written expressed in in + elements of _tensor_ data type. The length of the array must be at + least rank the the _tensor_. If _region_ is NULL then _tensor_'s + shape will be used as the region. + +* _mem_pitch_ defines the length of each dimension in elements to be + used for the memory region of _buffer_ or _host_ptr_. The length of + the array must be at least the rank of _tensor_ minus one. if + _mem_pitch_ is NULL or _mem_pitch_[i] is zero, _mem_pitch_[i] is + computed as _region_[i + 1]. + +* _buffer_ and _host_ptr_ refer to a valid buffer object / host + allocation where data is to be read into or to be written from. + Either the _buffer_ or _host_ptr_ can be non-NULL in which case the + non-NULL argument is used as the operand for the operation. + +* _event_wait_list_ and _num_events_in_wait_list_ specify events that + need to complete before this particular command can be executed. If + _event_wait_list_ is NULL, then this particular command does not + wait on any event to complete. If _event_wait_list_ is NULL, + _num_events_in_wait_list_ must be 0. If _event_wait_list_ is not + NULL, the list of events pointed to by _event_wait_list_ must be + valid and _num_events_in_wait_list_ must be greater than 0. The + events specified in _event_wait_list_ act as synchronization + points. The context associated with events in _event_wait_list_ and + _command_queue_ must be the same. The memory associated with + _event_wait_list_ can be reused or freed after the function returns. + +* _event_ returns an event object that identifies this read / write + command and can be used to query or queue a wait for this command to + complete. If _event_ is NULL or the enqueue is unsuccessful, no + event will be created and therefore it will not be possible to query + the status of this command or to wait for this command to + complete. If _event_wait_list_ and _event_ are not NULL, _event_ + must not refer to an element of the _event_wait_list_ array. + +The *clEnqueueExportToTensor* function copies contents of the buffer +object / host allocation to tensor's storage in +implementation-defined, opaque memory layout. The +*clEnqueueImportFromTensor* function copies data from tensor's +storage to buffer object / host allocation. + +The elements of buffer object / host allocation are mapped to tensor +coordinates and vice versa as follows in pseudo C code: + +[source,c] +---- +tensor_element( + tensor_origin[0] + i[0], + tensor_origin[1] + i[1], + ..., + tensor_origin[N-2] + i[N-2], + tensor_origin[N-2] + i[N-1]) == +((TENSOR_DATATYPE *)buffer_or_host_ptr)[ + (mem_origin[0] + i[0]) * pitch(0) + + (mem_origin[1] + i[1]) * pitch(1) + + ... + + (mem_origin[N-2] + i[N-2]) * pitch(N-2) + + (mem_origin[N-1] + i[N-1])]; +---- + +Where the `N` is tensor rank, the `i[X]` is a tensor coordinate with +inclusive range of `0..` and the `pitch` is computed as +follows in pseudo C code: + +[source,c] +---- +size_t pitch(size_t dim) { + size_t pitch = 1; + for (size_t i = dim; i < tensor_rank - 1; i++) + pitch *= + (mem_pitch != NULL || mem_pitch[i] == 0) ? mem_pitch[i] : region[i + 1]; + return pitch; +} +---- + +For `dim` in `0..(tensor_rank()-1)`. The `tensor_element()` represents +an abstract function that accesses a tensor element in its storage at +given coordinate. The method how the coordinates translate to tensor +storage addresses is unspecified. + +*clEnqueueImportFromTensor* and *clEnqueueExportToTensor* +returns CL_SUCCESS if the function is executed +successfully. Otherwise, it returns one of the following errors: + +* CL_INVALID_COMMAND_QUEUE if _command_queue_ is not a valid host + command-queue. + +* CL_INVALID_CONTEXT if the context associated with _command_queue_ + and buffer are not the same or if the context associated with + _command_queue_ and events in _event_wait_list_ are not the same. + +* CL_INVALID_MEM_OBJECT if _buffer_ is not a valid buffer object. + +* CL_INVALID_VALUE if _tensor_origin_ or _mem_origin_ is NULL. + +* CL_INVALID_VALUE if the region being read or written specified by + (_mem_origin_, _region_, _mem_pitch_) is out of bounds. + +* CL_INVALID_VALUE if any _region_ array element is 0. + +* CL_INVALID_VALUE if _mem_pitch_ is not NULL and _mem_pitch_[i] is + not 0 and _mem_pitch_[i] is less than _region_[i]. + +* CL_INVALID_VALUE if _buffer_ and _host_ptr_ both are NULL or non-NULL. + +* CL_INVALID_EVENT_WAIT_LIST if _event_wait_list_ is NULL and + _num_events_in_wait_list_ > 0, or _event_wait_list_ is not NULL and + _num_events_in_wait_list_ is 0, or if event objects in + _event_wait_list_ are not valid events. + +* CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST if the read and write + operations are blocking and the execution status of any of the + events in _event_wait_list_ is a negative integer value. + +* CL_MEM_OBJECT_ALLOCATION_FAILURE if there is a failure to allocate + memory for data store associated with memory object the _tensor_ is + bound to. + +* CL_OUT_OF_RESOURCES if there is a failure to allocate resources + required by the OpenCL implementation on the device. + +* CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources + required by the OpenCL implementation on the host. + +// TODO: add clEnqueueCopyTensor + +// TODO: add clEnqueueFillTensor? + +If *cl_khr_command_buffer* is supported, then the following command +buffer counterparts of the *clEnqueueImportFromTensor* and +*clEnqueueExportToTensor* commands are available. + +[source,c] +---- +cl_int clCommandImportFromTensorKHR( + cl_command_buffer_khr command_buffer, + cl_command_queue command_queue, + cl_tensor tensor, + const size_t* tensor_origin, + const size_t* mem_origin, + const size_t* region, + const size_t* mem_pitch, + cl_mem buffer, + void* host_ptr, + cl_uint num_sync_points_in_wait_list, + const cl_sync_point_khr* sync_point_wait_list, + cl_sync_point_khr* sync_point, + cl_mutable_command_khr* mutable_handle); +---- + +[source,c] +---- +cl_int clCommandExportToTensorKHR( + cl_command_buffer_khr command_buffer, + cl_command_queue command_queue, + cl_tensor tensor, + const size_t* tensor_origin, + const size_t* mem_origin, + const size_t* region, + const size_t* mem_pitch, + cl_mem buffer, + const void* host_ptr, + cl_uint num_sync_points_in_wait_list, + const cl_sync_point_khr* sync_point_wait_list, + cl_sync_point_khr* sync_point, + cl_mutable_command_khr* mutable_handle); +---- + +* _command_buffer_ refers to valid command-buffer object. + +* For _command_queue_, _tensor_, _tensor_origin_, _mem_origin_, + _region_, _mem_pitch_, _buffer_ and _host_ptr_ parameters refer to + *clEnqueueImportFromTensor*. + +* For _num_sync_points_in_wait_list_, _sync_point_wait_list_, + _sync_point_, _mutable_handle_ parameters refer to + *clCommandCopyBufferKHR*. + +*clCommandImportFromTensorKHR* and *clCommandImportFromTensorKHR* +returns CL_SUCCESS if the function is executed +successfully. Otherwise, it returns one of the following errors: + +* CL_INVALID_COMMAND_QUEUE if _command_queue_ is not NULL. + +* CL_INVALID_COMMAND_BUFFER_KHR if _command_buffer_ is not a valid + command-buffer. + +* CL_INVALID_CONTEXT if the context associated with _command_queue_ + and _command_buffer_ is not the same. + +* CL_INVALID_OPERATION if _command_buffer_ has been finalized. + +* CL_INVALID_VALUE if _mutable_handle_ is not NULL. + +* CL_INVALID_SYNC_POINT_WAIT_LIST_KHR if _sync_point_wait_list_ is + NULL and _num_sync_points_in_wait_list_ is > 0, or + _sync_point_wait_list_ is not NULL and _num_sync_points_in_wait_list_ is + 0, or if synchronization-point objects in _sync_point_wait_list_ are + not valid synchronization-points. + +* CL_OUT_OF_RESOURCES if there is a failure to allocate resources + required by the OpenCL implementation on the device. + +* CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources + required by the OpenCL implementation on the host. + +==== Add New Buffer Property in Section 5.2.1 + +[cols="2,1,2",stripes=odd] +|=== +| CL_MEM_COMMAND_BUFFER_TEMPORARY | cl_bool +a| This property can be set if *cl_khr_command_buffer* extension is +supported. + +NOTE: This property temporarily lives here and will be moved to +a separate extension proposal. + +If the value is true, create a "temporary" buffer object that only can +be used on commands recorded in command buffers. Non-recording +command enqueue functions must return CL_INVALID_OPERATION if the +command refers to a temporary buffer object. + +The temporary buffer objects are managed by command buffers. When a +temporary buffer object is used by multiple command buffer, the object +receives disjoint storage for each command buffer. + +// Consequently, Data may not be exchanged between command buffers through +// temporary buffers. + +Storage of the temporary buffer objects may be allocated on-demand +basis. At the times the buffer is not needed, OpenCL implementations +may reuse storage for other tasks within the command buffer. + +Contents of the temporary buffers are not guaranteed to be preserved +across command buffer executions. + +| CL_MEM_BIND_TO_TENSOR | cl_tensor a| Use the created buffer as +storage for the given valid tensor. To succeed creating the buffer, +the target tensor may not have storage already and _size_ +argument of the clCreateBufferWithProperties() must be zero. + +Size of the memory buffer is implementation-defined and it can be +queried with clGetTensorInfo(). + +Memory layout of the tensor in the created memory buffer is +implementation-defined and opaque to the applications and it may +change at unspecified points. Implementation may use non-contiguous +allocations to store the tensor data and implementation may store +auxiliary data within the allocations. Therefore, reading from or +writing to the memory buffer directly using the cl_mem handle leads to +undefined behavior. + +If the tensor is already bound to a buffer object, +clCreateBufferWithProperties call returns CL_TENSOR_BOUND_TO_BUFFER +error code. +|=== + +==== Add New Memory Object Query in Section 5.5.5 + +[cols="2,1,2",stripes=odd] +|=== +| CL_MEM_COMMAND_BUFFER_TEMPORARY | cl_bool | This property can be +queried if *cl_khr_command_buffer* extension is supported. + +Return true if the _memobj_ is temporary buffer object for command +buffers. +|=== + +==== Add New Error Codes in Appendix F + +[cols="2,3", stripes=odd] +|=== +| CL_TENSOR_BOUND_TO_BUFFER | Returned when attempting to bind a + buffer object to a tensor which already has been bound to the same + or another. +| CL_INVALID_TENSOR | Returned then the specified tensor is not a + valid tensor object. +|=== + +=== Sample Codes + +Helper functions used in the follow up tensor code samples: + +[source,c] +---- +cl_kernel create_matmul_kernel( + cl_context ctx, std::span device_span, + cl_tensor lhs, cl_tensor rhs, cl_tensor out) { + // A hypothetical matmul kernel signature in pseudo OpenCL C for + // illustrative purposes: + // + // kernel void matmul(global read_only tensor_t, global read_only tensor_t, + // global write_only tensor_t); + + cl_kernel matmul_kernel = /* Omitted. */; + clSetKernelArg(matmul_kernel, 0, sizeof(cl_tensor), &lhs); + clSetKernelArg(matmul_kernel, 1, sizeof(cl_tensor), &rhs); + clSetKernelArg(matmul_kernel, 2, sizeof(cl_tensor), &out); + return matmul_kernel; +} + +cl_kernel create_add_kernel( + cl_context ctx, std::span device_span, + cl_tensor lhs, cl_tensor rhs, cl_tensor out) { + // A hypothetical add kernel signature in pseudo OpenCL C for illustrative + // purposes: + // + // kernel void add(global read_only tensor_t, global read_only tensor_t, + // global write_only tensor_t); + + cl_tensor add_kernel = /* Omitted. */; + clSetKernelArg(add_kernel, 0, sizeof(cl_tensor), &lhs); + clSetKernelArg(add_kernel, 1, sizeof(cl_tensor), &rhs); + clSetKernelArg(add_kernel, 2, sizeof(cl_tensor), &out); + return add_kernel; +} +---- +An example usage of tensors on a command queue: + +[source,c] +---- +constexpr size_t b = 64, m = 100, n = 200, k = 50; + +cl_int err; +cl_tensor in0 = clCreateTensor(ctx, nullptr, 3, {b, m, k}, CL_TENSOR_FLOAT, err); +cl_tensor in1 = clCreateTensor(ctx, nullptr, 3, {b, k, n}, CL_TENSOR_FLOAT, err); +cl_tensor in2 = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err); +cl_tensor t0 = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err); +cl_tensor out = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err); + +cl_kernel matmul_kernel = create_matmul_kernel(ctx, device_span, in0, in1, t0); +cl_kernel add_kernel = create_add_kernel(ctx, device_span, t0, in2, out); + +// Allocate storage for the tensors. The buffer size must be set to +// zero when the buffer is bound to a tensor. OpenCL implementation +// may determine optimal data layout and the storage needed for it, +// based on the tensor's uses (the 'matmul' and 'add' kernels in this +// sample) so far. +cl_mem in0_mem = clCreateBufferWithProperties( + ctx, {CL_MEM_BIND_TO_TENSOR, in0, 0}, CL_MEM_READ_ONLY, + 0 /* must be zero for CL_MEM_BIND_TO_TENSOR. */, nullptr, &err); +cl_mem in1_mem = clCreateBufferWithProperties( + ctx, {CL_MEM_BIND_TO_TENSOR, in1, 0}, CL_MEM_READ_ONLY, + 0, nullptr, &err); +cl_mem in2_mem = clCreateBufferWithProperties( + ctx, {CL_MEM_BIND_TO_TENSOR, in2, 0}, CL_MEM_READ_ONLY, + 0, nullptr, &err); +cl_mem t0_mem = clCreateBufferWithProperties( + ctx, {CL_MEM_BIND_TO_TENSOR, t0, 0}, CL_MEM_READ_WRITE, + 0, nullptr, &err); +cl_mem out_mem = clCreateBufferWithProperties( + ctx, {CL_MEM_BIND_TO_TENSOR, out, 0}, CL_MEM_WRITE_ONLY, + 0, nullptr, &err); + +std::vector in0_data = ...; +std::vector in1_data = ...; +std::vector out_data(b * m * n); + +// Copies data into in0 tensor while possibly rearranging the data to the +// optimal data layout. +clEnqueueExportToTensor( + cmd_q, in0, false, {0, 0, 0}, {0, 0, 0}, {b, m, k}, + nullptr, nullptr, in0_data.data(), 0, nullptr, nullptr); +clEnqueueExportToTensor( + cmd_q, in1, false, {0, 0, 0}, {0, 0, 0}, {b, k, n}, + nullptr, nullptr, in1_data.data(), 0, nullptr, nullptr); +clEnqueueNDRangeKernel( + cmd_q, matmul_kernel, 3, matmul_grid, nullptr, nullptr, 0, nullptr, nullptr); +clEnqueueNDRangeKernel( + cmd_q, add_kernel, 3, add_grid, nullptr, nullptr, 0, nullptr, nullptr); +clEnqueueImportFromTensor( + cmd_q, out, false, {0, 0, 0}, {0, 0, 0}, {b, m, n}, + nullptr, nullptr, out_data.data(), 0, nullptr, nullptr); +---- + +An example use of tensors in a command buffer when cl_khr_command_buffer +extension is supported: + +[source,c] +---- +constexpr size_t b = 64, m = 100, n = 200, k = 50; + +cl_int err; +cl_tensor in0 = clCreateTensor(ctx, nullptr, 3, {b, m, k}, CL_TENSOR_FLOAT, err); +cl_tensor in1 = clCreateTensor(ctx, nullptr, 3, {b, k, n}, CL_TENSOR_FLOAT, err); +cl_tensor in2 = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err); +cl_tensor t0 = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err); +cl_tensor out = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err); + +cl_kernel matmul_kernel = create_matmul_kernel(ctx, device_span, in0, in1, t0); +cl_kernel add_kernel = create_add_kernel(ctx, device_span, t0, in2, out); + +// Bind command buffer managed storage to tensors. +// +// NOTE: same temporary tensor handle used in multiple command buffers +// will have separate storage. IOW, command buffers may not exchange +// data via temporary buffers between them. +cl_mem in0_mem = clCreateBufferWithProperties( + ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, in0, 0}, + CL_MEM_READ_ONLY, 0 /* must be zero for CL_MEM_BIND_TO_TENSOR. */, + nullptr, &err); +cl_mem in1_mem = clCreateBufferWithProperties( + ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, in1, 0}, + CL_MEM_READ_ONLY, 0, nullptr, &err); +cl_mem in2_mem = clCreateBufferWithProperties( + ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, in2, 0}, + CL_MEM_READ_ONLY, 0, nullptr, &err); +cl_mem t0_mem = clCreateBufferWithProperties( + ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, t0, 0}, + CL_MEM_READ_WRITE, 0, nullptr, &err); +cl_mem out_mem = clCreateBufferWithProperties( + ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, out, 0}, + CL_MEM_WRITE_ONLY, 0, nullptr, &err); + +std::vector in0_data = ...; +std::vector in1_data = ...; +std::vector out_data(b * m * n); + +cl_command_buffer_khr cb = + clCreateCommandBufferKHR(num_queues, queue_list, nullptr, &err); + +cl_sync_point_khr in0_syncp, in1_syncp, matmul_syncp, add_syncp; +clCommandExportToTensorKHR( + cmd_b, cmd_q, in0, {0, 0, 0}, {0, 0, 0}, {b, m, k}, + nullptr, nullptr, in0_data.data(), 0, nullptr, &in0_syncp); +clCommandExportToTensorKHR( + cmd_b, cmd_q, in1, {0, 0, 0}, {0, 0, 0}, {b, k, m}, + nullptr, nullptr, in1_data.data(), 0, nullptr, &in1_syncp); +clCommandNDRangeKernelKHR( + cmd_b, cmd_q, nullptr, matmul_kernel, 3, matmul_grid, nullptr, nullptr, + 2, {in0_syncp, in2_syncp}, &matmul_syncp, nullptr); +clCommandNDRangeKernelKHR( + cmd_b, cmd_q, nullptr, add_kernel, 3, add_grid, nullptr, nullptr, + 1, {matmul_syncp}, &add_syncp, nullptr); +clCommandImportFromTensorKHR( + cmd_b, cmd_q, out, {0, 0, 0}, {0, 0, 0}, {b, k, m}, + nullptr, nullptr, out_data.data(), 1, {add_syncp}, nullptr); + +// Finalize the command buffer. At this point the OpenCL +// implementation may reserve enough storage for all the tensor +// temporaries. Temporary tensors might be eliminated - for example, +// OpenCL implementation could use 'out' tensor to store result of +// matmul_kernel , thus, eliminating the need of 't0' tensor. +clFinalizeCommandBufferKHR(cmd_b); + +// Temporary tensors used in a command buffer can't be read or written +// into. A hypothetical reason is that the finalized command buffer +// might not use some of the tensor. +assert(clEnqueueImportFromTensor(..., t0, ...) == CL_INVALID_OPERATION); +---- + +=== Open Questions === + +. Should we have support for tensors with undefined shape and tensors + with unknown / symbolic dimension sizes like in ONNX? ++ +-- +// https://onnx.ai/onnx/repo-docs/ShapeInference.html +*UNRESOLVED* +-- + +. Should we define OpenCL C language features for accessing tensors? ++ +-- +*RESOLVED*: OpenCL C support for tensors can be introduced later in a + separate extension. Built-in kernels may benefit from this + extension as it is. +-- diff --git a/ext/cl_exp_tensor.html b/ext/cl_exp_tensor.html new file mode 100644 index 000000000..ad5d348eb --- /dev/null +++ b/ext/cl_exp_tensor.html @@ -0,0 +1,1599 @@ + + + + + + + +cl_exp_tensor + + + + + +
+
+

Tensor Data Type

+
+
+

This extension provides a new opaque OpenCL datatype called +cl_tensor. It is used for storing N-dimensional tensor data in +implementation-defined memory layout which may be optimized based on +tensor’s use cases. The datatype is designed to be efficiently used +within the cl_khr_command_buffers extension to capture task graphs +which can utilize tensors as input, output and temporary storage.

+
+
+

General information

+
+

Name Strings

+
+

cl_exp_tensor

+
+
+
+

Version history

+ +++++ + + + + + + + + + + + + + + +
DateVersionDescription

2023-11-XX

0.1.0

First assigned version.

+
+
+

Dependencies

+
+

This extension is written against the OpenCL Specification version 3.0.14.

+
+
+

This extension requires OpenCL 1.2 or later.

+
+
+
+

Contributors

+
+

Henry Linjamäki, Intel.
+Pekka Jääslkeläinen, Intel and Tampere University.
+Ben Ashbaugh, Intel.

+
+
+
+
+

Overview

+
+

The new tensor object enables applications to describe N-dimensional +arrays whose memory layout is opaque to applications. The goals +of this extension are the following:

+
+
+
    +
  • +

    Enable implementations to have freedom of placement data of the tensors for +improving performance of the kernels which use them. This extension +is designed such it allows implementations to determine optimal +memory layouts for the tensors based on their use cases for +increased performance, by means of, for example, analyzing kernels’ access +patterns or, in case of built-in kernels, by inspecting the tensor +arguments they operate on.

    +
  • +
  • +

    Reduce details and boilerplate needed for performance portable implementation of +applications by being less dependent on platform or device specifics +on the memory layout / data arrangements which matters for +performance. Such specifics may include:

    +
    +
      +
    • +

      alignment of data (e.g. for avoiding misaligned memory accesses)

      +
    • +
    • +

      arrangement of data required by kernels (column-major vs row-major +for matrix multiplication, NHWC vs NCHW for neural network +convolution)

      +
    • +
    • +

      arrangement of the data into tiles (or “packing”) for improving +cache and TLB hits

      +
    • +
    • +

      arrangement of data into specific tiles in order to exploit complex +HW operations such as matrix multiplications (Intel AMX, AMD matrix +cores).

      +
    • +
    • +

      arrangement of data into rows separated by a stride in order to +avoid bank conflicts in GPUs.

      +
    • +
    +
    +
  • +
+
+
+

The tensor data type is designed to be efficiently used together with command buffers (cl_khr_command_buffers) +and built-in kernels, including kernels to be provided by the Defined +Built-in Kernels (cl_khr_defined_builtin_kernels) extension that is being prepared together with this extension.

+
+
+
+

Modifications to OpenCL

+
+

New Section: 5.x Tensor Objects

+
+

A tensor object stores an N-dimensional array of elements. The memory +layout of the tensor is opaque to the application. When a tensor +object is created it is initially not associated to any storage for the tensor elements. + A storage is bound to a tensor +by creating a memory buffer with CL_MEM_BIND_TO_BUFFER. Tensor objects +without storage can be set as kernel arguments for kernels which +accepts them. Kernels which have tensor arguments must have storage +assigned to them prior enqueuing the kernels for execution.

+
+
+
+

New OpenCL Functions added to Tensor Objects section

+
+

To create a tensor use:

+
+
+
+
cl_tensor clCreateTensor(
+    cl_context context,
+    const cl_tensor_peoperties *properties,
+    size_t rank,
+    const size_t* shape,
+    cl_tensor_datatype dtype,
+    cl_int *errcode_ret);
+
+
+
+
    +
  • +

    context is a valid OpenCL context used to create the tensor object.

    +
  • +
  • +

    properties is an optional list of properties for the tensor object +and their corresponding values. The list is terminated with the +special property 0. If no properties are required, properties may be +NULL. This extension does not define any optional properties for +tensors.

    +
  • +
  • +

    rank is the number of dimensions. Zero value creates a "scalar" +tensor which has no dimensions but has storage for one element.

    +
  • +
  • +

    shape is a list of sizes of the dimensions. The length of the list +must be rank elements. shape can be NULL if rank value is +zero. All the first rank values in the list must be non-zero.

    +
  • +
  • +

    dtype is the element type of tensor. Refer to the +Tensor element types. The API type indicates the corresponding type for copying elements from an host allocation / buffer object to tensor or vice versa. table for the types.

    +
  • +
  • +

    errcode_ret may return an appropriate error code. If errcode_ret +is NULL, no error code is returned.

    +
  • +
+
+
+

clCreateTensor function creates a rank-dimensional tensor with +shape[0] * shape[1] * …​ * shape[rank-1] elements of dtype +type. At the creation time of the tensor, it does not have +storage. The storage is assigned to the tensor by calling +clCreateBufferWithProperties() with CL_MEM_BIND_TO_TENSOR.

+
+
+

A command that refers to a tensor must be bound to a valid buffer +object before enqueuing or recording the command.

+
+
+

clCreateTensor returns a valid non-zero tensor object and errcode_ret +is set to CL_SUCCESS if the tensor object is created +successfully. Otherwise, they return a NULL value with one of the +following error values returned in errcode_ret:

+
+
+
    +
  • +

    CL_INVALID_CONTEXT if context is not a valid context.

    +
  • +
  • +

    CL_INVALID_PROPERTY if a property name in properties is not a +supported property name, if the value specified for a supported +property name is not valid, or if the same property name is +specified more than once.

    +
  • +
  • +

    CL_INVALID_VALUE if a value specified in dtype is invalid.

    +
  • +
  • +

    CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources +required by the OpenCL implementation on the host.

    +
  • +
+
+ + +++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Table 1. Tensor element types. The API type indicates the corresponding type for copying elements from an host allocation / buffer object to tensor or vice versa.
Tensor element data typeDescriptionAPI type

CL_TENSOR_BOOL

1-bit signedless integer.

cl_uchar. [1]

CL_TENSOR_INT8

8-bit signed integer.

cl_char.

CL_TENSOR_INT16

16-bit signed integer.

cl_short.

CL_TENSOR_INT32

32-bit signed integer.

cl_int.

CL_TENSOR_INT64

64-bit signed integer.

cl_long.

CL_TENSOR_UINT8

8-bit unsigned integer.

cl_uchar.

CL_TENSOR_UINT16

16-bit unsigned integer.

cl_ushort.

CL_TENSOR_UINT32

32-bit unsigned integer.

cl_uint.

CL_TENSOR_UINT64

64-bit unsigned integer.

cl_ulong.

CL_TENSOR_HALF

Half precision floating-point.

cl_half.

CL_TENSOR_BFLOAT16

16-bit brain floating-point.

cl_ushort

CL_TENSOR_FLOAT

Single precision floating-point.

cl_float.

CL_TENSOR_DOUBLE

Double precision floating-point.

cl_double.

CL_TENSOR_COMPLEX64

64-bit complex floating-point with + 32-bit real and imaginary part.

cl_float2

CL_TENSOR_COMPLEX128

128-bit complex floating-point with + 64-bit real and imaginary part.

cl_double2

+
+

To retain a tensor object, call the function

+
+
+
+
cl_int clRetainTensorObject(cl_tensor tensor);
+
+
+
+
    +
  • +

    tensor is the tensor object to be retained.

    +
  • +
+
+
+

The tensor reference count is incremented.

+
+
+

clRetainTensor returns CL_SUCCESS if the function is executed +successfully. Otherwise, it returns one of the following errors:

+
+
+
    +
  • +

    CL_INVALID_TENSOR if the tensor is not a valid tensor object.

    +
  • +
+
+
+

To release a tensor object, call the function

+
+
+
+
cl_int clReleaseTensorObject(cl_tensor tensor);
+
+
+
+
    +
  • +

    tensor is the tensor object to be released.

    +
  • +
+
+
+

The tensor reference count is decremented.

+
+
+

The tensor object is deleted once the number of instances that are +retained to tensor become zero and the tensor object is no longer +needed by any enqueued or recorded commands that use tensor. Using +this function to release a reference that was not obtained by creating +the object or by calling clRetainTensor causes undefined behavior.

+
+
+

clReleaseTensor returns CL_SUCCESS if the function is executed +successfully. Otherwise, it returns one of the following errors:

+
+
+
    +
  • +

    CL_INVALID_TENSOR if tensor is not a valid tensor object.

    +
  • +
+
+
+

To return information about a tensor object, call the function

+
+
+
+
cl_int clGetTensorInfo(
+  cl_tensor tensor,
+  cl_tensor_info param_name,
+  size_t param_value_size,
+  void* param_value,
+  size_t* param_value_size_ret);
+
+
+
+
    +
  • +

    tensor specifies the tensor object being queried.

    +
  • +
  • +

    param_name specifies the information to query. The list of +supported param_name types and the information returned in +param_value by clGetTensorInfo is described in the [Tensor Object +Queries] table.

    +
  • +
  • +

    param_value is a pointer to memory where the appropriate result +being queried is returned. If param_value is NULL, it is ignored.

    +
  • +
  • +

    param_value_size is used to specify the size in bytes of memory +pointed to by param_value. This size must be ≥ size of return type +as described in the [Tensor Object Queries] table.

    +
  • +
  • +

    param_value_size_ret returns the actual size in bytes of data +being queried by param_name. If param_value_size_ret is NULL, it is +ignored.

    +
  • +
+
+
+

clGetTensorInfo returns CL_SUCCESS if the function is executed + succesfully. Otherwise, it returns one of the following errors:

+
+
+
    +
  • +

    CL_INVALID_TENSOR if tensor is not a valid tensor object.

    +
  • +
+
+ + +++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Table 2. List of supported param_names by clGetTensorInfo

CL_TENSOR_RANK

size_t

Return the tensor rank.

CL_TENSOR_SHAPE

size_t[]

Return the tensor shape.

CL_TENSOR_DTYPE

cl_tensor_datatype

Return the tensor data type.

CL_TENSOR_BOUND_TO_BUFFER

cl_bool

Return true if the tensor is +bound to a buffer.

CL_TENSOR_BUFFER

cl_mem

+

If CL_TENSOR_BOUND_TO_BUFFER is true, +return the buffer object the tensor is bound to. Otherwise, +clGetTensorInfo call returns:

+
+
+
    +
  • +

    CL_INVALID_MEM_OBJECT if the tensor is not bound to a buffer object.

    +
  • +
  • +

    CL_INVALID_PROPERTY otherwise.

    +
  • +
+

CL_TENSOR_CONTEXT

cl_context

Return the context specified when + the tensor object is created.

CL_TENSOR_REFERENCE_COUNT

cl_uint

Return the tensor reference +count.

+
+

The following functions are for reading from a tensor to host memory / +buffer object or to write to a tensor object from host memory / buffer +object.

+
+
+
+
cl_int clEnqueueImportFromTensor(
+  cl_command_queue command_queue,
+  cl_tensor tensor,
+  cl_bool blocking_command,
+  const size_t* tensor_origin,
+  const size_t* mem_origin,
+  const size_t* region,
+  const size_t* mem_pitch,
+  cl_mem buffer,
+  void* host_ptr,
+  cl_uint num_events_in_wait_list,
+  const cl_event* event_wait_list,
+  cl_event* event);
+
+
+
+
+
cl_int clEnqueueExportToTensor(
+  cl_command_queue command_queue,
+  cl_tensor tensor,
+  cl_bool blocking_command,
+  const size_t* tensor_origin,
+  const size_t* mem_origin,
+  const size_t* region,
+  const size_t* mem_pitch,
+  cl_mem buffer,
+  const void* host_ptr,
+  cl_uint num_events_in_wait_list,
+  const cl_event* event_wait_list,
+  cl_event* event);
+
+
+
+
    +
  • +

    command_queue is a valid host command-queue in which the read / +write command will be queued. command_queue and tensor must be +created with the same OpenCL context.

    +
  • +
  • +

    tensor refers to a valid tensor object which is bound to a buffer.

    +
  • +
  • +

    blocking_command indicate if the read and write operations are +blocking or non-blocking (see below).

    +
  • +
  • +

    tensor_origin defines the offset coordinates in tensor for start of +the regions to read / write tensor data. The length of the array +must be at least rank the the tensor.

    +
  • +
  • +

    mem_origin defines the offset coordinates in the memory region +pointed by buffer or host_ptr expressed in elements of tensor +data type. The length of the array must be at least rank the the +tensor.

    +
  • +
  • +

    region defines the region being read or written expressed in in +elements of tensor data type. The length of the array must be at +least rank the the tensor. If region is NULL then tensor's +shape will be used as the region.

    +
  • +
  • +

    mem_pitch defines the length of each dimension in elements to be +used for the memory region of buffer or host_ptr. The length of +the array must be at least the rank of tensor minus one. if +mem_pitch is NULL or mem_pitch[i] is zero, mem_pitch[i] is +computed as region[i + 1].

    +
  • +
  • +

    buffer and host_ptr refer to a valid buffer object / host +allocation where data is to be read into or to be written from. +Either the buffer or host_ptr can be non-NULL in which case the +non-NULL argument is used as the operand for the operation.

    +
  • +
  • +

    event_wait_list and num_events_in_wait_list specify events that +need to complete before this particular command can be executed. If +event_wait_list is NULL, then this particular command does not +wait on any event to complete. If event_wait_list is NULL, +num_events_in_wait_list must be 0. If event_wait_list is not +NULL, the list of events pointed to by event_wait_list must be +valid and num_events_in_wait_list must be greater than 0. The +events specified in event_wait_list act as synchronization +points. The context associated with events in event_wait_list and +command_queue must be the same. The memory associated with +event_wait_list can be reused or freed after the function returns.

    +
  • +
  • +

    event returns an event object that identifies this read / write +command and can be used to query or queue a wait for this command to +complete. If event is NULL or the enqueue is unsuccessful, no +event will be created and therefore it will not be possible to query +the status of this command or to wait for this command to +complete. If event_wait_list and event are not NULL, event +must not refer to an element of the event_wait_list array.

    +
  • +
+
+
+

The clEnqueueExportToTensor function copies contents of the buffer +object / host allocation to tensor’s storage in +implementation-defined, opaque memory layout. The +clEnqueueImportFromTensor function copies data from tensor’s +storage to buffer object / host allocation.

+
+
+

The elements of buffer object / host allocation are mapped to tensor +coordinates and vice versa as follows in pseudo C code:

+
+
+
+
tensor_element(
+  tensor_origin[0] + i[0],
+  tensor_origin[1] + i[1],
+  ...,
+  tensor_origin[N-2] + i[N-2],
+  tensor_origin[N-2] + i[N-1]) ==
+((TENSOR_DATATYPE *)buffer_or_host_ptr)[
+  (mem_origin[0] + i[0]) * pitch(0) +
+  (mem_origin[1] + i[1]) * pitch(1) +
+  ... +
+  (mem_origin[N-2] + i[N-2]) * pitch(N-2) +
+  (mem_origin[N-1] + i[N-1])];
+
+
+
+

Where the N is tensor rank, the i[X] is a tensor coordinate with +inclusive range of 0..<region[X]-1> and the pitch is computed as +follows in pseudo C code:

+
+
+
+
size_t pitch(size_t dim) {
+  size_t pitch = 1;
+  for (size_t i = dim; i < tensor_rank - 1; i++)
+    pitch *=
+      (mem_pitch != NULL || mem_pitch[i] == 0) ? mem_pitch[i] : region[i + 1];
+  return pitch;
+}
+
+
+
+

For dim in 0..(tensor_rank()-1). The tensor_element() represents +an abstract function that accesses a tensor element in its storage at +given coordinate. The method how the coordinates translate to tensor +storage addresses is unspecified.

+
+
+

clEnqueueImportFromTensor and clEnqueueExportToTensor +returns CL_SUCCESS if the function is executed +successfully. Otherwise, it returns one of the following errors:

+
+
+
    +
  • +

    CL_INVALID_COMMAND_QUEUE if command_queue is not a valid host +command-queue.

    +
  • +
  • +

    CL_INVALID_CONTEXT if the context associated with command_queue +and buffer are not the same or if the context associated with +command_queue and events in event_wait_list are not the same.

    +
  • +
  • +

    CL_INVALID_MEM_OBJECT if buffer is not a valid buffer object.

    +
  • +
  • +

    CL_INVALID_VALUE if tensor_origin or mem_origin is NULL.

    +
  • +
  • +

    CL_INVALID_VALUE if the region being read or written specified by +(mem_origin, region, mem_pitch) is out of bounds.

    +
  • +
  • +

    CL_INVALID_VALUE if any region array element is 0.

    +
  • +
  • +

    CL_INVALID_VALUE if mem_pitch is not NULL and mem_pitch[i] is +not 0 and mem_pitch[i] is less than region[i].

    +
  • +
  • +

    CL_INVALID_VALUE if buffer and host_ptr both are NULL or non-NULL.

    +
  • +
  • +

    CL_INVALID_EVENT_WAIT_LIST if event_wait_list is NULL and +num_events_in_wait_list > 0, or event_wait_list is not NULL and +num_events_in_wait_list is 0, or if event objects in +event_wait_list are not valid events.

    +
  • +
  • +

    CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST if the read and write +operations are blocking and the execution status of any of the +events in event_wait_list is a negative integer value.

    +
  • +
  • +

    CL_MEM_OBJECT_ALLOCATION_FAILURE if there is a failure to allocate +memory for data store associated with memory object the tensor is +bound to.

    +
  • +
  • +

    CL_OUT_OF_RESOURCES if there is a failure to allocate resources +required by the OpenCL implementation on the device.

    +
  • +
  • +

    CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources +required by the OpenCL implementation on the host.

    +
  • +
+
+
+

If cl_khr_command_buffer is supported, then the following command +buffer counterparts of the clEnqueueImportFromTensor and +clEnqueueExportToTensor commands are available.

+
+
+
+
cl_int clCommandImportFromTensorKHR(
+  cl_command_buffer_khr command_buffer,
+  cl_command_queue command_queue,
+  cl_tensor tensor,
+  const size_t* tensor_origin,
+  const size_t* mem_origin,
+  const size_t* region,
+  const size_t* mem_pitch,
+  cl_mem buffer,
+  void* host_ptr,
+  cl_uint num_sync_points_in_wait_list,
+  const cl_sync_point_khr* sync_point_wait_list,
+  cl_sync_point_khr* sync_point,
+  cl_mutable_command_khr* mutable_handle);
+
+
+
+
+
cl_int clCommandExportToTensorKHR(
+  cl_command_buffer_khr command_buffer,
+  cl_command_queue command_queue,
+  cl_tensor tensor,
+  const size_t* tensor_origin,
+  const size_t* mem_origin,
+  const size_t* region,
+  const size_t* mem_pitch,
+  cl_mem buffer,
+  const void* host_ptr,
+  cl_uint num_sync_points_in_wait_list,
+  const cl_sync_point_khr* sync_point_wait_list,
+  cl_sync_point_khr* sync_point,
+  cl_mutable_command_khr* mutable_handle);
+
+
+
+
    +
  • +

    command_buffer refers to valid command-buffer object.

    +
  • +
  • +

    For command_queue, tensor, tensor_origin, mem_origin, +region, mem_pitch, buffer and host_ptr parameters refer to +clEnqueueImportFromTensor.

    +
  • +
  • +

    For num_sync_points_in_wait_list, sync_point_wait_list, +sync_point, mutable_handle parameters refer to +clCommandCopyBufferKHR.

    +
  • +
+
+
+

clCommandImportFromTensorKHR and clCommandImportFromTensorKHR +returns CL_SUCCESS if the function is executed +successfully. Otherwise, it returns one of the following errors:

+
+
+
    +
  • +

    CL_INVALID_COMMAND_QUEUE if command_queue is not NULL.

    +
  • +
  • +

    CL_INVALID_COMMAND_BUFFER_KHR if command_buffer is not a valid +command-buffer.

    +
  • +
  • +

    CL_INVALID_CONTEXT if the context associated with command_queue +and command_buffer is not the same.

    +
  • +
  • +

    CL_INVALID_OPERATION if command_buffer has been finalized.

    +
  • +
  • +

    CL_INVALID_VALUE if mutable_handle is not NULL.

    +
  • +
  • +

    CL_INVALID_SYNC_POINT_WAIT_LIST_KHR if sync_point_wait_list is +NULL and num_sync_points_in_wait_list is > 0, or +sync_point_wait_list is not NULL and num_sync_points_in_wait_list is +0, or if synchronization-point objects in sync_point_wait_list are +not valid synchronization-points.

    +
  • +
  • +

    CL_OUT_OF_RESOURCES if there is a failure to allocate resources +required by the OpenCL implementation on the device.

    +
  • +
  • +

    CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources +required by the OpenCL implementation on the host.

    +
  • +
+
+
+
+

Add New Buffer Property in Section 5.2.1

+ +++++ + + + + + + + + + + + + +

CL_MEM_COMMAND_BUFFER_TEMPORARY

cl_bool

+

This property can be set if cl_khr_command_buffer extension is +supported.

+
+
+ + + + + +
+
Note
+
+This property temporarily lives here and will be moved to +a separate extension proposal. +
+
+
+

If the value is true, create a "temporary" buffer object that only can +be used on commands recorded in command buffers. Non-recording +command enqueue functions must return CL_INVALID_OPERATION if the +command refers to a temporary buffer object.

+
+
+

The temporary buffer objects are managed by command buffers. When a +temporary buffer object is used by multiple command buffer, the object +receives disjoint storage for each command buffer.

+
+
+

Storage of the temporary buffer objects may be allocated on-demand +basis. At the times the buffer is not needed, OpenCL implementations +may reuse storage for other tasks within the command buffer.

+
+
+

Contents of the temporary buffers are not guaranteed to be preserved +across command buffer executions.

+

CL_MEM_BIND_TO_TENSOR

cl_tensor

+

Use the created buffer as +storage for the given valid tensor. To succeed creating the buffer, +the target tensor may not have storage already and size +argument of the clCreateBufferWithProperties() must be zero.

+
+
+

Size of the memory buffer is implementation-defined and it can be +queried with clGetTensorInfo().

+
+
+

Memory layout of the tensor in the created memory buffer is +implementation-defined and opaque to the applications and it may +change at unspecified points. Implementation may use non-contiguous +allocations to store the tensor data and implementation may store +auxiliary data within the allocations. Therefore, reading from or +writing to the memory buffer directly using the cl_mem handle leads to +undefined behavior.

+
+
+

If the tensor is already bound to a buffer object, +clCreateBufferWithProperties call returns CL_TENSOR_BOUND_TO_BUFFER +error code.

+
+
+
+

Add New Memory Object Query in Section 5.5.5

+ +++++ + + + + + + + +

CL_MEM_COMMAND_BUFFER_TEMPORARY

cl_bool

This property can be +queried if cl_khr_command_buffer extension is supported.

+

Return true if the memobj is temporary buffer object for command +buffers.

+
+
+

Add New Error Codes in Appendix F

+ ++++ + + + + + + + + + + +

CL_TENSOR_BOUND_TO_BUFFER

Returned when attempting to bind a + buffer object to a tensor which already has been bound to the same + or another.

CL_INVALID_TENSOR

Returned then the specified tensor is not a + valid tensor object.

+
+
+
+

Sample Codes

+
+

Helper functions used in the follow up tensor code samples:

+
+
+
+
cl_kernel create_matmul_kernel(
+  cl_context ctx, std::span<cl_device_id> device_span,
+  cl_tensor lhs, cl_tensor rhs, cl_tensor out) {
+  // A hypothetical matmul kernel signature in pseudo OpenCL C for
+  // illustrative purposes:
+  //
+  //   kernel void matmul(global read_only tensor_t, global read_only tensor_t,
+  //                      global write_only tensor_t);
+
+  cl_kernel matmul_kernel = /* Omitted. */;
+  clSetKernelArg(matmul_kernel, 0, sizeof(cl_tensor), &lhs);
+  clSetKernelArg(matmul_kernel, 1, sizeof(cl_tensor), &rhs);
+  clSetKernelArg(matmul_kernel, 2, sizeof(cl_tensor), &out);
+  return matmul_kernel;
+}
+
+cl_kernel create_add_kernel(
+  cl_context ctx, std::span<cl_device_id> device_span,
+  cl_tensor lhs, cl_tensor rhs, cl_tensor out) {
+  // A hypothetical add kernel signature in pseudo OpenCL C for illustrative
+  // purposes:
+  //
+  // kernel void add(global read_only tensor_t, global read_only tensor_t,
+  //                 global write_only tensor_t);
+
+  cl_tensor add_kernel = /* Omitted. */;
+  clSetKernelArg(add_kernel, 0, sizeof(cl_tensor), &lhs);
+  clSetKernelArg(add_kernel, 1, sizeof(cl_tensor), &rhs);
+  clSetKernelArg(add_kernel, 2, sizeof(cl_tensor), &out);
+  return add_kernel;
+}
+
+
+
+

An example usage of tensors on a command queue:

+
+
+
+
constexpr size_t b = 64, m = 100, n = 200, k = 50;
+
+cl_int err;
+cl_tensor in0 = clCreateTensor(ctx, nullptr, 3, {b, m, k}, CL_TENSOR_FLOAT, err);
+cl_tensor in1 = clCreateTensor(ctx, nullptr, 3, {b, k, n}, CL_TENSOR_FLOAT, err);
+cl_tensor in2 = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
+cl_tensor t0  = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
+cl_tensor out = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
+
+cl_kernel matmul_kernel = create_matmul_kernel(ctx, device_span, in0, in1, t0);
+cl_kernel add_kernel = create_add_kernel(ctx, device_span, t0, in2, out);
+
+// Allocate storage for the tensors. The buffer size must be set to
+// zero when the buffer is bound to a tensor. OpenCL implementation
+// may determine optimal data layout and the storage needed for it,
+// based on the tensor's uses (the 'matmul' and 'add' kernels in this
+// sample) so far.
+cl_mem in0_mem = clCreateBufferWithProperties(
+  ctx, {CL_MEM_BIND_TO_TENSOR, in0, 0}, CL_MEM_READ_ONLY,
+  0 /* must be zero for CL_MEM_BIND_TO_TENSOR. */, nullptr, &err);
+cl_mem in1_mem = clCreateBufferWithProperties(
+  ctx, {CL_MEM_BIND_TO_TENSOR, in1, 0}, CL_MEM_READ_ONLY,
+  0, nullptr, &err);
+cl_mem in2_mem = clCreateBufferWithProperties(
+  ctx, {CL_MEM_BIND_TO_TENSOR, in2, 0}, CL_MEM_READ_ONLY,
+  0, nullptr, &err);
+cl_mem t0_mem = clCreateBufferWithProperties(
+  ctx, {CL_MEM_BIND_TO_TENSOR, t0, 0}, CL_MEM_READ_WRITE,
+  0, nullptr, &err);
+cl_mem out_mem = clCreateBufferWithProperties(
+  ctx, {CL_MEM_BIND_TO_TENSOR, out, 0}, CL_MEM_WRITE_ONLY,
+  0, nullptr, &err);
+
+std::vector<float> in0_data = ...;
+std::vector<float> in1_data = ...;
+std::vector<float> out_data(b * m * n);
+
+// Copies data into in0 tensor while possibly rearranging the data to the
+// optimal data layout.
+clEnqueueExportToTensor(
+  cmd_q, in0, false, {0, 0, 0}, {0, 0, 0}, {b, m, k},
+  nullptr, nullptr, in0_data.data(), 0, nullptr, nullptr);
+clEnqueueExportToTensor(
+  cmd_q, in1, false, {0, 0, 0}, {0, 0, 0}, {b, k, n},
+  nullptr, nullptr, in1_data.data(), 0, nullptr, nullptr);
+clEnqueueNDRangeKernel(
+  cmd_q, matmul_kernel, 3, matmul_grid, nullptr, nullptr, 0, nullptr, nullptr);
+clEnqueueNDRangeKernel(
+  cmd_q, add_kernel, 3, add_grid, nullptr, nullptr, 0, nullptr, nullptr);
+clEnqueueImportFromTensor(
+  cmd_q, out, false,  {0, 0, 0}, {0, 0, 0}, {b, m, n},
+  nullptr, nullptr, out_data.data(), 0, nullptr, nullptr);
+
+
+
+

An example use of tensors in a command buffer when cl_khr_command_buffer +extension is supported:

+
+
+
+
constexpr size_t b = 64, m = 100, n = 200, k = 50;
+
+cl_int err;
+cl_tensor in0 = clCreateTensor(ctx, nullptr, 3, {b, m, k}, CL_TENSOR_FLOAT, err);
+cl_tensor in1 = clCreateTensor(ctx, nullptr, 3, {b, k, n}, CL_TENSOR_FLOAT, err);
+cl_tensor in2 = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
+cl_tensor t0  = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
+cl_tensor out = clCreateTensor(ctx, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
+
+cl_kernel matmul_kernel = create_matmul_kernel(ctx, device_span, in0, in1, t0);
+cl_kernel add_kernel = create_add_kernel(ctx, device_span, t0, in2, out);
+
+// Bind command buffer managed storage to tensors.
+//
+// NOTE: same temporary tensor handle used in multiple command buffers
+//       will have separate storage. IOW, command buffers may not exchange
+//       data via temporary buffers between them.
+cl_mem in0_mem = clCreateBufferWithProperties(
+  ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, in0, 0},
+  CL_MEM_READ_ONLY, 0 /* must be zero for CL_MEM_BIND_TO_TENSOR. */,
+  nullptr, &err);
+cl_mem in1_mem = clCreateBufferWithProperties(
+  ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, in1, 0},
+  CL_MEM_READ_ONLY, 0, nullptr, &err);
+cl_mem in2_mem = clCreateBufferWithProperties(
+  ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, in2, 0},
+  CL_MEM_READ_ONLY, 0, nullptr, &err);
+cl_mem t0_mem = clCreateBufferWithProperties(
+  ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, t0, 0},
+  CL_MEM_READ_WRITE, 0, nullptr, &err);
+cl_mem out_mem = clCreateBufferWithProperties(
+  ctx, {CL_MEM_COMMAND_BUFFER_TEMPORARY, true, CL_MEM_BIND_TO_TENSOR, out, 0},
+  CL_MEM_WRITE_ONLY, 0, nullptr, &err);
+
+std::vector<float> in0_data = ...;
+std::vector<float> in1_data = ...;
+std::vector<float> out_data(b * m * n);
+
+cl_command_buffer_khr cb =
+  clCreateCommandBufferKHR(num_queues, queue_list, nullptr, &err);
+
+cl_sync_point_khr in0_syncp, in1_syncp, matmul_syncp, add_syncp;
+clCommandExportToTensorKHR(
+  cmd_b, cmd_q, in0, {0, 0, 0}, {0, 0, 0}, {b, m, k},
+  nullptr, nullptr, in0_data.data(), 0, nullptr, &in0_syncp);
+clCommandExportToTensorKHR(
+  cmd_b, cmd_q, in1, {0, 0, 0}, {0, 0, 0}, {b, k, m},
+  nullptr, nullptr, in1_data.data(), 0, nullptr, &in1_syncp);
+clCommandNDRangeKernelKHR(
+  cmd_b, cmd_q, nullptr, matmul_kernel, 3, matmul_grid, nullptr, nullptr,
+  2, {in0_syncp, in2_syncp}, &matmul_syncp, nullptr);
+clCommandNDRangeKernelKHR(
+  cmd_b, cmd_q, nullptr, add_kernel, 3, add_grid, nullptr, nullptr,
+  1, {matmul_syncp}, &add_syncp, nullptr);
+clCommandImportFromTensorKHR(
+  cmd_b, cmd_q, out, {0, 0, 0}, {0, 0, 0}, {b, k, m},
+  nullptr, nullptr, out_data.data(), 1, {add_syncp}, nullptr);
+
+// Finalize the command buffer. At this point the OpenCL
+// implementation may reserve enough storage for all the tensor
+// temporaries. Temporary tensors might be eliminated - for example,
+// OpenCL implementation could use 'out' tensor to store result of
+// matmul_kernel , thus, eliminating the need of 't0' tensor.
+clFinalizeCommandBufferKHR(cmd_b);
+
+// Temporary tensors used in a command buffer can't be read or written
+// into. A hypothetical reason is that the finalized command buffer
+// might not use some of the tensor.
+assert(clEnqueueImportFromTensor(..., t0, ...) == CL_INVALID_OPERATION);
+
+
+
+
+

Open Questions

+
+
    +
  1. +

    Should we have support for tensors with undefined shape and tensors +with unknown / symbolic dimension sizes like in ONNX?

    +
    +
    +
    +

    UNRESOLVED

    +
    +
    +
    +
  2. +
  3. +

    Should we define OpenCL C language features for accessing tensors?

    +
    +
    +
    +

    RESOLVED: OpenCL C support for tensors can be introduced later in a + separate extension. Built-in kernels may benefit from this + extension as it is.

    +
    +
    +
    +
  4. +
+
+
+
+
+
+
+
+
+1. only LSB bit is considered when writing data to tensor. When reading data from tensor the boolean value will be written as 0 or 1. The boolean values in the tensor may be packed densenly +
+
+ + + \ No newline at end of file