diff --git a/sycl/doc/extensions/proposed/sycl_ext_oneapi_bindless_images.asciidoc b/sycl/doc/extensions/proposed/sycl_ext_oneapi_bindless_images.asciidoc new file mode 100644 index 0000000000000..98f3e213eed40 --- /dev/null +++ b/sycl/doc/extensions/proposed/sycl_ext_oneapi_bindless_images.asciidoc @@ -0,0 +1,1973 @@ +# sycl_ext_oneapi_bindless_images + +:source-highlighter: coderay +:coderay-linenums-mode: table +:dpcpp: pass:[DPC++] + +// This section needs to be after the document title. +:doctype: book +:toc2: +:toc: left +:encoding: utf-8 +:lang: en + +:blank: pass:[ +] +// Set the default source code type in this document to C++, +// for syntax highlighting purposes. This is needed because +// docbook uses c++ and html5 uses cpp. +:language: {basebackend@docbook:c++:cpp} + + +== Notice + +[%hardbreaks] +Copyright (C) Codeplay. All rights reserved. + +Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are trademarks +of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc. used by +permission by Khronos. + + +== Dependencies + +This extension is written against the SYCL 2020 revision 6 specification. All +references below to the "core SYCL specification" or to section numbers in the +SYCL specification refer to that revision. + +== Status + +This is an experimental extension specification, intended to provide early +access to features and gather community feedback. Interfaces defined in this +specification are implemented in {dpcpp}, but they are not finalized and may +change incompatibly in future versions of {dpcpp} without prior notice. +*Shipping software products should not rely on APIs defined in this +specification.* + + +== Backend support status + +This extension is currently implemented in {dpcpp} only for GPU devices and +only when using the CUDA backend. Attempting to use this extension in +kernels that run on other devices or backends will not work. +Be aware that the compiler may not be able to issue a diagnostic to +warn you if this happens. + +== Overview + +Images in SYCL 1.2.1 were designed to work with OpenCL. SYCL 2020 tried to make +them more versatile by splitting the image type into sampled and unsampled +images. This enabled SYCL 2020 images to work better with other backends. +However, SYCL 2020 images still didn't quite meet user expectations. +There was feedback about various use cases where the current model falls short +(see examples at the end of this document for some of the use cases). + +One of the key issues is requesting access to arbitrary images through handles, +and not accessors. Accessing images through handles instead of accessors grants +much more flexibility to the user, at the expense of automatic data dependency +tracking. Bypassing accessors allows users to implement programs where the +number of images is not known at compile-time, such as a texture atlas where one +image holds references to other images. This kind of feature is impossible to +implement with the accessor model outlined in the core specification. + +These shortcomings are why we propose a new extension for SYCL 2020 images. +Per our proposal, users would be able to separate memory allocation for the +image from the actual image creation. Images will be represented by opaque +handle types that can be passed directly into a kernel without requesting +access. In many ways, this model more closely resembles the USM model when +accessing data on the device, but it's specialized for dealing with images. + +The proposed model does not replace SYCL 2020 images, +it is instead meant as building blocks for implementing SYCL 2020 images on +top of it. + +In addition to bindless images, this document also proposes an interoperability +extension providing functionality to allow users to import external memory and +semaphore objects from other APIs, such as Vulkan or DirectX. + +Importing memory allows it to be shared between APIs without the need to +duplicate allocations and perform multiple copies between host and device to +ensure that said memory is kept uniform across those APIs at all times. + +Importing semaphores will also allow SYCL to schedule command groups and queue +operations that depend on completion of GPU commands submitted by external APIs. + +[NOTE] +==== +The interoperability outlined in this document concerns only the importing of +external API objects into the SYCL runtime. We do not expose exportation of SYCL +objects to external APIs. Interoperability capabilities vary between APIs. For +example, CUDA allows the import of external memory and semaphores, but does not +allow export of its own resources. +==== + +== Specification + +=== Feature test macro + +This extension provides a feature-test macro as described in the core SYCL +specification. An implementation supporting this extension must predefine the +macro `SYCL_EXT_ONEAPI_BINDLESS_IMAGES` to one of the values defined in the +table below. Applications can test for the existence of this macro to +determine if the implementation supports this feature, or applications can test +the macro's value to determine which of the extension's features the +implementation supports. + +[frame="none",options="header"] +|====================== +|Rev |Description +|1 |Initial draft of the proposal +|2 |Second revision of the proposal +|3 |Third revision of the proposal +|4 |Fourth revision of the proposal +|====================== + +See the revision history at the bottom of this document for features added in +each revision. + +=== Querying bindless image support + +We provide the following device queries to retrieve information on whether a +SYCL implementation provides support for various bindless image features. + +The device aspects for these queries are: + +[frame="none",options="header"] +|====================== +|Device descriptor |Description +|`aspect::ext_oneapi_bindless_images` | Indicates if the device supports +creation of bindless images backed by the `image_mem` and `image_mem_handle` +APIs. +|`aspect::ext_oneapi_bindless_images_shared_usm` | Indicates if the device +supports the creation of bindless images backed by shared USM memory. +|`aspect::ext_oneapi_bindless_images_1d_usm` | Indicates if the device supports +creation of 1D bindless images backed by USM. +|`aspect::ext_oneapi_bindless_images_2d_usm` | Indicates if the device supports +creation of 2D bindless images backed by USM. +|====================== + +[NOTE] +==== +Not all SYCL backends may provide support for bindless images constructed from +USM memory with all dimensions. As an example, CUDA does not have +native support for 3D image resources constructed from USM. In the future, some +backends may support this, and this proposal may be updated to allow creation +of 3D USM images. +==== + +=== Image descriptor + +```cpp +namespace sycl::ext::oneapi::experimental { + +enum class image_channel_order : /* unspecified */ { + a, + r, + rx, + rg, + rgx, + ra, + rgb, + rgbx, + rgba, + argb, + bgra, + intensity, + luminance, + abgr, +}; + +enum class image_channel_type : /* unspecified */ { + snorm_int8, + snorm_int16, + unorm_int8, + unorm_int16, + unorm_short_565, + unorm_short_555, + unorm_int_101010, + signed_int8, + signed_int16, + signed_int32, + unsigned_int8, + unsigned_int16, + unsigned_int32, + fp16, + fp32, +}; + +enum class image_type : /* unspecified */ { + standard, + mipmap, + interop, +}; + +struct image_descriptor { + size_t width; + size_t height; + size_t depth; + image_channel_type channel_type; + image_channel_order channel_order; + image_type type; + unsigned int num_levels; + + image_descriptor(sycl::range<1> dims, image_channel_order channel_order, + image_channel_type channel_type, + image_type type = image_type::standard, + unsigned int num_levels = 1); + + image_descriptor(sycl::range<2> dims, image_channel_order channel_order, + image_channel_type channel_type, + image_type type = image_type::standard, + unsigned int num_levels = 1); + + image_descriptor(sycl::range<3> dims, image_channel_order channel_order, + image_channel_type channel_type, + image_type type = image_type::standard, + unsigned int num_levels = 1); + + image_descriptor get_mip_level_desc(unsigned int level) const; +}; + +} +``` + +The image descriptor represents the image dimensions, channel type, and channel +order. An `image_type` member is also present to allow for implementation of +mipmapped and interop images. + +The `image_descriptor` shall be default constructible and follow by-value +semantics. + +[NOTE] +==== +Additional future `image_type`s _may_ include "layered" and/or "cubemap". +==== + +Note that `image_channel_type` and `image_channel_order` existed in SYCL 1.2.1, +but were removed in SYCL 2020 in favor of a single, unified enum class. +We propose separating them again to enable better flexibility +and to avoid combinatorial complexity. + +For the `standard` image type, the value of `num_levels` must be `1`. + +The `type` member will inform the implementation of the type of image to +create, allocate, or free. + +Only mipmap image types support more than one level. For mipmap images, the +member function `get_mip_level_desc` will return an `image_descriptor` for +a given level of a mipmap, with valid dimension values for that level, and the +type of the returned `image_descriptor` will be `image_type::standard`. + +=== Allocating image memory + +The process of creating an image is two-fold: +allocate an image's memory, then create an image handle from the allocation. +Allocation of image memory can be achieved in two ways. + +==== Allocating non-USM image memory + +```cpp +namespace sycl::ext::oneapi::experimental { + +struct image_mem_handle { + using raw_handle_type = /* implementation defined */; + raw_handle_type raw_handle; +} + +class image_mem { +public: + image_mem(); + image_mem(const image_mem &rhs); + image_mem(image_mem &&rhs) noexcept; + + image_mem(const image_descriptor &imageDesc, + const sycl::device &syclDevice, + const sycl::context &syclContext); + image_mem(const image_descriptor &imageDesc, + const sycl::queue &syclQueue); + + ~image_mem(); + + image_mem &operator=(image_mem &rhs); + image_mem &operator=(image_mem &&rhs) noexcept; + + bool operator==(const image_mem &rhs) const; + bool operator!=(const image_mem &rhs) const; + + image_mem_handle get_handle() const; + image_descriptor get_descriptor() const; + sycl::device get_device() const; + sycl::context get_context() const; + + sycl::range<3> get_range() const; + sycl::image_channel_type get_image_channel_type() const; + sycl::image_channel_type get_image_channel_order() const; + unsigned int get_image_num_channels() const; + image_type get_type() const; + + image_mem_handle get_mip_level_mem_handle(unsigned int level) const; +}; + +image_mem_handle alloc_image_mem(const image_descriptor &imageDesc, + const sycl::device &syclDevice, + const sycl::context &syclContext); +image_mem_handle alloc_image_mem(const image_descriptor &imageDesc, + const sycl::queue &syclQueue); + +void free_image_mem(image_mem_handle memHandle, + image_type imageType, + const sycl::device &syclDevice, + const sycl::context &syclContext); +void free_image_mem(image_mem_handle memHandle, + image_type imageType, + const sycl::queue &syclQueue); +} +``` + +The first method of allocating device memory for images is through +`alloc_image_mem`. This takes a `sycl::device`, `sycl::context`, +and `image_descriptor` to allocate device memory, appropriately sized +based on the `image_descriptor`. Alternatively, we can also pass a +`sycl::queue` instead of both `sycl::device` and `sycl::context`. + +Memory allocated in this way requires the user to free that memory after all +operations using the memory are completed and no more operations operating on +the memory will be scheduled. This is done using `free_image_mem`. An +`image_type` should be passed to `free_image_mem` to inform the implementation +of the type of memory to be freed. + +The second method involves the `image_mem` class, which is a RAII class wrapper +that performs allocation and deallocation of device memory. + +The default constructor does not allocate any memory on the device and the +resulting `image_mem` object is in an uninitialized state. + +the constructor is a wrapper for `alloc_image_mem` functionality. +The destructor is a wrapper for `free_image_mem` functionality. + +`image_mem` also provides some functions to get various properties of the +image memory allocation such as the image range, channel type, channel order, +number of channels, number of levels, and image type. + +In the case where a mipmap has been allocated, `get_mip_level_mem_handle` can +be used to return an `image_mem_handle` to a specific level of the mipmap. This +can then be used to copy data to that specific level or create an image handle +based on that level. + +Note that the handle type `image_mem_handle::raw_handle_type` is an opaque type, +and the handle cannot be dereferenced on the host. The layout of the memory is +backend-specific, and may be an optimized layout, e.g. tile swizzle patterns. + +The `image_mem` class must follow Common Reference Semantics as outlined by the +core SYCL 2020 specification. + +The `image_mem` class is not a valid kernel argument. + +If the construction of the `image_mem` class fails, a +`sycl::exception` with error code `sycl::errc::memory_allocation` will be +thrown. + +Similarly, if `alloc_image_mem` or `free_image_mem` fail, a `sycl::exception` +with error code `sycl::errc::memory_allocation` will be thrown. + +[NOTE] +==== +In the DPC++ CUDA backend, `image_mem` will allocate/deallocate a +`CUarray` type (or `CUmipmappedArray` in the case of mipmap images). +==== + +===== Getting image information from `image_mem_handle` + +Extension functions are provided to retrieve information about images allocated +using the `image_mem_alloc` function. These are similar to the member functions +provided by `image_mem`. However, since the `image_mem_handle` is a minimal +struct representing just the opaque handle the underlying memory object, there +is some information that we cannot retrieve from it, namely the `image_type`, +`image_channel_order`, the `sycl::context` or `sycl::device` the memory was +allocated in, and the `image_descriptor` used to allocate the memory. + +```cpp +namespace sycl::ext::oneapi { + +sycl::range<3> get_image_range(const image_mem_handle memHandle, + const sycl::device &syclDevice, + const sycl::context &syclContext); + +sycl::image_channel_type +get_image_channel_type(const image_mem_handle memHandle, + const sycl::device &syclDevice, + const sycl::context &syclContext); + +unsigned int get_image_num_channels(const image_mem_handle memHandle, + const sycl::device &syclDevice, + const sycl::context &syclContext); + +image_mem_handle get_mip_level_mem_handle(const image_mem_handle mipMemHandle, + unsigned int level, + const sycl::device &syclDevice, + const sycl::context &syclContext); +} +``` + +For `get_image_range` where the underlying image memory was allocated with one +or two dimensions, the returned `sycl::range<3>` will contain zero values for +the dimensions unused by the underlying image memory object. + +==== Allocating USM image memory + +The second way to allocate image memory is to use USM allocations. SYCL already +provides a number of USM allocation functions. This proposal would add another, +pitched memory allocation, through `pitched_alloc_device`. + +```cpp +namespace sycl::ext::oneapi::experimental { + +void *pitched_alloc_device(size_t *retRowPitch, + size_t widthInBytes, size_t height, + unsigned int elementSizeBytes, + const sycl::queue &syclQueue); +void *pitched_alloc_device(size_t *retRowPitch + size_t widthInBytes, size_t height, + unsigned int elementSizeBytes, + const sycl::device &syclDevice, + const sycl::context &syclContext); + +void *pitched_alloc_device(size_t *resultPitch, + const image_descriptor &desc, + const sycl::queue &queue); + +void *pitched_alloc_device(size_t *resultPitch, + const image_descriptor &desc, + const sycl::device &syclDevice, + const sycl::context &syclContext); +} +``` + +This function will allocate a memory region aimed to be used for +two-dimensional images. It allocates memory that is guaranteed to +adhere to the device's alignment requirements for USM images. + +If the user does not wish to use `pitched_alloc_device` to allocate +two-dimensional USM images, but prefers to use another USM allocation +function instead, then that allocation must adhere to some alignment +restrictions. These restrictions are device specific, and queries for them can +be found in the "Pitch alignment restrictions and queries" section below. + +If the allocation of pitched memory fails, `pitched_alloc_device` will throw a +`sycl::exception` with error code `sycl::errc::memory_allocation`. + +=== Pitch alignment restrictions and queries + +For the purposes of this document, the row pitch of an image memory allocation +is the distance in bytes between the first elements of adjacent rows of the +image. Some devices may require two-dimensional USM images to be allocated with +specific alignments for their width and pitch values. The `pitched_alloc_device` +API intends to make allocation of USM memory adhering to these restrictions +easy, returning the appropriate pitch value to the user. However, if a user +wishes to use another USM allocation function, they must be aware of these +restrictions, and query the device to ensure the allocations they wish to use +adhere to those restrictions. + +This proposal provides a number of additional device queries that enable the +user to allocate appropriate pitched USM memory for two-dimensional +images. One-dimensional images do not require any pitch values. + +The device information descriptors for these queries are: + +[frame="none",options="header"] +|====================== +|Device descriptor |Return type |Description +|`ext::oneapi::experimental::info::device::image_row_pitch_align` |`uint32_t` | +Returns the required alignment of the pitch between two rows of an image in +bytes for images allocated using USM. +|`ext::oneapi::experimental::info::device::max_image_linear_width` |`size_t` | +Returns the maximum linear width allowed for images allocated using USM. +|`ext::oneapi::experimental::info::device::max_image_linear_height` |`size_t` | +Returns the maximum linear height allowed for images allocated using USM. +|`ext::oneapi::experimental::info::device::max_image_linear_row_pitch` +|`size_t` | Returns the maximum linear row pitch allowed for images allocated +using USM. +|====================== + +=== Obtaining a handle to the image + +The next step is to create the image, and obtain the handle. + +```cpp +namespace sycl::ext::oneapi::experimental { + +/// Opaque unsampled image handle type. +struct unsampled_image_handle { + using raw_image_handle_type = /* Implementation defined */; + raw_image_handle_type image_handle; +}; + +/// Opaque sampled image handle type. +struct sampled_image_handle { + using raw_image_handle_type = /* Implementation defined */; + using raw_sampler_handle_type = /* Implementation defined */ + + raw_image_handle_type image_handle; + raw_sampler_handle_type sampler_handle; +}; + +// Creating an unsampled image from an `image_mem_handle` +unsampled_image_handle create_image(image_mem_handle memHandle, + const image_descriptor &desc, + const sycl::device &syclDevice, + const sycl::context &syclContext); +unsampled_image_handle create_image(image_mem_handle memHandle, + const image_descriptor &desc, + const sycl::queue &syclQueue); + +// Creating a sampled image from an `image_mem_handle` +sampled_image_handle create_image(image_mem_handle memHandle, + const image_descriptor &desc, + const bindless_image_sampler &sampler, + const sycl::device &syclDevice, + const sycl::context &syclContext); +sampled_image_handle create_image(image_mem_handle memHandle, + const image_descriptor &desc, + const bindless_image_sampler &sampler, + const sycl::queue &syclQueue); + +// Creating an unsampled image from an `image_mem` object +unsampled_image_handle create_image(const image_mem &memHandle, + const image_descriptor &desc, + const sycl::device &syclDevice, + const sycl::context &syclContext); +unsampled_image_handle create_image(const image_mem &memHandle, + const image_descriptor &desc, + const sycl::queue &syclQueue); + +// Creating a sampled image from an `image_mem` object +sampled_image_handle create_image(const image_mem &memHandle, + const image_descriptor &desc, + const bindless_image_sampler &sampler, + const sycl::device &syclDevice, + const sycl::context &syclContext); +sampled_image_handle create_image(const image_mem &memHandle, + const image_descriptor &desc, + const bindless_image_sampler &sampler, + const sycl::queue &syclQueue); + +// Creating a sampled image from a USM allocation and pitch +sampled_image_handle create_image(const void *usmPtr, size_t pitch, + const image_descriptor &desc, + const bindless_image_sampler &sampler, + const sycl::device &syclDevice, + const sycl::context &syclContext); +sampled_image_handle create_image(const void *usmPtr, size_t pitch, + const image_descriptor &desc, + const bindless_image_sampler &sampler, + const sycl::queue &syclQueue); + +// Destroying an image handle +void destroy_image_handle(sampled_image_handle &imageHandle, + const sycl::device &syclDevice, + const sycl::context &syclContext); +void destroy_image_handle(sampled_image_handle &imageHandle, + const sycl::queue &syclQueue); + +void destroy_image_handle(unsampled_image_handle &imageHandle, + const sycl::device &syclDevice, + const sycl::context &syclContext); +void destroy_image_handle(unsampled_image_handle &imageHandle, + const sycl::queue &syclQueue); +} +``` + +Once we have allocated memory, we can pass it into the `create_image` function +to obtain a `sampled_image_handle` or `unsampled_image_handle`. +These objects are opaque types that represent an image object. +They can be captured by value into a SYCL kernel, or they can be passed in a +buffer as a dynamic array of images (see examples at the bottom of this +document). + +We can either provide a `bindless_image_sampler` (defined in section below) or +not when creating the image. Doing so will create a `sampled_image_handle`, +where otherwise an `unsampled_image_handle` would be returned. A +`sampled_image_handle` should contain a raw sampler handle that will be used +when sampling an image. + +Whether an `image_descriptor` or `void *` USM allocation was passed to +`create_image`, it must have been allocated in the same context and on the same +device as the one passed to `create_image`. + +If we choose to create a 2D image from a USM allocation by passing a `void *`, +we must also pass the pitch of the memory allocation. If the memory was +allocated using `pitched_alloc_device`, the pitch passed must be the one which +was returned by `pitched_alloc_device`. If the user did not use +`pitched_alloc_device` to allocate this memory, then that memory must still +adhere to device specific alignment restrictions. These restrictions and their +queries are outlined in the section "Pitch alignment restrictions and queries" +below. + +The pitch is ignored for 1D USM images. + +If the creation of an image fails, `create_image` will throw a `sycl::exception` +with error code `sycl::errc::runtime`. + +The `unsampled_image_handle` and `sampled_image_handle` types shall be +default-constructible, copy-constructible, and device-copyable. + +[NOTE] +==== +In the DPC++ CUDA backend a sampled image will correspond to a CUDA texture, +whereas an unsampled image will correspond to a CUDA surface. +==== + +After we're done with the image, we need to destroy the handle using +`destroy_image_handle`. Destroying an image handle does not deallocate the +underlying image memory. The user is responsible for deallocation, either +through `free_image_mem`, or destroying the `image_mem` object, if one was used. + +=== Image sampler struct + +The `bindless_image_sampler` struct shown below is used to set the sampling +properties of `sampled_images` upon image creation. It can be used to set +sampling properties that exist in the SYCL 2020 `image_sampler` as well as +extra properties used for sampling mipmaps including level-of-detail (LOD) and +anisotropic filtering. + +```cpp +namespace sycl::ext::oneapi::experimental { + +struct bindless_image_sampler { + + bindless_image_sampler(sycl::addressing_mode addressing, + sycl::coordinate_normalization_mode coordinate, + sycl::filtering_mode filtering); + + bindless_image_sampler(sycl::addressing_mode addressing, + sycl::coordinate_normalization_mode coordinate, + sycl::filtering_mode filtering, + sycl::filtering_mode mipFiltering, + float minMipmapLevelClamp, float maxMipmapLevelClamp, + float maxAnisotropy); + + sycl::addressing_mode addressing; + sycl::coordinate_normalization_mode coordinate; + sycl::filtering_mode filtering; + sycl::filtering_mode mipmapFiltering; + float minMipmapLevelClamp; + float maxMipmapLevelClamp; + float maxAnisotropy; +}; + +} +``` + +`mipmapFiltering` dictates the method in which sampling between mipmap +levels is performed. + +`minMipmapLevelClamp` defines the minimum mipmap level from which we can sample, +with the minimum value being 0. + +`maxMipmapLevelClamp` defines the maximum mipmap level from which we can sample. +This cannot value cannot be higher than the number of allocated levels. + +`maxAnisotropy` dictates the anisotropic ratio used when samplling the mipmap +with anisotropic filtering. + +=== Explicit copies + +```cpp +namespace sycl { + +class handler { +public: + + // Simple host to device copy + void ext_oneapi_copy( + void *Src, + ext::oneapi::experimental::image_mem_handle Dest, + const ext::oneapi::experimental:image_descriptor &DestImgDesc); + + // Host to device copy with offsets and extent + void ext_oneapi_copy( + void *Src, + sycl::range<3> SrcOffset, + sycl::range<3> SrcExtent, + ext::oneapi::experimental::image_mem_handle Dest, + sycl::range<3> DestOffset, + const ext::oneapi::experimental::image_descriptor &DestImgDesc, + sycl::range<3> CopyExtent); + + // Simple device to host copy + void ext_oneapi_copy( + ext::oneapi::experimental::image_mem_handle Src, + void *Dest, + const ext::oneapi::experimental::image_descriptor &SrcImgDesc); + + // Device to host copy with offsets and extent + void ext_oneapi_copy( + ext::oneapi::experimental::image_mem_handle Src, + sycl::range<3> SrcOffset, + const ext::oneapi::experimental::image_descriptor &SrcImgDesc, + void *Dest, + sycl::range<3> DestOffset, + sycl::range<3> DestExtent, + sycl::range<3> CopyExtent); + + // Simple HtoD or DtoH copy with USM device memory + void ext_oneapi_copy(void *Src, + void *Dest, + const ext::oneapi::experimental::image_descriptor &Desc, + size_t DeviceRowPitch); + + // HtoD or DtoH copy with USM device memory, using offsets, extent + void ext_oneapi_copy( + void *Src, + sycl::range<3> SrcOffset, + void *Dest, + sycl::range<3> DestOffset, + const ext::oneapi::experimental::image_descriptor &DeviceImgDesc, + size_t DeviceRowPitch, + sycl::range<3> HostExtent, + sycl::range<3> CopyExtent); +}; + +class queue { +public: + + // Simple host to device copy + event ext_oneapi_copy(void *Src, + ext::oneapi::experimental::image_mem_handle Dest, + const ext::oneapi::experimental::image_descriptor &DestImgDesc); + event ext_oneapi_copy(void *Src, + ext::oneapi::experimental::image_mem_handle Dest, + const ext::oneapi::experimental::image_descriptor &DestImgDesc, + event DepEvent); + event ext_oneapi_copy(void *Src, + ext::oneapi::experimental::image_mem_handle Dest, + const ext::oneapi::experimental::image_descriptor &DestImgDesc, + const std::vector &DepEvents); + + // Host to device copy with offsets and extent + event ext_oneapi_copy( + void *Src, + range<3> SrcOffset, + range<3> SrcExtent, + ext::oneapi::experimental::image_mem_handle Dest, + range<3> DestOffset, + const ext::oneapi::experimental::image_descriptor &DestImgDesc, + range<3> CopyExtent); + event ext_oneapi_copy( + void *Src, + range<3> SrcOffset, + range<3> SrcExtent, + ext::oneapi::experimental::image_mem_handle Dest, + range<3> DestOffset, + const ext::oneapi::experimental::image_descriptor &DestImgDesc, + range<3> Extent, event DepEvent); + event ext_oneapi_copy( + void *Src, + range<3> SrcOffset, + range<3> SrcExtent, + ext::oneapi::experimental::image_mem_handle Dest, + range<3> DestOffset, + const ext::oneapi::experimental::image_descriptor &DestImgDesc, + range<3> CopyExtent, const std::vector &DepEvents); + + // Simple device to host copy + event ext_oneapi_copy( + ext::oneapi::experimental::image_mem_handle Src, + void *Dest, + const ext::oneapi::experimental::image_descriptor &Desc); + event ext_oneapi_copy( + ext::oneapi::experimental::image_mem_handle Src, + void *Dest, + const ext::oneapi::experimental::image_descriptor &Desc, + event DepEvent); + event ext_oneapi_copy( + ext::oneapi::experimental::image_mem_handle Src, + void *Dest, + const ext::oneapi::experimental::image_descriptor &Desc, + const std::vector &DepEvents); + + // Device to host copy with offsets and extent + event ext_oneapi_copy( + ext::oneapi::experimental::image_mem_handle Src, + range<3> SrcOffset, + const ext::oneapi::experimental::image_descriptor &SrcImgDesc, + void *Dest, + range<3> DestOffset, + range<3> DestExtent, + range<3> CopyExtent); + event ext_oneapi_copy( + ext::oneapi::experimental::image_mem_handle Src, + range<3> SrcOffset, + const ext::oneapi::experimental::image_descriptor &SrcImgDesc, + void *Dest, + range<3> DestOffset, + range<3> DestExtent, + range<3> CopyExtent, event DepEvent); + event ext_oneapi_copy( + ext::oneapi::experimental::image_mem_handle Src, + range<3> SrcOffset, + const ext::oneapi::experimental::image_descriptor &SrcImgDesc, + void *Dest, + range<3> DestOffset, + range<3> DestExtent, + range<3> CopyExtent, const std::vector &DepEvents); + + // Host to device OR device to host using USM device memory + event ext_oneapi_copy( + void *Src, void *Dest, + const ext::oneapi::experimental::image_descriptor &DeviceImgDesc, + size_t DeviceRowPitch); + event ext_oneapi_copy( + void *Src, void *Dest, + const ext::oneapi::experimental::image_descriptor &DeviceImgDesc, + size_t DeviceRowPitch, + event DepEvent); + event ext_oneapi_copy( + void *Src, void *Dest, + const ext::oneapi::experimental::image_descriptor &DeviceImgDesc, + size_t DeviceRowPitch, + const std::vector &DepEvents); + + // Host to device OR device to host using USM device memory, + // with control over sub-region + event ext_oneapi_copy( + void *Src, sycl::range<3> SrcOffset, + void *Dest, sycl::range<3> DestOffset, + const ext::oneapi::experimental::image_descriptor &DeviceImgDesc, + size_t DeviceRowPitch, + sycl::range<3> HostExtent, + sycl::range<3> CopyExtent); + event ext_oneapi_copy( + void *Src, sycl::range<3> SrcOffset, + void *Dest, sycl::range<3> DestOffset, + const ext::oneapi::experimental::image_descriptor &DeviceImgDesc, + size_t DeviceRowPitch, + sycl::range<3> HostExtent, + sycl::range<3> CopyExtent); + event ext_oneapi_copy( + void *Src, sycl::range<3> SrcOffset, + void *Dest, sycl::range<3> DestOffset, + const ext::oneapi::experimental::image_descriptor &DeviceImgDesc, + size_t DeviceRowPitch, + sycl::range<3> HostExtent, + sycl::range<3> CopyExtent); +}; +} +``` + +To enable the copying of images an `ext_oneapi_copy` function is proposed as a +method of the queue and handler. It can be used to copy image memory, whether +allocated through USM or using an `image_mem_handle`, from host to +device, or device to host. For the `ext_oneapi_copy` variants that do not take +offsets and extents, the image descriptor passed to the `ext_oneapi_copy` API +is used to determine the pixel size, dimensions, and extent in memory of the +image to copy. If performing sub-region copy, the size of the memory region is +also determined by the offsets and extent passed. + +For images allocated using USM, existing SYCL functionality can be used to +copy their memory, but we also provide `ext_oneapi_copy` functions that take +USM pointers. If the image memory was allocated using `pitched_alloc_device`, +then the source and destination, row pitch parameter passed must match that +which was returned from `pitched_alloc_device`. If the user opted to +use another allocation function then the device pitch parameters must adhere to +the alignment restrictions outlined in the +"Pitch alignment restrictions and queries" section. + +Unless performing a sub-region copy, the user must ensure that the memory +regions accessed through `Dest` and `Src` have the same capacity. + +Whether copying image memory to the device through a USM `Dest` pointer, or an +`image_mem_handle`, the host memory is always assumed to be tightly packed. +Similarly, the host memory is assumed to be packed when copying from device to +host. + +For the functions that take an `image_mem_handle`, the handle must have been +allocated within the same context and device of the `queue`. + +For the forms that take a USM pointer, the image memory must also have been +allocated within the same context and device of the `queue`. The USM memory +must be accessible on the queue's device. + +The `ext_oneapi_copy` function variants that don't take offsets and extents may +fail in the following scenarios: + +1. The `Src` and `Dest` memory was not allocated on the same device and +context of the queue. + +2. The `Src` and `Dest` memory regions, where `Src` or `Dest` can be either +on the host or device, do not have the same memory capacity, where the capacity +is calculate from the `width`, `height`, `depth`, `channel_order`, and +`channel_type` members of the `image_descriptor` parameter. + +The `ext_oneapi_copy` function variants that do take offsets and extents may +fail in the following scenarios: + +1. The `Src` and `Dest` memory was not allocated on the same device and +context of the queue. + +2. The image descriptor passed does not match the image descriptor used to +allocate the image on the device. + +3. the `CopyExtent` describes a memory region larger than that which was +allocated on either the host or the device. + +4. The `HostExtent` describes a memory region larger than that which was +allocated on the host. + +5. The `SrcExtent` describes a memory region larger than that which was +allocated, where `Src` can be either the host or device. + +6. The `DestExtent` describes a memory region larger than that which was +allocated, where `Dest` can be either the host or device. + +7. If `SrcOffset + CopyExtent` moves the memory sub-region outside the bounds +of the memory described by `Src`, irrespective of whether `Src` is on the host +or the device. + +8. If `DestOffset + CopyExtent` moves the memory sub-region outside the bounds +of the memory described by `Dest`, irrespective of whether `Dest` is on the +host or the device. + +9. The `DeviceRowPitch` does not adhere to the alignment requirements +outlined in section "Pitch alignment restrictions and queries" + +10. The value of `DeviceRowPitch` is smaller than the width of the image on +the device. + +If copying of an image fails, `ext_oneapi_copy` will throw a `sycl::exception` +with error code `sycl::errc::invalid`, and relay an error message back to the +user through `sycl::exception::what()`, describing which of the scenarios +listed above caused the failure. + +=== Reading and writing inside the kernel + +```cpp +namespace sycl::ext::oneapi::experimental { + +template +DataT read_image(const unsampled_image_handle &ImageHandle, + const CoordT &Coords); +template +DataT read_image(const sampled_image_handle &ImageHandle, + const CoordT &Coords); + +template +void write_image(unsampled_image_handle &ImageHandle, + const CoordT &Coords, const DataT &Color); +} +``` + +Inside a kernel, it's possible to read an image via `read_image`, passing +the image handle. For the form that takes `unsampled_image_handle`, image data +will be fetched exactly as is in device memory. For the form that takes a +`sampled_image_handle`, the image will be sampled according to the +`bindless_image_sampler` that was passed to the image upon construction. The +sampler handle is included in the `sampled_image_handle` as +`sampled_image_handle::raw_sampler_handle`. + +The returned data will be of templated type `DataT`, which is specified by the +user, and should map to the type that the image was created with (a combination +of `image_channel_type` and `image_channel_order`). An example of a type +resulting from these two is shown below. + +For reads or writes, `DataT` must correspond to the type specified in the +`image_descriptor` when the image was created. For multi-channel types, the +resultant `DataT` should be a `sycl::vec` type. E.g., for a channel order of +`image_channel_order::rg` and channel type of `image_channel_type::fp16`, the +resultant `DataT` should be `sycl::vec`. + +It's possible to write to an image via `write_image` passing the +handle of the image to be written to, along with the coordinates to write to and +the data. + +Sampled images cannot be written to using `write_image`. + +For unsampled images, coordinates are specified by `int`, `sycl::vec`, +and `sycl::vec` for 1D, 2D, and 3D images respectively. + +Sampled image reads take `float`, `sycl::vec`, and +`sycl::vec` coordinate types for 1D, 2D, and 3D images respectively. + +In the case of 3D reads or writes, the fourth element in the coordinate vector +is ignored. + +Note that coordinates for 3D images take a vector of size 4, not 3. + +Note also that all images must be used in either read-only or write-only fashion +within a single kernel invocation; read/write images are not supported. + +Note also that read-after-write functionality is not supported. Unsampled +images may be read from and written back to within the same kernel, however, +reading from that same image again will result in undefined behaviour. A new +kernel must be submitted for the written data to be accessible. + +== Mipmapped images + +So far, we have described how to create and operate on standard bindless images. +Another type of image we propose support for is a mipmapped image. Mipmapped +images are an image type with multiple levels. Each consecutive dimension of a +mipmapped image level is smaller than the previous level. The dimensions of a +succeeding mip level is half that of the preceding level. As an example, a +two-dimensional mipmapped image where the top-most level (`level==0`) image has +a `width==16` and `height==16`, the succeeding level (`level==1`) in the mipmap +will have sizes `width==8` and `height==8`. This pattern continues until either +the final level has sizes of `width==1` and `height==1`, or the user-specified +maximum mip level has been reached (described by the `num_levels` member of +`image_descriptor`). + +=== Querying mipmap support + +We provide the following device queries to retrieve information on a SYCL +implementation of various mipmap features. + +The device aspect descriptors for these queries are: + +[frame="none",options="header"] +|====================== +|Device descriptor |Description +|`aspect::ext_oneapi_mipmap` | Indicates if the device supports allocating +mipmap resources. +|`aspect::ext_oneapi_mipmap_anisotropy` | Indicates if the device supports +sampling mipmap images with anisotropic filtering +|`aspect::ext_oneapi_mipmap_level_reference` | Indicates if the device supports +using images created from individual mipmap levels +|====================== + +The device information descriptors for these queries are: + +[frame="none",options="header"] +|====================== +|Device descriptor |Return type |Description +|`ext::oneapi::experimental::info::device::mipmap_max_anisotropy` |`float` | +Return the maximum anisotropic ratio supported by the device +|====================== + +=== Allocation of mipmapped images + +Mipmaps are allocated in a similar manner to standard images, however, mipmaps +do not support USM backed memory. + +Mipmap memory is allocated through `alloc_image_mem`. The user should populate +the `image_descriptor` with the image type of `image_type::mipmap`, and provide +the number of mipmaps levels they wish to allocate. The value of `num_levels` +must be greater than `1`. + +Mipmap memory allocated this way requires the user to free that memory after all +operations using the memory are completed and no more operations operating on +the memory will be scheduled. This is done using `free_image_mem`, passing +`image_type::mipmap`. Importantly, individual levels of a mipmap must not be +freed before calling `free_image_mem`. + +The RAII class `image_mem` may also be used to perform allocation and +deallocation of mipmap device memory. The constructor and destructor act as a +wrapper for the functions `alloc_image_mem` and `free_image_mem` respectively. + +When the underlying memory of `image_mem` is a mipmap, +`get_mip_level_mem_handle` can be used to return an `image_mem_handle` to a +specific level of the mipmap. This can then be used to copy data to that +specific level or create an image based on that level. + +=== Obtaining a handle to a mipmap image + +A handle to a mipmap image is acquired in the same way as a +`sampled_image_handle`. Mipmaps can only be sampled image types. We can create a +`sampled_image_handle` to the allocated mipmap through the `create_image` +functions which take a `bindless_image_sampler`. To sample a mipmap correctly, +the mipmap attributes of this sampler must be defined. + +Attempting to create an `unsampled_image_handle` to a mipmap will result in a +`sycl::exception` with error code `sycl::errc::runtime` being thrown. + +=== Copying mipmap image data + +In order to copy to or from mipmaps, the user should retrieve an individual +level's `image_mem_handle` through `image_mem::get_mip_level_mem_handle`, which +can then be passed to `ext_oneapi_copy`. The user must ensure that the image +descriptor passed to `ext_oneapi_copy` is dimensioned correctly for the mip +level being copied to/from. The provided `image_descriptor::get_mip_level_desc` +allows the user to retrieve a correctly dimensioned image descriptor for any +level of a given top-level descriptor. + +=== Reading a mipmap + +Inside the kernel, it's possible to read a mipmap via `read_image`, passing the +`sampled_image_handle`, the coordinates, and either the level or anisotropic +gradient values. + +The method of sampling a mipmap is different based on which `read_image` +function is used, and the sampler attributes passed upon creation of the +mipmap. + +```c++ +// Nearest/linear filtering between mip levels +template +DataT read_image(const sampled_image_handle &ImageHandle, + const CoordT &Coords, + const float Level); + +// Anisotropic filtering +template +DataT read_image(const sampled_image_handle &ImageHandle, + const CoordT &Coords, + const CoordT &Dx, const CoordT &Dy); +``` + +== Interoperability + +=== Querying interoperability support + +We provide the following device queries to retrieve information on whether a +SYCL implementation provides support for various interoperability features. + +The device aspect descriptors for these queries are: + +[frame="none",options="header"] +|====================== +|Device descriptor |Description +|`aspect::ext_oneapi_interop_memory_import` | Indicates if the device supports +importing external memory resources. +|`aspect::ext_oneapi_interop_memory_export` | Indicates if the device supports +exporting internal memory resources. +|`aspect::ext_oneapi_interop_semaphore_import`` | Indicates if the device +supports importing external semaphore resources. +|`aspect::ext_oneapi_interop_semaphore_export` | Indicates if the device +supports exporting internal event resources. +|====================== + + +[NOTE] +==== +Not all SYCL backends may provide support for importing or exporting native +memory or semaphore objects. CUDA for example only supports importation of +external memory and semaphores, but provides no support for their exportation. +==== + +=== Importing external memory objects + +In order to import a memory object, an external API must provide an appropriate +handle to that memory. The exact structure and type of this handle can depend on +the external API, and the operating system the application is running on. + +In order to facilitate a number of different external memory handle types, we +propose the following structures. + +[NOTE] +==== +We only show two examples of external memory handle types here, but the +`external_mem_descriptor` struct could be templated by any number of handle +types, provided that the SYCL implementation provides support for them. +==== + +```cpp +namespace sycl::ext::oneapi::experimental { + +// POSIX file descriptor memory handle type +struct external_mem_fd { + int file_descriptor; +}; + +// Windows NT memory handle type +struct external_mem_win32 { + void *handle; + const void *name; +}; + +// Descriptor templated on specific external memory handle type +template +struct external_mem_handle_type { + external_mem_handle_type external_handle; + size_t size_in_bytes; +}; + +} +``` + +The user should create an `external_mem_descriptor` templated on the appropriate +handle type for their purposes, e.g. `external_mem_fd` to describe a POSIX file +descriptor resource on Linux systems, or an `external_mem_win32` for Windows NT +resource handles. + +Once the user populates the `external_mem_descriptor` with the appropriate +`external_mem_handle_type` values, and the size of the external memory in bytes, +they can then import that memory into SYCL through `import_external_memory`. + +```cpp +namespace sycl::ext::oneapi::experimental { + +struct interop_mem_handle { + using raw_handle_type = /* Implementation defined */; + raw_handle_type raw_handle; +}; + +template +interop_mem_handle import_external_memory( + external_mem_descriptor externalMemDescriptor, + const sycl::device &syclDevice, + const sycl::context &syclContext); + +template +interop_mem_handle import_external_memory( + external_mem_descriptor externalMemDescriptor, + const sycl::queue &syclQueue); + +image_mem_handle map_external_memory_array( + interop_mem_handle interopMemHandle, + const image_descriptor &imageDescriptor, + const sycl::device &syclDevice, + const sycl::context &syclContext); +image_mem_handle map_external_memory_array( + interop_mem_handle interopMemHandle, + const image_descriptor &imageDescriptor, + const sycl::queue &syclQueue); + +void *map_external_memory_buffer( + interop_mem_handle interopMemHandle, + const sycl::device &syclDevice, + const sycl::context &syclContext); +void *map_external_memory_buffer( + interop_mem_handle interopMemHandle, + const sycl::queue &syclQueue); +} +``` + +The resulting `interop_mem_handle` can then be mapped to two different memory +types, a buffer, or an array. + +If mapped to a buffer, the resulting `void*` can be used in USM like fashion. +It can be passed as an argument to a kernel and therein manipulated. + +If mapped to an array, the resulting type is an `image_mem_handle`, which can be +used to construct images in the same way as memory allocated through +`alloc_image_mem`. The `ext_oneapi_copy` operations also work with imported +memory mapped to `image_mem_handle` types. + +When calling `create_image` with an `image_mem_handle` mapped from an external +memory object, the user must ensure that the image descriptor they pass to +`create_image` has members that match or map to those of the external API. +A mismatch between any of the `width`, `height`, `depth`, `image_channel_type`, +or `image_channel_order` members will result in undefined behavior. The +`image_type` member must be set to `image_type::interop`. + +Once a user has finished operating on imported memory, they must ensure that +they destroy the imported memory handle through `destroy_external_memory`. + +`destroy_external_memory` can only accept `interop_mem_handles` that were +created through `import_external_memory`. + +```cpp +namespace sycl::ext::oneapi::experimental { + +void destroy_external_memory(interop_mem_handle interopMem, + const sycl::device &syclDevice, + const sycl::context &syclContext); +void destroy_external_memory(interop_mem_handle interopMem, + const sycl::queue &syclQueue); +} +``` + +Destroying or freeing any imported memory through `image_mem_free` or +`sycl::free` will result in undefined behavior. + +=== Importing external semaphores + +In addition to proposing importation of external memory resources, we also +propose importation of synchronization primitives. Just like the sharing of +memory between APIs described above, any external APIs must provide a valid a +handle to a valid semaphore resource they wish to share, and just as external +memory resources handles can take different forms of structure and type +depending on the API and operating system, so do external semaphore resource +handles. + +In order to facilitate a number of different external semaphore handle types, we +propose the following structures. + +[NOTE] +==== +We only show two examples of external semaphore resource handle types here, but +the `external_semaphore_descriptor` struct could be templated by any number of +handle types, provided that the SYCL implementation provides support for them. +==== + +```cpp +namespace sycl::ext::oneapi::experimental { + +// POSIX file descriptor semaphore handle +struct external_semaphore_fd { + int file_descriptor; +}; + +// Windows NT semaphore handle +struct external_semaphore_win32 { + void *handle; + const void *name; +}; + +// Descriptor templated on specific external semaphore handle type +template +struct external_semaphore_descriptor { + external_semaphore_handle_type external_handle; +}; + +} +``` + +The user should create an `external_semaphore_descriptor` templated on the +appropriate handle type for their purposes, e.g. `external_semaphore_fd` to +describe a POSIX file descriptor resource on Linux systems, or an +`external_mem_win32` for Windows NT resource handles. + +Once the user populates the `external_semaphore_descriptor` with the appropriate +`external_semaphore_handle_type` values, they can then import that semaphore +into SYCL through `import_external_semaphore`. + +```cpp +namespace sycl::ext::oneapi::experimental { + +struct interop_semaphore_handle { + using raw_handle_type = /* Implementation defined */; + raw_handle_type raw_handle; +}; + +template +interop_semaphore_handle import_external_semaphore( + external_semaphore_descriptor + externalSemaphoreDescriptor, + const sycl::device &syclDevice, + const sycl::context &syclContext); +} + +template +interop_semaphore_handle import_external_semaphore( + external_semaphore_descriptor + externalSemaphoreDescriptor, + const sycl::queue &syclQueue); +} +``` + +The resulting `interop_semaphore_handle` can then be used in a SYCL command +group, to either wait until the semaphore is in the signaled state, or set the +semaphore to a signaled state. + +We propose to extend the SYCL queue and handler classes with semaphore waiting +and signalling operations. + +```cpp +namespace sycl { + +class handler { +public: + void ext_oneapi_wait_external_semaphore( + ext::oneapi::experimental::interop_semaphore_handle + interop_semaphore_handle); + + void ext_oneapi_signal_external_semaphore( + ext::oneapi::experimental::interop_semaphore_handle + interop_semaphore_handle); +}; + +class queue { +public: + event ext_oneapi_wait_external_semaphore( + ext::oneapi::experimental::interop_semaphore_handle + interop_semaphore_handle); + event ext_oneapi_wait_external_semaphore( + ext::oneapi::experimental::interop_semaphore_handle + interop_semaphore_handle, + event DepEvent); + event ext_oneapi_wait_external_semaphore( + ext::oneapi::experimental::interop_semaphore_handle + interop_semaphore_handle, + const std::vector &DepEvents); + + event ext_oneapi_signal_external_semaphore( + ext::oneapi::experimental::interop_semaphore_handle + interop_semaphore_handle); + event ext_oneapi_signal_external_semaphore( + ext::oneapi::experimental::interop_semaphore_handle + interop_semaphore_handle, + event DepEvent); + event ext_oneapi_signal_external_semaphore( + ext::oneapi::experimental::interop_semaphore_handle + interop_semaphore_handle, + const std::vector &DepEvents); +}; +} +``` + +Any operations submitted to the queue after a +`ext_oneapi_wait_external_semaphore` call will not begin until the imported +semaphore is in a signaled state. + +When `ext_oneapi_signal_external_semaphore` is called, the external semaphore +will be set to the signaled state after all commands submitted to the queue +prior to the `ext_oneapi_signal_external_semaphore` call complete. + +`ext_oneapi_wait_external_semaphore` and `ext_oneapi_signal_external_semaphore` +are non-blocking, asynchronous operations. + +The user must ensure to destroy all external semaphore objects once they are no +longer required through `destroy_external_semaphore`. + +```cpp +namespace sycl::ext::oneapi::experimental { + +void destroy_external_semaphore(interop_semaphore_handle semaphoreHandle, + const sycl::device &syclDevice, + const sycl::context &syclContext); + +void destroy_external_semaphore(interop_semaphore_handle semaphoreHandle, + const sycl::queue &syclQueue); + +} +``` + +== Examples + +=== 1D image read/write + +```cpp +// Set up device, queue, and context +sycl::device device; +sycl::queue queue(device); +sycl::context context = queue.get_context(); + +// Initialize input data +constexpr size_t width = 512; +std::vector dataIn(width); +std::vector dataOut(width); +for (int i = 0; i < width; i++) { + dataIn[i] = static_cast(i); +} + +// Image descriptor - can use the same for both images +sycl::ext::oneapi::experimental::image_descriptor desc( + sycl::range{width}, sycl::ext::oneapi::experimental::image_channel_order::r, + sycl::ext::oneapi::experimental::image_channel_type::fp32); + +try { + // Extension: returns the device pointer to the allocated memory + sycl::ext::oneapi::experimental::image_mem imgMemoryIn(desc, queue); + sycl::ext::oneapi::experimental::image_mem imgMemoryOut(desc, queue); + + // Extension: create the image and return the handle + sycl::ext::oneapi::experimental::unsampled_image_handle imgIn = + sycl::ext::oneapi::experimental::create_image(imgMemoryIn, desc, queue); + sycl::ext::oneapi::experimental::unsampled_image_handle imgOut = + sycl::ext::oneapi::experimental::create_image(imgMemoryOut, desc, queue); + + // Extension: copy over data to device + q.ext_oneapi_copy(dataIn.data(), imgMemoryIn, desc); + + // Bindless images require manual synchronization + // Wait for copy operation to finish + q.wait_and_throw(); + + q.submit([&](sycl::handler &cgh) { + // No need to request access, handles captured by value + + cgh.parallel_for(width, [=](sycl::id<1> id) { + // Extension: read image data from handle + float pixel = sycl::ext::oneapi::experimental::read_image( + imgIn, int(id[0])); + + // Extension: write to image data using handle + sycl::ext::oneapi::experimental::write_image(imgOut, int(id[0]), pixel); + }); + }); + + // Using image handles requires manual synchronization + q.wait_and_throw(); + + // Copy data written to imgOut to host + q.ext_oneapi_copy(imgMemoryOut, dataOut.data(), desc); + + // Cleanup + sycl::ext::oneapi::experimental::destroy_image_handle(imgIn, queue); + sycl::ext::oneapi::experimental::destroy_image_handle(imgOut, queue); +} catch (sycl::exception e) { + std::cerr << "SYCL exception caught: " << e.what(); + exit(-1); +} + +// Validate that `dataIn` correctly transferred to `dataOut` +bool validated = (dataIn == dataOut); +``` + +=== Reading from a dynamically sized array of 2D images + + +```cpp +// Set up device, queue, and context +sycl::device device; +sycl::queue queue(device); +sycl::context context = queue.get_context(); + +// declare image data +size_t numImages = 5; +size_t width = 8; +size_t height = 8; +size_t numPixels = width * height; +std::vector dataIn(numPixels); +std::vector dataOut(numPixels); +std::vector dataExpected(numPixels); +for (int i = 0; i < width; i++) { + for (int j = 0; j < height; j++) { + int index = j + (height * i); + dataIn[index] = index; + dataExpected[index] = index * numImages; + } +} + +// Image descriptor - can use the same for all images +sycl::ext::oneapi::experimental::image_descriptor desc( + {width, height}, sycl::ext::oneapi::experimental::image_channel_order::r, + sycl::ext::oneapi::experimental::image_channel_type::fp32); + +try { + + // Allocate each image and save the handles + std::vector imgAllocations; + for (int i = 0; i < numImages; i++) { + // Extension: move-construct device allocated memory + imgAllocations.emplace_back( + sycl::ext::oneapi::experimental::image_mem{desc, queue}); + } + + // Copy over data to device for each image + for (int i = 0; i < numImages; i++) { + // Extension: copy over data to device + q.ext_oneapi_copy(dataIn.data(), imgAllocations[i], desc); + } + + // Wait for copy operations to finish + q.wait_and_throw(); + + // Create the images and return the handles + std::vector + imgHandles; + for (int i = 0; i < numImages; i++) { + // Extension: create the image and return the handle + sycl::ext::oneapi::experimental::unsampled_image_handle imgHandle = + sycl::ext::oneapi::experimental::create_image(imgAllocations[i], + desc, queue); + imgHandles.push_back(imgHandle); + } + + sycl::buffer outBuf{dataOut.data(), sycl::range{height, width}}; + sycl::buffer imgHandlesBuf{imgHandles.data(), sycl::range{numImages}}; + q.submit([&](sycl::handler &cgh) { + sycl::accessor outAcc{outBuf, cgh, sycl::write_only}; + sycl::accessor imgHandleAcc{imgHandlesBuf, cgh, sycl::read_only}; + + cgh.parallel_for( + sycl::nd_range<2>{{width, height}, {width, height}}, + [=](sycl::nd_item<2> it) { + size_t dim0 = it.get_local_id(0); + size_t dim1 = it.get_local_id(1); + + // Sum each image by reading via its handle + float sum = 0; + for (int i = 0; i < numImages; i++) { + // Extension: read image data from handle + sum += (sycl::ext::oneapi::experimental::read_image( + imgHandleAcc[i], sycl::vec(dim0, dim1))); + } + outAcc[sycl::id{dim1, dim0}] = sum; + }); + }); + + // Using image handles requires manual synchronization + q.wait_and_throw(); + + // Cleanup + for (int i = 0; i < numImages; i++) { + sycl::ext::oneapi::experimental::destroy_image_handle(imgHandles[i], queue); + } +} catch (sycl::exception e) { + std::cerr << "SYCL exception caught: " << e.what(); + exit(-1); +} + +// Validate that `dataOut` is correct +bool validated = (dataOut == dataExpected); +``` + +=== Reading a 1D mipmap with anisotropic filtering and levels +```cpp +// Set up device, queue, and context +sycl::device device; +sycl::queue queue(device); +sycl::context context = q.get_context(); + +// declare image data +constexpr size_t width = 16; +unsigned int num_levels = 2; +std::vector dataIn1(width); +std::vector dataIn2(width / 2); +std::vector dataOut(width); +std::vector dataExpected(width); +int j = 0; +for (int i = 0; i < width; i++) { + dataExpected[i] = static_cast(i + (j + 10)); + if (i % 2) + j++; + dataIn1[i] = static_cast(i); + if (i < (N / 2)) + dataIn2[i] = static_cast(i + 10); +} + +try { + + // Image descriptor -- number of levels + sycl::ext::oneapi::experimental::image_descriptor desc( + {width}, sycl::ext::oneapi::experimental::image_channel_order::r, + sycl::ext::oneapi::experimental::image_channel_type::fp32, + sycl::ext::oneapi::experimental::image_type::mipmap, num_levels); + + // Allocate the mipmap + sycl::ext::oneapi::experimental::image_mem mip_mem(desc, queue); + + // Retrieve level 0 + sycl::ext::oneapi::experimental::image_mem_handle img_mem1 = + mip_mem.get_mip_level_mem_handle(0) + + // Copy over data to level 0 + q.ext_oneapi_copy(dataIn1.data(), img_mem1, desc); + + // Copy over data to level 1 + q.ext_oneapi_copy(dataIn2.data(), mip_mem.get_mip_level_mem_handle(1), + desc.get_mip_level_desc(1)); + q.wait_and_throw(); + + // Extended sampler object to take in mipmap attributes + sycl::ext::oneapi::experimental::bindless_image_sampler samp( + addressing_mode::mirrored_repeat, + coordinate_normalization_mode::normalized, filtering_mode::nearest, + mipmap_filtering_mode::nearest, 0.0f, (float)num_levels, 8.0f); + + // Create a sampled image handle to represent the mipmap + sycl::ext::oneapi::experimental::sampled_image_handle mipHandle = + sycl::ext::oneapi::experimental::create_image(mip_mem, samp, desc, queue); + q.wait_and_throw(); + + sycl::buffer buf((float *)dataOut.data(), width); + q.submit([&](handler &cgh) { + auto outAcc = buf.get_access(cgh, width); + + cgh.parallel_for(width, [=](id<1> id) { + float sum = 0; + float x = (static_cast(id[0]) + 0.5f) / static_cast(width); + // Read mipmap level 0 with anisotropic filtering + // and level 1 with level filtering + float px1 = sycl::ext::oneapi::experimental::read_image( + mipHandle, x, 0.0f, 0.0f); + float px2 = sycl::ext::oneapi::experimental::read_image( + mipHandle, x, 1.0f); + + sum = px1 + px2; + outAcc[id] = sum; + }); + }); + + q.wait_and_throw(); + + // Cleanup + sycl::ext::oneapi::experimental::destroy_image_handle(mipHandle, queue); + +} catch (sycl::exception e) { + std::cerr << "SYCL exception caught! : " << e.what() << "\n"; + exit(-1); +} catch (...) { + std::cerr << "Unknown exception caught!\n"; + exit(-1); +} + +// Validate that `dataOut` is correct +bool validated = (dataOut == dataExpected); +``` + +=== Using imported memory and semaphore objects + +```c++ +// Set up device, queue, and context +sycl::device device; +sycl::queue queue(device); +sycl::context context = queue.get_context(); + +size_t width = /* passed from external API */; +size_t height = /* passed from external API */; + +sycl::ext::oneapi::experimental::image_channel_order channel_order = + /* mapped from external API */ + /* we assume sycl::image_channel_order::r */; + +sycl::ext::oneapi::experimental::image_channel_type channel_type = + /* mapped from external API */ + /* we assume sycl::image_channel_type::unsigned_int32 */; + +// Image descriptor - mapped to external API image layout +// with `image_type::interop` +sycl::ext::oneapi::experimental::image_descriptor desc( + {width, height}, channel_order, channel_type, + sycl::ext::oneapi::experimental::image_type::interop); + +size_t img_size_in_bytes = width * height * sizeof(uint32_t); + +int external_input_image_file_descriptor = /* passed from external API */ +int external_output_image_file_descriptor = /* passed from external API */ + +// Extension: populate external memory descriptors +sycl::ext::oneapi::experimental::external_mem_descriptor< + sycl::ext::oneapi::experimental::external_mem_fd> + input_ext_mem_desc{external_input_image_file_descriptor, + img_size_in_bytes}; + +sycl::ext::oneapi::experimental::external_mem_descriptor< + sycl::ext::oneapi::experimental::external_mem_fd> + output_ext_mem_desc{external_output_image_file_descriptor, + img_size_in_bytes}; + +// An external API semaphore will signal this semaphore before our SYCL commands +// can begin execution +int wait_semaphore_file_descriptor = /* passed from external API */; + +// An external API will wait on this semaphore to be signalled by us before it +// can execute some commands +int done_semaphore_file_descriptor = /* passed from external API */; + +// Extension: populate external semaphore descriptor. +// We assume POSIX file descriptor resource types +sycl::ext::oneapi::experimental::external_semaphore_descriptor< + sycl::ext::oneapi::experimental::external_semaphore_fd> + wait_external_semaphore_desc{wait_semaphore_file_descriptor}; + +sycl::ext::oneapi::experimental::external_semaphore_descriptor< + sycl::ext::oneapi::experimental::external_semaphore_fd> + done_external_semaphore_desc{done_semaphore_file_descriptor}; + +try { + // Extension: import external semaphores + sycl::ext::oneapi::experimental::interop_semaphore_handle + wait_interop_semaphore_handle = + sycl::ext::oneapi::experimental::import_external_semaphore( + wait_external_semaphore_desc, queue); + + sycl::ext::oneapi::experimental::interop_semaphore_handle + done_interop_semaphore_handle = + sycl::ext::oneapi::experimental::import_external_semaphore( + done_external_semaphore_desc, queue); + + // Extension: import external memory from descriptors + sycl::ext::oneapi::experimental::interop_mem_handle + input_interop_mem_handle = + sycl::ext::oneapi::experimental::import_external_memory( + input_ext_mem_desc, queue); + + sycl::ext::oneapi::experimental::interop_mem_handle + output_interop_mem_handle = + sycl::ext::oneapi::experimental::import_external_memory( + output_ext_mem_desc, queue); + + // Extension: map imported external memory to image memory + sycl::ext::oneapi::experimental::image_mem_handle input_mapped_mem_handle = + sycl::ext::oneapi::experimental::map_external_memory_array( + input_interop_mem_handle, desc, queue); + sycl::ext::oneapi::experimental::image_mem_handle output_mapped_mem_handle = + sycl::ext::oneapi::experimental::map_external_memory_array( + output_interop_mem_handle, desc, queue); + + // Extension: create images from mapped memory and return the handles + sycl::ext::oneapi::experimental::unsampled_image_handle img_input = + sycl::ext::oneapi::experimental::create_image( + input_mapped_mem_handle, desc, queue); + sycl::ext::oneapi::experimental::unsampled_image_handle img_output = + sycl::ext::oneapi::experimental::create_image( + output_mapped_mem_handle, desc, queue); + + // Extension: wait for imported semaphore + q.ext_oneapi_wait_external_semaphore(wait_interop_semaphore_handle) + + // Submit our kernel that depends on imported "wait_semaphore_file_descriptor" + q.submit([&](sycl::handler &cgh) { + cgh.parallel_for<>( + sycl::nd_range<2>{{width, height}, {32, 32}}, + [=](sycl::nd_item<2> it) { + size_t dim0 = it.get_global_id(0); + size_t dim1 = it.get_global_id(1); + + // Extension: read image data from handle to imported image + uint32_t pixel = + sycl::ext::oneapi::experimental::read_image( + img_input, sycl::vec(dim0, dim1)); + + // Modify the data before writing back + pixel *= 10; + + // Extension: write image data using handle to imported image + sycl::ext::oneapi::experimental::write_image( + img_output, sycl::vec(dim0, dim1), pixel); + }); + }); + + // Extension: signal imported semaphore + q.ext_oneapi_signal_external_semaphore(done_interop_semaphore_handle) + + // The external API can now use the semaphore it exported to + // "done_semaphore_file_descriptor" to schedule its own command submissions + + q.wait_and_throw(); + + // Extension: destroy all external resources + sycl::ext::oneapi::experimental::destroy_external_memory( + input_interop_mem_handle, queue); + sycl::ext::oneapi::experimental::destroy_external_memory( + output_interop_mem_handle, queue); + sycl::ext::oneapi::experimental::destroy_external_semaphore( + wait_interop_semaphore_handle, queue); + sycl::ext::oneapi::experimental::destroy_external_semaphore( + done_interop_semaphore_handle, queue); + sycl::ext::oneapi::experimental::destroy_image_handle(img_input, queue); + sycl::ext::oneapi::experimental::destroy_image_handle(img_output, queue); +} catch (sycl::exception e) { + std::cerr << "SYCL exception caught! : " << e.what() << "\n"; + exit(-1); +} catch (...) { + std::cerr << "Unknown exception caught!\n"; + exit(-1); +} +``` + +== Implementation notes + +The current DPC++ prototype only implements the proposal for the CUDA backend, +however we are actively exploring Level Zero with SPIR-V. +We are looking at other backend as well in order to ensure the extension can +work across different backends. + +== Issues + +=== No dependency tracking + +Because this extension allows images to work in a USM-like model, +there are similar limitations to using USM for non-images, +mainly the lack of dependency tracking and the need for users to manually +synchronize operations. + +=== Limitations when using USM as image memory + +There are dimension specific limitations: + +* 1D - Linear interpolation not possible in the CUDA backend. + A workaround is to allocate 2D pitched memory with a height of 1. +* 2D - There are some alignment restrictions. See the "Pitch alignment + restrictions and queries" section, or use `pitched_alloc_device` to + allocate 2D USM image memory. +* 3D - No support at the moment. Possible support in non CUDA backends in the + future. + +=== Not supported yet + +These features still need to be handled: + +* Level Zero and SPIR-V support +* Layered images +* Cubemap images + +== Revision History + +[frame="none",options="header"] +|====================== +|Rev |Date |Changes +|1 |2023-02-03 | Initial draft +|2 |2023-02-23 | - Added `image_mem_handle` for image memory allocated with + `allocate_image` + + - Added ability to create images from USM + + - Added new way to copy images, removed requirement for copy + direction + + - Added image memory information getters to reflect + `cuArray3DGetDescriptor` functionality +|3 |2023-03-30 | - Some text clarifications. + + - Unsampled images can no longer be created from USM. + + - Added SYCL 1.2.1 `image_channel_order` and + `image_channel_type` structs. + + - Added `image_type` to enable construction of layered, + mipmap, and cubemap images in the future. + + - Added device information descriptors for querying pitched + allocation size and alignment requirement. + + - Added `ext_oneapi_copy` methods for the `sycl::handler`. + + - `ext_oneapi_copy` functions now take the `Src` as the first + parameter. + + - Created `image_mem` as a RAII style class. + + - Renamed `allocate_image` to `alloc_image_mem` + + - `pitched_alloc_device` can now take an `image_descriptor`. + + - Added interoperability features + + - Added support to query bindless image and interoperability + capabilities + + - Added mipmap support +|4 |2023-06-23 | - Added `sycl::device` parameter to multiple functions to + clarify that images must be created and used on the same + device. + + - Changed naming and order of some parameters to be consistent + throughout the proposal and with core SYCL. + + - Added variants of functions that take a `sycl::queue` + instead of both `sycl::device` and `sycl::context`. + + - Removed standalone wait and signal semaphore functions. These + should always go through the queue or handler methods. + + - Removed `get_image_handle` and `get_sampler_handle` functions + from sampled and unsampled image handle structs. The structs + have public handle members that can be retrieved without + getters. + + - Made all enum types and values unspecified + + - Moved support queries to device aspects, improved naming of + queries for better consistency, and moved device info queries + to the experimental namespace. + + - Added `get_mip_level_desc` member function to + `image_descriptor` + + - Fixed `get_mip_level_mem_handle` prototype in `image_mem`, + and added a standalone function. + + - Removed `ext_oneapi_copy` variants that take `image_mem`, + the user should retrieve the raw handle and pass that + themselves. + + - Removed `ext_oneapi_copy` variants that take a mip level, + the user should retrieve individual mip level image handles + themselves and pass that. + + - Added `ext_oneapi_copy` variants that take offsets and the + extent, to enable sub-region copy. + + - Created a list of failure scenarios for `ext_oneapi_copy`, + changed the failure error code to `errc::invalid`, and + specified that the implementation should relay the reason + for the failure back to the user. + + - Added a `bindless_image_sampler` struct. + + - Specified that `image_mem` must follow Common Reference + Semantics. + + - Updated code samples. +|======================