Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
= sycl_ext_intel_queue_immediate_command_list

:source-highlighter: coderay
:coderay-linenums-mode: table

// This section needs to be after the document title.
:doctype: book
:toc2:
:toc: left
:encoding: utf-8
:lang: en
:dpcpp: pass:[DPC++]

// Set the default source code type in this document to C++,
// for syntax highlighting purposes. This is needed because
// docbook uses c++ and html5 uses cpp.
:language: {basebackend@docbook:c++:cpp}


== Notice

[%hardbreaks]
Copyright (C) 2023-2023 Intel Corporation. All rights reserved.

Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are trademarks
of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc. used by
permission by Khronos.


== Contact

To report problems with this extension, please open a new issue at:

https://github.com/intel/llvm/issues


== Dependencies

This extension is written against the SYCL 2020 revision 7 specification. All
references below to the "core SYCL specification" or to section numbers in the
SYCL specification refer to that revision.


== Status

This extension is implemented and fully supported by {dpcpp}.

However, the immediate command list feature (which is exposed by this extension)
has been well-tested only on Intel (R) Data Center Max Series GPUs (aka PVC).
Use of this extension to specify immediate command lists is not recommended
for other Intel GPUs.


== Backend support status

The properties added by this extension are a hint, which all backends accept.
However, in the current {dpcpp} implementation, the hint is only meaningful
on the Level Zero backend.

== Overview

When commands are submitted to a SYCL queue that uses the Level Zero backend,
those commands can be submitted to the hardware in one of two ways:
either through an immediate command list or through a standard command queue.
Commands submitted through an immediate command list are immediately submitted
to the device while commands submitted through a standard command queue may be
batched with other commands before they are submitted. By default the
implementation chooses a method that works best for most workloads.

In most cases, applications should rely on the default behavior.
However, sometimes it is advantageous for the application to choose one method
or the other. This extension provides a way for applications to select either
of these two methods via a queue property.

For example, when kernel runtimes are very short, the submission time on the
host may be as long or longer than the actual runtime of the kernel. In this
case, doing batched submissions may be preferable so that the submission
overhead is amortized over a number of kernel executions.


== Specification

=== Feature test macro

This extension provides a feature-test macro as described in the core SYCL
specification. An implementation supporting this extension must predefine the
macro `SYCL_EXT_INTEL_QUEUE_IMMEDIATE_COMMAND_LIST` to one of the values
defined in the table below. Applications can test for the existence of this
macro to determine if the implementation supports this feature, or
applications can test the macro's value to determine which of the
extension's features the implementation supports.

[%header,cols="1,5"]
|===
|Value
|Description

|1
|Initial version of this extension.
|===

=== New queue properties
This extension adds the following new properties that can be used when
constructing a queue object.

```c++
namespace sycl::ext::intel::property::queue {

struct immediate_command_list {};
struct no_immediate_command_list {};

} // namespace sycl::ext::intel::property::queue

```


Both properties are hints, which are ignored unless the backend is Level Zero.

The property `immediate_command_list` requests that the implementation use an
immediate command list when commands are submitted to this queue. As a result,
these commands are submitted immediately to the device.

The property `no_immediate_command_list` requests that the implementation use
a standard command queue instead of an immediate command list. As a result,
commands submitted to this queue may be batched with other commands before
being submitted to the device.

These two properties are mutually exclusive. Constructing a queue with both
properties causes the constructor to throw a synchronous exception with
the `errc::invalid` error code.

== Example
```c++
#include <sycl/sycl.hpp>

namespace syclintel = sycl::ext::intel;

int main() {
// Use immediate command lists
sycl::queue q1{syclintel::property::queue::immediate_command_list{}};
...

// Do not use immediate command lists
sycl::queue q2{syclintel::property::queue::no_immediate_command_list{}};
...
}
```

== Interaction with the SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS environment variable

{dpcpp} supports an environment variable named
SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS which also controls
the use of immediate command lists in SYCL queues. When that
environment variable is used in conjunction with the properties in this
extension, the properties take precedence. The environment variable has
no effect on queues constructed with one of these properties, however it
still affects queues that were not constructed with either of these properties.
2 changes: 2 additions & 0 deletions sycl/include/sycl/detail/pi.h
Original file line number Diff line number Diff line change
Expand Up @@ -662,6 +662,8 @@ constexpr pi_queue_properties PI_QUEUE_FLAG_ON_DEVICE_DEFAULT = (1 << 3);
constexpr pi_queue_properties PI_EXT_ONEAPI_QUEUE_FLAG_DISCARD_EVENTS = (1 << 4);
constexpr pi_queue_properties PI_EXT_ONEAPI_QUEUE_FLAG_PRIORITY_LOW = (1 << 5);
constexpr pi_queue_properties PI_EXT_ONEAPI_QUEUE_FLAG_PRIORITY_HIGH = (1 << 6);
constexpr pi_queue_properties PI_EXT_QUEUE_FLAG_SUBMISSION_NO_IMMEDIATE = (1 << 7);
constexpr pi_queue_properties PI_EXT_QUEUE_FLAG_SUBMISSION_IMMEDIATE = (1 << 8);
// clang-format on

typedef enum {
Expand Down
4 changes: 3 additions & 1 deletion sycl/include/sycl/detail/property_helper.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,10 @@ enum DataLessPropKind {
QueuePriorityLow = 17,
QueuePriorityHigh = 18,
GraphNoCycleCheck = 19,
QueueSubmissionBatched = 20,
QueueSubmissionImmediate = 21,
// Indicates the last known dataless property.
LastKnownDataLessPropKind = 19,
LastKnownDataLessPropKind = 21,
// Exceeding 32 may cause ABI breaking change on some of OSes.
DataLessPropKindSize = 32
};
Expand Down
4 changes: 4 additions & 0 deletions sycl/include/sycl/properties/queue_properties.def
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,10 @@ __SYCL_DATA_LESS_PROP(ext::oneapi::property::queue, priority_low,
QueuePriorityLow)
__SYCL_DATA_LESS_PROP(ext::oneapi::property::queue, priority_high,
QueuePriorityHigh)
__SYCL_DATA_LESS_PROP(ext::intel::property::queue, no_immediate_command_list,
QueueSubmissionBatched)
__SYCL_DATA_LESS_PROP(ext::intel::property::queue, immediate_command_list,
QueueSubmissionImmediate)

__SYCL_DATA_LESS_PROP(ext::oneapi::cuda::property::queue, use_default_stream,
UseDefaultStream)
Expand Down
8 changes: 7 additions & 1 deletion sycl/plugins/unified_runtime/pi2ur.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -1455,7 +1455,9 @@ inline pi_result piextQueueCreate(pi_context Context, pi_device Device,
PI_QUEUE_FLAG_ON_DEVICE_DEFAULT |
PI_EXT_ONEAPI_QUEUE_FLAG_DISCARD_EVENTS |
PI_EXT_ONEAPI_QUEUE_FLAG_PRIORITY_LOW |
PI_EXT_ONEAPI_QUEUE_FLAG_PRIORITY_HIGH)),
PI_EXT_ONEAPI_QUEUE_FLAG_PRIORITY_HIGH |
PI_EXT_QUEUE_FLAG_SUBMISSION_NO_IMMEDIATE |
PI_EXT_QUEUE_FLAG_SUBMISSION_IMMEDIATE)),
PI_ERROR_INVALID_VALUE);

PI_ASSERT(Context, PI_ERROR_INVALID_CONTEXT);
Expand All @@ -1482,6 +1484,10 @@ inline pi_result piextQueueCreate(pi_context Context, pi_device Device,
UrProperties.flags |= UR_QUEUE_FLAG_SYNC_WITH_DEFAULT_STREAM;
if (Properties[1] & __SYCL_PI_CUDA_USE_DEFAULT_STREAM)
UrProperties.flags |= UR_QUEUE_FLAG_USE_DEFAULT_STREAM;
if (Properties[1] & PI_EXT_QUEUE_FLAG_SUBMISSION_NO_IMMEDIATE)
UrProperties.flags |= UR_QUEUE_FLAG_SUBMISSION_BATCHED;
if (Properties[1] & PI_EXT_QUEUE_FLAG_SUBMISSION_IMMEDIATE)
UrProperties.flags |= UR_QUEUE_FLAG_SUBMISSION_IMMEDIATE;

ur_queue_index_properties_t IndexProperties{};
IndexProperties.stype = UR_STRUCTURE_TYPE_QUEUE_INDEX_PROPERTIES;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -583,7 +583,7 @@ ur_result_t ur_context_handle_t_::getAvailableCommandList(
bool UseCopyEngine, bool AllowBatching,
ze_command_queue_handle_t *ForcedCmdQueue) {
// Immediate commandlists have been pre-allocated and are always available.
if (Queue->Device->ImmCommandListUsed) {
if (Queue->UsingImmCmdLists) {
CommandList = Queue->getQueueGroup(UseCopyEngine).getImmCmdList();
if (CommandList->second.EventList.size() >
ImmCmdListsEventCleanupThreshold) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -971,10 +971,6 @@ ur_result_t ur_device_handle_t_::initialize(int SubSubDeviceOrdinal,

ImmCommandListUsed = this->useImmediateCommandLists();

if (ImmCommandListUsed == ImmCmdlistMode::NotUsed) {
ZeEventsScope = DeviceEventsSetting;
}

uint32_t numQueueGroups = 0;
ZE2UR_CALL(zeDeviceGetCommandQueueGroupProperties,
(ZeDevice, &numQueueGroups, nullptr));
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -135,11 +135,6 @@ struct ur_device_handle_t_ : _ur_object {
// Returns whether immediate command lists are used on this device.
ImmCmdlistMode ImmCommandListUsed{};

// Scope of events used for events on the device
// Can be adjusted with UR_L0_DEVICE_SCOPE_EVENTS
// for non-immediate command lists
EventsScope ZeEventsScope = AllHostVisible;

bool isSubDevice() { return RootDevice != nullptr; }

// Is this a Data Center GPU Max series (aka PVC)?
Expand Down
15 changes: 7 additions & 8 deletions sycl/plugins/unified_runtime/ur/adapters/level_zero/event.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,7 @@ UR_APIEXPORT ur_result_t UR_APICALL urEnqueueEventsWait(
}
}

if (!Queue->Device->ImmCommandListUsed) {
if (!Queue->UsingImmCmdLists) {
std::unique_lock<ur_shared_mutex> Lock(Queue->Mutex);
resetCommandLists(Queue);
}
Expand Down Expand Up @@ -270,7 +270,7 @@ UR_APIEXPORT ur_result_t UR_APICALL urEnqueueEventsWaitWithBarrier(
for (auto &QueueGroup : QueueMap) {
bool UseCopyEngine =
QueueGroup.second.Type != ur_queue_handle_t_::queue_type::Compute;
if (Queue->Device->ImmCommandListUsed) {
if (Queue->UsingImmCmdLists) {
// If immediate command lists are being used, each will act as their own
// queue, so we must insert a barrier into each.
for (auto &ImmCmdList : QueueGroup.second.ImmCmdLists)
Expand Down Expand Up @@ -498,7 +498,7 @@ ur_result_t ur_event_handle_t_::getOrCreateHostVisibleEvent(
this->Mutex);

if (!HostVisibleEvent) {
if (UrQueue->Device->ZeEventsScope != OnDemandHostVisibleProxy)
if (UrQueue->ZeEventsScope != OnDemandHostVisibleProxy)
die("getOrCreateHostVisibleEvent: missing host-visible event");

// Submit the command(s) signalling the proxy event to the queue.
Expand Down Expand Up @@ -538,8 +538,7 @@ UR_APIEXPORT ur_result_t UR_APICALL urEventWait(
///< events to wait for completion
) {
for (uint32_t I = 0; I < NumEvents; I++) {
if (EventWaitList[I]->UrQueue->Device->ZeEventsScope ==
OnDemandHostVisibleProxy) {
if (EventWaitList[I]->UrQueue->ZeEventsScope == OnDemandHostVisibleProxy) {
// Make sure to add all host-visible "proxy" event signals if needed.
// This ensures that all signalling commands are submitted below and
// thus proxy events can be waited without a deadlock.
Expand Down Expand Up @@ -587,7 +586,7 @@ UR_APIEXPORT ur_result_t UR_APICALL urEventWait(
}
}
if (auto Q = Event->UrQueue) {
if (Q->Device->ImmCommandListUsed && Q->isInOrderQueue())
if (Q->UsingImmCmdLists && Q->isInOrderQueue())
// Use information about waited event to cleanup completed events in
// the in-order queue.
CleanupEventsInImmCmdLists(
Expand Down Expand Up @@ -1029,7 +1028,7 @@ ur_result_t _ur_ze_event_list_t::createAndRetainUrZeEventList(
this->UrEventList = nullptr;

if (CurQueue->isInOrderQueue() && CurQueue->LastCommandEvent != nullptr) {
if (CurQueue->Device->ImmCommandListUsed) {
if (CurQueue->UsingImmCmdLists) {
if (ReuseDiscardedEvents && CurQueue->isDiscardEvents()) {
// If queue is in-order with discarded events and if
// new command list is different from the last used command list then
Expand Down Expand Up @@ -1158,7 +1157,7 @@ ur_result_t _ur_ze_event_list_t::createAndRetainUrZeEventList(
//
// Make sure that event1.wait() will wait for a host-visible
// event that is signalled before the command2 is enqueued.
if (CurQueue->Device->ZeEventsScope != AllHostVisible) {
if (CurQueue->ZeEventsScope != AllHostVisible) {
CurQueue->executeAllOpenCommandLists();
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -209,7 +209,7 @@ UR_APIEXPORT ur_result_t UR_APICALL urEnqueueKernelLaunch(
if (IndirectAccessTrackingEnabled)
Queue->KernelsToBeSubmitted.push_back(Kernel);

if (Queue->Device->ImmCommandListUsed && IndirectAccessTrackingEnabled) {
if (Queue->UsingImmCmdLists && IndirectAccessTrackingEnabled) {
// If using immediate commandlists then gathering of indirect
// references and appending to the queue (which means submission)
// must be done together.
Expand Down
Loading