Skip to content

Commit 2f2560f

Browse files
authored
[SYCL] Add queue properties to select submission mode (#9554)
This change adds the queue properties immediate_command_list and no_immediate_command_list to enable overriding the queue submission defaults.
1 parent 7c8932a commit 2f2560f

File tree

18 files changed

+301
-46
lines changed

18 files changed

+301
-46
lines changed
Lines changed: 157 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,157 @@
1+
= sycl_ext_intel_queue_immediate_command_list
2+
3+
:source-highlighter: coderay
4+
:coderay-linenums-mode: table
5+
6+
// This section needs to be after the document title.
7+
:doctype: book
8+
:toc2:
9+
:toc: left
10+
:encoding: utf-8
11+
:lang: en
12+
:dpcpp: pass:[DPC++]
13+
14+
// Set the default source code type in this document to C++,
15+
// for syntax highlighting purposes. This is needed because
16+
// docbook uses c++ and html5 uses cpp.
17+
:language: {basebackend@docbook:c++:cpp}
18+
19+
20+
== Notice
21+
22+
[%hardbreaks]
23+
Copyright (C) 2023-2023 Intel Corporation. All rights reserved.
24+
25+
Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are trademarks
26+
of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc. used by
27+
permission by Khronos.
28+
29+
30+
== Contact
31+
32+
To report problems with this extension, please open a new issue at:
33+
34+
https://github.com/intel/llvm/issues
35+
36+
37+
== Dependencies
38+
39+
This extension is written against the SYCL 2020 revision 7 specification. All
40+
references below to the "core SYCL specification" or to section numbers in the
41+
SYCL specification refer to that revision.
42+
43+
44+
== Status
45+
46+
This extension is implemented and fully supported by {dpcpp}.
47+
48+
However, the immediate command list feature (which is exposed by this extension)
49+
has been well-tested only on Intel (R) Data Center Max Series GPUs (aka PVC).
50+
Use of this extension to specify immediate command lists is not recommended
51+
for other Intel GPUs.
52+
53+
54+
== Backend support status
55+
56+
The properties added by this extension are a hint, which all backends accept.
57+
However, in the current {dpcpp} implementation, the hint is only meaningful
58+
on the Level Zero backend.
59+
60+
== Overview
61+
62+
When commands are submitted to a SYCL queue that uses the Level Zero backend,
63+
those commands can be submitted to the hardware in one of two ways:
64+
either through an immediate command list or through a standard command queue.
65+
Commands submitted through an immediate command list are immediately submitted
66+
to the device while commands submitted through a standard command queue may be
67+
batched with other commands before they are submitted. By default the
68+
implementation chooses a method that works best for most workloads.
69+
70+
In most cases, applications should rely on the default behavior.
71+
However, sometimes it is advantageous for the application to choose one method
72+
or the other. This extension provides a way for applications to select either
73+
of these two methods via a queue property.
74+
75+
For example, when kernel runtimes are very short, the submission time on the
76+
host may be as long or longer than the actual runtime of the kernel. In this
77+
case, doing batched submissions may be preferable so that the submission
78+
overhead is amortized over a number of kernel executions.
79+
80+
81+
== Specification
82+
83+
=== Feature test macro
84+
85+
This extension provides a feature-test macro as described in the core SYCL
86+
specification. An implementation supporting this extension must predefine the
87+
macro `SYCL_EXT_INTEL_QUEUE_IMMEDIATE_COMMAND_LIST` to one of the values
88+
defined in the table below. Applications can test for the existence of this
89+
macro to determine if the implementation supports this feature, or
90+
applications can test the macro's value to determine which of the
91+
extension's features the implementation supports.
92+
93+
[%header,cols="1,5"]
94+
|===
95+
|Value
96+
|Description
97+
98+
|1
99+
|Initial version of this extension.
100+
|===
101+
102+
=== New queue properties
103+
This extension adds the following new properties that can be used when
104+
constructing a queue object.
105+
106+
```c++
107+
namespace sycl::ext::intel::property::queue {
108+
109+
struct immediate_command_list {};
110+
struct no_immediate_command_list {};
111+
112+
} // namespace sycl::ext::intel::property::queue
113+
114+
```
115+
116+
117+
Both properties are hints, which are ignored unless the backend is Level Zero.
118+
119+
The property `immediate_command_list` requests that the implementation use an
120+
immediate command list when commands are submitted to this queue. As a result,
121+
these commands are submitted immediately to the device.
122+
123+
The property `no_immediate_command_list` requests that the implementation use
124+
a standard command queue instead of an immediate command list. As a result,
125+
commands submitted to this queue may be batched with other commands before
126+
being submitted to the device.
127+
128+
These two properties are mutually exclusive. Constructing a queue with both
129+
properties causes the constructor to throw a synchronous exception with
130+
the `errc::invalid` error code.
131+
132+
== Example
133+
```c++
134+
#include <sycl/sycl.hpp>
135+
136+
namespace syclintel = sycl::ext::intel;
137+
138+
int main() {
139+
// Use immediate command lists
140+
sycl::queue q1{syclintel::property::queue::immediate_command_list{}};
141+
...
142+
143+
// Do not use immediate command lists
144+
sycl::queue q2{syclintel::property::queue::no_immediate_command_list{}};
145+
...
146+
}
147+
```
148+
149+
== Interaction with the SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS environment variable
150+
151+
{dpcpp} supports an environment variable named
152+
SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS which also controls
153+
the use of immediate command lists in SYCL queues. When that
154+
environment variable is used in conjunction with the properties in this
155+
extension, the properties take precedence. The environment variable has
156+
no effect on queues constructed with one of these properties, however it
157+
still affects queues that were not constructed with either of these properties.

sycl/include/sycl/detail/pi.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -662,6 +662,8 @@ constexpr pi_queue_properties PI_QUEUE_FLAG_ON_DEVICE_DEFAULT = (1 << 3);
662662
constexpr pi_queue_properties PI_EXT_ONEAPI_QUEUE_FLAG_DISCARD_EVENTS = (1 << 4);
663663
constexpr pi_queue_properties PI_EXT_ONEAPI_QUEUE_FLAG_PRIORITY_LOW = (1 << 5);
664664
constexpr pi_queue_properties PI_EXT_ONEAPI_QUEUE_FLAG_PRIORITY_HIGH = (1 << 6);
665+
constexpr pi_queue_properties PI_EXT_QUEUE_FLAG_SUBMISSION_NO_IMMEDIATE = (1 << 7);
666+
constexpr pi_queue_properties PI_EXT_QUEUE_FLAG_SUBMISSION_IMMEDIATE = (1 << 8);
665667
// clang-format on
666668

667669
typedef enum {

sycl/include/sycl/detail/property_helper.hpp

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,8 +44,10 @@ enum DataLessPropKind {
4444
QueuePriorityLow = 17,
4545
QueuePriorityHigh = 18,
4646
GraphNoCycleCheck = 19,
47+
QueueSubmissionBatched = 20,
48+
QueueSubmissionImmediate = 21,
4749
// Indicates the last known dataless property.
48-
LastKnownDataLessPropKind = 19,
50+
LastKnownDataLessPropKind = 21,
4951
// Exceeding 32 may cause ABI breaking change on some of OSes.
5052
DataLessPropKindSize = 32
5153
};

sycl/include/sycl/properties/queue_properties.def

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,10 @@ __SYCL_DATA_LESS_PROP(ext::oneapi::property::queue, priority_low,
1717
QueuePriorityLow)
1818
__SYCL_DATA_LESS_PROP(ext::oneapi::property::queue, priority_high,
1919
QueuePriorityHigh)
20+
__SYCL_DATA_LESS_PROP(ext::intel::property::queue, no_immediate_command_list,
21+
QueueSubmissionBatched)
22+
__SYCL_DATA_LESS_PROP(ext::intel::property::queue, immediate_command_list,
23+
QueueSubmissionImmediate)
2024

2125
__SYCL_DATA_LESS_PROP(ext::oneapi::cuda::property::queue, use_default_stream,
2226
UseDefaultStream)

sycl/plugins/unified_runtime/pi2ur.hpp

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1455,7 +1455,9 @@ inline pi_result piextQueueCreate(pi_context Context, pi_device Device,
14551455
PI_QUEUE_FLAG_ON_DEVICE_DEFAULT |
14561456
PI_EXT_ONEAPI_QUEUE_FLAG_DISCARD_EVENTS |
14571457
PI_EXT_ONEAPI_QUEUE_FLAG_PRIORITY_LOW |
1458-
PI_EXT_ONEAPI_QUEUE_FLAG_PRIORITY_HIGH)),
1458+
PI_EXT_ONEAPI_QUEUE_FLAG_PRIORITY_HIGH |
1459+
PI_EXT_QUEUE_FLAG_SUBMISSION_NO_IMMEDIATE |
1460+
PI_EXT_QUEUE_FLAG_SUBMISSION_IMMEDIATE)),
14591461
PI_ERROR_INVALID_VALUE);
14601462

14611463
PI_ASSERT(Context, PI_ERROR_INVALID_CONTEXT);
@@ -1482,6 +1484,10 @@ inline pi_result piextQueueCreate(pi_context Context, pi_device Device,
14821484
UrProperties.flags |= UR_QUEUE_FLAG_SYNC_WITH_DEFAULT_STREAM;
14831485
if (Properties[1] & __SYCL_PI_CUDA_USE_DEFAULT_STREAM)
14841486
UrProperties.flags |= UR_QUEUE_FLAG_USE_DEFAULT_STREAM;
1487+
if (Properties[1] & PI_EXT_QUEUE_FLAG_SUBMISSION_NO_IMMEDIATE)
1488+
UrProperties.flags |= UR_QUEUE_FLAG_SUBMISSION_BATCHED;
1489+
if (Properties[1] & PI_EXT_QUEUE_FLAG_SUBMISSION_IMMEDIATE)
1490+
UrProperties.flags |= UR_QUEUE_FLAG_SUBMISSION_IMMEDIATE;
14851491

14861492
ur_queue_index_properties_t IndexProperties{};
14871493
IndexProperties.stype = UR_STRUCTURE_TYPE_QUEUE_INDEX_PROPERTIES;

sycl/plugins/unified_runtime/ur/adapters/level_zero/context.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -583,7 +583,7 @@ ur_result_t ur_context_handle_t_::getAvailableCommandList(
583583
bool UseCopyEngine, bool AllowBatching,
584584
ze_command_queue_handle_t *ForcedCmdQueue) {
585585
// Immediate commandlists have been pre-allocated and are always available.
586-
if (Queue->Device->ImmCommandListUsed) {
586+
if (Queue->UsingImmCmdLists) {
587587
CommandList = Queue->getQueueGroup(UseCopyEngine).getImmCmdList();
588588
if (CommandList->second.EventList.size() >
589589
ImmCmdListsEventCleanupThreshold) {

sycl/plugins/unified_runtime/ur/adapters/level_zero/device.cpp

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -971,10 +971,6 @@ ur_result_t ur_device_handle_t_::initialize(int SubSubDeviceOrdinal,
971971

972972
ImmCommandListUsed = this->useImmediateCommandLists();
973973

974-
if (ImmCommandListUsed == ImmCmdlistMode::NotUsed) {
975-
ZeEventsScope = DeviceEventsSetting;
976-
}
977-
978974
uint32_t numQueueGroups = 0;
979975
ZE2UR_CALL(zeDeviceGetCommandQueueGroupProperties,
980976
(ZeDevice, &numQueueGroups, nullptr));

sycl/plugins/unified_runtime/ur/adapters/level_zero/device.hpp

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -135,11 +135,6 @@ struct ur_device_handle_t_ : _ur_object {
135135
// Returns whether immediate command lists are used on this device.
136136
ImmCmdlistMode ImmCommandListUsed{};
137137

138-
// Scope of events used for events on the device
139-
// Can be adjusted with UR_L0_DEVICE_SCOPE_EVENTS
140-
// for non-immediate command lists
141-
EventsScope ZeEventsScope = AllHostVisible;
142-
143138
bool isSubDevice() { return RootDevice != nullptr; }
144139

145140
// Is this a Data Center GPU Max series (aka PVC)?

sycl/plugins/unified_runtime/ur/adapters/level_zero/event.cpp

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -115,7 +115,7 @@ UR_APIEXPORT ur_result_t UR_APICALL urEnqueueEventsWait(
115115
}
116116
}
117117

118-
if (!Queue->Device->ImmCommandListUsed) {
118+
if (!Queue->UsingImmCmdLists) {
119119
std::unique_lock<ur_shared_mutex> Lock(Queue->Mutex);
120120
resetCommandLists(Queue);
121121
}
@@ -270,7 +270,7 @@ UR_APIEXPORT ur_result_t UR_APICALL urEnqueueEventsWaitWithBarrier(
270270
for (auto &QueueGroup : QueueMap) {
271271
bool UseCopyEngine =
272272
QueueGroup.second.Type != ur_queue_handle_t_::queue_type::Compute;
273-
if (Queue->Device->ImmCommandListUsed) {
273+
if (Queue->UsingImmCmdLists) {
274274
// If immediate command lists are being used, each will act as their own
275275
// queue, so we must insert a barrier into each.
276276
for (auto &ImmCmdList : QueueGroup.second.ImmCmdLists)
@@ -498,7 +498,7 @@ ur_result_t ur_event_handle_t_::getOrCreateHostVisibleEvent(
498498
this->Mutex);
499499

500500
if (!HostVisibleEvent) {
501-
if (UrQueue->Device->ZeEventsScope != OnDemandHostVisibleProxy)
501+
if (UrQueue->ZeEventsScope != OnDemandHostVisibleProxy)
502502
die("getOrCreateHostVisibleEvent: missing host-visible event");
503503

504504
// Submit the command(s) signalling the proxy event to the queue.
@@ -538,8 +538,7 @@ UR_APIEXPORT ur_result_t UR_APICALL urEventWait(
538538
///< events to wait for completion
539539
) {
540540
for (uint32_t I = 0; I < NumEvents; I++) {
541-
if (EventWaitList[I]->UrQueue->Device->ZeEventsScope ==
542-
OnDemandHostVisibleProxy) {
541+
if (EventWaitList[I]->UrQueue->ZeEventsScope == OnDemandHostVisibleProxy) {
543542
// Make sure to add all host-visible "proxy" event signals if needed.
544543
// This ensures that all signalling commands are submitted below and
545544
// thus proxy events can be waited without a deadlock.
@@ -587,7 +586,7 @@ UR_APIEXPORT ur_result_t UR_APICALL urEventWait(
587586
}
588587
}
589588
if (auto Q = Event->UrQueue) {
590-
if (Q->Device->ImmCommandListUsed && Q->isInOrderQueue())
589+
if (Q->UsingImmCmdLists && Q->isInOrderQueue())
591590
// Use information about waited event to cleanup completed events in
592591
// the in-order queue.
593592
CleanupEventsInImmCmdLists(
@@ -1029,7 +1028,7 @@ ur_result_t _ur_ze_event_list_t::createAndRetainUrZeEventList(
10291028
this->UrEventList = nullptr;
10301029

10311030
if (CurQueue->isInOrderQueue() && CurQueue->LastCommandEvent != nullptr) {
1032-
if (CurQueue->Device->ImmCommandListUsed) {
1031+
if (CurQueue->UsingImmCmdLists) {
10331032
if (ReuseDiscardedEvents && CurQueue->isDiscardEvents()) {
10341033
// If queue is in-order with discarded events and if
10351034
// new command list is different from the last used command list then
@@ -1158,7 +1157,7 @@ ur_result_t _ur_ze_event_list_t::createAndRetainUrZeEventList(
11581157
//
11591158
// Make sure that event1.wait() will wait for a host-visible
11601159
// event that is signalled before the command2 is enqueued.
1161-
if (CurQueue->Device->ZeEventsScope != AllHostVisible) {
1160+
if (CurQueue->ZeEventsScope != AllHostVisible) {
11621161
CurQueue->executeAllOpenCommandLists();
11631162
}
11641163
}

sycl/plugins/unified_runtime/ur/adapters/level_zero/kernel.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -209,7 +209,7 @@ UR_APIEXPORT ur_result_t UR_APICALL urEnqueueKernelLaunch(
209209
if (IndirectAccessTrackingEnabled)
210210
Queue->KernelsToBeSubmitted.push_back(Kernel);
211211

212-
if (Queue->Device->ImmCommandListUsed && IndirectAccessTrackingEnabled) {
212+
if (Queue->UsingImmCmdLists && IndirectAccessTrackingEnabled) {
213213
// If using immediate commandlists then gathering of indirect
214214
// references and appending to the queue (which means submission)
215215
// must be done together.

0 commit comments

Comments
 (0)