[SYCL] Do not store last event for in-order queues #18277

igchor · 2025-04-30T20:56:29Z

unless Host Tasks are used.

Without Host Tasks, we can just rely on UR for ordering. Having no last event means that ext_oneapi_get_last_event() needs to submit a barrier to return an event to the user. Similarly, ext_oneapi_submit_barrier() now always submits a barrier, even for in-order queues.

Whenever Host Tasks are used we need to start recording all events. This is needed because of how kernel submission synchronizes with Host Tasks. With a following scenario:

q.host_task();
q.submit_kernel();
q.host_task():

The kernel won't even be submitted to UR until the first Host Task completes. To properly synchronize the second Host Task we need to keep the event describing kernel submission.

Pennycook · 2025-05-14T09:21:45Z

Whenever Host Tasks are used we need to start recording all events.

I think I'm missing something. Is this so we can make sure that the second host task in your example depends only on specific previous commands?

Why isn't it sufficient to just insert a barrier in the queue before we launch the host task, store the event associated with that barrier inside the host task implementation, and wait on it before the host task executes? Wouldn't that ensure that everything previously submitted to the queue was complete before the host task executed?

sycl/source/detail/queue_impl.hpp

sycl/source/detail/queue_impl.cpp

steffenlarsen · 2025-05-14T10:34:31Z

Why isn't it sufficient to just insert a barrier in the queue before we launch the host task, store the event associated with that barrier inside the host task implementation, and wait on it before the host task executes? Wouldn't that ensure that everything previously submitted to the queue was complete before the host task executed?

I second that, though the proper event would be a "marker" (in OpenCL terminology) instead of a barrier. For in-order queues they are the same, but for out-of-order (which I don't know if this applies to, but better be safe) the marker won't block other work from starting before it finishes, while a barrier would.

For reference:
urEnqueueEventsWait -> Marker.
urEnqueueEventsWaitWithBarrier -> Barrier.

queue_impl::insertMarkerEvent is always helpful for the former. 😉

igchor · 2025-05-14T15:15:47Z

Why isn't it sufficient to just insert a barrier in the queue before we launch the host task, store the event associated with that barrier inside the host task implementation, and wait on it before the host task executes? Wouldn't that ensure that everything previously submitted to the queue was complete before the host task executed?

I second that, though the proper event would be a "marker" (in OpenCL terminology) instead of a barrier. For in-order queues they are the same, but for out-of-order (which I don't know if this applies to, but better be safe) the marker won't block other work from starting before it finishes, while a barrier would.

@steffenlarsen @Pennycook that's what we are doing right now, however, this is not enough - the problem is with all the kernels submitted after the host task. With the scenario I mentioned in the first comment what actually happens is this:

host task is enqueued - if there were any operations on the device prior to that, insert a barrier and wait on it
kernel is submitted to the queue - we cannot enqueue it to UR until we know host task is completed (if we do, both will execute concurrently), the only way to wait on host_task is to call depends_on() on host task event

For any operation that comes after that, we still need to make sure they are synchronized with the kernel we just submitted. Since the kernel might not yet be enqueued to UR, we need to call depends_on() and rely on the scheduler.

In my patch I implement a way to 'go back' to the eventless mode by checking if the LastEvent is completed. If it is, we don't need any calls to depends_on() anymore.

Pennycook · 2025-05-14T15:53:49Z

host task is enqueued - if there were any operations on the device prior to that, insert a barrier and wait on it

kernel is submitted to the queue - we cannot enqueue it to UR until we know host task is completed (if we do, both will execute concurrently), the only way to wait on host_task is to call depends_on() on host task event

Instead of calling depends_on, couldn't you insert another barrier/marker? You'd have to insert another barrier before every kernel (until we detect that any host tasks are completed), but that might be faster than creating events.

Or does that not work because those barriers (and the kernels that follow them) would actually need to be enqueued to UR?

igchor · 2025-05-14T16:25:22Z

host task is enqueued - if there were any operations on the device prior to that, insert a barrier and wait on it

kernel is submitted to the queue - we cannot enqueue it to UR until we know host task is completed (if we do, both will execute concurrently), the only way to wait on host_task is to call depends_on() on host task event

Instead of calling depends_on, couldn't you insert another barrier/marker? You'd have to insert another barrier before every kernel (until we detect that any host tasks are completed), but that might be faster than creating events.

Or does that not work because those barriers (and the kernels that follow them) would actually need to be enqueued to UR?

Yes, we can only submit barrier to UR but since UR does not know anything about the host tasks, they would have no effect. The only way to synchronize with host_task (as far as I'm aware) is to use the SYCL event. The SYCL scheduler can execute the host tasks in any order (assuming that dependencies are completed), and there is no specific handling for host tasks originating from in-order queue in the scheduler.

I think we won't be able to solve this until we have host task implementation in UR.

steffenlarsen · 2025-05-15T04:43:22Z

Yes, we can only submit barrier to UR but since UR does not know anything about the host tasks, they would have no effect. The only way to synchronize with host_task (as far as I'm aware) is to use the SYCL event. The SYCL scheduler can execute the host tasks in any order (assuming that dependencies are completed), and there is no specific handling for host tasks originating from in-order queue in the scheduler.

I think we won't be able to solve this until we have host task implementation in UR.

Ah, thank you for clarifying. I wonder (and maybe this would be a follow-up) but could we have host-task register its event as an "external event" like with set_external_event? Since any command will clear the external event when it is enqueued, we are guaranteed that it won't conflict with any set by the user, and if the user sets one after then they must guarantee that it finishes after the host-task anyway. That way we could get rid of the last event for good, right?

sycl/source/detail/queue_impl.hpp

sycl/test-e2e/InorderQueue/in_order_ext_oneapi_submit_barrier.cpp

sycl/unittests/Extensions/EnqueueFunctionsEvents.cpp

sycl/unittests/scheduler/InOrderQueueHostTaskDeps.cpp

igchor · 2025-05-15T18:38:26Z

Yes, we can only submit barrier to UR but since UR does not know anything about the host tasks, they would have no effect. The only way to synchronize with host_task (as far as I'm aware) is to use the SYCL event. The SYCL scheduler can execute the host tasks in any order (assuming that dependencies are completed), and there is no specific handling for host tasks originating from in-order queue in the scheduler.
I think we won't be able to solve this until we have host task implementation in UR.

Ah, thank you for clarifying. I wonder (and maybe this would be a follow-up) but could we have host-task register its event as an "external event" like with set_external_event? Since any command will clear the external event when it is enqueued, we are guaranteed that it won't conflict with any set by the user, and if the user sets one after then they must guarantee that it finishes after the host-task anyway. That way we could get rid of the last event for good, right?

That's an interesting idea, however, I think handling graphs could be problematic. Right now, we have a separate LastEvent for regular submissions and for the recording mode. I would need to think if we can somehow unify this.

Also, we were actually thinking about deprecating set_external_event extension as it seems a bit redundant - you can achieve the same thing by calling depends_on or by submitting a barrier with non-empty wait-list (although as you pointed out, barrier might not be the most suitable name for in-order queues).

For opencl, always store the last event to support queue_empty(), just don't use it for synchronization

steffenlarsen · 2025-05-16T05:05:45Z

That's an interesting idea, however, I think handling graphs could be problematic. Right now, we have a separate LastEvent for regular submissions and for the recording mode. I would need to think if we can somehow unify this.

I'm fine with either. The main reason I talked about external event here is that it clears itself after first use, which seems like something we could also do for LastEvent, now that it is only needed for certain cases. That removes the need for two event trackers doing remotely the same thing.

Also, we were actually thinking about deprecating set_external_event extension as it seems a bit redundant - you can achieve the same thing by calling depends_on or by submitting a barrier with non-empty wait-list (although as you pointed out, barrier might not be the most suitable name for in-order queues).

If we need to keep the "last event" for the sake of host_task, I can't help but wonder if we might as well keep it. It seems like the logic should be the same between it and the tracked event from host_task. From what I remember of the use-case for set_external_event, it was not only used as an inter-queue dependency tracking mechanism, but also a way for the user to set an event and then query the "last" event later to compare and check if it was still the same event at the top. Both depends_on and barriers were insufficient, so if it comes at a minimal maintenance cost it might be worth just keeping it around.

igchor had a problem deploying to WindowsCILock April 30, 2025 20:56 — with GitHub Actions Failure

igchor temporarily deployed to WindowsCILock April 30, 2025 21:28 — with GitHub Actions Inactive

igchor had a problem deploying to WindowsCILock April 30, 2025 21:39 — with GitHub Actions Failure

igchor force-pushed the in_order_queue_no_event branch from ce2652e to dac0398 Compare May 1, 2025 01:00

igchor had a problem deploying to WindowsCILock May 1, 2025 01:00 — with GitHub Actions Error

igchor force-pushed the in_order_queue_no_event branch from dac0398 to 375b895 Compare May 1, 2025 01:01

igchor temporarily deployed to WindowsCILock May 1, 2025 01:01 — with GitHub Actions Inactive

igchor temporarily deployed to WindowsCILock May 1, 2025 01:49 — with GitHub Actions Inactive

igchor had a problem deploying to WindowsCILock May 1, 2025 01:59 — with GitHub Actions Failure

igchor had a problem deploying to WindowsCILock May 1, 2025 20:32 — with GitHub Actions Error

igchor force-pushed the in_order_queue_no_event branch from a7b84a1 to ce4ac8a Compare May 1, 2025 20:44

igchor had a problem deploying to WindowsCILock May 1, 2025 20:45 — with GitHub Actions Failure

igchor temporarily deployed to WindowsCILock May 1, 2025 21:22 — with GitHub Actions Inactive

igchor had a problem deploying to WindowsCILock May 1, 2025 21:32 — with GitHub Actions Failure

igchor force-pushed the in_order_queue_no_event branch from ce4ac8a to 7ce77ca Compare May 1, 2025 21:41

igchor had a problem deploying to WindowsCILock May 1, 2025 21:41 — with GitHub Actions Error

igchor force-pushed the in_order_queue_no_event branch from 7ce77ca to 39c5740 Compare May 1, 2025 21:42

igchor temporarily deployed to WindowsCILock May 1, 2025 21:43 — with GitHub Actions Inactive

igchor temporarily deployed to WindowsCILock May 1, 2025 22:21 — with GitHub Actions Inactive

igchor had a problem deploying to WindowsCILock May 1, 2025 22:32 — with GitHub Actions Failure

igchor force-pushed the in_order_queue_no_event branch from 39c5740 to 1e2bf93 Compare May 2, 2025 01:35

igchor had a problem deploying to WindowsCILock May 2, 2025 01:35 — with GitHub Actions Failure

igchor temporarily deployed to WindowsCILock May 2, 2025 02:07 — with GitHub Actions Inactive

igchor had a problem deploying to WindowsCILock May 2, 2025 02:17 — with GitHub Actions Failure

igchor temporarily deployed to WindowsCILock May 2, 2025 02:17 — with GitHub Actions Inactive

igchor had a problem deploying to WindowsCILock May 2, 2025 19:09 — with GitHub Actions Failure

slawekptak reviewed May 14, 2025

View reviewed changes

sycl/source/detail/queue_impl.hpp Show resolved Hide resolved

sycl/source/detail/queue_impl.cpp Outdated Show resolved Hide resolved

igchor had a problem deploying to WindowsCILock May 14, 2025 15:34 — with GitHub Actions Error

igchor force-pushed the in_order_queue_no_event branch from 8405663 to e96b4d6 Compare May 14, 2025 15:40

igchor temporarily deployed to WindowsCILock May 14, 2025 15:40 — with GitHub Actions Inactive

igchor temporarily deployed to WindowsCILock May 14, 2025 16:09 — with GitHub Actions Inactive

igchor force-pushed the in_order_queue_no_event branch from e96b4d6 to f63ce1e Compare May 14, 2025 16:20

igchor temporarily deployed to WindowsCILock May 14, 2025 16:20 — with GitHub Actions Inactive

igchor temporarily deployed to WindowsCILock May 14, 2025 16:43 — with GitHub Actions Inactive

Fix queue_empty() for opencl

f63ce1e

slawekptak reviewed May 15, 2025

View reviewed changes

add checks to test

1abe040

igchor temporarily deployed to WindowsCILock May 15, 2025 20:24 — with GitHub Actions Inactive

restore some of the unittests

6982df9

igchor temporarily deployed to WindowsCILock May 15, 2025 20:48 — with GitHub Actions Inactive

igchor had a problem deploying to WindowsCILock May 15, 2025 20:48 — with GitHub Actions Failure

igchor force-pushed the in_order_queue_no_event branch from e1c4ba6 to 3aea08d Compare May 15, 2025 21:02

igchor temporarily deployed to WindowsCILock May 15, 2025 21:02 — with GitHub Actions Inactive

igchor temporarily deployed to WindowsCILock May 15, 2025 21:28 — with GitHub Actions Inactive

Rework logic for openCL

3aea08d

For opencl, always store the last event to support queue_empty(), just don't use it for synchronization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL] Do not store last event for in-order queues #18277

[SYCL] Do not store last event for in-order queues #18277

igchor commented Apr 30, 2025 •

edited

Loading

Pennycook commented May 14, 2025

steffenlarsen commented May 14, 2025

igchor commented May 14, 2025

Pennycook commented May 14, 2025

igchor commented May 14, 2025

steffenlarsen commented May 15, 2025

igchor commented May 15, 2025

steffenlarsen commented May 16, 2025

[SYCL] Do not store last event for in-order queues #18277

Are you sure you want to change the base?

[SYCL] Do not store last event for in-order queues #18277

Conversation

igchor commented Apr 30, 2025 • edited Loading

Pennycook commented May 14, 2025

steffenlarsen commented May 14, 2025

igchor commented May 14, 2025

Pennycook commented May 14, 2025

igchor commented May 14, 2025

steffenlarsen commented May 15, 2025

igchor commented May 15, 2025

steffenlarsen commented May 16, 2025

igchor commented Apr 30, 2025 •

edited

Loading