Skip to content

[SYCL] Do not store last event for in-order queues #18277

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: sycl
Choose a base branch
from

Conversation

igchor
Copy link
Member

@igchor igchor commented Apr 30, 2025

unless Host Tasks are used.

Without Host Tasks, we can just rely on UR for ordering. Having no last event means that ext_oneapi_get_last_event() needs to submit a barrier to return an event to the user. Similarly, ext_oneapi_submit_barrier() now always submits a barrier, even for in-order queues.

Whenever Host Tasks are used we need to start recording all events. This is needed because of how kernel submission synchronizes with Host Tasks. With a following scenario:

q.host_task();
q.submit_kernel();
q.host_task():

The kernel won't even be submitted to UR until the first Host Task completes. To properly synchronize the second Host Task we need to keep the event describing kernel submission.

@igchor igchor temporarily deployed to WindowsCILock April 30, 2025 21:28 — with GitHub Actions Inactive
@igchor igchor force-pushed the in_order_queue_no_event branch from ce2652e to dac0398 Compare May 1, 2025 01:00
@igchor igchor force-pushed the in_order_queue_no_event branch from dac0398 to 375b895 Compare May 1, 2025 01:01
@igchor igchor temporarily deployed to WindowsCILock May 1, 2025 01:01 — with GitHub Actions Inactive
@igchor igchor temporarily deployed to WindowsCILock May 1, 2025 01:49 — with GitHub Actions Inactive
@igchor igchor force-pushed the in_order_queue_no_event branch from a7b84a1 to ce4ac8a Compare May 1, 2025 20:44
@igchor igchor temporarily deployed to WindowsCILock May 1, 2025 21:22 — with GitHub Actions Inactive
@igchor igchor force-pushed the in_order_queue_no_event branch from ce4ac8a to 7ce77ca Compare May 1, 2025 21:41
@igchor igchor force-pushed the in_order_queue_no_event branch from 7ce77ca to 39c5740 Compare May 1, 2025 21:42
@igchor igchor temporarily deployed to WindowsCILock May 1, 2025 21:43 — with GitHub Actions Inactive
@igchor igchor temporarily deployed to WindowsCILock May 1, 2025 22:21 — with GitHub Actions Inactive
@igchor igchor force-pushed the in_order_queue_no_event branch from 39c5740 to 1e2bf93 Compare May 2, 2025 01:35
@igchor igchor temporarily deployed to WindowsCILock May 2, 2025 02:07 — with GitHub Actions Inactive
@igchor igchor temporarily deployed to WindowsCILock May 2, 2025 02:17 — with GitHub Actions Inactive
@Pennycook
Copy link
Contributor

Whenever Host Tasks are used we need to start recording all events.

I think I'm missing something. Is this so we can make sure that the second host task in your example depends only on specific previous commands?

Why isn't it sufficient to just insert a barrier in the queue before we launch the host task, store the event associated with that barrier inside the host task implementation, and wait on it before the host task executes? Wouldn't that ensure that everything previously submitted to the queue was complete before the host task executed?

@steffenlarsen
Copy link
Contributor

Why isn't it sufficient to just insert a barrier in the queue before we launch the host task, store the event associated with that barrier inside the host task implementation, and wait on it before the host task executes? Wouldn't that ensure that everything previously submitted to the queue was complete before the host task executed?

I second that, though the proper event would be a "marker" (in OpenCL terminology) instead of a barrier. For in-order queues they are the same, but for out-of-order (which I don't know if this applies to, but better be safe) the marker won't block other work from starting before it finishes, while a barrier would.

For reference:
urEnqueueEventsWait -> Marker.
urEnqueueEventsWaitWithBarrier -> Barrier.

queue_impl::insertMarkerEvent is always helpful for the former. 😉

@igchor
Copy link
Member Author

igchor commented May 14, 2025

Why isn't it sufficient to just insert a barrier in the queue before we launch the host task, store the event associated with that barrier inside the host task implementation, and wait on it before the host task executes? Wouldn't that ensure that everything previously submitted to the queue was complete before the host task executed?

I second that, though the proper event would be a "marker" (in OpenCL terminology) instead of a barrier. For in-order queues they are the same, but for out-of-order (which I don't know if this applies to, but better be safe) the marker won't block other work from starting before it finishes, while a barrier would.

@steffenlarsen @Pennycook that's what we are doing right now, however, this is not enough - the problem is with all the kernels submitted after the host task. With the scenario I mentioned in the first comment what actually happens is this:

  1. host task is enqueued - if there were any operations on the device prior to that, insert a barrier and wait on it
  2. kernel is submitted to the queue - we cannot enqueue it to UR until we know host task is completed (if we do, both will execute concurrently), the only way to wait on host_task is to call depends_on() on host task event

For any operation that comes after that, we still need to make sure they are synchronized with the kernel we just submitted. Since the kernel might not yet be enqueued to UR, we need to call depends_on() and rely on the scheduler.

In my patch I implement a way to 'go back' to the eventless mode by checking if the LastEvent is completed. If it is, we don't need any calls to depends_on() anymore.

@igchor igchor force-pushed the in_order_queue_no_event branch from 8405663 to e96b4d6 Compare May 14, 2025 15:40
@igchor igchor temporarily deployed to WindowsCILock May 14, 2025 15:40 — with GitHub Actions Inactive
@Pennycook
Copy link
Contributor

  1. host task is enqueued - if there were any operations on the device prior to that, insert a barrier and wait on it
  2. kernel is submitted to the queue - we cannot enqueue it to UR until we know host task is completed (if we do, both will execute concurrently), the only way to wait on host_task is to call depends_on() on host task event

Instead of calling depends_on, couldn't you insert another barrier/marker? You'd have to insert another barrier before every kernel (until we detect that any host tasks are completed), but that might be faster than creating events.

Or does that not work because those barriers (and the kernels that follow them) would actually need to be enqueued to UR?

@igchor igchor temporarily deployed to WindowsCILock May 14, 2025 16:09 — with GitHub Actions Inactive
@igchor igchor temporarily deployed to WindowsCILock May 14, 2025 16:09 — with GitHub Actions Inactive
@igchor igchor force-pushed the in_order_queue_no_event branch from e96b4d6 to f63ce1e Compare May 14, 2025 16:20
@igchor igchor temporarily deployed to WindowsCILock May 14, 2025 16:20 — with GitHub Actions Inactive
@igchor
Copy link
Member Author

igchor commented May 14, 2025

  1. host task is enqueued - if there were any operations on the device prior to that, insert a barrier and wait on it
  2. kernel is submitted to the queue - we cannot enqueue it to UR until we know host task is completed (if we do, both will execute concurrently), the only way to wait on host_task is to call depends_on() on host task event

Instead of calling depends_on, couldn't you insert another barrier/marker? You'd have to insert another barrier before every kernel (until we detect that any host tasks are completed), but that might be faster than creating events.

Or does that not work because those barriers (and the kernels that follow them) would actually need to be enqueued to UR?

Yes, we can only submit barrier to UR but since UR does not know anything about the host tasks, they would have no effect. The only way to synchronize with host_task (as far as I'm aware) is to use the SYCL event. The SYCL scheduler can execute the host tasks in any order (assuming that dependencies are completed), and there is no specific handling for host tasks originating from in-order queue in the scheduler.

I think we won't be able to solve this until we have host task implementation in UR.

@igchor igchor temporarily deployed to WindowsCILock May 14, 2025 16:43 — with GitHub Actions Inactive
@igchor igchor temporarily deployed to WindowsCILock May 14, 2025 16:43 — with GitHub Actions Inactive
@steffenlarsen
Copy link
Contributor

Yes, we can only submit barrier to UR but since UR does not know anything about the host tasks, they would have no effect. The only way to synchronize with host_task (as far as I'm aware) is to use the SYCL event. The SYCL scheduler can execute the host tasks in any order (assuming that dependencies are completed), and there is no specific handling for host tasks originating from in-order queue in the scheduler.

I think we won't be able to solve this until we have host task implementation in UR.

Ah, thank you for clarifying. I wonder (and maybe this would be a follow-up) but could we have host-task register its event as an "external event" like with set_external_event? Since any command will clear the external event when it is enqueued, we are guaranteed that it won't conflict with any set by the user, and if the user sets one after then they must guarantee that it finishes after the host-task anyway. That way we could get rid of the last event for good, right?

@igchor
Copy link
Member Author

igchor commented May 15, 2025

Yes, we can only submit barrier to UR but since UR does not know anything about the host tasks, they would have no effect. The only way to synchronize with host_task (as far as I'm aware) is to use the SYCL event. The SYCL scheduler can execute the host tasks in any order (assuming that dependencies are completed), and there is no specific handling for host tasks originating from in-order queue in the scheduler.
I think we won't be able to solve this until we have host task implementation in UR.

Ah, thank you for clarifying. I wonder (and maybe this would be a follow-up) but could we have host-task register its event as an "external event" like with set_external_event? Since any command will clear the external event when it is enqueued, we are guaranteed that it won't conflict with any set by the user, and if the user sets one after then they must guarantee that it finishes after the host-task anyway. That way we could get rid of the last event for good, right?

That's an interesting idea, however, I think handling graphs could be problematic. Right now, we have a separate LastEvent for regular submissions and for the recording mode. I would need to think if we can somehow unify this.

Also, we were actually thinking about deprecating set_external_event extension as it seems a bit redundant - you can achieve the same thing by calling depends_on or by submitting a barrier with non-empty wait-list (although as you pointed out, barrier might not be the most suitable name for in-order queues).

@igchor igchor temporarily deployed to WindowsCILock May 15, 2025 20:24 — with GitHub Actions Inactive
@igchor igchor temporarily deployed to WindowsCILock May 15, 2025 20:48 — with GitHub Actions Inactive
@igchor igchor force-pushed the in_order_queue_no_event branch from e1c4ba6 to 3aea08d Compare May 15, 2025 21:02
@igchor igchor temporarily deployed to WindowsCILock May 15, 2025 21:02 — with GitHub Actions Inactive
@igchor igchor temporarily deployed to WindowsCILock May 15, 2025 21:28 — with GitHub Actions Inactive
@igchor igchor temporarily deployed to WindowsCILock May 15, 2025 21:28 — with GitHub Actions Inactive
For opencl, always store the last event to support queue_empty(),
just don't use it for synchronization
@steffenlarsen
Copy link
Contributor

That's an interesting idea, however, I think handling graphs could be problematic. Right now, we have a separate LastEvent for regular submissions and for the recording mode. I would need to think if we can somehow unify this.

I'm fine with either. The main reason I talked about external event here is that it clears itself after first use, which seems like something we could also do for LastEvent, now that it is only needed for certain cases. That removes the need for two event trackers doing remotely the same thing.

Also, we were actually thinking about deprecating set_external_event extension as it seems a bit redundant - you can achieve the same thing by calling depends_on or by submitting a barrier with non-empty wait-list (although as you pointed out, barrier might not be the most suitable name for in-order queues).

If we need to keep the "last event" for the sake of host_task, I can't help but wonder if we might as well keep it. It seems like the logic should be the same between it and the tracked event from host_task. From what I remember of the use-case for set_external_event, it was not only used as an inter-queue dependency tracking mechanism, but also a way for the user to set an event and then query the "last" event later to compare and check if it was still the same event at the top. Both depends_on and barriers were insufficient, so if it comes at a minimal maintenance cost it might be worth just keeping it around.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants