Skip to content

fix queue locking behavior when creating event lists#1546

Merged
kbenzie merged 1 commit intooneapi-src:mainfrom
pbalcer:fix-queue-locking
May 3, 2024
Merged

fix queue locking behavior when creating event lists#1546
kbenzie merged 1 commit intooneapi-src:mainfrom
pbalcer:fix-queue-locking

Conversation

@pbalcer
Copy link
Contributor

@pbalcer pbalcer commented Apr 25, 2024

This fixes a race when creating queue events without acquiring the appropriate locks, and also fixes a deadlock when acquiring multiple queue locks.

The deadlock scenario is roughly this:
queue1 = ...;
queue2 = ...;

for (;;) {
queue1_event = urEnqueueKernelLaunch(queue1, ...);
// lock order: queue2->Mutex, queue1->Mutex
urEnqueueEventsWait(queue2, 1, [queue1_event], nullptr);
}

T2:

for (;;) {
queue2_event = urEnqueueKernelLaunch(queue2, ...);
// lock order: queue1->Mutex, queue2->Mutex
urEnqueueEventsWait(queue1, 1, [queue2_event], nullptr);
}

The solution in this patch is the simplest possible fix I managed to figure out, but I strongly believe this code requires a substantial refactor to better structure the synchronization logic. Right now it's a minefield.

This fixes a race when creating queue events without acquiring
the appropriate locks, and also fixes a deadlock when acquiring
multiple queue locks.

The deadlock scenario is roughly this:
queue1 = ...;
queue2 = ...;

for (;;) {
  queue1_event = urEnqueueKernelLaunch(queue1, ...);
  // lock order: queue2->Mutex, queue1->Mutex
  urEnqueueEventsWait(queue2, 1, [queue1_event], nullptr);
}

T2:

for (;;) {
  queue2_event = urEnqueueKernelLaunch(queue2, ...);
  // lock order: queue1->Mutex, queue2->Mutex
  urEnqueueEventsWait(queue1, 1, [queue2_event], nullptr);
}

The solution in this patch is the simplest possible fix I managed
to figure out, but I strongly believe this code requires a substantial
refactor to better structure the synchronization logic. Right now
it's a minefield.
@pbalcer pbalcer requested a review from a team as a code owner April 25, 2024 09:42
@pbalcer pbalcer added the v0.9.x Include in the v0.9.x release label Apr 25, 2024
@pbalcer
Copy link
Contributor Author

pbalcer commented Apr 25, 2024

intel/llvm#13564

@github-actions github-actions bot added the level-zero L0 adapter specific issues label Apr 25, 2024
@pbalcer
Copy link
Contributor Author

pbalcer commented Apr 25, 2024

I've been looking at this code more closely, and I'm not sure whether unlocking the CurQueue like that is a good idea, since something else can modify the state of the queue in the middle of an operation. So if an upper-level function stores some queue state on the stack or some other allocation, and then calls createAndRetainUrZeEventList, that state might change. This is the case here.

Now, I don't think there's any bug here now, but this is such a massive footgun that it might be a good idea to consider some other solution (everything I came up so far requires a very substantial refactor, where the risk of making it this late in the release cycle might be non-trivial).

Copy link
Contributor

@nrspruit nrspruit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was similar to what I tried, but as you said, you are dropping the lock to the current queue given the Queue in the event list is not the same.

Thanks for finding that we needed to move the lock scope to one step above.

@nrspruit nrspruit added the ready to merge Added to PR's which are ready to merge label Apr 26, 2024
@kbenzie kbenzie merged commit a44e81b into oneapi-src:main May 3, 2024
kbenzie added a commit to kbenzie/unified-runtime that referenced this pull request May 3, 2024
fix queue locking behavior when creating event lists
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

level-zero L0 adapter specific issues ready to merge Added to PR's which are ready to merge v0.9.x Include in the v0.9.x release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants