fix queue locking behavior when creating event lists#1546
fix queue locking behavior when creating event lists#1546kbenzie merged 1 commit intooneapi-src:mainfrom
Conversation
This fixes a race when creating queue events without acquiring
the appropriate locks, and also fixes a deadlock when acquiring
multiple queue locks.
The deadlock scenario is roughly this:
queue1 = ...;
queue2 = ...;
for (;;) {
queue1_event = urEnqueueKernelLaunch(queue1, ...);
// lock order: queue2->Mutex, queue1->Mutex
urEnqueueEventsWait(queue2, 1, [queue1_event], nullptr);
}
T2:
for (;;) {
queue2_event = urEnqueueKernelLaunch(queue2, ...);
// lock order: queue1->Mutex, queue2->Mutex
urEnqueueEventsWait(queue1, 1, [queue2_event], nullptr);
}
The solution in this patch is the simplest possible fix I managed
to figure out, but I strongly believe this code requires a substantial
refactor to better structure the synchronization logic. Right now
it's a minefield.
|
I've been looking at this code more closely, and I'm not sure whether unlocking the CurQueue like that is a good idea, since something else can modify the state of the queue in the middle of an operation. So if an upper-level function stores some queue state on the stack or some other allocation, and then calls Now, I don't think there's any bug here now, but this is such a massive footgun that it might be a good idea to consider some other solution (everything I came up so far requires a very substantial refactor, where the risk of making it this late in the release cycle might be non-trivial). |
nrspruit
left a comment
There was a problem hiding this comment.
This was similar to what I tried, but as you said, you are dropping the lock to the current queue given the Queue in the event list is not the same.
Thanks for finding that we needed to move the lock scope to one step above.
fix queue locking behavior when creating event lists
This fixes a race when creating queue events without acquiring the appropriate locks, and also fixes a deadlock when acquiring multiple queue locks.
The deadlock scenario is roughly this:
queue1 = ...;
queue2 = ...;
for (;;) {
queue1_event = urEnqueueKernelLaunch(queue1, ...);
// lock order: queue2->Mutex, queue1->Mutex
urEnqueueEventsWait(queue2, 1, [queue1_event], nullptr);
}
T2:
for (;;) {
queue2_event = urEnqueueKernelLaunch(queue2, ...);
// lock order: queue1->Mutex, queue2->Mutex
urEnqueueEventsWait(queue1, 1, [queue2_event], nullptr);
}
The solution in this patch is the simplest possible fix I managed to figure out, but I strongly believe this code requires a substantial refactor to better structure the synchronization logic. Right now it's a minefield.