Skip to content

stop batching multiple CmdList execution when waiting for barriers#1555

Merged
kbenzie merged 1 commit intooneapi-src:mainfrom
pbalcer:fix-enqueue-wait-barrier
May 3, 2024
Merged

stop batching multiple CmdList execution when waiting for barriers#1555
kbenzie merged 1 commit intooneapi-src:mainfrom
pbalcer:fix-enqueue-wait-barrier

Conversation

@pbalcer
Copy link
Contributor

@pbalcer pbalcer commented Apr 30, 2024

The current implementation of queue command batching appears to assume that getAvailableCommandList is followed by a single executeCommandList, and it's not allowed to, for example, create a list of CmdLists for later execution. And there's an assert to check for this.

This makes sense, since it doesn't really make sense to batch things across many different CmdLists. However, the implementation of urEnqueueEventsWaitWithBarrier was doing exactly that.

This patch disables batching in executeCommandList, fixing a assert failure during urEnqueueEventsWaitWithBarrier when many potentially open command lists are in use.

@pbalcer pbalcer requested a review from a team as a code owner April 30, 2024 09:25
@pbalcer pbalcer added the v0.8.x Include in the v0.8.x release label Apr 30, 2024
@github-actions github-actions bot added the level-zero L0 adapter specific issues label Apr 30, 2024
@pbalcer
Copy link
Contributor Author

pbalcer commented Apr 30, 2024

intel/llvm#13599

@pbalcer pbalcer added v0.9.x Include in the v0.9.x release and removed v0.8.x Include in the v0.8.x release labels Apr 30, 2024
@kbenzie
Copy link
Contributor

kbenzie commented May 2, 2024

@oneapi-src/unified-runtime-level-zero-write would be great to get some reviews on this so I can go ahead with merging everything and adding it to #1547

@igchor igchor added the ready to merge Added to PR's which are ready to merge label May 2, 2024
The current implementation of queue command batching appears to
assume that getAvailableCommandList is followed by a single
executeCommandList, and it's not allowed to, for example,
create a list of CmdLists for later execution. And there's
an assert to check for this.

This makes sense, since it doesn't really make sense to batch
things across many different CmdLists. However, the implementation
of urEnqueueEventsWaitWithBarrier was doing exactly that.

This patch disables batching in executeCommandList, fixing
a assert failure during urEnqueueEventsWaitWithBarrier
when many potentially open command lists are in use.
@kbenzie kbenzie force-pushed the fix-enqueue-wait-barrier branch from 6015a32 to 112089f Compare May 3, 2024 09:28
@kbenzie kbenzie mentioned this pull request May 3, 2024
5 tasks
@kbenzie kbenzie merged commit 9990d44 into oneapi-src:main May 3, 2024
kbenzie added a commit to kbenzie/unified-runtime that referenced this pull request May 3, 2024
stop batching multiple CmdList execution when waiting for barriers
@kbenzie
Copy link
Contributor

kbenzie commented May 3, 2024

I've removed this from v0.9.1rc due to these issues

@kbenzie kbenzie removed the v0.9.x Include in the v0.9.x release label May 3, 2024
pbalcer added a commit to pbalcer/unified-runtime that referenced this pull request May 7, 2024
This is another attempt to fix an issue where the L0 Adapter
crashes when urEnqueueEventsWaitWithBarrier in presence of
multiple active command queues with batched cmdlists.

The core issue is that the current queue implementation only
allows for two command lists to be active open batches, one
for copy and one for compute. If that assumption doesn't hold,
the getAvailableCommandList function, when executed multiple times
for a command queue of the same type, will override the active
open command batch. So only the last retrieved command list can
actually be batched. After this, when the code attempts to execute
all the command lists it collected, with batching enabled. And this
is where we hit an assert because the active open command list
doesn't match what is being used.

The proper fix here is to allow open command batches for each
command queue. But that's a fairly risky change to do this late
in the release cycle.

My previous attempt at a fix simply disabled batching
for queue-wide barriers (oneapi-src#1555). That introduced regressions
in tests that assumed that batching happens. It might also been
a performance regression.

Instead, this patch fixes getAvailableCommandList when batching
is enabled and specific command queue is required, and disables
batching only for cases where the open batch cmdlist is different
than the one we are executing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

level-zero L0 adapter specific issues ready to merge Added to PR's which are ready to merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants