Fix adding event to queue cache#1549
Conversation
|
This is draft for now as I'd like to add some tests for this. |
|
This is a regression after #1324. We should target this for 0.9. Given the severity of the impact, it makes me wonder whether the discard events are used by anyone and worth keeping around in their current form. They are adding a lot of complexity to the codebase. |
d2bfd73 to
337697f
Compare
|
I've added one test for the event caches but it's a bit hacky (I'm using ZeCallCount map from L0 adapter) - if you have a better idea how to test number of zeEventCreateCalls then I'm open for suggestions. The testcase that verifies the issue fixed by this PR is Also, there are some problems I found:
|
| using FlagsTupleType = std::tuple<ur_queue_flags_t, ur_queue_flags_t, | ||
| ur_queue_flags_t, ur_queue_flags_t>; | ||
|
|
||
| #define ASSERT_SUCCESS_OR_UNSUPPORTED(ret) \ |
There was a problem hiding this comment.
can this be moved to some common location?
| for (int j = 0; j < numEnqueues; j++) { | ||
| enqueueWork(nullptr, i * numEnqueues + j); | ||
| } | ||
| ASSERT_SUCCESS_OR_UNSUPPORTED(urQueueFinish(queue)); |
There was a problem hiding this comment.
it's odd that the UNSUPPORTED is returned by urQueueFinish. Did you investigate what specifically is returning that?
There was a problem hiding this comment.
It seems that zeCommandListHostSynchronize returns this. pfnHostSynchronize is set to false on the machine that this fails. This happens for UR_QUEUE_FLAG_SUBMISSION_IMMEDIATE.
On the other hand, when I run with UR_QUEUE_FLAG_SUBMISSION_BATCHED I get a segfault and corrupted double-linked list coming from zeCommandListReset I believe.
|
|
||
| // This will count the calls to Level-Zero | ||
| #ifndef UR_L0_CALL_COUNT_IN_TESTS | ||
| std::map<std::string, int> *ZeCallCount = nullptr; |
There was a problem hiding this comment.
I really don't like this. #1454 is meant to give programmatic access to internals like this in tests. So please add a TODO.
| ::testing::Combine( | ||
| testing::Values(0, UR_QUEUE_FLAG_DISCARD_EVENTS), | ||
| testing::Values(0, UR_QUEUE_FLAG_OUT_OF_ORDER_EXEC_MODE_ENABLE), | ||
| // TODO: why the test fails with UR_QUEUE_FLAG_SUBMISSION_BATCHED? |
There was a problem hiding this comment.
with what combination of the flags? The discard events path doesn't seem that well tested, so it's not inconceivable that's its broken when a few different flags are set.
There was a problem hiding this comment.
Well... every combination. Now, I'm getting a segfault from zeCommandListReset (similarly as in the comment above).
There was a problem hiding this comment.
I rerun this on the latest driver and everything works fine there so I guess we can just uncomment this once we switch our CI to the newer driver.
7690a5f to
f3d7960
Compare
in single-device scenario. When event is an internal event (when using queue with discard events property) the UrQueue is set to nullptr. This means that inside addEventToContextCache the event will be added to a multi-device cache which is wrong. This causes unbounded growth of the multi-device event cache if there are no multi-device events requested from the cache.
Fix adding event to queue cache
in single-device scenario.
When event is an internal event (when using queue with discard events property) the UrQueue is set to nullptr. This means that inside addEventToContextCache the event will be added to a multi-device cache which is wrong. This causes unbounded growth of the multi-device event cache if there are no multi-device events requested from the cache.