Skip to content

Fix adding event to queue cache#1549

Merged
kbenzie merged 2 commits intooneapi-src:mainfrom
igchor:event_fix
May 3, 2024
Merged

Fix adding event to queue cache#1549
kbenzie merged 2 commits intooneapi-src:mainfrom
igchor:event_fix

Conversation

@igchor
Copy link
Contributor

@igchor igchor commented Apr 25, 2024

in single-device scenario.

When event is an internal event (when using queue with discard events property) the UrQueue is set to nullptr. This means that inside addEventToContextCache the event will be added to a multi-device cache which is wrong. This causes unbounded growth of the multi-device event cache if there are no multi-device events requested from the cache.

@github-actions github-actions bot added the level-zero L0 adapter specific issues label Apr 25, 2024
@igchor
Copy link
Contributor Author

igchor commented Apr 25, 2024

This is draft for now as I'd like to add some tests for this.

@pbalcer
Copy link
Contributor

pbalcer commented Apr 26, 2024

This is a regression after #1324. We should target this for 0.9.

Given the severity of the impact, it makes me wonder whether the discard events are used by anyone and worth keeping around in their current form. They are adding a lot of complexity to the codebase.

@pbalcer pbalcer added the v0.9.x Include in the v0.9.x release label Apr 26, 2024
@pbalcer pbalcer requested a review from raiyanla April 26, 2024 12:55
@igchor igchor force-pushed the event_fix branch 5 times, most recently from d2bfd73 to 337697f Compare April 26, 2024 22:19
@igchor
Copy link
Contributor Author

igchor commented Apr 26, 2024

I've added one test for the event caches but it's a bit hacky (I'm using ZeCallCount map from L0 adapter) - if you have a better idea how to test number of zeEventCreateCalls then I'm open for suggestions. The testcase that verifies the issue fixed by this PR is eventsReuseNoVisibleEvent with UR_QUEUE_FLAG_DISCARD_EVENTS.

Also, there are some problems I found:

  • the test seems to fail with UR_RESULT_ERROR_FEATURE_UNSUPPORTED (this is returned from either urQueueFinish or urQeueuRelease) for some machines on our CI (with older driver?). For now, I've added a GTEST_SKIP() if this error is returned but it requires further investigation
  • test does not work with UR_QUEUE_FLAG_SUBMISSION_BATCHED - I'm not sure why
  • in eventsReuseNoVisibleEvent I would expect the number of zeEventCreate to never be as high as numIters * numEnqueues but that number is equal or higher unless UR_QUEUE_FLAG_DISCARD_EVENTS is set and UR_QUEUE_FLAG_OUT_OF_ORDER_EXEC_MODE_ENABLE is NOT set

@igchor igchor marked this pull request as ready for review April 26, 2024 22:28
@igchor igchor requested review from a team as code owners April 26, 2024 22:28
@igchor igchor requested review from pbalcer and raiyanla April 26, 2024 22:28
using FlagsTupleType = std::tuple<ur_queue_flags_t, ur_queue_flags_t,
ur_queue_flags_t, ur_queue_flags_t>;

#define ASSERT_SUCCESS_OR_UNSUPPORTED(ret) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be moved to some common location?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done.

for (int j = 0; j < numEnqueues; j++) {
enqueueWork(nullptr, i * numEnqueues + j);
}
ASSERT_SUCCESS_OR_UNSUPPORTED(urQueueFinish(queue));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's odd that the UNSUPPORTED is returned by urQueueFinish. Did you investigate what specifically is returning that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that zeCommandListHostSynchronize returns this. pfnHostSynchronize is set to false on the machine that this fails. This happens for UR_QUEUE_FLAG_SUBMISSION_IMMEDIATE.

On the other hand, when I run with UR_QUEUE_FLAG_SUBMISSION_BATCHED I get a segfault and corrupted double-linked list coming from zeCommandListReset I believe.


// This will count the calls to Level-Zero
#ifndef UR_L0_CALL_COUNT_IN_TESTS
std::map<std::string, int> *ZeCallCount = nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like this. #1454 is meant to give programmatic access to internals like this in tests. So please add a TODO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

::testing::Combine(
testing::Values(0, UR_QUEUE_FLAG_DISCARD_EVENTS),
testing::Values(0, UR_QUEUE_FLAG_OUT_OF_ORDER_EXEC_MODE_ENABLE),
// TODO: why the test fails with UR_QUEUE_FLAG_SUBMISSION_BATCHED?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with what combination of the flags? The discard events path doesn't seem that well tested, so it's not inconceivable that's its broken when a few different flags are set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well... every combination. Now, I'm getting a segfault from zeCommandListReset (similarly as in the comment above).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rerun this on the latest driver and everything works fine there so I guess we can just uncomment this once we switch our CI to the newer driver.

@github-actions github-actions bot added the conformance Conformance test suite issues. label Apr 29, 2024
@igchor igchor force-pushed the event_fix branch 2 times, most recently from 7690a5f to f3d7960 Compare April 29, 2024 20:05
igchor added 2 commits May 2, 2024 16:50
in single-device scenario.

When event is an internal event (when using queue with discard
events property) the UrQueue is set to nullptr. This means that
inside addEventToContextCache the event will be added to a
multi-device cache which is wrong. This causes unbounded growth
of the multi-device event cache if there are no multi-device
events requested from the cache.
@kbenzie kbenzie added the ready to merge Added to PR's which are ready to merge label May 2, 2024
@kbenzie kbenzie mentioned this pull request May 3, 2024
5 tasks
@kbenzie kbenzie merged commit fb342f0 into oneapi-src:main May 3, 2024
kbenzie added a commit to kbenzie/unified-runtime that referenced this pull request May 3, 2024
sarnex pushed a commit to intel/llvm that referenced this pull request May 3, 2024
@igchor igchor deleted the event_fix branch May 4, 2024 02:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

conformance Conformance test suite issues. level-zero L0 adapter specific issues ready to merge Added to PR's which are ready to merge v0.9.x Include in the v0.9.x release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants