-
Notifications
You must be signed in to change notification settings - Fork 124
[L0] do not ignore returned values from zeHostSynchronize #1259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #1259 +/- ##
=======================================
Coverage 15.57% 15.57%
=======================================
Files 233 233
Lines 32088 32088
Branches 3638 3638
=======================================
Hits 4999 4999
Misses 27038 27038
Partials 51 51 ☔ View full report in Codecov by Sentry. |
|
This PR fixes a specific issue but I've noticed that some other functions are also called without checking return value, eg. here: https://github.com/oneapi-src/unified-runtime/blob/536f31a8eeb9ac60314149e88e8772d8b5249058/source/adapters/level_zero/queue.cpp#L712C5-L712C22 and in some other places. Are some of those by design or is this just an oversight? If it's the latter then perhaps we should mark every (or most?) functions that return ur_result_t as [[nodiscard]]? |
pbalcer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to adding nodiscard, but I'm not sure if we can just add it en masse without then having to spend weeks trying to solve issues.
|
|
||
| // Make sure all commands get executed. | ||
| Queue->synchronize(); | ||
| UR_CALL(Queue->synchronize()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this isn't leaking the queue object. But other similar checks also don't do anything with it so dunno :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I didn't see any cleanup in other parts of the code as well.
If we actually end up failing here (on this call or any other in this function) it will lead to abort on SYCL level because urQueueRelease is called inside a queue dtor and PI translates UR errors into exceptions (dtor is noexcept) so we don't need to worry about leaks. I don't know is this is desired but that's how it is right now.
However, I guess it would nice to have some tests on UR level with error injection perhaps so we could check for leaks.
nrspruit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I think this might be related to another bug being encountered that delayed the error to the next kernel after the previous event failed. Please merge this.
Instead, return the error immediately. Ignoring the errors, resulted in silent failure and unexpected errors from next operations.
Rebased and created a testing PR: intel/llvm#12427 |
[L0] do not ignore returned values from zeHostSynchronize
[L0] do not ignore returned values from zeHostSynchronize
oneapi-src/unified-runtime#1259 --------- Co-authored-by: Kenneth Benzie (Benie) <k.benzie@codeplay.com>
oneapi-src/unified-runtime#1259 --------- Co-authored-by: Kenneth Benzie (Benie) <k.benzie@codeplay.com>
Instead, return the error immediately. Ignoring the errors, resulted in silent failure and unexpected errors from next operations.
This issue manifested when a driver detected a hang in the kernel (when hangcheck is enabled). Instead of reporting an error on queue wait, the error was being reported only when a next kernel was run.