Skip to content

Conversation

@igchor
Copy link
Contributor

@igchor igchor commented Jan 17, 2024

Instead, return the error immediately. Ignoring the errors, resulted in silent failure and unexpected errors from next operations.

This issue manifested when a driver detected a hang in the kernel (when hangcheck is enabled). Instead of reporting an error on queue wait, the error was being reported only when a next kernel was run.

@codecov-commenter
Copy link

codecov-commenter commented Jan 17, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (92f44da) 15.57% compared to head (f9ad3d4) 15.57%.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1259   +/-   ##
=======================================
  Coverage   15.57%   15.57%           
=======================================
  Files         233      233           
  Lines       32088    32088           
  Branches     3638     3638           
=======================================
  Hits         4999     4999           
  Misses      27038    27038           
  Partials       51       51           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@igchor igchor marked this pull request as ready for review January 18, 2024 05:54
@igchor igchor requested a review from a team as a code owner January 18, 2024 05:54
@igchor
Copy link
Contributor Author

igchor commented Jan 18, 2024

This PR fixes a specific issue but I've noticed that some other functions are also called without checking return value, eg. here: https://github.com/oneapi-src/unified-runtime/blob/536f31a8eeb9ac60314149e88e8772d8b5249058/source/adapters/level_zero/queue.cpp#L712C5-L712C22 and in some other places. Are some of those by design or is this just an oversight? If it's the latter then perhaps we should mark every (or most?) functions that return ur_result_t as [[nodiscard]]?

Copy link
Contributor

@pbalcer pbalcer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to adding nodiscard, but I'm not sure if we can just add it en masse without then having to spend weeks trying to solve issues.


// Make sure all commands get executed.
Queue->synchronize();
UR_CALL(Queue->synchronize());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this isn't leaking the queue object. But other similar checks also don't do anything with it so dunno :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I didn't see any cleanup in other parts of the code as well.

If we actually end up failing here (on this call or any other in this function) it will lead to abort on SYCL level because urQueueRelease is called inside a queue dtor and PI translates UR errors into exceptions (dtor is noexcept) so we don't need to worry about leaks. I don't know is this is desired but that's how it is right now.

However, I guess it would nice to have some tests on UR level with error injection perhaps so we could check for leaks.

Copy link
Contributor

@nrspruit nrspruit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I think this might be related to another bug being encountered that delayed the error to the next kernel after the previous event failed. Please merge this.

@pbalcer pbalcer added level-zero L0 adapter specific issues v0.8.x Include in the v0.8.x release labels Jan 18, 2024
@pbalcer
Copy link
Contributor

pbalcer commented Jan 18, 2024

@igchor please create an intel/llvm testing PR (after #1260 is merged).

@kbenzie kbenzie mentioned this pull request Jan 18, 2024
8 tasks
Instead, return the error immediately. Ignoring the errors,
resulted in silent failure and unexpected errors from next operations.
@igchor
Copy link
Contributor Author

igchor commented Jan 19, 2024

@igchor please create an intel/llvm testing PR (after #1260 is merged).

Rebased and created a testing PR: intel/llvm#12427

@kbenzie kbenzie merged commit 6032f6f into oneapi-src:main Jan 22, 2024
kbenzie added a commit to kbenzie/unified-runtime that referenced this pull request Jan 22, 2024
[L0] do not ignore returned values from zeHostSynchronize
kbenzie added a commit to kbenzie/unified-runtime that referenced this pull request Jan 22, 2024
[L0] do not ignore returned values from zeHostSynchronize
steffenlarsen pushed a commit to intel/llvm that referenced this pull request Jan 22, 2024
oneapi-src/unified-runtime#1259

---------

Co-authored-by: Kenneth Benzie (Benie) <k.benzie@codeplay.com>
kbenzie added a commit to intel/llvm that referenced this pull request Jan 22, 2024
oneapi-src/unified-runtime#1259

---------

Co-authored-by: Kenneth Benzie (Benie) <k.benzie@codeplay.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

level-zero L0 adapter specific issues v0.8.x Include in the v0.8.x release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants