Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UR][L0] Fix L0 teardown checks for stability #17818

Merged
merged 11 commits into from
Apr 8, 2025

Conversation

nrspruit
Copy link
Contributor

@nrspruit nrspruit commented Apr 2, 2025

  • Address the race conditions with L0 Loader teardown timing such that L0 teardown is verified before handle destruction in all cases and uses a L0 loader api to verify stability.

Signed-off-by: Neil R. Spruit <neil.r.spruit@intel.com>
Signed-off-by: Neil R. Spruit <neil.r.spruit@intel.com>
nrspruit added 2 commits April 2, 2025 16:15
Signed-off-by: Neil R. Spruit <neil.r.spruit@intel.com>
Signed-off-by: Neil R. Spruit <neil.r.spruit@intel.com>
Signed-off-by: Neil R. Spruit <neil.r.spruit@intel.com>
@nrspruit nrspruit force-pushed the fix_L0_loader_teardown_checks branch from 36f9559 to c33beab Compare April 3, 2025 15:42
Signed-off-by: Neil R. Spruit <neil.r.spruit@intel.com>
Signed-off-by: Neil R. Spruit <neil.r.spruit@intel.com>
Signed-off-by: Neil R. Spruit <neil.r.spruit@intel.com>
@nrspruit nrspruit marked this pull request as ready for review April 3, 2025 19:07
@nrspruit nrspruit requested review from a team as code owners April 3, 2025 19:07
@nrspruit nrspruit requested a review from a team as a code owner April 3, 2025 19:07
@nrspruit nrspruit requested a review from reble April 3, 2025 19:07
Signed-off-by: Neil R. Spruit <neil.r.spruit@intel.com>
Signed-off-by: Neil R. Spruit <neil.r.spruit@intel.com>
Signed-off-by: Neil R. Spruit <neil.r.spruit@intel.com>
@nrspruit
Copy link
Contributor Author

nrspruit commented Apr 4, 2025

@intel/bindless-images-reviewers / @intel/sycl-graphs-reviewers please review the minor changes to your code paths. SYCL and oneAPI customers tend to try and release SYCL/L0 objects after atexit has begun, this results in random cases where the memory for L0 will have been released by the OS and we encounter application crash.

This change ensures that during resource release we always check to ensure the L0 loader and driver is stable and we don't allow a crash for customers. This utilizes a check added to the Loader to verify the stability of the L0 ecosystem before destroying a handle.

NOTE: these changes don't result in memory leaks. When this issue occurs, the OS has already released the memory for these handles. All other times, this check will result in a pass and the release will continue unblocked.

@nrspruit
Copy link
Contributor Author

nrspruit commented Apr 4, 2025

@intel/llvm-gatekeepers , please let me know if we can move ahead and merge this change. Without this, we have race conditions with various users of SYCL where the application will abort at teardown.

@aelovikov-intel
Copy link
Contributor

image
image

@nrspruit
Copy link
Contributor Author

nrspruit commented Apr 4, 2025

@intel/bindless-images-reviewers , please review the small changes around image destroy needed to prevent use after free crashes.

@nrspruit
Copy link
Contributor Author

nrspruit commented Apr 7, 2025

ping @intel/bindless-images-reviewers , please review your section to unblock this review, the changes are small, but this prevents potential crashes during teardown when memory is being freed by SYCL and SYCL users.

@nrspruit
Copy link
Contributor Author

nrspruit commented Apr 8, 2025

@intel/llvm-gatekeepers please merge when available.

@kbenzie kbenzie merged commit 507001c into intel:sycl Apr 8, 2025
63 of 69 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants