[SYCL] Detect ze call leaking in E2E tests #19710

jiezzhang · 2025-08-05T05:44:20Z

After #19328, UR_L0_LEAKS_DEBUG stop throwing exceptions when leaks are detected so LIT can't report failures. Add a leak checking in format.py to keep "--param ur_l0_leaks_debug=1" work as before.

cperkinsintel · 2025-08-05T17:36:04Z

sycl/test-e2e/format.py

+        def check_leak(output):
+            keyword_found = False
+            for line in output.splitlines():
+                if keyword_found and "LEAK" in line:


Just FYI, we already have a lit var %{l0_leak_check} which a bunch of the tests still use (instead of UR_L0_LEAKS_DEBUG ). It gets replaced by the env var in the final invocation, but I'm not sure if this python script here will detect.

I think that l0_leak_check directive is no longer needed and probably could be replaced by the actual env var. It's in hundreds of tests right now.

I think the tests that set the env var (either directly or by using %{l0_leak_check}) already check for absence of "LEAK" (--implicit-check-not=LEAK) so we don't have to check it again here. As far as I understand this is only needed when the user decides to run all the tests with leak checking (e.g. passing --param ur_l0_leaks_debug=1 to lit).

sycl/test-e2e/format.py

jiezzhang · 2025-08-08T06:50:10Z

Pretty sure the failed test is not caused by this PR. @cperkinsintel is it a known flaky issue?
SYCL :: Reduction/reduction_internal_nd_range_1dim.cpp

igchor · 2025-08-14T20:29:32Z

Pretty sure the failed test is not caused by this PR. @cperkinsintel is it a known flaky issue? SYCL :: Reduction/reduction_internal_nd_range_1dim.cpp

I don't know if it;s a known issue but it seems unrelated to the PR

igchor · 2025-08-14T20:29:58Z

@intel/llvm-reviewers-runtime could you please take a look the PR?

aelovikov-intel · 2025-08-14T20:34:40Z

After #19328, UR_L0_LEAKS_DEBUG stop throwing exceptions

Why does "unification" cause that?

igchor · 2025-08-14T21:13:59Z

After #19328, UR_L0_LEAKS_DEBUG stop throwing exceptions

Why does "unification" cause that?

The leak checking is now happening in the loader, during loader teardown, and so there is no place for us to throw the exception from anymore.

aelovikov-intel · 2025-08-14T21:17:37Z

The leak checking is now happening in the loader, during loader teardown, and so there is no place for us to throw the exception from anymore.

Because it's C and not C++? Can it abort?

igchor · 2025-08-14T21:34:49Z

The leak checking is now happening in the loader, during loader teardown, and so there is no place for us to throw the exception from anymore.

Because it's C and not C++? Can it abort?

Mostly because it's done in the library destructor which means we don't have an entry point from which we could return an error (we had urAdapterTeardown when leak checking was done in UR).

We could abort() but leaks are not really a critical failure so I think just parsing the output in tests is a better option.

aelovikov-intel · 2025-08-14T21:41:53Z

The leak checking is now happening in the loader, during loader teardown, and so there is no place for us to throw the exception from anymore.

Because it's C and not C++? Can it abort?

Mostly because it's done in the library destructor which means we don't have an entry point from which we could return an error (we had urAdapterTeardown when leak checking was done in UR).

We could abort() but leaks are not really a critical failure so I think just parsing the output in tests is a better option.

Can we have one more env variable control to request that abort? I think that would still be much better than parsing output (that might be redirected and not available for parsing).

igchor · 2025-08-14T21:47:04Z

The leak checking is now happening in the loader, during loader teardown, and so there is no place for us to throw the exception from anymore.

Because it's C and not C++? Can it abort?

Mostly because it's done in the library destructor which means we don't have an entry point from which we could return an error (we had urAdapterTeardown when leak checking was done in UR).
We could abort() but leaks are not really a critical failure so I think just parsing the output in tests is a better option.

Can we have one more env variable control to request that abort? I think that would still be much better than parsing output (that might be redirected and not available for parsing).

@nrspruit What do you think about calling abort? Do you think there is any other way to report the leaks?

nrspruit · 2025-08-19T18:56:26Z

The leak checking is now happening in the loader, during loader teardown, and so there is no place for us to throw the exception from anymore.

Because it's C and not C++? Can it abort?

Mostly because it's done in the library destructor which means we don't have an entry point from which we could return an error (we had urAdapterTeardown when leak checking was done in UR).
We could abort() but leaks are not really a critical failure so I think just parsing the output in tests is a better option.

Can we have one more env variable control to request that abort? I think that would still be much better than parsing output (that might be redirected and not available for parsing).

@nrspruit What do you think about calling abort? Do you think there is any other way to report the leaks?

So, the layers in the loader are meant to gracefully catch errors, by convention we avoid any and all aborts within the L0 loader and drivers unless it is an unrecoverable error.

In this case, even if you changed the validation layer, you will not get that update in the CI to fix this problem until you have a driver with that loader. The CI does not use a different loader than what is in your driver so you would not see that change for a couple of months unless one changed the loader only.

However, I recommend only adding the "abort" handling in case of a leak into the llvm-lit which already scrapes the logs to determine if a test passes. Because this is already done, the stdout/stderr in llvm-lit will never be redirected, otherwise many tests would fail so I see that as a non issue, noone is redirecting stdout/stderr in the llvm-lit testing unless they wanted all the tests to fail due to the logs being read already.....

aelovikov-intel · 2025-08-19T19:34:29Z

Where is it proven that this patch works as intended?

jiezzhang · 2025-08-20T03:50:17Z

Where is it proven that this patch works as intended?

For example, run USM cases by commenting its memory releasing

With this patch:

env UR_L0_LEAKS_DEBUG=1 ONEAPI_DEVICE_SELECTOR=level_zero:gpu  /localdisk2/zhangji4/ws1/build/USM/Output/fill.cpp.tmp1.out
# executed command: env UR_L0_LEAKS_DEBUG=1 ONEAPI_DEVICE_SELECTOR=level_zero:gpu /localdisk2/zhangji4/ws1/build/USM/Output/fill.cpp.tmp1.out
# .---command stderr------------
# | Check balance of create/destroy calls
# | ----------------------------------------------------------
# |                zeContextCreate = 1     \--->              zeContextDestroy = 1
# |           zeCommandQueueCreate = 1     \--->         zeCommandQueueDestroy = 1
# |                 zeModuleCreate = 21    \--->               zeModuleDestroy = 21
# |                 zeKernelCreate = 21    \--->               zeKernelDestroy = 21
# |              zeEventPoolCreate = 1     \--->            zeEventPoolDestroy = 1
# |   zeCommandListCreateImmediate = 1     |
# |            zeCommandListCreate = 1     \--->          zeCommandListDestroy = 2
# |                  zeEventCreate = 2     \--->                zeEventDestroy = 2
# |                  zeFenceCreate = 1     \--->                zeFenceDestroy = 1
# |                  zeImageCreate = 0     |
# |           zeImageViewCreateExt = 0     \--->                zeImageDestroy = 0
# |                zeSamplerCreate = 0     \--->              zeSamplerDestroy = 0
# |               zeMemAllocDevice = 5     |
# |                 zeMemAllocHost = 6     |
# |               zeMemAllocShared = 18    \--->                     zeMemFree = 22
# |                                        \--->                  zeMemFreeExt = 0     ---> LEAK = 7
# `-----------------------------

--

********************
********************
Failed Tests (1):
  SYCL :: USM/fill.cpp


Testing Time: 6.30s

Total Discovered Tests: 1
  Failed: 1 (100.00%)

Without this patch:

env UR_L0_LEAKS_DEBUG=1 ONEAPI_DEVICE_SELECTOR=level_zero:gpu  /localdisk2/zhangji4/ws1/build/USM/Output/fill.cpp.tmp1.out
# executed command: env UR_L0_LEAKS_DEBUG=1 ONEAPI_DEVICE_SELECTOR=level_zero:gpu /localdisk2/zhangji4/ws1/build/USM/Output/fill.cpp.tmp1.out
# .---command stderr------------
# | Check balance of create/destroy calls
# | ----------------------------------------------------------
# |                zeContextCreate = 1     \--->              zeContextDestroy = 1
# |           zeCommandQueueCreate = 1     \--->         zeCommandQueueDestroy = 1
# |                 zeModuleCreate = 21    \--->               zeModuleDestroy = 21
# |                 zeKernelCreate = 21    \--->               zeKernelDestroy = 21
# |              zeEventPoolCreate = 1     \--->            zeEventPoolDestroy = 1
# |   zeCommandListCreateImmediate = 1     |
# |            zeCommandListCreate = 1     \--->          zeCommandListDestroy = 2
# |                  zeEventCreate = 2     \--->                zeEventDestroy = 2
# |                  zeFenceCreate = 1     \--->                zeFenceDestroy = 1
# |                  zeImageCreate = 0     |
# |           zeImageViewCreateExt = 0     \--->                zeImageDestroy = 0
# |                zeSamplerCreate = 0     \--->              zeSamplerDestroy = 0
# |               zeMemAllocDevice = 5     |
# |                 zeMemAllocHost = 6     |
# |               zeMemAllocShared = 18    \--->                     zeMemFree = 22
# |                                        \--->                  zeMemFreeExt = 0     ---> LEAK = 7
# `-----------------------------

--

********************

Testing Time: 6.22s

Total Discovered Tests: 1
  Passed: 1 (100.00%)

aelovikov-intel · 2025-08-20T14:39:11Z

For example, run USM cases by commenting its memory releasing

Can you please do the same but also modifying its RUN line like

// RUN: %{run} %t1.out 2>&1 | FileCheck %s
// CHECK-NOT: abracadabra

or something similar to verify that any pipes inside the test don't interfere with this approach?

jiezzhang · 2025-08-22T07:54:46Z

For example, run USM cases by commenting its memory releasing

Can you please do the same but also modifying its RUN line like
// RUN: %{run} %t1.out 2>&1 | FileCheck %s
// CHECK-NOT: abracadabra
or something similar to verify that any pipes inside the test don't interfere with this approach?

Good suggestion! I find an invalid case when validating. Push a commit to fix such scenario

// RUN: %{run} %t.out
// RUN: %if level_zero %{%{l0_leak_check} %{run} %t.out 2>&1 | FileCheck %s --implicit-check-not=LEAK %}

aelovikov-intel · 2025-08-22T15:34:38Z

Please paste the logs (-a) of that run.

jiezzhang · 2025-08-25T08:13:18Z

@aelovikov-intel please check below output. The problem has been fixed.

env UR_L0_LEAKS_DEBUG=1 env UR_L0_LEAKS_DEBUG=1 ONEAPI_DEVICE_SELECTOR=level_zero:gpu  /localdisk2/zhangji4/ws/build/Graph/RecordReplay/Output/dotp_in_order_with_empty_nodes.cpp.tmp.out 2>&1 | /rdrive/ref/lit/tools/Linux/FileCheck /localdisk2/zhangji4/ws/llvm/sycl/test-e2e/Graph/RecordReplay/dotp_in_order_with_empty_nodes.cpp --implicit-check-not=LEAK
# executed command: env UR_L0_LEAKS_DEBUG=1 env UR_L0_LEAKS_DEBUG=1 ONEAPI_DEVICE_SELECTOR=level_zero:gpu /localdisk2/zhangji4/ws/build/Graph/RecordReplay/Output/dotp_in_order_with_empty_nodes.cpp.tmp.out
# note: command had no output on stdout or stderr
# executed command: /rdrive/ref/lit/tools/Linux/FileCheck /localdisk2/zhangji4/ws/llvm/sycl/test-e2e/Graph/RecordReplay/dotp_in_order_with_empty_nodes.cpp --implicit-check-not=LEAK
# note: command had no output on stdout or stderr

--

********************

Testing Time: 3.54s

Total Discovered Tests: 1
  Passed: 1 (100.00%)

aelovikov-intel · 2025-08-25T14:46:30Z

Why "PASS"? I asked for a modified case (with a leak) that uses pipe (both stdout and stderr) to see if your approach works for those tests. The logs (-a) should clearly show a fail with a leak in such scenario.

jiezzhang · 2025-09-01T08:56:13Z

Done. It fails with expected errors with new method:

# RUN: at line 10
env UR_L0_LEAKS_DEBUG=1 ONEAPI_DEVICE_SELECTOR=level_zero:gpu  /localdisk2/zhangji4/ws/build/USM/Output/fill.cpp.tmp1.out
# executed command: env UR_L0_LEAKS_DEBUG=1 ONEAPI_DEVICE_SELECTOR=level_zero:gpu /localdisk2/zhangji4/ws/build/USM/Output/fill.cpp.tmp1.out
# .---command stderr------------
# | Check balance of create/destroy calls
# | ----------------------------------------------------------
# |                zeContextCreate = 1     \--->              zeContextDestroy = 1
# |           zeCommandQueueCreate = 1     \--->         zeCommandQueueDestroy = 1
# |                 zeModuleCreate = 21    \--->               zeModuleDestroy = 21
# |                 zeKernelCreate = 21    \--->               zeKernelDestroy = 21
# |              zeEventPoolCreate = 1     \--->            zeEventPoolDestroy = 1
# |   zeCommandListCreateImmediate = 1     |
# |            zeCommandListCreate = 1     \--->          zeCommandListDestroy = 2
# |                  zeEventCreate = 2     \--->                zeEventDestroy = 2
# |                  zeFenceCreate = 1     \--->                zeFenceDestroy = 1
# |                  zeImageCreate = 0     |
# |           zeImageViewCreateExt = 0     \--->                zeImageDestroy = 0
# |                zeSamplerCreate = 0     \--->              zeSamplerDestroy = 0
# |               zeMemAllocDevice = 5     |
# |                 zeMemAllocHost = 6     |
# |               zeMemAllocShared = 18    \--->                     zeMemFree = 22
# |                                        \--->                  zeMemFreeExt = 0     ---> LEAK = 7
# `-----------------------------

--

********************
********************
Failed Tests (1):
  SYCL :: USM/fill.cpp

myler

LGTM.

[SYCL] Detect ze call leaking in E2E tests

a6ff3da

jiezzhang requested a review from a team as a code owner August 5, 2025 05:44

jiezzhang requested a review from cperkinsintel August 5, 2025 05:44

jiezzhang had a problem deploying to WindowsCILock August 5, 2025 17:31 — with GitHub Actions Failure

cperkinsintel reviewed Aug 5, 2025

View reviewed changes

jiezzhang had a problem deploying to WindowsCILock August 5, 2025 18:07 — with GitHub Actions Failure

igchor reviewed Aug 5, 2025

View reviewed changes

sycl/test-e2e/format.py Outdated Show resolved Hide resolved

Fix runfail when ur_l0_leaks_debug is not used

7036961

igchor approved these changes Aug 7, 2025

View reviewed changes

jiezzhang temporarily deployed to WindowsCILock August 7, 2025 14:04 — with GitHub Actions Inactive

jiezzhang temporarily deployed to WindowsCILock August 7, 2025 14:27 — with GitHub Actions Inactive

match leak logs by regex

30f19a0

myler approved these changes Sep 3, 2025

View reviewed changes

kswiecicki approved these changes Sep 8, 2025

View reviewed changes

jiezzhang temporarily deployed to WindowsCILock September 8, 2025 09:01 — with GitHub Actions Inactive

jiezzhang temporarily deployed to WindowsCILock September 8, 2025 09:22 — with GitHub Actions Inactive

[SYCL] Detect ze call leaking in E2E tests #19710

Are you sure you want to change the base?

[SYCL] Detect ze call leaking in E2E tests #19710

Uh oh!

Conversation

jiezzhang commented Aug 5, 2025

Uh oh!

cperkinsintel Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

igchor Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jiezzhang commented Aug 8, 2025

Uh oh!

igchor commented Aug 14, 2025

Uh oh!

igchor commented Aug 14, 2025

Uh oh!

aelovikov-intel commented Aug 14, 2025

Uh oh!

igchor commented Aug 14, 2025

Uh oh!

aelovikov-intel commented Aug 14, 2025

Uh oh!

igchor commented Aug 14, 2025

Uh oh!

aelovikov-intel commented Aug 14, 2025

Uh oh!

igchor commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nrspruit commented Aug 19, 2025

Uh oh!

aelovikov-intel commented Aug 19, 2025

Uh oh!

jiezzhang commented Aug 20, 2025

Uh oh!

aelovikov-intel commented Aug 20, 2025

Uh oh!

jiezzhang commented Aug 22, 2025

Uh oh!

aelovikov-intel commented Aug 22, 2025

Uh oh!

jiezzhang commented Aug 25, 2025

Uh oh!

aelovikov-intel commented Aug 25, 2025

Uh oh!

jiezzhang commented Sep 1, 2025

Uh oh!

myler left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

cperkinsintel Aug 5, 2025 •

edited

Loading

igchor commented Aug 14, 2025 •

edited

Loading