Skip to content

Conversation

jiezzhang
Copy link
Contributor

After #19328, UR_L0_LEAKS_DEBUG stop throwing exceptions when leaks are detected so LIT can't report failures. Add a leak checking in format.py to keep "--param ur_l0_leaks_debug=1" work as before.

@jiezzhang jiezzhang requested a review from a team as a code owner August 5, 2025 05:44
@jiezzhang jiezzhang requested a review from cperkinsintel August 5, 2025 05:44
def check_leak(output):
keyword_found = False
for line in output.splitlines():
if keyword_found and "LEAK" in line:
Copy link
Contributor

@cperkinsintel cperkinsintel Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI, we already have a lit var %{l0_leak_check} which a bunch of the tests still use (instead of UR_L0_LEAKS_DEBUG ). It gets replaced by the env var in the final invocation, but I'm not sure if this python script here will detect.

I think that l0_leak_check directive is no longer needed and probably could be replaced by the actual env var. It's in hundreds of tests right now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the tests that set the env var (either directly or by using %{l0_leak_check}) already check for absence of "LEAK" (--implicit-check-not=LEAK) so we don't have to check it again here. As far as I understand this is only needed when the user decides to run all the tests with leak checking (e.g. passing --param ur_l0_leaks_debug=1 to lit).

@jiezzhang
Copy link
Contributor Author

Pretty sure the failed test is not caused by this PR. @cperkinsintel is it a known flaky issue?
SYCL :: Reduction/reduction_internal_nd_range_1dim.cpp

@igchor
Copy link
Contributor

igchor commented Aug 14, 2025

Pretty sure the failed test is not caused by this PR. @cperkinsintel is it a known flaky issue? SYCL :: Reduction/reduction_internal_nd_range_1dim.cpp

I don't know if it;s a known issue but it seems unrelated to the PR

@igchor
Copy link
Contributor

igchor commented Aug 14, 2025

@intel/llvm-reviewers-runtime could you please take a look the PR?

@aelovikov-intel
Copy link
Contributor

After #19328, UR_L0_LEAKS_DEBUG stop throwing exceptions

Why does "unification" cause that?

@igchor
Copy link
Contributor

igchor commented Aug 14, 2025

After #19328, UR_L0_LEAKS_DEBUG stop throwing exceptions

Why does "unification" cause that?

The leak checking is now happening in the loader, during loader teardown, and so there is no place for us to throw the exception from anymore.

@aelovikov-intel
Copy link
Contributor

The leak checking is now happening in the loader, during loader teardown, and so there is no place for us to throw the exception from anymore.

Because it's C and not C++? Can it abort?

@igchor
Copy link
Contributor

igchor commented Aug 14, 2025

The leak checking is now happening in the loader, during loader teardown, and so there is no place for us to throw the exception from anymore.

Because it's C and not C++? Can it abort?

Mostly because it's done in the library destructor which means we don't have an entry point from which we could return an error (we had urAdapterTeardown when leak checking was done in UR).

We could abort() but leaks are not really a critical failure so I think just parsing the output in tests is a better option.

@aelovikov-intel
Copy link
Contributor

The leak checking is now happening in the loader, during loader teardown, and so there is no place for us to throw the exception from anymore.

Because it's C and not C++? Can it abort?

Mostly because it's done in the library destructor which means we don't have an entry point from which we could return an error (we had urAdapterTeardown when leak checking was done in UR).

We could abort() but leaks are not really a critical failure so I think just parsing the output in tests is a better option.

Can we have one more env variable control to request that abort? I think that would still be much better than parsing output (that might be redirected and not available for parsing).

@igchor
Copy link
Contributor

igchor commented Aug 14, 2025

The leak checking is now happening in the loader, during loader teardown, and so there is no place for us to throw the exception from anymore.

Because it's C and not C++? Can it abort?

Mostly because it's done in the library destructor which means we don't have an entry point from which we could return an error (we had urAdapterTeardown when leak checking was done in UR).
We could abort() but leaks are not really a critical failure so I think just parsing the output in tests is a better option.

Can we have one more env variable control to request that abort? I think that would still be much better than parsing output (that might be redirected and not available for parsing).

@nrspruit What do you think about calling abort? Do you think there is any other way to report the leaks?

@nrspruit
Copy link
Contributor

The leak checking is now happening in the loader, during loader teardown, and so there is no place for us to throw the exception from anymore.

Because it's C and not C++? Can it abort?

Mostly because it's done in the library destructor which means we don't have an entry point from which we could return an error (we had urAdapterTeardown when leak checking was done in UR).
We could abort() but leaks are not really a critical failure so I think just parsing the output in tests is a better option.

Can we have one more env variable control to request that abort? I think that would still be much better than parsing output (that might be redirected and not available for parsing).

@nrspruit What do you think about calling abort? Do you think there is any other way to report the leaks?

So, the layers in the loader are meant to gracefully catch errors, by convention we avoid any and all aborts within the L0 loader and drivers unless it is an unrecoverable error.

In this case, even if you changed the validation layer, you will not get that update in the CI to fix this problem until you have a driver with that loader. The CI does not use a different loader than what is in your driver so you would not see that change for a couple of months unless one changed the loader only.

However, I recommend only adding the "abort" handling in case of a leak into the llvm-lit which already scrapes the logs to determine if a test passes. Because this is already done, the stdout/stderr in llvm-lit will never be redirected, otherwise many tests would fail so I see that as a non issue, noone is redirecting stdout/stderr in the llvm-lit testing unless they wanted all the tests to fail due to the logs being read already.....

@aelovikov-intel
Copy link
Contributor

Where is it proven that this patch works as intended?

@jiezzhang
Copy link
Contributor Author

Where is it proven that this patch works as intended?

For example, run USM cases by commenting its memory releasing

With this patch:

env UR_L0_LEAKS_DEBUG=1 ONEAPI_DEVICE_SELECTOR=level_zero:gpu  /localdisk2/zhangji4/ws1/build/USM/Output/fill.cpp.tmp1.out
# executed command: env UR_L0_LEAKS_DEBUG=1 ONEAPI_DEVICE_SELECTOR=level_zero:gpu /localdisk2/zhangji4/ws1/build/USM/Output/fill.cpp.tmp1.out
# .---command stderr------------
# | Check balance of create/destroy calls
# | ----------------------------------------------------------
# |                zeContextCreate = 1     \--->              zeContextDestroy = 1
# |           zeCommandQueueCreate = 1     \--->         zeCommandQueueDestroy = 1
# |                 zeModuleCreate = 21    \--->               zeModuleDestroy = 21
# |                 zeKernelCreate = 21    \--->               zeKernelDestroy = 21
# |              zeEventPoolCreate = 1     \--->            zeEventPoolDestroy = 1
# |   zeCommandListCreateImmediate = 1     |
# |            zeCommandListCreate = 1     \--->          zeCommandListDestroy = 2
# |                  zeEventCreate = 2     \--->                zeEventDestroy = 2
# |                  zeFenceCreate = 1     \--->                zeFenceDestroy = 1
# |                  zeImageCreate = 0     |
# |           zeImageViewCreateExt = 0     \--->                zeImageDestroy = 0
# |                zeSamplerCreate = 0     \--->              zeSamplerDestroy = 0
# |               zeMemAllocDevice = 5     |
# |                 zeMemAllocHost = 6     |
# |               zeMemAllocShared = 18    \--->                     zeMemFree = 22
# |                                        \--->                  zeMemFreeExt = 0     ---> LEAK = 7
# `-----------------------------

--

********************
********************
Failed Tests (1):
  SYCL :: USM/fill.cpp


Testing Time: 6.30s

Total Discovered Tests: 1
  Failed: 1 (100.00%)

Without this patch:

env UR_L0_LEAKS_DEBUG=1 ONEAPI_DEVICE_SELECTOR=level_zero:gpu  /localdisk2/zhangji4/ws1/build/USM/Output/fill.cpp.tmp1.out
# executed command: env UR_L0_LEAKS_DEBUG=1 ONEAPI_DEVICE_SELECTOR=level_zero:gpu /localdisk2/zhangji4/ws1/build/USM/Output/fill.cpp.tmp1.out
# .---command stderr------------
# | Check balance of create/destroy calls
# | ----------------------------------------------------------
# |                zeContextCreate = 1     \--->              zeContextDestroy = 1
# |           zeCommandQueueCreate = 1     \--->         zeCommandQueueDestroy = 1
# |                 zeModuleCreate = 21    \--->               zeModuleDestroy = 21
# |                 zeKernelCreate = 21    \--->               zeKernelDestroy = 21
# |              zeEventPoolCreate = 1     \--->            zeEventPoolDestroy = 1
# |   zeCommandListCreateImmediate = 1     |
# |            zeCommandListCreate = 1     \--->          zeCommandListDestroy = 2
# |                  zeEventCreate = 2     \--->                zeEventDestroy = 2
# |                  zeFenceCreate = 1     \--->                zeFenceDestroy = 1
# |                  zeImageCreate = 0     |
# |           zeImageViewCreateExt = 0     \--->                zeImageDestroy = 0
# |                zeSamplerCreate = 0     \--->              zeSamplerDestroy = 0
# |               zeMemAllocDevice = 5     |
# |                 zeMemAllocHost = 6     |
# |               zeMemAllocShared = 18    \--->                     zeMemFree = 22
# |                                        \--->                  zeMemFreeExt = 0     ---> LEAK = 7
# `-----------------------------

--

********************

Testing Time: 6.22s

Total Discovered Tests: 1
  Passed: 1 (100.00%)

@aelovikov-intel
Copy link
Contributor

For example, run USM cases by commenting its memory releasing

Can you please do the same but also modifying its RUN line like

// RUN: %{run} %t1.out 2>&1 | FileCheck %s
// CHECK-NOT: abracadabra

or something similar to verify that any pipes inside the test don't interfere with this approach?

@jiezzhang
Copy link
Contributor Author

For example, run USM cases by commenting its memory releasing

Can you please do the same but also modifying its RUN line like

// RUN: %{run} %t1.out 2>&1 | FileCheck %s
// CHECK-NOT: abracadabra

or something similar to verify that any pipes inside the test don't interfere with this approach?

Good suggestion! I find an invalid case when validating. Push a commit to fix such scenario

// RUN: %{run} %t.out
// RUN: %if level_zero %{%{l0_leak_check} %{run} %t.out 2>&1 | FileCheck %s --implicit-check-not=LEAK %}

@aelovikov-intel
Copy link
Contributor

Please paste the logs (-a) of that run.

@jiezzhang
Copy link
Contributor Author

@aelovikov-intel please check below output. The problem has been fixed.

env UR_L0_LEAKS_DEBUG=1 env UR_L0_LEAKS_DEBUG=1 ONEAPI_DEVICE_SELECTOR=level_zero:gpu  /localdisk2/zhangji4/ws/build/Graph/RecordReplay/Output/dotp_in_order_with_empty_nodes.cpp.tmp.out 2>&1 | /rdrive/ref/lit/tools/Linux/FileCheck /localdisk2/zhangji4/ws/llvm/sycl/test-e2e/Graph/RecordReplay/dotp_in_order_with_empty_nodes.cpp --implicit-check-not=LEAK
# executed command: env UR_L0_LEAKS_DEBUG=1 env UR_L0_LEAKS_DEBUG=1 ONEAPI_DEVICE_SELECTOR=level_zero:gpu /localdisk2/zhangji4/ws/build/Graph/RecordReplay/Output/dotp_in_order_with_empty_nodes.cpp.tmp.out
# note: command had no output on stdout or stderr
# executed command: /rdrive/ref/lit/tools/Linux/FileCheck /localdisk2/zhangji4/ws/llvm/sycl/test-e2e/Graph/RecordReplay/dotp_in_order_with_empty_nodes.cpp --implicit-check-not=LEAK
# note: command had no output on stdout or stderr

--

********************

Testing Time: 3.54s

Total Discovered Tests: 1
  Passed: 1 (100.00%)

@aelovikov-intel
Copy link
Contributor

Why "PASS"? I asked for a modified case (with a leak) that uses pipe (both stdout and stderr) to see if your approach works for those tests. The logs (-a) should clearly show a fail with a leak in such scenario.

@jiezzhang
Copy link
Contributor Author

Done. It fails with expected errors with new method:

# RUN: at line 10
env UR_L0_LEAKS_DEBUG=1 ONEAPI_DEVICE_SELECTOR=level_zero:gpu  /localdisk2/zhangji4/ws/build/USM/Output/fill.cpp.tmp1.out
# executed command: env UR_L0_LEAKS_DEBUG=1 ONEAPI_DEVICE_SELECTOR=level_zero:gpu /localdisk2/zhangji4/ws/build/USM/Output/fill.cpp.tmp1.out
# .---command stderr------------
# | Check balance of create/destroy calls
# | ----------------------------------------------------------
# |                zeContextCreate = 1     \--->              zeContextDestroy = 1
# |           zeCommandQueueCreate = 1     \--->         zeCommandQueueDestroy = 1
# |                 zeModuleCreate = 21    \--->               zeModuleDestroy = 21
# |                 zeKernelCreate = 21    \--->               zeKernelDestroy = 21
# |              zeEventPoolCreate = 1     \--->            zeEventPoolDestroy = 1
# |   zeCommandListCreateImmediate = 1     |
# |            zeCommandListCreate = 1     \--->          zeCommandListDestroy = 2
# |                  zeEventCreate = 2     \--->                zeEventDestroy = 2
# |                  zeFenceCreate = 1     \--->                zeFenceDestroy = 1
# |                  zeImageCreate = 0     |
# |           zeImageViewCreateExt = 0     \--->                zeImageDestroy = 0
# |                zeSamplerCreate = 0     \--->              zeSamplerDestroy = 0
# |               zeMemAllocDevice = 5     |
# |                 zeMemAllocHost = 6     |
# |               zeMemAllocShared = 18    \--->                     zeMemFree = 22
# |                                        \--->                  zeMemFreeExt = 0     ---> LEAK = 7
# `-----------------------------

--

********************
********************
Failed Tests (1):
  SYCL :: USM/fill.cpp

Copy link
Contributor

@myler myler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants