KAFKA-16072: JUnit 5 extension to detect thread leak #15101

wernerdv · 2024-01-01T21:23:54Z

Added LeakTestingExtension based on TestUtils#verifyNoUnexpectedThreads for verify that there are no lingering threads left over. Extension execute after every test.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

Added LeakTestingExtension based on TestUtils#verifyNoUnexpectedThreads

divijvaidya · 2024-01-02T11:16:59Z

Thank you for the change @wernerdv . As a next step, we need to fix the leaks that were detected by this PR before merging this. I already know of at least two PRs that will help: #15093 and #15077 . Let's wait for them to merge first and then we can rebase this on trunk and run again.

wernerdv · 2024-01-02T12:10:05Z

@divijvaidya Thanks for the reply. Now extension runs only for modules that:
testImplementation project(':core')

Is this expected behavior or does it require improvement?

showuon · 2024-01-02T12:17:14Z

@wernerdv , thanks for the smart way to detect the thread leaking!
For now, only core module will introduce thousands of test errors. We need to fix all the thread leaking before extending to other modules. Thanks.

gharris1727

Hi, thanks for the PR!

This test utility suffers from "cascading failures" where if one test leaks a thread, then every test run afterwards in that same JVM will also fail the assertion. This makes it difficult to blame failures on the test which leaked the resource, something that I found extremely important when tracking down leaks with #14783 .

I think this flaw was borderline acceptable when tests opted-in to the leak checking, but I think it is unacceptable if it is unconditionally applied to every test. I think we need to use a stateful approach to diff the threads created between Before and After.

ashwinpankaj · 2024-01-03T03:36:54Z

core/src/test/java/kafka/test/junit/LeakTestingExtension.java

+public class LeakTestingExtension implements AfterEachCallback {
+    @Override
+    public void afterEach(ExtensionContext extensionContext) {
+        TestUtils.verifyNoUnexpectedThreads("@AfterEach");


Thanks @wernerdv . I had a basic question about this change -
TestUtils.verifyNoUnexpectedThreads seems to check for the presence of specific type of threads. Will this check be sufficient to catch all types of thread leaks ?

refactoring

wernerdv · 2024-01-03T08:52:13Z

@gharris1727 @ashwinpankaj
I've updated extension to check for leaked threads relative to threads created before the test.
This looks more correct.

It might be worth adding more names to the set of expected thread names.

add expected thread names

gharris1727 · 2024-01-03T16:39:20Z

core/src/test/java/kafka/test/junit/LeakTestingExtension.java

+            Arrays.asList("junit-", "JMX", "feature-zk-node-event-process-thread", "ForkJoinPool", "executor-",
+                    "kafka-scheduler-", "metrics-meter-tick-thread", "ReplicaFetcherThread", "scala-", "pool-")
+    );
+    private Set<Thread> initialThreads;


I think using a state variable here is essentially a static variable, and is shared between tests running concurrently.
We don't run concurrent tests in the same JVM by default, but it is easy to change that with configuration. Can you use the extensionContext to store the state to avoid this breaking in the future? https://junit.org/junit5/docs/current/user-guide/#extensions-keeping-state

Cascading failure was fixed

gharris1727 · 2024-01-03T16:43:16Z

It might be worth adding more names to the set of expected thread names.

I think increasing the scope of this assertion by adding patterns/removing the pattern check/applying to tests outside of core are good ideas. We can land this change first and then explore those in follow-ups.

used ExtensionContext.Store

gharris1727

LGTM pending a build proving that this assertion passes for the existing test suites.

Thanks @wernerdv for the contribution!

divijvaidya · 2024-01-03T20:10:24Z

core/src/test/java/kafka/test/junit/LeakTestingExtension.java

+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+public class LeakTestingExtension implements BeforeEachCallback, AfterEachCallback {
+    private static final Set<String> EXPECTED_THREAD_NAMES = new HashSet<>(


Hy @gharris1727 @wernerdv
Can you please help me understand why it is expected to have kafka-scheduler or ExpirationReaper or ReplicaFetcherThread at the end or beginning of test? Isn't presence of these threads an indication of a thread leak? (for example, expiration reaper threads may mean that we don't close controllerAPIs correctly)

Oh I didn't notice that this is a set of threads that are allowed to leak, where the original TestUtils.verifyNoUnexpectedThreads was a set of threads which weren't allowed to leak. I assumed the mechanism was copy-pasted.

I don't know if this set needs to exist, especially now that the assertion is stateful. Perhaps we can empty this set completely to see what the effect is, and then add in what is necessary. I did see some flakiness with an early prototype of #14783 which sometimes detected the gradle runner opening sockets. Perhaps there are similar things which create background threads.

update EXPECTED_THREAD_NAMES

ashwinpankaj · 2024-01-04T05:47:50Z

core/src/test/java/kafka/test/junit/LeakTestingExtension.java

+    private Set<Thread> unexpectedThreads(Set<Thread> initialThreads) {
+        Set<Thread> finalThreads = Thread.getAllStackTraces().keySet();
+
+        if (initialThreads.size() != finalThreads.size()) {


intialThreads and finalThreads could have the same size but contain different threads. I think we should match the names always.

@ashwinpankaj removed size comparison and now thread names are checked.

refactoring

updated EXPECTED_THREAD_NAMES

wernerdv · 2024-01-10T18:38:00Z

@divijvaidya hello, I rerun the tests after merging #15093 and #15077.
https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-15101/10/testReport/

There are failed tests from the org.apache.kafka.streams.integration package that have unexpected treads e.g. kafka-scheduler, Controller-0-to-broker-0-send-thread, ReplicaFetcherThread.
https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-15101/10/testReport/org.apache.kafka.streams.integration/

It's also worth looking at the test results:
https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-15101/10/testReport/kafka.utils/

The names of the expected threads are similar to https://github.com/apache/kafka/pull/15052/files#diff-b8f9f9d1b191457cbdb332a3429f0ad65b50fa4cef5af8562abcfd1f177a2cfeR2441

Is there anything I can do to improve my PR?

divijvaidya · 2024-01-11T08:56:39Z

One reason that some stream tests are failing because they use the same server for the duration of the entire test suite and don't create a server per test, for example MetricsIntegrationTest.shouldAddMetricsOnAllLevels().

Should we change our logic to test for these leaked threads at the end of every test class, instead of at the end of every test?

@gharris1727 thoughts?

gharris1727 · 2024-01-12T01:02:05Z

One reason that some stream tests are failing because they use the same server for the duration of the entire test suite and don't create a server per test, for example MetricsIntegrationTest.shouldAddMetricsOnAllLevels().
Should we change our logic to test for these leaked threads at the end of every test class, instead of at the end of every test?

@divijvaidya The stateful strategy of diffing the threads before and after each test should exclude the test-suite threads which are created before the test method begins. What is probably happening here is that there are some threads being lazily created as a side-effect of the actions of the test but which are not cleaned up immediately, but are later cleaned up by the whole cluster shutting down.

I wouldn't recommend immediately reducing the scope to the Class-only enforcement, as it makes it so much harder to blame specific tests which contain leaks. It's so much more helpful to get a "This test leaked a thread" warning than a "this suite leaked a test" warning, especially when the assertion doesn't tell you where the thread is being allocated.

As far as what to do next:

You could figure out how to disable the lazy threads with some development-only configuration
You could figure out how to clean up these lazy threads, either automatically or with a development-only hint method.
You could give these particular threads special privileges (by adding to the expected thread names)
You could give these particular tests special privileges (with an Ignore annotation)

These are technically leaks, and the extension is correctly finding them. It's up to us to figure out if they're worth fixing or if they're generally harmless.

add expected thread names

KAFKA-16072: JUnit 5 extension to detect thread leak

3eaa3ab

Added LeakTestingExtension based on TestUtils#verifyNoUnexpectedThreads

divijvaidya added the tests Test fixes (including flaky tests) label Jan 2, 2024

gharris1727 previously requested changes Jan 2, 2024

View reviewed changes

ashwinpankaj reviewed Jan 3, 2024

View reviewed changes

KAFKA-16072: JUnit 5 extension to detect thread leak

cfd1716

refactoring

wernerdv added 2 commits January 3, 2024 16:00

KAFKA-16072: JUnit 5 extension to detect thread leak

6387125

add expected thread names

KAFKA-16072: JUnit 5 extension to detect thread leak

5c707ad

add expected thread names

gharris1727 reviewed Jan 3, 2024

View reviewed changes

wernerdv added 2 commits January 3, 2024 22:10

Merge branch 'trunk' into KAFKA-16072

28c2626

KAFKA-16072: JUnit 5 extension to detect thread leak

7035b90

used ExtensionContext.Store

gharris1727 approved these changes Jan 3, 2024

View reviewed changes

divijvaidya reviewed Jan 3, 2024

View reviewed changes

KAFKA-16072: JUnit 5 extension to detect thread leak

9c4a431

update EXPECTED_THREAD_NAMES

ashwinpankaj reviewed Jan 4, 2024

View reviewed changes

wernerdv added 2 commits January 5, 2024 20:57

Merge branch 'trunk' into KAFKA-16072

5eedd69

KAFKA-16072: JUnit 5 extension to detect thread leak

7460dad

refactoring

wernerdv requested a review from ashwinpankaj January 5, 2024 21:12

wernerdv added 2 commits January 7, 2024 01:12

Merge branch 'trunk' into KAFKA-16072

ef44604

KAFKA-16072: JUnit 5 extension to detect thread leak

d26d948

updated EXPECTED_THREAD_NAMES

ashwinpankaj approved these changes Jan 9, 2024

View reviewed changes

Merge branch 'trunk' into KAFKA-16072

ef15ae1

wernerdv added 5 commits January 20, 2024 23:10

Merge branch 'trunk' into KAFKA-16072

c7f58fc

Merge branch 'trunk' into KAFKA-16072

776da95

Merge branch 'trunk' into KAFKA-16072

dc85ac5

KAFKA-16072: JUnit 5 extension to detect thread leak

93b5985

add expected thread names

Merge branch 'trunk' into KAFKA-16072

b1fb3ef

wernerdv closed this Mar 29, 2024

gharris1727 mentioned this pull request Jun 25, 2024

KAFKA-16791: Add thread detection to ClusterTestExtensions #16427

Closed

3 tasks

gharris1727 mentioned this pull request Jan 26, 2025

MINOR: KafkaProducerTest - Fix resource leakage and replace explicit invocation of close() method with try with resources #18678

Merged

3 tasks

KAFKA-16072: JUnit 5 extension to detect thread leak #15101

KAFKA-16072: JUnit 5 extension to detect thread leak #15101

Uh oh!

Conversation

wernerdv commented Jan 1, 2024

Committer Checklist (excluded from commit message)

Uh oh!

divijvaidya commented Jan 2, 2024

Uh oh!

wernerdv commented Jan 2, 2024

Uh oh!

showuon commented Jan 2, 2024

Uh oh!

gharris1727 left a comment

Choose a reason for hiding this comment

Uh oh!

ashwinpankaj Jan 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wernerdv commented Jan 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gharris1727 Jan 3, 2024

Choose a reason for hiding this comment

Uh oh!

gharris1727 commented Jan 3, 2024

Uh oh!

gharris1727 left a comment

Choose a reason for hiding this comment

Uh oh!

divijvaidya Jan 3, 2024

Choose a reason for hiding this comment

Uh oh!

gharris1727 Jan 3, 2024

Choose a reason for hiding this comment

Uh oh!

ashwinpankaj Jan 4, 2024

Choose a reason for hiding this comment

Uh oh!

wernerdv Jan 5, 2024

Choose a reason for hiding this comment

Uh oh!

wernerdv commented Jan 10, 2024

Uh oh!

divijvaidya commented Jan 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gharris1727 commented Jan 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ashwinpankaj Jan 3, 2024 •

edited

Loading

wernerdv commented Jan 3, 2024 •

edited

Loading

divijvaidya commented Jan 11, 2024 •

edited

Loading