Skip to content

Conversation

@wernerdv
Copy link
Contributor

@wernerdv wernerdv commented Jan 1, 2024

Added LeakTestingExtension based on TestUtils#verifyNoUnexpectedThreads for verify that there are no lingering threads left over. Extension execute after every test.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

Added LeakTestingExtension based on TestUtils#verifyNoUnexpectedThreads
@divijvaidya divijvaidya added the tests Test fixes (including flaky tests) label Jan 2, 2024
@divijvaidya
Copy link
Member

Thank you for the change @wernerdv . As a next step, we need to fix the leaks that were detected by this PR before merging this. I already know of at least two PRs that will help: #15093 and #15077 . Let's wait for them to merge first and then we can rebase this on trunk and run again.

@wernerdv
Copy link
Contributor Author

wernerdv commented Jan 2, 2024

@divijvaidya Thanks for the reply. Now extension runs only for modules that:
testImplementation project(':core')

Is this expected behavior or does it require improvement?

@showuon
Copy link
Member

showuon commented Jan 2, 2024

@wernerdv , thanks for the smart way to detect the thread leaking!
For now, only core module will introduce thousands of test errors. We need to fix all the thread leaking before extending to other modules. Thanks.

Copy link
Contributor

@gharris1727 gharris1727 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for the PR!

This test utility suffers from "cascading failures" where if one test leaks a thread, then every test run afterwards in that same JVM will also fail the assertion. This makes it difficult to blame failures on the test which leaked the resource, something that I found extremely important when tracking down leaks with #14783 .

I think this flaw was borderline acceptable when tests opted-in to the leak checking, but I think it is unacceptable if it is unconditionally applied to every test. I think we need to use a stateful approach to diff the threads created between Before and After.

public class LeakTestingExtension implements AfterEachCallback {
@Override
public void afterEach(ExtensionContext extensionContext) {
TestUtils.verifyNoUnexpectedThreads("@AfterEach");
Copy link
Contributor

@ashwinpankaj ashwinpankaj Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wernerdv . I had a basic question about this change -
TestUtils.verifyNoUnexpectedThreads seems to check for the presence of specific type of threads. Will this check be sufficient to catch all types of thread leaks ?

@wernerdv
Copy link
Contributor Author

wernerdv commented Jan 3, 2024

@gharris1727 @ashwinpankaj
I've updated extension to check for leaked threads relative to threads created before the test.
This looks more correct.

It might be worth adding more names to the set of expected thread names.

Arrays.asList("junit-", "JMX", "feature-zk-node-event-process-thread", "ForkJoinPool", "executor-",
"kafka-scheduler-", "metrics-meter-tick-thread", "ReplicaFetcherThread", "scala-", "pool-")
);
private Set<Thread> initialThreads;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using a state variable here is essentially a static variable, and is shared between tests running concurrently.
We don't run concurrent tests in the same JVM by default, but it is easy to change that with configuration. Can you use the extensionContext to store the state to avoid this breaking in the future? https://junit.org/junit5/docs/current/user-guide/#extensions-keeping-state

@gharris1727 gharris1727 dismissed their stale review January 3, 2024 16:39

Cascading failure was fixed

@gharris1727
Copy link
Contributor

It might be worth adding more names to the set of expected thread names.

I think increasing the scope of this assertion by adding patterns/removing the pattern check/applying to tests outside of core are good ideas. We can land this change first and then explore those in follow-ups.

Copy link
Contributor

@gharris1727 gharris1727 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending a build proving that this assertion passes for the existing test suites.

Thanks @wernerdv for the contribution!

import static org.junit.jupiter.api.Assertions.assertTrue;

public class LeakTestingExtension implements BeforeEachCallback, AfterEachCallback {
private static final Set<String> EXPECTED_THREAD_NAMES = new HashSet<>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hy @gharris1727 @wernerdv
Can you please help me understand why it is expected to have kafka-scheduler or ExpirationReaper or ReplicaFetcherThread at the end or beginning of test? Isn't presence of these threads an indication of a thread leak? (for example, expiration reaper threads may mean that we don't close controllerAPIs correctly)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I didn't notice that this is a set of threads that are allowed to leak, where the original TestUtils.verifyNoUnexpectedThreads was a set of threads which weren't allowed to leak. I assumed the mechanism was copy-pasted.

I don't know if this set needs to exist, especially now that the assertion is stateful. Perhaps we can empty this set completely to see what the effect is, and then add in what is necessary. I did see some flakiness with an early prototype of #14783 which sometimes detected the gradle runner opening sockets. Perhaps there are similar things which create background threads.

private Set<Thread> unexpectedThreads(Set<Thread> initialThreads) {
Set<Thread> finalThreads = Thread.getAllStackTraces().keySet();

if (initialThreads.size() != finalThreads.size()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intialThreads and finalThreads could have the same size but contain different threads. I think we should match the names always.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashwinpankaj removed size comparison and now thread names are checked.

@wernerdv wernerdv requested a review from ashwinpankaj January 5, 2024 21:12
@wernerdv
Copy link
Contributor Author

@divijvaidya hello, I rerun the tests after merging #15093 and #15077.
https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-15101/10/testReport/

There are failed tests from the org.apache.kafka.streams.integration package that have unexpected treads e.g. kafka-scheduler, Controller-0-to-broker-0-send-thread, ReplicaFetcherThread.
https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-15101/10/testReport/org.apache.kafka.streams.integration/

It's also worth looking at the test results:
https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-15101/10/testReport/kafka.utils/

The names of the expected threads are similar to https://github.com/apache/kafka/pull/15052/files#diff-b8f9f9d1b191457cbdb332a3429f0ad65b50fa4cef5af8562abcfd1f177a2cfeR2441

Is there anything I can do to improve my PR?

@divijvaidya
Copy link
Member

divijvaidya commented Jan 11, 2024

One reason that some stream tests are failing because they use the same server for the duration of the entire test suite and don't create a server per test, for example MetricsIntegrationTest.shouldAddMetricsOnAllLevels().

Should we change our logic to test for these leaked threads at the end of every test class, instead of at the end of every test?

@gharris1727 thoughts?

@gharris1727
Copy link
Contributor

One reason that some stream tests are failing because they use the same server for the duration of the entire test suite and don't create a server per test, for example MetricsIntegrationTest.shouldAddMetricsOnAllLevels().
Should we change our logic to test for these leaked threads at the end of every test class, instead of at the end of every test?

@divijvaidya The stateful strategy of diffing the threads before and after each test should exclude the test-suite threads which are created before the test method begins. What is probably happening here is that there are some threads being lazily created as a side-effect of the actions of the test but which are not cleaned up immediately, but are later cleaned up by the whole cluster shutting down.

I wouldn't recommend immediately reducing the scope to the Class-only enforcement, as it makes it so much harder to blame specific tests which contain leaks. It's so much more helpful to get a "This test leaked a thread" warning than a "this suite leaked a test" warning, especially when the assertion doesn't tell you where the thread is being allocated.

As far as what to do next:

  1. You could figure out how to disable the lazy threads with some development-only configuration
  2. You could figure out how to clean up these lazy threads, either automatically or with a development-only hint method.
  3. You could give these particular threads special privileges (by adding to the expected thread names)
  4. You could give these particular tests special privileges (with an Ignore annotation)

These are technically leaks, and the extension is correctly finding them. It's up to us to figure out if they're worth fixing or if they're generally harmless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tests Test fixes (including flaky tests)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants