Skip to content

Conversation

@gharris1727
Copy link
Contributor

@gharris1727 gharris1727 commented Nov 16, 2023

There are currently many tests which instantiate sockets and channels, but don't close them. The vast majority of these sockets and channels are created via Selector and SocketServer, and are associated with clients and servers which are not closed properly. Many of these leaks are silent, and have gone unnoticed for years.

To allow us to detect these leaks and prevent future ones, we should have a Junit5 Extension which can detect when tests leak clients, and fail those tests with a diagnostic message.

Implementation note: I tried to keep the main changes as small as possible, making use of the javax.net Factory classes to create a layer of indirection in the NetworkContext. This is a much less invasive change than dependency-injecting the factories and possibly having to add more constructors to the Kafka clients.

While trying this out, I applied it automatically to all test suites with Jupiter's automatic registration mechanism. I added opt-out IgnoreAll and IgnoreEach annotations. I found that many test suites create a single Kafka cluster for the whole test, and this would cause the Each extension to generate false positives. This lead me to believe that Each should be opt-in instead, and a developer can add it temporarily with @ExtendWith(LeakTestingExtension.Each.class).

However, using automatic registration also means that consumers of the clients test-jar may unintentionally turn on this leak testing. We could avoid this by moving the extension to a different test-utils jar that is only for internal-use, or by disabling automatic registration and making this extension opt-in only.

This PR will cause many test failures for tests that are not compliant, and should not be merged as-is. For the set of leaks I found with this testing methodology, see:

If you have a leaked resource, it looks something like this (from KafkaProducerTest):

This test contains a resource leak. Close the resources instantiated in the test, or add the @LeakTestingExtension.Ignore annotation to the method and/or class.
java.lang.AssertionError: This test contains a resource leak. Close the resources instantiated in the test, or add the @LeakTestingExtension.Ignore annotation to the method and/or class.
	at org.apache.kafka.common.network.LeakTestingExtension.after(LeakTestingExtension.java:82)
	at org.apache.kafka.common.network.LeakTestingExtension.access$100(LeakTestingExtension.java:31)
	at org.apache.kafka.common.network.LeakTestingExtension$Each.afterEach(LeakTestingExtension.java:125)
	at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeAfterEachCallbacks$12(TestMethodTestDescriptor.java:261)
<snipped>
	at worker.org.gradle.process.internal.worker.GradleWorkerMain.run(GradleWorkerMain.java:69)
	at worker.org.gradle.process.internal.worker.GradleWorkerMain.main(GradleWorkerMain.java:74)
Caused by: java.lang.AssertionError: Leak check failed
	at org.apache.kafka.common.utils.LeakTester.lambda$combine$0(LeakTester.java:87)
	at org.apache.kafka.common.network.LeakTestingExtension.after(LeakTestingExtension.java:80)
	... 70 more
	Suppressed: java.lang.AssertionError: AbstractSelector instances left open
		at org.apache.kafka.common.utils.PredicateLeakTester.lambda$start$0(PredicateLeakTester.java:90)
		at org.apache.kafka.common.utils.LeakTester.lambda$combine$0(LeakTester.java:84)
		... 71 more
		Suppressed: java.lang.Exception: Opened sun.nio.ch.KQueueSelectorImpl
			at org.apache.kafka.common.utils.PredicateLeakTester.open(PredicateLeakTester.java:59)
			at org.apache.kafka.common.network.NetworkContextLeakTester$RecordingSelectorProvider.openSelector(NetworkContextLeakTester.java:110)
			at org.apache.kafka.common.network.Selector.<init>(Selector.java:159)
			at org.apache.kafka.common.network.Selector.<init>(Selector.java:212)
			at org.apache.kafka.common.network.Selector.<init>(Selector.java:224)
			at org.apache.kafka.common.network.Selector.<init>(Selector.java:228)
			at org.apache.kafka.clients.ClientUtils.createNetworkClient(ClientUtils.java:219)
			at org.apache.kafka.clients.ClientUtils.createNetworkClient(ClientUtils.java:160)
			at org.apache.kafka.clients.producer.KafkaProducer.newSender(KafkaProducer.java:514)
			at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:455)
			at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:289)
			at org.apache.kafka.clients.producer.KafkaProducerTest.lambda$testDeliveryTimeoutAndLingerMsConfig$54(KafkaProducerTest.java:2427)
			at org.junit.jupiter.api.AssertDoesNotThrow.assertDoesNotThrow(AssertDoesNotThrow.java:71)
			at org.junit.jupiter.api.AssertDoesNotThrow.assertDoesNotThrow(AssertDoesNotThrow.java:58)
			at org.junit.jupiter.api.Assertions.assertDoesNotThrow(Assertions.java:3224)
			at org.apache.kafka.clients.producer.KafkaProducerTest.testDeliveryTimeoutAndLingerMsConfig(KafkaProducerTest.java:2427)
<snipped>
			... 60 more

It shows the exact stack trace of where the socket or selector was instantiated, and attributes the failure to the test which caused it. I found that this was sufficient to find and fix leaks in parts of the codebase I had never seen before, so it should be very helpful for people who already know their way around the test and the surrounding code.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@gharris1727 gharris1727 added the tests Test fixes (including flaky tests) label Nov 16, 2023
@gharris1727 gharris1727 marked this pull request as ready for review January 2, 2024 16:53
@github-actions
Copy link

github-actions bot commented May 6, 2024

This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please ask a committer for review. If the PR has merge conflicts, please update it with the latest from trunk (or appropriate release branch)

If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.

@github-actions github-actions bot added stale Stale PRs and removed stale Stale PRs labels May 6, 2024
…network resources

Signed-off-by: Greg Harris <greg.harris@aiven.io>
Signed-off-by: Greg Harris <greg.harris@aiven.io>
@gharris1727 gharris1727 force-pushed the kafka-15845-socket-leak-extension branch from b7bfba7 to be561e8 Compare August 6, 2024 00:05
Signed-off-by: Greg Harris <greg.harris@aiven.io>
…annotations

Signed-off-by: Greg Harris <greg.harris@aiven.io>
@gharris1727 gharris1727 force-pushed the kafka-15845-socket-leak-extension branch from be561e8 to 59a58df Compare August 7, 2024 18:06
@github-actions
Copy link

github-actions bot commented Jan 3, 2025

This PR is being marked as stale since it has not had any activity in 90 days. If you
would like to keep this PR alive, please leave a comment asking for a review. If the PR has
merge conflicts, update it with the latest from the base branch.

If you are having difficulty finding a reviewer, please reach out on the [mailing list](https://kafka.apache.org/contact).

If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.

@divijvaidya
Copy link
Member

@gharris1727 shall we try to push it across the finish line? We can begin to enable this, at least for tests which are passing this check now.

@github-actions
Copy link

This PR is being marked as stale since it has not had any activity in 90 days. If you
would like to keep this PR alive, please leave a comment asking for a review. If the PR has
merge conflicts, update it with the latest from the base branch.

If you are having difficulty finding a reviewer, please reach out on the [mailing list](https://kafka.apache.org/contact).

If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.

@github-actions github-actions bot added the stale Stale PRs label Apr 25, 2025
@github-actions
Copy link

This PR has been closed since it has not had any activity in 120 days. If you feel like this
was a mistake, or you would like to continue working on it, please feel free to re-open the
PR and ask for a review.

@github-actions github-actions bot added the closed-stale PRs that were closed due to inactivity label May 25, 2025
@github-actions github-actions bot closed this May 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

closed-stale PRs that were closed due to inactivity stale Stale PRs tests Test fixes (including flaky tests)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants