MINOR: Stop leaking threads in BlockingConnectorTest #12290

C0urante · 2022-06-14T01:10:33Z

These tests currently create threads that block forever until the JVM is shut down. This change unblocks those threads once their respective test cases are finished.

This is valuable not only for general code hygiene and resource utilization, but also for laying the groundwork for reusing an embedded Connect cluster across each of these test cases, which would drastically reduce test time. That's left for a follow-up PR, though.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

ijuma · 2022-06-15T16:47:41Z

@kkonstantine can you please review this?

C0urante · 2022-06-28T22:40:11Z

@ijuma it appears that @kkonstantine is busy at the moment.

@cadonna would you be willing to take a look, since you were previously in the neighborhood? This change is much less involved but touches on the same BlockingConnectorTest suite.

ijuma · 2022-08-10T13:58:07Z

@C0urante It's a bit difficult to review since there is no explanation on what each latch does. If you add some explanatory comments to the code, I can try to review it.

C0urante · 2022-08-16T16:51:00Z

Thanks @ijuma, I've updated the PR with some more details on the purpose of each latch type, and addressed a small bug that would have caused connectors/tasks to become unblocked incorrectly during testing.

C0urante · 2023-06-06T17:23:36Z

@tombentley @viktorsomogyi if you have a moment, would you mind taking a look? Thanks!

viktorsomogyi

@C0urante I'm not that familiar (yet) with this but I hope the comments make (some) sense 🙂

viktorsomogyi · 2023-06-19T13:28:50Z

connect/runtime/src/test/java/org/apache/kafka/connect/integration/BlockingConnectorTest.java

                while (true) {
                    try {
-                        Thread.sleep(Long.MAX_VALUE);
+                        blockLatch.await();
+                        log.debug("Instructed to stop blocking; will resume normal execution");
+                        return;
                    } catch (InterruptedException e) {
-                        // No-op. Just keep blocking.
+                        log.debug("Interrupted while blocking; will continue blocking until instructed to stop");
                    }
                }


Wouldn't this while block prevent the normal shutdown of connect based on the order in BlockingConnectorTest (you call connect.stop and then Block.reset)? For instance the way Worker is shutting down they expect WorkerConnectors to respond to an interrupt.

Code reference:

kafka/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/Worker.java

Line 267 in 6f7682d

ThreadUtils.shutdownExecutorServiceQuietly(executor, EXECUTOR_SHUTDOWN_TERMINATION_TIMEOUT_MS, TimeUnit.MILLISECONDS);

Sort of--this prevents the Worker class's executor field from shutting down gracefully (i.e., when we invoke awaitTermination on it in ThreadUtils::shutdownExecutorServiceQuietly), but it doesn't prevent the Connect worker from shutting down, since we put a bound on how long we wait for the executor to shut down before moving on.

This is why the tests on trunk (which also have this kind of while (true) loop to simulate connector/task blocks) don't hang.

connect/runtime/src/test/java/org/apache/kafka/connect/integration/BlockingConnectorTest.java

gharris1727 · 2023-12-12T18:19:45Z

connect/runtime/src/test/java/org/apache/kafka/connect/integration/BlockingConnectorTest.java

        connect.stop();
-        Block.resetBlockLatch();
+        // unblock everything so that we don't leak threads after each test run
+        Block.reset();


WDYT about resetting before stopping the workers, to allow a normal shutdown to happen?

It may be valuable to ensure that workers can shut down gracefully under these circumstances. Thoughts?

I think that should take place in a test then, not in the cleanup.

The reason I bring this up is that if I were to assert that the clients/threads are all stopped immediately after Block.reset() (as implemented in #14783) there's no synchronization to ensure that cleanup takes place before the assertion fires. The "asynchronous cleanup" initiated by Block.reset could exceed the lifetime of the test, still leaking the threads but only temporarily.

I think that should take place in a test then, not in the cleanup.

It's easier to handle in a single method rather than copy over to 11 test cases, though. And I also don't necessarily see why @After-annotated classes need to be used exclusively for cleanup.

The concern about threads leaking (even if for a short period) beyond the scope of the test definitely seems valid. I've pushed a tweak that adds logic to wait for the blocked threads to complete in Block::reset. LMK if this seems clean enough; if not, I can bite the bullet and reverse the order of operations in BlockingConnectorTest::close and then see about adding more explicit checks for graceful worker shutdown in other places.

It's easier to handle in a single method rather than copy over to 11 test cases

Oh I see, I thought it would be sufficient to add one test case that called stop() to verify that one type of blocked thread still allows shutdown to complete, rather than verifying it for all of the different ways of blocking threads. That would have less coverage that the current tests on trunk.

I've pushed a tweak that adds logic to wait for the blocked threads to complete in Block::reset.

I think this is probably the better solution. The leak tester can separately verify that resources were closed properly now that the test ensures the threads stop. 👍

gharris1727 · 2023-12-12T18:22:07Z

connect/runtime/src/test/java/org/apache/kafka/connect/integration/BlockingConnectorTest.java

@@ -350,13 +353,16 @@ private void assertRequestTimesOut(String requestDescription, ThrowingRunnable r
    }

    private static class Block {


can you make this public to allow OffsetsApiIntegrationTest to use the latch?

and do you think that maybe these connectors should be moved out of this test to a common reusable class?

can you make this public to allow OffsetsApiIntegrationTest to use the latch?

Yep, done 👍

and do you think that maybe these connectors should be moved out of this test to a common reusable class?

I do think this would be cleaner, but it'd fairly involved. Do you think it's alright to merge this as-is without blocking on that?

Do you think it's alright to merge this as-is without blocking on that?

Yep not a blocker.

gharris1727 · 2023-12-12T18:25:32Z

connect/runtime/src/test/java/org/apache/kafka/connect/integration/BlockingConnectorTest.java

+            }
+
+            if (awaitBlockLatch == null) {
+                throw new IllegalArgumentException("No connector has been created yet");


Is this an opportunity for a flaky failure, if the test thread advances before the connector is created. It seems very rare, I don't see any instances on the Gradle dashboard.

Good point--given that we're not guaranteed that, e.g., Connector::start has been invoked after a REST request to create a connector has returned, this does seem like a chance for a flaky failure.

I've tweaked this part to handle the case when awaitBlockLatch is null gracefully, without risking blocking forever.

gharris1727 · 2023-12-12T19:12:01Z

connect/runtime/src/test/java/org/apache/kafka/connect/integration/BlockingConnectorTest.java

            log.debug("Waiting for connector to block");
-            if (!blockLatch.await(CONNECTOR_BLOCK_TIMEOUT_MS, TimeUnit.MILLISECONDS)) {
+            if (!awaitBlockLatch.await(CONNECTOR_BLOCK_TIMEOUT_MS, TimeUnit.MILLISECONDS)) {
                throw new TimeoutException("Timed out waiting for connector to block.");


Since scanning creates connector instances, and validation caches the connector instance, how do you ensure that the right awaitBlockLatch is being waited on here?

Right now it's a matter of writing test code carefully, with the assumption that if any connector or task instance has hit the block, it's the one we're interested in. So far I believe this holds for all the tests; let me know if you've found any exceptions, though.

If you add this stanza before the log.debug("Connector should now be blocked") the tests still pass:

boolean retry; synchronized (Block.class) { retry = Block.awaitBlockLatch != null && Block.awaitBlockLatch != awaitBlockLatch; } if (retry) { log.debug("New blocking instance was created, retrying wait"); waitForBlock(); }

For me, I see this being printed in:

testBlockInSinkTaskStart

testBlockInConnectorStart

testWorkerRestartWithBlockInConnectorStart

testBlockInSourceTaskStart

testBlockInConnectorInitialize

This leads me to believe that this function is normally exiting before the blocking method of the last-instantiated instance happens.

I don't immediately see how this could cause flakiness, but it's at least an instance of the method not doing what it says it does.

WDYT about a Block.prepare() called before the test starts that creates the awaitBlockLatch, instead of having the Block constructor initialize it? That could eliminate the wait-notify mechanism, since only one thread (the test thread) would be responsible for setting/clearing the awaitBlockLatch.

edit: Would this also allow you to block in methods used during plugin scanning, if you only started blocking if the asyncBlockLatch had been prepared first?

I don't believe these changes are necessary here since the portions they address are not affected by the PR. If you would like to do this cleanup in a separate PR, I'd be happy to review.

gharris1727 · 2024-01-12T18:22:07Z

connect/runtime/src/test/java/org/apache/kafka/connect/integration/BlockingConnectorTest.java

-                blockLatch.countDown();
+                CountDownLatch blockLatch;
+                synchronized (Block.class) {
+                    awaitBlockLatch.countDown();


nit: small NPE here under this sequence of calls:

new Block(s)

Block.reset()

block.maybeBlockOn(s)

… purposes of each latch type

- Reset block latch in OffsetsApiIntegrationTest cases that use the blocking connector - Harden Block::waitForBlock against possible race condition caused by connector creation latency

Reviewers: Kvicii <kvicii.yu@gmail.com>, Viktor Somogyi-Vass <viktorsomogyi@gmail.com>, Greg Harris <greg.harris@aiven.io>

mimaison added the connect label Jun 14, 2022

Kvicii approved these changes Jun 14, 2022

View reviewed changes

ijuma requested a review from kkonstantine June 15, 2022 16:47

C0urante force-pushed the kafka-12657-thread-leak branch from ddd55b2 to f936f46 Compare August 16, 2022 16:50

viktorsomogyi self-requested a review June 19, 2023 11:16

viktorsomogyi reviewed Jun 19, 2023

View reviewed changes

C0urante mentioned this pull request Dec 12, 2023

MINOR: Allow Block.resetBlockLatch to release blocked operation for end-of-test cleanup #14987

Closed

3 tasks

gharris1727 reviewed Dec 12, 2023

View reviewed changes

C0urante force-pushed the kafka-12657-thread-leak branch from f936f46 to ef89fdb Compare December 13, 2023 14:56

gharris1727 approved these changes Jan 12, 2024

View reviewed changes

C0urante and others added 5 commits January 12, 2024 15:27

KAFKA-12657: Stop leaking threads in BlockingConnectorTest

0a5c63f

Track multiple block latches per test case, add comments to summarize…

c62b547

… purposes of each latch type

Address review comments

32e0ea6

- Reset block latch in OffsetsApiIntegrationTest cases that use the blocking connector - Harden Block::waitForBlock against possible race condition caused by connector creation latency

Wait for blocked threads to complete in Block::reset

205fa45

Address potential NPE in Block::maybeBlockOn

a2fa402

C0urante force-pushed the kafka-12657-thread-leak branch from 4e9e7bc to a2fa402 Compare January 12, 2024 20:40

C0urante merged commit a989329 into apache:trunk Jan 18, 2024

C0urante deleted the kafka-12657-thread-leak branch January 18, 2024 16:11

dajac pushed a commit to dajac/kafka that referenced this pull request Jan 19, 2024

MINOR: Stop leaking threads in BlockingConnectorTest (apache#12290)

96c55df

Reviewers: Kvicii <kvicii.yu@gmail.com>, Viktor Somogyi-Vass <viktorsomogyi@gmail.com>, Greg Harris <greg.harris@aiven.io>

showuon pushed a commit to showuon/kafka that referenced this pull request Jan 22, 2024

MINOR: Stop leaking threads in BlockingConnectorTest (apache#12290)

b7eda38

Reviewers: Kvicii <kvicii.yu@gmail.com>, Viktor Somogyi-Vass <viktorsomogyi@gmail.com>, Greg Harris <greg.harris@aiven.io>

drawxy pushed a commit to drawxy/kafka that referenced this pull request Jan 23, 2024

MINOR: Stop leaking threads in BlockingConnectorTest (apache#12290)

8ae8a02

Reviewers: Kvicii <kvicii.yu@gmail.com>, Viktor Somogyi-Vass <viktorsomogyi@gmail.com>, Greg Harris <greg.harris@aiven.io>

yyu1993 pushed a commit to yyu1993/kafka that referenced this pull request Feb 15, 2024

MINOR: Stop leaking threads in BlockingConnectorTest (apache#12290)

681c9b0

Reviewers: Kvicii <kvicii.yu@gmail.com>, Viktor Somogyi-Vass <viktorsomogyi@gmail.com>, Greg Harris <greg.harris@aiven.io>

clolov pushed a commit to clolov/kafka that referenced this pull request Apr 5, 2024

MINOR: Stop leaking threads in BlockingConnectorTest (apache#12290)

72bd863

Reviewers: Kvicii <kvicii.yu@gmail.com>, Viktor Somogyi-Vass <viktorsomogyi@gmail.com>, Greg Harris <greg.harris@aiven.io>

		@@ -350,13 +353,16 @@ private void assertRequestTimesOut(String requestDescription, ThrowingRunnable r
		}

		private static class Block {

MINOR: Stop leaking threads in BlockingConnectorTest #12290

MINOR: Stop leaking threads in BlockingConnectorTest #12290

Uh oh!

Conversation

C0urante commented Jun 14, 2022

Committer Checklist (excluded from commit message)

Uh oh!

ijuma commented Jun 15, 2022

Uh oh!

C0urante commented Jun 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ijuma commented Aug 10, 2022

Uh oh!

C0urante commented Aug 16, 2022

Uh oh!

C0urante commented Jun 6, 2023

Uh oh!

viktorsomogyi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gharris1727 Dec 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

C0urante commented Jun 28, 2022 •

edited

Loading

gharris1727 Dec 13, 2023 •

edited

Loading