KAFKA-15827: Prevent KafkaBasedLog subclasses from leaking passed-in clients #14763

gharris1727 · 2023-11-14T20:57:29Z

The KafkaBasedLog normally creates clients during start() and closes them in stop().
Some KafkaBasedLog subclasses accept already-created clients, and close them in stop() if start() is called first.
These clients should also be closed if stop() is called without first calling start().

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

Signed-off-by: Greg Harris <greg.harris@aiven.io>

C0urante · 2023-11-25T22:23:26Z

Can we add a comment to the class explaining why we're explicitly closing the clients even though it also looks like they're closed in the superclass's stop method?

Also, is it worth adding a test (or augmenting one or more existing test cases) for this?

Signed-off-by: Greg Harris <greg.harris@aiven.io>

…opicPartition> Signed-off-by: Greg Harris <greg.harris@aiven.io>

gharris1727 · 2023-11-27T18:31:54Z

Also, is it worth adding a test (or augmenting one or more existing test cases) for this?

I added a test to KafkaBasedLogTest, but found that it was difficult to set up the same test for the duplicate implementation in OffsetSyncStore. Instead, I eliminated the OffsetSyncStore implementation because i felt it duplicated the withExistingClients method. I think we had some discussion about why adding a new constructor to KafkaBasedLog was not viable, but would you consider changing the signature of the withExistingClients method?

connect/runtime/src/main/java/org/apache/kafka/connect/util/KafkaBasedLog.java

C0urante

LGTM, thanks Greg! Left one nit, feel free to address or ignore at your discretion.

I think it's fine to tweak the withExistingClients factory method. Hopefully nobody's relying on that in their connectors, and if they are, maybe it's time for a KIP where we officially declare part or all of this class as public API and have to start making future changes in other places or only after follow-up KIPs.

Signed-off-by: Greg Harris <greg.harris@aiven.io>

gharris1727 · 2023-12-11T22:08:07Z

Hi @C0urante I found another very similar problem, where InternalTopicsIntegrationTest was causing us to leak clients.

There, the "bad topics" cause startServices to throw an exception, preventing stopServices from being called in halt(). I implemented nearly the same fix, by adding a stopServices call to the DistributedHerder::stop method, similar to what is already happening in the StandaloneHerder.

PTAL thanks!

C0urante · 2023-12-11T22:54:57Z

...ct/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java

        ThreadUtils.shutdownExecutorServiceQuietly(herderExecutor, herderExecutorTimeoutMs(), TimeUnit.MILLISECONDS);
        ThreadUtils.shutdownExecutorServiceQuietly(forwardRequestExecutor, FORWARD_REQUEST_SHUTDOWN_TIMEOUT_MS, TimeUnit.MILLISECONDS);
        ThreadUtils.shutdownExecutorServiceQuietly(startAndStopExecutor, START_AND_STOP_SHUTDOWN_TIMEOUT_MS, TimeUnit.MILLISECONDS);
+        stopServices();


Would putting stopServices in the finally block in halt achieve the same thing? I'm a little worried about this exacerbating ungraceful shutdowns in scenarios other than the one we're trying to address with this change.

Would putting stopServices in the finally block in halt achieve the same thing?

Do you mean finally in DistributedHerder#run? halt is never called if startServices throws.

I considered putting stopServices in catch in DistributedHerder#run, but I saw that stopServices was already called in StandaloneHerder#stop and followed the same pattern.

I'm a little worried about this exacerbating ungraceful shutdowns in scenarios other than the one we're trying to address with this change.

For ungraceful shutdowns that end with the herder thread throwing an exception, it should still call exit and kill the process, without ever running this code.

For ungraceful shutdowns that don't throw an exception, i suppose the herder could block and ThreadUtils.shutdownExecutorServiceQuietly(herderExecutor, ...) would time out, and this stopServices would close the KafkaBasedLogs underneath the running herder, which could be bad.

Do you mean finally in DistributedHerder#run? halt is never called if startServices throws.

Ugh sorry, yes. This is my reward for rushing a review before I close the laptop for the day!

For ungraceful shutdowns that don't throw an exception, i suppose the herder could block and ThreadUtils.shutdownExecutorServiceQuietly(herderExecutor, ...) would time out, and this stopServices would close the KafkaBasedLogs underneath the running herder, which could be bad.

Yeah, this is what I'm afraid of.

Alternatives also include wrapping startServices with try/catch logic to automatically invoke stopServices if anything goes wrong (and then proceed to throw the original exception).

Also, looking at this once more--doesn't the herderMetrics field never get closed in the same scenario that we're addressing with this PR?

BTW, this is starting to feel similar to the problem touched on in #11608. Generally, both involve failure to clean up resources when operations fail in the same scope that the resources were initialized in.

I changed this to move the stopServices from stop() to the catch block in run. This is because finally doesn't execute if exit() is called in the catch block, which is typical when in production.

I moved the stop/close for all of the non-started resources to closeResources, which is always called even if stopServices() throws. Before herderMetrics and member were only closed on the halt() happy-path, now they are called if an exception kills the herder thread too.

Signed-off-by: Greg Harris <greg.harris@aiven.io>

C0urante

Thanks Greg, the approach here looks really clean. There's one failing unit test, DistributedHerderTest.testHaltCleansUpWorker, which doesn't like that the happy path for DistributedHerder::halt invokes WorkerGroupMember::stop twice. That behavior feels a little strange, but since WorkerGroupMember::stop is idempotent (as long as it's invoked from the same thread), I don't think we have to block on changing it, so feel free to just tweak the tests with atLeastOnce() and possibly a comment on idempotency for WorkerGroupMember::stop if that's preferable.

LGTM as long as the failing unit test is fixed and the rest of the CI build looks good 👍

Signed-off-by: Greg Harris <greg.harris@aiven.io>

gharris1727 · 2024-01-19T20:46:52Z

Test failures appear unrelated, and the runtime tests pass locally for me.

…clients (apache#14763) Signed-off-by: Greg Harris <greg.harris@aiven.io> Reviewers: Chris Egerton <chrise@aiven.io>

gharris1727 added 2 commits November 14, 2023 12:54

KAFKA-15827: Close clients passed into KafkaBasedLog.withExistingClients

b37e31e

Signed-off-by: Greg Harris <greg.harris@aiven.io>

fixup: OffsetSyncStore leaks consumer

4566180

Signed-off-by: Greg Harris <greg.harris@aiven.io>

gharris1727 added connect mirror-maker-2 labels Nov 14, 2023

gharris1727 mentioned this pull request Nov 16, 2023

KAFKA-15845: Detect leaked Kafka clients and servers with LeakTestingExtension #14783

Closed

3 tasks

gharris1727 added 2 commits November 27, 2023 10:06

fixup: comments, tests

2c2523a

Signed-off-by: Greg Harris <greg.harris@aiven.io>

Reduce duplication by having withExistingClients accept a Predicate<T…

a940f60

…opicPartition> Signed-off-by: Greg Harris <greg.harris@aiven.io>

C0urante reviewed Dec 4, 2023

View reviewed changes

connect/runtime/src/main/java/org/apache/kafka/connect/util/KafkaBasedLog.java Outdated Show resolved Hide resolved

C0urante approved these changes Dec 4, 2023

View reviewed changes

gharris1727 added 2 commits December 4, 2023 10:43

fixup: unconditionally close clients, fix checkstyle

f285102

Signed-off-by: Greg Harris <greg.harris@aiven.io>

DistributedHerder should stop services during stop

02c5440

Signed-off-by: Greg Harris <greg.harris@aiven.io>

C0urante reviewed Dec 11, 2023

View reviewed changes

fixup: move resource cleanup from stop to run

8b31662

Signed-off-by: Greg Harris <greg.harris@aiven.io>

C0urante approved these changes Jan 18, 2024

View reviewed changes

fixup: remove duplicate member stop call

230c0a8

Signed-off-by: Greg Harris <greg.harris@aiven.io>

gharris1727 merged commit 9397146 into apache:trunk Jan 19, 2024

drawxy pushed a commit to drawxy/kafka that referenced this pull request Jan 23, 2024

KAFKA-15827: Prevent KafkaBasedLog subclasses from leaking passed-in …

9983b97

…clients (apache#14763) Signed-off-by: Greg Harris <greg.harris@aiven.io> Reviewers: Chris Egerton <chrise@aiven.io>

yyu1993 pushed a commit to yyu1993/kafka that referenced this pull request Feb 15, 2024

KAFKA-15827: Prevent KafkaBasedLog subclasses from leaking passed-in …

2dc0b34

…clients (apache#14763) Signed-off-by: Greg Harris <greg.harris@aiven.io> Reviewers: Chris Egerton <chrise@aiven.io>

clolov pushed a commit to clolov/kafka that referenced this pull request Apr 5, 2024

KAFKA-15827: Prevent KafkaBasedLog subclasses from leaking passed-in …

1c680f5

…clients (apache#14763) Signed-off-by: Greg Harris <greg.harris@aiven.io> Reviewers: Chris Egerton <chrise@aiven.io>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KAFKA-15827: Prevent KafkaBasedLog subclasses from leaking passed-in clients #14763

KAFKA-15827: Prevent KafkaBasedLog subclasses from leaking passed-in clients #14763

Uh oh!

gharris1727 commented Nov 14, 2023

Uh oh!

C0urante commented Nov 25, 2023

Uh oh!

gharris1727 commented Nov 27, 2023 •

edited

Loading

Uh oh!

Uh oh!

C0urante left a comment

Uh oh!

gharris1727 commented Dec 11, 2023

Uh oh!

C0urante Dec 11, 2023

Uh oh!

gharris1727 Dec 11, 2023

Uh oh!

C0urante Dec 12, 2023

Uh oh!

C0urante Dec 12, 2023

Uh oh!

gharris1727 Jan 17, 2024

Uh oh!

C0urante left a comment

Uh oh!

gharris1727 commented Jan 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KAFKA-15827: Prevent KafkaBasedLog subclasses from leaking passed-in clients #14763

KAFKA-15827: Prevent KafkaBasedLog subclasses from leaking passed-in clients #14763

Uh oh!

Conversation

gharris1727 commented Nov 14, 2023

Committer Checklist (excluded from commit message)

Uh oh!

C0urante commented Nov 25, 2023

Uh oh!

gharris1727 commented Nov 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

C0urante left a comment

Choose a reason for hiding this comment

Uh oh!

gharris1727 commented Dec 11, 2023

Uh oh!

C0urante Dec 11, 2023

Choose a reason for hiding this comment

Uh oh!

gharris1727 Dec 11, 2023

Choose a reason for hiding this comment

Uh oh!

C0urante Dec 12, 2023

Choose a reason for hiding this comment

Uh oh!

C0urante Dec 12, 2023

Choose a reason for hiding this comment

Uh oh!

gharris1727 Jan 17, 2024

Choose a reason for hiding this comment

Uh oh!

C0urante left a comment

Choose a reason for hiding this comment

Uh oh!

gharris1727 commented Jan 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gharris1727 commented Nov 27, 2023 •

edited

Loading