Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][client] Fix producer/consumer stop to reconnect or Pub/Sub due to IO thread race-condition #23499

Merged
merged 6 commits into from
Oct 23, 2024

Conversation

poorbarcode
Copy link
Contributor

@poorbarcode poorbarcode commented Oct 22, 2024

Motivation

Issue 1: see the test testUnstableNetWorkWhenCreatingProducer

The producer fails to start if the network is unstable.

time socket closed start producer
1 get a pooled connection from the pool
2 before registering the producer to the connection
3 socket closed
4 reconnect all producers, which is empty now
5 register the producer to a closed connection
6 failed to send Command.newProducer to brokers
7 users receive an error io.netty.channel.StacklessClosedChannelException: null

**Issue 2**: see the test: `testSendAfterCreateProducerAsync` - Start a producer/consumer async - Call a sync method in the callback method of method `createAsync`, such as `producer.send`, and `consumer.receive`. - The IO thread encounters a deadlock

Modifications

  • fix the issues above

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: x

Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on addressing a hard problem, @poorbarcode! I added a few minor review comments.

@poorbarcode poorbarcode requested a review from lhotari October 22, 2024 16:03
@poorbarcode poorbarcode requested a review from lhotari October 22, 2024 17:02
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great work @poorbarcode.

Copy link
Member

@nodece nodece left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue 1:

It looks the CompletableFuture.supplyAsync(() -> clientCnx, clientCnx.ctx().executor()); can fix this issue, but I don’t understand this fix way.

In the old code, Isn't this callback thread clientCnx.ctx().executor()? I didn't double-check this part of the code.

LGTM

Issue 2:

You call the sync method in the async callback thread, this is an incorrect way. I don't suggest that you switch the thread to complete the callback.

In the past, we have removed many sync calls in the Pulsar.

@lhotari
Copy link
Member

lhotari commented Oct 22, 2024

You call the sync method in the async callback thread, this is an incorrect way. I don't suggest that you switch the thread to complete the callback.

@nodece please provide the exact code location that you are referring to. Thanks!

@lhotari
Copy link
Member

lhotari commented Oct 22, 2024

It looks the CompletableFuture.supplyAsync(() -> clientCnx, clientCnx.ctx().executor()); can fix this issue, but I don’t understand this fix way.

Indeed, it would be more consistent to use thenComposeAsync to switch the executor for all code branches instead of just a single branch.

@nodece
Copy link
Member

nodece commented Oct 23, 2024

You call the sync method in the async callback thread, this is an incorrect way. I don't suggest that you switch the thread to complete the callback.

@nodece please provide the exact code location that you are referring to. Thanks!

        client.newProducer().topic("test").createAsync().thenApply(producer -> {
            producer.send("hello-1".getBytes(StandardCharsets.UTF_8)); // incorrect, which will block the pulsar client thread.
            producer.sendAsync("hello-1".getBytes(StandardCharsets.UTF_8)); // correct
        });

User should not call a sync method in the callback of the async method.

+1

@nodece nodece changed the title [fix] [client] Fix producer/consumer stop to reconnect or Pub/Sub due to IO thread race-condition [fix][client] Fix producer/consumer stop to reconnect or Pub/Sub due to IO thread race-condition Oct 23, 2024
@Technoboy- Technoboy- merged commit ff4a25e into apache:master Oct 23, 2024
55 of 56 checks passed
lhotari pushed a commit that referenced this pull request Oct 23, 2024
…to IO thread race-condition (#23499)

(cherry picked from commit ff4a25e)
lhotari pushed a commit that referenced this pull request Oct 23, 2024
…to IO thread race-condition (#23499)

(cherry picked from commit ff4a25e)
lhotari pushed a commit that referenced this pull request Oct 23, 2024
…to IO thread race-condition (#23499)

(cherry picked from commit ff4a25e)
nikhil-ctds pushed a commit to datastax/pulsar that referenced this pull request Oct 28, 2024
…to IO thread race-condition (apache#23499)

(cherry picked from commit ff4a25e)
(cherry picked from commit 8bb0df0)
nikhil-ctds pushed a commit to datastax/pulsar that referenced this pull request Oct 28, 2024
…to IO thread race-condition (apache#23499)

(cherry picked from commit ff4a25e)
(cherry picked from commit 8bb0df0)
nikhil-ctds pushed a commit to datastax/pulsar that referenced this pull request Nov 6, 2024
…to IO thread race-condition (apache#23499)

(cherry picked from commit ff4a25e)
(cherry picked from commit 8bb0df0)
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Nov 7, 2024
…to IO thread race-condition (apache#23499)

(cherry picked from commit ff4a25e)
(cherry picked from commit 8bb0df0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants