Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] [broker] fix mismatch between dispatcher.consumerList and dispatcher.consumerSet #22283

Merged
merged 7 commits into from
Mar 25, 2024

Conversation

poorbarcode
Copy link
Contributor

@poorbarcode poorbarcode commented Mar 16, 2024

Motivation

time client broker
1 Unload topic
2 Subscribe(1st) after disconnect
3 Initialize ManagedLedger. This operation is slowly
4 Subscribe(1st) timeout, try to close the consumer
5 Received the command CloseConsumer, then remove consumerFuture(1st) from ServerCnx, then the mechanism to prevent registering a consumer with the same id is invalid now
6 Initialize ManagedLedger finish, add consumer(1st) to Dispatcher.
  - dispatcher.consumerSet.size: 1
  - dispatcher.consumerList.size: 1
7 Retry to subscribe(2nd)
8 add consumer(2nd) to Dispatcher.
  - dispatcher.consumerSet.size: 1
  - dispatcher.consumerList.size: 2
9 Since the future of consumer(1st) is failed, remove it
  - dispatcher.consumerSet.size: 0
  - dispatcher.consumerList.size: 1

This issue may cause a stuck of dispatchers closing/unsubscribing and topic closing/deleting. See more context: #22270

BTW, the scenario of producer registration has no issue because the producers with the same ServerCnx + producerId will override each other.

Modifications

  • The subsequent subscribing will always fail until the previous one is done even if the previous subscribing was timed out in the client view, which guarantees that there is no race condition about twice subscribing.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: x

@poorbarcode poorbarcode self-assigned this Mar 16, 2024
@poorbarcode poorbarcode added this to the 3.3.0 milestone Mar 16, 2024
@poorbarcode poorbarcode added the type/bug The PR fixed a bug or issue reported a bug label Mar 16, 2024
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Mar 16, 2024
Copy link
Contributor

@Technoboy- Technoboy- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same issue with #20583, right ?

@poorbarcode
Copy link
Contributor Author

poorbarcode commented Mar 19, 2024

The same issue with #20583, right ?

Yes, and this one #22270. Left a comment under #20583 here

@BewareMyPower
Copy link
Contributor

BewareMyPower commented Mar 20, 2024

It seems that consumerSet is only used for an efficient implementation of canUnsubscribe.

    public synchronized boolean canUnsubscribe(Consumer consumer) {
        return consumerList.size() == 1 && consumerSet.contains(consumer);
    }

However, I think it's over-designed. When the consumerList.size() is 1, iterating the consumerList should be quick. I'm going to look deeper to see if it's safe to remove the consumerSet, so I don't leave a review comment that could block this PR.

@BewareMyPower
Copy link
Contributor

I see the purpose of consumerSet is to make consumerFlow and removeConsumer more efficient now. So it won't be an issue.

Copy link
Contributor

@codelipenghui codelipenghui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After checking more details. I believe the root cause of this issue is the new subscribe request should not remove the failed request of the previous subscribe request. This is the responsibility of the previous subscribe request.

} else if (existingConsumerFuture.isCompletedExceptionally()){
ServerError error = getErrorCodeWithErrorLog(existingConsumerFuture, true,
String.format("Consumer subscribe failure. remoteAddress: %s, subscription: %s",
remoteAddress, subscriptionName));
consumers.remove(consumerId, existingConsumerFuture);
commandSender.sendErrorResponse(requestId, error,
"Consumer that failed is already present on the connection");

We are using a failed future to ensure the new subscribe request from the same consumer should only be accepted once the old subscribe request is done. So I think the reasonable fix it to avoid the new subscribe request remove the failed future from the old subscribe request.

And we already have the mechanism to remove the failed future finally by the previous subscribe request.

try {
consumer.close();
log.info("[{}] Cleared consumer created after timeout on client side {}",
remoteAddress, consumer);
} catch (BrokerServiceException e) {
log.warn(
"[{}] Error closing consumer created"
+ " after timeout on client side {}: {}",
remoteAddress, consumer, e.getMessage());
}
consumers.remove(consumerId, consumerFuture);

@poorbarcode
Copy link
Contributor Author

@codelipenghui

We are using a failed future to ensure the new subscribe request from the same consumer should only be accepted once the old subscribe request is done. So I think the reasonable fix it to avoid the new subscribe request remove the failed future from the old subscribe request.
And we already have the mechanism to remove the failed future finally by the previous subscribe request.

Good suggestion, already changed the solution: The subsequent subscribing will always fail until the previous one is done even if the previous subscribing was timed out in the client view, which guarantees that there is no race condition about twice subscribing.

@poorbarcode
Copy link
Contributor Author

@poorbarcode Pease help write some comments about this change to say this is not expected. And I think in this case, close the dispatcher or topic might a better solution, just to let the consumers to reconnect again, because we don't think what exactly happened at that time.

Added.

@codelipenghui
Copy link
Contributor

/pulsarbot run-failure-checks

@poorbarcode poorbarcode merged commit a52945b into apache:master Mar 25, 2024
50 checks passed
poorbarcode added a commit that referenced this pull request Mar 26, 2024
poorbarcode added a commit that referenced this pull request Mar 26, 2024
poorbarcode added a commit that referenced this pull request Mar 26, 2024
poorbarcode added a commit that referenced this pull request Mar 26, 2024
mukesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Apr 15, 2024
…tcher.consumerSet (apache#22283)

(cherry picked from commit a52945b)
(cherry picked from commit bec3be2)
mukesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Apr 17, 2024
…tcher.consumerSet (apache#22283)

(cherry picked from commit a52945b)
(cherry picked from commit bec3be2)
mukesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Apr 17, 2024
…tcher.consumerSet (apache#22283)

(cherry picked from commit a52945b)
(cherry picked from commit bec3be2)
mukesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Apr 19, 2024
…tcher.consumerSet (apache#22283)

(cherry picked from commit a52945b)
(cherry picked from commit bec3be2)
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Apr 23, 2024
…tcher.consumerSet (apache#22283)

(cherry picked from commit a52945b)
(cherry picked from commit bec3be2)
nodece pushed a commit to ascentstream/pulsar that referenced this pull request May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants