-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix][broker] One topic can be closed multiple times concurrently #17524
[fix][broker] One topic can be closed multiple times concurrently #17524
Conversation
This PR should merge into these branches:
|
/pulsarbot rerun-failure-checks |
The pr had no activity for 30 days, mark with Stale label. |
Since we will start the RC version of
So drag this PR to |
The pr had no activity for 30 days, mark with Stale label. |
@poorbarcode please rebase (or merge changes from master to) this PR |
@poorbarcode please rebase this PR. I guess this is still relevant and in the same category as #20540 . |
41010f3
to
9aee3df
Compare
Rebased |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this issue.
However, I think this Topic state management needs a serious refactoring.
I suggest defining TopicState and revisit topic state transitions in a state machine manner.
Agree with you |
@heesung-sn I agree that a state machine style would result in a more maintainable solution. We can handle that in a second step. There is urgency to address the long outstanding topic closing issues and this PR makes good progress in that area. |
@poorbarcode looks like OneWayReplicatorTest.testUnFenceTopicToReuse fails |
Sorry, I found a behavior change(before: broker tries to unfence topic to reuse when clos clients fail; after: this mechanism does not work), and it is difficulte to be fixed gracefully, I will try to fix it tomorrow. |
473bbd4
to
057fe1d
Compare
Fixed, the code is ugly now, sorry. Please review it again. Thanks. @lhotari @heesung-sn |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #17524 +/- ##
============================================
+ Coverage 73.57% 73.93% +0.35%
- Complexity 32624 32640 +16
============================================
Files 1877 1885 +8
Lines 139502 140679 +1177
Branches 15299 15465 +166
============================================
+ Hits 102638 104010 +1372
+ Misses 28908 28622 -286
- Partials 7956 8047 +91
Flags with carried forward coverage won't be shown. Click here to find out more.
|
pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentTopic.java
Outdated
Show resolved
Hide resolved
…ache#17524) (cherry picked from commit 93afd89) (cherry picked from commit 620fe9b)
…ache#17524) (cherry picked from commit 93afd89) (cherry picked from commit 620fe9b)
Motivation
With the transaction feature, we send and receive messages, and at the same time, execute
admin API: unload namespace
1000 times. Then the problem occur: Consumer could not receive any message, and there has no error log. After that we triedadmin API: get topic stats
, and the response showed only producers are registered on topic, and no consumers are registered on topic, but consumer stat isReady
in the client. This means that the state of the consumer is inconsistent between the broker and the client.Location problem
Then we found the problem: Two PersistentTopic which have the same name registered at a broker node, consumer registered on one (aka
topic-c
), and producer registered on another one (akatopic-p
). At this time, when we send messages, the data flow like this :But the consumer exactly registered on another topic:
topic-c
, so consumer could not receive any message.Repreduce
Make
transaction buffer recover
,admin unload namespace
,client create consumer
,client create producer
executed at the same time, the process flow like this (at the step-11, the problem occurs ):transaction buffer recoverr
admin unload namespace
client create consumer
client create producer
topic-c
topic-c
topic-c
finishtopic-p
topic-p
client create consumer
,client create producer
.brokerService.topics
.Time
is used only to indicate the order of each step, not the actual time.Even if persistent topic property
isClosingOrDeleting
have already changed totrue
, it still can be executed another once, see line-1247:pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentTopic.java
Lines 1240 to 1249 in f230d15
Whether close can be executed depends on two predicates:
is closing
or@param closeWithoutWaitingClientDisconnect is true
. This means that methodtopic.close
can be reentrant executed when@param closeWithoutWaitingClientDisconnect
is true, and in the implementation ofadmin API: unload namespace
the parametercloseWithoutWaitingClientDisconnect
is exactlytrue
.pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/namespace/NamespaceService.java
Lines 723 to 725 in f230d15
So when
transaction buffer recover fail
andadmin unload namespace
is executed at the same time, andtransaction buffer recover fail
beforeadmin unload namespace
, the topic will be removed frombrokerService.topics
twice.Because of the current implementation of
BrokerService. removeTopicFromCache
use cmdmap.remove(key)
, not usemap.remove(key, value)
, So this cmd can remove any value in the map, even if it's not the desired one.pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java
Line 1956 in f230d15
To sum up: We should make these two changes:
topic.close
non-reentrant. Also prevent reentrant betweentopic.close
andtopic.delete
.map.remove(key, value)
instead ofmap.remove(key)
in implementation ofBrokerService. removeTopicFromCache
. This change will apply to both scenes:topic.close
andtopic.delete
.Modifications
map.remove(key, value)
instead ofmap.remove(key)
in implementation of `BrokerService.Documentation
doc-required
doc-not-needed
doc
doc-complete