-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix][broker]Fix deadlock of metadata store #20189
Conversation
354ef5f
to
05e27e0
Compare
c66ebb5
to
a928f4a
Compare
@poorbarcode @Technoboy- is this related to #20130 changes and review comments in any way? |
Sorry, I didn't notice this PR before, but I sense that the two problems are very similar. This PR is not related to #20130. #20130 is used to solve the deadlock of the |
Now the current modification is problematic because the I will fix it tomorrow |
Codecov Report
@@ Coverage Diff @@
## master #20189 +/- ##
=============================================
+ Coverage 37.61% 72.98% +35.37%
- Complexity 12589 31971 +19382
=============================================
Files 1691 1868 +177
Lines 129028 138604 +9576
Branches 14066 15240 +1174
=============================================
+ Hits 48530 101166 +52636
+ Misses 74183 29399 -44784
- Partials 6315 8039 +1724
Flags with carried forward coverage won't be shown. Click here to find out more.
|
/** | ||
* @Deprecated This method is only used by test. call "isServiceUnitActiveAsync" is better. | ||
*/ | ||
@Deprecated | ||
public boolean isServiceUnitActive(TopicName topicName) { | ||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make sense to use something like isServiceUnitActiveAsync(topicName).get(conf.getMetadataStoreOperationTimeoutSeconds(), SECONDS);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a good suggestion. Already edit this method to make the test that calls the method works better.
Could you take a look again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Motivation: This task loadOrCreatePersistentTopic occupied the event thread of the ZK client so that other ZK tasks could not be finished anymore(Including the task itself), and it calls bundlesCache.synchronous().get(nsname) which is a blocking method. Modification: Since the method getBundle(topic) will eventually call the method bundlesCache.synchronous().get(nsname), use getBundleAsync(topic) instead of getBundle(topic) to avoid blocking the thread. (cherry picked from commit 4678c36)
Motivation: This task loadOrCreatePersistentTopic occupied the event thread of the ZK client so that other ZK tasks could not be finished anymore(Including the task itself), and it calls bundlesCache.synchronous().get(nsname) which is a blocking method. Modification: Since the method getBundle(topic) will eventually call the method bundlesCache.synchronous().get(nsname), use getBundleAsync(topic) instead of getBundle(topic) to avoid blocking the thread. (cherry picked from commit 4678c36) (cherry picked from commit 8a1a4be)
Motivation: This task loadOrCreatePersistentTopic occupied the event thread of the ZK client so that other ZK tasks could not be finished anymore(Including the task itself), and it calls bundlesCache.synchronous().get(nsname) which is a blocking method. Modification: Since the method getBundle(topic) will eventually call the method bundlesCache.synchronous().get(nsname), use getBundleAsync(topic) instead of getBundle(topic) to avoid blocking the thread.
Motivation: This task loadOrCreatePersistentTopic occupied the event thread of the ZK client so that other ZK tasks could not be finished anymore(Including the task itself), and it calls bundlesCache.synchronous().get(nsname) which is a blocking method. Modification: Since the method getBundle(topic) will eventually call the method bundlesCache.synchronous().get(nsname), use getBundleAsync(topic) instead of getBundle(topic) to avoid blocking the thread. (cherry picked from commit 4678c36)
Motivation
loadOrCreatePersistentTopic
[1]: it callsgetBundles(load policies from ZK)
afterLockManagerImpl.lambda$acquireLock
. This all runs on the threadmain-EventThread.
It stuck at https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/common/naming/NamespaceBundleFactory.java#L233bundlesCache.synchronous().get(nsname)
which is a blocking method.splitBundle
[2]: remove old bundles and create news. It stuck at https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/ModularLoadManagerImpl.java#L1143 caused bymain-EventThread
is busy.lookup
[3]: refresh policies and get metadata. It stuck at https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/ModularLoadManagerImpl.java#L383 caused bybundlesCache
is loading.Note:
To reproduce
2.11.x
) using KOPPub & Sub.
[1]Task: load or create topics
[2] Task: split bundle
[3] Task: lookup
Modifications
Since the method
getBundle(topic)
will eventually call the methodbundlesCache.synchronous().get(nsname)
, usegetBundleAsync(topic)
instead ofgetBundle(topic)
to avoid blocking the thread.Documentation
doc
doc-required
doc-not-needed
doc-complete
Matching PR in forked repository
PR in forked repository: