[fix][broker]Consumer can't consume messages because there has two sames topics in one broker #17526

poorbarcode · 2022-09-07T16:11:40Z

Motivation

With the transaction feature, we send and receive messages, and at the same time, execute admin API: unload namespace 1000 times. Then the problem occur: Consumer could not receive any message, and there has no error log. After that we tried admin API: get topic stats, and the response showed only producers are registered on topic, and no consumers are registered on topic, but consumer stat is Ready in the client. This means that the state of the consumer is inconsistent between the broker and the client.

Location problem

Then we found the problem: Two PersistentTopic which have the same name registered at a broker node, consumer registered on one (aka topic-c), and producer registered on another one (aka topic-p). At this time, when we send messages, the data flow like this :

client: producer sends a message

broker: handle cmd-send

broker: find the topic by name, it is "topic-p"

broker: find all subscriptions registered on "topic-p"

broker: found one subscription, but it has no consumers registered

broker: no need to send the message to the client

But the consumer exactly registered on another topic: topic-c, so consumer could not receive any message.

Repreduce

How to reproduce two topics registered at the same broker node ?

Make transaction buffer recover, admin unload namespace, client create consumer, client create producer executed at the same time, the process flow like this (at the step-11, the problem occurs ):

Time	`transaction buffer recoverr`	`admin unload namespace`	`client create consumer`	`client create producer`
1	TB recover
2	TB recover failure	topic.unload
3	topic.close(false)	topic.close(true)
4	brokerService.topics.remove(topicName)
5	remove topic finish		lookup
6			create `topic-c`
7			consumer registered on `topic-c`
8		brokerService.topics.remove(topic)
9		remove `topic-c` finish		lookup
10				create `topic-p`
11				producer registered on `topic-p`

Each column means the individual process. e.g. client create consumer, client create producer.
Multiple processes are going on at the same time, and all effet the brokerService.topics.
Column Time is used only to indicate the order of each step, not the actual time.
The important steps are explained below:

step 3

Even if persistent topic propertyisClosingOrDeleting have already changed to true, it still can be executed another once, see line-1247:

pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentTopic.java

Lines 1240 to 1249 in f230d15

    
           public CompletableFuture<Void> close(boolean closeWithoutWaitingClientDisconnect) { 
        
               CompletableFuture<Void> closeFuture = new CompletableFuture<>(); 
        
               lock.writeLock().lock(); 
        
               try { 
        
                   // closing managed-ledger waits until all producers/consumers/replicators get closed. Sometimes, broker 
        
                   // forcefully wants to close managed-ledger without waiting all resources to be closed. 
        
                   if (!isClosingOrDeleting || closeWithoutWaitingClientDisconnect) { 
        
                       fenceTopicToCloseOrDelete(); 
        
                   } else {

Whether close can be executed depends on two predicates: is closing or @param closeWithoutWaitingClientDisconnect is true. This means that method topic.close can be reentrant executed when @param closeWithoutWaitingClientDisconnect is true, and in the implementation of admin API: unload namespace the parameter closeWithoutWaitingClientDisconnect is exactly true.

pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/namespace/NamespaceService.java

Lines 723 to 725 in f230d15

    
           public CompletableFuture<Void> unloadNamespaceBundle(NamespaceBundle bundle, long timeout, TimeUnit timeoutUnit) { 
        
               return unloadNamespaceBundle(bundle, timeout, timeoutUnit, true); 
        
           }

So when transaction buffer recover fail and admin unload namespace is executed at the same time, and transaction buffer recover fail before admin unload namespace, the topic will be removed from brokerService.topics twice.

step-4 / step-8

Because of the current implementation of BrokerService. removeTopicFromCache use cmd map.remove(key), not use map.remove(key, value), So this cmd can remove any value in the map, even if it's not the desired one.

pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java

Line 1956 in f230d15

topics.remove(topic);

To sum up: We should make these two changes:

Make method topic.close non-reentrant. Also prevent reentrant between topic.close and topic.delete.
Use cmd map.remove(key, value) instead of map.remove(key) in implementation of BrokerService. removeTopicFromCache. This change will apply to both scenes: topic.close and topic.delete.

Modifications

Make method topic.close non-reentrant. Also prevent reentrant between topic.close and topic.delete.
- fixed by PR [fix][broker] One topic can be closed multiple times concurrently #17524
Use cmd map.remove(key, value) instead of map.remove(key) in implementation of `BrokerService.
- fixed by current PR

Documentation

doc-required
doc-not-needed
doc
doc-complete

Matching PR in forked repository

PR in forked repository(All check passed):

[run-ci][fix][broker]Consumer can't consume messages because there has two sames topics in one broker poorbarcode/pulsar#7

poorbarcode · 2022-09-07T16:55:58Z

This PR should merge into these branches:

branch-2.8
branch-2.9
branch-2.10
branch-2.11
master

poorbarcode · 2022-09-08T00:15:57Z

/pulsarbot rerun-failure-checks

liangyepianzhou

Good work!

codelipenghui · 2022-09-14T14:45:50Z

pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java

+        if (!createTopicFuture.isDone()){
+            return CompletableFuture.completedFuture(null);
+        }
+        return createTopicFuture.thenCompose(topicOptional -> {


The createTopicFuture might be completed with the exception?

This is a good point.

I have covered this logic branch: If the future in cache has exception complete, the topic instance in the cache is not the same as the @param topic, so the delete will return success

poorbarcode · 2022-09-14T15:59:39Z

rebase master

poorbarcode · 2022-09-14T23:37:05Z

/pulsarbot rerun-failure-checks

poorbarcode · 2022-09-21T02:06:09Z

/pulsarbot rerun-failure-checks

…mes topics in one broker (#17526)

…mes topics in one broker (#17526) (cherry picked from commit 260f5c6)

…mes topics in one broker (apache#17526) (cherry picked from commit 260f5c6) (cherry picked from commit ddd642e)

…mes topics in one broker (#17526) (cherry picked from commit 260f5c6)

… race conditions - solution was introduced in apache#17526 - however, it was accidentially replaced with a call to the incorrect method signature in apache#17736

heesung-sn

Do we use the same thread to update states of the same pulsar resource?

I think we can make the same thread update the same resource to minimize update conflicts.

One idea is to use resource update task queues and have one thread consume and run the tasks for the same topic.

For example,
// topic op task pub
TopicRemoveTask task = new TopicRemoveTask()
Int Qid = hash(topic)
Queues(qid).add(task) // we can have a set data structure to dedup the same topic operations

//topic op task consume
TopicTask task= Queues(this.thread.id).poll()
run(task);

This was referenced Sep 7, 2022

[fix][broker] One topic can be closed multiple times concurrently #17524

Merged

[fix] [broker] The broker has two identical Persitenttopics #16247

Closed

poorbarcode force-pushed the fix/topic_repeat_registry_split_1 branch from 0367fb7 to 6f0616c Compare September 7, 2022 16:17

github-actions bot added the doc-not-needed Your PR changes do not impact docs label Sep 7, 2022

Technoboy- assigned poorbarcode Sep 8, 2022

Technoboy- added type/bug The PR fixed a bug or issue reported a bug area/broker labels Sep 8, 2022

Technoboy- added this to the 2.12.0 milestone Sep 8, 2022

Technoboy- added release/2.9.4 release/2.8.5 release/2.11.1 release/2.10.3 labels Sep 8, 2022

poorbarcode force-pushed the fix/topic_repeat_registry_split_1 branch from 6f0616c to d7475f6 Compare September 9, 2022 04:09

liangyepianzhou approved these changes Sep 13, 2022

View reviewed changes

liangyepianzhou requested review from codelipenghui, gaoran10 and Technoboy- September 13, 2022 02:01

codelipenghui reviewed Sep 14, 2022

View reviewed changes

poorbarcode force-pushed the fix/topic_repeat_registry_split_1 branch from d7475f6 to 33e4f11 Compare September 14, 2022 15:57

poorbarcode requested review from codelipenghui and removed request for Technoboy- and gaoran10 September 14, 2022 15:59

codelipenghui requested review from congbobo184, gaoran10, mattisonchao and Technoboy- September 19, 2022 11:43

Technoboy- approved these changes Sep 21, 2022

View reviewed changes

Technoboy- requested review from eolivelli, Jason918, BewareMyPower, hangc0276, lhotari and nodece September 21, 2022 03:36

codelipenghui merged commit 260f5c6 into apache:master Sep 21, 2022

poorbarcode deleted the fix/topic_repeat_registry_split_1 branch September 21, 2022 14:36

Technoboy- pushed a commit that referenced this pull request Sep 26, 2022

[fix][broker]Consumer can't consume messages because there has two sa…

be9cbc3

…mes topics in one broker (#17526)

Technoboy- added cherry-picked/branch-2.11 and removed release/2.11.1 labels Sep 26, 2022

Technoboy- modified the milestones: 2.12.0, 2.11.0 Sep 26, 2022

Jason918 added cherry-picked/branch-2.10 release/2.10.2 and removed release/2.10.3 labels Sep 26, 2022

Jason918 pushed a commit that referenced this pull request Sep 26, 2022

[fix][broker]Consumer can't consume messages because there has two sa…

ddd642e

…mes topics in one broker (#17526) (cherry picked from commit 260f5c6)

congbobo184 pushed a commit that referenced this pull request Nov 14, 2022

[fix][broker]Consumer can't consume messages because there has two sa…

bcd0b74

…mes topics in one broker (#17526) (cherry picked from commit 260f5c6)

congbobo184 added the cherry-picked/branch-2.9 Archived: 2.9 is end of life label Nov 14, 2022

congbobo184 pushed a commit that referenced this pull request Nov 26, 2022

[fix][broker]Consumer can't consume messages because there has two sa…

ae42b48

…mes topics in one broker (#17526) (cherry picked from commit 260f5c6)

This was referenced Nov 28, 2022

[do not merge] [branch-2.9] just reproduce issue: ledgers lost #18661

Closed

[fix][ml] There are two same-named managed ledgers in the one broker #18688

Merged

This was referenced Jun 7, 2023

[fix][broker] Restore solution for certain topic unloading race conditions #20527

Merged

[Bug] Broker sees topic fenced #20526

Closed

heesung-sn reviewed Jan 14, 2024

View reviewed changes

shibd mentioned this pull request Jan 26, 2025

[fix][broker] Closed topics won't be removed from the cache #23884

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix][broker]Consumer can't consume messages because there has two sames topics in one broker #17526

[fix][broker]Consumer can't consume messages because there has two sames topics in one broker #17526

poorbarcode commented Sep 7, 2022 •

edited

Loading

poorbarcode commented Sep 7, 2022

poorbarcode commented Sep 8, 2022

liangyepianzhou left a comment

codelipenghui Sep 14, 2022

poorbarcode Sep 14, 2022

poorbarcode commented Sep 14, 2022

poorbarcode commented Sep 14, 2022

poorbarcode commented Sep 21, 2022

heesung-sn left a comment

	public CompletableFuture<Void> close(boolean closeWithoutWaitingClientDisconnect) {
	CompletableFuture<Void> closeFuture = new CompletableFuture<>();

	lock.writeLock().lock();
	try {
	// closing managed-ledger waits until all producers/consumers/replicators get closed. Sometimes, broker
	// forcefully wants to close managed-ledger without waiting all resources to be closed.
	if (!isClosingOrDeleting \|\| closeWithoutWaitingClientDisconnect) {
	fenceTopicToCloseOrDelete();
	} else {

	public CompletableFuture<Void> unloadNamespaceBundle(NamespaceBundle bundle, long timeout, TimeUnit timeoutUnit) {
	return unloadNamespaceBundle(bundle, timeout, timeoutUnit, true);
	}

[fix][broker]Consumer can't consume messages because there has two sames topics in one broker #17526

[fix][broker]Consumer can't consume messages because there has two sames topics in one broker #17526

Conversation

poorbarcode commented Sep 7, 2022 • edited Loading

Motivation

Location problem

Repreduce

Modifications

Documentation

Matching PR in forked repository

poorbarcode commented Sep 7, 2022

poorbarcode commented Sep 8, 2022

liangyepianzhou left a comment

Choose a reason for hiding this comment

codelipenghui Sep 14, 2022

Choose a reason for hiding this comment

poorbarcode Sep 14, 2022

Choose a reason for hiding this comment

poorbarcode commented Sep 14, 2022

poorbarcode commented Sep 14, 2022

poorbarcode commented Sep 21, 2022

heesung-sn left a comment

Choose a reason for hiding this comment

poorbarcode commented Sep 7, 2022 •

edited

Loading