Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] [broker] Part-1: Replicator can not created successfully due to an orphan replicator in the previous topic owner #21946

Merged
merged 22 commits into from
Apr 23, 2024

Conversation

poorbarcode
Copy link
Contributor

@poorbarcode poorbarcode commented Jan 22, 2024

Motivation

There is a race condition that makes an orphan replicator in the original owner of a topic, and causes the new owner of the topic can not start a replicator due to org.apache.pulsar.broker.service.BrokerServiceException$NamingException Producer with name 'pulsar.repl.{local_cluster}-->{remote_cluster}' is already connected to topic.

Scenario 1

  • Thread-1: start/restart the producer of the replicator.
  • Thread-2: unloading bundles.

Scenario 2

  • Thread-1: start a new replicator after updated replication_clusters.
  • Thread-2: unloading bundles.

Current PR is focusing on Scenario 1.

Steps of Scenario 1

time thread start replicator thread unload bundle
1 Initialize cursor pulsar.repl
2 Start producer
3 Start producer failed, add a scheduled task to retry
4 Mark topic as closing
4 Close clients: replicator.disconnect
5 Skip to close the producer because the producer is null, and set replicator.stat --> Stopped
6 Retry to start the producer
7 Set replicator.stat --> Starting
8 Create producer success and set replicator.stat --> Started
9 Trigger a readMoreEntries, since there is no entries to read, just pending this request
10 Close cursor pulsar.repl
11 Close managed ledger
12 An orphan replicator is there, and the next topic owner could not start a replicator due to Producer with name 'pulsar.repl.{local_cluster}-->{remote_cluster}' is already connected to topic

Modifications

  • Split the state of Replicator.State.Stopped into Producer_Stopped and Closed.
    • The producer can be restart again after the producer closed due to read entries error or ack messages error.
    • The Replicator can not be started again after it was closed due to the topic being closed or having disabled replication.
  • Add a new method terminate to close the Replicator.
    • The old method disconnect only used to close the internal producer.

A case that hit this issue

2024-01-19T22:18:24,498+0000 [pulsar-io-4-23] ERROR org.apache.pulsar.client.impl.ProducerImpl - [persistent://xxx_stress.Default-226] [pulsar.repl.c1-->c2] Failed to create producer: {"errorMsg":"org.apache.pulsar.broker.service.BrokerServiceException$NamingException: Producer with name 'pulsar.repl.c1-->c2' is already connected to topic","reqId":885581189144276683, "remote":"xxx/xxx:6651", "local":"/xxx:59416"}	
...

Picture-1: An orphan producer was left in old broker, it is not associated with any topic/replicator
Screenshot 2024-01-24 at 15 16 44

Picture-2: After the topic is transferred to new broker, it can not start a new Replicator successfully
Screenshot 2024-01-24 at 15 14 42

Since the scenario is too complex, I can not add a test. But I reproduced the Scenario 1 locally.
Screenshot 2024-01-23 at 00 36 48
Screenshot 2024-01-23 at 00 35 33

#21948 fixes the following issues:

  • How to perfectly start a new Replicator after the replication has been enabled again.
  • How to perfectly start a new Replicator when calling topic.unfenceTopicToResume after topic.close failed.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: x

@poorbarcode poorbarcode self-assigned this Jan 22, 2024
@poorbarcode poorbarcode added this to the 3.3.0 milestone Jan 22, 2024
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Jan 22, 2024
@poorbarcode poorbarcode added type/bug The PR fixed a bug or issue reported a bug release/2.10.6 category/reliability The function does not work properly in certain specific environments or failures. e.g. data lost release/3.0.3 release/2.11.4 release/3.1.3 and removed doc-not-needed Your PR changes do not impact docs labels Jan 22, 2024
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Jan 22, 2024
@poorbarcode poorbarcode changed the title [draft] [fix] [broker] Fix orphan replicator after unload topic [draft] [fix] [broker] Replicator can not created successfully due to an orphan replicator in the previous topic owner Jan 22, 2024
@poorbarcode poorbarcode changed the title [draft] [fix] [broker] Replicator can not created successfully due to an orphan replicator in the previous topic owner [draft] [fix] [broker] Part-1: Replicator can not created successfully due to an orphan replicator in the previous topic owner Jan 22, 2024
@poorbarcode poorbarcode changed the title [draft] [fix] [broker] Part-1: Replicator can not created successfully due to an orphan replicator in the previous topic owner [draft] [fix] [broker] Part-3: Replicator can not created successfully due to an orphan replicator in the previous topic owner Jan 22, 2024
@poorbarcode poorbarcode changed the title [draft] [fix] [broker] Part-3: Replicator can not created successfully due to an orphan replicator in the previous topic owner [draft] [fix] [broker] Part-1: Replicator can not created successfully due to an orphan replicator in the previous topic owner Jan 22, 2024
@poorbarcode poorbarcode changed the title [draft] [fix] [broker] Part-1: Replicator can not created successfully due to an orphan replicator in the previous topic owner [fix] [broker] Part-1: Replicator can not created successfully due to an orphan replicator in the previous topic owner Jan 22, 2024
@Jason918
Copy link
Contributor

@poorbarcode Does this PR fix the issue mentioned in #21203 ?

@poorbarcode
Copy link
Contributor Author

@Jason918

@poorbarcode Does this PR fix the issue mentioned in #21203 ?

Yes, the current PR also fixed the issue that #21203 tries to fix.

Copy link
Contributor

@codelipenghui codelipenghui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add a test to cover this case?

And it looks like we can simplify the fix by adding a new method terminate() to the replicator so that we don't need to mix the closeProducer and closeReplicator logic.

@poorbarcode poorbarcode force-pushed the fix/orphan_replicator_2 branch from 05de423 to 257f163 Compare April 15, 2024 16:41
@poorbarcode
Copy link
Contributor Author

Rebase master

@poorbarcode poorbarcode force-pushed the fix/orphan_replicator_2 branch from 257f163 to 3bb81fa Compare April 19, 2024 12:17
@poorbarcode poorbarcode force-pushed the fix/orphan_replicator_2 branch from a42bd91 to 5793ca1 Compare April 22, 2024 16:50
@poorbarcode poorbarcode merged commit 4924052 into apache:master Apr 23, 2024
48 of 50 checks passed
Technoboy- pushed a commit to Technoboy-/pulsar that referenced this pull request Apr 24, 2024
poorbarcode added a commit that referenced this pull request May 7, 2024
… an orphan replicator in the previous topic owner (#21946)

(cherry picked from commit 4924052)
poorbarcode added a commit that referenced this pull request May 7, 2024
… an orphan replicator in the previous topic owner (#21946)

(cherry picked from commit 4924052)
nikhil-ctds pushed a commit to datastax/pulsar that referenced this pull request May 13, 2024
… an orphan replicator in the previous topic owner (apache#21946)

(cherry picked from commit 4924052)
(cherry picked from commit 670aff0)
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request May 16, 2024
… an orphan replicator in the previous topic owner (apache#21946)

(cherry picked from commit 4924052)
(cherry picked from commit 670aff0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category/reliability The function does not work properly in certain specific environments or failures. e.g. data lost cherry-picked/branch-3.0 cherry-picked/branch-3.2 doc-not-needed Your PR changes do not impact docs ready-to-test release/2.11.4 release/3.0.3 release/3.2.1 type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants