Avoid triggering startReplProducer on newAddProducer as it may flips replicator state wrongly #232

rdhabalia · 2017-02-22T03:36:43Z

Motivation

On Race condition when replicationCluster in namespace-policy gets changed multiple times subsequently (repl-cluster added then removed) : replication producer couldn't close successfully.

New repl-cluster added: Adds Replicator in replicators and it tries to create Producer async.
Before Producer created repl-cluster deleted: It closes the cursor and flips replicator's state=stopped as Producer is not created yet and when producer will be created it will be closed if state is stopped
However, in between if any other producer gets created then it tries to start replicators again which flips replicator's state to starting and when producer will be created in step-2 it gets following exception:
Error reading entries at null. Retrying to read in 67.938s. (Cursor was already closed) (7)

Modifications

Broker should not try to startReplicator again on newProducer-creation as it flips the state of replicator.

Result

It will avoid flipping up replicator's state wrongly and replicator's producer can be closed on replicator removal.

…replicator state wrongly

merlimat · 2017-02-22T05:51:17Z

@rdhabalia But we need to resume the replicator it was stopped. Replicator will be paused when no one is connected and there are no local subscriptions.

Now, whenever a new producer comes in, we need to restart the replicator. How would that work with this patch?

rdhabalia · 2017-02-22T07:07:48Z

But we need to resume the replicator it was stopped. Replicator will be paused when no one is connected and there are no local subscriptions.

Aren't we starting the replicator on topic loading. And due to inactivity when gc closes the replicator, it also deletes the topic. So, next time when producer/consumer will be connected, it forces topic to be loaded and that would also start the replicator. I think GC is the only usecase where broker pause the replicator.
Also, if remove-replicator signals producer to be closed after creation then we should avoid addProducer to overwrite the state. Another option could be check TopicAlreadyClosed on readFailure

merlimat · 2017-02-22T07:11:49Z

Aren't we starting the replicator on topic loading. And due to inactivity when gc closes the replicator, it also deletes the topic. So, next time when producer/consumer will be connected, it forces topic to be loaded and that would also start the replicator. I think GC is the only usecase where broker pause the replicator.

The replicator will be paused and the topic might not be deleted (eg: an incoming replicator is still connected and keeps the topic alive)

merlimat · 2017-02-22T07:41:34Z

Change lgtm. Should we have at least a test that adds and removes a cluster from the list multiple times to stress the race condition?

rdhabalia · 2017-02-22T08:38:50Z

Actually, this change will still create the same issue. if someone(addProducer) calls startProducer before producer is created then as state is Stopping, it will schedule startProducer. and as soon as producer will be closed, state will be flipped to Stopped so, startProducer will again try to create producer and try to readEntries using closed-cursor.

I think fundamental issue is: removeReplicator() disconnects producers and removes replicator from cache (replicators) asynchronously and it's not atomic. So, if someone(addProducer) tries to startReplProducer in between it then it will create inconsistency in behavior.

So, startReplProducer should read replicator-clusters directly from policies rather from cache(replicators)

I will add test-case if change looks fine.

rdhabalia · 2017-02-22T20:13:52Z

Added testcase to verify this race-condition.

saandrews · 2017-02-23T19:59:28Z

LGTM 👍 @merlimat Can you take a look?

merlimat

👍

…replicator state wrongly (#232) * Avoid triggering startReplProducer on newAddProducer as it may flips replicator state wrongly * Signal replicator is stopping if porducer is not created yet * read repl-cluster from policies to avoid restart of closing-replicator

…ting a lot of requests. (apache#232) KafkaHeaderAndRequest contains native memory references. When getting a lot of requests from kafka client, for example 10 million per seconds, those redundance references will stop gc collection and memory get increased. It would be tricky to provide unit test for this patch and this pr seems to be a minor change, so I didn't write test case for it.

Avoid triggering startReplProducer on newAddProducer as it may flips …

d9b1b4c

…replicator state wrongly

rdhabalia added the type/bug The PR fixed a bug or issue reported a bug label Feb 22, 2017

rdhabalia added this to the 1.17 milestone Feb 22, 2017

rdhabalia self-assigned this Feb 22, 2017

Signal replicator is stopping if porducer is not created yet

9717f11

rdhabalia force-pushed the repl_sync branch 2 times, most recently from 44a1c09 to 69d84e1 Compare February 22, 2017 20:12

read repl-cluster from policies to avoid restart of closing-replicator

a8fa6cc

rdhabalia force-pushed the repl_sync branch from 69d84e1 to a8fa6cc Compare February 22, 2017 21:14

rdhabalia mentioned this pull request Feb 24, 2017

Replicator stop retrying to read when cursor is already closed #239

Merged

merlimat approved these changes Feb 24, 2017

View reviewed changes

merlimat merged commit 99cf1dc into apache:master Feb 24, 2017

jai1 mentioned this pull request Mar 3, 2017

Fix for Infinite loop in PersistentReplicator.startProducer() #275

Merged

rdhabalia deleted the repl_sync branch June 21, 2017 18:55

sijie added a commit to sijie/pulsar that referenced this pull request Mar 4, 2018

Remove zkServers from worker config (apache#232)

f8152c8

xiaotongwang1 mentioned this pull request Aug 4, 2021

Pulsar 2.7.0+ KOP 2.7.2.x getPartitionedTopicMetadata timeout #11532

Closed

Technoboy- mentioned this pull request Mar 10, 2022

Fix update replication cluster but not update replicator. #14570

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid triggering startReplProducer on newAddProducer as it may flips replicator state wrongly #232

Avoid triggering startReplProducer on newAddProducer as it may flips replicator state wrongly #232

rdhabalia commented Feb 22, 2017

merlimat commented Feb 22, 2017

rdhabalia commented Feb 22, 2017

merlimat commented Feb 22, 2017

merlimat commented Feb 22, 2017

rdhabalia commented Feb 22, 2017

rdhabalia commented Feb 22, 2017

saandrews commented Feb 23, 2017

merlimat left a comment

Avoid triggering startReplProducer on newAddProducer as it may flips replicator state wrongly #232

Avoid triggering startReplProducer on newAddProducer as it may flips replicator state wrongly #232

Conversation

rdhabalia commented Feb 22, 2017

Motivation

Modifications

Result

merlimat commented Feb 22, 2017

rdhabalia commented Feb 22, 2017

merlimat commented Feb 22, 2017

merlimat commented Feb 22, 2017

rdhabalia commented Feb 22, 2017

rdhabalia commented Feb 22, 2017

saandrews commented Feb 23, 2017

merlimat left a comment

Choose a reason for hiding this comment