Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid triggering startReplProducer on newAddProducer as it may flips replicator state wrongly #232

Merged
merged 3 commits into from
Feb 24, 2017

Conversation

rdhabalia
Copy link
Contributor

Motivation

On Race condition when replicationCluster in namespace-policy gets changed multiple times subsequently (repl-cluster added then removed) : replication producer couldn't close successfully.

  1. New repl-cluster added: Adds Replicator in replicators and it tries to create Producer async.
  2. Before Producer created repl-cluster deleted: It closes the cursor and flips replicator's state=stopped as Producer is not created yet and when producer will be created it will be closed if state is stopped
  3. However, in between if any other producer gets created then it tries to start replicators again which flips replicator's state to starting and when producer will be created in step-2 it gets following exception:
    Error reading entries at null. Retrying to read in 67.938s. (Cursor was already closed) (7)

Modifications

Broker should not try to startReplicator again on newProducer-creation as it flips the state of replicator.

Result

It will avoid flipping up replicator's state wrongly and replicator's producer can be closed on replicator removal.

@rdhabalia rdhabalia added the type/bug The PR fixed a bug or issue reported a bug label Feb 22, 2017
@rdhabalia rdhabalia added this to the 1.17 milestone Feb 22, 2017
@rdhabalia rdhabalia self-assigned this Feb 22, 2017
@merlimat
Copy link
Contributor

@rdhabalia But we need to resume the replicator it was stopped. Replicator will be paused when no one is connected and there are no local subscriptions.

Now, whenever a new producer comes in, we need to restart the replicator. How would that work with this patch?

@rdhabalia
Copy link
Contributor Author

But we need to resume the replicator it was stopped. Replicator will be paused when no one is connected and there are no local subscriptions.

Aren't we starting the replicator on topic loading. And due to inactivity when gc closes the replicator, it also deletes the topic. So, next time when producer/consumer will be connected, it forces topic to be loaded and that would also start the replicator. I think GC is the only usecase where broker pause the replicator.
Also, if remove-replicator signals producer to be closed after creation then we should avoid addProducer to overwrite the state. Another option could be check TopicAlreadyClosed on readFailure

@merlimat
Copy link
Contributor

Aren't we starting the replicator on topic loading. And due to inactivity when gc closes the replicator, it also deletes the topic. So, next time when producer/consumer will be connected, it forces topic to be loaded and that would also start the replicator. I think GC is the only usecase where broker pause the replicator.

The replicator will be paused and the topic might not be deleted (eg: an incoming replicator is still connected and keeps the topic alive)

@merlimat
Copy link
Contributor

Change lgtm. Should we have at least a test that adds and removes a cluster from the list multiple times to stress the race condition?

@rdhabalia
Copy link
Contributor Author

Actually, this change will still create the same issue. if someone(addProducer) calls startProducer before producer is created then as state is Stopping, it will schedule startProducer. and as soon as producer will be closed, state will be flipped to Stopped so, startProducer will again try to create producer and try to readEntries using closed-cursor.

I think fundamental issue is: removeReplicator() disconnects producers and removes replicator from cache (replicators) asynchronously and it's not atomic. So, if someone(addProducer) tries to startReplProducer in between it then it will create inconsistency in behavior.

So, startReplProducer should read replicator-clusters directly from policies rather from cache(replicators)

I will add test-case if change looks fine.

@rdhabalia rdhabalia force-pushed the repl_sync branch 2 times, most recently from 44a1c09 to 69d84e1 Compare February 22, 2017 20:12
@rdhabalia
Copy link
Contributor Author

Added testcase to verify this race-condition.

@saandrews
Copy link
Contributor

LGTM 👍 @merlimat Can you take a look?

Copy link
Contributor

@merlimat merlimat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@merlimat merlimat merged commit 99cf1dc into apache:master Feb 24, 2017
rdhabalia added a commit that referenced this pull request Feb 24, 2017
…replicator state wrongly (#232)

* Avoid triggering startReplProducer on newAddProducer as it may flips replicator state wrongly

* Signal replicator is stopping if porducer is not created yet

* read repl-cluster from policies to avoid restart of closing-replicator
@rdhabalia rdhabalia deleted the repl_sync branch June 21, 2017 18:55
sijie added a commit to sijie/pulsar that referenced this pull request Mar 4, 2018
hangc0276 pushed a commit to hangc0276/pulsar that referenced this pull request May 26, 2021
…ting a lot of requests. (apache#232)

KafkaHeaderAndRequest contains native memory references. When getting a lot of requests from kafka client, for example 10 million per seconds, those redundance references will stop gc collection and memory get increased.
It would be tricky to provide unit test for this patch and this pr seems to be a minor change, so I didn't write test case for it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants