-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[improve][broker] PIP-192 Made only the leader consume TopBundlesLoadDataStore #19730
Conversation
@heesung-sn Please help rebase PR to the master branch. |
7474440
to
ed949b3
Compare
Done. |
this.pulsar.getLoadManagerExecutor().execute(() -> { | ||
try { | ||
loadStoreInitWaiter.await(); | ||
} catch (InterruptedException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After being interrupted, do we still need to start the table view?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optimistically, yes.
If the load store has been initialized, then it will successfully start table view.
If not, in the worst case, the table view start will throw an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, I think we can retry when interrupted.
@@ -61,30 +70,58 @@ public CompletableFuture<Void> removeAsync(String key) { | |||
|
|||
@Override | |||
public Optional<T> get(String key) { | |||
validateTableViewStart(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For TopKBundle tableView, it seems that if the check validateTableViewStart
is failed, the UnloadScheduler will also fail to find bundles that are needed to unload, can we try to re-initialize the tableView in the method validateTableViewStart
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When ExtensibleLoadManager.start() and playLeader() are complete, then topBundlesLoadStore.validateTableViewStart() should never fail from the leader broker(from UnloadScheduler).
Currently, follower brokers should never access topBundlesLoadDataStore.tableview, and if they do, we should throw an exception in this case instead of re-initializing the tableView(followers don't run UnloadScheduler).
I also moved the UnloadScheduler.start()/close() logic into the playLeader() and playFollower().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to this code block, even if it failed to start the topBundlesLoadDataStore
, the playLeader
still can complete, if the UnloadScheduler
tries to find bundles that need to unload, it will try to get topKBundlesData, at this moment it will encounter IllegalStateException("table view has not been started").
private void playLeader() {
this.pulsar.getLoadManagerExecutor().execute(() -> {
serviceUnitStateChannel.scheduleOwnershipMonitor();
waitForLoadStoreInit();
try {
topBundlesLoadDataStore.startTableView();
} catch (Throwable e) {
log.error("The new leader:{} failed to start topBundlesLoadDataStore tableview",
pulsar.getLookupServiceAddress(), e);
}
...
log.info("This broker:{} is the leader now.", pulsar.getLookupServiceAddress());
});
}
TransferShedder.java
public UnloadDecision findBundlesForUnloading(LoadManagerContext context,
Map<String, Long> recentlyUnloadedBundles,
Map<String, Long> recentlyUnloadedBrokers) {
try {
...
// If topBundlesLoadDataStore is not initialized, IllegalStateException will be thrown.
Optional<TopBundlesLoadData> bundlesLoadData = context.topBundleLoadDataStore().get(maxBroker);
...
} catch (Throwable e) {
log.error("Failed to process unloading. ", e);
decision.fail();
}
return decision;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ExtensibleLoadManager.start() and playLeader() are complete
I meant this complete
as everything initialized correctly without an error.
Yes. Currently, if topBundlesLoadDataStore.startTableView();
fails, the leader broker still proceeds, and the TransferShedder will constantly make UnloadDecision(fail).
However, lazy-init(tableview.start ) in validateTableViewStart could also fail and lead to the same situation (TransferShedder will constantly make UnloadDecision(fail)). Besides, this lazy-init can lead to unknown behaviors(many consumers on this topBundlesLoad topic) when other brokers access the tableview.
The bigger question is how Pulsar handles system state change failures. In this case, the state change is from playLeader or playFollower.
I can think of the following options:
Option 1: retry-forever
Option 2: fail-fast(shutdown broker)
Option 3: ignore (leader or followers will run in the invalid state, and there will be many error logs and metrics. Eventually, the operator needs to capture this and fix it, probably by broker restart)
AFAIK, upon system state change failures, other systems immediately fail fast with meaningful logs and even with coredump. However, I don't think this fail-fast
is the practice in the Pulsar project.
I can update the code for option 1 (retry-forever).
Please let me know if you have more concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your explanation.
Obviously, if everything is initialized correctly without an error, it's OK, but you caught exceptions while starting topBundlesLoadDataStore before.
However, lazy-init(tableview.start ) in validateTableViewStart could also fail and lead to the same situation (TransferShedder will constantly make UnloadDecision(fail))
Yes, the lazy initialization in validateTableViewStart could also fail, but it will not fail forever. If we only caught the exception, it will fail forever.
I can update the code for option 1 (retry-forever).
I think it's ok, we can merge it first.
519626d
to
889f44e
Compare
@heesung-sn Please help rebase this PR to the master branch to resolve the conflicts. |
Lets merge the other PRs and rebase this PR again. |
2123c31
to
c142d2f
Compare
@Demogorgon314 rebase is done. |
@Demogorgon314 Please merge this PR if looking good. |
/pulsarbot rerun-failure-checks |
1 similar comment
/pulsarbot rerun-failure-checks |
Master Issue: #16691
Motivation
Raising a PR to implement #16691.
Based on the current design, only the leader needs to consume the topk bundle load data.
Modifications
This PR
Verifying this change
This change added tests and can be verified as follows:
Does this pull request potentially affect one of the following parts:
If the box was checked, please highlight the changes
Documentation
doc
doc-required
doc-not-needed
doc-complete
We will have separate PRs to update the Doc later.
Matching PR in forked repository
PR in forked repository: heesung-sn#37