Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[improve][broker] PIP-192 Made only the leader consume TopBundlesLoadDataStore #19730

Merged
merged 3 commits into from
Mar 16, 2023

Conversation

heesung-sn
Copy link
Contributor

@heesung-sn heesung-sn commented Mar 7, 2023

Master Issue: #16691

Motivation

Raising a PR to implement #16691.

Based on the current design, only the leader needs to consume the topk bundle load data.

Modifications

This PR

  • made only the leader consume TopBundlesLoadDataStore
  • moved LeaderElectionService to ExtensibleLoadManager from ServiceUnitStateChannel.

Verifying this change

  • Make sure that the change passes the CI checks.

This change added tests and can be verified as follows:

  • *Updated unit tests.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

We will have separate PRs to update the Doc later.

Matching PR in forked repository

PR in forked repository: heesung-sn#37

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Mar 7, 2023
@Demogorgon314
Copy link
Member

@heesung-sn Please help rebase PR to the master branch.

@heesung-sn
Copy link
Contributor Author

@heesung-sn Please help rebase PR to the master branch.

Done.

this.pulsar.getLoadManagerExecutor().execute(() -> {
try {
loadStoreInitWaiter.await();
} catch (InterruptedException e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After being interrupted, do we still need to start the table view?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimistically, yes.

If the load store has been initialized, then it will successfully start table view.

If not, in the worst case, the table view start will throw an exception.

Copy link
Contributor Author

@heesung-sn heesung-sn Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I think we can retry when interrupted.

@Technoboy- Technoboy- added this to the 3.0.0 milestone Mar 9, 2023
@Technoboy- Technoboy- closed this Mar 9, 2023
@Technoboy- Technoboy- reopened this Mar 9, 2023
@@ -61,30 +70,58 @@ public CompletableFuture<Void> removeAsync(String key) {

@Override
public Optional<T> get(String key) {
validateTableViewStart();
Copy link
Contributor

@gaoran10 gaoran10 Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For TopKBundle tableView, it seems that if the check validateTableViewStart is failed, the UnloadScheduler will also fail to find bundles that are needed to unload, can we try to re-initialize the tableView in the method validateTableViewStart?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When ExtensibleLoadManager.start() and playLeader() are complete, then topBundlesLoadStore.validateTableViewStart() should never fail from the leader broker(from UnloadScheduler).

Currently, follower brokers should never access topBundlesLoadDataStore.tableview, and if they do, we should throw an exception in this case instead of re-initializing the tableView(followers don't run UnloadScheduler).

I also moved the UnloadScheduler.start()/close() logic into the playLeader() and playFollower().

Copy link
Contributor

@gaoran10 gaoran10 Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to this code block, even if it failed to start the topBundlesLoadDataStore, the playLeader still can complete, if the UnloadScheduler tries to find bundles that need to unload, it will try to get topKBundlesData, at this moment it will encounter IllegalStateException("table view has not been started").

    private void playLeader() {
        this.pulsar.getLoadManagerExecutor().execute(() -> {
            serviceUnitStateChannel.scheduleOwnershipMonitor();
            waitForLoadStoreInit();
            try {
                topBundlesLoadDataStore.startTableView();
            } catch (Throwable e) {
                log.error("The new leader:{} failed to start topBundlesLoadDataStore tableview",
                        pulsar.getLookupServiceAddress(), e);
            }
            ...
            log.info("This broker:{} is the leader now.", pulsar.getLookupServiceAddress());
        });
    }

TransferShedder.java

public UnloadDecision findBundlesForUnloading(LoadManagerContext context,
                                                  Map<String, Long> recentlyUnloadedBundles,
                                                  Map<String, Long> recentlyUnloadedBrokers) {
    try {
        ...
        // If topBundlesLoadDataStore is not initialized, IllegalStateException will be thrown.
        Optional<TopBundlesLoadData> bundlesLoadData = context.topBundleLoadDataStore().get(maxBroker);
        ...
    }  catch (Throwable e) {
        log.error("Failed to process unloading. ", e);
        decision.fail();
    }
    return decision;

}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExtensibleLoadManager.start() and playLeader() are complete

I meant this complete as everything initialized correctly without an error.

Yes. Currently, if topBundlesLoadDataStore.startTableView(); fails, the leader broker still proceeds, and the TransferShedder will constantly make UnloadDecision(fail).

However, lazy-init(tableview.start ) in validateTableViewStart could also fail and lead to the same situation (TransferShedder will constantly make UnloadDecision(fail)). Besides, this lazy-init can lead to unknown behaviors(many consumers on this topBundlesLoad topic) when other brokers access the tableview.

The bigger question is how Pulsar handles system state change failures. In this case, the state change is from playLeader or playFollower.

I can think of the following options:

Option 1: retry-forever
Option 2: fail-fast(shutdown broker)
Option 3: ignore (leader or followers will run in the invalid state, and there will be many error logs and metrics. Eventually, the operator needs to capture this and fix it, probably by broker restart)

AFAIK, upon system state change failures, other systems immediately fail fast with meaningful logs and even with coredump. However, I don't think this fail-fast is the practice in the Pulsar project.

I can update the code for option 1 (retry-forever).

Please let me know if you have more concerns.

Copy link
Contributor

@gaoran10 gaoran10 Mar 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your explanation.

Obviously, if everything is initialized correctly without an error, it's OK, but you caught exceptions while starting topBundlesLoadDataStore before.

However, lazy-init(tableview.start ) in validateTableViewStart could also fail and lead to the same situation (TransferShedder will constantly make UnloadDecision(fail))

Yes, the lazy initialization in validateTableViewStart could also fail, but it will not fail forever. If we only caught the exception, it will fail forever.

I can update the code for option 1 (retry-forever).

I think it's ok, we can merge it first.

@heesung-sn heesung-sn force-pushed the pip-192-topk-exclude branch from 519626d to 889f44e Compare March 9, 2023 20:32
@Demogorgon314
Copy link
Member

@heesung-sn Please help rebase this PR to the master branch to resolve the conflicts.

@heesung-sn
Copy link
Contributor Author

@heesung-sn Please help rebase this PR to the master branch to resolve the conflicts.

Lets merge the other PRs and rebase this PR again.

@heesung-sn heesung-sn force-pushed the pip-192-topk-exclude branch from 2123c31 to c142d2f Compare March 14, 2023 17:44
@heesung-sn
Copy link
Contributor Author

@heesung-sn Please help rebase this PR to the master branch to resolve the conflicts.

Lets merge the other PRs and rebase this PR again.

@Demogorgon314 rebase is done.

@heesung-sn
Copy link
Contributor Author

@Demogorgon314 Please merge this PR if looking good.

@Technoboy-
Copy link
Contributor

/pulsarbot rerun-failure-checks

1 similar comment
@heesung-sn
Copy link
Contributor Author

/pulsarbot rerun-failure-checks

@Demogorgon314 Demogorgon314 merged commit 8cff123 into apache:master Mar 16, 2023
@heesung-sn heesung-sn deleted the pip-192-topk-exclude branch April 2, 2024 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-not-needed Your PR changes do not impact docs ready-to-test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants