Skip to content

Try to fix flaky LeaderElection test.#9443

Merged
merlimat merged 4 commits intoapache:masterfrom
MarvinCai:zxc/eaderElection-test
Feb 6, 2021
Merged

Try to fix flaky LeaderElection test.#9443
merlimat merged 4 commits intoapache:masterfrom
MarvinCai:zxc/eaderElection-test

Conversation

@MarvinCai
Copy link
Contributor

Fixes #9408

Modifications

Seems the test is flaky because when it kills current leader and wait for new leader, it only check single pulsar to verify leader has changed, but cache for other pulsars may not refreshed by the ZK watch yet, hence we got the Optional.Empty which is leader path doesn't exist and they try to become leader.

So update to wait for leader changed for all pulsar, then verify they're all see same leader.

Comment on lines 184 to 185
while (loopCount < MAX_RETRIES) {
Thread.sleep(1000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to replace this loop with Awaitility. An example of using Awaitility for this type of case is here: lhotari@dc16960#diff-ff0def4f209a344e2ebc9f9d3a6ab32bf67ed12b9c7c6a968a110a6e69c7ed07R192-R196

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lhotari PTAL

@lhotari
Copy link
Member

lhotari commented Feb 3, 2021

@MarvinCai good work! I added a suggestion of using Awaitility for the waiting loop.

I have some concerns that the production code has some issues. I wrote about my observations in #9408 (comment) . There was a NPE stack trace with maximum 1024 stack frames.

The NPE is easy to fix), but the depth of the stack is the concern. I did a quick check in the code and it looks like backoff with some delay is missing in some cases and the leader election can get into a very tight loop. I was hoping that @merlimat could check the comments in #9408 (comment) before #9408 is closed. I guess the other option is that I file a separate bug report about the observation.

@sijie sijie added this to the 2.8.0 milestone Feb 3, 2021
// Check if the all active pulsar see a new leader
for (PulsarService pulsar : activePulsars) {
Optional<LeaderBroker> leader = pulsar.getLeaderElectionService().readCurrentLeader().join();
if (leader.isPresent() && leader.get().equals(oldLeader)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job ,I have a question:
Since there will be a single follower who cannot see the new leader until timeout, why traversing all the followers can solve this problem?
The current logic is: as long as there is no old leader in all followers, it will be successful, but there will still be cases where empty is returned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@315157973 I think the old logic is it pick the last follower it seen and check if it sees the new leader, which is in this chunk of code (ref1, ref2).
And then it just do the check which can't guarantee all follower already saw a new leader since for some follower which try to become leader, it'll first see empty path "/loadbalance/leader", read it in cache as [Optional.Empty] then try to create the znode, but after it fail(other node already create that znode and become leader) and before it's cache get updated by zk watch, there're might be some delay so old test can still see that [Optinal.Empty]. So loop through all followers and making sure all of them already a new leader, then check all of them see the same leader can solve the problem.
Does it make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to extend the waiting time, and some followers still could not see the new leader. That is: when a leader switch occurs, some followers can never see the new leader, and all they read are empty

Copy link
Contributor

@315157973 315157973 Feb 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be the reason. Therefore, the unit test should cover the scenario where the leader of the follower is always empty.
#9460

@MarvinCai
Copy link
Contributor Author

@MarvinCai good work! I added a suggestion of using Awaitility for the waiting loop.

I have some concerns that the production code has some issues. I wrote about my observations in #9408 (comment) . There was a NPE stack trace with maximum 1024 stack frames.

The NPE is easy to fix), but the depth of the stack is the concern. I did a quick check in the code and it looks like backoff with some delay is missing in some cases and the leader election can get into a very tight loop. I was hoping that @merlimat could check the comments in #9408 (comment) before #9408 is closed. I guess the other option is that I file a separate bug report about the observation.

We can probably wait for couple days and if we don't get reply from him we can probably open another issue for that NPE. I did also see this exception couple of times, but not sure if it's causing any real problem.

@merlimat
Copy link
Contributor

merlimat commented Feb 3, 2021

I'll be looking at the tight loop in the async calls there

@MarvinCai MarvinCai requested a review from 315157973 February 3, 2021 18:36
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Good work @MarvinCai

@lhotari
Copy link
Member

lhotari commented Feb 4, 2021

/pulsarbot run-failure-checks

@MarvinCai
Copy link
Contributor Author

MarvinCai commented Feb 4, 2021

Please don't merge for now, found some more issue we need to fix.

@lhotari
Copy link
Member

lhotari commented Feb 4, 2021

Please don't merge for now, found some more issue we need to fix.

@MarvinCai You are probably already aware of the fix #9460 that is also needed. If that's the case, I think these 2 changes could be merged separately and having a new PR to improve further. WDYT?

@MarvinCai
Copy link
Contributor Author

@lhotari Oh, I didn't see that change, but was trying to do exactly the same thing, manually invalidating the cache entry after an election, cause I found the test will still fail quite often due to the out dated cache entry.
I'll rebase my PR if that one get merged first.

@lhotari
Copy link
Member

lhotari commented Feb 4, 2021

I'll rebase my PR if that one get merged first.

@MarvinCai I meant to say earlier that it's fine that you don't rebase, it's better to get your changes merged asap and follow up any remaining issues later. The reason I'm saying this is that the build queue usually gets up to several hours and with all the flakiness, it's again a lot of retrying until these changes get merged...
You can immediately locally test the changes by cherry-picking #9460 changes to a temporary local branch and checking if the flakiness gets fixed together with the changes in this PR.

@MarvinCai
Copy link
Contributor Author

/pulsarbot run-failure-checks

@MarvinCai
Copy link
Contributor Author

/pulsarbot run-failure-checks

@315157973
Copy link
Contributor

@MarvinCai This PR is merged #9460, please go on

@sijie
Copy link
Member

sijie commented Feb 5, 2021

/pulsarbot run-failure-checks

@MarvinCai
Copy link
Contributor Author

/pulsarbot run-failure-checks

@MarvinCai
Copy link
Contributor Author

/pulsarbot run-failure-checks

@merlimat merlimat merged commit a958ee9 into apache:master Feb 6, 2021
merlimat pushed a commit to merlimat/pulsar that referenced this pull request Apr 6, 2021
* Try to fix flaky LeaderElection test.

* Change to use Awaitility.

* Fix condition to check for leader not empty && not equal to old leader.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky-test: org.apache.pulsar.broker.loadbalance.LoadBalancerTest.testLeaderElection

5 participants