Try to fix flaky LeaderElection test. by MarvinCai · Pull Request #9443 · apache/pulsar

MarvinCai · 2021-02-03T07:01:10Z

Modifications

Seems the test is flaky because when it kills current leader and wait for new leader, it only check single pulsar to verify leader has changed, but cache for other pulsars may not refreshed by the ZK watch yet, hence we got the Optional.Empty which is leader path doesn't exist and they try to become leader.

So update to wait for leader changed for all pulsar, then verify they're all see same leader.

lhotari · 2021-02-03T08:47:45Z

pulsar-broker/src/test/java/org/apache/pulsar/broker/loadbalance/LoadBalancerTest.java

        while (loopCount < MAX_RETRIES) {
            Thread.sleep(1000);


It would be better to replace this loop with Awaitility. An example of using Awaitility for this type of case is here: lhotari@dc16960#diff-ff0def4f209a344e2ebc9f9d3a6ab32bf67ed12b9c7c6a968a110a6e69c7ed07R192-R196

Yeah, I'll do that.

@lhotari PTAL

lhotari · 2021-02-03T08:54:43Z

@MarvinCai good work! I added a suggestion of using Awaitility for the waiting loop.

I have some concerns that the production code has some issues. I wrote about my observations in #9408 (comment) . There was a NPE stack trace with maximum 1024 stack frames.

The NPE is easy to fix), but the depth of the stack is the concern. I did a quick check in the code and it looks like backoff with some delay is missing in some cases and the leader election can get into a very tight loop. I was hoping that @merlimat could check the comments in #9408 (comment) before #9408 is closed. I guess the other option is that I file a separate bug report about the observation.

315157973 · 2021-02-03T16:34:27Z

pulsar-broker/src/test/java/org/apache/pulsar/broker/loadbalance/LoadBalancerTest.java

+            // Check if the all active pulsar see a new leader
+            for (PulsarService pulsar : activePulsars) {
+                Optional<LeaderBroker> leader = pulsar.getLeaderElectionService().readCurrentLeader().join();
+                if (leader.isPresent() && leader.get().equals(oldLeader)) {


Good job ，I have a question:
Since there will be a single follower who cannot see the new leader until timeout, why traversing all the followers can solve this problem?
The current logic is: as long as there is no old leader in all followers, it will be successful, but there will still be cases where empty is returned.

@315157973 I think the old logic is it pick the last follower it seen and check if it sees the new leader, which is in this chunk of code (ref1, ref2).
And then it just do the check which can't guarantee all follower already saw a new leader since for some follower which try to become leader, it'll first see empty path "/loadbalance/leader", read it in cache as [Optional.Empty] then try to create the znode, but after it fail(other node already create that znode and become leader) and before it's cache get updated by zk watch, there're might be some delay so old test can still see that [Optinal.Empty]. So loop through all followers and making sure all of them already a new leader, then check all of them see the same leader can solve the problem.
Does it make sense?

I tried to extend the waiting time, and some followers still could not see the new leader. That is: when a leader switch occurs, some followers can never see the new leader, and all they read are empty

Seems to be the reason. Therefore, the unit test should cover the scenario where the leader of the follower is always empty.
#9460

MarvinCai · 2021-02-03T18:02:02Z

@MarvinCai good work! I added a suggestion of using Awaitility for the waiting loop.

I have some concerns that the production code has some issues. I wrote about my observations in #9408 (comment) . There was a NPE stack trace with maximum 1024 stack frames.

The NPE is easy to fix), but the depth of the stack is the concern. I did a quick check in the code and it looks like backoff with some delay is missing in some cases and the leader election can get into a very tight loop. I was hoping that @merlimat could check the comments in #9408 (comment) before #9408 is closed. I guess the other option is that I file a separate bug report about the observation.

We can probably wait for couple days and if we don't get reply from him we can probably open another issue for that NPE. I did also see this exception couple of times, but not sure if it's causing any real problem.

merlimat · 2021-02-03T18:09:38Z

I'll be looking at the tight loop in the async calls there

lhotari

LGTM. Good work @MarvinCai

lhotari · 2021-02-04T06:18:46Z

/pulsarbot run-failure-checks

MarvinCai · 2021-02-04T06:32:59Z

Please don't merge for now, found some more issue we need to fix.

lhotari · 2021-02-04T06:46:06Z

Please don't merge for now, found some more issue we need to fix.

@MarvinCai You are probably already aware of the fix #9460 that is also needed. If that's the case, I think these 2 changes could be merged separately and having a new PR to improve further. WDYT?

MarvinCai · 2021-02-04T06:57:50Z

@lhotari Oh, I didn't see that change, but was trying to do exactly the same thing, manually invalidating the cache entry after an election, cause I found the test will still fail quite often due to the out dated cache entry.
I'll rebase my PR if that one get merged first.

lhotari · 2021-02-04T07:06:03Z

I'll rebase my PR if that one get merged first.

@MarvinCai I meant to say earlier that it's fine that you don't rebase, it's better to get your changes merged asap and follow up any remaining issues later. The reason I'm saying this is that the build queue usually gets up to several hours and with all the flakiness, it's again a lot of retrying until these changes get merged...
You can immediately locally test the changes by cherry-picking #9460 changes to a temporary local branch and checking if the flakiness gets fixed together with the changes in this PR.

MarvinCai · 2021-02-04T08:52:47Z

/pulsarbot run-failure-checks

MarvinCai · 2021-02-04T18:51:16Z

/pulsarbot run-failure-checks

315157973 · 2021-02-05T05:45:17Z

@MarvinCai This PR is merged #9460, please go on

sijie · 2021-02-05T08:06:56Z

/pulsarbot run-failure-checks

MarvinCai · 2021-02-05T17:40:38Z

/pulsarbot run-failure-checks

…n-test

MarvinCai · 2021-02-06T00:36:41Z

/pulsarbot run-failure-checks

* Try to fix flaky LeaderElection test. * Change to use Awaitility. * Fix condition to check for leader not empty && not equal to old leader.

Try to fix flaky LeaderElection test.

8d268e7

lhotari reviewed Feb 3, 2021

View reviewed changes

sijie assigned MarvinCai Feb 3, 2021

sijie added this to the 2.8.0 milestone Feb 3, 2021

sijie added the type/flaky-tests label Feb 3, 2021

315157973 reviewed Feb 3, 2021

View reviewed changes

Change to use Awaitility.

a49432c

MarvinCai requested a review from 315157973 February 3, 2021 18:36

lhotari approved these changes Feb 3, 2021

View reviewed changes

merlimat approved these changes Feb 3, 2021

View reviewed changes

Fix condition to check for leader not empty && not equal to old leader.

80a6679

lhotari mentioned this pull request Feb 4, 2021

[Build] Cancelling workflows with airflow-cancel-workflow-runs action within each workflow doesn't seem to work #9479

Closed

sijie mentioned this pull request Feb 4, 2021

ISSUE-9479: [Build] Cancelling workflows with airflow-cancel-workflow-runs action within each workflow doesn't seem to work streamnative/pulsar-archived#2126

Closed

sijie approved these changes Feb 5, 2021

View reviewed changes

Merge remote-tracking branch 'apache-pulsar/master' into eaderElectio…

23f133c

…n-test

315157973 approved these changes Feb 6, 2021

View reviewed changes

merlimat merged commit a958ee9 into apache:master Feb 6, 2021

merlimat pushed a commit to merlimat/pulsar that referenced this pull request Apr 6, 2021

Try to fix flaky LeaderElection test. (apache#9443)

f410fee

* Try to fix flaky LeaderElection test. * Change to use Awaitility. * Fix condition to check for leader not empty && not equal to old leader.

Conversation

MarvinCai commented Feb 3, 2021

Modifications

Uh oh!

lhotari Feb 3, 2021

Choose a reason for hiding this comment

Uh oh!

MarvinCai Feb 3, 2021

Choose a reason for hiding this comment

Uh oh!

MarvinCai Feb 3, 2021

Choose a reason for hiding this comment

Uh oh!

lhotari commented Feb 3, 2021

Uh oh!

315157973 Feb 3, 2021

Choose a reason for hiding this comment

Uh oh!

MarvinCai Feb 3, 2021

Choose a reason for hiding this comment

Uh oh!

315157973 Feb 4, 2021

Choose a reason for hiding this comment

Uh oh!

315157973 Feb 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarvinCai commented Feb 3, 2021

Uh oh!

merlimat commented Feb 3, 2021

Uh oh!

lhotari left a comment

Choose a reason for hiding this comment

Uh oh!

lhotari commented Feb 4, 2021

Uh oh!

MarvinCai commented Feb 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhotari commented Feb 4, 2021

Uh oh!

MarvinCai commented Feb 4, 2021

Uh oh!

lhotari commented Feb 4, 2021

Uh oh!

MarvinCai commented Feb 4, 2021

Uh oh!

MarvinCai commented Feb 4, 2021

Uh oh!

315157973 commented Feb 5, 2021

Uh oh!

sijie commented Feb 5, 2021

Uh oh!

MarvinCai commented Feb 5, 2021

Uh oh!

MarvinCai commented Feb 6, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

315157973 Feb 4, 2021 •

edited

Loading

MarvinCai commented Feb 4, 2021 •

edited

Loading