Try to fix flaky LeaderElection test.#9443
Conversation
| while (loopCount < MAX_RETRIES) { | ||
| Thread.sleep(1000); |
There was a problem hiding this comment.
It would be better to replace this loop with Awaitility. An example of using Awaitility for this type of case is here: lhotari@dc16960#diff-ff0def4f209a344e2ebc9f9d3a6ab32bf67ed12b9c7c6a968a110a6e69c7ed07R192-R196
There was a problem hiding this comment.
Yeah, I'll do that.
|
@MarvinCai good work! I added a suggestion of using Awaitility for the waiting loop. I have some concerns that the production code has some issues. I wrote about my observations in #9408 (comment) . There was a NPE stack trace with maximum 1024 stack frames. The NPE is easy to fix), but the depth of the stack is the concern. I did a quick check in the code and it looks like backoff with some delay is missing in some cases and the leader election can get into a very tight loop. I was hoping that @merlimat could check the comments in #9408 (comment) before #9408 is closed. I guess the other option is that I file a separate bug report about the observation. |
| // Check if the all active pulsar see a new leader | ||
| for (PulsarService pulsar : activePulsars) { | ||
| Optional<LeaderBroker> leader = pulsar.getLeaderElectionService().readCurrentLeader().join(); | ||
| if (leader.isPresent() && leader.get().equals(oldLeader)) { |
There was a problem hiding this comment.
Good job ,I have a question:
Since there will be a single follower who cannot see the new leader until timeout, why traversing all the followers can solve this problem?
The current logic is: as long as there is no old leader in all followers, it will be successful, but there will still be cases where empty is returned.
There was a problem hiding this comment.
@315157973 I think the old logic is it pick the last follower it seen and check if it sees the new leader, which is in this chunk of code (ref1, ref2).
And then it just do the check which can't guarantee all follower already saw a new leader since for some follower which try to become leader, it'll first see empty path "/loadbalance/leader", read it in cache as [Optional.Empty] then try to create the znode, but after it fail(other node already create that znode and become leader) and before it's cache get updated by zk watch, there're might be some delay so old test can still see that [Optinal.Empty]. So loop through all followers and making sure all of them already a new leader, then check all of them see the same leader can solve the problem.
Does it make sense?
There was a problem hiding this comment.
I tried to extend the waiting time, and some followers still could not see the new leader. That is: when a leader switch occurs, some followers can never see the new leader, and all they read are empty
There was a problem hiding this comment.
Seems to be the reason. Therefore, the unit test should cover the scenario where the leader of the follower is always empty.
#9460
We can probably wait for couple days and if we don't get reply from him we can probably open another issue for that NPE. I did also see this exception couple of times, but not sure if it's causing any real problem. |
|
I'll be looking at the tight loop in the async calls there |
|
/pulsarbot run-failure-checks |
|
Please don't merge for now, found some more issue we need to fix. |
@MarvinCai You are probably already aware of the fix #9460 that is also needed. If that's the case, I think these 2 changes could be merged separately and having a new PR to improve further. WDYT? |
|
@lhotari Oh, I didn't see that change, but was trying to do exactly the same thing, manually invalidating the cache entry after an election, cause I found the test will still fail quite often due to the out dated cache entry. |
@MarvinCai I meant to say earlier that it's fine that you don't rebase, it's better to get your changes merged asap and follow up any remaining issues later. The reason I'm saying this is that the build queue usually gets up to several hours and with all the flakiness, it's again a lot of retrying until these changes get merged... |
|
/pulsarbot run-failure-checks |
|
/pulsarbot run-failure-checks |
|
@MarvinCai This PR is merged #9460, please go on |
|
/pulsarbot run-failure-checks |
|
/pulsarbot run-failure-checks |
|
/pulsarbot run-failure-checks |
* Try to fix flaky LeaderElection test. * Change to use Awaitility. * Fix condition to check for leader not empty && not equal to old leader.
Fixes #9408
Modifications
Seems the test is flaky because when it kills current leader and wait for new leader, it only check single pulsar to verify leader has changed, but cache for other pulsars may not refreshed by the ZK watch yet, hence we got the Optional.Empty which is leader path doesn't exist and they try to become leader.
So update to wait for leader changed for all pulsar, then verify they're all see same leader.