Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leader step down when removing itself #248

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jabolina
Copy link
Member

Opening a PR since it touches the election part a bit.

The idea here is to check the election implementation after removing a member. If the current leader is removing itself, it steps down locally to stop accepting subsequent requests. Then, the old leader starts the voting thread to collect information from the other RAFT members. More details are listed in the issue tracker, but the idea here borrows a bit from the leader transfer in the Raft dissertation.

A leader needs to have an up-to-date log. After the membership change is committed, there are a majority of nodes with a recent log. Therefore, we can safely start an election round with the remaining members.

The change in the election is to request the RAFT members instead of the view members. Executing the test and Jepsen suites doesn't show any problem with this restriction. We could enhance the test coverage here. One interesting test scenario would enqueue a few requests and execute the leader removal. I plan to extend the test suite in future changes, so I'll make a note of that.

Closes #246.

@jabolina jabolina added this to the 1.0.13 milestone Jan 28, 2024
@jabolina
Copy link
Member Author

I think we might have a bug with only this. After the leader steps down and start the voting thread, the remote nodes might not have committed yet, as the message is still in-flight.

I think the node should start the voting thread only after the commit index of other nodes advance. Also, the dissertation mentions that reconfiguration is applied when received, even before the commit. That might be something at play here.

I'll need to dig a little more.

@jabolina
Copy link
Member Author

jabolina commented Feb 4, 2024

This is quite tricky. Imagine a scenario with nodes A and B, where A is the leader. If A removes itself and shuts before B applies the command, the system becomes unavailable because node B still thinks the members are {A, B} and the majority is 1.

In one approach, we could delay the command completion until the leader identifies that the majority of followers have applied the command locally. See, this should only delay the CompletableFuture completion and return to the client. The command has to be applied in the same order. This would lead to a leader knowing it was removed but continue replicating until a majority applies the command to their local machine. Only after that the node steps down, trigger the election round, and then complete the CF. This would eliminate any possibility of unavailability.

Another approach follows the dissertation. The nodes adopt the new configuration as soon as they know about it. This would require still more changes. There is a bug in this approach [1], corrected by making a new leader commit an internal no-op entry before applying any membership change. But honestly, I can't see how that would help in a scenario like the example I gave.

The last approach I see takes advantage of JGroups. The operation follows the usual approach. After the operation is committed, the leader steps down (this will stop the multicast of AppendEntries to the followers) and starts the voting thread. Once a new leader is elected, it starts the multicast of AppendEntries again, which includes the still-not-applied-in-followers operation removing the previous leader. Since we utilize the multicast from JGroups, the previous leader will still receive the request. We can change the leader handling of append responses to ignore counting non-members. This means the previous leader would still reply to requests outside the Raft member list until the membership operation is applied to the new leader. However, the system becomes unavailable if the old leader shuts before the new leader applies the membership operation!

I would argue that reconfiguration commands are only complete after the followers apply the command to their state machines. I still need to think about it. Do you have any ideas? Hopefully, I am just overthinking. Reconfiguration is a PITA 🤕.

[1] https://groups.google.com/g/raft-dev/c/t4xj6dJTP6E/m/d2D9LrWRza8J

@jabolina
Copy link
Member Author

jabolina commented Feb 5, 2024

Applied changes to implement the last option, taking advantage of the multicast. The leader now discards responses from non-members, so it doesn't count towards the majority. My contrived scenario above would fail for any operation, not only membership, and since it was committed, it does not mean data loss.

In addition to the changes, I added some warnings to the documentation. This should be ready now.

After executing a membership operation where the leader removes itself,
it will step down and trigger an election round with the remaining
members.

This takes some inspiration from the Raft dissertation mechanism to
transfer the leader. After the membership operation is comitted, we know
there is a node with up-to-date log, so we can safely start an election.
This also means, after comitting a removal, we don't get below majority.

After the leader is removed, it executes the election round to elect a
new leader. This helps to reduce the unvailability, since it is not
necessary to shutdown the node to trigger the view change. We also
update the documentation regarding the membership operations.
@belaban
Copy link
Member

belaban commented Feb 9, 2024

I'm pretty swamped with work and won't have time to look into this in the near future.
Ask yourself a few questions:

  • Is this a real bug that affects correctness? Or is it just a performance fix?
  • How often does this occur? Pretty regularly or only in edge cases?
  • You mentioned 'tricky' and 'complicated': is the fix itself going to complicate code, a.k.a. introduce new bugs / impact legibility and maintenance of the code?

If you introduce complexity for no obvious gain (correctness! we don't care that much about performance), then I'd not do it. Otherwise, go for it!

@jabolina
Copy link
Member Author

Thanks, Bela! Thanks for raising the points, which gave me some ideas. Unfortunately, this issue raises a liveness and safety issue. In the current main version, removing the leader doesn't trigger a new leader election, and the (removed) leader can replicate entries (see, a non-raft member at this point!). However, this is a rare enough event. It happens only when an operator executes a reconfiguration to remove the current leader.

I'll try adding a few more tests and ensure the changes are as minimal as possible to reduce the surface for new bugs.

@jabolina jabolina modified the milestones: 1.0.13, 1.0.14 Aug 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Leader stepping down with membership change
2 participants