Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transfer leadership when establishLeadership fails #5247

Merged
merged 16 commits into from
Jun 19, 2019

Conversation

hanshasselberg
Copy link
Member

@hanshasselberg hanshasselberg commented Jan 22, 2019

Fixes #5047. I am throwing this out here to have a place to discuss the approaches.

TLDR; transition leadership is now implemented in raft which is being used in this PR.

Background:

Leadership is established in raft and there is basically no way to influence that from consul. And there has been no need to. But when we setup leadership in consul in establishLeadership things can go wrong and in the past we retried to fix it. This is problematic since even though raft considers this server the leader, it is not fully prepared! Consistent reads for example are not possible.

Abandoned Ideas:

  • calling revokeLeadership: that tears down some consul things, but has no influence on raft leadership
  • quitting leaderLoop: that resets the consul side of things, but doesn't change the raft leadership

Solutions:

  • this PR: calling DemoteVoter as the leader, will make this server step down by marking the current leader as nonvoter. The same server is promoted to a voter afterwards again by autopilot from the new leader:
    func (a *Autopilot) handlePromotions(promotions []raft.Server) error {
    // This used to wait to only promote to maintain an odd quorum of
    // servers, but this was at odds with the dead server cleanup when doing
    // rolling updates (add one new server, wait, and then kill an old
    // server). The dead server cleanup would still count the old server as
    // a peer, which is conservative and the right thing to do, and this
    // would wait to promote, so you could get into a stalemate. It is safer
    // to promote early than remove early, so by promoting as soon as
    // possible we have chosen that as the solution here.
    for _, server := range promotions {
    a.logger.Printf("[INFO] autopilot: Promoting %s to voter", fmtServer(server))
    addFuture := a.delegate.Raft().AddVoter(server.ID, server.Address, 0, 0)
    if err := addFuture.Error(); err != nil {
    return fmt.Errorf("failed to add raft peer: %v", err)
    }
    }
    // If we promoted a server, trigger a check to remove dead servers.
    if len(promotions) > 0 {
    select {
    case a.removeDeadCh <- struct{}{}:
    default:
    }
    }
    return nil
    }
    The idea to that approach came from Suggestion: control over leader identity raft#218.
  • alternatively we could implement leadership transfer in raft as specified in https://ramcloud.stanford.edu/~ongaro/thesis.pdf chapter 3.10.

Concerns with demotevoter:

  • in an environment where not all servers are running autopilot, it is possible that the leader doesn't run it and doesn't promote the demoted server to be a voter again. Lets assume we would be left with 2 voters and 1 nonvoter. That would make electing a leader harder, but raft would figure it out eventually. During an upgrade, the same could happen one last time: 1 voter and 2 nonvoters left which is the worst case. But when this happens, electing a leader is easy again. As soon as the leader runs autopilot, all nonvoters will be promoted to a voter again. establishLeadership would have to fail often during that process to make this scenario possible. Without this fix though, we would be stuck with a leader that is not performing all leader tasks, without a chance of recovery.
  • raft state is modified for this and we rely on consul for promoting the previous leader to be a voter again - feels fishy.

Concerns with leadership transfer:

none

This PR uses the leadership transfer feature. It requires a revendoring though and fails because of it.

@hanshasselberg hanshasselberg added the needs-discussion Topic needs discussion with the larger Consul maintainers before committing to for a release label Jan 22, 2019
@hashicorp-cla
Copy link

hashicorp-cla commented Jan 22, 2019

CLA assistant check
All committers have signed the CLA.

@hanshasselberg hanshasselberg changed the title Demote itself when establishLeadership fails Transfer leadership when establishLeadership fails Feb 14, 2019
hanshasselberg added a commit to hashicorp/raft that referenced this pull request May 17, 2019
This PR is implementing the leadership transfer extension described in the thesis chap 3.10.

Background:

Consul is performing some setup after acquiring leadership. It is possible that the setup fails, but there is no good way to step down as a leader. It is possible to use DemoteVoter as show in hashicorp/consul#5247, but this is suboptimal because it relies on Consul's autopilot to promote the old leader to a voter again.
Since there is a perfectly fine way described in the thesis: leadership transfer extension, we decided to implement that instead. Doing it this way also helps other teams, since it is more generic.

The necessary steps to perform are:

1. Leader picks target to transition to
2. Leader stops accepting client requests
3. Leader makes sure to replicate logs to the target
4. Leader sends TimeoutNow RPC request
5. Target receives TimeoutNow request, which triggers an election
6a. If the election is successful, a message with the new term will make the old leader step down
6b. if after electiontimeout the leadership transfer did not complete, the old leader resumes operation

Resources:

https://github.com/etcd-io/etcd/tree/master/raft
@hanshasselberg hanshasselberg force-pushed the establish_leader branch 2 times, most recently from 833070e to b8e9fa5 Compare May 23, 2019 15:41
Level: hclog.LevelFromString(s.config.LogLevel),
Output: s.config.LogOutput,
})
s.config.RaftConfig.Logger = raftLogger
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raft logging changed and we need to provide the hclogger now.

@@ -178,7 +195,7 @@ RECONCILE:
if err := s.revokeLeadership(); err != nil {
s.logger.Printf("[ERR] consul: failed to revoke leadership: %v", err)
}
goto WAIT
return err
Copy link
Member Author

@hanshasselberg hanshasselberg Jun 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first usecase: we couldn't establish leadership and instead of retrying it after interval in WAIT, an error is returned which leads to a raft leadership transfer.

err := reassert()
errCh <- err
if err != nil {
return err
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the second usecase: we couldn't reassert and instead of retrying it after interval, an error is returned which leads to a raft leadership transfer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, when does this happen? I don't recall the specifics and wonder if by failing and stepping down we might end up causing new leadership stability issues?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// reassertLeaderCh is used to signal the leader loop should re-run
// leadership actions after a snapshot restore.

This happens after snapshot restore. Since the agent revokes leadership and immediately tries to establish it again, there is the possibility that it fails. When it does we are in the same situation as above - raft thinks this agent is a leader, but consul disagrees.

@hanshasselberg hanshasselberg requested a review from a team June 7, 2019 09:22
@hanshasselberg hanshasselberg removed the needs-discussion Topic needs discussion with the larger Consul maintainers before committing to for a release label Jun 7, 2019
Copy link
Contributor

@freddygv freddygv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just have a couple questions inline 🔍

agent/consul/leader.go Show resolved Hide resolved
agent/consul/leader.go Show resolved Hide resolved
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about these Hans but just wondering if this solves the problem fully vs just making it more likely to get out of the bad state.

Please let me know if I'm missing something though - it's all pretty subtle!

err := reassert()
errCh <- err
if err != nil {
return err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, when does this happen? I don't recall the specifics and wonder if by failing and stepping down we might end up causing new leadership stability issues?

s.leaderLoop(ch)
err := s.leaderLoop(ch)
if err != nil {
s.leadershipTransfer()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if leadership transfer fails? We exit the leader loop anyway and no longer act as the leader, but does raft still think we are the leader? That seems like a bad case to be in - roughly the same as the bug this is meant to be fixing although we at least made an attempt at telling raft we didn't want to be leader any more.

I can think of a couple of possible options here:

  • Make a method in Raft lib that allows the leader to "StepDown" - it will stop running heartbeats etc. but not stop being a voter - basically, force a leader into follower state.
  • Keep trying the leadership Transfer indefinitely because the cluster is broken until it works anyway.
  • Keep retrying for a limited length of time and then crash the whole process to force a step down.

Copy link
Member Author

@hanshasselberg hanshasselberg Jun 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a couple of changes to my PR, so that leaderLoop is only left if leadershipTransfer was successful. I think that mitigates the issues you are talking about.

What happens now is that in case leadershipTransfer fails, the agent stays in leaderLoop and waits until ReconcileInterval to retry to establishLeadership (and to transferLeadership in case it fails).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is OK although ReconcileInterval is 60 seconds which seems a lot to wait around for when we know the cluster is down. 🤔

Should we instead consider looping indefinitely retrying every 5 seconds or something?

I think this is way better than before and eventually it should recover so it's good just wondering if it's easy to make that better quicker?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reset the interval to 5 seconds when transfer leadership fails so that it retries establishLeadership again faster than before.

@banks
Copy link
Member

banks commented Jun 12, 2019

@i0rek could you also double check how the hclog integration pans out. It came up today that Vault and Nomad have different date formats in logs which tripped someone up. Can you make sure we don't make it even worse by having raft log lines log with a different date format to the rest of Consul's logs please? I'm not sure if that is the case or not as I didn't look hard at where the plumbing/config sits for that but would just like to sanity check it.

// leader, but consul disagrees.
if err != nil {
if err := s.leadershipTransfer(); err != nil {
goto WAIT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now we attempt to transfer 3 times but if it fails we still hang out in non-leader limbo land for a bit before retrying?

I guess this is what i mentioned as "retry indefinitiely" and it should really work immediately if rest of the cluster is in an OK state so I think this is good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was that since leadershipTransfer failed, we try to establishLeadership again. I think in general establishLeadership is more likely to succeed than transferLeadership. I think if I make the interval smaller - like 5 seconds - before it retries it is the better solution than trying to transfer leadership indefinitely.

Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I'll leave it at request changes to make sure we don't forget to check out the log formatting for raft before merge.

@hanshasselberg
Copy link
Member Author

@banks I checked the logs and fixed the timestamps, but sometimes there are two whitespaces before raft:

master:

2019/06/13 22:11:45 [INFO] agent: Started DNS server 127.0.0.1:8600 (tcp)
2019/06/13 22:11:45 [INFO] agent: Started DNS server 127.0.0.1:8600 (udp)
2019/06/13 22:11:45 [INFO] agent: Started HTTP server on 127.0.0.1:8500 (tcp)
2019/06/13 22:11:45 [INFO] agent: started state syncer
2019/06/13 22:11:45 [INFO] agent: Started gRPC server on 127.0.0.1:8502 (tcp)
2019/06/13 22:11:45 [WARN] raft: Heartbeat timeout from "" reached, starting election
2019/06/13 22:11:45 [INFO] raft: Node at 127.0.0.1:8300 [Candidate] entering Candidate state in term 2
2019/06/13 22:11:45 [DEBUG] raft: Votes needed: 1
2019/06/13 22:11:45 [DEBUG] raft: Vote granted from 2ef30f59-536b-3efa-80d8-5bd3afb67585 in term 2. Tally: 1
2019/06/13 22:11:45 [INFO] raft: Election won. Tally: 1

pr:

2019/06/13 22:10:58 [INFO] agent: Started DNS server 127.0.0.1:8600 (tcp)
2019/06/13 22:10:58 [INFO] agent: Started DNS server 127.0.0.1:8600 (udp)
2019/06/13 22:10:58 [INFO] agent: Started HTTP server on 127.0.0.1:8500 (tcp)
2019/06/13 22:10:58 [INFO] agent: started state syncer
2019/06/13 22:10:58 [INFO] agent: Started gRPC server on 127.0.0.1:8502 (tcp)
2019/06/13 22:10:58 [WARN]  raft: Heartbeat timeout from "" reached, starting election
2019/06/13 22:10:58 [INFO]  raft: Node at 127.0.0.1:8300 [Candidate] entering Candidate state in term 2
2019/06/13 22:10:58 [DEBUG] raft: Votes needed: 1
2019/06/13 22:10:58 [DEBUG] raft: Vote granted from bd52484a-2349-6478-2e1f-4eb4c64c924c in term 2. Tally: 1
2019/06/13 22:10:58 [INFO]  raft: Election won. Tally: 1

Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

// establishedLeader needs to be set to
// false.
establishedLeader = false
interval = time.After(5 * time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is OK because we'll wait for 5 seconds in next wait loop and then timeout and goto RECONCILE. Right after that label we will hit interval := time.After(s.config.ReconcileInterval) again which presumably will set the interval back again.

I must admit I'm a little confused about how := works after a GOTO - it's reassigning the same variable name but in the same scope on subsequent jumps? I wonder if this is some strange variant of variable shadowing even though they are in the same scope? Maybe Go just has a special case to allow this when using GOTO but not in serial code? If it works I guess it's fine 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what you mean. WAIT is well after the RECONCILE and the interval variable declaration and the code should just be using the same variable.

I reproed your question in a playground: https://play.golang.org/p/6AqssHXg3Wt. If you jump before a variable declaration golang will create a new variable. Or did you mean something else?

Copy link
Member

@banks banks Jun 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I did a similar repro and it's fine, just was a strange one.

The path here is:

  • we set interval to be a timer chan that will go off in 5 seconds
  • we goto WAIT which enters a select on a few things including that chan
  • when the chan fires, that select branch does goto RECONCILE which immeidately re-assigns a timer chan for the original ReconcileInterval (using :=).

My original concern was that we might end up regaining leadership and then doing reconcile every 5 seconds after that but it's not the case due to the path mentioned above.

It also occurs to me that we always have had a re-assignment after goto RECONCILE so it's not really any different than before, it's just that was the only assignment before and I wondered if some strange form of shadowing might cause issues. That appears not to be the case so I think this is fine!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For fun - Go does create a new variable with the same name.

You can see that here: https://play.golang.org/p/FU0ZxictDXE capturing the variable in a lambda and then looping with GOTO leaves the lamda holding the original value not the redefined one after the GOTO jump.

It's just weird to me because it's in the same scope - shadowing across scopes seems fine but this seems to be a special case you can't normally do outside of GOTO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, now I see what you mean.

@silenceper
Copy link

silenceper commented Jun 22, 2019

@mkeeler
This pr add vendoring of github.com/hashicorp/consul
conflict with #5943

@mkeeler
Copy link
Member

mkeeler commented Jun 22, 2019

@silenceper Looks like you are correct. One of the other engineers has been working on a circle ci check to ensure we aren't accidentally doing this (which appears to be a common mistake for us).

FrontMageew2 pushed a commit to FrontMageew2/raft that referenced this pull request Jan 31, 2024
This PR is implementing the leadership transfer extension described in the thesis chap 3.10.

Background:

Consul is performing some setup after acquiring leadership. It is possible that the setup fails, but there is no good way to step down as a leader. It is possible to use DemoteVoter as show in hashicorp/consul#5247, but this is suboptimal because it relies on Consul's autopilot to promote the old leader to a voter again.
Since there is a perfectly fine way described in the thesis: leadership transfer extension, we decided to implement that instead. Doing it this way also helps other teams, since it is more generic.

The necessary steps to perform are:

1. Leader picks target to transition to
2. Leader stops accepting client requests
3. Leader makes sure to replicate logs to the target
4. Leader sends TimeoutNow RPC request
5. Target receives TimeoutNow request, which triggers an election
6a. If the election is successful, a message with the new term will make the old leader step down
6b. if after electiontimeout the leadership transfer did not complete, the old leader resumes operation

Resources:

https://github.com/etcd-io/etcd/tree/master/raft
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Leader doesn't step down when establishLeadership fails
6 participants