Skip to content
This repository has been archived by the owner on Feb 22, 2022. It is now read-only.

consul: Cluster doesn't recover after deleting all pods #1143

Closed
munnerz opened this issue May 23, 2017 · 12 comments
Closed

consul: Cluster doesn't recover after deleting all pods #1143

munnerz opened this issue May 23, 2017 · 12 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@munnerz
Copy link
Collaborator

munnerz commented May 23, 2017

I'm trying to use the stable/consul Helm chart to deploy a consul cluster.

With the default configuration (minus setting an appropriate storage class), on GKE v1.6.2, the cluster comes up fine.

It seems however, if I delete any pods then they are unable to rejoin the cluster and leader election starts to fail:

Waiting for con-consul-0.con-consul to come up
Waiting for con-consul-1.con-consul to come up
Waiting for con-consul-2.con-consul to come up
==> WARNING: LAN keyring exists but -encrypt given, using keyring
==> WARNING: WAN keyring exists but -encrypt given, using keyring
==> WARNING: Expect Mode enabled, expecting 3 servers
==> Starting Consul agent...
==> Starting Consul agent RPC...
==> Consul agent running!
           Version: 'v0.7.5'
           Node ID: 'c64a1df2-1b1c-4b6c-d49d-cfa2bdb4a637'
         Node name: 'con-consul-0'
        Datacenter: 'dc1'
            Server: true (bootstrap: false)
       Client Addr: 0.0.0.0 (HTTP: 8500, HTTPS: -1, DNS: 8600, RPC: 8400)
      Cluster Addr: 10.0.0.21 (LAN: 8301, WAN: 8302)
    Gossip encrypt: true, RPC-TLS: false, TLS-Incoming: false
             Atlas: <disabled>

==> Log data will now stream in as it occurs:

    2017/05/23 19:58:48 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:10.0.3.19:8300 Address:10.0.3.19:8300} {Suffrage:Voter ID:10.0.0.20:8300 Address:10.0.0.20:8300} {Suffrage:Voter ID:10.0.1.7:8300 Address:10.0.1.7:8300}]
    2017/05/23 19:58:48 [INFO] serf: EventMemberJoin: con-consul-0 10.0.0.21
    2017/05/23 19:58:48 [INFO] serf: EventMemberJoin: con-consul-0.dc1 10.0.0.21
    2017/05/23 19:58:48 [INFO] raft: Node at 10.0.0.21:8300 [Follower] entering Follower state (Leader: "")
    2017/05/23 19:58:48 [INFO] serf: Attempting re-join to previously known node: con-consul-1: 10.0.1.7:8301
    2017/05/23 19:58:48 [INFO] consul: Adding LAN server con-consul-0 (Addr: tcp/10.0.0.21:8300) (DC: dc1)
    2017/05/23 19:58:48 [INFO] consul: Raft data found, disabling bootstrap mode
    2017/05/23 19:58:48 [WARN] serf: Failed to re-join any previously known node
    2017/05/23 19:58:48 [INFO] consul: Adding WAN server con-consul-0.dc1 (Addr: tcp/10.0.0.21:8300) (DC: dc1)
    2017/05/23 19:58:48 [INFO] agent: Joining cluster...
    2017/05/23 19:58:48 [INFO] agent: (LAN) joining: [10.0.0.21 10.0.1.7 10.0.3.20]
    2017/05/23 19:58:48 [INFO] serf: EventMemberJoin: con-consul-1 10.0.1.7
    2017/05/23 19:58:48 [INFO] serf: Re-joined to previously known node: con-consul-1: 10.0.1.7:8301
    2017/05/23 19:58:48 [INFO] consul: Adding LAN server con-consul-1 (Addr: tcp/10.0.1.7:8300) (DC: dc1)
    2017/05/23 19:58:48 [INFO] agent: (LAN) joined: 2 Err: <nil>
    2017/05/23 19:58:48 [INFO] agent: Join completed. Synced with 2 initial agents
    2017/05/23 19:58:48 [INFO] serf: EventMemberJoin: con-consul-2 10.0.3.20
    2017/05/23 19:58:48 [INFO] consul: Adding LAN server con-consul-2 (Addr: tcp/10.0.3.20:8300) (DC: dc1)
    2017/05/23 19:58:55 [ERR] agent: failed to sync remote state: No cluster leader
    2017/05/23 19:58:57 [WARN] raft: Heartbeat timeout from "" reached, starting election
    2017/05/23 19:58:57 [INFO] raft: Node at 10.0.0.21:8300 [Candidate] entering Candidate state in term 3
    2017/05/23 19:58:57 [INFO] raft: Node at 10.0.0.21:8300 [Follower] entering Follower state (Leader: "")
    2017/05/23 19:59:00 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.0.20:8300 10.0.0.20:8300}: dial tcp 10.0.0.20:8300: getsockopt: no route to host
    2017/05/23 19:59:00 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.3.19:8300 10.0.3.19:8300}: dial tcp 10.0.3.19:8300: getsockopt: no route to host
    2017/05/23 19:59:06 [WARN] raft: Heartbeat timeout from "" reached, starting election
    2017/05/23 19:59:06 [INFO] raft: Node at 10.0.0.21:8300 [Candidate] entering Candidate state in term 5
    2017/05/23 19:59:06 [INFO] raft: Node at 10.0.0.21:8300 [Follower] entering Follower state (Leader: "")
    2017/05/23 19:59:06 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.3.19:8300 10.0.3.19:8300}: dial tcp 10.0.3.19:8300: getsockopt: no route to host
    2017/05/23 19:59:09 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.0.20:8300 10.0.0.20:8300}: dial tcp 10.0.0.20:8300: getsockopt: no route to host
    2017/05/23 19:59:12 [WARN] raft: Heartbeat timeout from "" reached, starting election
    2017/05/23 19:59:12 [INFO] raft: Node at 10.0.0.21:8300 [Candidate] entering Candidate state in term 7
    2017/05/23 19:59:13 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.3.19:8300 10.0.3.19:8300}: dial tcp 10.0.3.19:8300: getsockopt: no route to host
==> Newer Consul version available: 0.8.3 (currently running: 0.7.5)
    2017/05/23 19:59:15 [ERR] agent: coordinate update error: No cluster leader
    2017/05/23 19:59:15 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.0.20:8300 10.0.0.20:8300}: dial tcp 10.0.0.20:8300: getsockopt: no route to host
    2017/05/23 19:59:22 [WARN] raft: Election timeout reached, restarting election
    2017/05/23 19:59:22 [INFO] raft: Node at 10.0.0.21:8300 [Candidate] entering Candidate state in term 8
    2017/05/23 19:59:24 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.3.19:8300 10.0.3.19:8300}: dial tcp 10.0.3.19:8300: getsockopt: no route to host
    2017/05/23 19:59:25 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.0.20:8300 10.0.0.20:8300}: dial tcp 10.0.0.20:8300: getsockopt: no route to host
    2017/05/23 19:59:30 [ERR] agent: failed to sync remote state: No cluster leader
    2017/05/23 19:59:31 [WARN] raft: Election timeout reached, restarting election
    2017/05/23 19:59:31 [INFO] raft: Node at 10.0.0.21:8300 [Candidate] entering Candidate state in term 9
    2017/05/23 19:59:32 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.3.19:8300 10.0.3.19:8300}: dial tcp 10.0.3.19:8300: getsockopt: no route to host
    2017/05/23 19:59:34 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.0.20:8300 10.0.0.20:8300}: dial tcp 10.0.0.20:8300: getsockopt: no route to host
    2017/05/23 19:59:37 [WARN] raft: Election timeout reached, restarting election
    2017/05/23 19:59:37 [INFO] raft: Node at 10.0.0.21:8300 [Candidate] entering Candidate state in term 10
    2017/05/23 19:59:37 [INFO] raft: Node at 10.0.0.21:8300 [Follower] entering Follower state (Leader: "")
    2017/05/23 19:59:37 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.0.20:8300 10.0.0.20:8300}: dial tcp 10.0.0.20:8300: getsockopt: no route to host
    2017/05/23 19:59:38 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.3.19:8300 10.0.3.19:8300}: dial tcp 10.0.3.19:8300: getsockopt: no route to host
    2017/05/23 19:59:42 [WARN] raft: Heartbeat timeout from "" reached, starting election

I'm not too sure if anyone else has had a similar experience and has some advice here? I'm going to look into how to set up a Consul cluster to work it out for myself.

@munnerz
Copy link
Collaborator Author

munnerz commented May 23, 2017

Hmm - on closer inspection, it appears my new consul pods have different IPs.

Could consul have somehow cached the old pod IPs? How does it store peer names?

NAME           READY     STATUS    RESTARTS   AGE       IP          NODE
con-consul-0   1/1       Running   0          10m       10.0.0.21   gke-gs-staging-gke-default-pool-f57f3bd2-rjgd
con-consul-1   1/1       Running   0          11m       10.0.1.7    gke-gs-staging-gke-default-pool-f57f3bd2-xwqc
con-consul-2   1/1       Running   0          10m       10.0.3.20   gke-gs-staging-gke-default-pool-f57f3bd2-6hzr

@munnerz
Copy link
Collaborator Author

munnerz commented May 23, 2017

So it appears the old nodes aren't properly leaving the cluster before exiting.

From reading the Consul docs, it appears in order to trigger consul's automatic sending of a leave event, you must send the SIGINT signal. Kubernetes sends the SIGTERM signal when killing pods, so I think the member isn't notifying the cluster of the leave.

I attempted to create a preStop hook to run consul leave, which appeared to work when I deleted a single node but upon all nodes being deleted, the same issue occurred.

I've found this corresponding issue here: hashicorp/consul#1580 which suggests a move to raft v3 may help, however not all issues are resolved in my testing (nodes are still a part of the cluster, and duplicate node errors are being thrown).

I think we need to set the leave_on_terminate config var, or alternatively trap the SIGTERM signal from k8s and swap it for a SIGINT.

@lachie83
Copy link
Contributor

lachie83 commented Jun 2, 2017

thanks for raising this. I would like to have you retest after this PR gets merged. #1126

@lachie83
Copy link
Contributor

lachie83 commented Jun 8, 2017

@munnerz Can you run some tests again using the updated chart?

@mitom
Copy link

mitom commented Jun 13, 2017

@lachie83 I am facing the same issue (although AWS, not GCE), using the latest version of the chart mentioned above. Also as a note, even in consul 0.8, raft protocol version defaults to 2, not 3. This needs to be explicitly set in the config/or as a flag. The issue remains on both raft v2 and v3 though.

I found that if all nodes are terminated once they come back they will each try to rejoin as an existing node, which will work. They will then update their IP address, which will also work. They will not know about each other though, they each think they are in a cluster with the set of old nodes. They will try to elect a new leader, which starts an infinite loop complaining that it can't reach voters (the old ip addresses of the servers) and the election times out.

I have tried to update the init script to use the pod dns names instead (as in leave resolving them to consul), hoping that it will then store the dns names instead of ips, but no luck. The resolution works, so this could be simplified in the init script anyway, but it doesn't help with this issue, consul will still store the IPs.

Recovery of the cluster is possible, but it's far from automated and I'm not sure if it's possible to automate it sanely. The way recommended by consul is to stop all servers, then create a peers.json file (different structure for raft v2 and v3) with the information of all servers, put it on all servers into a location in the data dir, then restart them. The cluster will then enter recovery, take a snapshot and override the cluster member list. This is somewhat difficult on k8s as we can't just stop servers, they will keep coming back. The solution I found is to scale the cluster down to 1 instance, create the peers.json file on it with only its' own information and then kill it. When k8s reschedules it, it will still have the file, enter recovery and start up the cluster electing itself as leader. I could then scale up the cluster to the desired state and the nodes would re-enter. It's also to note that when recovery happens, the last changes from the change lot to consul are committed unconditionally making this unsafe, or at least not something that should be done every time consul starts up, only when it's actually broken already. While I think a healthy cluster shouldn't lose all nodes at once, I'd say this is still an annoying process to go through to recover it.

I have also tried the leave_on_terminate config set as suggested by @munnerz, which works partially, but in the end it's not possible to have all members leave and then restart the cluster with a new set. The documentation suggests that removing old servers should happen after new servers have been added.

I have also tried running consul info and consul join on the containers to see if it helps at all, gives 403 permission denied, I am unsure why is that as authentication is not enabled. Some API endpoints work, they tell me that the leader is "", and the list of peers is 2 nodes from when the cluster was healthy (interestingly, not even the new ip of the current node is present). That's as much info as I could extract from it.

@munnerz Could you please share what results did you get to? I could not get my cluster functional without recovery even using raft v3, not even to the point of duplicate node errors. Or was that not in kubernetes?

@slackpad I am not sure if you can help shed some light on this, but is there a way (from consul's perspective) to have all servers terminated and let them start up the cluster again with all of them having different IPs? Node UUIDs stay the same, all data stays the same, DNS names still resolve to the correct nodes, but all of them will have a different IP.

@nrvale0
Copy link
Contributor

nrvale0 commented Jun 16, 2017

I've not dug into this but is there a chance that the IP address skew could be a result of the Consul Pods in the StatefulSet binding to data directory/PhysicalVolumeClaims with stale data from a previous instantiation of the StatefulSet? I believe it is the default behavior that PVCs are not automagically cleaned on a StatefulSet rebuild.

@mitom
Copy link

mitom commented Jun 19, 2017

@nrvale0 I don't think that would solve the issue. There are 2 aspects to it (in my opinion):

  1. If all the pods are deleted at once, the cluster is without a leader. When a new instance joins, it would try to trigger an election with all the offline instances, since it can't connect any of them it can't get a quorum and the election fails and restarts.
  2. If all pods are deleted at once and the pods were to come back with a clear data directory, then all PVCs would be clean, therefore losing all the data from the cluster.

edit: I may have misunderstood your point regarding stale data and addressed it as in the same statefulset, but with new pods.

In case you meant the opposite and were talking about 2 separate consul clusters that were named the same, you are right, the PVCs must be manually cleaned up. In my case, it's the same statefulset with fresh pods that is causing the issue.

@mitom
Copy link

mitom commented Jun 19, 2017

I've also noticed that running the cluster with a single replica seems to be able to recover from the pod being deleted, however it's in a half-correct state:

2017-06-19T08:03:42.695392134Z ==> Consul agent running!
2017-06-19T08:03:42.695407651Z            Version: 'v0.8.3'
2017-06-19T08:03:42.695411698Z            Node ID: '19867e2c-5eca-bacd-0d43-b741fbcfff9d'
2017-06-19T08:03:42.695415108Z          Node name: 'consul-vault-consul-0'
2017-06-19T08:03:42.695418066Z         Datacenter: 'dc0'
2017-06-19T08:03:42.695420988Z             Server: true (bootstrap: true)
2017-06-19T08:03:42.695425656Z        Client Addr: 0.0.0.0 (HTTP: 8500, HTTPS: -1, DNS: 8600)
2017-06-19T08:03:42.695428716Z       Cluster Addr: 100.96.2.7 (LAN: 8301, WAN: 8302)
2017-06-19T08:03:42.695431583Z     Gossip encrypt: true, RPC-TLS: false, TLS-Incoming: false
2017-06-19T08:03:42.695434565Z              Atlas: <disabled>
consul-vault-consul-0:/# hostname -i
100.96.2.7
consul-vault-consul-0:/# consul members -token=*****
Node                   Address          Status  Type    Build  Protocol  DC
consul-vault-consul-0  100.96.2.7:8301  alive   server  0.8.3  2         dc0
consul-vault-consul-0:/# consul operator raft list-peers -token=******
Node       ID                                    Address          State     Voter  RaftProtocol
(unknown)  19867e2c-5eca-bacd-0d43-b741fbcfff9d  100.96.2.4:8300  follower  true   <=1

Note the cluster addr IP is correctly set and it appears correct in the member list as well, the raft IP is still from the old pod though. This results in the pod crashing and restarting every 10 or so minutes. edit: The restart happens due to the liveness probe getting permissions denied because of ACLs.

After performing the recovery as described previously:

consul-vault-consul-0:/# hostname -i
100.96.2.7
consul-vault-consul-0:/# consul members -token=*****
Node                   Address          Status  Type    Build  Protocol  DC
consul-vault-consul-0  100.96.2.7:8301  alive   server  0.8.3  2         dc0
consul-vault-consul-0:/# consul operator raft list-peers -token=*****
Node                   ID                                    Address          State   Voter  RaftProtocol
consul-vault-consul-0  19867e2c-5eca-bacd-0d43-b741fbcfff9d  100.96.2.7:8300  leader  true   3

I am beginning to wonder if this is something we can solve on the kubernetes level or not. The only way I see this alleviated is by somehow forcing the pods to retain their private IP.

@slackpad
Copy link

On the Consul side the root cause of this should be tracked by hashicorp/consul#1580. We've got support for handling servers changing IPs at the Raft level, but it's not completely plumbed up into the upper layers of Consul.

One interesting bit though is that we still need a quorum to make changes to the configuration, so this would work if a minority of servers was restarted with new IPs, but it still won't work if you have a majority of servers restart with new IPs.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 30, 2017
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 29, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

7 participants