-
Notifications
You must be signed in to change notification settings - Fork 16.7k
consul: Cluster doesn't recover after deleting all pods #1143
Comments
Hmm - on closer inspection, it appears my new consul pods have different IPs. Could consul have somehow cached the old pod IPs? How does it store peer names?
|
So it appears the old nodes aren't properly leaving the cluster before exiting. From reading the Consul docs, it appears in order to trigger consul's automatic sending of a I attempted to create a preStop hook to run I've found this corresponding issue here: hashicorp/consul#1580 which suggests a move to raft v3 may help, however not all issues are resolved in my testing (nodes are still a part of the cluster, and duplicate node errors are being thrown). I think we need to set the |
thanks for raising this. I would like to have you retest after this PR gets merged. #1126 |
@munnerz Can you run some tests again using the updated chart? |
@lachie83 I am facing the same issue (although AWS, not GCE), using the latest version of the chart mentioned above. Also as a note, even in consul 0.8, raft protocol version defaults to 2, not 3. This needs to be explicitly set in the config/or as a flag. The issue remains on both raft v2 and v3 though. I found that if all nodes are terminated once they come back they will each try to rejoin as an existing node, which will work. They will then update their IP address, which will also work. They will not know about each other though, they each think they are in a cluster with the set of old nodes. They will try to elect a new leader, which starts an infinite loop complaining that it can't reach voters (the old ip addresses of the servers) and the election times out. I have tried to update the init script to use the pod dns names instead (as in leave resolving them to consul), hoping that it will then store the dns names instead of ips, but no luck. The resolution works, so this could be simplified in the init script anyway, but it doesn't help with this issue, consul will still store the IPs. Recovery of the cluster is possible, but it's far from automated and I'm not sure if it's possible to automate it sanely. The way recommended by consul is to stop all servers, then create a peers.json file (different structure for raft v2 and v3) with the information of all servers, put it on all servers into a location in the data dir, then restart them. The cluster will then enter recovery, take a snapshot and override the cluster member list. This is somewhat difficult on k8s as we can't just stop servers, they will keep coming back. The solution I found is to scale the cluster down to 1 instance, create the peers.json file on it with only its' own information and then kill it. When k8s reschedules it, it will still have the file, enter recovery and start up the cluster electing itself as leader. I could then scale up the cluster to the desired state and the nodes would re-enter. It's also to note that when recovery happens, the last changes from the change lot to consul are committed unconditionally making this unsafe, or at least not something that should be done every time consul starts up, only when it's actually broken already. While I think a healthy cluster shouldn't lose all nodes at once, I'd say this is still an annoying process to go through to recover it. I have also tried the I have also tried running @munnerz Could you please share what results did you get to? I could not get my cluster functional without recovery even using raft v3, not even to the point of duplicate node errors. Or was that not in kubernetes? @slackpad I am not sure if you can help shed some light on this, but is there a way (from consul's perspective) to have all servers terminated and let them start up the cluster again with all of them having different IPs? Node UUIDs stay the same, all data stays the same, DNS names still resolve to the correct nodes, but all of them will have a different IP. |
I've not dug into this but is there a chance that the IP address skew could be a result of the Consul Pods in the StatefulSet binding to data directory/PhysicalVolumeClaims with stale data from a previous instantiation of the StatefulSet? I believe it is the default behavior that PVCs are not automagically cleaned on a StatefulSet rebuild. |
@nrvale0 I don't think that would solve the issue. There are 2 aspects to it (in my opinion):
edit: I may have misunderstood your point regarding stale data and addressed it as in the same statefulset, but with new pods. In case you meant the opposite and were talking about 2 separate consul clusters that were named the same, you are right, the PVCs must be manually cleaned up. In my case, it's the same statefulset with fresh pods that is causing the issue. |
I've also noticed that running the cluster with a single replica seems to be able to recover from the pod being deleted, however it's in a half-correct state:
Note the cluster addr IP is correctly set and it appears correct in the member list as well, the raft IP is still from the old pod though. After performing the recovery as described previously:
I am beginning to wonder if this is something we can solve on the kubernetes level or not. The only way I see this alleviated is by somehow forcing the pods to retain their private IP. |
On the Consul side the root cause of this should be tracked by hashicorp/consul#1580. We've got support for handling servers changing IPs at the Raft level, but it's not completely plumbed up into the upper layers of Consul. One interesting bit though is that we still need a quorum to make changes to the configuration, so this would work if a minority of servers was restarted with new IPs, but it still won't work if you have a majority of servers restart with new IPs. |
Issues go stale after 90d of inactivity. Prevent issues from auto-closing with an If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
I'm trying to use the
stable/consul
Helm chart to deploy a consul cluster.With the default configuration (minus setting an appropriate storage class), on GKE v1.6.2, the cluster comes up fine.
It seems however, if I delete any pods then they are unable to rejoin the cluster and leader election starts to fail:
I'm not too sure if anyone else has had a similar experience and has some advice here? I'm going to look into how to set up a Consul cluster to work it out for myself.
The text was updated successfully, but these errors were encountered: