consul - unable to recover cluster #965

ajohnstone · 2017-04-25T23:03:50Z

If you increase and then decrease the cluster size... you lose the cluster leader and are then unable to recover.
I mentioned this here and submitted a PR, when this was still in the incubator, but never came to a final solution.

Issue: kelseyhightower/consul-on-kubernetes#6
PR: kelseyhightower/consul-on-kubernetes#7

Tested again on consul:0.2.0

kubectl --namespace=consul patch statefulset/consul-consul -p '{"spec":{"replicas": 10}}' --namespace=consul
sleep 100
kubectl --namespace=consul patch statefulset/consul-consul -p '{"spec":{"replicas": 5}}' --namespace=consul
sleep 60
for i in $(seq 0 9); do kubectl --namespace=consul exec consul-consul-$i -- sh -c 'consul operator raft -list-peers '; done
...
Operator "raft" subcommand failed: Unexpected response code: 500 (No cluster leader)

@lachie83 Any further ideas on this?

The following is a good suggestion, but does not fix the underlying problem.

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: consul
spec:
  minAvailable: {{default 3 .Values.Replicas}} # -1 unsure how to do math in the templates?
  selector:
    matchLabels:
      heritage: {{.Release.Service | quote }}
      release: {{.Release.Name | quote }}
      chart: "{{.Chart.Name}}-{{.Chart.Version}}"
      component: "{{.Release.Name}}-{{.Values.Component}}"

The text was updated successfully, but these errors were encountered:

lachie83 · 2017-04-25T23:32:44Z

Thanks for raising this @ajohnstone. This is actually part of a bigger issue with consul when using the bootstrap-expect parameter. It seems that the solution is to get autopilot working in 0.8.0. Here's the upstream issue.
hashicorp/consul#993

ajohnstone · 2017-04-25T23:59:26Z

Thanks for the quick response, would be good to add PodDisruptionBudget and this feature to be enabled by default if >=0.8.0. When searching before I got lost in very old tickets 👍

Further details here https://github.com/hashicorp/consul/blob/master/website/source/docs/guides/autopilot.html.markdown

Simply adding

{
    "cleanup_dead_servers": true,
    "last_contact_threshold": "200ms",
    "server_stabilization_time": "10s",
}

Or using the cli option for raft 3 here.
https://www.consul.io/docs/agent/options.html#_raft_protocol

lachie83 · 2017-05-23T04:49:08Z

Consul chart PR out there -- #1126. It is recoverable.

PDB sounds like a good idea. Would be happy to review a PR.

lachie83 closed this as completed May 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consul - unable to recover cluster #965

consul - unable to recover cluster #965

ajohnstone commented Apr 25, 2017 •

edited

Loading

lachie83 commented Apr 25, 2017

ajohnstone commented Apr 25, 2017 •

edited

Loading

lachie83 commented May 23, 2017

consul - unable to recover cluster #965

consul - unable to recover cluster #965

Comments

ajohnstone commented Apr 25, 2017 • edited Loading

lachie83 commented Apr 25, 2017

ajohnstone commented Apr 25, 2017 • edited Loading

lachie83 commented May 23, 2017

ajohnstone commented Apr 25, 2017 •

edited

Loading

ajohnstone commented Apr 25, 2017 •

edited

Loading