Skip to content
This repository has been archived by the owner on Feb 22, 2022. It is now read-only.

consul - unable to recover cluster #965

Closed
ajohnstone opened this issue Apr 25, 2017 · 3 comments
Closed

consul - unable to recover cluster #965

ajohnstone opened this issue Apr 25, 2017 · 3 comments

Comments

@ajohnstone
Copy link
Contributor

ajohnstone commented Apr 25, 2017

If you increase and then decrease the cluster size... you lose the cluster leader and are then unable to recover.
I mentioned this here and submitted a PR, when this was still in the incubator, but never came to a final solution.

Issue: kelseyhightower/consul-on-kubernetes#6
PR: kelseyhightower/consul-on-kubernetes#7

Tested again on consul:0.2.0

kubectl --namespace=consul patch statefulset/consul-consul -p '{"spec":{"replicas": 10}}' --namespace=consul
sleep 100
kubectl --namespace=consul patch statefulset/consul-consul -p '{"spec":{"replicas": 5}}' --namespace=consul
sleep 60
for i in $(seq 0 9); do kubectl --namespace=consul exec consul-consul-$i -- sh -c 'consul operator raft -list-peers '; done
...
Operator "raft" subcommand failed: Unexpected response code: 500 (No cluster leader)

@lachie83 Any further ideas on this?

The following is a good suggestion, but does not fix the underlying problem.

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: consul
spec:
  minAvailable: {{default 3 .Values.Replicas}} # -1 unsure how to do math in the templates?
  selector:
    matchLabels:
      heritage: {{.Release.Service | quote }}
      release: {{.Release.Name | quote }}
      chart: "{{.Chart.Name}}-{{.Chart.Version}}"
      component: "{{.Release.Name}}-{{.Values.Component}}"
@lachie83
Copy link
Contributor

Thanks for raising this @ajohnstone. This is actually part of a bigger issue with consul when using the bootstrap-expect parameter. It seems that the solution is to get autopilot working in 0.8.0. Here's the upstream issue.
hashicorp/consul#993

@ajohnstone
Copy link
Contributor Author

ajohnstone commented Apr 25, 2017

Thanks for the quick response, would be good to add PodDisruptionBudget and this feature to be enabled by default if >=0.8.0. When searching before I got lost in very old tickets 👍

Further details here https://github.com/hashicorp/consul/blob/master/website/source/docs/guides/autopilot.html.markdown

Simply adding

{
    "cleanup_dead_servers": true,
    "last_contact_threshold": "200ms",
    "server_stabilization_time": "10s",
}

Or using the cli option for raft 3 here.
https://www.consul.io/docs/agent/options.html#_raft_protocol

@lachie83
Copy link
Contributor

Consul chart PR out there -- #1126. It is recoverable.

PDB sounds like a good idea. Would be happy to review a PR.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants