Resient server cluster in a environment where containers get new IPs #1306

peterbroadhurst · 2015-10-15T18:36:58Z

I've been reading through #993 #454 and other posts, and would echo the comment made by @pikeas here:
#993 (comment)

We have had some failures in our test environments, where we've been able to recover the cluster.
We have some specific constraints, which might be unique to us today, but I think are potentially common to a number of implementations going forwards:

We have to run consul in a docker container
All our containers get dynamically assigned an IP address when they start
When containers terminate, we cannot ensure the consul agent receives a signal

We've currently got the following design, and would like some feedback on how valid it is, and how we can improve it:

a server cluster of 3 (planning to extend to 5, or 7)
bootstrap-expect set to 3 on all servers
leave_on_terminate is true
if the whole container fails, we discard the filesystem - data-dir is wiped out
if we need to create a new server node, we give it a new name and it joins with a new IP
we restart consul inside the docker container with the same parameters, if just the process exits
we edit peers.json manually when we cycle in a new container to the server cluster
we can cope with the complete loss of the consul cluster, but with a significant impact (hours of downtime)

Our main practical concerns are:

what is the right step to take after a server crashes (or is manually deleted to patch in a new docker image), to keep the rest of the cluster healthy?
if we ever lose quorum, is there any real alternative to completely deleting the cluster and creating a new one (we've not been very successful in recovering)

josdirksen · 2015-11-15T11:49:55Z

We're pretty much in the same scenario. We also run Consul in docker containers (on AWS EC2) and basically want to have a healthy system in all cases. So when one instance goes down (out of three) I just want to provision a new one automatically without any manual interaction.

For our other components (especially Cassandra) we've got a good setup, but with Consul we also run into the issues you mentioned (and are mentioned in the posts you referenced). I'm going to automate the process of editting the peers.json files and cycle the affected Consul instances.

The most annoying thing is that we also use consul DNS for service discovery, so might run into a couple of failed lookups if the DNS TTL happens to occur with a restart.

As an addition, I'm exploring how force-leave affects the peers.json file, and so far this seems to be a nice alternative. Basically, when a machine goes down it takes a couple of minutes to come up again. During this time, my other consul nodes will see the terminated node as failed. I check whether the node's ip isn't part of our ASG and then force leave it on the other consul nodes. This way, I seem to be able to terminate and reattach consul nodes without losing quorum and without having to cycle the healthy nodes.

jmspring · 2016-02-10T15:40:48Z

+1 to track.

Running into this right now.

Re: Automating the editing of peers.son, I've got a setup where when a node goes down it likely knows the address of the peer it was replacing. (network attached storage) The script I have sets up a new container but also issues a "force-leave" on the prior node IP. My understanding is that this impacts the upper layer but not the raft layer (peers.json).

@josdirksen - how are you using force-leave? I'm trying to use it as mentioned in the prior statement, yet I don't see peers.json updated on the remaining/healthy nodes. And thus, run into leader election issues.

josdirksen · 2016-02-13T20:01:02Z

I run the following every five minutes to check whether there are any dead nodes. I do this on each node of the consul cluster, since (if I remember correctly) force-leave states don't automatically propagate between cluster members.

# get internal ip address
IP=`ip addr | grep 'state UP' -A2 | grep ' eth' | tail -n 1 | awk '{print $2}' | cut -f1 -d '/'`

# Set up docker location
export DOCKER_HOST=$IP:5000

DOCKER_OUTPUT=`docker ps --format "{{.ID}},{{.Names}}" | grep -i consul`
IFS=',' read -a NAME_MAP <<< "$DOCKER_OUTPUT"
CONSUL_ID="${NAME_MAP[0]}"

ALL_MEMBERS=`docker exec $CONSUL_ID consul members`
FAILED_MEMBERS=`echo "$ALL_MEMBERS" | grep -i "failed" | tr -s ' ' | cut -d ' ' -f 1`

echo "Following members are in failed state: $FAILED_MEMBERS"

while read -r TOBEREMOVED; do
    if [ ! -z "$TOBEREMOVED" ]; then
      echo "Removing host from cluster: $TOBEREMOVED"
      docker exec $CONSUL_ID consul force-leave $TOBEREMOVED
    fi
done <<< "$FAILED_MEMBERS"

Not that exciting, but seems to work in our scenario, and keeps the peers.json setup correctly, without having to manually change this. But I'll do a double-check based on your comment.

CpuID · 2016-04-19T22:48:48Z

Just a note, force-leave seems to do part of the job, but the failed Raft peer remains behind and still needs to be cleaned up.

See below (sanitised slightly) - showing that RequestVote calls are still attempted after the force-leave:

    2016/04/19 22:38:43 [INFO] Force leaving node: b65e2549a220
    2016/04/19 22:38:43 [INFO] serf: EventMemberLeave (forced): b65e2549a220 10.X.0.254
    2016/04/19 22:38:43 [INFO] consul: removing LAN server b65e2549a220 (Addr: 10.X.0.254:8300) (DC: dcname)
    2016/04/19 22:38:43 [ERR] raft: Failed to make RequestVote RPC to 10.X.10.176:8300: dial tcp 10.X.10.176:8300: getsockopt: no route to host
    2016/04/19 22:38:44 [WARN] raft: Heartbeat timeout reached, starting election
    2016/04/19 22:38:44 [INFO] raft: Node at 10.X.10.107:8300 [Candidate] entering Candidate state
    2016/04/19 22:38:45 [INFO] raft: Node at 10.X.10.107:8300 [Follower] entering Follower state
    2016/04/19 22:38:45 [ERR] raft: Failed to make RequestVote RPC to 10.X.30.179:8300: dial tcp 10.X.30.179:8300: i/o timeout
    2016/04/19 22:38:45 [ERR] raft: Failed to make RequestVote RPC to 10.X.0.254:8300: dial tcp 10.X.0.254:8300: i/o timeout
    2016/04/19 22:38:45 [ERR] raft: Failed to make RequestVote RPC to 10.X.20.71:8300: dial tcp 10.X.20.71:8300: i/o timeout
    2016/04/19 22:38:46 [WARN] raft: Heartbeat timeout reached, starting election
    2016/04/19 22:38:46 [INFO] raft: Node at 10.X.10.107:8300 [Candidate] entering Candidate state
    2016/04/19 22:38:47 [INFO] raft: Node at 10.X.10.107:8300 [Follower] entering Follower state
    2016/04/19 22:38:47 [ERR] raft: Failed to make RequestVote RPC to 10.X.30.179:8300: dial tcp 10.X.30.179:8300: i/o timeout
    2016/04/19 22:38:47 [ERR] raft: Failed to make RequestVote RPC to 10.X.0.254:8300: dial tcp 10.X.0.254:8300: i/o timeout
    2016/04/19 22:38:47 [ERR] raft: Failed to make RequestVote RPC to 10.X.20.71:8300: dial tcp 10.X.20.71:8300: i/o timeout
    2016/04/19 22:38:47 [ERR] raft: Failed to make RequestVote RPC to 10.X.10.176:8300: dial tcp 10.X.10.176:8300: getsockopt: no route to host
    2016/04/19 22:38:47 [ERR] raft: Failed to make RequestVote RPC to 10.X.10.176:8300: dial tcp 10.X.10.176:8300: getsockopt: no route to host

In addition, raft/peers.json retains the failed IPs (also sanitised slightly):

/ # cat /data/raft/peers.json
["10.X.30.37:8300","10.X.20.71:8300","10.X.0.58:8300","10.X.10.107:8300","10.X.10.108:8300","10.X.0.254:8300","10.X.10.176:8300","10.X.30.179:8300"]
/ # consul members -rpc-addr=127.0.0.1:8401 | grep server
Node          Address             Status  Type    Build  Protocol  DC  
243a3f2f830f  10.X.10.107:8303  alive   server  0.6.4  2         dcname
2e65154cdba1  10.X.0.58:8303    alive   server  0.6.4  2         dcname
50a6ece6daab  10.X.30.179:8303  left    server  0.6.4  2         dcname
60de5aab3ea2  10.X.10.108:8303  alive   server  0.6.4  2         dcname
9b2a79a7b0f9  10.X.0.221:8303   alive   server  0.6.4  2         dcname
b65e2549a220  10.X.0.254:8303   failed  server  0.6.4  2         dcname
b88597ea25de  10.X.20.71:8303   left    server  0.6.4  2         dcname
e93578b3e10a  10.X.30.37:8303   alive   server  0.6.4  2         dcname
/ # consul force-leave -rpc-addr=127.0.0.1:8401 b65e2549a220
/ # cat /data/raft/peers.json
["10.X.30.37:8300","10.X.20.71:8300","10.X.0.58:8300","10.X.10.107:8300","10.X.10.108:8300","10.X.0.254:8300","10.X.10.176:8300","10.X.30.179:8300"]
/ # consul members -rpc-addr=127.0.0.1:8401 | grep server
Node          Address             Status  Type    Build  Protocol  DC  
243a3f2f830f  10.X.10.107:8303  alive   server  0.6.4  2         dcname
2e65154cdba1  10.X.0.58:8303    alive   server  0.6.4  2         dcname
50a6ece6daab  10.X.30.179:8303  left    server  0.6.4  2         dcname
60de5aab3ea2  10.X.10.108:8303  alive   server  0.6.4  2         dcname
9b2a79a7b0f9  10.X.0.221:8303   alive   server  0.6.4  2         dcname
b65e2549a220  10.X.0.254:8303   left    server  0.6.4  2         dcname
b88597ea25de  10.X.20.71:8303   left    server  0.6.4  2         dcname
e93578b3e10a  10.X.30.37:8303   alive   server  0.6.4  2         dcname

So while I think the force-leave approach is a good stepping stone, the manual cleanup of the raft/peers.json is still required.

A big nice to have would be the ability to remove a Raft peer programmatically (RPC/API calls, CLI command, same as most stuff etc), in a way that it modifies the PeerStore implementation using something like SetPeers but maybe having a RemovePeer() function? Not sure how feasible it is to add/remove peers online at the Raft layer (haven't reviewed the spec in enough detail to comment). Possibly in a way that it updates the underlying JSON file also. Or alternatively allow a SIGHUP on the Consul agent to re-read raft/peers.json, allowing external modification of the file without requiring a process restart (and potentially short term query failures as per #1306 (comment)).

CpuID · 2016-04-20T02:40:26Z

Looks like my request above (re managing peers.json) exists as an issue - #1417

CpuID · 2016-06-07T18:46:54Z

@slackpad are you able to provide any feedback on this one?

CpuID · 2016-06-08T18:53:10Z

Related #1562 (comment) (3-4 comments there)

CpuID · 2016-06-08T22:26:04Z

@sean- @slackpad - how would you feel about augmenting RemoveFailedNode to support cleaning up the Raft entry in addition to cleaning up the Serf side, for force-leave operations?

I wonder if the correct way to do this is to use the Serf layer to broadcast a Raft removal which is actioned on each node locally?

Or potentially hook into the lanNodeFailed/wanNodeFailed hooks as a Plan B? Not 100% sure if these are called on RemoveFailedNode scenarios right now.

CpuID · 2016-06-10T15:44:01Z

I have managed to improve the resiliency of my cluster so far with the following:

as I run within Docker, using --net=host has killed off all instances of the log line 2016/04/02 23:07:05 [WARN] memberlist: Was able to reach node3 via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP
I was relying heavily on leave_on_terminate, which in some scenarios does not finalise and you get something like the below:

SIGTERM received, initiate graceful shutdown.
==> Caught signal: terminated
==> Gracefully shutting down agent...
    2016/06/10 14:14:12 [INFO] consul: server starting leave
SIGTERM received, initiate graceful shutdown.
    2016/06/10 14:14:12 [INFO] agent: requesting shutdown
    2016/06/10 14:14:12 [INFO] consul: shutting down server
    2016/06/10 14:14:12 [WARN] serf: Shutdown without a Leave
    2016/06/10 14:14:12 [WARN] serf: Shutdown without a Leave
Terminated
    2016/06/10 14:14:12 [INFO] agent: shutdown complete

Instead of just a SIGTERM, I now perform a consul leave before a SIGTERM, which has drastically reduced the failed node scenarios (nodes enter a left state instead), and force-leave seems slightly more reliable also in rare scenarios when it's required (probably because the cluster state isn't fubar).

slackpad · 2017-05-05T19:56:56Z

Closing this out now that we've got Autopilot which can remove dead peers automatically. More general forward work for supporting IP address changes is captured on #1580.

slackpad added thinking More time is needed to research by the Consul Contributors type/docs Documentation needs to be created/updated/clarified labels Jan 9, 2016

slackpad self-assigned this Jan 9, 2016

CpuID mentioned this issue Jun 8, 2016

Raft replication keep trying to contact left node of cluster #2024

Closed

slackpad closed this as completed May 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resient server cluster in a environment where containers get new IPs #1306

Resient server cluster in a environment where containers get new IPs #1306

peterbroadhurst commented Oct 15, 2015

josdirksen commented Nov 15, 2015

jmspring commented Feb 10, 2016

josdirksen commented Feb 13, 2016

CpuID commented Apr 19, 2016 •

edited

Loading

CpuID commented Apr 20, 2016

CpuID commented Jun 7, 2016

CpuID commented Jun 8, 2016

CpuID commented Jun 8, 2016

CpuID commented Jun 10, 2016

slackpad commented May 5, 2017

Resient server cluster in a environment where containers get new IPs #1306

Resient server cluster in a environment where containers get new IPs #1306

Comments

peterbroadhurst commented Oct 15, 2015

josdirksen commented Nov 15, 2015

jmspring commented Feb 10, 2016

josdirksen commented Feb 13, 2016

CpuID commented Apr 19, 2016 • edited Loading

CpuID commented Apr 20, 2016

CpuID commented Jun 7, 2016

CpuID commented Jun 8, 2016

CpuID commented Jun 8, 2016

CpuID commented Jun 10, 2016

slackpad commented May 5, 2017

CpuID commented Apr 19, 2016 •

edited

Loading