Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resient server cluster in a environment where containers get new IPs #1306

Closed
peterbroadhurst opened this issue Oct 15, 2015 · 10 comments
Closed
Assignees
Labels
thinking More time is needed to research by the Consul Contributors type/docs Documentation needs to be created/updated/clarified

Comments

@peterbroadhurst
Copy link

I've been reading through #993 #454 and other posts, and would echo the comment made by @pikeas here:
#993 (comment)

We have had some failures in our test environments, where we've been able to recover the cluster.
We have some specific constraints, which might be unique to us today, but I think are potentially common to a number of implementations going forwards:

  • We have to run consul in a docker container
  • All our containers get dynamically assigned an IP address when they start
  • When containers terminate, we cannot ensure the consul agent receives a signal

We've currently got the following design, and would like some feedback on how valid it is, and how we can improve it:

  • a server cluster of 3 (planning to extend to 5, or 7)
  • bootstrap-expect set to 3 on all servers
  • leave_on_terminate is true
  • if the whole container fails, we discard the filesystem - data-dir is wiped out
  • if we need to create a new server node, we give it a new name and it joins with a new IP
  • we restart consul inside the docker container with the same parameters, if just the process exits
  • we edit peers.json manually when we cycle in a new container to the server cluster
  • we can cope with the complete loss of the consul cluster, but with a significant impact (hours of downtime)

Our main practical concerns are:

  • what is the right step to take after a server crashes (or is manually deleted to patch in a new docker image), to keep the rest of the cluster healthy?
  • if we ever lose quorum, is there any real alternative to completely deleting the cluster and creating a new one (we've not been very successful in recovering)
@josdirksen
Copy link

We're pretty much in the same scenario. We also run Consul in docker containers (on AWS EC2) and basically want to have a healthy system in all cases. So when one instance goes down (out of three) I just want to provision a new one automatically without any manual interaction.

For our other components (especially Cassandra) we've got a good setup, but with Consul we also run into the issues you mentioned (and are mentioned in the posts you referenced). I'm going to automate the process of editting the peers.json files and cycle the affected Consul instances.

The most annoying thing is that we also use consul DNS for service discovery, so might run into a couple of failed lookups if the DNS TTL happens to occur with a restart.

As an addition, I'm exploring how force-leave affects the peers.json file, and so far this seems to be a nice alternative. Basically, when a machine goes down it takes a couple of minutes to come up again. During this time, my other consul nodes will see the terminated node as failed. I check whether the node's ip isn't part of our ASG and then force leave it on the other consul nodes. This way, I seem to be able to terminate and reattach consul nodes without losing quorum and without having to cycle the healthy nodes.

@slackpad slackpad added thinking More time is needed to research by the Consul Contributors type/docs Documentation needs to be created/updated/clarified labels Jan 9, 2016
@slackpad slackpad self-assigned this Jan 9, 2016
@jmspring
Copy link

+1 to track.

Running into this right now.

Re: Automating the editing of peers.son, I've got a setup where when a node goes down it likely knows the address of the peer it was replacing. (network attached storage) The script I have sets up a new container but also issues a "force-leave" on the prior node IP. My understanding is that this impacts the upper layer but not the raft layer (peers.json).

@josdirksen - how are you using force-leave? I'm trying to use it as mentioned in the prior statement, yet I don't see peers.json updated on the remaining/healthy nodes. And thus, run into leader election issues.

@josdirksen
Copy link

I run the following every five minutes to check whether there are any dead nodes. I do this on each node of the consul cluster, since (if I remember correctly) force-leave states don't automatically propagate between cluster members.

# get internal ip address
IP=`ip addr | grep 'state UP' -A2 | grep ' eth' | tail -n 1 | awk '{print $2}' | cut -f1 -d '/'`

# Set up docker location
export DOCKER_HOST=$IP:5000

DOCKER_OUTPUT=`docker ps --format "{{.ID}},{{.Names}}" | grep -i consul`
IFS=',' read -a NAME_MAP <<< "$DOCKER_OUTPUT"
CONSUL_ID="${NAME_MAP[0]}"

ALL_MEMBERS=`docker exec $CONSUL_ID consul members`
FAILED_MEMBERS=`echo "$ALL_MEMBERS" | grep -i "failed" | tr -s ' ' | cut -d ' ' -f 1`

echo "Following members are in failed state: $FAILED_MEMBERS"

while read -r TOBEREMOVED; do
    if [ ! -z "$TOBEREMOVED" ]; then
      echo "Removing host from cluster: $TOBEREMOVED"
      docker exec $CONSUL_ID consul force-leave $TOBEREMOVED
    fi
done <<< "$FAILED_MEMBERS"

Not that exciting, but seems to work in our scenario, and keeps the peers.json setup correctly, without having to manually change this. But I'll do a double-check based on your comment.

@CpuID
Copy link

CpuID commented Apr 19, 2016

Just a note, force-leave seems to do part of the job, but the failed Raft peer remains behind and still needs to be cleaned up.

See below (sanitised slightly) - showing that RequestVote calls are still attempted after the force-leave:

    2016/04/19 22:38:43 [INFO] Force leaving node: b65e2549a220
    2016/04/19 22:38:43 [INFO] serf: EventMemberLeave (forced): b65e2549a220 10.X.0.254
    2016/04/19 22:38:43 [INFO] consul: removing LAN server b65e2549a220 (Addr: 10.X.0.254:8300) (DC: dcname)
    2016/04/19 22:38:43 [ERR] raft: Failed to make RequestVote RPC to 10.X.10.176:8300: dial tcp 10.X.10.176:8300: getsockopt: no route to host
    2016/04/19 22:38:44 [WARN] raft: Heartbeat timeout reached, starting election
    2016/04/19 22:38:44 [INFO] raft: Node at 10.X.10.107:8300 [Candidate] entering Candidate state
    2016/04/19 22:38:45 [INFO] raft: Node at 10.X.10.107:8300 [Follower] entering Follower state
    2016/04/19 22:38:45 [ERR] raft: Failed to make RequestVote RPC to 10.X.30.179:8300: dial tcp 10.X.30.179:8300: i/o timeout
    2016/04/19 22:38:45 [ERR] raft: Failed to make RequestVote RPC to 10.X.0.254:8300: dial tcp 10.X.0.254:8300: i/o timeout
    2016/04/19 22:38:45 [ERR] raft: Failed to make RequestVote RPC to 10.X.20.71:8300: dial tcp 10.X.20.71:8300: i/o timeout
    2016/04/19 22:38:46 [WARN] raft: Heartbeat timeout reached, starting election
    2016/04/19 22:38:46 [INFO] raft: Node at 10.X.10.107:8300 [Candidate] entering Candidate state
    2016/04/19 22:38:47 [INFO] raft: Node at 10.X.10.107:8300 [Follower] entering Follower state
    2016/04/19 22:38:47 [ERR] raft: Failed to make RequestVote RPC to 10.X.30.179:8300: dial tcp 10.X.30.179:8300: i/o timeout
    2016/04/19 22:38:47 [ERR] raft: Failed to make RequestVote RPC to 10.X.0.254:8300: dial tcp 10.X.0.254:8300: i/o timeout
    2016/04/19 22:38:47 [ERR] raft: Failed to make RequestVote RPC to 10.X.20.71:8300: dial tcp 10.X.20.71:8300: i/o timeout
    2016/04/19 22:38:47 [ERR] raft: Failed to make RequestVote RPC to 10.X.10.176:8300: dial tcp 10.X.10.176:8300: getsockopt: no route to host
    2016/04/19 22:38:47 [ERR] raft: Failed to make RequestVote RPC to 10.X.10.176:8300: dial tcp 10.X.10.176:8300: getsockopt: no route to host

In addition, raft/peers.json retains the failed IPs (also sanitised slightly):

/ # cat /data/raft/peers.json
["10.X.30.37:8300","10.X.20.71:8300","10.X.0.58:8300","10.X.10.107:8300","10.X.10.108:8300","10.X.0.254:8300","10.X.10.176:8300","10.X.30.179:8300"]
/ # consul members -rpc-addr=127.0.0.1:8401 | grep server
Node          Address             Status  Type    Build  Protocol  DC  
243a3f2f830f  10.X.10.107:8303  alive   server  0.6.4  2         dcname
2e65154cdba1  10.X.0.58:8303    alive   server  0.6.4  2         dcname
50a6ece6daab  10.X.30.179:8303  left    server  0.6.4  2         dcname
60de5aab3ea2  10.X.10.108:8303  alive   server  0.6.4  2         dcname
9b2a79a7b0f9  10.X.0.221:8303   alive   server  0.6.4  2         dcname
b65e2549a220  10.X.0.254:8303   failed  server  0.6.4  2         dcname
b88597ea25de  10.X.20.71:8303   left    server  0.6.4  2         dcname
e93578b3e10a  10.X.30.37:8303   alive   server  0.6.4  2         dcname
/ # consul force-leave -rpc-addr=127.0.0.1:8401 b65e2549a220
/ # cat /data/raft/peers.json
["10.X.30.37:8300","10.X.20.71:8300","10.X.0.58:8300","10.X.10.107:8300","10.X.10.108:8300","10.X.0.254:8300","10.X.10.176:8300","10.X.30.179:8300"]
/ # consul members -rpc-addr=127.0.0.1:8401 | grep server
Node          Address             Status  Type    Build  Protocol  DC  
243a3f2f830f  10.X.10.107:8303  alive   server  0.6.4  2         dcname
2e65154cdba1  10.X.0.58:8303    alive   server  0.6.4  2         dcname
50a6ece6daab  10.X.30.179:8303  left    server  0.6.4  2         dcname
60de5aab3ea2  10.X.10.108:8303  alive   server  0.6.4  2         dcname
9b2a79a7b0f9  10.X.0.221:8303   alive   server  0.6.4  2         dcname
b65e2549a220  10.X.0.254:8303   left    server  0.6.4  2         dcname
b88597ea25de  10.X.20.71:8303   left    server  0.6.4  2         dcname
e93578b3e10a  10.X.30.37:8303   alive   server  0.6.4  2         dcname

So while I think the force-leave approach is a good stepping stone, the manual cleanup of the raft/peers.json is still required.

A big nice to have would be the ability to remove a Raft peer programmatically (RPC/API calls, CLI command, same as most stuff etc), in a way that it modifies the PeerStore implementation using something like SetPeers but maybe having a RemovePeer() function? Not sure how feasible it is to add/remove peers online at the Raft layer (haven't reviewed the spec in enough detail to comment). Possibly in a way that it updates the underlying JSON file also. Or alternatively allow a SIGHUP on the Consul agent to re-read raft/peers.json, allowing external modification of the file without requiring a process restart (and potentially short term query failures as per #1306 (comment)).

@CpuID
Copy link

CpuID commented Apr 20, 2016

Looks like my request above (re managing peers.json) exists as an issue - #1417

@CpuID
Copy link

CpuID commented Jun 7, 2016

@slackpad are you able to provide any feedback on this one?

@CpuID
Copy link

CpuID commented Jun 8, 2016

Related #1562 (comment) (3-4 comments there)

@CpuID
Copy link

CpuID commented Jun 8, 2016

@sean- @slackpad - how would you feel about augmenting RemoveFailedNode to support cleaning up the Raft entry in addition to cleaning up the Serf side, for force-leave operations?

I wonder if the correct way to do this is to use the Serf layer to broadcast a Raft removal which is actioned on each node locally?

Or potentially hook into the lanNodeFailed/wanNodeFailed hooks as a Plan B? Not 100% sure if these are called on RemoveFailedNode scenarios right now.

@CpuID
Copy link

CpuID commented Jun 10, 2016

I have managed to improve the resiliency of my cluster so far with the following:

  • as I run within Docker, using --net=host has killed off all instances of the log line 2016/04/02 23:07:05 [WARN] memberlist: Was able to reach node3 via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP
  • I was relying heavily on leave_on_terminate, which in some scenarios does not finalise and you get something like the below:
SIGTERM received, initiate graceful shutdown.
==> Caught signal: terminated
==> Gracefully shutting down agent...
    2016/06/10 14:14:12 [INFO] consul: server starting leave
SIGTERM received, initiate graceful shutdown.
    2016/06/10 14:14:12 [INFO] agent: requesting shutdown
    2016/06/10 14:14:12 [INFO] consul: shutting down server
    2016/06/10 14:14:12 [WARN] serf: Shutdown without a Leave
    2016/06/10 14:14:12 [WARN] serf: Shutdown without a Leave
Terminated
    2016/06/10 14:14:12 [INFO] agent: shutdown complete

Instead of just a SIGTERM, I now perform a consul leave before a SIGTERM, which has drastically reduced the failed node scenarios (nodes enter a left state instead), and force-leave seems slightly more reliable also in rare scenarios when it's required (probably because the cluster state isn't fubar).

@slackpad
Copy link
Contributor

slackpad commented May 5, 2017

Closing this out now that we've got Autopilot which can remove dead peers automatically. More general forward work for supporting IP address changes is captured on #1580.

@slackpad slackpad closed this as completed May 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
thinking More time is needed to research by the Consul Contributors type/docs Documentation needs to be created/updated/clarified
Projects
None yet
Development

No branches or pull requests

5 participants