Tricky to bootstrap cluster after stopped gracefully #750

pepov · 2015-03-03T19:32:54Z

I've observed that I'm unable to start a gracefully stopped server cluster (using the official 0.5 build). It seems that peers.json content will be set to null during a graceful leave on the nodes which affects a new bootstrapping process in a bad way. I've managed to start the cluster only by setting the proper value in peers.json after the agent has been stopped.

Is this a bug or am I missing something?

The text was updated successfully, but these errors were encountered:

highlyunavailable · 2015-03-03T21:15:55Z

This happened for me in 0.4.1 as well. I actually disabled leave-on-terminate for my "server" role Consul agents because of this after the cluster ate itself a few times.

pepov · 2015-03-03T21:19:07Z

Why do nodes write null in it when leaving?

highlyunavailable · 2015-03-03T21:33:46Z

They do not write null, they remove themselves - that's the point of a graceful leave. In 0.4.1, the contents of the file was simply [] (an empty array) once all nodes had gracefully left. Are you seeing an actual null value instead?

pepov · 2015-03-03T21:43:22Z

Yes null even if I stop only one node. I mean the contents on the other nodes is ok, but the one that left will leave only a null behind.

cat /opt/consul/consul_dc_0/data/raft/peers.json 
null

armon · 2015-03-04T00:21:20Z

@pepov It has to do with the semantics of leave. With a leave you are basically telling Consul "please announce to the cluster your intention to leave, and do not re-join this cluster again unless explicitly told to.". As part of that, the node leaves the raft replication group and broadcasts a gossip message with its intent to leave. When we leave the raft group, the peers are purposely set to null to avoid bothering the existing cluster on restart. Does that make sense?

pepov · 2015-03-04T05:11:13Z

@armon I understand you, but confused because I thought I'm explicit by setting the start_join in the config. Why the gossip layer is allowed to join together and only the raft peerset is handled differently in this case? Is there a config flag I should use to avoid this, or simply avoid leaving for server nodes?

armon · 2015-03-05T01:49:27Z

@pepov The gossip layer is simpler since joins are idempotent and there is no issues of data safety. The raft layer is much more sensitive, and hence a little more challenging to work with. Typically, you avoid this by never having servers leave. It's rather strange operationally to have servers come and go.

j1n6 · 2015-06-08T14:15:33Z

Typically, you avoid this by never having servers leave.

Maybe I misunderstood your comments. After reading this thread and #454, it seems recovery after planed or unplanned failure was not designed to be covered. This is probably not a good assumption to design a HA system. Virtual instances and bare metal deployment would have all sort of issues. That assumption has no differences between deploying to a single node, it seems to me that this defeats the purpose of building a distributed system...

awheeler · 2015-06-17T18:44:22Z

When updating to a new version of consul, the servers are leaving, which is expected, correct? I accidentally updated all 3 service simultaneously and ran into this issue and could not get the cluster to bootstrap until I created an atlas account.
So I gather the error is having the servers leave when the consul agent stops?

slackpad · 2015-06-17T21:52:08Z

@awheeler If you shutdown the servers gracefully they will leave the cluster and not re-join when you restart. You'll need to join each new server as it comes up with the new version of Consul (or follow the process discussed in #993 by fixing up the peers.json file if you accidentally have all 3 nodes leave simultaneously).

theonewolf · 2015-07-22T19:04:41Z

I ran into this same issue today. Is there no way to gracefully take down an entire Consul cluster and gracefully bring it back up?

I am trying to run consul join .... with all of my node IPs while they are live and it doesn't work at all. Raft peers remains at 0, peers.json file remains with the value null.

How do we, for example, reboot an experimental Consul cluster? Process termination seems nasty.

armon · 2015-07-22T19:36:17Z

@theonewolf The experience around this can definitely be improved as Consul was designed for continuous operation. It really dislikes being stopped so the notion of a graceful shutdown of an entire cluster really wasn't planned for. There is handling of a disaster total failure (power loss), but not as something people intend to do. That said, you can just kill -9 everything and it will survive.

j1n6 · 2015-07-22T19:45:40Z

I can confirm kill -9 works well for now. I have exercised to kill and recover server or agent under failure scenario (e.g. dc power cut, server crash)

On 22 Jul 2015, at 20:36, Armon Dadgar notifications@github.com wrote:

@theonewolf The experience around this can definitely be improved as Consul was designed for continuous operation. It really dislikes being stopped so the notion of a graceful shutdown of an entire cluster really wasn't planned for. There is handling of a disaster total failure (power loss), but not as something people intend to do. That said, you can just kill -9 everything and it will survive.

—
Reply to this email directly or view it on GitHub.

theonewolf · 2015-07-23T00:45:31Z

@armon I see. I think my/others confusion comes in because of incomplete documentation.

I didn't realize ctrl-c took a node out of the cluster in a deep sense. But the design makes sense to me.

My confusion came in when I see things like "Graceful shutdown"----> which I took to imply a possibility for a "Graceful startup" and rejoining of the cluster even if all nodes were brought down.

So for upgrades and resets we should be sending a different signal to the Consul process?

I think this signal, or all of the signals Consul expects and the outcomes of those signals should be very well documented somewhere.

Maybe a table listing the signals and Consul states post-signal?

armon · 2015-07-23T00:54:32Z

@theonewolf Partially, I think this was a mistake UX wise. I think that the "deep remove" should not have been the default behavior, but we are here now. Its a damned if you do, damned if you don't situation.

The problem with the signals table is that the behavior is configurable. It responds to 3 signals:

SIGHUP - Reload, always. Only some things can be reloaded.
SIGINT - Graceful leave by default, can be configured to be non-graceful.
SIGTERM - Non-graceful leave. Can be configured to be graceful.

theonewolf · 2015-07-23T01:37:13Z

I see! This does lead to the potential for a healthy lifecycle if the signals are used appropriately.

I still think such a table should exist. I'd like to know all the external run-time events that could affect my running Consul processes. Even if an engineer initiated them by accident for example.

biran0079 · 2015-09-11T01:56:35Z

It really dislikes being stopped so the notion of a graceful shutdown of an entire cluster really wasn't planned for.

What if I accidentally "gracefully shutdown the whole cluster"? Is there any way to recover?

awheeler · 2015-09-11T02:03:33Z

Yes, and it's covered in #993 -- you manually populate the peers.json file. See @slackpad 's example there.
I've used it several times since I read that and it makes the whole situation a non-event.

Dmitry1987 · 2016-02-18T15:40:42Z

It's a bit old thread, but I have confusing results regarding cluster recovery:

when playing in vagrant (virtual box) i can spin from 3 to 6 servers, set all of them to bootstrap expect '3', kill them in any ways while running ('consul leave', service start/stop, "kill" process) all nodes simultaneously btw. and cluster always easy to recover! sometimes just by 'service restart' and sometimes needs 'peers.json' fix. but always easy to bring back up and get 'leader' elected and cluster operational.
trying to do the same in AWS, with 3 machines, and another 5 machines as "wan cluster" (5 are alive and stable. 3 are experimental and join the big cluster by 'wan'), but those 3 machines always die completely (any type of kill, graceful or not, breaks the cluster from 1st attempt in a way i can't recover it with 'peers.json'!) those 3 machines, once interrupted and lost quorum, never elect a leader again until i delete their data folder! :(

i use version 6.3
this is sooo weird... any thoughts?

slackpad · 2016-03-09T22:19:16Z

Hi @Dmitry1987 that definitely should not be the case. Can you share a gist with your configuration and some log messages from when this happens?

Dmitry1987 · 2016-03-10T13:02:52Z

Hi @slackpad , i think i figured this out, by adding some parameters that say "do not gracefully leave in any case, hard/soft kill" , don't remember exactly, but problematic configuration BEFORE i found that solution was:

{

"acl_datacenter": "dc-main",
"acl_default_policy": "deny",
"advertise_addr": "22.22.22.22",
"advertise_addr_wan": "11.11.11.11",
"bootstrap_expect": 3,
"ca_file": "/zzzzz",
"cert_file": "/zzzzzz",
"data_dir": "/var/consul",
"datacenter": "dc-new",
"disable_remote_exec": true,
"domain": "my.domain.consul",
"enable_syslog": true,
"encrypt": "zzzzzzzzzz",
"key_file": "/zzzzzz",
"log_level": "INFO",
"server": true,
"start_join": [
"x.compute.amazonaws.com",
"y.compute.amazonaws.com",
"z.compute.amazonaws.com"
],
"start_join_wan": [
"zzz.compute.amazonaws.com",
"yyz.compute.amazonaws.com"
],
"syslog_facility": "LOCAL5",
"ui": false,
"verify_incoming": true,
"verify_outgoing": true
}

Dmitry1987 · 2016-03-10T13:05:50Z

I don't remember exactly those 2 directives, but it was 2 settings that say to not delete himself from raft peers file, no matter which kill signal received (or "consul leave"). after that it was working good.
but now we already not using consul after that POC, too busy with other things... maybe will finally implement it one day :) .
thanks!

slackpad · 2017-05-02T21:48:02Z

I'm going to close this out. Here's what we've done related to this:

Changed the default for leave_on_terminate and skip_leave_on_interrupt for servers to not make them leave when shut down, which is safer by default.
Removed the peer store, so there's not the confusing null.
Improved the outage documentation with more details and a peers.json example.

theonewolf · 2017-05-02T21:54:33Z

@slackpad could you reference the most recent release with these features?

Thanks!

slackpad · 2017-05-02T22:04:30Z

@theonewolf 0.8.1 has them all :-) They came in 0.7.0, though, so 0.7.5 is fine too if you are on that series.

theonewolf · 2017-05-02T22:27:13Z

@slackpad awesome. We are 0.7.5.

slackpad mentioned this issue Jun 10, 2015

Consul servers won't elect a leader #993

Closed

slackpad mentioned this issue Sep 14, 2015

Recover an Atlas-backed cluster of Consul servers #1237

Closed

slackpad mentioned this issue Dec 15, 2015

Changing from auto bootstrap to manual breaks leader election #1501

Closed

slackpad closed this as completed May 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tricky to bootstrap cluster after stopped gracefully #750

Tricky to bootstrap cluster after stopped gracefully #750

pepov commented Mar 3, 2015

highlyunavailable commented Mar 3, 2015

pepov commented Mar 3, 2015

highlyunavailable commented Mar 3, 2015

pepov commented Mar 3, 2015

armon commented Mar 4, 2015

pepov commented Mar 4, 2015

armon commented Mar 5, 2015

j1n6 commented Jun 8, 2015

awheeler commented Jun 17, 2015

slackpad commented Jun 17, 2015

theonewolf commented Jul 22, 2015

armon commented Jul 22, 2015

j1n6 commented Jul 22, 2015

theonewolf commented Jul 23, 2015

armon commented Jul 23, 2015

theonewolf commented Jul 23, 2015

biran0079 commented Sep 11, 2015

awheeler commented Sep 11, 2015

Dmitry1987 commented Feb 18, 2016

slackpad commented Mar 9, 2016

Dmitry1987 commented Mar 10, 2016

Dmitry1987 commented Mar 10, 2016

slackpad commented May 2, 2017

theonewolf commented May 2, 2017

slackpad commented May 2, 2017

theonewolf commented May 2, 2017

Tricky to bootstrap cluster after stopped gracefully #750

Tricky to bootstrap cluster after stopped gracefully #750

Comments

pepov commented Mar 3, 2015

highlyunavailable commented Mar 3, 2015

pepov commented Mar 3, 2015

highlyunavailable commented Mar 3, 2015

pepov commented Mar 3, 2015

armon commented Mar 4, 2015

pepov commented Mar 4, 2015

armon commented Mar 5, 2015

j1n6 commented Jun 8, 2015

awheeler commented Jun 17, 2015

slackpad commented Jun 17, 2015

theonewolf commented Jul 22, 2015

armon commented Jul 22, 2015

j1n6 commented Jul 22, 2015

theonewolf commented Jul 23, 2015

armon commented Jul 23, 2015

theonewolf commented Jul 23, 2015

biran0079 commented Sep 11, 2015

awheeler commented Sep 11, 2015

Dmitry1987 commented Feb 18, 2016

slackpad commented Mar 9, 2016

Dmitry1987 commented Mar 10, 2016

Dmitry1987 commented Mar 10, 2016

slackpad commented May 2, 2017

theonewolf commented May 2, 2017

slackpad commented May 2, 2017

theonewolf commented May 2, 2017