Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tricky to bootstrap cluster after stopped gracefully #750

Closed
pepov opened this issue Mar 3, 2015 · 26 comments
Closed

Tricky to bootstrap cluster after stopped gracefully #750

pepov opened this issue Mar 3, 2015 · 26 comments

Comments

@pepov
Copy link

pepov commented Mar 3, 2015

I've observed that I'm unable to start a gracefully stopped server cluster (using the official 0.5 build). It seems that peers.json content will be set to null during a graceful leave on the nodes which affects a new bootstrapping process in a bad way. I've managed to start the cluster only by setting the proper value in peers.json after the agent has been stopped.

Is this a bug or am I missing something?

@highlyunavailable
Copy link
Contributor

This happened for me in 0.4.1 as well. I actually disabled leave-on-terminate for my "server" role Consul agents because of this after the cluster ate itself a few times.

@pepov
Copy link
Author

pepov commented Mar 3, 2015

Why do nodes write null in it when leaving?

@highlyunavailable
Copy link
Contributor

They do not write null, they remove themselves - that's the point of a graceful leave. In 0.4.1, the contents of the file was simply [] (an empty array) once all nodes had gracefully left. Are you seeing an actual null value instead?

@pepov
Copy link
Author

pepov commented Mar 3, 2015

Yes null even if I stop only one node. I mean the contents on the other nodes is ok, but the one that left will leave only a null behind.

cat /opt/consul/consul_dc_0/data/raft/peers.json 
null

@armon
Copy link
Member

armon commented Mar 4, 2015

@pepov It has to do with the semantics of leave. With a leave you are basically telling Consul "please announce to the cluster your intention to leave, and do not re-join this cluster again unless explicitly told to.". As part of that, the node leaves the raft replication group and broadcasts a gossip message with its intent to leave. When we leave the raft group, the peers are purposely set to null to avoid bothering the existing cluster on restart. Does that make sense?

@pepov
Copy link
Author

pepov commented Mar 4, 2015

@armon I understand you, but confused because I thought I'm explicit by setting the start_join in the config. Why the gossip layer is allowed to join together and only the raft peerset is handled differently in this case? Is there a config flag I should use to avoid this, or simply avoid leaving for server nodes?

@armon
Copy link
Member

armon commented Mar 5, 2015

@pepov The gossip layer is simpler since joins are idempotent and there is no issues of data safety. The raft layer is much more sensitive, and hence a little more challenging to work with. Typically, you avoid this by never having servers leave. It's rather strange operationally to have servers come and go.

@j1n6
Copy link

j1n6 commented Jun 8, 2015

Typically, you avoid this by never having servers leave.

Maybe I misunderstood your comments. After reading this thread and #454, it seems recovery after planed or unplanned failure was not designed to be covered. This is probably not a good assumption to design a HA system. Virtual instances and bare metal deployment would have all sort of issues. That assumption has no differences between deploying to a single node, it seems to me that this defeats the purpose of building a distributed system...

@awheeler
Copy link
Contributor

When updating to a new version of consul, the servers are leaving, which is expected, correct? I accidentally updated all 3 service simultaneously and ran into this issue and could not get the cluster to bootstrap until I created an atlas account.
So I gather the error is having the servers leave when the consul agent stops?

@slackpad
Copy link
Contributor

@awheeler If you shutdown the servers gracefully they will leave the cluster and not re-join when you restart. You'll need to join each new server as it comes up with the new version of Consul (or follow the process discussed in #993 by fixing up the peers.json file if you accidentally have all 3 nodes leave simultaneously).

@theonewolf
Copy link

I ran into this same issue today. Is there no way to gracefully take down an entire Consul cluster and gracefully bring it back up?

I am trying to run consul join .... with all of my node IPs while they are live and it doesn't work at all. Raft peers remains at 0, peers.json file remains with the value null.

How do we, for example, reboot an experimental Consul cluster? Process termination seems nasty.

@armon
Copy link
Member

armon commented Jul 22, 2015

@theonewolf The experience around this can definitely be improved as Consul was designed for continuous operation. It really dislikes being stopped so the notion of a graceful shutdown of an entire cluster really wasn't planned for. There is handling of a disaster total failure (power loss), but not as something people intend to do. That said, you can just kill -9 everything and it will survive.

@j1n6
Copy link

j1n6 commented Jul 22, 2015

I can confirm kill -9 works well for now. I have exercised to kill and recover server or agent under failure scenario (e.g. dc power cut, server crash)

On 22 Jul 2015, at 20:36, Armon Dadgar notifications@github.com wrote:

@theonewolf The experience around this can definitely be improved as Consul was designed for continuous operation. It really dislikes being stopped so the notion of a graceful shutdown of an entire cluster really wasn't planned for. There is handling of a disaster total failure (power loss), but not as something people intend to do. That said, you can just kill -9 everything and it will survive.


Reply to this email directly or view it on GitHub.

@theonewolf
Copy link

@armon I see. I think my/others confusion comes in because of incomplete documentation.

I didn't realize ctrl-c took a node out of the cluster in a deep sense. But the design makes sense to me.

My confusion came in when I see things like "Graceful shutdown"----> which I took to imply a possibility for a "Graceful startup" and rejoining of the cluster even if all nodes were brought down.

So for upgrades and resets we should be sending a different signal to the Consul process?

I think this signal, or all of the signals Consul expects and the outcomes of those signals should be very well documented somewhere.

Maybe a table listing the signals and Consul states post-signal?

@armon
Copy link
Member

armon commented Jul 23, 2015

@theonewolf Partially, I think this was a mistake UX wise. I think that the "deep remove" should not have been the default behavior, but we are here now. Its a damned if you do, damned if you don't situation.

The problem with the signals table is that the behavior is configurable. It responds to 3 signals:

  • SIGHUP - Reload, always. Only some things can be reloaded.
  • SIGINT - Graceful leave by default, can be configured to be non-graceful.
  • SIGTERM - Non-graceful leave. Can be configured to be graceful.

@theonewolf
Copy link

I see! This does lead to the potential for a healthy lifecycle if the signals are used appropriately.

I still think such a table should exist. I'd like to know all the external run-time events that could affect my running Consul processes. Even if an engineer initiated them by accident for example.

@biran0079
Copy link

It really dislikes being stopped so the notion of a graceful shutdown of an entire cluster really wasn't planned for.

What if I accidentally "gracefully shutdown the whole cluster"? Is there any way to recover?

@awheeler
Copy link
Contributor

Yes, and it's covered in #993 -- you manually populate the peers.json file. See @slackpad 's example there.
I've used it several times since I read that and it makes the whole situation a non-event.

@Dmitry1987
Copy link

It's a bit old thread, but I have confusing results regarding cluster recovery:

  1. when playing in vagrant (virtual box) i can spin from 3 to 6 servers, set all of them to bootstrap expect '3', kill them in any ways while running ('consul leave', service start/stop, "kill" process) all nodes simultaneously btw. and cluster always easy to recover! sometimes just by 'service restart' and sometimes needs 'peers.json' fix. but always easy to bring back up and get 'leader' elected and cluster operational.

  2. trying to do the same in AWS, with 3 machines, and another 5 machines as "wan cluster" (5 are alive and stable. 3 are experimental and join the big cluster by 'wan'), but those 3 machines always die completely (any type of kill, graceful or not, breaks the cluster from 1st attempt in a way i can't recover it with 'peers.json'!) those 3 machines, once interrupted and lost quorum, never elect a leader again until i delete their data folder! :(

i use version 6.3
this is sooo weird... any thoughts?

@slackpad
Copy link
Contributor

slackpad commented Mar 9, 2016

Hi @Dmitry1987 that definitely should not be the case. Can you share a gist with your configuration and some log messages from when this happens?

@Dmitry1987
Copy link

Hi @slackpad , i think i figured this out, by adding some parameters that say "do not gracefully leave in any case, hard/soft kill" , don't remember exactly, but problematic configuration BEFORE i found that solution was:

{

"acl_datacenter": "dc-main",
"acl_default_policy": "deny",
"advertise_addr": "22.22.22.22",
"advertise_addr_wan": "11.11.11.11",
"bootstrap_expect": 3,
"ca_file": "/zzzzz",
"cert_file": "/zzzzzz",
"data_dir": "/var/consul",
"datacenter": "dc-new",
"disable_remote_exec": true,
"domain": "my.domain.consul",
"enable_syslog": true,
"encrypt": "zzzzzzzzzz",
"key_file": "/zzzzzz",
"log_level": "INFO",
"server": true,
"start_join": [
"x.compute.amazonaws.com",
"y.compute.amazonaws.com",
"z.compute.amazonaws.com"
],
"start_join_wan": [
"zzz.compute.amazonaws.com",
"yyz.compute.amazonaws.com"
],
"syslog_facility": "LOCAL5",
"ui": false,
"verify_incoming": true,
"verify_outgoing": true
}

@Dmitry1987
Copy link

I don't remember exactly those 2 directives, but it was 2 settings that say to not delete himself from raft peers file, no matter which kill signal received (or "consul leave"). after that it was working good.
but now we already not using consul after that POC, too busy with other things... maybe will finally implement it one day :) .
thanks!

@slackpad
Copy link
Contributor

slackpad commented May 2, 2017

I'm going to close this out. Here's what we've done related to this:

  • Changed the default for leave_on_terminate and skip_leave_on_interrupt for servers to not make them leave when shut down, which is safer by default.
  • Removed the peer store, so there's not the confusing null.
  • Improved the outage documentation with more details and a peers.json example.

@slackpad slackpad closed this as completed May 2, 2017
@theonewolf
Copy link

@slackpad could you reference the most recent release with these features?

Thanks!

@slackpad
Copy link
Contributor

slackpad commented May 2, 2017

@theonewolf 0.8.1 has them all :-) They came in 0.7.0, though, so 0.7.5 is fine too if you are on that series.

@theonewolf
Copy link

@slackpad awesome. We are 0.7.5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants