Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul autopilot does not leave nodes correctly (version 1.0.0) #3611

Closed
dpgaspar opened this issue Oct 24, 2017 · 12 comments
Closed

Consul autopilot does not leave nodes correctly (version 1.0.0) #3611

dpgaspar opened this issue Oct 24, 2017 · 12 comments
Labels
theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner type/bug Feature does not function as expected
Milestone

Comments

@dpgaspar
Copy link

Hi,

We are testing consul 1.0.0, but using cloudformation rolling updates leaves the cluster without a leader, no quorum.

The same behaviour does not happen when using 0.9.3 and the exact same recipe (and config)

consul version for both Client and Server

Client: 1.0.0
Server: 1.0.0

Server:

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 0
build:
	prerelease =
	revision = 51ea240
	version = 1.0.0
consul:
	bootstrap = false
	known_datacenters = 1
	leader = false
	leader_addr = 10.0.0.157:8300
	server = true
raft:
	applied_index = 131
	commit_index = 131
	fsm_pending = 0
	last_contact = 17.683093ms
	last_log_index = 131
	last_log_term = 8
	last_snapshot_index = 0
	last_snapshot_term = 0
	latest_configuration = [{Suffrage:Voter ID:06549143-59f3-677d-8f5f-456d77768fe1 Address:10.0.0.157:8300} {Suffrage:Voter ID:6a8a7e2f-bee1-e0eb-d277-dd94775ff099 Address:10.0.0.194:8300} {Suffrage:Voter ID:59bb00ee-f229-685d-1100-7166a7fea60b Address:10.0.0.131:8300}]
	latest_configuration_index = 25
	num_peers = 2
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Follower
	term = 8
runtime:
	arch = amd64
	cpu_count = 1
	goroutines = 91
	max_procs = 2
	os = linux
	version = go1.9.1
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 4
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 11
	members = 3
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 9
	members = 3
	query_queue = 0
	query_time = 1

Operating system and Environment details

OS: AMZN Linux
3 Nodes on autoscaling group

Description of the Issue (and unexpected/desired result)

After performing a cloudformation rolling update (this will replace each instance one by one) the cluster loses quorum

Some consul members (agent server) that are shutdown are on status failed, the raft quorum size increases with the rolling update, until quorum is not possible.

This behaviour does not happen on 0.9.3, using the exact same steps

Reproduction steps

Shutdown and replace each node of an 3 node cluster.

Server config

{
"bootstrap_expect": 3,
"server": true,
"leave_on_terminate": true,
"datacenter": "us-west-2",
"acl_datacenter": "us-west-2",
"acl_master_token": "XXXXX",
"acl_agent_token": "XXXXX",
"acl_default_policy": "deny",
"acl_down_policy": "extend-cache",
"disable_remote_exec": false,
"data_dir": "/opt/consul/data",
"encrypt": "XXXXX",
"log_level": "INFO",
"enable_syslog": true,
"ui": true,
"retry_join": ["provider=aws tag_key=Name tag_value=dev-consul-cluster"],
"telemetry": {
"dogstatsd_addr": "127.0.0.1:8125"
},
"performance": {
"raft_multiplier": 1
},
"ports": {
"http": 8500,
"dns": 8600,
"serf_lan": 8301,
"serf_wan": 8302,
"server": 8300
}
}

Log Fragments or Link to gist

@slackpad
Copy link
Contributor

slackpad commented Oct 24, 2017

Hi @dpgaspar Consul 1.0.0 defaults to Raft protocol 3 which introduces an extra stabilization period before servers are promoted to voters. How does your rolling update determine when it's safe to roll another node - does it use feedback from Consul or an open loop timeout? The Autopilot Read Health endpoint can tell you when it's safe to proceed when FailureTolerance > 0.

@dpgaspar
Copy link
Author

Hi @slackpad,

Thank you for the quick reply!

Our rolling updates will kill and launch a new server when the provisioning script finishes, this is after the consul service starts.

But the issue seems to be that the "old" and shutdown instance state is not set to "leave" but remains as "failed" and is still listed on 'consul operator raft list-peers' this behaviour is not observed on 0.9.3

Do you think that we should only send a SUCCESS signal when the REST endpoint:
/operator/autopilot/health returns FailureTolerance > 0 ?

Note that the rolling procedure takes some time, I will make another test and gather some logs to help us out.

@slackpad
Copy link
Contributor

Do you think that we should only send a SUCCESS signal when the REST endpoint:
/operator/autopilot/health returns FailureTolerance > 0 ?

That's a good way to close the loop to know it's safe to roll another server. You can also look at the Healthy boolean to make sure the other Autopilot checks are passing as well.

With 0.9.3 it was listed as failed too, but the new server would get added as a voter right away so autopilot would clean it up really quickly. With 1.0 while it's waiting to stabilize (which usually takes ~10 seconds, but you can tune it with https://www.consul.io/docs/guides/autopilot.html#configuration) the old server will show as failed because autopilot is waiting for the new server to become usable.

@dpgaspar
Copy link
Author

dpgaspar commented Oct 24, 2017

After the first new node starts and finishes provisioning and old node shutdown is executed:

[root@dev-consul-cluster-i-0794b2d1d613ddbff ~]# consul members -token=XXXX
Node                                    Address             Status  Type    Build  Protocol  DC         Segment
dev-consul-cluster-i-02eefc37322c86523  10.0.0.131:8301  alive   server  1.0.0  2         us-west-2  <all>
dev-consul-cluster-i-0794b2d1d613ddbff  10.0.0.75:8301   alive   server  1.0.0  2         us-west-2  <all>
dev-consul-cluster-i-0d49cf947afc526c9  10.0.0.194:8301  failed  server  1.0.0  2         us-west-2  <all>
dev-consul-cluster-i-0fc1c5b8b586f9831  10.0.0.138.157:8301  alive   server  1.0.0  2         us-west-2  <all>

At the end:

[root@dev-consul-cluster-i-0794b2d1d613ddbff ~]# consul members -token=XXXX

Node                                    Address             Status  Type    Build  Protocol  DC         Segment
dev-consul-cluster-i-0200d484fdf17380d  10.0.0.212:8301  alive   server  1.0.0  2         us-west-2  <all>
dev-consul-cluster-i-02eefc37322c86523  10.0.0.131:8301  failed  server  1.0.0  2         us-west-2  <all>
dev-consul-cluster-i-0794b2d1d613ddbff  10.0.0.75:8301   alive   server  1.0.0  2         us-west-2  <all>
dev-consul-cluster-i-0be67304415d08fe1  10.0.0.151:8301  alive   server  1.0.0  2         us-west-2  <all>
dev-consul-cluster-i-0d49cf947afc526c9  10.0.0.194:8301  failed  server  1.0.0  2         us-west-2  <all>
dev-consul-cluster-i-0fc1c5b8b586f9831  10.0.0.157:8301  failed  server  1.0.0  2         us-west-2  <all>

consul operator raft list-peers -token=XXXX
Error getting peers: Failed to retrieve raft configuration: Unexpected response code: 500 (No cluster leader)

Logs from CF:

18:29:30 UTC+0100 | UPDATE_COMPLETE | AWS::AutoScaling::AutoScalingGroup | ServerGroup |  
-- | -- | -- | -- | --
  | 18:29:27 UTC+0100 | UPDATE_IN_PROGRESS | AWS::AutoScaling::AutoScalingGroup | ServerGroup | Successfully terminated instance(s) [i-02eefc37322c86523] (Progress 100%).
  | 18:29:25 UTC+0100 | UPDATE_IN_PROGRESS | AWS::AutoScaling::AutoScalingGroup | ServerGroup | Terminating instance(s) [i-02eefc37322c86523]; replacing with 0 new instance(s).
  | 18:29:24 UTC+0100 | UPDATE_IN_PROGRESS | AWS::AutoScaling::AutoScalingGroup | ServerGroup | Received SUCCESS signal with UniqueId i-0200d484fdf17380d
  | 18:26:40 UTC+0100 | UPDATE_IN_PROGRESS | AWS::AutoScaling::AutoScalingGroup | ServerGroup | New instance(s) added to autoscaling group - Waiting on 1 resource signal(s) with a timeout of PT15M.
  | 18:26:40 UTC+0100 | UPDATE_IN_PROGRESS | AWS::AutoScaling::AutoScalingGroup | ServerGroup | Successfully terminated instance(s) [i-0fc1c5b8b586f9831] (Progress 67%).
  | 18:25:47 UTC+0100 | UPDATE_IN_PROGRESS | AWS::AutoScaling::AutoScalingGroup | ServerGroup | Terminating instance(s) [i-0fc1c5b8b586f9831]; replacing with 1 new instance(s).
  | 18:25:46 UTC+0100 | UPDATE_IN_PROGRESS | AWS::AutoScaling::AutoScalingGroup | ServerGroup | Received SUCCESS signal with UniqueId i-0be67304415d08fe1
  | 18:23:02 UTC+0100 | UPDATE_IN_PROGRESS | AWS::AutoScaling::AutoScalingGroup | ServerGroup | New instance(s) added to autoscaling group - Waiting on 1 resource signal(s) with a timeout of PT15M.
  | 18:23:02 UTC+0100 | UPDATE_IN_PROGRESS | AWS::AutoScaling::AutoScalingGroup | ServerGroup | Successfully terminated instance(s) [i-0d49cf947afc526c9] (Progress 33%).
  | 18:21:44 UTC+0100 | UPDATE_IN_PROGRESS | AWS::AutoScaling::AutoScalingGroup | ServerGroup | Terminating instance(s) [i-0d49cf947afc526c9]; replacing with 1 new instance(s).
  | 18:21:43 UTC+0100 | UPDATE_IN_PROGRESS | AWS::AutoScaling::AutoScalingGroup | ServerGroup | Received SUCCESS signal with UniqueId i-0794b2d1d613ddbff
  | 18:19:58 UTC+0100 | UPDATE_IN_PROGRESS | AWS::AutoScaling::AutoScalingGroup | ServerGroup | New instance(s) added to autoscaling group - Waiting on 1 resource signal(s) with a timeout of PT15M.
  | 18:19:05 UTC+0100 | UPDATE_IN_PROGRESS | AWS::AutoScaling::AutoScalingGroup | ServerGroup | Temporarily setting autoscaling group MinSize and DesiredCapacity to 4.
  | 18:19:04 UTC+0100 | UPDATE_IN_PROGRESS | AWS::AutoScaling::AutoScalingGroup | ServerGroup | Rolling update initiated. Terminating 3 obsolete instance(s) in batches of 1, while keeping at least 3 instance(s) in service. Waiting on resource signals with a timeout of PT15M when new instances are added to the autoscaling group.
```

@dpgaspar
Copy link
Author

dpgaspar commented Oct 24, 2017

"That's a good way to close the loop to know it's safe to roll another server. You can also look at the Healthy boolean to make sure the other Autopilot checks are passing as well."

Will do, but it seems odd, since as you can see from the CF logs, provisioning an instance takes 2 to 3 minutes, way more then 10s

On 0.9.3 the instances that are terminated are immediately (or almost) on state "leave"

Still convinced the FailureTolerance > 0 wait loop will solve the issue?

@slackpad
Copy link
Contributor

I think it's likely that - it's this interval:

  | 18:21:44 UTC+0100 | UPDATE_IN_PROGRESS | AWS::AutoScaling::AutoScalingGroup | ServerGroup | Terminating instance(s) [i-0d49cf947afc526c9]; replacing with 1 new instance(s).
  | 18:21:43 UTC+0100 | UPDATE_IN_PROGRESS | AWS::AutoScaling::AutoScalingGroup | ServerGroup | Received SUCCESS signal with UniqueId i-0794b2d1d613ddbff

In that case it got a SUCCESS signal and terminated the old server ~1 second later, even though that new server wasn't going to be a useful part of the quorum for ~10 seconds or more. During that window when the old server is terminated you are down 1, so you'll go into an outage if one other server is terminated before that process completes.

You are right that the long time to spin up the next instance should kind of pave over that window, so it's possible that things are taking longer than expected to converge and update the new server to being a voter. Adding that check should make it safe, and it would be interesting to see how much additional time it's waiting - there might be some things we could look at there.

@dpgaspar
Copy link
Author

Hi @slackpad,

I've changed the 'recipe' and made the following class to call the API: https://gist.github.com/dpgaspar/9143124a58508bfa7e9675e276030a82

Final step on the recipe is:


ruby_block 'wait_for_cluster' do
    block do
        require_relative "../libraries/consul"
        consul = Consul.new(node, node['mc-consul']['acl_master_token'])
        unless consul.wait_for_failure_tolerance
            raise 'MC FailureTolerance wait has timeout'
        end
    end
    action :run
end

```
Got the same results, When the third instance replacement starts there is no cluster leader.

@slackpad
Copy link
Contributor

Hi @dpgaspar sorry that didn't fix things. I just realized two things about the autopilot health endpoint. First, it won't be available until you are on Raft protocol 3, so it won't necessarily help for your first transition onto 1.0 (after that it will be available). Second, when you go from 3 to 4 servers, the failure tolerance will already be at 1 so it won't actually delay. I think the right "safe to roll another server" check it to look at that response and make sure you have the expected number of servers (4), marked healthy, and that they are all voters.

Trying this locally, though, I think I see a problem if a 1.0 server steps up and becomes leader during the roll. It looks like there's an issue where it won't properly remove the failed servers on the old Raft protocol version, so I'll mark this as a bug and look at that.

If you need an immediate workaround, the easiest thing is probably to configure your Consul 1.0 servers with raft_protocol set to 2, which should avoid this issue.

@slackpad slackpad added type/bug Feature does not function as expected theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner labels Oct 25, 2017
@slackpad slackpad added this to the 1.0.1 milestone Oct 25, 2017
@slackpad
Copy link
Contributor

I think the right "safe to roll another server" check it to look at that response and make sure you have the expected number of servers (4), marked healthy, and that they are all voters.

Seems like we should add a HealthyVoters sum at the top to make this easy to consume.

@dpgaspar
Copy link
Author

Hi @slackpad,

Thank you for the quick reply, great support!

Just a small comment about this:

"...it won't be available until you are on Raft protocol 3..."
I'm not upgrading servers, i'm deploying a new consul 1.0.0, and then after deploy i'm forcing a rolling update.
So i guess i'm always at raft protocol 3.

"Trying this locally, though, I think I see a problem if a 1.0 server steps up and becomes leader during the roll. It looks like there's an issue where it won't properly remove the failed servers on the old Raft protocol version, so I'll mark this as a bug and look at that."

Great!! Thanks.

@slackpad
Copy link
Contributor

I'm not upgrading servers, i'm deploying a new consul 1.0.0, and then after deploy i'm forcing a rolling update.

Thanks for clarifying that - that narrows it to a possible autopilot bug with 1.0 then, which I think would manifest the same the way I repro-d it. Will take a look.

slackpad added a commit that referenced this issue Oct 26, 2017
When we defaulted the Raft protocol version to 3 in #3477 we made
the numPeers() routine more strict to only count voters (this is
more conservative and more correct). This had the side effect of
breaking rolling updates becuase it's at odds with the Autopilot
non-voter promotion logic.

That logic used to wait to only promote to maintain an odd quorum
of servers. During a rolling update (add one new server, wait, and
then kill an old server) the dead server cleanup would still count
the old server as a peer, which is conservative and the right thing
to do, and no longer count the non-voter. This would wait to promote,
so you could get into a stalemate. It is safer to promote early than
remove early, so by promoting as soon as possible we have chosen
that as the solution here.

Fixes #3611
@slackpad
Copy link
Contributor

Figured out what was going on here and teed up a fix in #3623. @kyhavlov and @preetapan can you please take a look?

slackpad added a commit that referenced this issue Oct 27, 2017
When we defaulted the Raft protocol version to 3 in #3477 we made
the numPeers() routine more strict to only count voters (this is
more conservative and more correct). This had the side effect of
breaking rolling updates because it's at odds with the Autopilot
non-voter promotion logic.

That logic used to wait to only promote to maintain an odd quorum
of servers. During a rolling update (add one new server, wait, and
then kill an old server) the dead server cleanup would still count
the old server as a peer, which is conservative and the right thing
to do, and no longer count the non-voter. This would wait to promote,
so you could get into a stalemate. It is safer to promote early than
remove early, so by promoting as soon as possible we have chosen
that as the solution here.

Fixes #3611
slackpad added a commit that referenced this issue Oct 31, 2017
* Relaxes Autopilot promotion logic.

When we defaulted the Raft protocol version to 3 in #3477 we made
the numPeers() routine more strict to only count voters (this is
more conservative and more correct). This had the side effect of
breaking rolling updates because it's at odds with the Autopilot
non-voter promotion logic.

That logic used to wait to only promote to maintain an odd quorum
of servers. During a rolling update (add one new server, wait, and
then kill an old server) the dead server cleanup would still count
the old server as a peer, which is conservative and the right thing
to do, and no longer count the non-voter. This would wait to promote,
so you could get into a stalemate. It is safer to promote early than
remove early, so by promoting as soon as possible we have chosen
that as the solution here.

Fixes #3611

* Gets rid of unnecessary extra not-a-voter check.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests

2 participants