Consul autopilot does not leave nodes correctly (version 1.0.0) #3611

dpgaspar · 2017-10-24T16:21:00Z

Hi,

We are testing consul 1.0.0, but using cloudformation rolling updates leaves the cluster without a leader, no quorum.

The same behaviour does not happen when using 0.9.3 and the exact same recipe (and config)

`consul version` for both Client and Server

Client: 1.0.0
Server: 1.0.0

Server:

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 0
build:
	prerelease =
	revision = 51ea240
	version = 1.0.0
consul:
	bootstrap = false
	known_datacenters = 1
	leader = false
	leader_addr = 10.0.0.157:8300
	server = true
raft:
	applied_index = 131
	commit_index = 131
	fsm_pending = 0
	last_contact = 17.683093ms
	last_log_index = 131
	last_log_term = 8
	last_snapshot_index = 0
	last_snapshot_term = 0
	latest_configuration = [{Suffrage:Voter ID:06549143-59f3-677d-8f5f-456d77768fe1 Address:10.0.0.157:8300} {Suffrage:Voter ID:6a8a7e2f-bee1-e0eb-d277-dd94775ff099 Address:10.0.0.194:8300} {Suffrage:Voter ID:59bb00ee-f229-685d-1100-7166a7fea60b Address:10.0.0.131:8300}]
	latest_configuration_index = 25
	num_peers = 2
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Follower
	term = 8
runtime:
	arch = amd64
	cpu_count = 1
	goroutines = 91
	max_procs = 2
	os = linux
	version = go1.9.1
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 4
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 11
	members = 3
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 9
	members = 3
	query_queue = 0
	query_time = 1

Operating system and Environment details

OS: AMZN Linux
3 Nodes on autoscaling group

Description of the Issue (and unexpected/desired result)

After performing a cloudformation rolling update (this will replace each instance one by one) the cluster loses quorum

Some consul members (agent server) that are shutdown are on status failed, the raft quorum size increases with the rolling update, until quorum is not possible.

This behaviour does not happen on 0.9.3, using the exact same steps

Reproduction steps

Shutdown and replace each node of an 3 node cluster.

Server config

{
"bootstrap_expect": 3,
"server": true,
"leave_on_terminate": true,
"datacenter": "us-west-2",
"acl_datacenter": "us-west-2",
"acl_master_token": "XXXXX",
"acl_agent_token": "XXXXX",
"acl_default_policy": "deny",
"acl_down_policy": "extend-cache",
"disable_remote_exec": false,
"data_dir": "/opt/consul/data",
"encrypt": "XXXXX",
"log_level": "INFO",
"enable_syslog": true,
"ui": true,
"retry_join": ["provider=aws tag_key=Name tag_value=dev-consul-cluster"],
"telemetry": {
"dogstatsd_addr": "127.0.0.1:8125"
},
"performance": {
"raft_multiplier": 1
},
"ports": {
"http": 8500,
"dns": 8600,
"serf_lan": 8301,
"serf_wan": 8302,
"server": 8300
}
}

Log Fragments or Link to gist

The text was updated successfully, but these errors were encountered:

slackpad · 2017-10-24T16:40:51Z

Hi @dpgaspar Consul 1.0.0 defaults to Raft protocol 3 which introduces an extra stabilization period before servers are promoted to voters. How does your rolling update determine when it's safe to roll another node - does it use feedback from Consul or an open loop timeout? The Autopilot Read Health endpoint can tell you when it's safe to proceed when FailureTolerance > 0.

dpgaspar · 2017-10-24T17:19:46Z

Hi @slackpad,

Thank you for the quick reply!

Our rolling updates will kill and launch a new server when the provisioning script finishes, this is after the consul service starts.

But the issue seems to be that the "old" and shutdown instance state is not set to "leave" but remains as "failed" and is still listed on 'consul operator raft list-peers' this behaviour is not observed on 0.9.3

Do you think that we should only send a SUCCESS signal when the REST endpoint:
/operator/autopilot/health returns FailureTolerance > 0 ?

Note that the rolling procedure takes some time, I will make another test and gather some logs to help us out.

slackpad · 2017-10-24T17:28:35Z

Do you think that we should only send a SUCCESS signal when the REST endpoint:
/operator/autopilot/health returns FailureTolerance > 0 ?

That's a good way to close the loop to know it's safe to roll another server. You can also look at the Healthy boolean to make sure the other Autopilot checks are passing as well.

With 0.9.3 it was listed as failed too, but the new server would get added as a voter right away so autopilot would clean it up really quickly. With 1.0 while it's waiting to stabilize (which usually takes ~10 seconds, but you can tune it with https://www.consul.io/docs/guides/autopilot.html#configuration) the old server will show as failed because autopilot is waiting for the new server to become usable.

dpgaspar · 2017-10-24T17:40:30Z

After the first new node starts and finishes provisioning and old node shutdown is executed:

[root@dev-consul-cluster-i-0794b2d1d613ddbff ~]# consul members -token=XXXX
Node                                    Address             Status  Type    Build  Protocol  DC         Segment
dev-consul-cluster-i-02eefc37322c86523  10.0.0.131:8301  alive   server  1.0.0  2         us-west-2  <all>
dev-consul-cluster-i-0794b2d1d613ddbff  10.0.0.75:8301   alive   server  1.0.0  2         us-west-2  <all>
dev-consul-cluster-i-0d49cf947afc526c9  10.0.0.194:8301  failed  server  1.0.0  2         us-west-2  <all>
dev-consul-cluster-i-0fc1c5b8b586f9831  10.0.0.138.157:8301  alive   server  1.0.0  2         us-west-2  <all>

At the end:

[root@dev-consul-cluster-i-0794b2d1d613ddbff ~]# consul members -token=XXXX

Node                                    Address             Status  Type    Build  Protocol  DC         Segment
dev-consul-cluster-i-0200d484fdf17380d  10.0.0.212:8301  alive   server  1.0.0  2         us-west-2  <all>
dev-consul-cluster-i-02eefc37322c86523  10.0.0.131:8301  failed  server  1.0.0  2         us-west-2  <all>
dev-consul-cluster-i-0794b2d1d613ddbff  10.0.0.75:8301   alive   server  1.0.0  2         us-west-2  <all>
dev-consul-cluster-i-0be67304415d08fe1  10.0.0.151:8301  alive   server  1.0.0  2         us-west-2  <all>
dev-consul-cluster-i-0d49cf947afc526c9  10.0.0.194:8301  failed  server  1.0.0  2         us-west-2  <all>
dev-consul-cluster-i-0fc1c5b8b586f9831  10.0.0.157:8301  failed  server  1.0.0  2         us-west-2  <all>

consul operator raft list-peers -token=XXXX
Error getting peers: Failed to retrieve raft configuration: Unexpected response code: 500 (No cluster leader)

Logs from CF:

18:29:30 UTC+0100 | UPDATE_COMPLETE | AWS::AutoScaling::AutoScalingGroup | ServerGroup |  
-- | -- | -- | -- | --
  | 18:29:27 UTC+0100 | UPDATE_IN_PROGRESS | AWS::AutoScaling::AutoScalingGroup | ServerGroup | Successfully terminated instance(s) [i-02eefc37322c86523] (Progress 100%).
  | 18:29:25 UTC+0100 | UPDATE_IN_PROGRESS | AWS::AutoScaling::AutoScalingGroup | ServerGroup | Terminating instance(s) [i-02eefc37322c86523]; replacing with 0 new instance(s).
  | 18:29:24 UTC+0100 | UPDATE_IN_PROGRESS | AWS::AutoScaling::AutoScalingGroup | ServerGroup | Received SUCCESS signal with UniqueId i-0200d484fdf17380d
  | 18:26:40 UTC+0100 | UPDATE_IN_PROGRESS | AWS::AutoScaling::AutoScalingGroup | ServerGroup | New instance(s) added to autoscaling group - Waiting on 1 resource signal(s) with a timeout of PT15M.
  | 18:26:40 UTC+0100 | UPDATE_IN_PROGRESS | AWS::AutoScaling::AutoScalingGroup | ServerGroup | Successfully terminated instance(s) [i-0fc1c5b8b586f9831] (Progress 67%).
  | 18:25:47 UTC+0100 | UPDATE_IN_PROGRESS | AWS::AutoScaling::AutoScalingGroup | ServerGroup | Terminating instance(s) [i-0fc1c5b8b586f9831]; replacing with 1 new instance(s).
  | 18:25:46 UTC+0100 | UPDATE_IN_PROGRESS | AWS::AutoScaling::AutoScalingGroup | ServerGroup | Received SUCCESS signal with UniqueId i-0be67304415d08fe1
  | 18:23:02 UTC+0100 | UPDATE_IN_PROGRESS | AWS::AutoScaling::AutoScalingGroup | ServerGroup | New instance(s) added to autoscaling group - Waiting on 1 resource signal(s) with a timeout of PT15M.
  | 18:23:02 UTC+0100 | UPDATE_IN_PROGRESS | AWS::AutoScaling::AutoScalingGroup | ServerGroup | Successfully terminated instance(s) [i-0d49cf947afc526c9] (Progress 33%).
  | 18:21:44 UTC+0100 | UPDATE_IN_PROGRESS | AWS::AutoScaling::AutoScalingGroup | ServerGroup | Terminating instance(s) [i-0d49cf947afc526c9]; replacing with 1 new instance(s).
  | 18:21:43 UTC+0100 | UPDATE_IN_PROGRESS | AWS::AutoScaling::AutoScalingGroup | ServerGroup | Received SUCCESS signal with UniqueId i-0794b2d1d613ddbff
  | 18:19:58 UTC+0100 | UPDATE_IN_PROGRESS | AWS::AutoScaling::AutoScalingGroup | ServerGroup | New instance(s) added to autoscaling group - Waiting on 1 resource signal(s) with a timeout of PT15M.
  | 18:19:05 UTC+0100 | UPDATE_IN_PROGRESS | AWS::AutoScaling::AutoScalingGroup | ServerGroup | Temporarily setting autoscaling group MinSize and DesiredCapacity to 4.
  | 18:19:04 UTC+0100 | UPDATE_IN_PROGRESS | AWS::AutoScaling::AutoScalingGroup | ServerGroup | Rolling update initiated. Terminating 3 obsolete instance(s) in batches of 1, while keeping at least 3 instance(s) in service. Waiting on resource signals with a timeout of PT15M when new instances are added to the autoscaling group.
```

dpgaspar · 2017-10-24T17:58:05Z

"That's a good way to close the loop to know it's safe to roll another server. You can also look at the Healthy boolean to make sure the other Autopilot checks are passing as well."

Will do, but it seems odd, since as you can see from the CF logs, provisioning an instance takes 2 to 3 minutes, way more then 10s

On 0.9.3 the instances that are terminated are immediately (or almost) on state "leave"

Still convinced the FailureTolerance > 0 wait loop will solve the issue?

slackpad · 2017-10-24T18:51:06Z

I think it's likely that - it's this interval:

  | 18:21:44 UTC+0100 | UPDATE_IN_PROGRESS | AWS::AutoScaling::AutoScalingGroup | ServerGroup | Terminating instance(s) [i-0d49cf947afc526c9]; replacing with 1 new instance(s).
  | 18:21:43 UTC+0100 | UPDATE_IN_PROGRESS | AWS::AutoScaling::AutoScalingGroup | ServerGroup | Received SUCCESS signal with UniqueId i-0794b2d1d613ddbff

In that case it got a SUCCESS signal and terminated the old server ~1 second later, even though that new server wasn't going to be a useful part of the quorum for ~10 seconds or more. During that window when the old server is terminated you are down 1, so you'll go into an outage if one other server is terminated before that process completes.

You are right that the long time to spin up the next instance should kind of pave over that window, so it's possible that things are taking longer than expected to converge and update the new server to being a voter. Adding that check should make it safe, and it would be interesting to see how much additional time it's waiting - there might be some things we could look at there.

dpgaspar · 2017-10-25T13:16:30Z

Hi @slackpad,

I've changed the 'recipe' and made the following class to call the API: https://gist.github.com/dpgaspar/9143124a58508bfa7e9675e276030a82

Final step on the recipe is:


ruby_block 'wait_for_cluster' do
    block do
        require_relative "../libraries/consul"
        consul = Consul.new(node, node['mc-consul']['acl_master_token'])
        unless consul.wait_for_failure_tolerance
            raise 'MC FailureTolerance wait has timeout'
        end
    end
    action :run
end

```
Got the same results, When the third instance replacement starts there is no cluster leader.

slackpad · 2017-10-25T14:42:42Z

Hi @dpgaspar sorry that didn't fix things. I just realized two things about the autopilot health endpoint. First, it won't be available until you are on Raft protocol 3, so it won't necessarily help for your first transition onto 1.0 (after that it will be available). Second, when you go from 3 to 4 servers, the failure tolerance will already be at 1 so it won't actually delay. I think the right "safe to roll another server" check it to look at that response and make sure you have the expected number of servers (4), marked healthy, and that they are all voters.

Trying this locally, though, I think I see a problem if a 1.0 server steps up and becomes leader during the roll. It looks like there's an issue where it won't properly remove the failed servers on the old Raft protocol version, so I'll mark this as a bug and look at that.

If you need an immediate workaround, the easiest thing is probably to configure your Consul 1.0 servers with raft_protocol set to 2, which should avoid this issue.

slackpad · 2017-10-25T15:25:32Z

I think the right "safe to roll another server" check it to look at that response and make sure you have the expected number of servers (4), marked healthy, and that they are all voters.

Seems like we should add a HealthyVoters sum at the top to make this easy to consume.

dpgaspar · 2017-10-25T15:49:22Z

Hi @slackpad,

Thank you for the quick reply, great support!

Just a small comment about this:

"...it won't be available until you are on Raft protocol 3..."
I'm not upgrading servers, i'm deploying a new consul 1.0.0, and then after deploy i'm forcing a rolling update.
So i guess i'm always at raft protocol 3.

"Trying this locally, though, I think I see a problem if a 1.0 server steps up and becomes leader during the roll. It looks like there's an issue where it won't properly remove the failed servers on the old Raft protocol version, so I'll mark this as a bug and look at that."

Great!! Thanks.

slackpad · 2017-10-25T17:25:38Z

I'm not upgrading servers, i'm deploying a new consul 1.0.0, and then after deploy i'm forcing a rolling update.

Thanks for clarifying that - that narrows it to a possible autopilot bug with 1.0 then, which I think would manifest the same the way I repro-d it. Will take a look.

When we defaulted the Raft protocol version to 3 in #3477 we made the numPeers() routine more strict to only count voters (this is more conservative and more correct). This had the side effect of breaking rolling updates becuase it's at odds with the Autopilot non-voter promotion logic. That logic used to wait to only promote to maintain an odd quorum of servers. During a rolling update (add one new server, wait, and then kill an old server) the dead server cleanup would still count the old server as a peer, which is conservative and the right thing to do, and no longer count the non-voter. This would wait to promote, so you could get into a stalemate. It is safer to promote early than remove early, so by promoting as soon as possible we have chosen that as the solution here. Fixes #3611

slackpad · 2017-10-27T13:14:25Z

Figured out what was going on here and teed up a fix in #3623. @kyhavlov and @preetapan can you please take a look?

When we defaulted the Raft protocol version to 3 in #3477 we made the numPeers() routine more strict to only count voters (this is more conservative and more correct). This had the side effect of breaking rolling updates because it's at odds with the Autopilot non-voter promotion logic. That logic used to wait to only promote to maintain an odd quorum of servers. During a rolling update (add one new server, wait, and then kill an old server) the dead server cleanup would still count the old server as a peer, which is conservative and the right thing to do, and no longer count the non-voter. This would wait to promote, so you could get into a stalemate. It is safer to promote early than remove early, so by promoting as soon as possible we have chosen that as the solution here. Fixes #3611

* Relaxes Autopilot promotion logic. When we defaulted the Raft protocol version to 3 in #3477 we made the numPeers() routine more strict to only count voters (this is more conservative and more correct). This had the side effect of breaking rolling updates because it's at odds with the Autopilot non-voter promotion logic. That logic used to wait to only promote to maintain an odd quorum of servers. During a rolling update (add one new server, wait, and then kill an old server) the dead server cleanup would still count the old server as a peer, which is conservative and the right thing to do, and no longer count the non-voter. This would wait to promote, so you could get into a stalemate. It is safer to promote early than remove early, so by promoting as soon as possible we have chosen that as the solution here. Fixes #3611 * Gets rid of unnecessary extra not-a-voter check.

slackpad added type/bug Feature does not function as expected theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner labels Oct 25, 2017

slackpad added this to the 1.0.1 milestone Oct 25, 2017

slackpad mentioned this issue Oct 26, 2017

Relaxes Autopilot promotion logic. #3623

Merged

slackpad closed this as completed in #3623 Oct 31, 2017

slackpad mentioned this issue Dec 18, 2017

recurring log "serf: attempting reconnect" to left server #3361

Closed

slackpad mentioned this issue Feb 2, 2018

raft: not part of stable configuration, aborting election #3834

Closed

seafoodbuffet mentioned this issue May 13, 2018

Consul Autopilot doesn't seem to want to remove old nodes that have since left (1.0.6) #4118

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consul autopilot does not leave nodes correctly (version 1.0.0) #3611

Consul autopilot does not leave nodes correctly (version 1.0.0) #3611

dpgaspar commented Oct 24, 2017

slackpad commented Oct 24, 2017 •

edited

Loading

dpgaspar commented Oct 24, 2017

slackpad commented Oct 24, 2017

dpgaspar commented Oct 24, 2017 •

edited

Loading

dpgaspar commented Oct 24, 2017 •

edited

Loading

slackpad commented Oct 24, 2017

dpgaspar commented Oct 25, 2017

slackpad commented Oct 25, 2017

slackpad commented Oct 25, 2017

dpgaspar commented Oct 25, 2017

slackpad commented Oct 25, 2017

slackpad commented Oct 27, 2017

Consul autopilot does not leave nodes correctly (version 1.0.0) #3611

Consul autopilot does not leave nodes correctly (version 1.0.0) #3611

Comments

dpgaspar commented Oct 24, 2017

consul version for both Client and Server

Operating system and Environment details

Description of the Issue (and unexpected/desired result)

Reproduction steps

Server config

Log Fragments or Link to gist

slackpad commented Oct 24, 2017 • edited Loading

dpgaspar commented Oct 24, 2017

slackpad commented Oct 24, 2017

dpgaspar commented Oct 24, 2017 • edited Loading

dpgaspar commented Oct 24, 2017 • edited Loading

slackpad commented Oct 24, 2017

dpgaspar commented Oct 25, 2017

slackpad commented Oct 25, 2017

slackpad commented Oct 25, 2017

dpgaspar commented Oct 25, 2017

slackpad commented Oct 25, 2017

slackpad commented Oct 27, 2017

`consul version` for both Client and Server

slackpad commented Oct 24, 2017 •

edited

Loading

dpgaspar commented Oct 24, 2017 •

edited

Loading

dpgaspar commented Oct 24, 2017 •

edited

Loading