node failure, stale reads and monitoring #3285

rhuddleston · 2017-07-17T17:12:47Z

If you have a question, please direct it to the
consul mailing list if it hasn't been
addressed in either the FAQ or in one
of the Consul Guides.

When filing a bug, please include the following:

`consul version` for both Client and Server

Client: 0.8.3
Server: 0.8.3

`consul info` for both Client and Server

Client:

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 14
	services = 14
build:
	prerelease =
	revision = ea2a82b
	version = 0.8.3
consul:
	known_servers = 3
	server = false
runtime:
	arch = amd64
	cpu_count = 2
	goroutines = 63
	max_procs = 2
	os = linux
	version = go1.8.1
serf_lan:
	encrypted = true
	event_queue = 0
	event_time = 8821
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 233
	members = 7
	query_queue = 0
	query_time = 1

Server:

agent:
	check_monitors = 0
	check_ttls = 2
	checks = 9
	services = 10
build:
	prerelease =
	revision = ea2a82b
	version = 0.8.3
consul:
	bootstrap = false
	known_datacenters = 1
	leader = false
	leader_addr = 10.0.5.232:8300
	server = true
raft:
	applied_index = 244114834
	commit_index = 244114834
	fsm_pending = 0
	last_contact = 18.139613ms
	last_log_index = 244114834
	last_log_term = 13028
	last_snapshot_index = 244109217
	last_snapshot_term = 13028
	latest_configuration = [{Suffrage:Voter ID:10.0.5.232:8300 Address:10.0.5.232:8300} {Suffrage:Voter ID:10.0.5.75:8300 Address:10.0.5.75:8300} {Suffrage:Voter ID:10.0.5.132:8300 Address:10.0.5.132:8300}]
	latest_configuration_index = 160125876
	num_peers = 2
	protocol_version = 2
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Follower
	term = 13028
runtime:
	arch = amd64
	cpu_count = 1
	goroutines = 120
	max_procs = 1
	os = linux
	version = go1.8.1
serf_lan:
	encrypted = true
	event_queue = 0
	event_time = 8821
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 233
	members = 7
	query_queue = 0
	query_time = 1
serf_wan:
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 19
	members = 3
	query_queue = 0
	query_time = 1

Operating system and Environment details

Ubuntu 16.04.2 LTS

Description of the Issue (and unexpected/desired result)

There are several potential issues but the main issues are the following:

electing new master didn't recover the previous master and it was left in a bad state
we query with ?stale= to allow not perfectly in sync nodes. But in this case it was returning values over a day old even though all nodes in the cluster agreed who was master. If I removed the ?stale= parameter every node would return the correct values. Since everything was querying a local agent, it really shouldn't be returning from a node that is not replicating.
there doesn't seem to be anything to monitor on so I can be alerted that one of my nodes is in this state.

See more notes below in the log section:

Log Fragments or Link to gist

Node1: previos master

https://gist.github.com/rhuddleston/7f45ced3de6117d1ee1a6ec146e43b7a

Node2:

https://gist.github.com/rhuddleston/72cff6c4b4e3cbe118d754400a866021

node3: new master

https://gist.github.com/rhuddleston/425d29a92ae331fb03134ec1cb867ab9

"Failed to get log at index 242089339: log not found" this message continues on until node1 consul was wiped and restaged on 7-16.

So from looking at the logs node1 rejected vote for new master:

Rejecting vote request from 10.0.5.232:8300 since our last index is greater. Though when I checked it later it correctly knew that the master was node3. Node 3 tried to repeatedly send logs to node1 but apparently it didn't have the correct logs. Perhaps related to it being further ahead than node3 at the time of failover?

Assuming I'm okay with a node not properly failing over and electing a new leader the next issues is what is the correct way to alert on this condition? It appears that nothing in consul UI or consul-alerts or consul-exporter (prometheus) let us know that anything was wrong with node1 even though it was clearly in a bad state. If we were alerted to the issue we could manually shut down that node and correct the issue

Next issues was related to how ?stale= works. Every node in the cluster agreed that node3 was the master. When asking some of the local consul agents it returned very stale data:

127.0.0.1:8500//v1/health/service/conversation-manager?stale=

which returned:
< Content-Type: application/json
< X-Consul-Index: 242089207
< X-Consul-Knownleader: true
< X-Consul-Lastcontact: 45

I notcied X-Consul-Index never changes though X-Consul-Lastcontact value did change to between 38->146 randomly. These values should have been much higher as replication wasn't happening on that node for a day+. Also it seems like if one node is way behind or not replicating that agents shouldn't bind to it.

If I removed ?stale= then all nodes would return the correct value.

I can potentially write these up as separate issues if wanted just wanted an expert to look at this failure scenario to get their feedback.

Also note that node1 which was not replicating, basically was slowly leaking memory and eventually used up all the memory on the node. After restarting consul then manually removing the consul datadir we were able to recover the node.

So again main issues:

need a way to monitor when a node is in a bad state so someone can manually correct before it makes the problem worse
Should nodes be allowed in the cluster to be queried by agents when replication is not running? Seems agents should not query these nodes when replication is over x seconds behind
bonus is why failover didn't work in this circumstance but main issues are the above two.

The text was updated successfully, but these errors were encountered:

slackpad · 2017-08-29T01:07:27Z

Hi @rhuddleston the old leader that got stuck in a bad state got hit by #3230, which was fixed in Consul 0.9.0.

To detect situations like this, we added https://www.consul.io/api/operator/autopilot.html#read-health which includes a check to make sure replication is working on each node (it's based on LastIndex shown in the output), so that server would have shown as false for Healthy. That rolls up via that API into an overall healthy status for the cluster that you can alert on, which is also in telemetry as consul.autopilot.healthy. This feature does require you to configure the latest Raft protocol to enable, so please see https://www.consul.io/docs/guides/autopilot.html for details about that.

We currently don't have a mechanism for clients to stop using a server if it's not healthy, though that would be an interesting idea. Also, we should look at the X-Consul-Lastcontact header, since that is incorrectly making things seem fresher than they are, so I'll mark this as a bug for that.

Hope that helps!

slackpad added the type/bug Feature does not function as expected label Aug 29, 2017

jsosulska added the version/0.8.3 label May 18, 2020

jsosulska added the theme/telemetry Anything related to telemetry or observability label Jul 21, 2020

dnephin removed the version/0.8.3 label Jul 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node failure, stale reads and monitoring #3285

node failure, stale reads and monitoring #3285

rhuddleston commented Jul 17, 2017

slackpad commented Aug 29, 2017

node failure, stale reads and monitoring #3285

node failure, stale reads and monitoring #3285

Comments

rhuddleston commented Jul 17, 2017

consul version for both Client and Server

consul info for both Client and Server

Operating system and Environment details

Description of the Issue (and unexpected/desired result)

Log Fragments or Link to gist

slackpad commented Aug 29, 2017

`consul version` for both Client and Server

`consul info` for both Client and Server