node failure, stale reads and monitoring #3285
Labels
theme/telemetry
Anything related to telemetry or observability
type/bug
Feature does not function as expected
If you have a question, please direct it to the
consul mailing list if it hasn't been
addressed in either the FAQ or in one
of the Consul Guides.
When filing a bug, please include the following:
consul version
for both Client and ServerClient:
0.8.3
Server:
0.8.3
consul info
for both Client and ServerClient:
Server:
Operating system and Environment details
Ubuntu 16.04.2 LTS
Description of the Issue (and unexpected/desired result)
There are several potential issues but the main issues are the following:
See more notes below in the log section:
Log Fragments or Link to gist
Node1: previos master
https://gist.github.com/rhuddleston/7f45ced3de6117d1ee1a6ec146e43b7a
Node2:
https://gist.github.com/rhuddleston/72cff6c4b4e3cbe118d754400a866021
node3: new master
https://gist.github.com/rhuddleston/425d29a92ae331fb03134ec1cb867ab9
"Failed to get log at index 242089339: log not found" this message continues on until node1 consul was wiped and restaged on 7-16.
So from looking at the logs node1 rejected vote for new master:
Rejecting vote request from 10.0.5.232:8300 since our last index is greater. Though when I checked it later it correctly knew that the master was node3. Node 3 tried to repeatedly send logs to node1 but apparently it didn't have the correct logs. Perhaps related to it being further ahead than node3 at the time of failover?
Assuming I'm okay with a node not properly failing over and electing a new leader the next issues is what is the correct way to alert on this condition? It appears that nothing in consul UI or consul-alerts or consul-exporter (prometheus) let us know that anything was wrong with node1 even though it was clearly in a bad state. If we were alerted to the issue we could manually shut down that node and correct the issue
Next issues was related to how ?stale= works. Every node in the cluster agreed that node3 was the master. When asking some of the local consul agents it returned very stale data:
127.0.0.1:8500//v1/health/service/conversation-manager?stale=
which returned:
< Content-Type: application/json
< X-Consul-Index: 242089207
< X-Consul-Knownleader: true
< X-Consul-Lastcontact: 45
I notcied X-Consul-Index never changes though X-Consul-Lastcontact value did change to between 38->146 randomly. These values should have been much higher as replication wasn't happening on that node for a day+. Also it seems like if one node is way behind or not replicating that agents shouldn't bind to it.
If I removed ?stale= then all nodes would return the correct value.
I can potentially write these up as separate issues if wanted just wanted an expert to look at this failure scenario to get their feedback.
Also note that node1 which was not replicating, basically was slowly leaking memory and eventually used up all the memory on the node. After restarting consul then manually removing the consul datadir we were able to recover the node.
So again main issues:
The text was updated successfully, but these errors were encountered: