-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Service checks revert to failed state from server #818
Comments
I just got back from vacation, thought I would check in on thoughts on what I am seeing. I restarted all the nodes before I left and everything was in good order. I found it back in the same state today w/ the service check failed but redis running ok. Here is the output from monitor from one of the three servers. This output is from testconsul02. It has testconsul01/03 in its member list but they are unable to elect. Should I just clear all the nodes' data directories? Any thoughts? Am I missing something simple? Thanks! testconsul02 server output
|
Hey @alpduhuez, Sorry for the delay. I'm not sure exactly what's going on here. Deleting the data directories could help us at least understand if this could be some corruption issue, however it's a fairly destructive action and should only be done if the cluster state and data are unimportant. Were you able to get this sorted out? There are some further improvements to Consul and Raft in the master branch, which should be released soon as 0.5.1, so it might also be worthwhile to give that a try if you can. |
@ryanuber thank you for looking. Deleting the data is in fact what I did do to get back to square one. I eventually found that the state was being caused by chef runs. Every time the nodes converged, the consul recipe scheduled a restart. Over time, all 3 servers convergence would eventually line up and cause a down state. If did the manual recovery steps, it would come back. Far as the health checks. It was also a chef recipe bug that would hork up the service check, sometimes. I've fixed our recipes now so they only install consul and don't modify and the bugs as well. Since I've done that, the cluster has been happy. |
* Acceptance: wait for previous installation to exit In some cases the previous installation is listed as uninstalled by Helm but its pods are still terminating. This can cause issues for the newly running tests when they query Kubernetes to get information about the new installation, e.g. getting the server pod IPs might return the IPs from the old installation as well. * Make static-client/server exit faster
I am seeing a case where a script service check will run and get stuck in the error state and will not go back to healthy. If I run the check directly it is fine. I can start and stop the service and watch the check script correctly report the right value. But each time it is run by consul, it reports an error. The only way to fix it is to restart consul on that agent. And then the check will pass again for a little bit but will start failing again.
I have 3 servers right now and just 6 agents at the moment with 0.5 Consul
I was poking in the data.mdb file on one of the servers and I see the old error state in there that shows as the reason the service is failed. This old error is the one that is reported in the UI as the reason the service is failing.
Observed:
The service check returns 0 on the agent, but after a crash a failed state that is cached on the servers overwrites the successful check on the agent. The agent has to be restarted to correct the service check.
Agent Config
Service Config
Server Config
Agent Monitor Log
You can see that the check was failing. From another ssh session all I did was restart consul and the local check took precedence.
Crashing Agent Monitor Log
You can see in this log, it looks like a crash. After that the service check is in error. This seems to be case where the stale server data takes precedence. Only restarting the agent consul will correct it.
The text was updated successfully, but these errors were encountered: