-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Agents can't find new server(s) #9235
Comments
I'm facing the same issue. A workaround hack is to write a systemd service that curls v1/status/peers and if it returns no known consul servers, then restart consul service and emit a metric. That's shady and would prefer a native flag to restart on this error |
Hi guys, any update on this one? seems like the Consul server IP change (which is a regular behavior of kubernetes when restarting a pod) cause this issue. |
++Same issue. If consul-server-0 pod gets deleted it gets recreated by k8s (consul-server-0 as its a daemonset), client goes into error state and starts logging if i restart client then its okay. but what i noticed was consul sync catalog auto joined the server but this issue is not observed in hashicorp/consul --version 1.2.3 (with same configuration) there clients automatically reconnect to consul @zalimeni Hope it helps in debugging this issue |
Overview of the Issue
I have a very similar issue to #6672 but the proposed solution doesn't work for me.
I run a single-node consul server cluster inside k8s. If the pod is recreated the clients outside of k8s show weird errors and cannot reconnect. If I restart the agents using
systemctl restart consul
they connect fine.The solution proposed in #6672 is to add
to the config. As far as I understand this is the default for the server anyway. Adding it to the client didn't help.
I do understand that the whole server should not go down, but even in a k8s setup this could happen. Why can the agents not handle this? All that would need to happen is an
exit 1
since systemd whould start it again anyway. How can I achieve that? Or what's the proper solution?Btw. a 3-node consul server setup inside k8s didn't work for me due to #7750 but as mentioned, also there a complete server down could happen.
FWIW: I'm only using consul for its DNS capabilities.
Reproduction Steps
Consul info for both Client and Server
Client info
Server info
Operating system and Environment details
Server runs inside a kubernetes cluster in digitalocean on host network. It was installed using the helm chart v. 0.25.0 with this PR applied. I'm using the consul beta release but the same error appeard with the stable release. The agent is configured using
retry_join
with the cloud auto-joining configured.Log Fragments
client repeatedly logs this (log level
trace
) when the server pod is restarted:Actually when I tried yesterday (with log level
debug
) it repeatedly printed the following...:The text was updated successfully, but these errors were encountered: