Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul Node stuck at Leaving status #6882

Open
anshitabharti opened this issue Dec 4, 2019 · 8 comments
Open

Consul Node stuck at Leaving status #6882

anshitabharti opened this issue Dec 4, 2019 · 8 comments
Labels
theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics theme/kubernetes Consul-helm/kubernetes related questions theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner

Comments

@anshitabharti
Copy link

Overview of the issue:

  1. Sometimes one of the nodes SerfStatus is stuck as leaving state. Even though agent is started initially with retry-join, if it falls out of the cluster, it is unable to join back. When the node goes out the cluster, container is still up and running. To solve this issue the container has to be manually restarted which we want to avoid.

  2. Even if just one of the node is in Leaving Status, the monitoring api v1/operator/autopilot/health responds with Healthy: false, even though all k/v operations can be executed without any issues. Because of Healthy: false the alerts kick in and creates panic if the cluster is actually unhealthy. What's the rationale behind considering the cluster unhealthy?

Consul version: 1.5.3, running inside docker containers, on openstack VMs.

@KalenWessel
Copy link

I've been having the same issue with Consul 1.6.2 running on k8s. I can do a rolling redeploy via a statefulset update and sometimes one of the nodes will show SerfStatus leaving and autopilot shows unhealthy. Like you said, only after I delete the container manually does it come back up again as healthy. Did you ever figure out what the problem was?

@jsosulska jsosulska added theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics theme/kubernetes Consul-helm/kubernetes related questions theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner labels May 4, 2020
@lwei-wish
Copy link

we are having the same issue with one of the consul members becomes leaving when the underlying node is terminated. I have to kill the pod manually to let it back to alive.

@mssawant
Copy link

I am facing a similar issue, where one of the consul client see another as leaving while the latter is alive.

@Amier3
Copy link
Contributor

Amier3 commented Mar 15, 2022

Hey @lwei-wish & @mssawant

May I ask which version(s) of consul y'all are running? It would also be helpful to any logs if you have them.

@mssawant
Copy link

hi @Amier3, I am running version 1.9.1, so whenever I delete a pod running Consul client agent, on restart it just fails to resolve the node name to new ip address and all the other node sees this restarted pod as failed.
We have included leave_on_terminate in configuration, still it seems it does not leave the cluster.

{
  "enable_local_script_checks": true,
  "leave_on_terminate": true,

Any help will be appreciated.

@anshitabharti
Copy link
Author

anshitabharti commented Mar 15, 2022

Hello!

It has been a while, I do not recollect which part exactly helped us solve the problem. I'm pasting the docker-compose and consul-config below if that helps.

compose:

version: '2' services: {{ workload }}: network_mode: host build: args: - UID={{ ansible_user_uid }} - GID={{ ansible_user_gid }} context: . image: "{{ image_tag }}" container_name: {{ container_name }} hostname: "{{ ansible_host }}" ports: - {{ consul_http_port }}:{{ consul_http_port }} - {{ consul_rpc_port }}:{{ consul_rpc_port }} - {{ consul_dns_port }}:{{ consul_dns_port }} - {{ consul_lan_serf_port }}:{{ consul_lan_serf_port }}/tcp - {{ consul_wan_serf_port }}:{{ consul_wan_serf_port }}/tcp - {{ consul_lan_serf_port }}:{{ consul_lan_serf_port }}/udp - {{ consul_wan_serf_port }}:{{ consul_wan_serf_port }}/udp command: sh -c "/bin/consul agent -ui -config-file={{ consul_config_file }} >> {{ consul_log_file_path }} 2>&1" volumes: - {{ consul_config_file_path }}:{{ consul_config_file }} - {{ workload_data_dir }}:{{ consul_data_dir }} - {{ workload_log_dir }}:{{ consul_log_file_dir }} environment: CONSUL_ACL_TOKEN: {{ acl_master_token }} CONSUL_BIND_INTERFACE: eth0 CONSUL_CLIENT_INTERFACE: eth0

config:

{ "datacenter": "{{ consul_datacenter }}", "bootstrap_expect": {{ bootstrap_expect }}, "advertise_addr": "{{ hostvars[inventory_hostname]['ansible_default_ipv4']['address'] }}", "client_addr": "{{ client_addr }}", "server": {{ is_server }}, "data_dir": "{{ consul_data_dir }}", "retry_join": [ "{{ groups[consul_host_group] | map('extract', hostvars, ['ansible_host']) | join("\", \"") }}" ], "encrypt": "{{ consul_encrypt }}", "log_level": "{{ consul_log_level }}", "enable_syslog": {{ consul_enable_syslog }}, "check_update_interval": "{{ consul_check_interval }}", "acl_datacenter":"{{ consul_datacenter }}", "acl_default_policy":"{{ acl_policy }}", "acl_down_policy":"{{ acl_down_policy }}", "acl_master_token":"{{ acl_master_token }}", "acl_agent_token": "{{ acl_agent_token }}", "performance": { "raft_multiplier": {{ raft_multiplier }} }, "gossip_lan": { "probe_timeout": "{{ probe_timeout }}", "probe_interval": "{{ probe_interval }}" } }

@mssawant
Copy link

Thanks @anshitabharti , thought advertise would help but no luck. Trying probe_timeout, probe_interval.

@chymy
Copy link

chymy commented May 7, 2022

we are having the same issue with one of the consul client see another as leaving while the latter is alive.
consul version: 1.6.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics theme/kubernetes Consul-helm/kubernetes related questions theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner
Projects
None yet
Development

No branches or pull requests

7 participants