Consul Node stuck at Leaving status #6882

anshitabharti · 2019-12-04T19:01:25Z

Overview of the issue:

Sometimes one of the nodes SerfStatus is stuck as leaving state. Even though agent is started initially with retry-join, if it falls out of the cluster, it is unable to join back. When the node goes out the cluster, container is still up and running. To solve this issue the container has to be manually restarted which we want to avoid.
Even if just one of the node is in Leaving Status, the monitoring api v1/operator/autopilot/health responds with Healthy: false, even though all k/v operations can be executed without any issues. Because of Healthy: false the alerts kick in and creates panic if the cluster is actually unhealthy. What's the rationale behind considering the cluster unhealthy?

Consul version: 1.5.3, running inside docker containers, on openstack VMs.

KalenWessel · 2020-01-02T01:09:31Z

I've been having the same issue with Consul 1.6.2 running on k8s. I can do a rolling redeploy via a statefulset update and sometimes one of the nodes will show SerfStatus leaving and autopilot shows unhealthy. Like you said, only after I delete the container manually does it come back up again as healthy. Did you ever figure out what the problem was?

lwei-wish · 2022-03-01T18:44:55Z

we are having the same issue with one of the consul members becomes leaving when the underlying node is terminated. I have to kill the pod manually to let it back to alive.

mssawant · 2022-03-14T15:31:05Z

I am facing a similar issue, where one of the consul client see another as leaving while the latter is alive.

Amier3 · 2022-03-15T15:13:31Z

Hey @lwei-wish & @mssawant

May I ask which version(s) of consul y'all are running? It would also be helpful to any logs if you have them.

mssawant · 2022-03-15T20:03:52Z

hi @Amier3, I am running version 1.9.1, so whenever I delete a pod running Consul client agent, on restart it just fails to resolve the node name to new ip address and all the other node sees this restarted pod as failed.
We have included leave_on_terminate in configuration, still it seems it does not leave the cluster.

{
  "enable_local_script_checks": true,
  "leave_on_terminate": true,

Any help will be appreciated.

anshitabharti · 2022-03-15T20:24:42Z

Hello!

It has been a while, I do not recollect which part exactly helped us solve the problem. I'm pasting the docker-compose and consul-config below if that helps.

compose:

version: '2' services: {{ workload }}: network_mode: host build: args: - UID={{ ansible_user_uid }} - GID={{ ansible_user_gid }} context: . image: "{{ image_tag }}" container_name: {{ container_name }} hostname: "{{ ansible_host }}" ports: - {{ consul_http_port }}:{{ consul_http_port }} - {{ consul_rpc_port }}:{{ consul_rpc_port }} - {{ consul_dns_port }}:{{ consul_dns_port }} - {{ consul_lan_serf_port }}:{{ consul_lan_serf_port }}/tcp - {{ consul_wan_serf_port }}:{{ consul_wan_serf_port }}/tcp - {{ consul_lan_serf_port }}:{{ consul_lan_serf_port }}/udp - {{ consul_wan_serf_port }}:{{ consul_wan_serf_port }}/udp command: sh -c "/bin/consul agent -ui -config-file={{ consul_config_file }} >> {{ consul_log_file_path }} 2>&1" volumes: - {{ consul_config_file_path }}:{{ consul_config_file }} - {{ workload_data_dir }}:{{ consul_data_dir }} - {{ workload_log_dir }}:{{ consul_log_file_dir }} environment: CONSUL_ACL_TOKEN: {{ acl_master_token }} CONSUL_BIND_INTERFACE: eth0 CONSUL_CLIENT_INTERFACE: eth0

config:

{ "datacenter": "{{ consul_datacenter }}", "bootstrap_expect": {{ bootstrap_expect }}, "advertise_addr": "{{ hostvars[inventory_hostname]['ansible_default_ipv4']['address'] }}", "client_addr": "{{ client_addr }}", "server": {{ is_server }}, "data_dir": "{{ consul_data_dir }}", "retry_join": [ "{{ groups[consul_host_group] | map('extract', hostvars, ['ansible_host']) | join("\", \"") }}" ], "encrypt": "{{ consul_encrypt }}", "log_level": "{{ consul_log_level }}", "enable_syslog": {{ consul_enable_syslog }}, "check_update_interval": "{{ consul_check_interval }}", "acl_datacenter":"{{ consul_datacenter }}", "acl_default_policy":"{{ acl_policy }}", "acl_down_policy":"{{ acl_down_policy }}", "acl_master_token":"{{ acl_master_token }}", "acl_agent_token": "{{ acl_agent_token }}", "performance": { "raft_multiplier": {{ raft_multiplier }} }, "gossip_lan": { "probe_timeout": "{{ probe_timeout }}", "probe_interval": "{{ probe_interval }}" } }

mssawant · 2022-03-16T19:34:44Z

Thanks @anshitabharti , thought advertise would help but no luck. Trying probe_timeout, probe_interval.

chymy · 2022-05-07T08:57:57Z

we are having the same issue with one of the consul client see another as leaving while the latter is alive.
consul version: 1.6.2

jsosulska added theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics theme/kubernetes Consul-helm/kubernetes related questions theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner labels May 4, 2020

chymy mentioned this issue Sep 30, 2022

consul server: Missing check 'serfHealth' registration #14515

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consul Node stuck at Leaving status #6882

Consul Node stuck at Leaving status #6882

anshitabharti commented Dec 4, 2019

KalenWessel commented Jan 2, 2020

lwei-wish commented Mar 1, 2022

mssawant commented Mar 14, 2022

Amier3 commented Mar 15, 2022

mssawant commented Mar 15, 2022

anshitabharti commented Mar 15, 2022 •

edited

Loading

mssawant commented Mar 16, 2022

chymy commented May 7, 2022

Consul Node stuck at Leaving status #6882

Consul Node stuck at Leaving status #6882

Comments

anshitabharti commented Dec 4, 2019

KalenWessel commented Jan 2, 2020

lwei-wish commented Mar 1, 2022

mssawant commented Mar 14, 2022

Amier3 commented Mar 15, 2022

mssawant commented Mar 15, 2022

anshitabharti commented Mar 15, 2022 • edited Loading

mssawant commented Mar 16, 2022

chymy commented May 7, 2022

anshitabharti commented Mar 15, 2022 •

edited

Loading