Agents can't find new server(s) #9235

moreinhardt · 2020-11-19T11:56:05Z

Overview of the Issue

I have a very similar issue to #6672 but the proposed solution doesn't work for me.

I run a single-node consul server cluster inside k8s. If the pod is recreated the clients outside of k8s show weird errors and cannot reconnect. If I restart the agents using systemctl restart consul they connect fine.

The solution proposed in #6672 is to add

        "skip_leave_on_interrupt": true,
        "leave_on_terminate" : false,

to the config. As far as I understand this is the default for the server anyway. Adding it to the client didn't help.

I do understand that the whole server should not go down, but even in a k8s setup this could happen. Why can the agents not handle this? All that would need to happen is an exit 1 since systemd whould start it again anyway. How can I achieve that? Or what's the proper solution?

Btw. a 3-node consul server setup inside k8s didn't work for me due to #7750 but as mentioned, also there a complete server down could happen.

FWIW: I'm only using consul for its DNS capabilities.

Reproduction Steps

Create a cluster with (at least?) 1 client nodes and 1 server node. Server inside kubernetes, using the helm chart and host network. Client outside.
Destroy the server pod. K8s will recreate it.
Client will log the errors and not join the new cluster. If I restart the client it connects fine.

Consul info for both Client and Server

Client info

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 1
build:
	prerelease = 
	revision = 12b16df3
	version = 1.8.4
consul:
	acl = disabled
	known_servers = 1
	server = false
runtime:
	arch = amd64
	cpu_count = 8
	goroutines = 47
	max_procs = 8
	os = linux
	version = go1.14.6
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 2
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 24296
	members = 6
	query_queue = 0
	query_time = 1

Server info

agent:
        check_monitors = 0
        check_ttls = 0
        checks = 0
        services = 0
build:
        prerelease = beta3
        revision = a00f8cce
        version = 1.9.0
consul:
        acl = disabled
        bootstrap = true
        known_datacenters = 1
        leader = true
        leader_addr = 10.110.0.11:8300
        server = true
raft:
        applied_index = 79921
        commit_index = 79921
        fsm_pending = 0
        last_contact = 0
        last_log_index = 79921
        last_log_term = 7
        last_snapshot_index = 65591
        last_snapshot_term = 7
        latest_configuration = [{Suffrage:Voter ID:170ef5fb-0aa0-1f1c-64ad-1f73fe020bcc Address:10.110.0.11:8300}]
        latest_configuration_index = 0
        num_peers = 0
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Leader
        term = 7
runtime:
        arch = amd64
        cpu_count = 8
        goroutines = 118
        max_procs = 8
        os = linux
        version = go1.14.11
serf_lan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 2
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 1
        member_time = 24335
        members = 6
        query_queue = 0
        query_time = 1
serf_wan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 6
        members = 1
        query_queue = 0
        query_time = 1

Operating system and Environment details

Server runs inside a kubernetes cluster in digitalocean on host network. It was installed using the helm chart v. 0.25.0 with this PR applied. I'm using the consul beta release but the same error appeard with the stable release. The agent is configured using retry_join with the cloud auto-joining configured.

Log Fragments

client repeatedly logs this (log level trace) when the server pod is restarted:

2020-11-19T11:31:41.536Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers"
2020-11-19T11:31:55.323Z [ERROR] agent: Coordinate update error: error="No known Consul servers"
2020-11-19T11:32:04.413Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers"
2020-11-19T11:32:11.170Z [ERROR] agent: Coordinate update error: error="No known Consul servers"
2020-11-19T11:32:29.474Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers"
2020-11-19T11:32:33.289Z [ERROR] agent: Coordinate update error: error="No known Consul servers"
2020-11-19T11:32:50.384Z [ERROR] agent: Coordinate update error: error="No known Consul servers"
2020-11-19T11:32:55.866Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers"
2020-11-19T11:33:08.380Z [ERROR] agent: Coordinate update error: error="No known Consul servers"

Actually when I tried yesterday (with log level debug) it repeatedly printed the following...:

2020-11-18T13:43:40.455Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers"
2020-11-18T13:43:45.278Z [ERROR] agent.dns: rpc error: error="No known Consul servers"
2020-11-18T13:43:45.278Z [ERROR] agent.dns: rpc error: error="No known Consul servers"
2020-11-18T13:43:45.278Z [ERROR] agent.dns: rpc error: error="No known Consul servers"

The text was updated successfully, but these errors were encountered:

eatwithforks · 2021-09-09T20:39:21Z

I'm facing the same issue. A workaround hack is to write a systemd service that curls v1/status/peers and if it returns no known consul servers, then restart consul service and emit a metric. That's shady and would prefer a native flag to restart on this error

snirkatriel · 2022-03-14T15:34:48Z

Hi guys, any update on this one? seems like the Consul server IP change (which is a regular behavior of kubernetes when restarting a pod) cause this issue.
Is it possible to maybe rely on a unique ID and not the consul server IP?

AadiManchekar · 2024-10-24T14:23:09Z

++Same issue.
In a K8s env (i deploy consul via helm hashicorp/consul --version 1.4.1), one consul server, one consul client, one sync catalog
I have consul-server on node1, and consul-client and sync catalog running on node2

If consul-server-0 pod gets deleted it gets recreated by k8s (consul-server-0 as its a daemonset), client goes into error state and starts logging
2024-10-24T13:42:53.466Z [ERROR] agent.http: Request error: method=GET url=/v1/status/leader from=127.0.0.1:51854 error="No known Consul servers" 2024-10-24T13:43:03.503Z [ERROR] agent.http: Request error: method=GET url=/v1/status/leader from=127.0.0.1:58888 error="No known Consul servers" 2024-10-24T13:43:04.883Z [ERROR] agent.http: Request error: method=GET url=/v1/status/leader from=127.0.0.1:58902 error="No known Consul servers" 2024-10-24T13:43:13.601Z [ERROR] agent.http: Request error: method=GET url=/v1/status/leader from=127.0.0.1:43962 error="No known Consul servers" 2024-10-24T13:43:15.838Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers" 2024-10-24T13:43:15.961Z [ERROR] agent: Coordinate update error: error="No known Consul servers"

if i restart client then its okay.

but what i noticed was consul sync catalog auto joined the server
2024-10-24T13:32:37.539Z [ERROR] consul-server-connection-manager: connection error: error="fetching supported dataplane features: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.2.0.3:8502: i/o timeout\"" 2024-10-24T13:32:39.323Z [INFO] consul-server-connection-manager: trying to connect to a Consul server 2024-10-24T13:32:39.326Z [INFO] consul-server-connection-manager: discovered Consul servers: addresses=[10.2.0.7:8502] 2024-10-24T13:32:39.326Z [INFO] consul-server-connection-manager: current prioritized list of known Consul servers: addresses=[10.2.0.7:8502] 2024-10-24T13:32:40.160Z [INFO] consul-server-connection-manager: connected to Consul server: address=10.2.0.7:8502 2024-10-24T13:32:40.162Z [INFO] consul-server-connection-manager: updated known Consul servers from watch stream: addresses=[10.2.0.3:8502]

but this issue is not observed in hashicorp/consul --version 1.2.3 (with same configuration) there clients automatically reconnect to consul

@zalimeni Hope it helps in debugging this issue

jsosulska added the theme/kubernetes Consul-helm/kubernetes related questions label Nov 19, 2020

Amier3 added the type/bug Feature does not function as expected label Mar 15, 2022

Amier3 self-assigned this Mar 15, 2022

hc-github-team-consul-core assigned zalimeni May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agents can't find new server(s) #9235

Agents can't find new server(s) #9235

moreinhardt commented Nov 19, 2020

eatwithforks commented Sep 9, 2021 •

edited

Loading

snirkatriel commented Mar 14, 2022

AadiManchekar commented Oct 24, 2024 •

edited

Loading

Agents can't find new server(s) #9235

Agents can't find new server(s) #9235

Comments

moreinhardt commented Nov 19, 2020

Overview of the Issue

Reproduction Steps

Consul info for both Client and Server

Operating system and Environment details

Log Fragments

eatwithforks commented Sep 9, 2021 • edited Loading

snirkatriel commented Mar 14, 2022

AadiManchekar commented Oct 24, 2024 • edited Loading

eatwithforks commented Sep 9, 2021 •

edited

Loading

AadiManchekar commented Oct 24, 2024 •

edited

Loading