Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agents can't find new server(s) #9235

Open
moreinhardt opened this issue Nov 19, 2020 · 3 comments
Open

Agents can't find new server(s) #9235

moreinhardt opened this issue Nov 19, 2020 · 3 comments
Assignees
Labels
theme/kubernetes Consul-helm/kubernetes related questions type/bug Feature does not function as expected

Comments

@moreinhardt
Copy link

Overview of the Issue

I have a very similar issue to #6672 but the proposed solution doesn't work for me.

I run a single-node consul server cluster inside k8s. If the pod is recreated the clients outside of k8s show weird errors and cannot reconnect. If I restart the agents using systemctl restart consul they connect fine.

The solution proposed in #6672 is to add

        "skip_leave_on_interrupt": true,
        "leave_on_terminate" : false,

to the config. As far as I understand this is the default for the server anyway. Adding it to the client didn't help.

I do understand that the whole server should not go down, but even in a k8s setup this could happen. Why can the agents not handle this? All that would need to happen is an exit 1 since systemd whould start it again anyway. How can I achieve that? Or what's the proper solution?

Btw. a 3-node consul server setup inside k8s didn't work for me due to #7750 but as mentioned, also there a complete server down could happen.

FWIW: I'm only using consul for its DNS capabilities.

Reproduction Steps

  1. Create a cluster with (at least?) 1 client nodes and 1 server node. Server inside kubernetes, using the helm chart and host network. Client outside.
  2. Destroy the server pod. K8s will recreate it.
  3. Client will log the errors and not join the new cluster. If I restart the client it connects fine.

Consul info for both Client and Server

Client info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 1
build:
	prerelease = 
	revision = 12b16df3
	version = 1.8.4
consul:
	acl = disabled
	known_servers = 1
	server = false
runtime:
	arch = amd64
	cpu_count = 8
	goroutines = 47
	max_procs = 8
	os = linux
	version = go1.14.6
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 2
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 24296
	members = 6
	query_queue = 0
	query_time = 1
Server info
agent:
        check_monitors = 0
        check_ttls = 0
        checks = 0
        services = 0
build:
        prerelease = beta3
        revision = a00f8cce
        version = 1.9.0
consul:
        acl = disabled
        bootstrap = true
        known_datacenters = 1
        leader = true
        leader_addr = 10.110.0.11:8300
        server = true
raft:
        applied_index = 79921
        commit_index = 79921
        fsm_pending = 0
        last_contact = 0
        last_log_index = 79921
        last_log_term = 7
        last_snapshot_index = 65591
        last_snapshot_term = 7
        latest_configuration = [{Suffrage:Voter ID:170ef5fb-0aa0-1f1c-64ad-1f73fe020bcc Address:10.110.0.11:8300}]
        latest_configuration_index = 0
        num_peers = 0
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Leader
        term = 7
runtime:
        arch = amd64
        cpu_count = 8
        goroutines = 118
        max_procs = 8
        os = linux
        version = go1.14.11
serf_lan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 2
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 1
        member_time = 24335
        members = 6
        query_queue = 0
        query_time = 1
serf_wan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 6
        members = 1
        query_queue = 0
        query_time = 1

Operating system and Environment details

Server runs inside a kubernetes cluster in digitalocean on host network. It was installed using the helm chart v. 0.25.0 with this PR applied. I'm using the consul beta release but the same error appeard with the stable release. The agent is configured using retry_join with the cloud auto-joining configured.

Log Fragments

client repeatedly logs this (log level trace) when the server pod is restarted:

2020-11-19T11:31:41.536Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers"
2020-11-19T11:31:55.323Z [ERROR] agent: Coordinate update error: error="No known Consul servers"
2020-11-19T11:32:04.413Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers"
2020-11-19T11:32:11.170Z [ERROR] agent: Coordinate update error: error="No known Consul servers"
2020-11-19T11:32:29.474Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers"
2020-11-19T11:32:33.289Z [ERROR] agent: Coordinate update error: error="No known Consul servers"
2020-11-19T11:32:50.384Z [ERROR] agent: Coordinate update error: error="No known Consul servers"
2020-11-19T11:32:55.866Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers"
2020-11-19T11:33:08.380Z [ERROR] agent: Coordinate update error: error="No known Consul servers"

Actually when I tried yesterday (with log level debug) it repeatedly printed the following...:

2020-11-18T13:43:40.455Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers"
2020-11-18T13:43:45.278Z [ERROR] agent.dns: rpc error: error="No known Consul servers"
2020-11-18T13:43:45.278Z [ERROR] agent.dns: rpc error: error="No known Consul servers"
2020-11-18T13:43:45.278Z [ERROR] agent.dns: rpc error: error="No known Consul servers"
@jsosulska jsosulska added the theme/kubernetes Consul-helm/kubernetes related questions label Nov 19, 2020
@eatwithforks
Copy link

eatwithforks commented Sep 9, 2021

I'm facing the same issue. A workaround hack is to write a systemd service that curls v1/status/peers and if it returns no known consul servers, then restart consul service and emit a metric. That's shady and would prefer a native flag to restart on this error

@snirkatriel
Copy link

Hi guys, any update on this one? seems like the Consul server IP change (which is a regular behavior of kubernetes when restarting a pod) cause this issue.
Is it possible to maybe rely on a unique ID and not the consul server IP?

@Amier3 Amier3 added the type/bug Feature does not function as expected label Mar 15, 2022
@Amier3 Amier3 self-assigned this Mar 15, 2022
@AadiManchekar
Copy link

AadiManchekar commented Oct 24, 2024

++Same issue.
In a K8s env (i deploy consul via helm hashicorp/consul --version 1.4.1), one consul server, one consul client, one sync catalog
I have consul-server on node1, and consul-client and sync catalog running on node2

If consul-server-0 pod gets deleted it gets recreated by k8s (consul-server-0 as its a daemonset), client goes into error state and starts logging
2024-10-24T13:42:53.466Z [ERROR] agent.http: Request error: method=GET url=/v1/status/leader from=127.0.0.1:51854 error="No known Consul servers" 2024-10-24T13:43:03.503Z [ERROR] agent.http: Request error: method=GET url=/v1/status/leader from=127.0.0.1:58888 error="No known Consul servers" 2024-10-24T13:43:04.883Z [ERROR] agent.http: Request error: method=GET url=/v1/status/leader from=127.0.0.1:58902 error="No known Consul servers" 2024-10-24T13:43:13.601Z [ERROR] agent.http: Request error: method=GET url=/v1/status/leader from=127.0.0.1:43962 error="No known Consul servers" 2024-10-24T13:43:15.838Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers" 2024-10-24T13:43:15.961Z [ERROR] agent: Coordinate update error: error="No known Consul servers"

if i restart client then its okay.

but what i noticed was consul sync catalog auto joined the server
2024-10-24T13:32:37.539Z [ERROR] consul-server-connection-manager: connection error: error="fetching supported dataplane features: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.2.0.3:8502: i/o timeout\"" 2024-10-24T13:32:39.323Z [INFO] consul-server-connection-manager: trying to connect to a Consul server 2024-10-24T13:32:39.326Z [INFO] consul-server-connection-manager: discovered Consul servers: addresses=[10.2.0.7:8502] 2024-10-24T13:32:39.326Z [INFO] consul-server-connection-manager: current prioritized list of known Consul servers: addresses=[10.2.0.7:8502] 2024-10-24T13:32:40.160Z [INFO] consul-server-connection-manager: connected to Consul server: address=10.2.0.7:8502 2024-10-24T13:32:40.162Z [INFO] consul-server-connection-manager: updated known Consul servers from watch stream: addresses=[10.2.0.3:8502]

but this issue is not observed in hashicorp/consul --version 1.2.3 (with same configuration) there clients automatically reconnect to consul

@zalimeni Hope it helps in debugging this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/kubernetes Consul-helm/kubernetes related questions type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests

7 participants