consul 0.9.2 - [ERR] memberlist: Failed fallback ping: write tcp 172.17.0.5:45890->a.b.c.d:8301: i/o timeout #3411

mnuic · 2017-08-24T09:52:47Z

Running consul docker image 0.9.2

`consul version` for both Client and Server

Client: 0.7.5 -> upd to 0.9.2
Server: 0.7.5 -> upd to 0.9.2

`consul info` for both Client and Server

Client:

# consul info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 7
	services = 18
build:
	prerelease =
	revision = 75ca2ca
	version = 0.9.2
consul:
	known_servers = 3
	server = false
runtime:
	arch = amd64
	cpu_count = 4
	goroutines = 84
	max_procs = 4
	os = linux
	version = go1.8.3
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 4625
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 6897
	members = 12
	query_queue = 0
	query_time = 2

Server:

# consul info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 5
	services = 13
build:
	prerelease =
	revision = 75ca2ca
	version = 0.9.2
consul:
	bootstrap = false
	known_datacenters = 8
	leader = false
	leader_addr = 192.168.10.237:8300
	server = true
raft:
	applied_index = 431457249
	commit_index = 431457249
	fsm_pending = 0
	last_contact = 27.164445ms
	last_log_index = 431457249
	last_log_term = 23227
	last_snapshot_index = 431453186
	last_snapshot_term = 23227
	latest_configuration = [{Suffrage:Voter ID:a.b.c.d1:8300 Address:a.b.c.d1:8300} {Suffrage:Voter ID:a.b.c.d2:8300 Address:a.b.c.d2:8300} {Suffrage:Voter ID:a.b.c.d3:8300 Address:a.b.c.d3:8300}]
	latest_configuration_index = 359053270
	num_peers = 2
	protocol_version = 2
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Follower
	term = 23227
runtime:
	arch = amd64
	cpu_count = 8
	goroutines = 359
	max_procs = 8
	os = linux
	version = go1.8.3
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 4625
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 6897
	members = 12
	query_queue = 0
	query_time = 2
serf_wan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 767
	members = 13
	query_queue = 0
	query_time = 1

Operating system and Environment details

Ubuntu 16.04.3 LTS, docker 17.05.0-ce

Description of the Issue (and unexpected/desired result)

After upgrade Consul to v0.9.2. seeing lot of messages in log on every host, random

[ERR] memberlist: Failed fallback ping: write tcp 172.17.0.5:45890-> a.b.c.d:8301: i/o timeout

Reproduction steps

Consul docker image 0.7.5 upgrade to v.0.9.2, after that randomly get log messages about fallback ping.

Tried to use -log-level=TRACE but it is impossible to capture on what host is this going to happen. It is totally random.

All docker ports are open as I see it:

"Ports": {
                "8300/tcp": [
                    {
                        "HostIp": "0.0.0.0",
                        "HostPort": "8300"
                    }
                ],
                "8301/tcp": [
                    {
                        "HostIp": "0.0.0.0",
                        "HostPort": "8301"
                    }
                ],
                "8301/udp": [
                    {
                        "HostIp": "0.0.0.0",
                        "HostPort": "8301"
                    }
                ],
                "8302/tcp": [
                    {
                        "HostIp": "0.0.0.0",
                        "HostPort": "8302"
                    }
                ],
                "8302/udp": [
                    {
                        "HostIp": "0.0.0.0",
                        "HostPort": "8302"
                    }
                ],
                "8400/tcp": [
                    {
                        "HostIp": "0.0.0.0",
                        "HostPort": "8400"
                    }
                ],
                "8500/tcp": [
                    {
                        "HostIp": "0.0.0.0",
                        "HostPort": "8500"
                    }
                ],
                "8600/tcp": [
                    {
                        "HostIp": "0.0.0.0",
                        "HostPort": "8600"
                    }
                ],
                "8600/udp": [
                    {
                        "HostIp": "0.0.0.0",
                        "HostPort": "8600"
                    }
                ]
            },

On my test enviroment, installed 2 docker hosts with consul 0.7.5, after that upgrade to v0.8.5 and then to v0.9.0 and fallback ping started. So I think this is something caused from version 0.9.x

No firewall, no iptables, nothing that could block connection and cause timeout.

Edit:
Also, seeing a lot of this on 2 servers on the same subnet/network:

2017/08/24 13:12:04 [DEBUG] memberlist: Initiating push/pull sync with: a.b.c.d:8301
2017/08/24 13:12:10 [DEBUG] memberlist: Failed ping: SERVER (timeout reached)
2017/08/24 13:12:11 [DEBUG] memberlist: Failed ping: SERVER (timeout reached)

The text was updated successfully, but these errors were encountered:

aroca · 2017-09-01T20:03:01Z

The same happening. I'm on EC2 isntances. NO docker.

Consul v0.9.0

It is a 3 servers cluster only. No clients. Ports from 8300-8500 are allowed both udp and tcp. Not for 8600.

    2017/09/01 19:40:52 [DEBUG] memberlist: Stream connection from=10.0.3.237:44932
    2017/09/01 19:40:52 [DEBUG] memberlist: Failed ping: [server] (timeout reached)```

Consul cluster is alive and healthy. Just dont underestand that logs.

aroca · 2017-09-01T20:42:08Z

For what is worth, I rechecked the ACLs in AWS and the UDP ports were missing. The log is not that helpful though. I remember that in previous versions it stated that UDP was not reaching, fallback to TCP.. now the ping message isnt very helpful. Perhaps in the new 0.9.2 it changed.
Cheers!

mnuic · 2017-09-01T20:59:06Z

Don't have any ACL's, and all ports are open, but still get random timeout messages.

mnuic · 2017-09-19T09:42:24Z

We changed our infrastructure so that consul container has a host network and CONSUL_ALLOW_PRIVILEGED_PORTS=1 this morning. And we are seeing a lot of the same log messages:

[ERR] memberlist: Failed fallback ping: write tcp 10.0.0.1:49826->10.0.0.5:8301: i/o timeout

I found the explanation and can see the use of it and would not like to disable it but it is a little too excesive. Log lines are full for no obvious reason.https://github.com/hashicorp/consul/blob/v0.6.4/vendor/github.com/hashicorp/memberlist/state.go#L275-L299

@slackpad can You help?

slackpad · 2017-12-19T02:06:57Z

Hmm that error message did get more generic after a refactoring. We should look at making these messages more specific and actionable (and less spammy).

mnuic · 2017-12-19T07:30:23Z

That would be great @slackpad.

And also, can you do someting about this type of messages? We get them every couple of minutes, even on the last version 1.0.2

 [ERR] yamux: keepalive failed: session shutdow

mnuic · 2018-01-08T08:43:34Z

Hi @slackpad ,

Is there a chance to resolve this in the next release?
And to lower the log-level for the yamux keepalive failed session shutdown?

I'm asking because we have a lot of nodes and this kind of log messages are becoming too spammy.

Thanks

charlieoleary · 2018-01-09T05:33:20Z

Seeing the same thing in 1.0.2. More specifically, the nodes having the issues on are different VPCs, but the VPCs are peered and I have verified they can reach each other bi-direcitonally on all of the required ports. The only thing I could think of is that since all of the nodes are in a private subnet with a NAT, that is somehow causing interference, but they have appropriate direct routes setup. Debug messages don't shed any additional light.

mnuic · 2018-06-28T06:57:29Z

Hi,

Consul version 1.2.0, on the same LAN network every few minutes logs are filled:

2018/06/27 13:00:31 [ERR] memberlist: Failed fallback ping: read tcp 10.0.66.150:35168->10.0.66.192:8302: i/o timeout
2018/06/27 13:03:41 [ERR] memberlist: Failed fallback ping: read tcp 10.0.66.150:53268->10.0.66.192:8302: i/o timeout

shantanugadgil · 2018-08-02T04:41:43Z

I have been seeing the same for quite some time (and versions) between my on premise server and a cloud server.

I have verified all ports back and forth using telnet, netcat, iperf3.

consul version 1.2.2

linydquantil · 2018-09-05T07:26:18Z

+1 consul version 1.2.2

ervikrant06 · 2018-09-08T11:21:52Z

encountered same issue with 1.2.2

docker@consulserver:~$ docker exec -it 5cecf4554a0d consul members --http-addr=192.168.99.100:8500
Node          Address              Status  Type    Build  Protocol  DC        Segment
consulserver  192.168.99.100:8301  alive   server  1.2.2  2         labsetup  <all>
consulclient  192.168.99.101:8301  alive   client  1.2.2  2         labsetup  <default>


    2018/09/08 11:14:05 [INFO] consul: member 'consulclient' joined, marking health alive
    2018/09/08 11:14:15 [WARN] memberlist: Was able to connect to consulclient but other probes failed, network may be misconfigured
    2018/09/08 11:14:22 [WARN] memberlist: Was able to connect to consulclient but other probes failed, network may be misconfigured
    2018/09/08 11:14:29 [WARN] memberlist: Was able to connect to consulclient but other probes failed, network may be misconfigured
    2018/09/08 11:14:35 [WARN] memberlist: Was able to connect to consulclient but other probes failed, network may be misconfigured

lorierp · 2018-12-27T12:46:46Z

+1 consul version 1.4.0

shantanugadgil · 2018-12-27T12:52:17Z

The errors I was seeing have since gone away.
The issue was how the VPN was setup between the two endpoints.

Previously it was a software based VPN (StrongSwan).
Once a site-to-site VPN was setup between the on-premise firewall and AWS, this error went away.

qrgeng · 2019-12-05T08:00:07Z

encountered same issue with 1.2.2

docker@consulserver:~$ docker exec -it 5cecf4554a0d consul members --http-addr=192.168.99.100:8500
Node          Address              Status  Type    Build  Protocol  DC        Segment
consulserver  192.168.99.100:8301  alive   server  1.2.2  2         labsetup  <all>
consulclient  192.168.99.101:8301  alive   client  1.2.2  2         labsetup  <default>


    2018/09/08 11:14:05 [INFO] consul: member 'consulclient' joined, marking health alive
    2018/09/08 11:14:15 [WARN] memberlist: Was able to connect to consulclient but other probes failed, network may be misconfigured
    2018/09/08 11:14:22 [WARN] memberlist: Was able to connect to consulclient but other probes failed, network may be misconfigured
    2018/09/08 11:14:29 [WARN] memberlist: Was able to connect to consulclient but other probes failed, network may be misconfigured
    2018/09/08 11:14:35 [WARN] memberlist: Was able to connect to consulclient but other probes failed, network may be misconfigured

hi,i meet the same problem in consul 1.5.1 ,have you found a solution ? thanks
log:
memberlist: Was able to connect to FSKY_Client but other probes failed, network may be misconfigured

analytically · 2020-09-08T16:03:41Z

Seeing the same issues with 1.8.3 on AWS peered VPCs

yevgeniyo-ps · 2020-11-19T09:42:02Z

consul:1.8.5 seeing it too, it causes to pod be not ready

shantanugadgil · 2020-11-19T09:55:50Z

Haven't used in a while, but I would first confirm if all the necessary ports were reachable back and forth, TCP and UDP.

justas147 · 2022-01-18T07:56:41Z

When I deployed consul with the consul-k8s helm chart my problem was that the memory limit was set too low and the server pods were crashing. That is why the ping request returned with a timeout.

david-yu · 2022-11-02T22:42:01Z

Based on #3411 (comment) it sounds like the issues work more network related. Closing as this issue is quite old, please re-open a new issue if further investigation is needed.

TheGoderGuy · 2022-11-10T15:08:39Z

@justas147 we ran into that Problem too and your answer helped us a lot. Made the extra steps for a (maybe denied) PR so others won't have that problem in the future. This Problem had cost us several days and a precise lucky kubectl describe at the right time hashicorp/consul-k8s#1696

sumitsarkar mentioned this issue Oct 26, 2017

Consul client agent intermittently becomes unreachable and exit eventually due to flaky network connections. stelligent/mu#203

Closed

slackpad added type/enhancement Proposed improvement or new feature theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner labels Dec 19, 2017

slackpad added this to the Unplanned milestone Dec 19, 2017

david-yu closed this as completed Nov 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consul 0.9.2 - [ERR] memberlist: Failed fallback ping: write tcp 172.17.0.5:45890->a.b.c.d:8301: i/o timeout #3411

consul 0.9.2 - [ERR] memberlist: Failed fallback ping: write tcp 172.17.0.5:45890->a.b.c.d:8301: i/o timeout #3411

mnuic commented Aug 24, 2017 •

edited

Loading

aroca commented Sep 1, 2017

aroca commented Sep 1, 2017

mnuic commented Sep 1, 2017

mnuic commented Sep 19, 2017

slackpad commented Dec 19, 2017

mnuic commented Dec 19, 2017

mnuic commented Jan 8, 2018

charlieoleary commented Jan 9, 2018

mnuic commented Jun 28, 2018

shantanugadgil commented Aug 2, 2018

linydquantil commented Sep 5, 2018 •

edited

Loading

ervikrant06 commented Sep 8, 2018

lorierp commented Dec 27, 2018 •

edited

Loading

shantanugadgil commented Dec 27, 2018

qrgeng commented Dec 5, 2019

analytically commented Sep 8, 2020

yevgeniyo-ps commented Nov 19, 2020

shantanugadgil commented Nov 19, 2020

justas147 commented Jan 18, 2022 •

edited

Loading

david-yu commented Nov 2, 2022

TheGoderGuy commented Nov 10, 2022

consul 0.9.2 - [ERR] memberlist: Failed fallback ping: write tcp 172.17.0.5:45890->a.b.c.d:8301: i/o timeout #3411

consul 0.9.2 - [ERR] memberlist: Failed fallback ping: write tcp 172.17.0.5:45890->a.b.c.d:8301: i/o timeout #3411

Comments

mnuic commented Aug 24, 2017 • edited Loading

consul version for both Client and Server

consul info for both Client and Server

Operating system and Environment details

Description of the Issue (and unexpected/desired result)

Reproduction steps

aroca commented Sep 1, 2017

aroca commented Sep 1, 2017

mnuic commented Sep 1, 2017

mnuic commented Sep 19, 2017

slackpad commented Dec 19, 2017

mnuic commented Dec 19, 2017

mnuic commented Jan 8, 2018

charlieoleary commented Jan 9, 2018

mnuic commented Jun 28, 2018

shantanugadgil commented Aug 2, 2018

linydquantil commented Sep 5, 2018 • edited Loading

ervikrant06 commented Sep 8, 2018

lorierp commented Dec 27, 2018 • edited Loading

shantanugadgil commented Dec 27, 2018

qrgeng commented Dec 5, 2019

analytically commented Sep 8, 2020

yevgeniyo-ps commented Nov 19, 2020

shantanugadgil commented Nov 19, 2020

justas147 commented Jan 18, 2022 • edited Loading

david-yu commented Nov 2, 2022

TheGoderGuy commented Nov 10, 2022

mnuic commented Aug 24, 2017 •

edited

Loading

`consul version` for both Client and Server

`consul info` for both Client and Server

linydquantil commented Sep 5, 2018 •

edited

Loading

lorierp commented Dec 27, 2018 •

edited

Loading

justas147 commented Jan 18, 2022 •

edited

Loading