Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consul 0.9.2 - [ERR] memberlist: Failed fallback ping: write tcp 172.17.0.5:45890->a.b.c.d:8301: i/o timeout #3411

Closed
mnuic opened this issue Aug 24, 2017 · 21 comments
Labels
theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner type/enhancement Proposed improvement or new feature
Milestone

Comments

@mnuic
Copy link

mnuic commented Aug 24, 2017

Running consul docker image 0.9.2

consul version for both Client and Server

Client: 0.7.5 -> upd to 0.9.2
Server: 0.7.5 -> upd to 0.9.2

consul info for both Client and Server

Client:

# consul info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 7
	services = 18
build:
	prerelease =
	revision = 75ca2ca
	version = 0.9.2
consul:
	known_servers = 3
	server = false
runtime:
	arch = amd64
	cpu_count = 4
	goroutines = 84
	max_procs = 4
	os = linux
	version = go1.8.3
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 4625
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 6897
	members = 12
	query_queue = 0
	query_time = 2

Server:

# consul info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 5
	services = 13
build:
	prerelease =
	revision = 75ca2ca
	version = 0.9.2
consul:
	bootstrap = false
	known_datacenters = 8
	leader = false
	leader_addr = 192.168.10.237:8300
	server = true
raft:
	applied_index = 431457249
	commit_index = 431457249
	fsm_pending = 0
	last_contact = 27.164445ms
	last_log_index = 431457249
	last_log_term = 23227
	last_snapshot_index = 431453186
	last_snapshot_term = 23227
	latest_configuration = [{Suffrage:Voter ID:a.b.c.d1:8300 Address:a.b.c.d1:8300} {Suffrage:Voter ID:a.b.c.d2:8300 Address:a.b.c.d2:8300} {Suffrage:Voter ID:a.b.c.d3:8300 Address:a.b.c.d3:8300}]
	latest_configuration_index = 359053270
	num_peers = 2
	protocol_version = 2
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Follower
	term = 23227
runtime:
	arch = amd64
	cpu_count = 8
	goroutines = 359
	max_procs = 8
	os = linux
	version = go1.8.3
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 4625
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 6897
	members = 12
	query_queue = 0
	query_time = 2
serf_wan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 767
	members = 13
	query_queue = 0
	query_time = 1

Operating system and Environment details

Ubuntu 16.04.3 LTS, docker 17.05.0-ce

Description of the Issue (and unexpected/desired result)

After upgrade Consul to v0.9.2. seeing lot of messages in log on every host, random

[ERR] memberlist: Failed fallback ping: write tcp 172.17.0.5:45890-> a.b.c.d:8301: i/o timeout

Reproduction steps

Consul docker image 0.7.5 upgrade to v.0.9.2, after that randomly get log messages about fallback ping.

Tried to use -log-level=TRACE but it is impossible to capture on what host is this going to happen. It is totally random.

All docker ports are open as I see it:

"Ports": {
                "8300/tcp": [
                    {
                        "HostIp": "0.0.0.0",
                        "HostPort": "8300"
                    }
                ],
                "8301/tcp": [
                    {
                        "HostIp": "0.0.0.0",
                        "HostPort": "8301"
                    }
                ],
                "8301/udp": [
                    {
                        "HostIp": "0.0.0.0",
                        "HostPort": "8301"
                    }
                ],
                "8302/tcp": [
                    {
                        "HostIp": "0.0.0.0",
                        "HostPort": "8302"
                    }
                ],
                "8302/udp": [
                    {
                        "HostIp": "0.0.0.0",
                        "HostPort": "8302"
                    }
                ],
                "8400/tcp": [
                    {
                        "HostIp": "0.0.0.0",
                        "HostPort": "8400"
                    }
                ],
                "8500/tcp": [
                    {
                        "HostIp": "0.0.0.0",
                        "HostPort": "8500"
                    }
                ],
                "8600/tcp": [
                    {
                        "HostIp": "0.0.0.0",
                        "HostPort": "8600"
                    }
                ],
                "8600/udp": [
                    {
                        "HostIp": "0.0.0.0",
                        "HostPort": "8600"
                    }
                ]
            },

On my test enviroment, installed 2 docker hosts with consul 0.7.5, after that upgrade to v0.8.5 and then to v0.9.0 and fallback ping started. So I think this is something caused from version 0.9.x

No firewall, no iptables, nothing that could block connection and cause timeout.

Edit:
Also, seeing a lot of this on 2 servers on the same subnet/network:

2017/08/24 13:12:04 [DEBUG] memberlist: Initiating push/pull sync with: a.b.c.d:8301
2017/08/24 13:12:10 [DEBUG] memberlist: Failed ping: SERVER (timeout reached)
2017/08/24 13:12:11 [DEBUG] memberlist: Failed ping: SERVER (timeout reached)
@aroca
Copy link

aroca commented Sep 1, 2017

The same happening. I'm on EC2 isntances. NO docker.

Consul v0.9.0

It is a 3 servers cluster only. No clients. Ports from 8300-8500 are allowed both udp and tcp. Not for 8600.

    2017/09/01 19:40:52 [DEBUG] memberlist: Stream connection from=10.0.3.237:44932
    2017/09/01 19:40:52 [DEBUG] memberlist: Failed ping: [server] (timeout reached)```

Consul cluster is alive and healthy. Just dont underestand that logs.

@aroca
Copy link

aroca commented Sep 1, 2017

For what is worth, I rechecked the ACLs in AWS and the UDP ports were missing. The log is not that helpful though. I remember that in previous versions it stated that UDP was not reaching, fallback to TCP.. now the ping message isnt very helpful. Perhaps in the new 0.9.2 it changed.
Cheers!

@mnuic
Copy link
Author

mnuic commented Sep 1, 2017

Don't have any ACL's, and all ports are open, but still get random timeout messages.

@mnuic
Copy link
Author

mnuic commented Sep 19, 2017

We changed our infrastructure so that consul container has a host network and CONSUL_ALLOW_PRIVILEGED_PORTS=1 this morning. And we are seeing a lot of the same log messages:

[ERR] memberlist: Failed fallback ping: write tcp 10.0.0.1:49826->10.0.0.5:8301: i/o timeout

I found the explanation and can see the use of it and would not like to disable it but it is a little too excesive. Log lines are full for no obvious reason.https://github.com/hashicorp/consul/blob/v0.6.4/vendor/github.com/hashicorp/memberlist/state.go#L275-L299

@slackpad can You help?

@slackpad slackpad added type/enhancement Proposed improvement or new feature theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner labels Dec 19, 2017
@slackpad slackpad added this to the Unplanned milestone Dec 19, 2017
@slackpad
Copy link
Contributor

Hmm that error message did get more generic after a refactoring. We should look at making these messages more specific and actionable (and less spammy).

@mnuic
Copy link
Author

mnuic commented Dec 19, 2017

That would be great @slackpad.

And also, can you do someting about this type of messages? We get them every couple of minutes, even on the last version 1.0.2

 [ERR] yamux: keepalive failed: session shutdow

@mnuic
Copy link
Author

mnuic commented Jan 8, 2018

Hi @slackpad ,

Is there a chance to resolve this in the next release?
And to lower the log-level for the yamux keepalive failed session shutdown?

I'm asking because we have a lot of nodes and this kind of log messages are becoming too spammy.

Thanks

@charlieoleary
Copy link

Seeing the same thing in 1.0.2. More specifically, the nodes having the issues on are different VPCs, but the VPCs are peered and I have verified they can reach each other bi-direcitonally on all of the required ports. The only thing I could think of is that since all of the nodes are in a private subnet with a NAT, that is somehow causing interference, but they have appropriate direct routes setup. Debug messages don't shed any additional light.

@mnuic
Copy link
Author

mnuic commented Jun 28, 2018

Hi,

Consul version 1.2.0, on the same LAN network every few minutes logs are filled:

2018/06/27 13:00:31 [ERR] memberlist: Failed fallback ping: read tcp 10.0.66.150:35168->10.0.66.192:8302: i/o timeout
2018/06/27 13:03:41 [ERR] memberlist: Failed fallback ping: read tcp 10.0.66.150:53268->10.0.66.192:8302: i/o timeout

@shantanugadgil
Copy link
Contributor

I have been seeing the same for quite some time (and versions) between my on premise server and a cloud server.

I have verified all ports back and forth using telnet, netcat, iperf3.

consul version 1.2.2

@linydquantil
Copy link

linydquantil commented Sep 5, 2018

+1 consul version 1.2.2

@ervikrant06
Copy link

encountered same issue with 1.2.2

docker@consulserver:~$ docker exec -it 5cecf4554a0d consul members --http-addr=192.168.99.100:8500
Node          Address              Status  Type    Build  Protocol  DC        Segment
consulserver  192.168.99.100:8301  alive   server  1.2.2  2         labsetup  <all>
consulclient  192.168.99.101:8301  alive   client  1.2.2  2         labsetup  <default>


    2018/09/08 11:14:05 [INFO] consul: member 'consulclient' joined, marking health alive
    2018/09/08 11:14:15 [WARN] memberlist: Was able to connect to consulclient but other probes failed, network may be misconfigured
    2018/09/08 11:14:22 [WARN] memberlist: Was able to connect to consulclient but other probes failed, network may be misconfigured
    2018/09/08 11:14:29 [WARN] memberlist: Was able to connect to consulclient but other probes failed, network may be misconfigured
    2018/09/08 11:14:35 [WARN] memberlist: Was able to connect to consulclient but other probes failed, network may be misconfigured

@lorierp
Copy link

lorierp commented Dec 27, 2018

+1 consul version 1.4.0

@shantanugadgil
Copy link
Contributor

The errors I was seeing have since gone away.
The issue was how the VPN was setup between the two endpoints.

Previously it was a software based VPN (StrongSwan).
Once a site-to-site VPN was setup between the on-premise firewall and AWS, this error went away.

@qrgeng
Copy link

qrgeng commented Dec 5, 2019

encountered same issue with 1.2.2

docker@consulserver:~$ docker exec -it 5cecf4554a0d consul members --http-addr=192.168.99.100:8500
Node          Address              Status  Type    Build  Protocol  DC        Segment
consulserver  192.168.99.100:8301  alive   server  1.2.2  2         labsetup  <all>
consulclient  192.168.99.101:8301  alive   client  1.2.2  2         labsetup  <default>


    2018/09/08 11:14:05 [INFO] consul: member 'consulclient' joined, marking health alive
    2018/09/08 11:14:15 [WARN] memberlist: Was able to connect to consulclient but other probes failed, network may be misconfigured
    2018/09/08 11:14:22 [WARN] memberlist: Was able to connect to consulclient but other probes failed, network may be misconfigured
    2018/09/08 11:14:29 [WARN] memberlist: Was able to connect to consulclient but other probes failed, network may be misconfigured
    2018/09/08 11:14:35 [WARN] memberlist: Was able to connect to consulclient but other probes failed, network may be misconfigured

hi,i meet the same problem in consul 1.5.1 ,have you found a solution ? thanks
log:
memberlist: Was able to connect to FSKY_Client but other probes failed, network may be misconfigured

@analytically
Copy link

Seeing the same issues with 1.8.3 on AWS peered VPCs

@yevgeniyo-ps
Copy link

consul:1.8.5 seeing it too, it causes to pod be not ready

@shantanugadgil
Copy link
Contributor

Haven't used in a while, but I would first confirm if all the necessary ports were reachable back and forth, TCP and UDP.

@justas147
Copy link

justas147 commented Jan 18, 2022

When I deployed consul with the consul-k8s helm chart my problem was that the memory limit was set too low and the server pods were crashing. That is why the ping request returned with a timeout.

@david-yu
Copy link
Contributor

david-yu commented Nov 2, 2022

Based on #3411 (comment) it sounds like the issues work more network related. Closing as this issue is quite old, please re-open a new issue if further investigation is needed.

@david-yu david-yu closed this as completed Nov 2, 2022
@TheGoderGuy
Copy link

@justas147 we ran into that Problem too and your answer helped us a lot. Made the extra steps for a (maybe denied) PR so others won't have that problem in the future. This Problem had cost us several days and a precise lucky kubectl describe at the right time hashicorp/consul-k8s#1696

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner type/enhancement Proposed improvement or new feature
Projects
None yet
Development

No branches or pull requests