Skip to content
This repository was archived by the owner on Jun 20, 2024. It is now read-only.

Log shows frequent connection deleted message #3609

Closed
rongshen opened this issue Mar 6, 2019 · 8 comments
Closed

Log shows frequent connection deleted message #3609

rongshen opened this issue Mar 6, 2019 · 8 comments

Comments

@rongshen
Copy link

rongshen commented Mar 6, 2019

What you expected to happen?

No error message in the log

What happened?

In our kubernetes cluster, the following error messages happen every 3 minutes in the log file, connections between nodes seems fine, but we saw duplicate ip once. After following the instruction of deleting db file under /var/lib/docker and reboot, duplicate ip issue is gone, however still seeing the following error messages in the log, will such error cause any problems?

INFO: 2019/03/06 03:09:00.662976 ->[192.168.2.185:6783|82:af:45:b6:6c:28(master.k8s)]: connection fully established
INFO: 2019/03/06 03:09:00.663144 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/03/06 03:09:00.663758 sleeve ->[192.168.2.185:6783|82:af:45:b6:6c:28(master.k8s)]: Effective MTU verified at 1438
INFO: 2019/03/06 03:09:30.159068 ->[192.168.2.185:6783|82:af:45:b6:6c:28(master.k8s)]: connection shutting down due to error: read tcp4 192.168.2.241:57674->192.168.2.185:6783: read: connection reset by peer
INFO: 2019/03/06 03:09:30.159129 ->[192.168.2.185:6783|82:af:45:b6:6c:28(master.k8s)]: connection deleted
INFO: 2019/03/06 03:09:30.159675 ->[192.168.2.185:6783] attempting connection
INFO: 2019/03/06 03:09:30.160302 ->[192.168.2.185:6783|82:af:45:b6:6c:28(master.k8s)]: connection ready; using protocol version 2
INFO: 2019/03/06 03:09:30.160355 overlay_switch ->[82:af:45:b6:6c:28(master.k8s)] using fastdp
INFO: 2019/03/06 03:09:30.160374 ->[192.168.2.185:6783|82:af:45:b6:6c:28(master.k8s)]: connection added
INFO: 2019/03/06 03:09:30.160918 ->[192.168.2.185:6783|82:af:45:b6:6c:28(master.k8s)]: connection fully established
INFO: 2019/03/06 03:09:30.661941 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/03/06 03:09:30.662531 sleeve ->[192.168.2.185:6783|82:af:45:b6:6c:28(master.k8s)]: Effective MTU verified at 1438
INFO: 2019/03/06 03:10:04.388455 Discovered remote MAC f2:aa:b0:da:66:0a at ca:d2:c5:72:76:8f(2-110.k8s)
INFO: 2019/03/06 03:10:04.484069 Discovered remote MAC e6:4f:62:93:2b:96 at 16:7e:4e:eb:c8:db(2-112.k8s)
INFO: 2019/03/06 03:10:04.598544 Discovered remote MAC 5e:dc:72:a2:ba:f1 at 0a:5e:b6:a5:f0:66(77.k8s)
INFO: 2019/03/06 03:10:04.822022 Discovered remote MAC 3a:db:cd:98:ef:56 at 2a:69:e0:f7:6e:07(he-461-1a-1.k8s)
INFO: 2019/03/06 03:10:47.023263 ->[192.168.2.185:6783|82:af:45:b6:6c:28(master.k8s)]: connection shutting down due to error: Merge of incoming data causes: Entry 172.18.208.0-172.18.212.0 reporting too much free space: 2045 > 1024
INFO: 2019/03/06 03:10:47.023352 ->[192.168.2.185:6783|82:af:45:b6:6c:28(master.k8s)]: connection deleted
INFO: 2019/03/06 03:10:47.024384 ->[192.168.2.185:6783] attempting connection
INFO: 2019/03/06 03:10:47.029746 ->[192.168.2.185:6783|82:af:45:b6:6c:28(master.k8s)]: connection ready; using protocol version 2
INFO: 2019/03/06 03:10:47.029835 overlay_switch ->[82:af:45:b6:6c:28(master.k8s)] using fastdp
INFO: 2019/03/06 03:10:47.029860 ->[192.168.2.185:6783|82:af:45:b6:6c:28(master.k8s)]: connection added
INFO: 2019/03/06 03:10:47.531140 ->[192.168.2.185:6783|82:af:45:b6:6c:28(master.k8s)]: connection fully established
INFO: 2019/03/06 03:10:47.531472 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/03/06 03:10:47.532906 sleeve ->[192.168.2.185:6783|82:af:45:b6:6c:28(master.k8s)]: Effective MTU verified at 1438
INFO: 2019/03/06 03:11:17.023015 ->[192.168.2.185:6783|82:af:45:b6:6c:28(master.k8s)]: connection shutting down due to error: Merge of incoming data causes: Entry 172.18.208.0-172.18.212.0 reporting too much free space: 2045 > 1024

How to reproduce it?

Not exactly sure when this happens, but seems showing up after we removed 2 nodes from k8s clusters

Anything else we need to know?

Our cluster is run on bare metal, here is the yaml file
We use k8s yaml file. Let me know if you want to see the configure yaml file. We only added env variable IPALLOC_RANGE=172.18.128.0/17

Versions:

$ weave version
2.5.1
$ docker version
18.06.1
$ uname -a
4.4.159-1.el7.elrepo.x86_64
$ kubectl version
1.12.1

@murali-reddy
Copy link
Contributor

thanks for reporting this issue.

there is description of the situations when this can arise:

#2083 (comment)
#1946

Please try the solution described in #1946 (comment)

@rongshen
Copy link
Author

rongshen commented Mar 7, 2019

Thank you for getting back to me.
I'm quite new to weave, I try to follow the instructions in #1946 (comment) But I'm confused on how to do it in the following areas:

  1. How do I identify impacted nodes? Every node in my k8s cluster have this error in the log, does it mean every node is impacted?
  2. There is no more weave stop-router command, I tried weave --local stop in the container, but it is getting ignored
/home/weave # ./weave --local stop
Weave is not running (ignore on Kubernetes).

Since it's ignored on k8s cluster, how do I proceed?

One of my ideas is to delete every node from k8s cluster and delete weave daemon set. Then reinstall weave daemon set, and add each node back one by one. Do you think it can help? I try to avoid doing this because it will cause maintenance window. But if there is no other solutions, I'll have to go by this.

Thank you

@murali-reddy
Copy link
Contributor

How do I identify impacted nodes? Every node in my k8s cluster have this error in the log, does it mean every node is impacted?

Its odd every node in the cluster you are seeing this error. Just confirming. Error connection shutting down due to error: Merge of incoming data causes is what you referring right?

Could you please share weave report from one or more nodes where you see this error?

One of my ideas is to delete every node from k8s cluster and delete weave daemon set. Then reinstall weave daemon set, and add each node back one by one. Do you think it can help? I try to avoid doing this because it will cause maintenance window. But if there is no other solutions, I'll have to go by this.

If you are on Kubernetes, then triggering a rolling update would be less disruptive. Manifest has updateStrategy set to RollingUpdate already, just edit the weave-net daemonset and change a trivial configuration (like cpu requests etc)

@rongshen
Copy link
Author

rongshen commented Mar 7, 2019

Thank you very much for the reply and rollingupdate ideas

Its odd every node in the cluster you are seeing this error. Just confirming. Error connection shutting down due to error: Merge of incoming data causes is what you referring right?

I doubled checked all nodes, there are 23 out of 26 nodes have this error, including master node. The rest 3 nodes have error of connection reset due to multiple entries to master node, and 1 of them have a weird line saying:

connection shutting down due to error: Peer 92:c0:56:89:7c:fb says it owns the IP range from 172.18.212.0, which I think I own

Could you please share weave report from one or more nodes where you see this error?

Here is the weave status, it's quite long because we have 26 nodes in k8s cluster. I added it as a gist
https://gist.github.com/rongshen/adf76f44a467d70c7213d77b8e7dea00

@murali-reddy
Copy link
Contributor

I doubled checked all nodes, there are 23 out of 26 nodes have this error, including master node.

That sound like IPAM data is in inconsistent state across all of the cluster nodes. Please delete the DB file on each node and restart the corresponding weave-net pod, or perform rolling upgrade. See if this helps before you attempt to delete and readd the nodes.

@rongshen
Copy link
Author

I tried rolling update but seems not working, so in the end I have to start a maintenance window, delete the daemon set, and reinstall weave ds.
For each node, I stop kubelet service, delete /opt/cni/bin/weave-* and /var/lib/weave/*.db and restart kubelet service
Afterwards, the error are gone now. Our cluster seems healthy again.
Thank you for all your help!

@alok87
Copy link
Contributor

alok87 commented Jan 26, 2020

@murali-reddy
We keep getting merge problems in one of the cluster.

Command runs weave connections and grep Merge
On every run the data is different, most of times None shows up.

$ kn get pods -n kube-system -o wide | grep weave | awk '{print $1}' | xargs -I {} kubectln exec {} -c weave -n kube-system -- /home/weave/weave --local status connections | grep Merge
-> 10.2.20.25:6783       failed      Merge of incoming data causes: Entry 100.96.0.0-100.96.0.1 reporting too much free space: 2030 > 1, retry: 2020-01-26 15:24:58.430309658 +0000 UTC m=+57784.687751066
-> 10.2.21.252:6783      failed      Merge of incoming data causes: Entry 100.96.0.0-100.96.0.1 reporting too much free space: 2030 > 1, retry: 2020-01-26 15:24:58.528000782 +0000 UTC m=+28655.705965770

How to find which node is causing inconsistency ?

@murali-reddy
Copy link
Contributor

@alok87 please open a seperate issue on what symptoms you are seeing.

How to find which node is causing inconsistency ?

IPAM inconsistency is not necessaruly result of single node. Some of the cases are outlined in #1962

If you are running 2.6 we have debug logging specific to IPAM to help find the sequnce of IP allocations an gossip resulintg in inconsistency.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants