Peers not deleted in 2.5.X on Kubernetes #3602

sl4dy · 2019-02-25T10:48:24Z

What you expected to happen?

When deleting the node via kubectl delete node stg2-k8s-master01.int.na.xxx.com I expect that all peers delete that one peer.

What happened?

The peer gets deleted only on some nodes:

weave-net-2vb4f,stg2-k8s-worker01.int.na.xxx.com
de:73:44:a6:02:7f(stg2-k8s-worker01.int.na.xxx.com)     8192 IPs (03.1% of total) (17 active)
8e:61:2b:d5:a0:4f(stg2-k8s-master02.int.na.xxx.com)    16384 IPs (06.2% of total) 
c2:f0:a1:67:d0:20(stg2-k8s-worker03.int.na.xxx.com)     2048 IPs (00.8% of total) 
0e:93:27:8d:a0:02(stg2-k8s-worker02.int.na.xxx.com)     4096 IPs (01.6% of total) 
22:c5:16:38:47:94(stg2-k8s-master01.int.na.xxx.com)   221184 IPs (84.4% of total) - unreachable!
ea:32:7c:16:dd:3c(stg2-k8s-worker04.int.na.xxx.com)    10240 IPs (03.9% of total) 
weave-net-54gjz,stg2-k8s-worker02.int.na.xxx.com
0e:93:27:8d:a0:02(stg2-k8s-worker02.int.na.xxx.com)     4096 IPs (01.6% of total) (15 active)
8e:61:2b:d5:a0:4f(stg2-k8s-master02.int.na.xxx.com)    16384 IPs (06.2% of total) 
c2:f0:a1:67:d0:20(stg2-k8s-worker03.int.na.xxx.com)     2048 IPs (00.8% of total) 
22:c5:16:38:47:94(stg2-k8s-master01.int.na.xxx.com)   221184 IPs (84.4% of total) - unreachable!
de:73:44:a6:02:7f(stg2-k8s-worker01.int.na.xxx.com)     8192 IPs (03.1% of total) 
ea:32:7c:16:dd:3c(stg2-k8s-worker04.int.na.xxx.com)    10240 IPs (03.9% of total) 
weave-net-59xwv,stg2-k8s-master02.int.na.xxx.com
8e:61:2b:d5:a0:4f(stg2-k8s-master02.int.na.xxx.com)    16384 IPs (06.2% of total) (6 active)
de:73:44:a6:02:7f(stg2-k8s-worker01.int.na.xxx.com)     8192 IPs (03.1% of total) 
ea:32:7c:16:dd:3c(stg2-k8s-worker04.int.na.xxx.com)    10240 IPs (03.9% of total) 
0e:93:27:8d:a0:02(stg2-k8s-worker02.int.na.xxx.com)     4096 IPs (01.6% of total) 
c2:f0:a1:67:d0:20(stg2-k8s-worker03.int.na.xxx.com)   223232 IPs (85.2% of total) 
weave-net-chvvs,stg2-k8s-worker03.int.na.xxx.com
c2:f0:a1:67:d0:20(stg2-k8s-worker03.int.na.xxx.com)   223232 IPs (85.2% of total) (11 active)
8e:61:2b:d5:a0:4f(stg2-k8s-master02.int.na.xxx.com)    16384 IPs (06.2% of total) 
0e:93:27:8d:a0:02(stg2-k8s-worker02.int.na.xxx.com)     4096 IPs (01.6% of total) 
de:73:44:a6:02:7f(stg2-k8s-worker01.int.na.xxx.com)     8192 IPs (03.1% of total) 
ea:32:7c:16:dd:3c(stg2-k8s-worker04.int.na.xxx.com)    10240 IPs (03.9% of total) 
weave-net-mjrwh,stg2-k8s-worker04.int.na.xxx.com
ea:32:7c:16:dd:3c(stg2-k8s-worker04.int.na.xxx.com)    10240 IPs (03.9% of total) (15 active)
c2:f0:a1:67:d0:20(stg2-k8s-worker03.int.na.xxx.com)   223232 IPs (85.2% of total) 
de:73:44:a6:02:7f(stg2-k8s-worker01.int.na.xxx.com)     8192 IPs (03.1% of total) 
8e:61:2b:d5:a0:4f(stg2-k8s-master02.int.na.xxx.com)    16384 IPs (06.2% of total) 
0e:93:27:8d:a0:02(stg2-k8s-worker02.int.na.xxx.com)     4096 IPs (01.6% of total)

I see in the logs that stg2-k8s-worker03.int.na.xxx.com reclaims IPs of stg2-k8s-master01.int.na.xxx.com.

How to reproduce it?

Just delete the node via kubectl delete node. However it does not happen always.

Anything else we need to know?

Kubeadm deployed cluster (multi master setup) on Private Cloud, running on CentOS 7.X.

Versions:

$ weave version
weave 2.5.1

$ docker version
Client:
 Version:           18.06.1-ce
 API version:       1.38
 Go version:        go1.10.3
 Git commit:        e68fc7a
 Built:             Tue Aug 21 17:23:03 2018
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.06.1-ce
  API version:      1.38 (minimum version 1.12)
  Go version:       go1.10.3
  Git commit:       e68fc7a
  Built:            Tue Aug 21 17:25:29 2018
  OS/Arch:          linux/amd64
  Experimental:     false

$ uname -a
Linux pi-k8s-master01 3.10.0-957.1.3.el7.x86_64 #1 SMP Thu Nov 29 14:49:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.3", GitCommit:"435f92c719f279a3a67808c80521ea17d5715c66", GitTreeState:"clean", BuildDate:"2018-11-26T12:57:14Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.3", GitCommit:"435f92c719f279a3a67808c80521ea17d5715c66", GitTreeState:"clean", BuildDate:"2018-11-26T12:46:57Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}

Logs:

{"log":"INFO: 2019/02/20 15:51:40.890280 [kube-peers] rmpeer of 22:c5:16:38:47:94: 221184 IPs taken over from 22:c5:16:38:47:94\n","stream":"stderr","time":"2019-02-20T15:51:40.890464184Z"}
{"log":"INFO: 2019/02/20 15:51:40.892571 -\u003e[172.26.78.28:42028|c2:f0:a1:67:d0:20(stg2-k8s-worker03.int.na.xxx.com)]: connection shutting down due to error: Received update for IP range I own at 172.16.0.0 v167: incoming message says owner c2:f0:a1:67:d0:20 v168\n","stream":"stderr","time":"2019-02-20T15:51:40.893125576Z"}
{"log":"INFO: 2019/02/20 15:51:40.892635 -\u003e[172.26.78.28:6783|c2:f0:a1:67:d0:20(stg2-k8s-worker03.int.na.xxx.com)]: connection shutting down due to error: Inconsistent entries for 172.16.192.0: owned by 22:c5:16:38:47:94 but incoming message says c2:f0:a1:67:d0:20\n","stream":"stderr","time":"2019-02-20T15:51:40.893431276Z"}
{"log":"INFO: 2019/02/20 15:51:40.892772 -\u003e[172.26.77.196:52867|0e:93:27:8d:a0:02(stg2-k8s-worker02.int.na.xxx.com)]: connection shutting down due to error: read tcp4 172.26.78.28:6783-\u003e172.26.77.196:52867: read: connection reset by peer\n","stream":"stderr","time":"2019-02-20T15:51:40.897326239Z"}
{"log":"INFO: 2019/02/20 15:51:40.892927 -\u003e[172.26.76.72:6783|22:c5:16:38:47:94(stg2-k8s-master01.int.na.xxx.com)]: connection shutting down due to error: read tcp4 172.26.78.28:42028-\u003e172.26.76.72:6783: read: connection reset by peer\n","stream":"stderr","time":"2019-02-20T15:51:40.897347148Z"}
{"log":"INFO: 2019/02/20 15:51:40.899620 -\u003e[172.26.78.28:52318|c2:f0:a1:67:d0:20(stg2-k8s-worker03.int.na.xxx.com)]: connection shutting down due to error: Multiple connections to c2:f0:a1:67:d0:20(stg2-k8s-worker03.int.na.xxx.com) added to 0e:93:27:8d:a0:02(stg2-k8s-worker02.int.na.xxx.com)\n","stream":"stderr","time":"2019-02-20T15:51:40.900784973Z"}
{"log":"INFO: 2019/02/20 15:51:40.901842 -\u003e[172.26.76.72:53086|22:c5:16:38:47:94(stg2-k8s-master01.int.na.xxx.com)]: connection shutting down due to error: Multiple connections to 22:c5:16:38:47:94(stg2-k8s-master01.int.na.xxx.com) added to c2:f0:a1:67:d0:20(stg2-k8s-worker03.int.na.xxx.com)\n","stream":"stderr","time":"2019-02-20T15:51:40.902156463Z"}
{"log":"INFO: 2019/02/20 15:51:40.901989 -\u003e[172.26.77.196:6783|0e:93:27:8d:a0:02(stg2-k8s-worker02.int.na.xxx.com)]: connection shutting down due to error: Multiple connections to 0e:93:27:8d:a0:02(stg2-k8s-worker02.int.na.xxx.com) added to c2:f0:a1:67:d0:20(stg2-k8s-worker03.int.na.xxx.com)\n","stream":"stderr","time":"2019-02-20T15:51:40.902171107Z"}
{"log":"INFO: 2019/02/20 15:51:40.902731 -\u003e[172.26.78.28:53357|c2:f0:a1:67:d0:20(stg2-k8s-worker03.int.na.xxx.com)]: connection shutting down due to error: Inconsistent entries for 172.16.192.0: owned by 22:c5:16:38:47:94 but incoming message says c2:f0:a1:67:d0:20\n","stream":"stderr","time":"2019-02-20T15:51:40.902960342Z"}
{"log":"INFO: 2019/02/20 15:51:40.902778 -\u003e[172.26.79.241:6783|de:73:44:a6:02:7f(stg2-k8s-worker01.int.na.xxx.com)]: connection shutting down due to error: read tcp4 172.26.78.28:53357-\u003e172.26.79.241:6783: read: connection reset by peer\n","stream":"stderr","time":"2019-02-20T15:51:40.902987536Z"}
{"log":"INFO: 2019/02/20 15:51:40.903152 -\u003e[172.26.78.28:6783|c2:f0:a1:67:d0:20(stg2-k8s-worker03.int.na.xxx.com)]: connection shutting down due to error: Multiple connections to c2:f0:a1:67:d0:20(stg2-k8s-worker03.int.na.xxx.com) added to 22:c5:16:38:47:94(stg2-k8s-master01.int.na.xxx.com)\n","stream":"stderr","time":"2019-02-20T15:51:40.904315944Z"}
{"log":"INFO: 2019/02/20 15:51:40.909251 -\u003e[172.26.79.241:34272|de:73:44:a6:02:7f(stg2-k8s-worker01.int.na.xxx.com)]: connection shutting down due to error: Multiple connections to de:73:44:a6:02:7f(stg2-k8s-worker01.int.na.xxx.com) added to c2:f0:a1:67:d0:20(stg2-k8s-worker03.int.na.xxx.com)\n","stream":"stderr","time":"2019-02-20T15:51:40.909383472Z"}
{"log":"INFO: 2019/02/20 15:51:40.911994 -\u003e[172.26.78.28:6783|c2:f0:a1:67:d0:20(stg2-k8s-worker03.int.na.xxx.com)]: connection shutting down due to error: Multiple connections to c2:f0:a1:67:d0:20(stg2-k8s-worker03.int.na.xxx.com) added to de:73:44:a6:02:7f(stg2-k8s-worker01.int.na.xxx.com)\n","stream":"stderr","time":"2019-02-20T15:51:40.912066662Z"}

Network:

The text was updated successfully, but these errors were encountered:

murali-reddy · 2019-02-25T11:13:12Z

@sl4dy thanks for reporting the issue.

When a node is deleted its automated to perform weave rmpeer and weave forget. As noted in the comment it appears some how this action is not performed reliably.

I will try again to reproduce the problem. Last when i tested I used AWS ASG to scale down/up the nodes several times still could not reproduce.

Just delete the node via kubectl delete node. However it does not happen always.

Are you able to reproduce this issue? How often you are able to observe this behaviour.

sl4dy · 2019-02-25T12:55:07Z

@murali-reddy I am able to reproduce this issue. It happens ~1/10 node deletions. However, I do not have 100% reproducer. If you need some extra debug info or something just let me know.

murali-reddy · 2019-02-26T13:19:15Z

@sl4dy Are you trying in cloud environment or on bare-metal? I tried again today on AWS using ASG to scale up and down the nodes, but was not able to reproduce.

sl4dy · 2019-02-26T14:09:14Z

@murali-reddy It is bare metal.

murali-reddy · 2019-02-27T06:53:42Z

@sl4dy Could you please share the snippet of weave container logs during the time when you perform kubectl delete node and you can repro the issue? I tried to repro with kubectl delete node followed by kubeadm reset & join in a loop with no luck.

sl4dy · 2019-02-27T09:04:16Z

@murali-reddy Here are logs from all weave containers, it is CSV export from Splunk. It starts just after node deletion Feb 20 16:51:40 stg2-k8s-master01 sudo: rundeck : TTY=unknown ; PWD=/home/rundeck ; USER=root ; COMMAND=/bin/bash -c kubectl delete node stg2-k8s-master01.int.na.xxx.com severity=notice

https://gist.githubusercontent.com/sl4dy/8b313e7f39c54861ae322074c52d208e/raw/65c6f18bef056ef2872091bcc0c4e61dc47d8504/weave.csv

murali-reddy · 2019-02-27T12:19:03Z

@sl4dy thanks for sharing the logs. It seems only one node identifies disappeared node, for the rest of the nodes they don't see any disappeared nodes, so there is no reclaimPeer performed. Will investigate further.

 cat weave.csv| grep "Nodes that have disappeared"                                                                                                                                                                                                                                   ❯❯❯
"{""log"":""DEBU: 2019/02/20 15:51:40.881564 [kube-peers] Nodes that have disappeared: map[22:c5:16:38:47:94:{22:c5:16:38:47:94 stg2-k8s-master01.int.na.xxx.com}]\n"",""stream"":""stderr"",""time"":""2019-02-20T15:51:40.881715233Z""}","2019-02-20T16:51:40.881+0100",15,20,51,february,40,wednesday,2019,0,,"stg2-k8s-worker03.int.na.xxx.com","dev-gdc",1,"DEBU: 2019/02/20 15:51:40.881564 [kube-peers] Nodes that have disappeared: map[22:c5:16:38:47:94:{22:c5:16:38:47:94 stg2-k8s-master01.int.na.xxx.com}]
"{""log"":""DEBU: 2019/02/20 15:51:40.920212 [kube-peers] Nodes that have disappeared: map[]\n"",""stream"":""stderr"",""time"":""2019-02-20T15:51:40.920281342Z""}","2019-02-20T16:51:40.920+0100",15,20,51,february,40,wednesday,2019,0,,"stg2-k8s-worker03.int.na.xxx.com","dev-gdc",1,"DEBU: 2019/02/20 15:51:40.920212 [kube-peers] Nodes that have disappeared: map[]
"{""log"":""DEBU: 2019/02/20 15:51:41.318984 [kube-peers] Nodes that have disappeared: map[]\n"",""stream"":""stderr"",""time"":""2019-02-20T15:51:41.328668243Z""}","2019-02-20T16:51:41.318+0100",15,20,51,february,41,wednesday,2019,0,,"stg2-k8s-worker04.int.na.xxx.com","dev-gdc",1,"DEBU: 2019/02/20 15:51:41.318984 [kube-peers] Nodes that have disappeared: map[]
"{""log"":""DEBU: 2019/02/20 15:51:41.726823 [kube-peers] Nodes that have disappeared: map[]\n"",""stream"":""stderr"",""time"":""2019-02-20T15:51:41.727850274Z""}","2019-02-20T16:51:41.726+0100",15,20,51,february,41,wednesday,2019,0,,"stg2-k8s-master02.int.na.xxx.com","dev-gdc",1,"DEBU: 2019/02/20 15:51:41.726823 [kube-peers] Nodes that have disappeared: map[]
"{""log"":""DEBU: 2019/02/20 15:51:41.727433 [kube-peers] Nodes that have disappeared: map[]\n"",""stream"":""stderr"",""time"":""2019-02-20T15:51:41.728478726Z""}","2019-02-20T16:51:41.727+0100",15,20,51,february,41,wednesday,2019,0,,"stg2-k8s-master01.int.na.xxx.com","dev-gdc",1,"DEBU: 2019/02/20 15:51:41.727433 [kube-peers] Nodes that have disappeared: map[]
"{""log"":""DEBU: 2019/02/20 15:51:42.792382 [kube-peers] Nodes that have disappeared: map[]\n"",""stream"":""stderr"",""time"":""2019-02-20T15:51:42.792599956Z""}","2019-02-20T16:51:42.792+0100",15,20,51,february,42,wednesday,2019,0,,"stg2-k8s-worker02.int.na.xxx.com","dev-gdc",1,"DEBU: 2019/02/20 15:51:42.792382 [kube-peers] Nodes that have disappeared: map[]
"{""log"":""DEBU: 2019/02/20 15:51:42.948470 [kube-peers] Nodes that have disappeared: map[]\n"",""stream"":""stderr"",""time"":""2019-02-20T15:51:42.94941031Z""}","2019-02-20T16:51:42.948+0100",15,20,51,february,42,wednesday,2019,0,,"stg2-k8s-worker01.int.na.xxx.com","dev-gdc",1,"DEBU: 2019/02/20 15:51:42.948470 [kube-peers] Nodes that have disappeared: map[]

sl4dy · 2019-03-05T15:53:04Z

@murali-reddy Not sure if it might be relevant but we run on custom, slightly smaller --ipalloc-range=172.16.0.0/14

murali-reddy · 2019-03-15T08:42:21Z

Ok I am able to reproduce the issue. It happens once in several retries but does happen. I guess root cause is same for #3444 as well.

sl4dy · 2019-03-19T09:07:09Z

@murali-reddy Thx for info. Is there any better workaround than deletion of IPAM data on affected nodes as it is mentioned here? https://www.weave.works/docs/net/latest/tasks/ipam/troubleshooting-ipam/

murali-reddy · 2019-03-20T12:35:32Z

@sl4dy Unfortunately there is no better workaround at the moment till #1962 is addressed.

murali-reddy · 2019-05-14T07:47:43Z

Having worked on fix for couple of IPAM issues lately, I have more clarity in to this issue now. Snip from the logs.

{"log":"INFO: 2019/02/20 15:51:40.890280 [kube-peers] rmpeer of 22:c5:16:38:47:94: 221184 IPs taken over from 22:c5:16:38:47:94\n","stream":"stderr","time":"2019-02-20T15:51:40.890464184Z"}
{"log":"INFO: 2019/02/20 15:51:40.892635 -\u003e[172.26.78.28:6783|c2:f0:a1:67:d0:20(stg2-k8s-worker03.int.na.xxx.com)]: connection shutting down due to error: Inconsistent entries for 172.16.192.0: owned by 22:c5:16:38:47:94 but incoming message says c2:f0:a1:67:d0:20\n","stream":"stderr","time":"2019-02-20T15:51:40.893431276Z"}

#3635 addressed this issue. Transfer of the range from one peer to another (performed in the case due to rmpeer) takes precedence as now we bump up the version by large value, so there is no conflict on any peer receiving the update.

Closing this issue as the root cause (conflicting entries after rmpeer is performed) is fixed.

@sl4dy fix will be available as part of 2.5.2

murali-reddy added the bug label Feb 25, 2019

murali-reddy added this to the 2.6 milestone Mar 12, 2019

murali-reddy closed this as completed May 14, 2019

bboreham modified the milestones: 2.6, 2.5.2 Nov 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Peers not deleted in 2.5.X on Kubernetes #3602

Peers not deleted in 2.5.X on Kubernetes #3602

sl4dy commented Feb 25, 2019

murali-reddy commented Feb 25, 2019

sl4dy commented Feb 25, 2019

murali-reddy commented Feb 26, 2019 •

edited

Loading

sl4dy commented Feb 26, 2019

murali-reddy commented Feb 27, 2019

sl4dy commented Feb 27, 2019

murali-reddy commented Feb 27, 2019

sl4dy commented Mar 5, 2019

murali-reddy commented Mar 15, 2019

sl4dy commented Mar 19, 2019 •

edited

Loading

murali-reddy commented Mar 20, 2019

murali-reddy commented May 14, 2019

Peers not deleted in 2.5.X on Kubernetes #3602

Peers not deleted in 2.5.X on Kubernetes #3602

Comments

sl4dy commented Feb 25, 2019

What you expected to happen?

What happened?

How to reproduce it?

Anything else we need to know?

Versions:

Logs:

Network:

murali-reddy commented Feb 25, 2019

sl4dy commented Feb 25, 2019

murali-reddy commented Feb 26, 2019 • edited Loading

sl4dy commented Feb 26, 2019

murali-reddy commented Feb 27, 2019

sl4dy commented Feb 27, 2019

murali-reddy commented Feb 27, 2019

sl4dy commented Mar 5, 2019

murali-reddy commented Mar 15, 2019

sl4dy commented Mar 19, 2019 • edited Loading

murali-reddy commented Mar 20, 2019

murali-reddy commented May 14, 2019

murali-reddy commented Feb 26, 2019 •

edited

Loading

sl4dy commented Mar 19, 2019 •

edited

Loading