Skip to content
This repository was archived by the owner on Jun 20, 2024. It is now read-only.

Peers not deleted in 2.5.X on Kubernetes #3602

Closed
sl4dy opened this issue Feb 25, 2019 · 12 comments
Closed

Peers not deleted in 2.5.X on Kubernetes #3602

sl4dy opened this issue Feb 25, 2019 · 12 comments
Labels
Milestone

Comments

@sl4dy
Copy link

sl4dy commented Feb 25, 2019

What you expected to happen?

  • When deleting the node via kubectl delete node stg2-k8s-master01.int.na.xxx.com I expect that all peers delete that one peer.

What happened?

The peer gets deleted only on some nodes:

weave-net-2vb4f,stg2-k8s-worker01.int.na.xxx.com
de:73:44:a6:02:7f(stg2-k8s-worker01.int.na.xxx.com)     8192 IPs (03.1% of total) (17 active)
8e:61:2b:d5:a0:4f(stg2-k8s-master02.int.na.xxx.com)    16384 IPs (06.2% of total) 
c2:f0:a1:67:d0:20(stg2-k8s-worker03.int.na.xxx.com)     2048 IPs (00.8% of total) 
0e:93:27:8d:a0:02(stg2-k8s-worker02.int.na.xxx.com)     4096 IPs (01.6% of total) 
22:c5:16:38:47:94(stg2-k8s-master01.int.na.xxx.com)   221184 IPs (84.4% of total) - unreachable!
ea:32:7c:16:dd:3c(stg2-k8s-worker04.int.na.xxx.com)    10240 IPs (03.9% of total) 
weave-net-54gjz,stg2-k8s-worker02.int.na.xxx.com
0e:93:27:8d:a0:02(stg2-k8s-worker02.int.na.xxx.com)     4096 IPs (01.6% of total) (15 active)
8e:61:2b:d5:a0:4f(stg2-k8s-master02.int.na.xxx.com)    16384 IPs (06.2% of total) 
c2:f0:a1:67:d0:20(stg2-k8s-worker03.int.na.xxx.com)     2048 IPs (00.8% of total) 
22:c5:16:38:47:94(stg2-k8s-master01.int.na.xxx.com)   221184 IPs (84.4% of total) - unreachable!
de:73:44:a6:02:7f(stg2-k8s-worker01.int.na.xxx.com)     8192 IPs (03.1% of total) 
ea:32:7c:16:dd:3c(stg2-k8s-worker04.int.na.xxx.com)    10240 IPs (03.9% of total) 
weave-net-59xwv,stg2-k8s-master02.int.na.xxx.com
8e:61:2b:d5:a0:4f(stg2-k8s-master02.int.na.xxx.com)    16384 IPs (06.2% of total) (6 active)
de:73:44:a6:02:7f(stg2-k8s-worker01.int.na.xxx.com)     8192 IPs (03.1% of total) 
ea:32:7c:16:dd:3c(stg2-k8s-worker04.int.na.xxx.com)    10240 IPs (03.9% of total) 
0e:93:27:8d:a0:02(stg2-k8s-worker02.int.na.xxx.com)     4096 IPs (01.6% of total) 
c2:f0:a1:67:d0:20(stg2-k8s-worker03.int.na.xxx.com)   223232 IPs (85.2% of total) 
weave-net-chvvs,stg2-k8s-worker03.int.na.xxx.com
c2:f0:a1:67:d0:20(stg2-k8s-worker03.int.na.xxx.com)   223232 IPs (85.2% of total) (11 active)
8e:61:2b:d5:a0:4f(stg2-k8s-master02.int.na.xxx.com)    16384 IPs (06.2% of total) 
0e:93:27:8d:a0:02(stg2-k8s-worker02.int.na.xxx.com)     4096 IPs (01.6% of total) 
de:73:44:a6:02:7f(stg2-k8s-worker01.int.na.xxx.com)     8192 IPs (03.1% of total) 
ea:32:7c:16:dd:3c(stg2-k8s-worker04.int.na.xxx.com)    10240 IPs (03.9% of total) 
weave-net-mjrwh,stg2-k8s-worker04.int.na.xxx.com
ea:32:7c:16:dd:3c(stg2-k8s-worker04.int.na.xxx.com)    10240 IPs (03.9% of total) (15 active)
c2:f0:a1:67:d0:20(stg2-k8s-worker03.int.na.xxx.com)   223232 IPs (85.2% of total) 
de:73:44:a6:02:7f(stg2-k8s-worker01.int.na.xxx.com)     8192 IPs (03.1% of total) 
8e:61:2b:d5:a0:4f(stg2-k8s-master02.int.na.xxx.com)    16384 IPs (06.2% of total) 
0e:93:27:8d:a0:02(stg2-k8s-worker02.int.na.xxx.com)     4096 IPs (01.6% of total)

I see in the logs that stg2-k8s-worker03.int.na.xxx.com reclaims IPs of stg2-k8s-master01.int.na.xxx.com.

How to reproduce it?

Just delete the node via kubectl delete node. However it does not happen always.

Anything else we need to know?

Kubeadm deployed cluster (multi master setup) on Private Cloud, running on CentOS 7.X.

Versions:

$ weave version
weave 2.5.1

$ docker version
Client:
 Version:           18.06.1-ce
 API version:       1.38
 Go version:        go1.10.3
 Git commit:        e68fc7a
 Built:             Tue Aug 21 17:23:03 2018
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.06.1-ce
  API version:      1.38 (minimum version 1.12)
  Go version:       go1.10.3
  Git commit:       e68fc7a
  Built:            Tue Aug 21 17:25:29 2018
  OS/Arch:          linux/amd64
  Experimental:     false

$ uname -a
Linux pi-k8s-master01 3.10.0-957.1.3.el7.x86_64 #1 SMP Thu Nov 29 14:49:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.3", GitCommit:"435f92c719f279a3a67808c80521ea17d5715c66", GitTreeState:"clean", BuildDate:"2018-11-26T12:57:14Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.3", GitCommit:"435f92c719f279a3a67808c80521ea17d5715c66", GitTreeState:"clean", BuildDate:"2018-11-26T12:46:57Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}

Logs:

{"log":"INFO: 2019/02/20 15:51:40.890280 [kube-peers] rmpeer of 22:c5:16:38:47:94: 221184 IPs taken over from 22:c5:16:38:47:94\n","stream":"stderr","time":"2019-02-20T15:51:40.890464184Z"}
{"log":"INFO: 2019/02/20 15:51:40.892571 -\u003e[172.26.78.28:42028|c2:f0:a1:67:d0:20(stg2-k8s-worker03.int.na.xxx.com)]: connection shutting down due to error: Received update for IP range I own at 172.16.0.0 v167: incoming message says owner c2:f0:a1:67:d0:20 v168\n","stream":"stderr","time":"2019-02-20T15:51:40.893125576Z"}
{"log":"INFO: 2019/02/20 15:51:40.892635 -\u003e[172.26.78.28:6783|c2:f0:a1:67:d0:20(stg2-k8s-worker03.int.na.xxx.com)]: connection shutting down due to error: Inconsistent entries for 172.16.192.0: owned by 22:c5:16:38:47:94 but incoming message says c2:f0:a1:67:d0:20\n","stream":"stderr","time":"2019-02-20T15:51:40.893431276Z"}
{"log":"INFO: 2019/02/20 15:51:40.892772 -\u003e[172.26.77.196:52867|0e:93:27:8d:a0:02(stg2-k8s-worker02.int.na.xxx.com)]: connection shutting down due to error: read tcp4 172.26.78.28:6783-\u003e172.26.77.196:52867: read: connection reset by peer\n","stream":"stderr","time":"2019-02-20T15:51:40.897326239Z"}
{"log":"INFO: 2019/02/20 15:51:40.892927 -\u003e[172.26.76.72:6783|22:c5:16:38:47:94(stg2-k8s-master01.int.na.xxx.com)]: connection shutting down due to error: read tcp4 172.26.78.28:42028-\u003e172.26.76.72:6783: read: connection reset by peer\n","stream":"stderr","time":"2019-02-20T15:51:40.897347148Z"}
{"log":"INFO: 2019/02/20 15:51:40.899620 -\u003e[172.26.78.28:52318|c2:f0:a1:67:d0:20(stg2-k8s-worker03.int.na.xxx.com)]: connection shutting down due to error: Multiple connections to c2:f0:a1:67:d0:20(stg2-k8s-worker03.int.na.xxx.com) added to 0e:93:27:8d:a0:02(stg2-k8s-worker02.int.na.xxx.com)\n","stream":"stderr","time":"2019-02-20T15:51:40.900784973Z"}
{"log":"INFO: 2019/02/20 15:51:40.901842 -\u003e[172.26.76.72:53086|22:c5:16:38:47:94(stg2-k8s-master01.int.na.xxx.com)]: connection shutting down due to error: Multiple connections to 22:c5:16:38:47:94(stg2-k8s-master01.int.na.xxx.com) added to c2:f0:a1:67:d0:20(stg2-k8s-worker03.int.na.xxx.com)\n","stream":"stderr","time":"2019-02-20T15:51:40.902156463Z"}
{"log":"INFO: 2019/02/20 15:51:40.901989 -\u003e[172.26.77.196:6783|0e:93:27:8d:a0:02(stg2-k8s-worker02.int.na.xxx.com)]: connection shutting down due to error: Multiple connections to 0e:93:27:8d:a0:02(stg2-k8s-worker02.int.na.xxx.com) added to c2:f0:a1:67:d0:20(stg2-k8s-worker03.int.na.xxx.com)\n","stream":"stderr","time":"2019-02-20T15:51:40.902171107Z"}
{"log":"INFO: 2019/02/20 15:51:40.902731 -\u003e[172.26.78.28:53357|c2:f0:a1:67:d0:20(stg2-k8s-worker03.int.na.xxx.com)]: connection shutting down due to error: Inconsistent entries for 172.16.192.0: owned by 22:c5:16:38:47:94 but incoming message says c2:f0:a1:67:d0:20\n","stream":"stderr","time":"2019-02-20T15:51:40.902960342Z"}
{"log":"INFO: 2019/02/20 15:51:40.902778 -\u003e[172.26.79.241:6783|de:73:44:a6:02:7f(stg2-k8s-worker01.int.na.xxx.com)]: connection shutting down due to error: read tcp4 172.26.78.28:53357-\u003e172.26.79.241:6783: read: connection reset by peer\n","stream":"stderr","time":"2019-02-20T15:51:40.902987536Z"}
{"log":"INFO: 2019/02/20 15:51:40.903152 -\u003e[172.26.78.28:6783|c2:f0:a1:67:d0:20(stg2-k8s-worker03.int.na.xxx.com)]: connection shutting down due to error: Multiple connections to c2:f0:a1:67:d0:20(stg2-k8s-worker03.int.na.xxx.com) added to 22:c5:16:38:47:94(stg2-k8s-master01.int.na.xxx.com)\n","stream":"stderr","time":"2019-02-20T15:51:40.904315944Z"}
{"log":"INFO: 2019/02/20 15:51:40.909251 -\u003e[172.26.79.241:34272|de:73:44:a6:02:7f(stg2-k8s-worker01.int.na.xxx.com)]: connection shutting down due to error: Multiple connections to de:73:44:a6:02:7f(stg2-k8s-worker01.int.na.xxx.com) added to c2:f0:a1:67:d0:20(stg2-k8s-worker03.int.na.xxx.com)\n","stream":"stderr","time":"2019-02-20T15:51:40.909383472Z"}
{"log":"INFO: 2019/02/20 15:51:40.911994 -\u003e[172.26.78.28:6783|c2:f0:a1:67:d0:20(stg2-k8s-worker03.int.na.xxx.com)]: connection shutting down due to error: Multiple connections to c2:f0:a1:67:d0:20(stg2-k8s-worker03.int.na.xxx.com) added to de:73:44:a6:02:7f(stg2-k8s-worker01.int.na.xxx.com)\n","stream":"stderr","time":"2019-02-20T15:51:40.912066662Z"}

Network:

@murali-reddy
Copy link
Contributor

@sl4dy thanks for reporting the issue.

When a node is deleted its automated to perform weave rmpeer and weave forget. As noted in the comment it appears some how this action is not performed reliably.

I will try again to reproduce the problem. Last when i tested I used AWS ASG to scale down/up the nodes several times still could not reproduce.

Just delete the node via kubectl delete node. However it does not happen always.

Are you able to reproduce this issue? How often you are able to observe this behaviour.

@sl4dy
Copy link
Author

sl4dy commented Feb 25, 2019

@murali-reddy I am able to reproduce this issue. It happens ~1/10 node deletions. However, I do not have 100% reproducer. If you need some extra debug info or something just let me know.

@murali-reddy
Copy link
Contributor

murali-reddy commented Feb 26, 2019

@sl4dy Are you trying in cloud environment or on bare-metal? I tried again today on AWS using ASG to scale up and down the nodes, but was not able to reproduce.

@sl4dy
Copy link
Author

sl4dy commented Feb 26, 2019

@murali-reddy It is bare metal.

@murali-reddy
Copy link
Contributor

@sl4dy Could you please share the snippet of weave container logs during the time when you perform kubectl delete node and you can repro the issue? I tried to repro with kubectl delete node followed by kubeadm reset & join in a loop with no luck.

@sl4dy
Copy link
Author

sl4dy commented Feb 27, 2019

@murali-reddy Here are logs from all weave containers, it is CSV export from Splunk. It starts just after node deletion Feb 20 16:51:40 stg2-k8s-master01 sudo: rundeck : TTY=unknown ; PWD=/home/rundeck ; USER=root ; COMMAND=/bin/bash -c kubectl delete node stg2-k8s-master01.int.na.xxx.com severity=notice

https://gist.githubusercontent.com/sl4dy/8b313e7f39c54861ae322074c52d208e/raw/65c6f18bef056ef2872091bcc0c4e61dc47d8504/weave.csv

@murali-reddy
Copy link
Contributor

@sl4dy thanks for sharing the logs. It seems only one node identifies disappeared node, for the rest of the nodes they don't see any disappeared nodes, so there is no reclaimPeer performed. Will investigate further.

 cat weave.csv| grep "Nodes that have disappeared"                                                                                                                                                                                                                                   ❯❯❯
"{""log"":""DEBU: 2019/02/20 15:51:40.881564 [kube-peers] Nodes that have disappeared: map[22:c5:16:38:47:94:{22:c5:16:38:47:94 stg2-k8s-master01.int.na.xxx.com}]\n"",""stream"":""stderr"",""time"":""2019-02-20T15:51:40.881715233Z""}","2019-02-20T16:51:40.881+0100",15,20,51,february,40,wednesday,2019,0,,"stg2-k8s-worker03.int.na.xxx.com","dev-gdc",1,"DEBU: 2019/02/20 15:51:40.881564 [kube-peers] Nodes that have disappeared: map[22:c5:16:38:47:94:{22:c5:16:38:47:94 stg2-k8s-master01.int.na.xxx.com}]
"{""log"":""DEBU: 2019/02/20 15:51:40.920212 [kube-peers] Nodes that have disappeared: map[]\n"",""stream"":""stderr"",""time"":""2019-02-20T15:51:40.920281342Z""}","2019-02-20T16:51:40.920+0100",15,20,51,february,40,wednesday,2019,0,,"stg2-k8s-worker03.int.na.xxx.com","dev-gdc",1,"DEBU: 2019/02/20 15:51:40.920212 [kube-peers] Nodes that have disappeared: map[]
"{""log"":""DEBU: 2019/02/20 15:51:41.318984 [kube-peers] Nodes that have disappeared: map[]\n"",""stream"":""stderr"",""time"":""2019-02-20T15:51:41.328668243Z""}","2019-02-20T16:51:41.318+0100",15,20,51,february,41,wednesday,2019,0,,"stg2-k8s-worker04.int.na.xxx.com","dev-gdc",1,"DEBU: 2019/02/20 15:51:41.318984 [kube-peers] Nodes that have disappeared: map[]
"{""log"":""DEBU: 2019/02/20 15:51:41.726823 [kube-peers] Nodes that have disappeared: map[]\n"",""stream"":""stderr"",""time"":""2019-02-20T15:51:41.727850274Z""}","2019-02-20T16:51:41.726+0100",15,20,51,february,41,wednesday,2019,0,,"stg2-k8s-master02.int.na.xxx.com","dev-gdc",1,"DEBU: 2019/02/20 15:51:41.726823 [kube-peers] Nodes that have disappeared: map[]
"{""log"":""DEBU: 2019/02/20 15:51:41.727433 [kube-peers] Nodes that have disappeared: map[]\n"",""stream"":""stderr"",""time"":""2019-02-20T15:51:41.728478726Z""}","2019-02-20T16:51:41.727+0100",15,20,51,february,41,wednesday,2019,0,,"stg2-k8s-master01.int.na.xxx.com","dev-gdc",1,"DEBU: 2019/02/20 15:51:41.727433 [kube-peers] Nodes that have disappeared: map[]
"{""log"":""DEBU: 2019/02/20 15:51:42.792382 [kube-peers] Nodes that have disappeared: map[]\n"",""stream"":""stderr"",""time"":""2019-02-20T15:51:42.792599956Z""}","2019-02-20T16:51:42.792+0100",15,20,51,february,42,wednesday,2019,0,,"stg2-k8s-worker02.int.na.xxx.com","dev-gdc",1,"DEBU: 2019/02/20 15:51:42.792382 [kube-peers] Nodes that have disappeared: map[]
"{""log"":""DEBU: 2019/02/20 15:51:42.948470 [kube-peers] Nodes that have disappeared: map[]\n"",""stream"":""stderr"",""time"":""2019-02-20T15:51:42.94941031Z""}","2019-02-20T16:51:42.948+0100",15,20,51,february,42,wednesday,2019,0,,"stg2-k8s-worker01.int.na.xxx.com","dev-gdc",1,"DEBU: 2019/02/20 15:51:42.948470 [kube-peers] Nodes that have disappeared: map[]

@sl4dy
Copy link
Author

sl4dy commented Mar 5, 2019

@murali-reddy Not sure if it might be relevant but we run on custom, slightly smaller --ipalloc-range=172.16.0.0/14

@murali-reddy murali-reddy added this to the 2.6 milestone Mar 12, 2019
@murali-reddy
Copy link
Contributor

Ok I am able to reproduce the issue. It happens once in several retries but does happen. I guess root cause is same for #3444 as well.

@sl4dy
Copy link
Author

sl4dy commented Mar 19, 2019

@murali-reddy Thx for info. Is there any better workaround than deletion of IPAM data on affected nodes as it is mentioned here? https://www.weave.works/docs/net/latest/tasks/ipam/troubleshooting-ipam/

@murali-reddy
Copy link
Contributor

@sl4dy Unfortunately there is no better workaround at the moment till #1962 is addressed.

@murali-reddy
Copy link
Contributor

Having worked on fix for couple of IPAM issues lately, I have more clarity in to this issue now. Snip from the logs.

{"log":"INFO: 2019/02/20 15:51:40.890280 [kube-peers] rmpeer of 22:c5:16:38:47:94: 221184 IPs taken over from 22:c5:16:38:47:94\n","stream":"stderr","time":"2019-02-20T15:51:40.890464184Z"}
{"log":"INFO: 2019/02/20 15:51:40.892635 -\u003e[172.26.78.28:6783|c2:f0:a1:67:d0:20(stg2-k8s-worker03.int.na.xxx.com)]: connection shutting down due to error: Inconsistent entries for 172.16.192.0: owned by 22:c5:16:38:47:94 but incoming message says c2:f0:a1:67:d0:20\n","stream":"stderr","time":"2019-02-20T15:51:40.893431276Z"}

#3635 addressed this issue. Transfer of the range from one peer to another (performed in the case due to rmpeer) takes precedence as now we bump up the version by large value, so there is no conflict on any peer receiving the update.

Closing this issue as the root cause (conflicting entries after rmpeer is performed) is fixed.

@sl4dy fix will be available as part of 2.5.2

@bboreham bboreham modified the milestones: 2.6, 2.5.2 Nov 4, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants