-
Notifications
You must be signed in to change notification settings - Fork 679
Peers not deleted in 2.5.X on Kubernetes #3602
Comments
@sl4dy thanks for reporting the issue. When a node is deleted its automated to perform I will try again to reproduce the problem. Last when i tested I used AWS ASG to scale down/up the nodes several times still could not reproduce.
Are you able to reproduce this issue? How often you are able to observe this behaviour. |
@murali-reddy I am able to reproduce this issue. It happens ~1/10 node deletions. However, I do not have 100% reproducer. If you need some extra debug info or something just let me know. |
@sl4dy Are you trying in cloud environment or on bare-metal? I tried again today on AWS using ASG to scale up and down the nodes, but was not able to reproduce. |
@murali-reddy It is bare metal. |
@sl4dy Could you please share the snippet of weave container logs during the time when you perform |
@murali-reddy Here are logs from all weave containers, it is CSV export from Splunk. It starts just after node deletion |
@sl4dy thanks for sharing the logs. It seems only one node identifies disappeared node, for the rest of the nodes they don't see any disappeared nodes, so there is no
|
@murali-reddy Not sure if it might be relevant but we run on custom, slightly smaller |
Ok I am able to reproduce the issue. It happens once in several retries but does happen. I guess root cause is same for #3444 as well. |
@murali-reddy Thx for info. Is there any better workaround than deletion of IPAM data on affected nodes as it is mentioned here? https://www.weave.works/docs/net/latest/tasks/ipam/troubleshooting-ipam/ |
Having worked on fix for couple of IPAM issues lately, I have more clarity in to this issue now. Snip from the logs.
#3635 addressed this issue. Transfer of the range from one peer to another (performed in the case due to rmpeer) takes precedence as now we bump up the version by large value, so there is no conflict on any peer receiving the update. Closing this issue as the root cause (conflicting entries after rmpeer is performed) is fixed. @sl4dy fix will be available as part of 2.5.2 |
What you expected to happen?
kubectl delete node stg2-k8s-master01.int.na.xxx.com
I expect that all peers delete that one peer.What happened?
The peer gets deleted only on some nodes:
I see in the logs that
stg2-k8s-worker03.int.na.xxx.com
reclaims IPs ofstg2-k8s-master01.int.na.xxx.com
.How to reproduce it?
Just delete the node via
kubectl delete node
. However it does not happen always.Anything else we need to know?
Kubeadm deployed cluster (multi master setup) on Private Cloud, running on CentOS 7.X.
Versions:
Logs:
Network:
The text was updated successfully, but these errors were encountered: