-
Notifications
You must be signed in to change notification settings - Fork 673
Remove deleted k8s nodes from Weave Net #2797
Comments
@pmcq suggests (in #2807) to use a However, it seems that the hook will be run on every occasion when terminating even if k8s just restarts the Pod. This would disconnect all containers running on such node from the cluster. Also, it would open a window for IPAM races. In addition, the hook won't be run if a machine was stopped non-gracefully. One very complicated solution is to subscribe to k8s API service and run Paxos to elect a leader which would run |
We don't need to run Paxos ourselves; Kubernetes is built on a fully-consistent store, which we can use for arbitrary purposes via annotations. Example code (that's using an annotation on a service; I guess we can use one on the DaemonSet) |
What happens if you run rmpeer for the same node more than once? Does that actually break something, or does it just mean one of the rmpeer calls fails harmlessly? |
|
This is covered in the FAQ:
|
Oh, I see, thanks. Apologies, I didn't see that |
If anyone else is being hit by this on a prod cluster and wants an interim fix, this is what we hacked together today: https://gist.github.com/mikebryant/f5b25f9b14e5d6275ff0d3e934f73f12 Assumes all of your weave peers are Kubernetes nodes, and assumes none of your node names are strict prefixes of other node names (Leader election by cloud-provider volume mounting) |
Is this something that planned? It became unmanageable. Every time I remove node from the cluster I need to Writing a script that does it is fine, but this is a very hacky, especially taking the fact that most of other kubernetes components are handling their duties by themselves. |
Implementation is under way at #3022, but see the comments for some ugly complications. Could you say why you call |
This bit us hard today. Our K8s masters all struggled getting up making out cluster unstable. One of the etcd peers was timing out trying access the other peer at an incorrect IP ... which mean all the masters were screwed. Guided by #2797 (comment), we got into a weave pod and ran this: #!/bin/bash
set -eu
# install kubectl
kubectl_version="1.6.7"
curl -o /usr/local/bin/kubectl "https://storage.googleapis.com/kubernetes-release/release/v${kubectl_version}/bin/linux/amd64/kubectl"
chmod +x /usr/local/bin/kubectl
# get list of nicknames from weave
curl -H "Accept: application/json" http://localhost:6784/report | jq -r .IPAM.Entries[].Nickname | sort | uniq > /tmp/nicknames
# get list of available nodes from kubernetes
kubectl get node -o custom-columns=name:.metadata.name --no-headers | xargs -n1 -I '{}' echo '{}' | cut -d'.' -f1 | sort > /tmp/node-names
# diff, basically what's unavailable
grep -F -x -v -f /tmp/node-names /tmp/nicknames > /tmp/nodes-unavailable
# rmpeer unavailable nodes
cat /tmp/nodes-unavailable | xargs -n 1 -I '{}' curl -H "Accept: application/json" -X DELETE 'http://localhost:6784/peer/{}' This is what we had before:
And now we have 😰:
|
@bboreham based on the result in #2797 (comment) ... what are the implications of this ... 97.9% of total on one host.
|
That is what you achieved by reclaiming all the "unreachable" space on that one peer. As other peers run out of space, or start anew, they will request space from that one. It's an ok state to be in, unless you shut down that peer for good without telling Weave Net, in which case you will be back in the previous situation. |
@bboreham great ... thanks for the explanation. |
After removing the peers, I now have this:
what are the implications of that? Should I be worried? |
It turns out I should, things are not working :D |
BTW: I had to remove all things related to weave and re-create it. Domains were not being resolved anymore to the outside world. Not sure if it is related to this problem or not. This was a fun Friday night. |
@caarlos0 99.9% of your cluster was unreachable ...
@mikebryant shared how he recovered from this in #2797 (comment). I've tried to make his solution clearer in #2797 (comment). @bboreham explains what's happening in #2797 (comment). |
@itskingori yeah, this all-unreacheable state was after I removed the peers that didn't exist anymore using the scripts provided. I did that, pods started to launch again, but DNS to the "outside world" inside the containers wasn't working. Because pods keep restarting, my entire cluster entered in a broken state, where nodes were failing with Ultimately, I had terminate all nodes, remove the weave daemon-set and re-create it (also upgrading from 1.9.4 to 2.0.4). |
@bboreham It seems that reclaiming IP addresses to a single node creates a potential single-point-of failure. When the node that has all of the IPAM allocations dies, how do they get reclaimed? Sure, if we are constantly running the script, they would theoretically be reclaimed to another running node once k8s realized the script's pod is no longer up, and reschedules it. This could take several precarious seconds, assuming all the pieces fall correctly. Wouldn't it be better if the IPs were allocated evenly across the cluster instead of hoarded by one node? Is there a way to do this? |
IP ownership is generally evenly spread, so at any one time we would be reclaiming some fraction of all IPs from those nodes that have gone away without telling us. Re-running the reclaim periodically, instead of just when a node starts, is a worthwhile improvement to reduce the window. You've certainly identified an edge case, @natewarr, but I think I'd want to see evidence that it can happen for real before making the implementation much more complicated. |
I probably need to change my name to "TheEdgeCase". We will find an acceptable workaround. Thanks for your work on this bug! |
Any suggestions on this situation? It looks like a new node got spun up on the same IP as the old node, and that happens to be the node which we are running rmpeers.
|
@natewarr I can see that AWS is re-using IPs and hence hostnames; Weave Net internally works off the "unique peer ID" which is generated in various ways I am unclear what you need suggestions for, sorry. |
FWIW this issue with peers unreachable happens a lot more if you have an elastic cluster (obviously, considering more instances are launched and terminated). |
I see. I was getting hung up on the use of the Nicknames as used in the gist hack. I was able to run this to reclaim that peer with the duplicate IP.
|
@natewarr ok, that makes perfect sense. So my suggestion would have been to drop to the peer-id (the hex number that looks like a mac address), which you did 😄 |
Please note the code to remove deleted Kubernetes peers from a cluster was released today, in Weave Net version 2.1.1. |
@bboreham @natewarr we are also running the older version of weave and were facing this issue in staging cluster, so we ran the script for prod cluster just to be safe(as there were many unreachable ips there also). But now the 87.9% IPs are present in a single node. How to avoid this. As this node going down would recreate the problem.
|
@alok87 thats as far as the workaround they wrote up will get you. the other nodes can request this node share with them as needed, so its not technically an error condition. If you lose that node without reclaiming them somehow, you are back in the error condition. I imagine the weave guys will just tell you to update past 2.1.1. |
@alok87 try spread out the clearing to different instances of weave (the script just clears all from one weave ... and the weave that runs the clearing claims the cleared IPs). Fundamental command is mentioned in #2797 (comment). |
Still having this issue in a new kubernetes 1.8.8 cluster (launched with kops 1.8) and weave 2.2.0 on a pre-existing VPC. The cluster was still small (went from 3 nodes to 2 nodes - plus master, so, 4 to 3 in total) - maybe that is the reason (not enough instances to the quorum)? On the other hand, this also still happens on two old kubernetes 1.5.x clusters running weave 2.2.0 on the same VPC. Those clusters have more nodes - one has between 6 and 10, commonly 7/8 and the other between 5 and 8, commonly 6. All the 3 clusters run cluster-autoscaler (different versions due to kubernetes version limitations). Is there something I should be looking at? Any guesses on what's the reason for this problem? |
ohh, so that's it, a node go down, the cleanup doesn't run until a new node is up... OK then, no issue 👍 thanks |
I have seen this same issue with weave 2.2.0, kops 1.9.0-beta-2, k8s 1.9.3 |
I'm also seeing this issue with weave 2.3.0, kops 1.9.0, k8s 1.9.3 |
Please don’t comment on old, closed issues. Open a new issue and provide the details which will allow your issue to be debugged. |
weave-kube
adds all current nodes as peers at startup, but never checks back to see if some nodes have been deleted.In a situation such as a regularly expanding and contracting auto-scale group, the IPAM ring will eventually become clogged with peers that have gone away.
We need to do
weave rmpeer
on deleted nodes, and less importantlyweave forget
. We will need some interlock to ensure theweave rmpeer
is only done once.How do we even detect that a Weave Net peer originated as a Kubernetes node? This logic should be resilient to users adding non-Kubernetes peers to the network, even on a host that was previously a Kubernetes peer.
The text was updated successfully, but these errors were encountered: