Remove deleted k8s nodes from Weave Net #2797

bboreham · 2017-02-14T10:04:55Z

weave-kube adds all current nodes as peers at startup, but never checks back to see if some nodes have been deleted.

In a situation such as a regularly expanding and contracting auto-scale group, the IPAM ring will eventually become clogged with peers that have gone away.

We need to do weave rmpeer on deleted nodes, and less importantly weave forget. We will need some interlock to ensure the weave rmpeer is only done once.

How do we even detect that a Weave Net peer originated as a Kubernetes node? This logic should be resilient to users adding non-Kubernetes peers to the network, even on a host that was previously a Kubernetes peer.

The text was updated successfully, but these errors were encountered:

brb · 2017-02-17T13:51:26Z

@pmcq suggests (in #2807) to use a preStop hook for weave-kube Pod. The hook would run some steps from weave reset (the most important is DELETE /peers).

However, it seems that the hook will be run on every occasion when terminating even if k8s just restarts the Pod. This would disconnect all containers running on such node from the cluster. Also, it would open a window for IPAM races.

In addition, the hook won't be run if a machine was stopped non-gracefully.

One very complicated solution is to subscribe to k8s API service and run Paxos to elect a leader which would run rmpeer.

bboreham · 2017-02-23T11:53:15Z

We don't need to run Paxos ourselves; Kubernetes is built on a fully-consistent store, which we can use for arbitrary purposes via annotations. Example code (that's using an annotation on a service; I guess we can use one on the DaemonSet)

mikebryant · 2017-02-23T12:07:45Z

What happens if you run rmpeer for the same node more than once? Does that actually break something, or does it just mean one of the rmpeer calls fails harmlessly?

bboreham · 2017-02-23T12:11:29Z

weave rmpeer foo means "remove foo from the cluster and claim all of its IP address space on this node", so if you run it on two nodes close enough in time they will both claim foo's space and then the data structure is inconsistent. Hence, for an automated system, there has to be an interlock.

rade · 2017-02-23T12:12:51Z

This is covered in the FAQ:

You cannot call weave rmpeer on more than one host. The address space, which was owned by the stale peer cannot be left dangling, and as a result it gets reassigned. In this instance, the address is reassigned to the peer on which weave rmpeer was run. Therefore, if you run weave forget and then weave rmpeer on more than one host at a time, it results in duplicate IPs on more than one host.

mikebryant · 2017-02-23T12:15:11Z

Oh, I see, thanks. Apologies, I didn't see that

mikebryant · 2017-02-23T19:36:28Z

If anyone else is being hit by this on a prod cluster and wants an interim fix, this is what we hacked together today: https://gist.github.com/mikebryant/f5b25f9b14e5d6275ff0d3e934f73f12

Assumes all of your weave peers are Kubernetes nodes, and assumes none of your node names are strict prefixes of other node names

(Leader election by cloud-provider volume mounting)

shamil · 2017-06-21T07:49:24Z

Is this something that planned? It became unmanageable. Every time I remove node from the cluster I need to rmpeer once and forget on each node (per each removed node).

Writing a script that does it is fine, but this is a very hacky, especially taking the fact that most of other kubernetes components are handling their duties by themselves.

bboreham · 2017-06-21T08:12:02Z

Implementation is under way at #3022, but see the comments for some ugly complications.

Could you say why you call forget? It shouldn't be necessary, unless to avoid some noise in the logs.

itskingori · 2017-07-28T12:19:01Z

This bit us hard today. Our K8s masters all struggled getting up making out cluster unstable. One of the etcd peers was timing out trying access the other peer at an incorrect IP ... which mean all the masters were screwed.

Guided by #2797 (comment), we got into a weave pod and ran this:

#!/bin/bash

set -eu

# install kubectl
kubectl_version="1.6.7"
curl -o /usr/local/bin/kubectl "https://storage.googleapis.com/kubernetes-release/release/v${kubectl_version}/bin/linux/amd64/kubectl"
chmod +x /usr/local/bin/kubectl

# get list of nicknames from weave
curl -H "Accept: application/json" http://localhost:6784/report | jq -r .IPAM.Entries[].Nickname | sort | uniq > /tmp/nicknames

# get list of available nodes from kubernetes
kubectl get node -o custom-columns=name:.metadata.name --no-headers | xargs -n1 -I '{}' echo '{}' | cut -d'.' -f1 | sort > /tmp/node-names

# diff, basically what's unavailable
grep -F -x -v -f /tmp/node-names /tmp/nicknames > /tmp/nodes-unavailable

# rmpeer unavailable nodes
cat /tmp/nodes-unavailable | xargs -n 1 -I '{}' curl -H "Accept: application/json" -X DELETE 'http://localhost:6784/peer/{}'

This is what we had before:

/home/weave # ./weave --local status ipam
da:14:02:dd:7b:3c(ip-10-83-92-195)        8214 IPs (00.4% of total) (22 active)
6a:2c:f6:77:cf:91(ip-10-83-117-144)        128 IPs (00.0% of total) - unreachable!
a6:a6:ce:25:0c:58(ip-10-83-98-214)       16384 IPs (00.8% of total) - unreachable!
02:26:07:6b:3f:42(ip-10-83-103-200)      16384 IPs (00.8% of total) - unreachable!
f2:05:af:47:c5:92(ip-10-83-115-32)        8192 IPs (00.4% of total) - unreachable!
ba:36:17:bc:2e:0a(ip-10-83-50-12)         4096 IPs (00.2% of total) - unreachable!
2a:d8:15:8c:7d:33(ip-10-83-40-21)          128 IPs (00.0% of total) - unreachable!
be:06:02:f0:d1:96(ip-10-83-46-255)         384 IPs (00.0% of total) - unreachable!
fa:c2:ac:cf:17:df(ip-10-83-90-145)        2048 IPs (00.1% of total) - unreachable!
be:7e:d2:47:43:86(ip-10-83-92-216)          26 IPs (00.0% of total) - unreachable!
1e:c4:43:bd:59:e5(ip-10-83-89-20)        49152 IPs (02.3% of total) - unreachable!
92:07:d6:6d:75:83(ip-10-83-80-44)            3 IPs (00.0% of total) - unreachable!
2e:22:22:a1:7a:c1(ip-10-83-36-147)        8219 IPs (00.4% of total)
7e:82:5c:75:3f:3d(ip-10-83-114-142)        128 IPs (00.0% of total) - unreachable!
66:d7:7d:a9:8b:4a(ip-10-83-112-138)         32 IPs (00.0% of total) - unreachable!
1e:a5:00:85:4d:e8(ip-10-83-81-246)        4096 IPs (00.2% of total) - unreachable!
c2:7b:87:5f:15:e3(ip-10-83-67-231)       32768 IPs (01.6% of total) - unreachable!
f2:c6:e3:10:c4:05(ip-10-83-45-198)           5 IPs (00.0% of total) - unreachable!
2e:88:88:46:75:75(ip-10-83-127-236)         16 IPs (00.0% of total) - unreachable!
fe:eb:e5:a5:8e:95(ip-10-83-62-202)          16 IPs (00.0% of total)
16:dc:f6:a5:2a:db(ip-10-83-38-54)        32768 IPs (01.6% of total) - unreachable!
7e:1f:42:06:d0:21(ip-10-83-77-72)          128 IPs (00.0% of total) - unreachable!
06:cb:d3:f5:ef:8d(ip-10-83-68-140)        2048 IPs (00.1% of total) - unreachable!
de:a4:37:90:0d:84(ip-10-83-115-89)      393216 IPs (18.8% of total) - unreachable!
26:e0:ab:a4:74:0b(ip-10-83-68-162)         256 IPs (00.0% of total) - unreachable!
02:49:5c:88:7a:0d(ip-10-83-111-31)          64 IPs (00.0% of total) - unreachable!
8e:5d:49:84:a5:a9(ip-10-83-84-87)          128 IPs (00.0% of total) - unreachable!
a6:5a:77:b5:61:46(ip-10-83-84-116)        1024 IPs (00.0% of total) - unreachable!
4a:a6:cc:c8:13:a4(ip-10-83-125-169)        512 IPs (00.0% of total) - unreachable!
8a:5c:20:8d:15:f8(ip-10-83-68-157)       16405 IPs (00.8% of total)
5e:e0:60:22:50:7d(ip-10-83-98-247)         256 IPs (00.0% of total) - unreachable!
1a:0c:62:5f:86:6d(ip-10-83-48-66)          128 IPs (00.0% of total) - unreachable!
f6:ae:7c:aa:1c:3c(ip-10-83-63-48)           96 IPs (00.0% of total) - unreachable!
ba:b4:ed:9b:cf:63(ip-10-83-65-104)        8192 IPs (00.4% of total) - unreachable!
2a:bd:84:68:31:51(ip-10-83-76-199)         128 IPs (00.0% of total) - unreachable!
46:91:34:07:8e:90(ip-10-83-121-222)      49152 IPs (02.3% of total) - unreachable!
3a:24:a5:a6:64:ac(ip-10-83-102-251)     131072 IPs (06.2% of total) - unreachable!
66:20:7e:60:d8:07(ip-10-83-116-44)        4096 IPs (00.2% of total) - unreachable!
4e:fb:dd:99:97:d9(ip-10-83-99-3)          4096 IPs (00.2% of total) - unreachable!
fa:ab:cd:8c:18:a5(ip-10-83-114-159)          8 IPs (00.0% of total) - unreachable!
66:46:4f:c7:00:5f(ip-10-83-123-232)       1024 IPs (00.0% of total) - unreachable!
4e:cb:d4:93:41:c8(ip-10-83-109-130)         64 IPs (00.0% of total) - unreachable!
1e:74:28:eb:8f:06(ip-10-83-66-200)       32768 IPs (01.6% of total) - unreachable!
e2:34:d5:7e:b7:a9(ip-10-83-82-178)          64 IPs (00.0% of total) - unreachable!
3e:28:56:5c:2e:73(ip-10-83-105-188)        384 IPs (00.0% of total) - unreachable!
f6:d3:e5:2e:5c:9d(ip-10-83-102-39)        8196 IPs (00.4% of total)
8a:13:b7:4b:31:06(ip-10-83-76-105)          32 IPs (00.0% of total) - unreachable!
12:97:7f:b8:69:12(ip-10-83-125-85)         522 IPs (00.0% of total)
8a:9b:c1:bd:33:d2(ip-10-83-68-64)          283 IPs (00.0% of total)
36:82:78:ca:d8:a6(ip-10-83-32-246)         128 IPs (00.0% of total) - unreachable!
a6:ec:79:8d:df:e4(ip-10-83-76-83)           64 IPs (00.0% of total) - unreachable!
e6:0d:e7:33:19:b8(ip-10-83-114-144)         32 IPs (00.0% of total) - unreachable!
f6:a6:19:34:7f:01(ip-10-83-76-208)         513 IPs (00.0% of total)
6a:81:32:7c:47:8f(ip-10-83-115-223)         64 IPs (00.0% of total) - unreachable!
82:29:b3:47:2a:dc(ip-10-83-105-141)        512 IPs (00.0% of total) - unreachable!
2e:0d:d5:2c:60:49(ip-10-83-93-139)          16 IPs (00.0% of total) - unreachable!
5a:38:28:74:39:09(ip-10-83-44-56)          768 IPs (00.0% of total) - unreachable!
02:ca:92:6d:0b:49(ip-10-83-69-142)          32 IPs (00.0% of total) - unreachable!
32:b6:fe:6d:f5:53(ip-10-83-34-112)      720896 IPs (34.4% of total) - unreachable!
6e:0f:10:25:b4:d0(ip-10-83-66-64)         6144 IPs (00.3% of total) - unreachable!
5e:ba:58:ee:b1:4b(ip-10-83-100-132)          6 IPs (00.0% of total) - unreachable!
6e:6a:25:28:a4:0f(ip-10-83-32-97)           64 IPs (00.0% of total) - unreachable!
d6:74:d8:c3:76:9a(ip-10-83-76-8)           256 IPs (00.0% of total) - unreachable!
4e:82:65:f2:58:42(ip-10-83-111-20)       32768 IPs (01.6% of total) - unreachable!
3e:b7:04:00:d4:f2(ip-10-83-106-240)          1 IPs (00.0% of total) - unreachable!
6e:83:f0:c0:a9:b8(ip-10-83-85-120)          32 IPs (00.0% of total) - unreachable!
42:f0:d4:06:4e:7d(ip-10-83-68-156)          24 IPs (00.0% of total) - unreachable!
e6:5c:cc:12:1a:bc(ip-10-83-116-20)          32 IPs (00.0% of total) - unreachable!
da:a2:cb:69:82:e9(ip-10-83-104-165)      32768 IPs (01.6% of total) - unreachable!
72:80:f4:5a:0c:11(ip-10-83-34-65)         3072 IPs (00.1% of total) - unreachable!
8a:a4:b9:4f:ad:49(ip-10-83-82-106)           4 IPs (00.0% of total) - unreachable!
0a:cf:fc:47:e9:17(ip-10-83-84-163)         128 IPs (00.0% of total) - unreachable!
2e:a6:7b:93:88:78(ip-10-83-35-94)        16384 IPs (00.8% of total) - unreachable!
fa:81:f9:33:18:e5(ip-10-83-115-220)        128 IPs (00.0% of total) - unreachable!
a2:43:ca:64:17:a2(ip-10-83-83-38)       196608 IPs (09.4% of total) - unreachable!
f2:dc:c2:14:25:59(ip-10-83-73-117)          16 IPs (00.0% of total) - unreachable!
6a:fa:ce:5d:14:3d(ip-10-83-103-152)          1 IPs (00.0% of total) - unreachable!
e6:4f:3b:d8:b6:1e(ip-10-83-101-153)       2048 IPs (00.1% of total) - unreachable!
46:49:f4:b9:18:94(ip-10-83-117-26)        4096 IPs (00.2% of total) - unreachable!
0a:9e:de:c4:f9:69(ip-10-83-59-150)       49152 IPs (02.3% of total) - unreachable!
2a:58:df:a1:9a:b1(ip-10-83-92-126)        1024 IPs (00.0% of total) - unreachable!
ae:20:1b:df:2b:14(ip-10-83-51-31)           32 IPs (00.0% of total) - unreachable!
22:ae:03:ac:b7:de(ip-10-83-65-204)        1024 IPs (00.0% of total) - unreachable!
7a:b3:89:22:05:78(ip-10-83-44-203)         763 IPs (00.0% of total)
d6:d7:8d:fe:c9:4b(ip-10-83-124-62)         256 IPs (00.0% of total) - unreachable!
8e:63:7a:cf:c4:08(ip-10-83-55-154)          32 IPs (00.0% of total) - unreachable!
1a:04:58:59:b9:70(ip-10-83-72-144)          64 IPs (00.0% of total) - unreachable!
f2:d6:c3:39:67:61(ip-10-83-43-79)           16 IPs (00.0% of total) - unreachable!
de:a0:14:57:16:71(ip-10-83-101-185)       8212 IPs (00.4% of total)
7e:41:70:af:ae:d8(ip-10-83-35-75)          256 IPs (00.0% of total) - unreachable!
e2:89:7e:dd:57:f5(ip-10-83-102-35)        4096 IPs (00.2% of total) - unreachable!
7a:0e:89:2b:68:a1(ip-10-83-60-172)          64 IPs (00.0% of total) - unreachable!
4e:94:5c:4d:5e:de(ip-10-83-91-195)           8 IPs (00.0% of total) - unreachable!
6e:4d:e1:dd:c7:d8(ip-10-83-92-146)      131072 IPs (06.2% of total) - unreachable!
0e:2a:a2:0d:38:5c(ip-10-83-79-230)          32 IPs (00.0% of total) - unreachable!
1a:9e:26:4a:a6:51(ip-10-83-68-119)         512 IPs (00.0% of total) - unreachable!
ca:dd:86:02:25:51(ip-10-83-61-104)         256 IPs (00.0% of total) - unreachable!
0e:b5:e3:1f:b9:f0(ip-10-83-37-205)           4 IPs (00.0% of total) - unreachable!
96:e6:fd:76:f7:c1(ip-10-83-107-140)       2048 IPs (00.1% of total) - unreachable!
ee:00:45:23:9f:d4(ip-10-83-79-61)         3072 IPs (00.1% of total) - unreachable!
8a:10:e6:e9:f9:43(ip-10-83-92-250)        1024 IPs (00.0% of total) - unreachable!
5a:65:ea:d1:6d:27(ip-10-83-83-123)        2048 IPs (00.1% of total) - unreachable!
26:93:7d:fa:12:e0(ip-10-83-119-159)      32768 IPs (01.6% of total) - unreachable!
8a:d1:11:50:94:d6(ip-10-83-50-167)           3 IPs (00.0% of total) - unreachable!
02:e1:63:27:96:af(ip-10-83-124-250)         64 IPs (00.0% of total) - unreachable!
9e:ef:0b:28:b6:e1(ip-10-83-121-202)       4096 IPs (00.2% of total) - unreachable!
12:d0:13:4a:65:32(ip-10-83-46-23)           20 IPs (00.0% of total) - unreachable!

And now we have 😰:

/home/weave # ./weave --local status ipam
da:14:02:dd:7b:3c(ip-10-83-92-195)     2054023 IPs (97.9% of total) (22 active)
8a:5c:20:8d:15:f8(ip-10-83-68-157)       16405 IPs (00.8% of total)
f6:d3:e5:2e:5c:9d(ip-10-83-102-39)        8196 IPs (00.4% of total)
7a:b3:89:22:05:78(ip-10-83-44-203)         763 IPs (00.0% of total)
f6:a6:19:34:7f:01(ip-10-83-76-208)         513 IPs (00.0% of total)
12:97:7f:b8:69:12(ip-10-83-125-85)         522 IPs (00.0% of total)
8a:9b:c1:bd:33:d2(ip-10-83-68-64)          283 IPs (00.0% of total)
de:a0:14:57:16:71(ip-10-83-101-185)       8212 IPs (00.4% of total)
2e:22:22:a1:7a:c1(ip-10-83-36-147)        8219 IPs (00.4% of total)
fe:eb:e5:a5:8e:95(ip-10-83-62-202)          16 IPs (00.0% of total)

itskingori · 2017-07-28T12:19:53Z

@bboreham based on the result in #2797 (comment) ... what are the implications of this ... 97.9% of total on one host.

da:14:02:dd:7b:3c(ip-10-83-92-195)     2054023 IPs (97.9% of total) (22 active)

bboreham · 2017-07-28T12:28:51Z

That is what you achieved by reclaiming all the "unreachable" space on that one peer. As other peers run out of space, or start anew, they will request space from that one.

It's an ok state to be in, unless you shut down that peer for good without telling Weave Net, in which case you will be back in the previous situation.

itskingori · 2017-07-28T13:55:32Z

@bboreham great ... thanks for the explanation.

caarlos0 · 2017-09-08T18:40:51Z

After removing the peers, I now have this:

/home/weave # ./weave --local status ipam
1a:a5:74:30:7c:3d(ip-10-10-201-41)           1 IPs (00.0% of total) (1 active)
ee:bb:0c:34:ab:31(ip-10-10-201-254)       2048 IPs (00.2% of total) - unreachable!
52:60:b2:bd:1a:a3(ip-10-10-200-19)       49152 IPs (04.7% of total) - unreachable!
66:42:12:77:e0:d2(ip-10-10-200-33)       65536 IPs (06.2% of total) - unreachable!
f6:34:d7:e3:af:c9(ip-10-10-201-185)      32768 IPs (03.1% of total) - unreachable!
5a:9f:68:70:ca:de(ip-10-10-201-122)     736103 IPs (70.2% of total) - unreachable!
8e:63:bc:56:4c:ca(ip-10-10-200-46)         512 IPs (00.0% of total) - unreachable!
72:a9:cc:8a:c5:3b(ip-10-10-200-125)       4096 IPs (00.4% of total) - unreachable!
a2:b6:29:8b:be:b7(ip-10-10-200-97)      131072 IPs (12.5% of total) - unreachable!
c2:1a:8b:80:ab:4c(ip-10-10-200-14)         512 IPs (00.0% of total) - unreachable!
ea:43:20:d6:aa:c8(ip-10-10-200-200)       8192 IPs (00.8% of total) - unreachable!
d2:45:4c:6c:e9:d6(ip-10-10-201-125)          8 IPs (00.0% of total) - unreachable!
6a:9b:f8:c5:11:1e(ip-10-10-201-116)         16 IPs (00.0% of total) - unreachable!
2a:5c:6c:40:2b:fb(ip-10-10-201-122)        128 IPs (00.0% of total) - unreachable!
2e:5e:65:75:fc:71(ip-10-10-200-83)       16384 IPs (01.6% of total) - unreachable!
7e:f5:cb:0b:3e:a3(ip-10-10-201-109)       1024 IPs (00.1% of total) - unreachable!
9e:68:a6:65:ee:1d(ip-10-10-200-22)        1024 IPs (00.1% of total) - unreachable!

what are the implications of that? Should I be worried?

caarlos0 · 2017-09-09T03:18:10Z

It turns out I should, things are not working :D

caarlos0 · 2017-09-09T03:34:10Z

BTW: I had to remove all things related to weave and re-create it.

Domains were not being resolved anymore to the outside world. Not sure if it is related to this problem or not.

This was a fun Friday night.

itskingori · 2017-09-09T08:34:13Z

@caarlos0 99.9% of your cluster was unreachable ...

0.2 + 4.7 + 6.2 + 3.1 + 70.2 + 0.4 + 12.5 + 0.8 + 1.6 + 0.1+ 0.1 = 99.9

@mikebryant shared how he recovered from this in #2797 (comment). I've tried to make his solution clearer in #2797 (comment). @bboreham explains what's happening in #2797 (comment).

caarlos0 · 2017-09-09T12:38:46Z

@itskingori yeah, this all-unreacheable state was after I removed the peers that didn't exist anymore using the scripts provided.

I did that, pods started to launch again, but DNS to the "outside world" inside the containers wasn't working.

Because pods keep restarting, my entire cluster entered in a broken state, where nodes were failing with ContainerGCFailed/ImageGCFailed and nothing worked anymore.

Ultimately, I had terminate all nodes, remove the weave daemon-set and re-create it (also upgrading from 1.9.4 to 2.0.4).

bricef · 2017-10-25T10:42:16Z

Just an update for everyone here. @bboreham's #3022 fix has been rebased and tested and it's in review at #3149. Should make it to mainline soon.

natewarr · 2017-11-01T16:56:00Z

@bboreham It seems that reclaiming IP addresses to a single node creates a potential single-point-of failure. When the node that has all of the IPAM allocations dies, how do they get reclaimed? Sure, if we are constantly running the script, they would theoretically be reclaimed to another running node once k8s realized the script's pod is no longer up, and reschedules it. This could take several precarious seconds, assuming all the pieces fall correctly.

Wouldn't it be better if the IPs were allocated evenly across the cluster instead of hoarded by one node? Is there a way to do this?

bboreham · 2017-11-01T17:25:53Z

IP ownership is generally evenly spread, so at any one time we would be reclaiming some fraction of all IPs from those nodes that have gone away without telling us. Re-running the reclaim periodically, instead of just when a node starts, is a worthwhile improvement to reduce the window.

You've certainly identified an edge case, @natewarr, but I think I'd want to see evidence that it can happen for real before making the implementation much more complicated.

natewarr · 2017-11-02T16:39:39Z

I probably need to change my name to "TheEdgeCase". We will find an acceptable workaround. Thanks for your work on this bug!

natewarr · 2017-11-02T17:42:32Z

Any suggestions on this situation? It looks like a new node got spun up on the same IP as the old node, and that happens to be the node which we are running rmpeers.

0e:1c:47:1d:8f:0b(ip-172-30-3-14)       671746 IPs (64.1% of total) (9 active)
f6:28:bd:1b:f2:16(ip-172-30-3-19)         8192 IPs (00.8% of total)
12:1e:f8:cc:46:61(ip-172-30-2-48)        49153 IPs (04.7% of total)
1a:99:77:49:5f:cf(ip-172-30-2-29)        32768 IPs (03.1% of total)
7e:58:ab:e7:6d:5f(ip-172-30-2-56)        32768 IPs (03.1% of total)
82:05:7d:62:71:de(ip-172-30-3-14)        65533 IPs (06.2% of total) - unreachabl
e!
82:23:bd:9b:89:b1(ip-172-30-2-16)        24576 IPs (02.3% of total)
ea:1f:35:fb:b8:61(ip-172-30-2-37)        49152 IPs (04.7% of total)
ca:ad:8f:66:32:1c(ip-172-30-2-58)        32768 IPs (03.1% of total)
32:92:fc:78:08:df(ip-172-30-3-18)        16384 IPs (01.6% of total)
ee:e3:80:f7:d2:2b(ip-172-30-3-24)        32768 IPs (03.1% of total)
da:ab:47:b9:80:e0(ip-172-30-2-28)        32768 IPs (03.1% of total)

bboreham · 2017-11-06T16:41:21Z

@natewarr I can see that AWS is re-using IPs and hence hostnames; Weave Net internally works off the "unique peer ID" which is generated in various ways

I am unclear what you need suggestions for, sorry.

caarlos0 · 2017-11-06T16:45:50Z

FWIW this issue with peers unreachable happens a lot more if you have an elastic cluster (obviously, considering more instances are launched and terminated).

natewarr · 2017-11-06T17:33:31Z

I see. I was getting hung up on the use of the Nicknames as used in the gist hack. I was able to run this to reclaim that peer with the duplicate IP.

curl -H "Accept: application/json" -X DELETE 'http://localhost:6784/peer/82:05:7d:62:71:de'

bboreham · 2017-11-06T18:31:53Z

@natewarr ok, that makes perfect sense. So my suggestion would have been to drop to the peer-id (the hex number that looks like a mac address), which you did 😄

bboreham · 2017-11-17T15:03:30Z

Please note the code to remove deleted Kubernetes peers from a cluster was released today, in Weave Net version 2.1.1.

alok87 · 2018-01-28T18:49:38Z

@bboreham @natewarr we are also running the older version of weave and were facing this issue in staging cluster, so we ran the script for prod cluster just to be safe(as there were many unreachable ips there also). But now the 87.9% IPs are present in a single node. How to avoid this. As this node going down would recreate the problem.

ca:5d:67:8a:1b:f9(ip-10-0-21-172.ap-southeast-1.compute.internal)  1843200 IPs (87.9% of total) (25 active)
3e:26:65:4d:b4:52(ip-10-0-21-187.ap-southeast-1.compute.internal)      512 IPs (00.0% of total)```

natewarr · 2018-01-28T18:57:48Z

@alok87 thats as far as the workaround they wrote up will get you. the other nodes can request this node share with them as needed, so its not technically an error condition. If you lose that node without reclaiming them somehow, you are back in the error condition. I imagine the weave guys will just tell you to update past 2.1.1.

itskingori · 2018-01-28T19:58:20Z

@alok87 try spread out the clearing to different instances of weave (the script just clears all from one weave ... and the weave that runs the clearing claims the cleared IPs). Fundamental command is mentioned in #2797 (comment).

caarlos0 · 2018-03-19T11:14:17Z

Still having this issue in a new kubernetes 1.8.8 cluster (launched with kops 1.8) and weave 2.2.0 on a pre-existing VPC.

The cluster was still small (went from 3 nodes to 2 nodes - plus master, so, 4 to 3 in total) - maybe that is the reason (not enough instances to the quorum)?

On the other hand, this also still happens on two old kubernetes 1.5.x clusters running weave 2.2.0 on the same VPC. Those clusters have more nodes - one has between 6 and 10, commonly 7/8 and the other between 5 and 8, commonly 6.

All the 3 clusters run cluster-autoscaler (different versions due to kubernetes version limitations).

Is there something I should be looking at? Any guesses on what's the reason for this problem?

bboreham · 2018-03-19T11:22:01Z

@caarlos0 please can you open a new issue detailing what you are seeing. There is no minimum number of nodes, but note the cleanup in #3149 only runs when a weave container starts.

caarlos0 · 2018-03-19T11:24:04Z

ohh, so that's it, a node go down, the cleanup doesn't run until a new node is up... OK then, no issue 👍

thanks

sstarcher · 2018-04-10T00:05:25Z

I have seen this same issue with weave 2.2.0, kops 1.9.0-beta-2, k8s 1.9.3

sean-krail · 2018-04-23T07:32:21Z

I'm also seeing this issue with weave 2.3.0, kops 1.9.0, k8s 1.9.3

bboreham · 2018-04-23T07:53:59Z

Please don’t comment on old, closed issues. Open a new issue and provide the details which will allow your issue to be debugged.

bboreham added chore [component/kube] labels Feb 14, 2017

pmcq mentioned this issue Feb 16, 2017

weave-kube doesn't run reset on exit #2807

Closed

mikebryant mentioned this issue Feb 23, 2017

Add prometheus metrics for pending Allocates and Claims #2810

Merged

bboreham mentioned this issue Feb 28, 2017

Pods get stuck in ContainerCreating #2822

Closed

Miyurz mentioned this issue Apr 25, 2017

CNI+Weave networking and pods stuck in ContainerCreating and Terminating kubernetes/kops#2346

Closed

bboreham mentioned this issue Jun 2, 2017

Weave fail to assign IPv4 / Remove ephemeral peers from Weave Net via AWS ASG lifecycle hook #2970

Closed

bboreham mentioned this issue Jun 19, 2017

WIP: remove peers that have disappeared from kubernetes #3022

Closed

bboreham mentioned this issue Jul 8, 2017

Q: in weave as k8s add-on, what is the equivalent of weave cli commands? #3052

Closed

bboreham mentioned this issue Jul 20, 2017

Weave network not deleted by 'kubeadm reset' #2911

Open

caarlos0 mentioned this issue Sep 9, 2017

ContainerGCFailed / ImageGCFailed context deadline exceeded kubernetes/kubernetes#42164

Closed

caarlos0 mentioned this issue Sep 11, 2017

dns stopped working inside containers #3118

Closed

bboreham mentioned this issue Nov 10, 2017

weave attempting to connect to deleted nodes, and failing with functioning ones #3021

Closed

bboreham added this to the 2.1 milestone Nov 14, 2017

bboreham closed this as completed in #3149 Nov 14, 2017

yoz2326 mentioned this issue Jan 23, 2018

Pod sandbox issues and stuck at ContainerCreating kubernetes/kops#4327

Closed

sgmiller mentioned this issue May 14, 2018

Lost K8S autoscaled nodes contacted indefinitely #3300

Closed

alok87 mentioned this issue Aug 8, 2018

Node deletion does not clear up the IPs #3372

Closed

This was referenced Aug 23, 2018

Weave not working correctly leads to containers stuck in ContainerCreating #3384

Closed

Not removing unreachable peers due to lock from nonexistent peer #3386

Closed

Nodes are attempting to claim same IP range #3310

Closed

bboreham mentioned this issue Mar 14, 2019

Weave Net integration should remove dead peers weaveworks/integrations#124

Open

sferrett mentioned this issue Jul 12, 2019

Memory leak/OOM with "Received update for IP range I own" messages in log #3659

Closed

Remove deleted k8s nodes from Weave Net #2797

Remove deleted k8s nodes from Weave Net #2797

Comments

bboreham commented Feb 14, 2017

brb commented Feb 17, 2017

bboreham commented Feb 23, 2017

mikebryant commented Feb 23, 2017

bboreham commented Feb 23, 2017 • edited Loading

rade commented Feb 23, 2017

mikebryant commented Feb 23, 2017

mikebryant commented Feb 23, 2017 • edited Loading

shamil commented Jun 21, 2017 • edited Loading

bboreham commented Jun 21, 2017

itskingori commented Jul 28, 2017

itskingori commented Jul 28, 2017

bboreham commented Jul 28, 2017

itskingori commented Jul 28, 2017

caarlos0 commented Sep 8, 2017

caarlos0 commented Sep 9, 2017 • edited Loading

caarlos0 commented Sep 9, 2017

itskingori commented Sep 9, 2017

caarlos0 commented Sep 9, 2017

bricef commented Oct 25, 2017 • edited Loading

natewarr commented Nov 1, 2017 • edited Loading

bboreham commented Nov 1, 2017 • edited Loading

natewarr commented Nov 2, 2017

natewarr commented Nov 2, 2017 • edited Loading

bboreham commented Nov 6, 2017

caarlos0 commented Nov 6, 2017

natewarr commented Nov 6, 2017

bboreham commented Nov 6, 2017

bboreham commented Nov 17, 2017

alok87 commented Jan 28, 2018 • edited Loading

natewarr commented Jan 28, 2018

itskingori commented Jan 28, 2018

caarlos0 commented Mar 19, 2018 • edited Loading

bboreham commented Mar 19, 2018

caarlos0 commented Mar 19, 2018

sstarcher commented Apr 10, 2018

sean-krail commented Apr 23, 2018

bboreham commented Apr 23, 2018

bboreham commented Feb 23, 2017 •

edited

Loading

mikebryant commented Feb 23, 2017 •

edited

Loading

shamil commented Jun 21, 2017 •

edited

Loading

caarlos0 commented Sep 9, 2017 •

edited

Loading

bricef commented Oct 25, 2017 •

edited

Loading

natewarr commented Nov 1, 2017 •

edited

Loading

bboreham commented Nov 1, 2017 •

edited

Loading

natewarr commented Nov 2, 2017 •

edited

Loading

alok87 commented Jan 28, 2018 •

edited

Loading

caarlos0 commented Mar 19, 2018 •

edited

Loading