weave attempting to connect to deleted nodes, and failing with functioning ones #3021

deitch · 2017-06-18T14:16:18Z

What you expected to happen?

weave-kube should be able to connect consistently to functional nodes and ignore deleted ones

What happened?

Logs show "flapping", in/out, but constantly tries to connect to deleted nodes

How to reproduce it?

Unsure. I am a little nervous to delete the DaemonSet for fear of losing the problem until it shows up again in prod

Anything else we need to know?

Running on kube on AWS.

Versions:

$ weave version
1.9.8  # image version; running `weave version` doesn't really work on kube, in weave-kube container gives "docker: Cannot connect to the Docker daemon. Is the docker daemon running on this host?."
$ docker version
Client:
 Version:      1.12.3
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   34a2ead
 Built:
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.3
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   34a2ead
 Built:
 OS/Arch:      linux/amd64
$ uname -a
Linux ip-10-50-20-214.ec2.internal 4.7.3-coreos-r2 #1 SMP Sun Jan 8 00:32:25 UTC 2017 x86_64 Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz GenuineIntel GNU/Linux
$ kubectl version
1.6.2

Logs:

$ docker logs weave

or, if using Kubernetes:

$ kubectl logs -n kube-system <weave-net-pod> weave
INFO: 2017/06/18 13:45:59.381055 ->[10.50.21.194:54097] connection accepted
INFO: 2017/06/18 13:45:59.382893 ->[10.50.21.194:54097|b2:b0:8e:9e:92:01(ip-10-50-21-194.ec2.internal)]: connection ready; using protocol version 2
INFO: 2017/06/18 13:45:59.383012 overlay_switch ->[b2:b0:8e:9e:92:01(ip-10-50-21-194.ec2.internal)] using fastdp
INFO: 2017/06/18 13:45:59.383055 ->[10.50.21.194:54097|b2:b0:8e:9e:92:01(ip-10-50-21-194.ec2.internal)]: connection added (new peer)
INFO: 2017/06/18 13:45:59.386618 ->[10.50.21.194:54097|b2:b0:8e:9e:92:01(ip-10-50-21-194.ec2.internal)]: connection fully established
INFO: 2017/06/18 13:45:59.387055 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2017/06/18 13:45:59.388897 sleeve ->[10.50.21.194:6783|b2:b0:8e:9e:92:01(ip-10-50-21-194.ec2.internal)]: Effective MTU verified at 8939
INFO: 2017/06/18 13:45:59.561696 Discovered remote MAC 9e:7d:e6:dd:3b:60 at b2:b0:8e:9e:92:01(ip-10-50-21-194.ec2.internal)
INFO: 2017/06/18 13:45:59.949617 Discovered remote MAC 36:37:42:ca:44:59 at b2:b0:8e:9e:92:01(ip-10-50-21-194.ec2.internal)
INFO: 2017/06/18 13:46:00.144553 Discovered remote MAC f6:58:af:e6:75:a0 at b2:b0:8e:9e:92:01(ip-10-50-21-194.ec2.internal)
INFO: 2017/06/18 13:46:00.144899 Discovered remote MAC ba:db:51:ef:5c:f5 at b2:b0:8e:9e:92:01(ip-10-50-21-194.ec2.internal)
INFO: 2017/06/18 13:46:00.164246 Discovered remote MAC 72:27:02:61:0b:c5 at b2:b0:8e:9e:92:01(ip-10-50-21-194.ec2.internal)
INFO: 2017/06/18 13:46:01.111513 Discovered remote MAC 3e:ab:04:52:eb:c8 at b2:b0:8e:9e:92:01(ip-10-50-21-194.ec2.internal)
INFO: 2017/06/18 13:46:01.565576 Discovered remote MAC 0e:db:2c:c9:f4:e6 at b2:b0:8e:9e:92:01(ip-10-50-21-194.ec2.internal)
INFO: 2017/06/18 13:46:01.736969 Discovered remote MAC 8e:e3:b0:fe:8c:db at b2:b0:8e:9e:92:01(ip-10-50-21-194.ec2.internal)
ERRO: 2017/06/18 13:46:02.432131 Captured frame from MAC (3e:ab:04:52:eb:c8) to (fa:d2:49:2d:12:73) associated with another peer b2:b0:8e:9e:92:01(ip-10-50-21-194.ec2.internal)
ERRO: 2017/06/18 13:46:02.432232 Captured frame from MAC (3e:ab:04:52:eb:c8) to (fa:d2:49:2d:12:73) associated with another peer b2:b0:8e:9e:92:01(ip-10-50-21-194.ec2.internal)
INFO: 2017/06/18 13:46:02.531194 Discovered remote MAC 4a:46:51:1b:3c:ee at b2:b0:8e:9e:92:01(ip-10-50-21-194.ec2.internal)
INFO: 2017/06/18 13:46:02.896173 Discovered remote MAC 82:93:18:bf:3b:dd at b2:b0:8e:9e:92:01(ip-10-50-21-194.ec2.internal)
INFO: 2017/06/18 13:46:04.847660 Discovered remote MAC e6:44:7a:30:34:15 at b2:b0:8e:9e:92:01(ip-10-50-21-194.ec2.internal)
INFO: 2017/06/18 13:46:08.734704 Discovered remote MAC 72:1e:b2:82:93:c4 at b2:b0:8e:9e:92:01(ip-10-50-21-194.ec2.internal)
INFO: 2017/06/18 13:46:12.827627 Discovered remote MAC b2:b0:8e:9e:92:01 at b2:b0:8e:9e:92:01(ip-10-50-21-194.ec2.internal)
INFO: 2017/06/18 13:48:26.035363 ->[10.50.21.91:6783] attempting connection
INFO: 2017/06/18 13:48:43.133397 ->[10.50.20.63:6783] attempting connection
INFO: 2017/06/18 13:48:46.139066 ->[10.50.20.63:6783] error during connection attempt: dial tcp4 :0->10.50.20.63:6783: getsockopt: no route to host
INFO: 2017/06/18 13:50:33.249061 ->[10.50.21.91:6783] error during connection attempt: dial tcp4 :0->10.50.21.91:6783: getsockopt: connection timed out
INFO: 2017/06/18 13:52:21.554919 ->[10.50.21.226:6783] attempting connection
INFO: 2017/06/18 13:54:28.769024 ->[10.50.21.226:6783] error during connection attempt: dial tcp4 :0->10.50.21.226:6783: getsockopt: connection timed out
INFO: 2017/06/18 13:54:49.098034 Expired MAC a6:4c:16:6a:67:3a at c6:f8:08:48:aa:a4(ip-10-50-20-144.ec2.internal)
INFO: 2017/06/18 13:54:49.098071 Expired MAC 9e:d4:27:92:01:2b at 12:4a:1a:49:54:d8(ip-10-50-20-214.ec2.internal)
INFO: 2017/06/18 13:55:49.098291 Expired MAC 7a:12:89:41:90:fe at c6:f8:08:48:aa:a4(ip-10-50-20-144.ec2.internal)
INFO: 2017/06/18 13:55:49.098333 Expired MAC c6:f8:08:48:aa:a4 at c6:f8:08:48:aa:a4(ip-10-50-20-144.ec2.internal)
INFO: 2017/06/18 13:56:06.702540 ->[10.50.20.63:6783] attempting connection
INFO: 2017/06/18 13:56:09.709158 ->[10.50.20.63:6783] error during connection attempt: dial tcp4 :0->10.50.20.63:6783: getsockopt: no route to host
INFO: 2017/06/18 13:56:30.050728 ->[10.50.21.91:6783] attempting connection
INFO: 2017/06/18 13:56:49.098541 Expired MAC 12:4a:1a:49:54:d8 at 12:4a:1a:49:54:d8(ip-10-50-20-214.ec2.internal)
INFO: 2017/06/18 13:58:37.345030 ->[10.50.21.91:6783] error during connection attempt: dial tcp4 :0->10.50.21.91:6783: getsockopt: connection timed out
INFO: 2017/06/18 13:59:18.085603 ->[10.50.20.63:6783] attempting connection
INFO: 2017/06/18 13:59:21.091068 ->[10.50.20.63:6783] error during connection attempt: dial tcp4 :0->10.50.20.63:6783: getsockopt: no route to host
INFO: 2017/06/18 13:59:49.099264 Expired MAC 3e:ab:04:52:eb:c8 at b2:b0:8e:9e:92:01(ip-10-50-21-194.ec2.internal)
INFO: 2017/06/18 13:59:49.099324 Expired MAC 8e:e3:b0:fe:8c:db at b2:b0:8e:9e:92:01(ip-10-50-21-194.ec2.internal)
INFO: 2017/06/18 13:59:49.099352 Expired MAC 9e:7d:e6:dd:3b:60 at b2:b0:8e:9e:92:01(ip-10-50-21-194.ec2.internal)
INFO: 2017/06/18 13:59:49.099377 Expired MAC f6:58:af:e6:75:a0 at b2:b0:8e:9e:92:01(ip-10-50-21-194.ec2.internal)
INFO: 2017/06/18 13:59:49.099425 Expired MAC 82:93:18:bf:3b:dd at b2:b0:8e:9e:92:01(ip-10-50-21-194.ec2.internal)
INFO: 2017/06/18 13:59:49.099438 Expired MAC ba:db:51:ef:5c:f5 at b2:b0:8e:9e:92:01(ip-10-50-21-194.ec2.internal)
INFO: 2017/06/18 13:59:49.099450 Expired MAC 4a:46:51:1b:3c:ee at b2:b0:8e:9e:92:01(ip-10-50-21-194.ec2.internal)
INFO: 2017/06/18 13:59:49.099462 Expired MAC e6:44:7a:30:34:15 at b2:b0:8e:9e:92:01(ip-10-50-21-194.ec2.internal)
INFO: 2017/06/18 13:59:49.099474 Expired MAC 72:27:02:61:0b:c5 at b2:b0:8e:9e:92:01(ip-10-50-21-194.ec2.internal)
INFO: 2017/06/18 14:01:17.985491 Discovered local MAC ba:b6:78:61:2d:9d
INFO: 2017/06/18 14:02:22.074616 ->[10.50.20.63:6783] attempting connection
INFO: 2017/06/18 14:02:25.081171 ->[10.50.20.63:6783] error during connection attempt: dial tcp4 :0->10.50.20.63:6783: getsockopt: no route to host
INFO: 2017/06/18 14:02:55.986007 ->[10.50.21.226:6783] attempting connection
INFO: 2017/06/18 14:04:50.546119 ->[10.50.21.91:6783] attempting connection
INFO: 2017/06/18 14:05:03.265054 ->[10.50.21.226:6783] error during connection attempt: dial tcp4 :0->10.50.21.226:6783: getsockopt: connection timed out
INFO: 2017/06/18 14:06:57.825098 ->[10.50.21.91:6783] error during connection attempt: dial tcp4 :0->10.50.21.91:6783: getsockopt: connection timed out
INFO: 2017/06/18 14:09:35.270454 ->[10.50.21.226:6783] attempting connection
INFO: 2017/06/18 14:10:00.046305 ->[10.50.20.63:6783] attempting connection
INFO: 2017/06/18 14:10:03.053167 ->[10.50.20.63:6783] error during connection attempt: dial tcp4 :0->10.50.20.63:6783: getsockopt: no route to host
INFO: 2017/06/18 14:11:42.497042 ->[10.50.21.226:6783] error during connection attempt: dial tcp4 :0->10.50.21.226:6783: getsockopt: connection timed out
INFO: 2017/06/18 14:11:49.102565 Expired MAC 0e:db:2c:c9:f4:e6 at b2:b0:8e:9e:92:01(ip-10-50-21-194.ec2.internal)
INFO: 2017/06/18 14:11:49.102624 Expired MAC ba:b6:78:61:2d:9d at 12:4a:1a:49:54:d8(ip-10-50-20-214.ec2.internal)

In the above, 10.50.21.91 and 10.50.21.226 and 10.50.20.63 are old nodes that have since been deleted from kube (not NotReady but actually deleted).

$ kubectl get nodes
NAME           STATUS    AGE       VERSION
10.50.20.144   Ready     1h        v1.6.2
10.50.20.214   Ready     1h        v1.6.2
10.50.21.194   Ready     1h        v1.6.2

I do not know if this is affecting some oddness with other pods, like services being unable to find endpoints even though they are fine, and sometimes nodes being unable to find the internal kube API service endpoint 10.200.0.1:443. FWIW, weave-kube accesses it just fine:

core@ip-10-50-20-214 ~ $ docker exec -it f7dcf35635b55 sh
/home/weave # curl -ik https://10.200.0.1
HTTP/1.1 403 Forbidden
Content-Type: text/plain
X-Content-Type-Options: nosniff
Date: Sun, 18 Jun 2017 14:15:59 GMT
Content-Length: 79

User "system:anonymous" cannot get  at the cluster scope.: "No policy matched."
/home/weave #

The text was updated successfully, but these errors were encountered:

deitch · 2017-06-18T14:17:21Z

FYI, the most troublesome other services are:

kubernetes-dashboard pod cannot reach 10.200.0.1:443
traefik ingress controller service cannot find any endpoints even though they are active and match

deitch · 2017-06-18T15:33:26Z

Feh. CoreOS rebooted itself, now it doesn't appear to be giving those messages anymore. Instead, every minute I am getting:

WARN: 2017/06/18 15:32:54.281972 [allocator]: Delete: no addresses for de2d98a2e55b160ff509f791082fe12794a8073571c75d5e37e4121f469b1068
WARN: 2017/06/18 15:32:54.299032 [allocator]: Delete: no addresses for ebcb280918b417125ea61d643d917a3b7fb5c93cbc74453830062cce1341f942
WARN: 2017/06/18 15:32:54.306000 [allocator]: Delete: no addresses for 797e3b6e829de452a76630708837634795ea0defed82fd0659d1a11e1d73f515
WARN: 2017/06/18 15:32:54.312916 [allocator]: Delete: no addresses for ace8f925262c2d4d1aecc3272b592e23651ad4ba5a4fcb4e9373fa1830a20e4a
WARN: 2017/06/18 15:32:54.319678 [allocator]: Delete: no addresses for 317571c42ec7652591172b0b4d73c619ea7a5e3fff059a0510b9c8d1846b0125
WARN: 2017/06/18 15:32:54.326228 [allocator]: Delete: no addresses for 04681d4ac460d454136c1a7942db6c319b6a9bdd937d370b66a5b0494a233757
WARN: 2017/06/18 15:32:54.332627 [allocator]: Delete: no addresses for 7d166aa34f3e931ec4452f5184ea2518876ac1cae084495bee2c8f44a7d1c229
WARN: 2017/06/18 15:32:54.339040 [allocator]: Delete: no addresses for 14686e71b4da7c15a14ec0f6e3fdb967ecf06cdb2790245ae58483c12cf2082f

What can I do to help here?

marccarre · 2017-06-20T10:50:02Z

Hi @deitch, thanks for reporting this issue.
What does the below command output?

$ docker exec -ti <weave CID> ./weave --local status ipam

deitch · 2017-06-20T12:01:30Z

Output from all three workers:

core@ip-10-50-21-194 ~ $ docker exec -it 970e1a24b8b8 ./weave --local status ipam
b2:b0:8e:9e:92:01(ip-10-50-21-194.ec2.internal)    16384 IPs (25.0% of total) (10 active)
12:4a:1a:49:54:d8(ip-10-50-20-214.ec2.internal)    32768 IPs (50.0% of total)
c6:f8:08:48:aa:a4(ip-10-50-20-144.ec2.internal)    16384 IPs (25.0% of total)

core@ip-10-50-20-144 ~ $ docker exec -it 13c7f9cfbc9f ./weave --local status ipam
c6:f8:08:48:aa:a4(ip-10-50-20-144.ec2.internal)    16384 IPs (25.0% of total) (13 active)
b2:b0:8e:9e:92:01(ip-10-50-21-194.ec2.internal)    16384 IPs (25.0% of total)
12:4a:1a:49:54:d8(ip-10-50-20-214.ec2.internal)    32768 IPs (50.0% of total)

core@ip-10-50-20-214 ~ $ docker exec -it e8403ca7e85a ./weave --local status ipam
12:4a:1a:49:54:d8(ip-10-50-20-214.ec2.internal)    32768 IPs (50.0% of total) (18 active)
b2:b0:8e:9e:92:01(ip-10-50-21-194.ec2.internal)    16384 IPs (25.0% of total)
c6:f8:08:48:aa:a4(ip-10-50-20-144.ec2.internal)    16384 IPs (25.0% of total)

renewooller · 2017-11-06T04:17:56Z

I have similar problems, however my status looks a lot worse:

root@ip-10-20-52-110:/home/admin# docker exec -it 29605c63985d ./weave --local status ipam
72:07:06:06:16:3c(ip-10-20-52-110.us-west-1.compute.internal)     4096 IPs (00.2% of total) (24 active)
a6:e9:b1:63:0c:60(ip-10-20-53-217.us-west-1.compute.internal)     8192 IPs (00.4% of total) 
e2:cd:be:ab:a6:d1(ip-10-20-40-87.us-west-1.compute.internal)     4096 IPs (00.2% of total) 
a2:71:2b:05:a5:25(ip-10-20-36-3.us-west-1.compute.internal)   131072 IPs (06.2% of total) - unreachable!
ba:2d:dc:a6:b6:54(ip-10-20-55-87.us-west-1.compute.internal)     3072 IPs (00.1% of total) 
9a:f1:cb:a7:f9:87(ip-10-20-61-57.us-west-1.compute.internal)     1024 IPs (00.0% of total) 
32:79:72:b4:17:20(ip-10-20-49-181.us-west-1.compute.internal)   262144 IPs (12.5% of total) - unreachable!
1a:3d:2d:6c:83:40(ip-10-20-62-110.us-west-1.compute.internal)   524288 IPs (25.0% of total) - unreachable!
66:db:5d:c4:ae:48(ip-10-20-57-148.us-west-1.compute.internal)   262144 IPs (12.5% of total) - unreachable!
1a:f7:a9:61:df:8a(ip-10-20-60-34.us-west-1.compute.internal)   393216 IPs (18.8% of total) - unreachable!
4e:a0:da:62:66:de(ip-10-20-32-10.us-west-1.compute.internal)    32768 IPs (01.6% of total) - unreachable!
be:52:fc:14:cf:12(ip-10-20-37-161.us-west-1.compute.internal)    12288 IPs (00.6% of total) - unreachable!
1e:75:38:78:8e:da(ip-10-20-59-142.us-west-1.compute.internal)   262144 IPs (12.5% of total) - unreachable!
16:4f:1f:56:43:64(ip-10-20-35-94.us-west-1.compute.internal)    49152 IPs (02.3% of total) - unreachable!
5e:f8:14:99:de:0d(ip-10-20-47-181.us-west-1.compute.internal)    16384 IPs (00.8% of total) - unreachable!
92:98:89:e8:62:b9(ip-10-20-41-4.us-west-1.compute.internal)   131072 IPs (06.2% of total) - unreachable!

SharpEdgeMarshall · 2017-11-10T11:49:16Z

Same here, using kops after a rolling-update

5e:bf:f1:92:37:8a(ip-10-0-33-17.eu-west-1.compute.internal)    65536 IPs (03.1% of total) (29 active)
7a:02:c3:8d:9d:1e(ip-10-0-47-107.eu-west-1.compute.internal)  1048576 IPs (50.0% of total) - unreachable!
fa:20:f4:51:c3:44(ip-10-0-42-95.eu-west-1.compute.internal)   524288 IPs (25.0% of total) - unreachable!
fa:82:77:11:31:1d(ip-10-0-35-28.eu-west-1.compute.internal)   131072 IPs (06.2% of total)
e2:d1:e2:d7:da:89(ip-10-0-68-148.eu-west-1.compute.internal)    65536 IPs (03.1% of total)
56:ce:08:53:60:de(ip-10-0-94-247.eu-west-1.compute.internal)   262144 IPs (12.5% of total) - unreachable!

bboreham · 2017-11-10T16:17:52Z

@renewooller @SharpEdgeMarshall I think the point you are raising is tracked at #2797, and we will shortly release a new version which should improve matters.

"attempting to connect" should not impact function in any way; would be nice to avoid it but not critical.

bboreham · 2017-11-17T15:05:39Z

Please note the code to remove deleted Kubernetes peers from the IP ownership list was released today, in Weave Net version 2.1.1.

deitch · 2017-12-02T16:53:05Z

Somehow I missed this notification. So as of 2.1.1, if a peer is unreachable and no longer listed among kubernetes nodes, weave will remove it automatically and reap its addresses?

bboreham · 2017-12-04T12:16:19Z

@deitch yes; the reaping happens the next time a peer starts up.

Note there is a further fix pending release #3192

deitch · 2017-12-04T12:20:21Z

the reaping happens the next time a peer starts up

Ah, but not on some sort of schedule? I guess in a cloud environment, if one is going away, others are coming back with some regularity.

further fix pending release #3192

I see. ETA/targeted release?

bboreham · 2017-12-04T15:02:45Z

v2.1.3 is released now.

deitch · 2017-12-04T15:31:12Z

Thanks @bboreham . We will use that. Closing this issue, then.

bboreham · 2017-12-04T15:45:33Z

Note older peers will still attempt to connect to deleted nodes, because that part hasn't changed. When they (eventually) restart they will pick up an updated list of peers and stop trying to connect to the deleted ones.

deitch · 2017-12-04T15:47:13Z

older peers will still attempt to connect to deleted nodes

When you say "older peers", you mean "peers running pre-2.1.3 versions of weave"?

bboreham · 2017-12-04T16:16:34Z

"older" meaning "started prior to the time a node was deleted".

The new release addresses the point made at #3021 (comment) ; it does not address "attempting to connect" which was the original subject of this issue.

deitch · 2017-12-05T06:24:28Z

started prior to the time a node was deleted

I start 3 nodes: A, B, C
I delete A, and auto-scaling starts D. New state: B, C, D.

Result: B and C will continue attempting to connect to A, while D will not.

I delete B, auto-scaling starts E.

Result: C will continue attempting to connect to A and B, while D and E will not.

Is that it?

And if so, the IP addresses nonetheless will be reaped, because D (and later E), running 2.1.3 will recognize the disappearance of A (later B) and reap those addresses, which get synced up among all nodes?

The new release addresses the point made at #3021 (comment) ; it does not address "attempting to connect" which was the original subject of this issue

True, but a- part of the issue was deleted nodes; b- this is old enough and older releases enough ago (is that English?), that I am willing to close the issue I opened, will open a new one if I see recurrence in 2.1.3+

But I only opened the issue, do not maintain; if you want to reopen to track, by all means.

bboreham · 2017-12-05T11:44:31Z

@deitch yes, you have it.

deitch · 2017-12-05T11:45:42Z

yes, you have it.

Excellent, thanks. We should be upgrading our weave pods in the next few days... and rather looking forward.

deitch closed this as completed Dec 4, 2017

bboreham mentioned this issue May 14, 2018

Lost K8S autoscaled nodes contacted indefinitely #3300

Closed

weave attempting to connect to deleted nodes, and failing with functioning ones #3021

weave attempting to connect to deleted nodes, and failing with functioning ones #3021

Comments

deitch commented Jun 18, 2017

What you expected to happen?

What happened?

How to reproduce it?

Anything else we need to know?

Versions:

Logs:

deitch commented Jun 18, 2017

Uh oh!

deitch commented Jun 18, 2017

Uh oh!

marccarre commented Jun 20, 2017

Uh oh!

deitch commented Jun 20, 2017

Uh oh!

renewooller commented Nov 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SharpEdgeMarshall commented Nov 10, 2017

Uh oh!

bboreham commented Nov 10, 2017

Uh oh!

bboreham commented Nov 17, 2017

Uh oh!

deitch commented Dec 2, 2017

Uh oh!

bboreham commented Dec 4, 2017

Uh oh!

deitch commented Dec 4, 2017

Uh oh!

bboreham commented Dec 4, 2017

Uh oh!

deitch commented Dec 4, 2017

Uh oh!

bboreham commented Dec 4, 2017

Uh oh!

deitch commented Dec 4, 2017

Uh oh!

bboreham commented Dec 4, 2017

Uh oh!

deitch commented Dec 5, 2017

Uh oh!

bboreham commented Dec 5, 2017

Uh oh!

deitch commented Dec 5, 2017

Uh oh!

renewooller commented Nov 6, 2017 •

edited

Loading