weave launch-router fails when trying to start after a stop #1772

aneeth · 2015-12-10T12:01:57Z

If you stop the weave router for any reason and then try to launch it again it fails. Interestingly it starts without any issues at the next launch attempt.

How to reproduce on CentOS 7.1 and weave 1.3.1

root@localhost ~]# weave launch-router
5ce0aabe6a7de4115eeb2dd015f18374136740998cf23e6398a0a1d1c0238c60
[root@localhost ~]# weave stop-router
[root@localhost ~]# weave launch-router
The weave container has died. Consult the container logs for further details.

Docker logs shows the following

[root@localhost ~]# docker logs weave
INFO: 2015/12/03 00:19:08.580529 Command line options: map[ipalloc-range:10.32.0.0/12 nickname:localhost.localdomain datapath:weave dns-listen-address:172.17.0.1:53 http-addr:127.0.0.1:6784 port:6783 docker-api:unix:///var/run/docker.sock dns-effective-listen-address:172.17.0.1 name:fe:45:8d:b8:e4:d0]
INFO: 2015/12/03 00:19:08.580626 Command line peers: []
FATA: 2015/12/03 00:19:08.580904 netlink error response: address already in use

So naturally checked the interfaces

[root@localhost ~]# ip addr show weave
7: weave: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue state UNKNOWN 
    link/ether fe:45:8d:b8:e4:d0 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fc45:8dff:feb8:e4d0/64 scope link 
       valid_lft forever preferred_lft forever

However the router will launch successfully at the next attempt

[root@localhost ~]# weave launch-router
6efe86ee795698476d587a3e50f9850e34ebb4813b6c2c71b6232c3fd8e8a104
[root@localhost ~]#

Workaround

Here's the workaround that I've been using to launch the router (without a launch failure) after stopping it for any reason. Need to manually delete the weave interface/bridge before starting the router

(stopping the router since its already running)

[root@localhost ~]# weave stop-router

Manually removing the interface/bridge

[root@localhost ~]# docker run --rm --privileged --net=host weaveworks/weave --delete-datapath --datapath=weave

Starting weave

[root@localhost ~]# weave launch-router
c5ccb0d10e36ec2bab42336354ef231bf3f6b3faba6788235fa6b5f57c9be626
[root@localhost ~]#

Starts without an issue

If this is a bug it would be great if there is a fix for this soon.

The text was updated successfully, but these errors were encountered:

rade · 2015-12-10T12:09:16Z

We had a report of this before, but not with a repro. The same sequence of steps works fine in our dev/test envs. So might be something kernel version specific. What kernel are you running?

aneeth · 2015-12-10T12:11:58Z

[root@localhost ~]# uname -a
Linux localhost.localdomain 3.10.0-229.20.1.el7.x86_64 #1 SMP Tue Nov 3 19:10:07 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

dpw · 2015-12-10T14:04:13Z

I can reproduce this now.

Note that a second weave launch-router succeeds, and straight after the first attempt, the odp tool shows that the vxlan vports are gone, and netstat shows the port is not in use.

Looking at the centos kernel source, it looks like the release of the vxlan UDP socket occurs asynchronously. So when we delete a vxlan vport and recreate it, the old one is not quite gone.

However, the relevant section of kernel code (vxlan_sock_release) has not changed substantially, so that does not explain why we don't see this issue everywhere. It may be something to do with how different kernel versions schedule queued work.

awh · 2015-12-10T14:26:22Z

@dpw and I have just discussed a fix for this, which is to implement a limited retry loop with a short sleep (e.g. 5 tries, 10ms) on that particular error.

leon-strong · 2015-12-14T21:26:06Z

appears this happens on fresh installs also, i've just tried to install on a fresh install and it refused to start, looking at the container logs, i saw the error, removing the bridge manually and starting fixed it right up.

awh · 2015-12-15T11:29:04Z

Is it the exact same error? If you've got an existing bridge on the machine from an earlier version of weave, and then try to use fast datapath you will need to weave reset first...

…-in-use Introduce limited retry on vxlan vport creation. LGTM; fixes #1772.

rade added the bug label Dec 10, 2015

rade added this to the 1.3.2 milestone Dec 10, 2015

awh self-assigned this Dec 15, 2015

awh modified the milestones: 1.4.0, 1.3.2 Dec 15, 2015

awh mentioned this issue Dec 16, 2015

Introduce limited retry on vxlan vport creation #1795

Merged

bboreham closed this as completed in #1795 Dec 16, 2015

bboreham added a commit that referenced this issue Dec 16, 2015

Merge pull request #1795 from /issues/1772-fix-fastdp-address-already…

e6e414a

…-in-use Introduce limited retry on vxlan vport creation. LGTM; fixes #1772.

ydye mentioned this issue May 7, 2020

[Kubespray] Failed to start k8s cluster microsoft/pai#4480

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

weave launch-router fails when trying to start after a stop #1772

weave launch-router fails when trying to start after a stop #1772

aneeth commented Dec 10, 2015

rade commented Dec 10, 2015

aneeth commented Dec 10, 2015

dpw commented Dec 10, 2015

awh commented Dec 10, 2015

leon-strong commented Dec 14, 2015

awh commented Dec 15, 2015

weave launch-router fails when trying to start after a stop #1772

weave launch-router fails when trying to start after a stop #1772

Comments

aneeth commented Dec 10, 2015

How to reproduce on CentOS 7.1 and weave 1.3.1

Workaround

rade commented Dec 10, 2015

aneeth commented Dec 10, 2015

dpw commented Dec 10, 2015

awh commented Dec 10, 2015

leon-strong commented Dec 14, 2015

awh commented Dec 15, 2015