Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

weave launch-router fails when trying to start after a stop #1772

Closed
aneeth opened this issue Dec 10, 2015 · 6 comments
Closed

weave launch-router fails when trying to start after a stop #1772

aneeth opened this issue Dec 10, 2015 · 6 comments
Assignees
Labels
Milestone

Comments

@aneeth
Copy link

aneeth commented Dec 10, 2015

If you stop the weave router for any reason and then try to launch it again it fails. Interestingly it starts without any issues at the next launch attempt.

How to reproduce on CentOS 7.1 and weave 1.3.1

root@localhost ~]# weave launch-router
5ce0aabe6a7de4115eeb2dd015f18374136740998cf23e6398a0a1d1c0238c60
[root@localhost ~]# weave stop-router
[root@localhost ~]# weave launch-router
The weave container has died. Consult the container logs for further details.

Docker logs shows the following

[root@localhost ~]# docker logs weave
INFO: 2015/12/03 00:19:08.580529 Command line options: map[ipalloc-range:10.32.0.0/12 nickname:localhost.localdomain datapath:weave dns-listen-address:172.17.0.1:53 http-addr:127.0.0.1:6784 port:6783 docker-api:unix:///var/run/docker.sock dns-effective-listen-address:172.17.0.1 name:fe:45:8d:b8:e4:d0]
INFO: 2015/12/03 00:19:08.580626 Command line peers: []
FATA: 2015/12/03 00:19:08.580904 netlink error response: address already in use

So naturally checked the interfaces

[root@localhost ~]# ip addr show weave
7: weave: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue state UNKNOWN 
    link/ether fe:45:8d:b8:e4:d0 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fc45:8dff:feb8:e4d0/64 scope link 
       valid_lft forever preferred_lft forever

However the router will launch successfully at the next attempt

[root@localhost ~]# weave launch-router
6efe86ee795698476d587a3e50f9850e34ebb4813b6c2c71b6232c3fd8e8a104
[root@localhost ~]# 

Workaround

Here's the workaround that I've been using to launch the router (without a launch failure) after stopping it for any reason. Need to manually delete the weave interface/bridge before starting the router

(stopping the router since its already running)

[root@localhost ~]# weave stop-router

Manually removing the interface/bridge

[root@localhost ~]# docker run --rm --privileged --net=host weaveworks/weave --delete-datapath --datapath=weave

Starting weave

[root@localhost ~]# weave launch-router
c5ccb0d10e36ec2bab42336354ef231bf3f6b3faba6788235fa6b5f57c9be626
[root@localhost ~]# 

Starts without an issue

If this is a bug it would be great if there is a fix for this soon.

@rade rade added the bug label Dec 10, 2015
@rade rade added this to the 1.3.2 milestone Dec 10, 2015
@rade
Copy link
Member

rade commented Dec 10, 2015

We had a report of this before, but not with a repro. The same sequence of steps works fine in our dev/test envs. So might be something kernel version specific. What kernel are you running?

@aneeth
Copy link
Author

aneeth commented Dec 10, 2015

[root@localhost ~]# uname -a
Linux localhost.localdomain 3.10.0-229.20.1.el7.x86_64 #1 SMP Tue Nov 3 19:10:07 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

@dpw
Copy link
Contributor

dpw commented Dec 10, 2015

I can reproduce this now.

Note that a second weave launch-router succeeds, and straight after the first attempt, the odp tool shows that the vxlan vports are gone, and netstat shows the port is not in use.

Looking at the centos kernel source, it looks like the release of the vxlan UDP socket occurs asynchronously. So when we delete a vxlan vport and recreate it, the old one is not quite gone.

However, the relevant section of kernel code (vxlan_sock_release) has not changed substantially, so that does not explain why we don't see this issue everywhere. It may be something to do with how different kernel versions schedule queued work.

@awh
Copy link
Contributor

awh commented Dec 10, 2015

@dpw and I have just discussed a fix for this, which is to implement a limited retry loop with a short sleep (e.g. 5 tries, 10ms) on that particular error.

@leon-strong
Copy link

appears this happens on fresh installs also, i've just tried to install on a fresh install and it refused to start, looking at the container logs, i saw the error, removing the bridge manually and starting fixed it right up.

@awh
Copy link
Contributor

awh commented Dec 15, 2015

Is it the exact same error? If you've got an existing bridge on the machine from an earlier version of weave, and then try to use fast datapath you will need to weave reset first...

@awh awh self-assigned this Dec 15, 2015
@awh awh modified the milestones: 1.4.0, 1.3.2 Dec 15, 2015
bboreham added a commit that referenced this issue Dec 16, 2015
…-in-use

Introduce limited retry on vxlan vport creation. LGTM; fixes #1772.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants