-
Notifications
You must be signed in to change notification settings - Fork 673
weave not deleting network interfaces #3406
Comments
As you reported this error is result of reaching hard limit of 1024 bridge ports (as noted in #3258 as well). But we need to find the interfaces for deleted containers were not cleaned up. Could you please share or check in the logs if there was attempt to delete the interface or failure to delete the interfaces of the containers? |
Noting that you run Docker 1.11.2 which can leak netns preventing from removing the ifaces (moby/moby#32090). |
I just created a 4 node weave 2.4.0 cluster with docker 18.06.0-ce (no mesos etc.)
Logs:
|
@brb this should be enough info to remove the label, right? |
I'm also experiencing this issue. Weave version: 2.4.1 Logs: Potentially interesting factors: Also knocked up a quick script to delete the dangling links (requires pyroute2 and recent util-linux for lsns): #!/usr/bin/env python
from __future__ import print_function
import subprocess
from pyroute2 import IPRoute
THRESHOLD = 100
if __name__ == "__main__":
ipr = IPRoute()
# Enumerate the links, create an ifname lookup table
links = {x['index']: dict(x['attrs']) for x in ipr.get_links('all')}
ifs = {v['IFLA_IFNAME']: k for k, v in links.iteritems()}
# Fetch the index of the 'weave' bridge interface and enumerate it's children
bridge = ifs['weave']
veths = [k for k, v in links.iteritems() if v['IFLA_IFNAME'].startswith('vethwepl') and 'IFLA_MASTER' in v and v['IFLA_MASTER'] == bridge]
# Look up valid netnsids - should find a native way to do this
output = subprocess.check_output(['lsns', '-t', 'net'])
lines = output.split("\n")
valid_netnsids = set(x.split()[5] for x in lines[1:-1])
int_netnsids = set(int(x) for x in valid_netnsids if x != 'unassigned')
# Check whether the netnsid on each veth is valid
valid_veths = set()
invalid_veths = set()
valid = invalid = 0
for idx in veths:
if 'IFLA_LINK_NETNSID' not in links[idx]:
continue
if links[idx]['IFLA_LINK_NETNSID'] in int_netnsids:
valid_veths.add(idx)
else:
invalid_veths.add(idx)
print("Found {} valid, {} invalid veth pairs".format(len(valid_veths), len(invalid_veths)))
# Only act if we're above a threshold, to make reproducing easier
if len(invalid_veths) > THRESHOLD:
print("More than {} invalid veths; culling them".format(THRESHOLD))
for idx in invalid_veths:
ipr.link('del', index=idx) |
We're using a bash script, with #!/bin/bash
ip a | grep 'vethwepl.*\@' -oP | while read -r line ; do
veth=${line::-1}
if [[ $veth =~ [0-9] ]]; then
echo check $veth
pid=$(echo $veth | tr -dc '0-9')
if ! ps -p $pid > /dev/null; then
echo deleting $veth
ip link delete $veth >&2
else
echo $veth still running
fi
else
echo $veth veth has no number in it and will not be deleted
fi
done |
Also having this issue, we run weave with nomad in a 3 node cluster and various satellite applications connecting in over weave. But this was caused by setting up a periodic job that ran every minute, so it eventually stopped running, only after taking out the entire cluster 💯 Pretty much identical configuration and logs to @predakanga. The containers are being removed, there is no DNS entries left behind and Kinda ironic running the script above using periodic to work around periodic jobs not working lol |
@MikeMichel @predakanga @thetooth thanks for providing the logs. I will try to reproduce and do root cause analysis. |
Just followed the steps I am able to reproduce the issue. Will try to idetify the cause. |
Possibly related, I just SSHed in to find that several (about 10) of our task containers had hung during their launch, over a period of 4 days. Each hung container looks the same, so I picked a random one to gather information: Docker lists the container as "running" |
@jzaefferer |
Fixes #3406 weave not deleting network interfaces
Must have been a good looking script ;) |
I have found the cause of this problem in the overnode script. The leaked IP interfaces are left when a container is created via weave docker proxy, but killed and removed via a docker directly. Once I moved the kill and rm to the weave proxy too, I have not seen this problem anymore. |
What you expected to happen?
weave removes network interface after containers are stopped
What happened?
A lot of interfaces stay forever until the hard limit of 1024 is reached and no more containers can be started
docker: Error response from daemon: attaching veth "vethwepl23795" to "weave": exchange full.
How to reproduce it?
We have hundreds of container starts on our DC/OS cluster. Hard to say why/when it happens.
Logs:
We used weave 1.8.0 before and never saw this problem.
The text was updated successfully, but these errors were encountered: