-
Notifications
You must be signed in to change notification settings - Fork 678
Kubernetes weave container killed OOM #3650
Comments
For the "steady growth" portion, we do not expect that. Would you be able to grab a memory profile next time it does that? We do expect some management data and buffers per connection, so depending on the number of nodes in your cluster 200MB may be too low. Note that, as the memory usage gets close to 200MB, Linux will demand-page the This line indicates a particular problem:
Weave Net cannot tolerate inconsistent data like this, so connections will be broken continuously. In the current version there is no solution except to delete the bad data and restart
I see one log. Can you clarify what I should see? |
@bboreham thanks for your response. I'll do my best to grab a memory profile, but the challenge will be noticing it is happening before it does. Additionally, I've never performed a memory profile on a docker container - do you have any suggestions on how to do that? Should I exec into the pod? Do you have a specific tool you suggest using? Our kubernetes cluster isn't particularly large. At max we're maybe 20-25 nodes, and usually closer to 10-15. I expect weave should have no scaling issues at that size? As for the other logs - I apologize. Those are consolidated logs pulled from elasticsearch for all weave-net containers across the cluster. In the format I posted here you lose context of seeing they are coming from different nodes/containers. I was just showing the rest of the network reacting to the OOM container dying. I was unsure what data would be relevant to the bug so I figured I'd give you as much as possible, but they might be totally irrelevant. |
10-15 nodes should run fine within 200MB. Above 100 I'd expect issues. To grab a memory profile, run this on the host: The memory growth could be connected with the messages "error: Received update for IP range I own". |
This seems related to my experience (see #3659 (comment)). We've increased the limit to 300MB as the weave container operates to close to 200MB. The interesting this is the container usage really dropped down after increasing the limit 👇 Waiting for another memory spike so that I can profile it. |
I thought it's important to note that the above profile is from weave 2.5.2 and we've just hit the same issue (without it hitting 300MB) so it's not a memory issue (at least for 2.5.2). We've opted to roll back to 2.5.1 but keep the 300MB limit. |
We changed a few things in 2.6.0 to avoid OOMs - going to close this. If you see steady growth again and can get a heap profile please re-open. |
What you expected to happen?
Weave container to have stable memory usage. I've done some searching and it appears weave had a memory leak in a previous version (2.3?) but I'm on 2.5.1. Wondering if there are other sources of runaway memory usage.
What happened?
TL;DR The weave container on one of our kubernetes nodes had memory slowly grow until it was killed (OOM) causing network issues in the cluster.
Long version: We've been experiencing periodic (every couple of weeks) network issues in our cluster. We'll have networking fail in parts of the cluster and it was unclear the cause. I think it may be related to this. Basically we can see from pod metrics that the weave container slowly grew in memory over the course of a few hours until it OOMed at 200 MiB. After the crash, it appears that, while the container was restarted, the cluster did not recover to a healthy state. I think that other weave containers (on other nodes) were failing to recognize that the node had recovered. Or perhaps the container restart didn't fully recover the node?
You can see that even after the container restarted, network traffic didn't recover for a few hours later. It's unclear what caused it to recover as we were attempting multiple things - but none of them were directly related to weave. Mostly it was restarting other pods.
How to reproduce it?
Unknown
Anything else we need to know?
AWS, Kops v. 1.12, Kubernetes version 1.12.9, Weave container v. 2.5.1.
From our cluster spec, this is about all I can see referencing weave:
Versions:
Logs:
Failed container logs
Here are the logs from all of the other Weave containers shortly after the bad container crashed. You'll see they all start removing.
Other weave containers
Network:
The text was updated successfully, but these errors were encountered: