-
Notifications
You must be signed in to change notification settings - Fork 679
unreachable IPs in 2.5.0 #3600
Comments
Thanks for reporting. Like you noticed 2.5 release has the fix to automate performing You should see in the logs corresponding to
So from the logs i have seen who reported similar issue I could not find a similar activity for peers which are listed as unreachable in ipam status. If there is failure to perform rmpeer an error is logged. I don't see any error as well. It appears that for some reason API node delete event was never sent though |
Versions:
What we see
Weave status
Weave Connection Status
Kube Nodes vs Weave Unreachable Nodes
Thus, we can see none of the nodes that are unreachable are currently in the cluster. Reclaim IssuesPost discussion with @murali.reddy at this point we saw that, no logs turned up for:
However, we did see a bunch of
we also saw,
I ran a diff between Set of Unreachable Peers vs. peers listed in above
We can see that, |
@JaveriaK I also see different errors. Pehaps better to follow-up as differetnt issues.
This is unrelated to IPAM issue. Please get the kubelet logs why weave-not pod is getting restarted.
Again this is a different issue. Please see if this happening on a particular node or across the cluster. |
I've been watching one unreachable entry today for quite some time and looks like it hasn't managed to remove itself.
I don't see any
|
So I managed to debug a problematic node where weave was getting killed and turn out its being oom killed by the kernel. Here are the relevant kernel traces:
This one bad weave pod was also affecting the rest of the cluster and I saw all the other weave pods spike up memory usage to the limit every few minutes.
also is there resource recommendations we need to follow for higher traffic clusters? |
thanks @JaveriaK for sharing this kernel logs. As discussed in slack, there are scaling issues with large clusters, But close to 1GB usage for 70-100 node cluster seems to indicate either there is memory leaks are unbounded usage. If you run into this issue please share memory profile of weaver process. Unfortunately the one you shared is perhaps from a healthy node so not much insights I could draw.
Does it take time to get in to that state or is it instantly? |
I can confirm it with 2.5.2 as well
|
@kostyrev Do you have older logs? It would be interesting to see the log where it should do What you are showing is further evidence that something went wrong, but nothing we can use to track down the problem directly. |
I've got logs starting from 2019/10/12 but there are no mentions of unreachable nodes. |
That appears to be 33 minutes older, and not covering the period when any of those nodes disappeared. |
Release 2.6 fixed a lot of things that would cause symptoms described here. |
What you expected to happen?
According to kubernetes/kops#4327 (comment) I expected the issue with reclaiming IPs to be resolved. This cluster has autoscaling enabled so nodes get terminated frequently and the cluster networking layer needs to be able to handle this dynamically.
For now, I've added some automated in around the
rmpeer
solution so that we can recover more quickly the next time this happens.What happened?
Noticed pods on two nodes in the cluster were having networking issues, running pods were not resolving dns to other pods and the kubernetes api.
New pods were not starting with:
The weave pods on these nodes were continuously restarting. I can provide full logs from these pods as well.
How to reproduce it?
This issue itself has become less frequent since the last weave upgrade to 2.5.0 (only seen thrice since)
Anything else we need to know?
kops Version 1.10.0 deployed to aws
Versions:
Logs:
Here a few relevant bits from the restarting/crashing weave pods (full logs from these pods can also be provided):
ipam status
from the time of the problem.The text was updated successfully, but these errors were encountered: