-
Notifications
You must be signed in to change notification settings - Fork 673
Weave-net always renders one node unhealthy in the cluster #3243
Comments
Could you perhaps try to get the weave logs from the unreachable node via the AWS console? |
To debug this further, perhaps run some |
@leth great thanks for quick help So, we saw that ASG removed the bad node and brought up a new node; and that new node was healthy; we killed all 3 nodes again and then one became unhealthy and here are the logs from AWS Actions -> Instance Settings -> Get System Log https://gist.github.com/rasheedamir/394fd3bfc4771c7f106571876912f6e5 Please let me know if we can grab other logs from somewhere |
@leth But in the other scenario, we can't even ssh into the node to check We aren't running any other software which could change ip tables |
So, after shortwhile it became healthy again Here are the logs from AWS console https://gist.github.com/rasheedamir/5f6b552057270d3745ca0c5e55a35cbf |
@leth here is bit of
|
It looks like that log is truncated at a certain width, could you re-post it? thanks |
What instance type are you using? The above are hallmarks of throttling. |
@rade we have:
|
ok. not throttling then. |
@leth between this time slot; node wasn't healthy Also if you see there is a slot of 10 minutes where nothing was logged ... not sure if it means something
|
Scenario 1 seems to be an instance of #3133 which should be fixed in the 2.2.0 release. Scenario 2 Try creating VMs with more memory. From the dmesg log ^^ you can see that the kernel is struggling to allocate physical memory for OS processes, so the system might become unresponsive due to this. |
thanks for response @brb Wondering why does k8s schedule more pods then a node can entertain? Also we have seen that in some cases that node is removed with this message:
Does it give any indication? |
Not sure. I think it's better to ask Kubernetes community (via Github issue / Slack).
EC2 status checks might fail due to many reasons. Anyway, I would suggest to consult AWS docs. |
What you expected to happen?
Using weavenet in a Kubernetes cluster, created via Kops, with 3 masters and 3 nodes.
I was expecting all nodes to be healthy with no network issues.
What happened?
In both scenarios, other nodes seem to work absolutely fine.
How to reproduce it?
Anything else we need to know?
Cloud Provider: AWS
Kubernetes configuration: Kops -> 3 masters and 3 nodes
Versions:
Logs:
Weave pod logs:
The weave pod on that node was unreachable so we got logs from another pod:
https://gist.github.com/hazim1093/c7aba837dc94d9b66e24f227f9cd4d6f
Node status in Kubernetes Dashboard:
What we can see in Weave scope:
NOTE
The node seen as
Unmanaged
in the images above, is the faulty one.Network:
sudo iptables-save
command of the faulty node:We were first facing this issue while using weavenet 2.0.1 and then 2.0.5. So we upgraded to 2.2.0. But still the Scenario 2 is easily reproducible for us by just killing all nodes at once.
The text was updated successfully, but these errors were encountered: