-
Notifications
You must be signed in to change notification settings - Fork 673
"Troubleshooting Blocked Connections" -> "connection from x to y blocked by Weave NPC" lacks reason #3839
Comments
The way Weave-NPC works is it sets up a rule to drop everything by default, then adds more rules to allow traffic according to network policies. So we can see which rules are allowing things, but we cannot see anything specific about the absence of a rule which causes something to be blocked. If you look at the logs do those blocking occasions happen right after the pod IP is notified, or a long time later? |
@Bregor details matter absolutely, and without more detail we cannot say whether this issue is the same as any other. |
Sure. Didn't want to offend anyone :) |
Thinking about how hard it is to diagnose these issues (look at #3829 for instance), some tools could make it easier:
All of the above would be generic: the rules should be the same for any Kubernetes NetworkPolicy implementation. Specifically for Weave Net, then, we could say which iptables rules and ipsets are supposed to implement a given policy. Maybe even check that the ipset contains the correct address, and show how hit-counts on rules are changing to indicate whether a rule is operating. |
I don't have any network policies:
We have a bunch of integration test machines, where this is the biggest source for flakiness. There we install helm-charts (to distinct namespaces), run stuff and tear things down. Quite often we experience either delays or conenction failure for grpc services. A connection is retried for up to 5 min. I would not expect the weave-npc to spend more than 5 min setting up iptable rules, right? The warnings seems to appear whenever a test starts to run. I'll try to find a way to correlate events. If that is not easy, we'll probably disable weave-npc. Given the number of users that seem to hit this, it would be good to invest more time here to reproduce this from your side, otherwise people just turn off weave-npc and you'll never hear from them again. Maybe consider to disable it by default to only have people running into this, if they actually use it. |
Here is my repro:
The grep for chartassignment is our controller that creates a namespace and installs the yaml (https://github.com/googlecloudrobotics/core/tree/master/src/go/pkg/controller/chartassignment). With that we get the output below, where we can see that the warnings are printed right after a new 'app' got installed. From the output I can't tell if weave-npc is still busy applying policies though.
A few observations:
Isn't there a way for weave-npc to defer actual pod-startup until you are done applying the (empty) policy? |
What you expected to happen?
The weave-npc logs are quite busy and not super easy to act on for me. I'd like to see what things are intentional and what are just status. I gets frequent messages like the one below on the WARN level, which makes me think it is bad. The troubleshooting guide https://www.weave.works/docs/net/latest/kubernetes/kube-addon/#-troubleshooting-blocked-connections also stops where it gets interesting. It could at least say if e.g. under normal circumstances one should never see those messages.
Next steps would be to understand why the connections are blocked to help people figure out what to do if the connection should not be blocked.
Some additional steps for the docs could be e.g.:
(this leads my to the question why core-dns gets blocked?)
What happened?
Seeing lots of those in the logs:
The text was updated successfully, but these errors were encountered: